LitMC-BERT: Transformer-Based Multi-Label Classification of Biomedical Literature With An Application on COVID-19 Literature Curation

The rapid growth of biomedical literature poses a significant challenge for curation and interpretation. This has become more evident during the COVID-19 pandemic. LitCovid, a literature database of COVID-19 related papers in PubMed, has accumulated over 200,000 articles with millions of accesses. Approximately 10,000 new articles are added to LitCovid every month. A main curation task in LitCovid is topic annotation where an article is assigned with up to eight topics, e.g., Treatment and Diagnosis. The annotated topics have been widely used both in LitCovid (e.g., accounting for ∼18% of total uses) and downstream studies such as network generation. However, it has been a primary curation bottleneck due to the nature of the task and the rapid literature growth. This study proposes LITMC-BERT, a transformer-based multi-label classification method in biomedical literature. It uses a shared transformer backbone for all the labels while also captures label-specific features and the correlations between label pairs. We compare LITMC-BERT with three baseline models on two datasets. Its micro-F1 and instance-based F1 are 5% and 4% higher than the current best results, respectively, and only requires ∼18% of the inference time than the Binary BERT baseline. The related datasets and models are available via https://github.com/ncbi/ml-transformer.


INTRODUCTION
The rapid growth of biomedical literature significantly challenges manual curation and interpretation [1,2].These challenges have become more evident under the context of the COVID-19 pandemic.Specifically, the median acceptance time of COVID-19 papers is about 2-time and 20time faster than the acceptance time for papers about Ebola and cardiovascular disease [3].The number of articles in the literature related to COVID-19 is growing by about 10,000 articles per month [4].LitCovid [5,6], a literature database of COVID-19 related papers in PubMed, has accumulated a total of more than 100,000 articles, with millions of accesses each month by users worldwide.LitCovid is updated daily, and this rapid growth significantly increases the burden of manual curation.In particular, annotating each article with up to eight possible topics, e.g., Treatment and Diagnosis, has been a bottleneck in the Li-tCovid curation pipeline.Fig. 1. shows the characteristics of topic annotations in LitCovid; we will explain the related curation process below.The annotated topics have been used both in LitCovid directly and downstream studies widely.For instance, topic-related searching and browsing account for over 18% of LitCovid user behaviors [6], and the topics have also been used downstream studies such as citation analysis and knowledge network generation.application [7][8][9].Therefore, it is important to develop automatic methods to overcome this issue.
Innovative text mining tools have been developed to facilitate biomedical literature curation for over two decades [2,[10][11][12].Topic annotation in LitCovid is a standard multilabel classification task, which aims to assign one or more labels to each article [13].To facilitate manual topic annotation, we previously employed the deep learning model Bidirectional Encoder Representations from Transformers (BERT) [14].We used one BERT model per topic, known as Binary BERT, and previously demonstrated this method achieved the best performance of the available models for LitCovid topic annotations [6]; other studies have reported consistent results [15].Indeed, existing studies in biomedical text mining have demonstrated Binary BERT achieves the best performance in multi-label classification tasks [16,17].However, this method has two primary limitations.First, by training each topic individually, the model ignores the correlation between topics, especially for topics that often co-occur, biasing the predictions and reducing generalization capability.Second, using eight models significantly increases the inference time, causing the daily curation in LitCovid to require significant computational resources.
To this end, this paper proposes a LITMC-BERT, a transformer-based multi-label classification method in biomedical literature.It uses a shared BERT backbone for all the labels while also captures label-specific features and the correlations between label pairs.It also leverages multitask training and label-specific fine-tuning to further improve the label prediction performance.We compared LITMC-BERT to three baseline methods using two sets of evaluation metrics (label-based and example-based) commonly used for multi-label classification on two datasets: the LitCovid dataset consists of over 30,000 articles and the HoC (the Hallmarks of Cancers) dataset consists of over 1,500 articles (the only benchmark dataset for multi-label classification methods in biomedical text mining)..It achieved the highest performance on both datasets: its instance-based F1 and accuracy are 3 and 8% higher than the BERT baselines, respectively.Importantly, it also achieves the SOTA performance in the HoC dataset: its micro-F1 and instance-based F1 are 5% and 4% higher than the current best results reported in the literature, respectively.In addition, it requires only ~15% of the inference time needed for Binary BERT, significantly improving the inference efficiency.LITMC-BERT has been employed in the Li-tCovid production system, making the curation more sustainable.We also make the related datasets, codes, and models available via https://github.com/ncbi/ml-transformer.

Multi-label classification
Multi-label classification is a standard machine learning task where each instance is assigned with one or more labels.Multi-label classification methods can be categorized into two broad groups [18]: problem transformation, transforming the multi-label classification problem into relatively simpler problems such as single-label classification and algorithm adaptation, adapting the methods (such as changing the loss function) for multi-label data.
The methods under the problem transformation group are traditional approaches to address multi-label classification.Most popular methods include (1) binary relevance, where each label requires to train a corresponding binary classification model [19], (2) label powerset, where a binary classification model is trained for every combination of the labels [20], and (3) classifier chains, where the output of a binary classification model is used as the input to train a further binary classification model [21].Such methods have achieved promising performance in a range of multilabel classification tasks [13,22].Indeed, existing studies have shown binary relevance BERT achieved the best performance for topic annotation in LitCovid [6,15].However, it is computationally expensive and transforming multi-label classification tasks into binary classification may ignore the correlations among labels.Recently, an increasing number of deep learning methods under the algorithm adaptation group have been proposed which predict all the labels directly as the output [23][24][25].

Multi-label text classification in the domain of biomedical text mining
Text classification methods have been widely applied in biomedical text mining for biomedical document triage [26], retrieval [27], and curation [28].Compared to the general domain, text classification especially for multi-label classification in biomedical text mining has three primary challenges: (1) domain-specific language ambiguity, e.g., a gene may have over 10 different synonyms mentioned in the literature; conversely, a gene and a chemical could share the same name [29]; (2) limited benchmark datasets for method development and validation; e.g., the HoC dataset [30] is the only multi-label text dataset among commonly-used benchmark datasets for biomedical text mining [16,17] and it only has about 1,500 PubMed documents; and (3) deployment difficulty, i.e., an important contribution of biomedical text mining is to make open-source tools and servers such that biomedical researchers can readily apply.Therefore, the designed methods should be scalable to massive biomedical literature and are also efficient in standard research tool production settings where computational resources are limited such as the graphics processing unit (GPU) is not commonly available for method deployment.
Such challenges impact the method development in biomedical text mining.Most of the existing methods focus on transfer learning which employs word embeddings (such as WordVec) or transformer-based models (such as BERT) that are pre-trained in biomedical corpora to extract text representations [31][32][33][34].Indeed, this also applies to the multi-label classification methods in biomedical text mining.Existing studies have mostly used Binary BERT [16,17] for multi-label classification: for each label, it trains a corresponding BERT (or other types of transformers) classification model.The evaluation results show that the BERT-related approaches achieve the best performance for multi-label classification in biomedical text mining compared to other multi-label classification methods that have been used in the general domain [35,36].

LitCovid curation pipeline
A primary focus of this study is to develop a multi-label classification method to facilitate COVID-19 literature curation such as topic annotation in LitCovid.Here we summarize the topic annotation in the LitCovid curation pipeline and its challenges.
The detailed LitCovid curation pipeline is summarized in [6].For topic annotation, an article in LitCovid is considered for one or more of eight topics (General information, Mechanism, Transmission, Diagnosis, Treatment, Prevention, Case report, or Epidemic forecasting) when applicable.Fig. 1. shows the characteristics of topic annotations of the articles in LitCovid by the end of September 2021.Prevention, Treatment, and Diagnosis are the topics with the highest frequency.Over 20% of the articles have more than one topic annotated.Some topics co-occur frequently, such as Treatment and Mechanism, where papers describe underlying biological pathways and potential COVID-19 treatment (https://www.ncbi.nlm.nih.gov/pubmed/33638460).The annotated topics have been demonstrated to be effective for information retrieval and have been used in many downstream applications.Specifically, topic-related searching and browsing account for over 18% of LitCovid user behaviors among millions of accesses and it has been the second most accessed features in LitCovid [6].The topics have also been used downstream studies such as evidence attribution, literature influence analysis, and knowledge network generation.application [7][8][9].
However, annotating these topics has been a primary bottleneck for manual curation.First, compared to the general domain, biomedical literature has domain-specific ambiguities and difficulties of understanding its semantics.For example, the Treatment topic can be described in different ways including patient outcomes (e.g., 'these factors may impact in guiding the success of vaccines and clinical outcomes in COVID-19 infections'), biological pathways (e.g., 'virus-specific host responses and vRNA-associated proteins that variously promote or restrict viral infection), and biological entities (e.g., 'unique ATP-binding pockets on NTD/CTD may offer promising targets').Second, compared to other curation tasks in LitCovid (document triage and entity recognition), topic annotation is more difficult due to the nature of the task (assigning up to eight topics) and the ambiguity of natural languages (such as different ways to describe COVID-19 treatment procedures).Initially, the annotation was done manually by two curators with little machine assistance.To keep up with the rapid growth of COVID-19 literature, Binary BERT has been developed to support manual annotation.However, previous evaluations show that it has an F1-score of 10% lower than the tools assisting other curation tasks in LitCovid [6].Also, Binary BERT requires a significant amount of inference time because each label needs a separate BERT model for prediction.This challenges the LitCovid curation pipeline, which may have thousands of articles to curate within a day.

Experiment datasets
We used LitCovid BioCreative and HoC datasets for method development and evaluation.TABLE 1 provides the characteristics.
For the LitCovid BioCreative dataset [37], it contains 24,960, 6,239, and 2,500 PubMed articles in the training, development, and testing sets, respectively.The topics were assigned using the above annotation approach consistently.All the articles contain both titles and abstracts available in PubMed and have been manually reviewed by two curators.The only difference is that the datasets do not contain the General Information topic since the priority of the topic annotation is given to the articles with abstracts available in PubMed [6].In addition, the testing set contains the articles that have been added to LitCovid from 16th June to 22nd August after the construction of the training and development datasets.Using incoming articles to generate the testing set will facilitate the evaluation of the generalization capability of automatic methods.To our knowledge, this dataset is one of the largest multi-label classification datasets on biomedical English scientific text..We have made this dataset publicly available to the community via https://ftp.ncbi.nlm.nih.gov/pub/lu/Li-tCovid/biocreative/.For the HoC dataset, it contains 1,580 PubMed abstracts with 10 currently known hallmarks of cancer that were annotated by two curators.The data set is available via https://www.cl.cam.ac.uk/∼sb895/HoC.html.We used the same dataset split from previous studies [16,24]; the training, development, and testing sets contain 1,108, 157, and 315 articles, respectively.As mentioned, it is the only dataset used for multi-label classification in biomedical literature from commonly-used benchmark datasets [16,17,24].

LITMC-BERT architecture
The architecture of LITMC-BERT is summarized in Fig. 2 and Fig. 3. Fig. 2 compares its architecture with other approaches (we will also use them as baselines), whereas Fig. 3 details its underlying modules.The detailed hyperparameters are also summarized in TABLE 2 and 3 As mentioned, most biomedical text mining studies have used Binary BERT (Fig. 2 (A)) for multi-label classification [16,17].Indeed, we have applied Binary BERT to annotate topics in the LitCovid curation pipeline as well [6].An alternative approach is to uses a shared BERT model with a Sigmoid function (or other similar activation functions) to outputs all the label probabilities directly (Fig. 2 (B)), which we denote it as Linear BERT (Fig. 2 (B)): it uses a Sigmoid function (or other similar activation functions) followed by a shared BERT model which outputs all the label probabilities directly.Linear BERT also forms the basis of LITMC-BERT (Fig. 2 (C)).In contrast, for LITMC-BERT, each label has its own module (Label Module) to capture label-specific representations; and the label representations are further used (Label Pair Module) to predict whether a pair of labels co-occurs.It also leverages multi-task training and label-based fine-tuning.We explain each in detail below.

Transformer Backbone
The Transformer Backbone applies a transformer-based model to get a general representation of an input text; in this case, the input text is the title and abstract (if available) of an article.In this study, the transformer-based model is BioBERT [38], which a BERT model pre-trained on Pub-Med and PMC articles.We evaluated a range of BERT variants and BioBERT (v1.0) gave the overall highest performance as the backbone model.

Label Module
Each label has a label module to capture its specific representations for the final label classification.Figure 3 (A) shows its detail.Essentially, the Label Module combines the final hidden vector for the CLS token of a BERT backbone (Figure 3 (A)(1); we call it CLS vector) and the labelspecific vector (Figure 3 (A)(2)) to produce the final label feature vector (Figure 3 (A)(5)).
Using the CLS vector of a BERT backbone is recommended by the authors of BERT for classification tasks [14].For LITMC-BERT, it is shared by all the labels (since a shared BERT model is used as the backbone).In addition, for each label, a Multi-head Self-Attention [39] and a global average pooling layer are applied to the last encoder layer of the BERT backbone (Figure 3 (A)(2)) to get a label-specific vector.This is designed to capture specific features for each label.We further normalize the CLS vector and label-specific vector with a multi-layer perceptron (MLP) consisting of a few dense layers (Figure 3 (A)(3)) (Figure 3 (A)( 4)).This approach has been demonstrated to be effective for combining feature vectors from different sources [28].The normalized vectors are summed up to produce the final label vector (Figure 3 (A)(5)).

Label Pair Module
The Label Pair Module further uses the label representations from the Label Module and captures correlations between label pairs.Figure 3 (B) shows its detail.
For a pair of labels 1 and 2, the Label Pair Module first uses their corresponding feature representations produced by the Multi-head Self-Attention in the Label Module (Figure (A)) as inputs.Then it performs co-attentions (Figure 3 (B)(2)) and global average pooling (Figure 3 (B)(3)) to get two vectors from the inputs.The co-attention mechanism is an adaption of the Multi-head Self-Attention whereas the query and key components of the Self-Attention are the label pairs in this case (e.g., the attention from label 1 to label 2 and the attention from label 2 to label 1 in Figure 3 (B)(2)).This has been demonstrated to be effective for modeling correlations between pairs [40,41].Then, the two vectors are fused using the same method above (Figure 3 (B)(4) and Figure 3 (B)( 5)) to get the final label pair vector (Figure 3 (B)( 6)).The label pair vector is used to predict whether the labels 1 and 2 co-occur as auxiliary tasks for the multi-training process introduced below.Auxiliary tasks are not directly related to primary tasks (label predictions in this case) but have shown effective for multi-task training to make the shared representation more generalizable [42].In addition, while the relations between label pairs are important, it does not necessarily apply to every label pair.We define a hyperparameter called label pair threshold: the Label Pair Module is only applied to a label pair if above the threshold.For a pair of labels 1 and 2, the threshold is calculated by the number of instances that labels 1 and 2 co-occur dividing by the minimum number between the number of instances of labels 1 and 2 in the training set.

Multi-task training and label-based fine-tuning
The LITMC-BERT training process employs multi-task training where it trains and predicts the labels (main tasks) and co-occurrence (auxiliary tasks) simultaneously.The loss during the multi-task training is the total loss of main tasks and auxiliary tasks).Given that main tasks are the focus, we define a hyperparameter called auxiliary task weight (from 0 to 1) which takes a proportion of auxiliary task losses.The full hyperparameters and baselines are provided below.When the multi-task training converges, it further fine-tunes the Label Module for each label while freezing the weights of other modules .Such training approach has been shown effective in both text mining and computer vision applications [43,44].

Baseline models
We compared LITMC-BERT to three baseline models: ML-Net (a shallow deep learning multi-label classification model which has achieved superior performance in biomedical literature) [24], Binary BERT (Fig. 2 (A)), and Linear BERT (Fig. 2 (B)).
ML-Net is an end-to-end deep learning framework which has achieved favorably state of the art (SOTA) performance in a few biomedical multi-label text classification tasks [24].ML-Net first maps texts into high dimensional vectors through deep contextualized word representations (ELMo) [33], and then combines a label prediction network and label count prediction to infer an optimal set of labels for each document.
Binary BERT and Linear BERT are introduced in 3.2.For N labels, Binary BERT trains N BERT classification models (one label each) whereas Linear BERT provides all the N label predictions in one model.Note that previous studies mostly have used Binary BERT in biomedical literature [16,17].It also has been the state-of-the-art (SOTA) model for multi-label classification in biomedical literature [15][16][17] and was also used in the LitCovid production system previously [6].

Hyperparameters
For each model, we performed hyperparameter tuning on the datasets and selected the best sets of hyperparameter based on the validation set loss.TABLE 2 provides the hyperparameter values in the LitCovid BioCreative dataset; the configuration files of the hyperparameters are also provided in the github repository.Importantly, for BERT-related models (Binary BERT, Linear BERT, and LITMC-BERT), we controlled their shared hyperparameters (BERT backbone, maximum sequence length, learning rate, early stop steps, and batch size) to ensure a fair and direct comparison.

Evaluation metrics and reporting standard
There are a number of evaluation measures for multilabel classification tasks [13,45,46], which can be broadly divided into two groups: (1) label-based measures, which evaluate the classifier's performance on each label and (2) example-based measures (also called instance-based measures), which aim to evaluate the multi-label classifier's performance on each test instance.Both groups complement each other: in the case of topic annotation, labelbased measures quantify the specific performance for each topic, whereas example-based measures quantify the effectiveness of models at document level (which may contain several topics).We employed representative metrics from both groups to provide a broader spectrum on the performance.
Specifically, we used six evaluation measures as the main metrics.They consist of four label-based measures: macro-F1, macro-Average Precision (AP), micro-F1, and micro-AP and two instance-based measures: instancebased F1 and accuracy (also stands for exact match ratio and the complement of zero one loss in this case).We further reported six evaluation measures that have been used to calculate the main metrics as additional metrics.They consist of four label-based measures: macro-Precision, macro-Recall, micro-Precision, and micro-Recall and two label-based measures: instance-based Precision and instance-based Recall.Their calculation formulas are summarized below.

Label-based measures
Label-based measures evaluate the multi-label classifier's performance separately on each label by calculating their true positive (TP), false positive (FP) and false negative (FN) on the test set.For the -th label   , we calculated the following four metrics: F1 and AP are aggregated measures using both Precision and Recall in the calculation.AP is also a thresholdbased measure which summarizes a Precision-Recall curve at each threshold (denoted as n in the formula).
To measure the overall metrics for all the labels, we calculated both macro-averaged (using unweighted averaging across labels) and micro-averaged scores (counting TP, FP and FN globally rather than at label level) for the labels.

Example-based measures
The example-based metrics evaluate the multi-label classifier's performance separately by comparing the predicted labels with the gold-standard labels for each test example.We focus on the following four metrics: where p is the number of documents in the test set;   refers to the true label set for the i-th document in the test set; and  ̂ refers to the predicted label set for the i-th document in the test set.

Statistic test and reporting standard
We repeated each model 10 times, reported the mean and max values of the repeats for each evaluation measure above, and conducted the Wilcoxon rank-sum test (Confidence Interval at 95%; one-tail) following previous studies [40,47] 4 RESULTS AND DISCUSSIONS 4.1 Overall performance TABLE 3 demonstrates the overall performance of the models on both datasets.As mentioned, we used six main metrics and reported their mean and max results (i.e., 12 evaluation measurement results).Out of these 12 measurement results, LITMC-BERT consistently achieved the highest results in 10 of them in the LitCovid BioCreative dataset and all the 12 in the HoC dataset.On average, its macro F1score is about 10% higher than ML-Net in both datasets; the same applies to other measures such as macro-AP and accuracy.Compared with Binary BERT, its label-based measures are about 2% and 4% higher on the LitCovid Bi-oCreative and HoC datasets, respectively.Its instancebased measures on the HoC dataset show a larger difference; e.g., its accuracy is up to 10% higher.The observations are similar when comparing LITMC-BERT with Linear BERT: e.g., its macro-F1 and accuracy are up to 2% and 4% higher on the HoC dataset, respectively.In terms of comparing Binary BERT with Linear BERT, Binary BERT achieved overall better performance on the LitCovid Bi-oCreative dataset, which is consistent with the literature [6,15], whereas Linear BERT achieved over better performance on the HoC dataset.
In addition, Fig. 4 shows the distribution of macro F1scores of the models and the P-values of the Wilcoxon rank -sum test.On both datasets, LITMC-BERT consistently had a better macro-F1 score than ML-Net (P-values close to 0) and both Binary BERT and Linear BERT (P-values smaller or close to 0.001).
Further, comparing with the current SOTA results on the HoC dataset, LITMC-BERT also achieved higher performance.For LitCovid BioCreative, LITMC-BERT achieved better performance than the results reported by the challenge overview from 80 system submissions worldwide [37].Existing studies on the HoC dataset used different measures and only reported one evaluation result (without repetitions to report the average performance or perform statistic tests).One study used instance-based F1 and reported that BlueBERT (base) and BlueBERT (large) achieved the highest instance-based F1 of 0.8530 and 0.8730, respectively, compared with other BERT variants [16].In contrast, LITMC-BERT achieved a mean instancebased F1 of 0.9030 and a maximum instance-based F1 of 0.9169, consistently higher than the reported performance.Similarly, another study used micro-F1 on a slightly different version of the HoC dataset (this is different from other studies [16,24]) and reported that PubMedBERT achieved the highest micro-F1 of 0.8232 [17].The mean and maximum of micro-F1 of LITMC-BERT are 0.8648 and 0.8787, respectively.We manually examined the results and find that one possible reason is that the existing studies use the BERT model at sentence-level and then aggregate the predictions to the abstract-level for the HoC dataset [16]; this may ignore the inter-relations among sentences and cannot capture the context at abstract-level.In contrast, we directly applied the models at the abstract-level which overcomes the limitations.

Additional measures, label-specific results, and an ablation analysis
TABLE 4 provides additional measures to complement the main metrics.As mentioned, we reported the mean and maximum of six additional metrics (i.e., 12 in total).Out of these 12 additional measurement results, LITMC-BERT achieved the highest results in 7 of them in the LitCovid BioCreative dataset and 11 of them in the HoC dataset, which is consistent with the main measurement results in TABLE 3.
In addition, we further analyzed the performance of each individual label.Fig. 5 and Fig. 6 show F1s of each label in the LitCovid BioCreative and HoC datasets, respectively.Out of the seven labels in the LitCovid BioCreative dataset, LITMC-BERT had the highest F1 in four of them.Similarly, it had the highest F1 in seven out of 10 labels in the HoC dataset.The results also demonstrate that LITMC-BERT had much better performance for labels with low frequencies.For the LitCovid BioCreative dataset, its F1s are up to 9% and 6% higher for the Epidemic Forecasting (accounting for 1.64% of the testing set) and Transmission (5.12%) labels than Binary BERT and Linear BERT, respectively.For the HoC dataset, its F1s are also up to 6% and 5% higher for the Avoiding immune destruction (5.40%) and Enabling replicative immortality (5.71%) labels than Binary BERT and Linear BERT, respectively.This suggests that LITMC-BERT might be more robust to the class imbalance issue.Indeed, existing studies have demonstrated the class imbalance issue remains an open challenge for multi-label classifications and it is more difficult to improve the classification performance for rare classes [21,22].This is more evident that BERT-related models can already achieve F1-scores of close to or above 90% for labels with high frequencies on both datasets.Therefore, the performance of topics with low frequencies is arguably more critical.
Further, we performed an ablation analysis to quantify the effectiveness of the LITMC-BERT modules.Specifically, we compared the performance of LITMC-BERT without the Label Module, the Label Pair Module, or both (i.e., Linear BERT) using the same evaluation procedure above.TABLE 5 shows the results.Recall that Linear BERT uses the same BERT backbone and does not capture label-specific features or correlations between labels; therefore, we can directly compare the effectiveness of the Label Module and the Label Pair Module with Linear BERT.On both datasets, LITMC-BERT with both modules had the highest performance in all the measures.For instance, the Label Module increased the average macro-F1 by 2.1% and 0.5% in LitCovid BioCreative and HoC, respectively; the Label Pair Module increased the average macro-F1 by 0.7% and 0.5% in LitCovid BioCreative and HoC, respectively.Consistent observations are also shown in other metrics; for example, the Label Module increased the average macro-AP by 2.5% and 1.5% in LitCovid BioCreative and HoC, respectively.This suggests that the two modules complement to each other and combining both is effective.In addition, removing either module dropped the performance in both datasets; removing both of them had the lowest performance on average.This suggests both modules are effective.Also, the results suggest that the Label Module is more effective in the LitCovid BioCreative dataset (e.g., the maco-F1 is reduced by up to 2% if removing the Label Module) whereas the Label Pair Module is more effective in the HoC dataset (e.g., the maco-F1 is reduced by up to 1% if removing the Label Pair Module).

Generalization and efficiency analysis
The above evaluations show that LITMC-BERT achieved consistently better performance on both datasets.We also further evaluated its generalization and efficiency in the Li-tCovid production environment.While transformer-based models have achieved SOTA results in many applications, their inference time is significantly longer than other types of models [48].It is thus important to measure its efficiency in practice.Reducing inference time is also critical to the LitCovid curation pipeline, which may have thousands of articles to curate within a day (the peak was over 2,500 articles in a single day) [6].A random sample of 3,000 articles in LitCovid was collected between October and December, 2021, which was independent to the training set.We used it as an external validation set and measured the accuracy and efficiency of these models.As mentioned, Binary BERT was used in LitCovid for topic annotations.We used a single processor on CPU with a batch size of 128, which is consistent with the LitCovid production setting, and tracked the inference time accordingly.TABLE 6 details the performance.LITMC-BERT achieved the best performance in all the accuracy-related measures, took ~18% of the prediction time of Binary BERT, and was only 0.05 sec/doc slower than Linear BERT as the trade-off.Note that Binary BERT was previously used in the LitCovid production.It took about 3.4 seconds on average to predict topics for an article.Note that this does not include overhead time (e.g., switching into other automatic curation tasks) and postprocessing time (e.g., sorting the probabilities and showing the related articles for manual review).Therefore, just predicting the topics for a large batch of articles may take over an hour, delaying the daily curation of LitCovid.In contrast, it only takes LITMC-BERT about 0.5 seconds on average for inference, which accounts for ~15% of the time used by Binary BERT.We have employed LITMC-BERT into the LitCovid production system given its superior performance on both effectiveness and efficiency.

Limitations and future work
While LITMC-BERT achieved the best overall performance in both datasets compared with other competitive baselines, it does have certain limitations that we plan to address in the future.First, from the method level, it still relies on transfer learning from BERT backbones given the scale of multi-label classification datasets in biomedical literature.In contrast, some methods used other domains include label clustering [49] and label graph attentions [50].We plan to investigate these methods and quantify whether they are effective for biomedical literature.Second, it has more hyperparameters to tune (such as finding the optimal label pair threshold) compared with other straightforward BERT-based models.It would be better to incorporate dynamic modules that learn these hyperparameters adaptively.Third, the Label Pair Module focuses only on co-occurred labels which may miss more complex scenarios such as labels in n-ary relations.

CONCLUSION
In this paper, we propose a novel transformer-based multilabel classification method on biomedical literature, LITMC-BERT.Compared to the existing multi-label classification methods in biomedical literature, it captures labelspecific features and also captures the correlations between label pairs.The multi-task training approach also makes it more efficient than binary models.LITMC-BERT achieved the highest overall performance on two datasets than three baselines.Also, it only takes ~18% of the inference time taken by the previous best model for COVID-19 literature.LITMC-BERT has been employed in the LitCovid production system for more sustainability and effectiveness.We plan to further improve the method such that it is more dynamic and capable of handling more complex relations among labels and further quantify its effectiveness on multi-label classification tasks beyond biomedical literature (such as clinical notes).

Qingyu
Chen received Ph.D. in Biomedical Informatics from the University of Melbourne.He is currently a research fellow at the BioNLP lab, National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), and the lead instructor of text mining courses at the Foundation for Advanced Education in the Sciences.His research interests include biomedical text mining, medical image analytics, and biocuration.Dr. Chen has published over 30 first-authored papers and 50 papers in total.He serves as the Associate Editor of Frontiers in Digital Health, ECR editor of Applied Clinical Informatics and PC member for the IEEE International Conference on Healthcare Informatics and ACL BioNLP workshop.Jingcheng Du received Ph.D. in health informatics from The University of Texas Health Science Center at Houston (UTHealth), TX, USA.He is an assistant professor in health informatics at UTHealth School of Biomedical Informatics.His research interest include machine learning, biomedical natural language processing and knowledge representation.Alexis Allot received his PhD in Bioinformatics in 2015 from the University of Strasbourg, France.He then worked at EMBL/EBI as Bioinformatician and is now working as postdoctoral fellow at NIH.His research interests include biomedical text mining, data mining and web development.Zhiyong Lu received the Ph.D. degree in bioinformatics from the School of Medicine, University of Colorado in 2007.Dr. Lu is currently Deputy Director of Literature Search, National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), directing R&D of improving literature searches such as PubMed and LitCovid.He is also an NIH Senior Investigator (with early tenure), leading the Natural Language Processing (NLP) and Machine Learning research at NLM/NIH.Dr. Lu has published over 200 scientific articles and books.He serves as Associate Editor for the journals Bioinformatics and Artificial Intelligence in Medicine.Dr. Lu is an elected fellow of the American College of Medical Informatics.

Fig. 2 .
Fig. 2..An overview of BERT-based multi-classification models for biomedical literature using an example of classifying two labels (Labels 1 and 2).(A): Binary BERT: train a BERT model for each label; (B) Linear BERT: train a shared BERT model and output all the label probabilities at once; (C): LitMC-BERT (our proposed approach): train a shared BERT model, capture label-based features (Label Module) and models pair relations (Label Pair Module), and also predict both labels and their co-occurrences via co-training.

Fig. 4 .
Fig. 4. The distributions of macro-F1s for each model on the LitCovid BioCreative (A) and HoC datasets (B).Each model was repeated 10 times and the Wilcoxon rank sum test (Confidence Interval at 95%; one-tail) was performed.The P-values are shown in the figure.

Fig. 5 .
Fig. 5.The performance (F1) of the methods for each label in the LitCovid dataset.

Fig. 6 .
Fig. 6.The performance (F1) of the methods for each label in the HoC dataset.

TABLE 1 CHARACTERISTICS
OF THE EXPERIMENT DATASETS.#ARTICLES: THE NUMBER OF ARTICLES; LABEL (%): THE PROPORTION OF THE ARTI-CLES WITH A SPECIFIC LABEL; SOME LABEL NAMES OF THE HOC DATASET ARE SHORTENDED FOR REPRESENTATION PURPOSE.

TABLE 4 ADDITIONAL
EVALUATION MEASURES OF THE METHODS ON THE LITCOVID AND HOC DATSETS.