TaughtNet: Learning Multi-Task Biomedical Named Entity Recognition From Single-Task Teachers

In Biomedical Named Entity Recognition (BioNER), the use of current cutting-edge deep learning-based methods, such as deep bidirectional transformers (e.g. BERT, GPT-3), can be substantially hampered by the absence of publicly accessible annotated datasets. When the BioNER system is required to annotate multiple entity types, various challenges arise because the majority of current publicly available datasets contain annotations for just one entity type: for example, mentions of disease entities may not be annotated in a dataset specialized in the recognition of drugs, resulting in a poor ground truth when using the two datasets to train a single multi-task model. In this work, we propose TaughtNet, a knowledge distillation-based framework allowing us to fine-tune a single multi-task student model by leveraging both the ground truth and the knowledge of single-task teachers. Our experiments on the recognition of mentions of diseases, chemical compounds and genes show the appropriateness and relevance of our approach w.r.t. strong state-of-the-art baselines in terms of precision, recall and F1 scores. Moreover, TaughtNet allows us to train smaller and lighter student models, which may be easier to be used in real-world scenarios, where they have to be deployed on limited-memory hardware devices and guarantee fast inferences, and shows a high potential to provide explainability. We publicly release both our code on github1 and our multi-task model on the huggingface repository.2


I. INTRODUCTION
N UMEROUS industrial sectors, including healthcare, are being revolutionised by the uncontrolled growth of data produced by humans and machines as well as the availability of Concomitantly, efforts are being made to collect and make available the unstructured health information associated with hospital admissions (e.g. EHRs, laboratory tests, medications). As a result, the field of biomedical text understanding can profitably benefit from the current advancements in Deep Learning and Natural Language Processing techniques.
Biomedical Named Entity Recognition (BioNER) consists in identifying mentions of biomedical entities (e.g. disorders, chemical compounds, genetic information) from unstructured text data. It is the first and essential step of many text understanding applications, such as the construction of knowledge graphs for data representation and analysis or conversational agents including research assistants and medical chatbots. It is extremely difficult to develop a BioNER system that can recognise a wide range of entity types with high precision and recall for a number of reasons, including: r Presence of synonyms, alternate spellings, polysemous words: Biomedical datasets are characterized by a large number of synonyms or alternate spellings of entities, which are often referred to with non-standard abbreviations; polysemy is very common, i.e. the same token could represent different entities based on its context (e.g. the token "VHL" may refer to the Von Hippel-Lindau disease or to the gene name which causes the disease).
r Lack of annotated data: To guarantee high quality, the labeling process of healthcare datasets requires time, effort and domain knowledge. As a result, there is a lack of publicly available training data. Furthermore, the majority of datasets covers only one or two entity types, making it necessary to integrate different data sources.
r Inference time and memory constraints: Being (usually) a component of a larger pipeline architecture, the BioNER system has to be able to promptly provide its results when required. Moreover, in conversational agents, it may be necessary to deploy the system on devices with a limited amount of memory. NER systems for biological text mining were used to be primarily dictionary-and rule-based, but they had a number of issues, including the out-of-vocabulary problem, i.e. they  I  EXAMPLE OF TAUGHTNET OUTPUT FOR THE IDENTIFICATION OF DISEASE,  CHEMICAL AND GENE MENTIONS, COMPARED TO THE GROUND TRUTH. 4 struggled to deal with unseen and/or polysemous words and had a low recall.
As a result of the availability of an increasing number of human-labeled datasets, BioNER systems evolved over time by means of deep learning techniques able to infer features from sentence contexts. These methods were typically based on Bidirectional Long-Short Term Memory networks with Conditional Random Fields (BiLSTM-CRF) [1], [2] and/or trying to capture character-level features of words [3], [4], [5], [6]. Recently, large-scale language models pre-trained on biomedical corpora and fine-tuned over BioNER datasets [7], [8], [9], [10], [11] have shown their remarkable potential to enhance the stateof-the-art of biomedical entity recognition and their promising prospects for improvement as the availability of training data increases [12].
Nevertheless, the above-mentioned models usually have hundreds of millions of parameters, and recent research demonstrates that as the training parameters are increased, performance on downstream tasks improves [13]. The expansion of model parameters implies computational and memory limitations, which may make it more difficult to use these systems in real-world settings.
In this paper, we aim to use the technological advances brought about by Transformer models to train a multi-task BioNER system capable of recognizing multiple entity types from its inputs while dealing with the data shortage affecting the biomedical field, where most publicly available datasets contain tags for only one entity type. Based on the premise that designing, training and deploying a single Transformer-based BioNER model for each available dataset is extremely impractical, owed both to their constraining memory requirements and to the problems which would arise due to overlapping predictionse.g. two models assigning two different entity types to the same mention -, we propose TaughtNet, a multi-task framework based on knowledge distillation that allows us to fine-tune a single transformer architecture to recognize multiple entity types (an output example is shown in Table I. Similar works from Khan et al. [14] and Yoon et al. [15] propose changes in the model architecture and training procedure to accomplish the task: the former trains multiple models sharing some layers in order to build a "shared knowledge" across the 4 Sample Taken From the Test Set of the Dataset NCBI-Disease datasets, while the latter leverages an ensemble of single-task models. In contrast, TaughtNet produces a single, independent Student Transformer model that is capable of recognising a variety of entity types. Our Teachers do not create an ensemble that works together to make predictions; rather, they merely impart their knowledge to the student during the training phase.
In our experiments, we demonstrate that TaughtNet not only allows us to efficiently individuate multiple entity types by ensuring state-of-the-art performance on three benchmark datasets, but it can also be applied to smaller and lighter students, which may be more easily used in real-world scenarios where they must be deployed on hardware with limited memory and/or to ensure quick inferences. Additionally, we show the potential of TaughtNet to easily provide explainability for its predictions, which is not always possible when utilising multiple models or intricately adjusted architectures.
The rest of the paper is structured as follows. We recall some background on Biomedical Named Entity Recognition, Pretrained Language Models, Multitask Learning and Knowledge Distillation and describe the main Related Works in Section 2. Next, we introduce the training framework of TaughtNet in Section 3. Experiments are described in Section 4. Finally, we conclude our paper and discuss future directions in Section 5.

A. Biomedical Named Entity Recognition
The Named Entity Recognition (NER) task has been introduced in [16] with the aim to identify mentions of interest in unstructured texts. Biomedical Named Entity Recognition (BioNER) differs from general NER under several different points of view [17]: (1) datasets are characterized by a large number of synonyms or alternate spellings of entities, which are often referred to with (even non-standard) abbreviations; (2) entities often consist of long sequences of tokens, making it difficult to detect their boundaries; (3) entities are sometimes nested, e.g. an entity of class "species" can be part of a longer entity of class "disease"; (4) polysemy is very common, i.e. the same token may refer to different entity types, but the right one has to be chosen based on its context. While neural networks have demonstrated to generally outperform other approaches because of their capacity to analyse the syntactic and semantic structure of sentences [13], [18], annotating training data to train them is a laborious and time-consuming task that requires knowledge from domain experts. Furthermore, the lack of resources affecting the healthcare industry is primarily caused by privacy concerns surrounding the sharing of personal information. It would be preferable for a recognition system to maximise the usage of publicly accessible datasets unless it is feasible to employ a significant quantity of data given by private entities (such as hospitals and organisations)

B. Pretrained Language Models in the Healthcare Domain
In recent years, the research interest in the area of Natural Language Processing is rapidly growing, especially thanks to the pretrain-and-finetune approach which has brought significant improvements in many downstream tasks [13], [18], [19], [20]. A broad variety of pretrained models have been presented in the healthcare industry, driven by the well-established fact that pretraining the language model using domain-dependent training data significantly enhances performance [11]. Lewis et al. [12] provide an accurate comparison of the current landscape of pretrained healthcare models, highlighting the main training choices affecting downstream performance.
Techniques mostly vary in terms of the training data, which is either acquired from medical records or scientific literature (e.g., PubMed, Semantic Scholar, PMC) (e.g. MIMIC-III or other private datasets). In the first scenario, data can be easily retrieved (at least for the English language), allowing for the collection of enormous amounts of raw text to train the model; in the second scenario, data is more difficult to gather and share due to privacy concerns, but is closer to the real world of medical practise than the idealised information found in textbooks and journals [21].
The focus of this paper is not to pretrain a novel language model, but rather to design a fine-tuning framework which, based on knowledge distillation, allows us to accomplish the NER task for multiple entities by exploiting pretrained language models and heterogeneous publicly available healthcare datasets, each of them referring to a different entity type.

C. Multi-Task Learning
Multi-Task Learning (MTL) aims to leverage multiple datasets that are similar to one another yet address various tasks [22]. The key idea is that the knowledge acquired by the model for solving a task (e.g., disease extraction) can help it in solving similar tasks (e.g. drug extraction).
In biomedical text mining, the first approaches (e.g. [23]) ignored the information of subwords which can be crucial to obtain high performance. Wang et al. [6] propose the combination of a multi-task BiLSTM-CRF model and a BiLSTM layer for modeling character sequences, obtaining promising results. To the best of our knowledge, [14] is the first work adopting the multi-task learning framework with a pre-trained language model. Yoon et al. [15] highlight that despite the high recall obtained by MTL models, their precision is relatively low, i.e. they have difficulties in differentiating between entity types, primarily due to the presence of polysemous words in text which confuse the model. To solve such false-positive problem, the authors propose CollaboNet, a network composed of multiple models, each one built on a different dataset for a different task, which collaborates during training and inferences to output the final prediction. Despite the promising results, this framework requires "collaborator" models to be stored in memory at inference time and to provide their outputs when a prediction is required, resulting in low efficiency in computational and memory consumption terms.
To overcome the low-precision and the computational and memory consumption challenges, inspired by CollaboNet, we developed TaughtNet, a training framework which allows us to fine-tune a single transformer language model for multi-task BioNER based on Knowledge Distillation. In simple terms, we train single-task models on different datasets, but they do not collaborate to provide the outputs of predictions, but rather to "teach" to a single multi-task "student" how to predict the entity types in which they are experts.

D. Knowledge Distillation
Knowledge Distillation (KD) has been originally proposed in [24] as a teacher-student framework which allows the knowledge embedded in a large "teacher" model to be shared with its small "student". Modeling the behavior of teacher and student with functions f T (·) and f S (·), respectively, the objective of KD is to minimize the following objective function: where X is the training dataset and L(·) denotes the loss function computing the difference between the two behavior function outputs for the input x ∈ X . With the primary aim to "compress" the knowledge embedded in a large model -which shows good performance but is too large to be used in real scenarios -into a smaller one, the application of KG in NLP and pre-trained models has been extensively studied [25], [26], [27], [28], [29], [30], [31].
Research on the application of the KD framework for purposes other than model compression is restricted to a few works. Reimers et al. [32] try to transfer the knowledge embedded in an English BERT model to the German language. In [33], a finetuned BERT teacher is used as extra supervision to improve the text generation performance of conventional Seq2Seq student models.
To the best of our knowledge, TaughtNet is the first approach exploiting KD in a NER scenario to transfer the knowledge encoded in a variety of teachers, specialized in single entity types, into a single student, which learns to recognize all the entity types.
The multi-teacher scenario in the application of the KD approach has been thoroughly investigated [34], [35], [36], [37]. Fukuda et al. [35] hypothesize that the different "views" provided by various teacher distributions may help the student generalizing better while also capturing the complementary information embedded in each teacher stream. In [34], the teacher is an ensemble of models whose outputs are determined by the combination of the individual model predictions and the student learns to imitate its behavior by minimizing the Kullback-Leibler (KL) divergence [38] between student and teacher distributions (which the authors prove to be equal to minimizing the crossentropy error between the two distributions). The use of an ensemble knowledge distillation framework in [36] results in better student accuracy thanks to the encouragement of heterogeneity in feature learning. [37] highlights the importance of assigning the proper weights to teachers when distilling their knowledge.
In contrast to traditional KD approaches, where teachers and students share the same tasks, we aim to design a student able to handle all the tasks learned from teachers in a single model. Tan et al. [39] propose a similar approach, designing a multilingual translation system based on knowledge distillation from multiple individual teachers handling separate language pairs. Their experimental results, showing that the multilingual model reaches comparable performance with teachers -even outperforming them in many cases -further encourage our work.

III. METHOD
In this work, we aim to leverage a set of publicly available healthcare datasets to train a single multi-task BioNER model. A comprehensive overview of our framework is shown in Fig. 1.
To facilitate the reader in the understanding of our methodology, we summarize the adopted notation in Table II and support every methodological step with a running example.

A. Problem Formulation
Let E the set of entity types we aim to individuate, e i ∈ E representing the i-th entity type (with i ∈ {1, . . . , |E|}). A corpus of annotated sentences D i is associated to each entity type, . . , H}, where H represents the maximum sequence length) and Y i being the relative set of labels. In this work, we will refer to the IOB2 annotation schema [40], assigning the "B" label to the beginning, the "I" label to the inside and the "O" to the outside of an entity mention.
Based on such datasets, our aim is to learn a model f (·) able to map each token x j in a sentence x to its label y j ∈ Y multi , where:

B. TaughtNet
The structure of this section reflects the procedural steps summarized in Fig. 1 by comprehensively describing the phases involved in the training procedure: (1) datasets aggregation, (2) retrieval of teacher distributions, (3) aggregation of teacher distributions and (4) student training.
1) Datasets Aggregation: Based on the available training datasets D 1 , D 2 , . . . , D E , we build an aggregated dataset: where X S results from the concatenation of the sentences contained in each single-task dataset X S = X 1 + + X 2 + + . . . + + X |E| , and the same goes for labels Y S with the only difference that B and I labels are diversified based on the corresponding entities, as described in Section III-A. The aggregated dataset D S will serve as the data source to obtain the distribution representing the knowledge of teachers (used for knowledge distillation) and as a ground truth reference during student training.
2) Retrieval of Teacher Predictions: Let θ 1 T , θ 2 T , . . . , θ |E| T be the parameters learnt by teacher models on their corresponding single-task datasets. For each sentence token x j ∈ x, the i-th teacher will be able to provide the distribution T j i : 3) Distributions Aggregation: Thanks to knowledge distillation, a student model learns how to mimic the output distribution of a teacher model. Differently from the standard approach, our student has to learn from an heterogeneous set of teachers, each of them able to individuate a different entity type. Hence, we need an aggregation phase, where teacher distributions are merged in one single distribution to be used in the knowledge distillation framework.
Let x j ∈ x be a token we have to aggregate distributions for. Let's denote with p i k = T j i (y j = k|x; θ i T ) the probability which the i-th teacher assigns to the label k, where k ∈ {B, I, O}.
The probability of the token x j being assigned to the label B-e i , I-e i and O can be respectively computed as the probability of the intersection of the events shown as follows: Given the independence between teachers and the mutual exclusivity characterizing each teacher distribution, we can then compute the probabilities of the aggregated distribution A as follows: A j (y j = I-e i |x; θ 1 T , . . . , θ Given a sentence token x j ∈ x, j ∈ {1, . . . , H}, the output of this phase is the distribution of Y labels: Running Example: Let x j ∈ x be the input token and 4) Student Training: Let us represent the student model with its parameters θ S and its output distribution S{y t = k|x; θ S }, k ∈ Y . The fine-tuning procedure aims to minimize a loss function composed by two terms: the former measuring the distance of the student distribution from its teachers distribution, the latter representing its error on the ground-truth. Formally, we can define our loss as shown below: (13) where L KD and L GT are the knowledge distillation and groundtruth loss, respectively, while λ is an hyperparameter controlling their weight on the overall loss L.
Despite the Kullback-Leibler divergence being suitable for this knowledge-distillation task, similarly to [39] and in compliance with [34] which proves that minimizing the Kullback-Leibler divergence is equal to minimize the cross-entropy error between two distributions, it is sufficient to train the student model to minimize the following loss function: where H is the sequence length and S{·} denotes the student distribution.
The ground-truth-based loss function is: 1{y t = k}logS{y t = k|x; θ S }, (15) where the indicator 1{·} represents the one-hot label annotated in the ground truth.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

IV. EXPERIMENTS
In this section, we report our empirical evaluation of Taugh-tNet. In the first place, we train three students for three different biomedical entity types: diseases, chemical compounds and genetic information. Thereafter, we train several student architectures with different size and parameters, and report our results in the Results subsection. Specifically, we report: (1) a comparison of our best student with several state-of-the-art baselines; (2) results of different students with different architectures and size; (3) a comparison of all the students in terms of their level of agreement on predictions; (4) an error analysis w.r.t different error types; and (5) an explainability experiment which investigates how the inner workings change from the teachers to the student.

A. Datasets and Teachers
We evaluate the performance of our approach with three benchmark datasets, each of which has been constructed from PubMed abstract: NCBI-Disease [41], BC5CDR [42], BC2GM [43]. All the datasets -with their training, development and test splits -have been downloaded from: https: //github.com/dmis-lab/biobert. We encoded word labels by using the IOB2 notation format [44].
For each one of the datasets, we trained our teachers by finetuning for 30 epochs a RoBERTa-large architecture which had been pre-trained on PubMed and PMC and MIMIC-III with a BPE Vocab learnt from PubMed [12].
A summary of the datasets, in terms of size and entity-type, and of the teachers, in terms of their precision, recall and F1 scores, is provided in Table III.

B. Evaluation Details
For all the datasets, we used the same dataset splits as BioBERT [8], which are based on earlier publications for a fair evaluation. In particular, training/development/test splits of NCBI-disease and BC5CDR corpora are the same as their original version, while the training set of BC2GM has been modified because the original corpus does not provide a development set. Thus, 2,500 sentences are split off from the training data to generate the development set.

C. Metrics
Quality: For the evaluation of the quality of the named entity recognition approaches, we used the Precision, Recall and F1 metrics computed with the seqeval Python framework. In simple terms, Precision is the percentage of entities which are correctly found by the system, while Recall is the percentage of entities of the test set which are found by the system. A system with a low Precision is not able to differentiate between entity types, while a low Recall indicates the inability to recognize entities.
To measure the degree of agreement among different models, we used the Cohen's Kappa metric which can be computed as follows: where p o is the relative observed agreement among predictions, and p e is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly seeing each category. Memory occupation and inference time: The efficiency of models has been evaluated based on their size (in terms of MB of memory occupied) and the samples-per-second (SPS) required during the training and inference phases. A model with too many parameters is difficult to deploy on hardware systems with strict memory constraints, while a slow model is difficult to integrate in complex systems where the NER engine is just a step in a pipeline. Our experiments have been performed on a Oracle Cloud Infrastructure (OCI) with an Intel(R) Xeon(R) Platinum 8167 M CPU @ 2.00 GHz (12 cores) and a NVIDIA Tesla V100 SXM2 GPU.

D. Settings and Hyperparameters
We developed our framework on top of the Hugging-Face transformers library [45]. We experimented with several model architectures and weights with varying size. Specifically, we used RoBERTa-large-PM-M3-Voc and RoBERTabase-PM-M3-Voc-train-longer from Lewis et al. [12], and huawei-noah/TinyBERT_General_4L_312D and distilroberta-base from the huggingface model hub.
To fine-tune our models, we used a Adam optimizer with an initial learning rate of 5e-5 and β 1 = 0.9, β 2 = 0.999, = 1e−8. The batch size was set to 8 and the maximum sequence length to 128.
Although the results in terms of quality are always satisfactory from the first epochs, we usually found the highest performance after 20 epochs. This result is consistent with findings from Lee et al. [8].
As concerns the loss function designed to train the student model, we used the KLDivLoss and NLLLoss PyTorch implementations for the knowledge distillation L KD and ground-truth L GT loss components.

1) Comparison With Baselines:
We compare the quality of our best student with several baselines, described as follows: r Merged: the simplest way to train a multi-label NER model from single-entity datasets is to merge them in one aggregated dataset to be used for training and testing. We fine-tuned until convergence the same RoBERTa-large model architecture used for teachers on such dataset. r CollaboNET: aggregates the results of collaborator single-task models, and uses them as an additional input to the target multi-task model. r MT-BioNER: multi-task transformer-based neural architecture, where different models for different datasets share some layers to build a "shared" knowledge across tasks. Table IV reports results over the three benchmark datasets in terms of Precision, Recall and F1 scores. Thanks to the utilization of high-performing teachers, our student model achieves the best results for each of the datasets. Interestingly, performance obtained for the NCBI dataset surpasses the related teacher thanks to the indirect positive effect of the (1) data augmentation obtained by merging all the dataset and the (2) joint training based on both the ground-truth and teacher predictions. A comparative discussion with baselines is provided in Section IV-F.
2) Smaller and Smaller Students: Thanks to its knowledge distillation based architecture, one of the advantages of using TaughtNet is its straightforward way to train multi-task small models by leveraging the knowledge of large and highperforming teachers. In our experiments, we compare results of the student architectures described as follows: r Distil: distilled version of BERT base introduced by Sanh et al. [27]. It has 40% less parameters and runs 60% faster than BERT-base.
r Tiny: distilled version of BERT-base introduced by Jiao et al. [31], 7.5x smaller and 9.4x faster on inference than BERT-base. Results are reported in Table V in terms of model size, samples-per-second (SPS) processed during the training and inference phase, and F1 scores over the three benchmark datasets. Interestingly, the Base architecture achieves F1 scores closely resembling its Large counterpart, probably resulting in the best result in lower F1 scores, but their considerable improvement in memory occupation and processing time could make them a suitable choice in limited-resource scenarios. In the experiments that follow, we delve into the differences between these students and their corresponding teachers.

3) Levels of Agreement (Cohen's Kappa):
We computed the Cohen's Kappa metric to measure the degree of agreement among models and the ground-truth. 5 Heatmaps in Fig. 2 show agreements over the three benchmark datasets among the ground truth, the teacher, and the size-decreasing student architectures. Despite the disagreement between distilled models and their teacher -which highlights a limitation in distilling their knowledge, which will be explored in future work -results show an overall agreement between teachers and their students and among student architectures. 4) Error Analysis: We further explored the differences among models based on the number of correctly-retrieved entity mentions (CORRECT), new predictions deriving from the application of the framework (NEW) and their errors, which can be divided into five categories described as follows:   nizes the presence of an annotated named entity, but the span is wrong. It can be seen from the data in Table VI that students trained with TaughtNet allow us to retrieve a considerable number of novel entity mentions which were not annotated in the groundtruth, thanks to the knowledge of the teachers employed. Concordant with the above-reported experiments, Large and Base students are able to detect a significantly higher number of new entity mentions w.r.t. distilled architectures. The highest limitation of distilled architectures w.r.t. to their "larger" counterparts is in the number of CFN errors, i.e. they are not able to identify mentions which are actually annotated.
The majority of errors fall in the RLOS category, meaning that models are able to identify an entity mention, but the range detected is not the same as the ground truth. However, previous works have shown that this type of errors are often a result of the subjectivity and inconsistency of span annotations [46], [47]. Some examples are shown in Table VII. It is important to note that many of the errors are due to the ability of our model to recognize multiple entity types: for example, the two words gene mention "estrogen receptor" (see WRLS, 2nd example) are assigned by our model to two different entity types ("estrogen" as a chemical compound, "receptor" as a gene). 5) Explainability: We apply Integrated Gradients [48] to assign an importance score to each input token by approximating the integral of gradients of the output w.r.t the inputs. 6 To investigate how the inner workings of the models change from Teachers to Student, we report in Fig. 3 the explanations from the three large Teachers and the resulting Student to the sentence: "Subchronic inhibition of nitric-oxide synthesis modifies haloperidolinduced catalepsy and the number of NADPH-diaphorase neurons in mice", which contains at least one mention per entity type. Interestingly, despite our experiment being carried out with just the aim to prove the effortlessly interpretability of our method -which does not modify the architecture of the Student model and thus can leverage off-the-shelf methods to explain its predictions -, we also observed that the explanations provided by the Student are better targeted (i.e. lower number of influential tokens) and understandable.

F. Discussion
In our experiments, we have studied in-depth the effects of learning from several single-task transformer-based teachers and contrasted TaughtNet with strong baselines from the current literature.
In Table VIII we show a methodological comparison of state-of-the-art methods accompanied by averaged precision, recall and F1 scores on the benchmarking datasets used in this work. We can observe that multi-task methods that are based on high-performing pre-trained transformer models consistently outperform CollaboNet in most situations, despite the fact that CollaboNet effectively addresses the low-precision problem of multi-task learning systems by defining a collaborative framework made of single-task BiLSTM-CRF models that also solves the type conflict problem, i.e. different models recognising the same mentions. In TaughtNet, we have leveraged the advantages of both multi-task learning and transformers by dealing with the same low-precision and type conflict problems as CollaboNet.
The result is a single high-performing fine-tuned transformer model able to identify mentions of several entity types, which makes it (1) easy to lighten under constraining hardware and computing time requirements thanks to lighter students (e.g. DistilBERT, TinyBERT), and (2) easy to interpret by the use of  off-the-shelf explainability techniques, since we do not change any module in the architecture.

V. CONCLUSION & FUTURE WORK
The difficulty in finding a single dataset with all the entities required for a Biomedical Named Entity Recognition System (e.g. diseases, genes, species, drugs) has laid the foundations of this work. TaughtNet has the objective to integrate various publicly available single-task healthcare datasets in a single BERT architecture which can be used as a fast and highly performing BioNER engine in real applications, such as conversational agents or knowledge graph development.
Experimental results demonstrate that not only does Taugh-tNet surpass strong state-of-the-art baselines, but it also is a valuable option when constrained by strict computational and memory requirements thanks to its ability to train lightweight models that distill the knowledge from high-performing singletask teachers. Furthermore, we have shown the potential of TaughtNet to provide explainability, which is a valuable advantage, especially when dealing with healthcare data.
There is abundant room for further progress in exploring the use and application of knowledge distillation to bring the student performance as close as possible to that of teachers. As a future work, we would like to integrate more datasets and to extend the framework not only to other downstream tasks, but also to other application domains, since the technique is not dependent on the biomedical domain.

ACKNOWLEDGMENT
This work is carried out within the framework of the Knowledge graphs for next-generation health science applications project with Oracle America Inc. and with the support of the Oracle for Research program, within a research agreement between Oracle Research Lab and the Department of Electrical Engineering and Information Technologies at the University of Naples Federico II (DIETI). All opinions reflected in this paper are those of the authors and not necessarily those of the funding agency.