Transformers for Clinical Coding in Spanish

Automatic clinical coding is an essential task in the process of extracting relevant information from unstructured documents contained in electronic health records (EHRs). However, most research in the development of computer-based methods for clinical coding focuses on texts written in English due to the limited availability of medical linguistic resources in languages other than English. With nearly 500 million native speakers, there is a worldwide interest in processing healthcare texts in Spanish. In this study, we systematically analyzed transformer-based models for automatic clinical coding in Spanish. Using a transfer-learning-based approach, the three existing transformer architectures that support the Spanish language, namely, multilingual BERT (mBERT), BETO and XLM-RoBERTa (XLM-R), were first pretrained on a corpus of real-world oncology clinical cases with the goal of adapting transformers to the particularities of Spanish medical texts. The resulting models were fine-tuned on three distinct clinical coding tasks, following a multilabel sentence classification strategy. For each analyzed transformer, the domain-specific version outperformed the original general domain model across those tasks. Moreover, the combination of the developed strategy with an ensemble approach leveraging the predictive capacities of the three distinct transformers yielded the best obtained results, with MAP scores of 0.662, 0.544 and 0.884 on CodiEsp-D, CodiEsp-P and Cantemist-Coding shared tasks, which remarkably improved the previous state-of-the-art performance by 11.6%, 10.3% and 4.4%, respectively. We publicly release the mBERT, BETO and XLMR transformers adapted to the Spanish clinical domain at https://github.com/guilopgar/ClinicalCodingTransformerES, providing the clinical natural language processing community with advanced deep learning methods for performing medical coding and other tasks in the Spanish clinical domain.


I. INTRODUCTION
The incremental adoption of electronic health records (EHRs), as a main component of the information systems of hospitals and healthcare services in general, has raised a series of questions for the scientific community that remain open.One of the main problems derived from the incremental use of EHRs is the need to extract the most significant information stored in the system, the effective management of which has direct implications in improving patient health care.EHRs contain data whose volume is constantly growing and whose nature is heterogeneous and includes personal The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Tang .information of patients, standardized health codes, medical images, and genetic data stored in a wide variety of formats.EHRs also include free-text documents that use domainspecific natural language (with specialized vocabulary and terminology) and store information about clinical notes, discharge reports, radiological or pathology reports, anamnesis and notes of clinical examinations, medical orders, etc. [1] However, the unstructured nature of the text these documents contain makes it especially complex to directly extract relevant information on medical concepts from these documents that can be used systematically.
Structured clinical information, in the form of clinical data encoded according to controlled and standardized vocabularies, is a fundamental resource that allows not only the administrative management of data or the efficient exchange of information between institutions but also the development of statistical analyses [2]-of retrospective or prospective nature-or epidemiological studies, which can be carried out based on data from cohorts or patient populations.These analyses can subsequently serve as support in clinical decision-making and in optimizing the management of economic-financial resources of healthcare services [3].Automatic clinical coding consists of transforming unstructured clinical texts written in specialized natural language to structured formats that conform to standardized coding terminologies using computational methods [2].Given the importance of storing descriptions written in natural language in the different documents involved in medical and health care processes in EHRs, automatic clinical coding is an essential task in the process of extracting meaningful information from patient data, which affects the improvement of many clinical and productivity aspects of the health professionals involved-freeing professional coders from the arduous task of manual coding of large amounts of text generated in daily clinical practice [4]-as well as the optimization of diagnoses and procedures.
Traditionally, natural language processing (NLP) strategies have been applied to the problem of automatic clinical coding [2], [5]- [7], although more recent studies focus on the use of rule-based approaches, machine learning (ML) strategies, and deep learning (DL) models [8]- [12].However, most of the previous works in the literature focus on texts written in English due to the limited availability of annotated corpora with standardized clinical coding labels and additional linguistic resources in languages other than English.
The main goal of this study is to develop clinical coding models for Spanish medical documents by adapting several transformer-based models to the particularities of the Spanish healthcare domain.For this purpose, we pretrained the models by using a private corpus of deidentified real-world oncology clinical texts written in Spanish.The resulting models were further fine-tuned on 3 clinical coding tasks, namely, CodiEsp-D, CodiEsp-P and Cantemist-Coding, by using two public annotated clinical Spanish corpora [13]- [16].In this work, we explored 3 transformers that support the Spanish language: the multilingual BERT (mBERT) [17], BETO [18] and XLM-RoBERTa (XLM-R) [19].With the aim of adapting transformers to the particularities exhibited by real-world small-data clinical coding tasks, in this study, we developed a multilabel sentence classification approach that serves as a data augmentation procedure.Following the proposed strategy, the transformers achieved new state-of-the-art (SOTA) performance in each of the clinical coding tasks explored in this work.The 3 transformer-based models adapted to the Spanish clinical domain are publicly available at https://github.com/guilopgar/ClinicalCodingTransformerES.

II. BACKGROUND A. ICD-10 CODING
The ICD 1 classification system (International Statistical Classification of Diseases and Related Health Problems) establishes a standardized coding that allows the statistical analysis of mortality and morbidity of patients belonging to healthcare services.The ICD-10 edition-whose corresponding Spanish version is called CIE-10-ES2 (see Fig. 1)is structured hierarchically into chapters that group codes of up to seven characters, thus allowing it to encode over 70,000 diagnoses and 72,000 procedures.Hence, the length of the ICD-10 codes ranges from a minimum of 3 characters to a maximum of 7 characters, depending on the degree of specificity needed for the concept to be encoded (see Fig. 1).Coding with ICD-10 in general, as well as with CIE-10-ES in particular, presents great difficulties, mainly due to the enormous sparsity that the distribution of codes presents: certain diseases and procedures are much less frequent than others, which results in corpora annotated with thousands of rare codes and a few tens or hundreds of frequent codes.As a consequence, the datasets tend to be very unbalanced, and in addition, they tend to present important biases derived from the local factors characteristic of each health service or institution.All of this makes it difficult to develop automatic coding algorithms that obtain acceptable levels of efficiency.
For the particular case of automatic coding of clinical texts in Spanish using CIE-10-ES codes, few works have been published.Thus, Blanco et al. [20] explored different DL models using a multilabel classification approach to address the problem of clinical coding on a corpus obtained from the public hospital system of the Basque Country in Spain.Pérez et al. [21] used an approach based on latent Dirichlet allocation (LDA) to carry out a multilabel classification of EHRs obtained from the cardiology department of the same public system of hospitals in the Basque Country, obtaining positive results with the 124 most frequent CIE-10-ES codes present in the corpus.More recently, Almagro et al. [22] compared algorithms based on binary outputs, groups of subsets and extreme classification (eXtreme Multilabel Text Classification, or XMTC) to assign CIE-10-ES codes to clinical texts from hospital discharge reports of the Hospital Universitario Fundación Alcorcón in Spain, concluding that assembling methods based on weighting each code according to training frequency and performance can achieve better performance scores on extreme distributions, such as CIE-10-ES coding.

B. CLINICAL CODING SHARED TASKS
In the last five years, the CLEF eHealth Lab3 has organized a series of shared clinical coding tasks on multilingual or non-English corpora.In 2020, Task 1 of the CLEF eHealth was matched with the CodiEsp track [16], which stands out for being the first shared task consisting of the automatic coding of clinical cases in Spanish.With this aim, the CodiEsp corpus was provided, a synthetic corpus consisting of 1,000 samples of clinical cases, manually curated by the organizers of the task.Within the CodiEsp track, three different shared subtasks were proposed, all of which focused on the CodiEsp corpus: CodiEsp Diagnosis (CodiEsp-D), CodiEsp Procedure (CodiEsp-P) and Explainable AI (CodiEsp-X).Thus, given a clinical free text written in specialized natural Spanish language, the tasks CodiEsp-D and CodiEsp-P consisted of assigning to the text a list of CIE-10-ES diagnostic and procedural codes, respectively.In contrast, the CodiEsp-X subtask required the prediction of both types of CIE-10-ES codes together with the exact reference to the text segment that served as justification for the assignment of such codesthat is, this last subtask became a named-entity-recognition and normalization (NER-N) task.Among the methodologies used by the participants to solve these subtasks, the best result for CodiEsp-D (with a mean average precision (MAP) value of 0.593) was obtained using an ML algorithm, while the best MAP score for CodiEsp-P (0.493) was achieved using systems that did not use ML.Finally, it should be noted that most of the models presented to the CodiEsp-X subtask used NLP algorithms from outside the scope of ML [16].As a consequence of the line of work starting from the evaluation shared tasks of the CodiEsp track, some studies have emerged in recent months that make use of the CodiEsp corpus or of a specific clinical corpus in Spanish to address the problem of automatic clinical coding, using various methodologies, among which, the DL techniques stand out [12].
In the same line of work of automatic coding of clinical texts in Spanish, last year, the Cantemist Track for Cancer Text Mining evaluation campaign was carried out [13], which proposed a series of shared tasks whose objective was the development of systems for assigning ICD-O codes (International Classification of Diseases for Oncology) to clinical oncology texts in Spanish.More specifically, it was about assigning CIE-O-3.1 codes (which is the Spanish equivalent of ICD-O, version 3.1, for Neoplasia Morphology) to the texts included in a gold standard corpus with mentions of tumor morphologies.Among the methodologies used in the Cantemist shared tasks, the best results (with a maximum MAP of 0.847) were obtained using language models based on a transformer [23] and bidirectional recurrent neural networks with LSTM units [13]- [15].

C. ATTENTION MODELS FOR NLP DOWNSTREAM TASKS
In recent years, a new family of models has emerged capable of associating each word with a contextual numerical representation considering the specific context in which the word appears within the text.These types of models are known as contextual embeddings.Some of them are based on semisupervised sequence learning, as is the case with ELMo [24], ULMFit [25], transformer [23], BERT [17] and, more recently, RoBERTa [26], T5 [27], XLM-R [19] and XLNet [28].While those based on recurrent neural networks, such as ELMo or UMLFit, present efficiency problems due to the sequential nature of these networks, the models based on a transformer architecture, such as BERT, RoBERTa, T5, XLM-R and XLNet, focus on the attention mechanisms proposed to, among other advantages, increase computing efficiency through the parallelization of a large part of their network architecture.Furthermore, another particularity of these attention models is that they can be pretrained on a general domain corpus and later fine-tuned and adapted on a domain-specific corpus to solve a particular NLP task.This technique, known as transfer learning (TL), is commonly used to fit DL algorithms to small datasets.
Recently, we advanced our research line consisting of the application of TL approaches for problem solving in the biomedical field [29]- [31].In this study, we continued the same line of work by applying a TL-based strategy to address the automatic clinical coding problem in Spanish using transformers.There exist preliminary works in which BERT-based models have been applied with moderate results to single clinical coding tasks in Spanish [14], [15], [32].However, there is a lack of studies that systematically analyze the  performance of transformers in the Spanish medical domain, particularly for distinct clinical coding problems.Additionally, there is also a lack of publicly available transformerbased models pretrained on clinical corpora in Spanish.Given the significant interest that exists both in industry and academia to develop automatic systems for extracting relevant clinical information contained in Spanish medical documents, the availability of transformers adapted to the particularities of the real-world Spanish clinical domain would facilitate the adoption of these models in downstream medical NLP tasks.In this way, the two main contributions of this work are: on the one hand, a systematic analysis of the performance of transformers on three distinct clinical coding tasks in Spanish.On the other hand, the release of the first publicly available transformer-based models adapted to the Spanish clinical field, providing the medical NLP community with cutting-edge DL methods for performing clinical coding and other downstream tasks in the Spanish medical domain.

A. CORPORA 1) DOMAIN-SPECIFIC PRETRAINING
With the aim of adapting the transformer-based models to the particularities of the Spanish clinical domain, we further pretrained them using a private corpus of deidentified clinical cases retrieved from the Galén Oncology Information System [33], [34].The corpus corresponds to a collection of real-world oncology clinical texts written in Spanish by physicians from the oncology departments of the Hospital Regional Universitario and the Hospital Universitario Virgen de la Victoria in Málaga, Spain.In total, the corpus comprises 30.9K documents, 64.4 M words and 437.6 M characters.

2) CLINICAL CODING FINE-TUNING
We used 2 publicly annotated clinical corpora in Spanish to fine-tune the models on 3 clinical coding tasks.Both the CodiEsp-D and CodiEsp-P tasks are based on the CodiEsp corpus [16], a collection of 1 K clinical cases annotated with both CIE-10-ES diagnosis and CIE-10-ES procedures codes.The Cantemist corpus [13] comprises 1.3 K clinical cases from the oncology domain, annotated with CIE-O-3 codes for the Cantemist-Coding task.Each of the two clinical corpora was split into training, development and test subsets.Table 1 summarizes the annotation distribution of the two analyzed corpora.
For each of the 3 clinical coding tasks addressed in this work, the available annotations contained the assignment of a set of clinical codes to each document in the corpus (see Fig. 2A).Additionally, the organizers of both CodiEsp and Cantemist tracks also proposed two NER-N tasks, namely, FIGURE 3. Workflow of the TL approach for automatic clinical coding using transformers.
CodiEsp-X-which was based on the CodiEsp corpus and considered the same codes annotations used both in the CodiEsp-D and the CodiEsp-P tasks-and Catemist-Normwhich considered the same documents and codes annotations used in the Cantemist-Coding task.In contrast with the annotation format of the clinical coding tasks, the code annotations of the NER-N tasks contained an additional field indicating the mention in the text that supported the coding assignment (see Fig. 2B).

B. TRANSFORMER-BASED MODELS
In this work, we systematically analyzed the performance of different transformer-based models on clinical coding tasks in Spanish.For this reason, we explored 3 transformers that support the Spanish language, namely, mBERT [17], BETO [18] and XLM-R [19].To the best of our knowledge, these are the only publicly available transformer-based models that include Spanish among their supported languages.
• mBERT : the multilingual version of the BERT-Base model [17] was pretrained on a corpus comprising Wikipedia texts from 104 languages. 4The model uses a multilingual WordPiece [35] vocabulary of ∼110 K subtokens, and the total number of trainable parameters is ∼177 M.
• BETO: The Spanish-BERT model, called BETO, uses a similar architecture to the BERT-Base model, with a total of ∼110 M trainable parameters [18].The pretraining corpus exclusively contains texts in Spanish, including data from Wikipedia and the OPUS Project [36].The model uses a Spanish vocabulary of ∼31 K subwords.
• XLM-R: the multilingual version of the RoBERTa-Base model [26] was pretrained following a modified version of the XLM approach [37], using a CommonCrawl Corpus in 100 languages [19].The model uses a large multilingual SentencePiece [38] vocabulary of ∼250 K subtokens, and the total number of trainable parameters is ∼278 M.
The workflow of our TL approach is shown in Fig. 3, and each stage of the pipeline is described in the next subsections.

1) UNSUPERVISED PRETRAINING
The transformers analyzed in this study were further pretrained on a corpus of unlabeled oncology clinical texts.Specifically, following the two pretraining objectives proposed by Devlin et al. [17], the two BERT-based models explored in this work, namely, mBERT and BETO, were optimized based on the next sentence prediction (NSP) task and the masked language model (MLM) pretraining objective with the whole-word masking (WWM) modification.Following the approach developed by Conneau et al. [19], we used the MLM objective with the dynamic masking modification to perform the pretraining of the XLM-R model.

2) SUPERVISED FINE-TUNING
In this work, we addressed the clinical coding problem in Spanish using transformers.Generally, clinical coding is addressed as a multilabel text classification task, where for a collection of documents, a set of standard medical codes must be assigned to each text.However, when applying transformer-based models to address real-world clinical coding tasks, two main issues arise.First, transformer architectures can only deal with fixed-length input sequences due to the quadratic time complexity on the input sequence length for the self-attention layers [23].For instance, the transformers analyzed in this work were designed to process subword input sequences up to a maximum length of 512.This represents a significant constraint when addressing longtext classification problems such as the clinical coding tasks faced in this study since numerous clinical cases from the CodiEsp and Cantemist corpora have a subtoken sequence length above the maximum size supported by transformers.Second, DL architectures have been shown to be primarily effective in contexts where a large set of samples is used to train the models, given their high number of trainable parameters.In small-data text classification scenarios, such as the clinical coding tasks faced in this work, only a few hundred (200-1 K) annotated texts are available to train the classification systems.Consequently, the transformers explored in this study-with several hundreds of millions of weights-would be prone to suffer from critical overfitting issues if they were fitted on text classification tasks using such a scarce set of samples as the set of documents contained in the CodiEsp and Cantemist corpora.
With the aim of adapting transformers to the particularities exhibited by real-world clinical coding tasks, based on our previous works [14], [32], we developed a 3-phase multilabel sentence classification approach that not only addresses the fixed-length input sequence limitation presented by transformers but also serves as a data augmentation procedure.In this way, leveraging the information available for the NER-N tasks (see Section III-A2), we applied our 3-phase strategy to each clinical coding task-CodiEsp-D, CodiEsp-P and Cantemist-Coding-to convert the multilabel long-text classification problem into a multilabel sentence classification task.The next paragraphs describe our sentence classification approach for clinical coding.
Phase 1: Creation of a corpus of annotated sentences.For each document in the clinical coding corpus, the text was first split into sentences using the SPACCC Sentence Splitter tool. 5Then, exploiting the additional information available for the corresponding NER-N task-CodiEsp-X in the case of CodiEsp-D and CodiEsp-P tasks and Cantemist-Norm in the case of Cantemist-Coding task-, each sentence was exclusively annotated with the clinical codes whose text references were contained within the limits of the sentence.As an example, Fig. 4 shows the annotated sentences obtained from the text of the S1130-14732005000300004-1 clinical case (see also Fig. 2A), using the diagnosis code annotations from the CodiEsp-X task available for the document (see also Fig. 2B).In this way, a corpus of sentences annotated with clinical codes was created.The generated corpus was significantly larger than the original clinical coding corpus, given that, instead of considering the whole text of a document as a single training sample, each sentence obtained from the document was treated as an individual training instance.
Phase 2: Fine-tuning on a multilabel sentence classification task.Using the corpus of labeled sentences, we fine-tuned each transformer model on a multilabel sentence classification problem.To perform the supervised finetuning of the whole model architecture on a multilabel sentence-level task, the output representation encoded by the model for the initial beginning of sequence (BOS) subtoken was fed into a final classification layer with C sigmoid units, where C denotes the number of distinct codes present in the fine-tuning corpus.Thus, given a sentence as input to the model, the generated output vector of length C could be interpreted as the probability of each of the C codes occurring within the input sentence.

Phase 3: Predicting code probabilities at the document level.
Although each model was fine-tuned on a supervised sentence-level task, the predictive performance of the models was evaluated at the document level.Consequently, using a maximum probability criterion, we postprocessed the probabilities predicted by the models at the sentence level to produce a vector of code probabilities at the document level.Hence, given a set S containing all sentences obtained from a single document d as input, the model outputted a |S| × C probability matrix P. Subsequently, the criterion consisted of selecting the maximum probability value across each column of P, obtaining a final vector p of length C, which represented the probability of each code to occur in d.
As a result, at prediction time, given a set of D documents as input data, our workflow for clinical coding using transformers produced a D × C matrix of coding probabilities predicted by the model at the document level (see Fig. 3).

D. EXPERIMENTS
We implemented our TL approach for clinical coding using both the transformers library developed by the HuggingFace team [39] and the TensorFlow implementation of BERT developed by Google. 6For all transformer-based models examined in this work, we set a maximum input sequence length of 128 subwords.We fixed the same values for most of the hyperparameters during the pretraining of the models (see Supplementary Table S1 for further details).When finetuning the models to perform clinical coding, we used the RAdam [40] optimizer with a learning rate of 3 × 10 −5 , a batch size of 16 and the number of epochs was empirically determined on the development set of the corresponding clinical coding corpus, with an upper limit of 50 epochs.Additionally, when fine-tuning the models on the CodiEsp-D task, we enriched the CodiEsp corpus with a set of Spanish abstracts annotated with CIE-10-ES diagnosis codes, provided by the organizers of the CodiEsp track [16].Regarding the hardware resources employed, all experiments were conducted using a single GeForce GTX 1080 Ti GPU.

IV. RESULTS
Table 2 shows the predictive performance of the 3 transformer-based models on the CodiEsp-D, CodiEsp-P and Cantemist-Coding tasks.For each model, we compared the original model pretrained using general domain corpora (see Section III-B) with the Spanish clinical version of the model obtained by further pretraining the generaldomain version using the oncology clinical cases from the Galén corpus (see Section III-C1).For each transformer, we fine-tuned 5 differently randomly initialized instances.The MAP metric-the official evaluation metric of the tasks [13], [16]-was employed to evaluate the performance of the models.With the aim of maximizing the score obtained for the MAP rank-based metric, for each document in the test set, we returned all codes considered by the model, sorted in descending order according to their predicted probability of occurrence.Among all models, BETO-Galén achieved the best performance on the 3 clinical coding tasks, with average MAP values of 0.616, 0.514 and 0.862.The two multilingual transformers adapted to the clinical domain, namely, mBERT-Galén and XLM-R-Galén, achieved almost the same performance across the 3 tasks, with the mBERT-Galén model obtaining mean MAP scores of 0.609, 0.495 and 0.858 and the XLM-R-Galén model achieving average MAP values of 0.611, 0.493 and 0.859, respectively.Compared with the general-domain transformers, the clinical version of the models improved the performance for clinical coding in Spanish.In this way, across the 3 tasks, for each transformer, the clinical version surpassed the generaldomain version of the model in terms of the average MAP scores.When comparing the maximum MAP score results obtained in this study with the previously reported SOTA results, new SOTA performance was achieved in each of the 3 clinical coding tasks.Thus, for the CodiEsp-D task, XLM-R (0.601), mBERT (0.602), XLM-R-Galén (0.615), mBERT-Galén (0.616) and BETO-Galén (0.619) exceeded the prior SOTA performance (0.593) reported by the organizers of the shared task [16].In the case of the CodiEsp-P task, XLM-R-Galén (0.498), mBERT-Galén (0.508) and BETO-Galén (0.52) surpassed the previous SOTA results (0.493) [16].Finally, for the Cantemist-Coding task, again, the transformers adapted to the Spanish clinical domain, namely, mBERT-Galén (0.86), XLM-R-Galén (0.861) and BETO-Galén (0.864), outperformed the prior SOTA results (0.847) [13].

A. ENSEMBLE
Additionally, we proposed an ensemble approach to combine the different clinical coding predictions made by the models.Thus, given a set comprising D documents, our proposed workflow for clinical coding using a transformerbased model A outputted a D × C probability matrix M A (see Fig. 3), with C representing the number of codes considered by the model.In fact, as a result of fine-tuning 5 distinct instances of each model, 5 different probability matrices, namely, M 1 A , M 2 A , . . ., M 5 A , were obtained for model A. To combine the 5 distinct matrices into a single probability matrix M E A , our ensemble approach plainly consisted of summing the 5 probability matrices obtained for model A, i.e., M E A = 5 i=1 M i A .Moreover, our ensemble strategy could also be applied to combine the coding predictions made by any number of distinct models by directly summing all probability matrices obtained from the models.For instance, given M E A and M E B as the ensemble probability matrices of transformers A and B, respectively, a single probability matrix M E A+B could be obtained by summing both previous matrices, i.e., M E A+B = M E A + M E B .Table 3 describes the performance of our ensemble approach applied to combine both the coding predictions made by single models and the predictions made by multiple different models.Regarding the performance of the ensemble approach applied to single models, the BETO-Galén ensemble obtained the best results on the 3 clinical coding tasks, with MAP values of 0.648, 0.537 and 0.88, respectively.In relation to the ensemble approach applied to multiple TABLE 3. Ensemble models performances on the CodiEsp-D, CodiEsp-P and Cantemist-Coding test sets, according to the MAP metric.For each task, the best obtained result is bolded, while the second best is underlined.models, the ensemble combining the predictions of the 3 transformers adapted to the Spanish clinical domain, namely, mBERT-Galén, BETO-Galén and XLM-R-Galén, achieved the best performance among all models analyzed in this study, with MAP values of 0.662, 0.544 and 0.884 on the CodiEsp-D, CodiEsp-P and Cantemist-Coding tasks, respectively.In fact, according to the MAP scores, the results obtained by mBERT-Galén + BETO-Galén + XLM-R-Galén ensemble remarkably surpassed the prior SOTA performance by 11.6% on the CodiEsp-D task, 10.3% on the CodiEsp-P task, and 4.4% on the Cantemist-Coding task.

B. ADDITIONAL METRICS
Finally, although MAP was the official evaluation metric of the tasks, the organizers also evaluated the participating systems using additional metrics [13], [16], namely, the microaveraged precision, recall and F1-score.For completeness, in Supplementary Table S2-3, we also describe the obtained results by the analyzed transformers on the clinical coding tasks considering the additional evaluation metrics.According to the F1-score, for each transformer, the clinical version surpassed the general-domain version of the model.In addition, the mBERT-Galén + BETO-Galén + XLM-R-Galén ensemble model improved the prior SOTA performance on the CodiEsp-P task while obtaining comparable results with the SOTA performance on the CodiEsp-D and Cantemist-Coding tasks.

A. DOMAIN-SPECIFIC MODELS
Automatic clinical coding is a crucial task in the process of extracting valuable information from unstructured patient data stored in modern EHR systems.This work systematically analyzed the performance achieved by 3 transformer architectures when applied to the problem of clinical coding in Spanish.We followed a TL-based strategy to adapt the transformers to the distinctive features of the Spanish medical domain.For this purpose, the models were first pretrained by using a private corpus of real-world oncology cases in Spanish.To evaluate the validity of the proposed approach, we compared the performance obtained by the original general domain version of the transformers with the performance achieved by the Spanish clinical version of the models.The obtained results showed that for each analyzed transformer, the domain-specific model outperformed the corresponding general model across the 3 clinical coding tasks explored in this study.Among the 3 transformer architectures, BETO, the Spanish version of BERT, was the model that benefited the most from the domain adaptation procedure, improving its average performance on the CodiEsp-D, CodiEsp-P and Cantemist-Coding tasks by approximately 5.5%, 12.2% and 5.4%, respectively (see Table 2), when pretrained on the Galén corpus.
Additionally, we further examined the MLM loss of the 3 transformers on a validation corpus of 4.4K deidentified genetic counseling documents retrieved from the Galén Information System [33], [34].The MLM loss should be considered an intramodel performance evaluation metric, as it greatly depends on the specific factors of each model, such as the vocabulary employed by the tokenizer.In this way, the MLM loss values of the general domain versions of mBERT, BETO and XLM-R were 3.331, 2.114 and 3.21, respectively.When further pretrained on the corpus of oncology clinical cases, mBERT, BETO and XLM-R reduced their MLM loss scores to 1.666, 1.639 and 1.131, respectively.The obtained loss values on the MLM task show that, for each analyzed transformer, its clinical language modeling capabilities improved when further pretrained on the corpus of real-world oncology cases.
Thus, the results obtained in this work for automatic clinical coding in Spanish support the hypothesis asserting that, when adapted to the specificities of the clinical domain, transformer-based models outperformed their original nonspecific domain versions on downstream medical tasks.Although this hypothesis was already explored in previous works [41]- [43], its validity has only been demonstrated for clinical NLP tasks in the English language, for which a considerable amount of standardized and curated medical linguistic resources is publicly available.In contrast, in this study, we have experimentally shown the effectiveness of the clinical domain adaptation of transformers when applied to small-data NLP tasks in a non-English language with limited textual resources.In this way, a Spanish version and two multilingual versions of transformer-based models have benefited from pretraining their architectures on a real-world corpus of 30.9K oncology texts to further tackle clinical coding downstream tasks in the Spanish language.

B. FINE-TUNING APPROACH
Once the clinical pretraining of the transformers was completed, the resulting models were fine-tuned following a multilabel sequence classification approach to perform clinical coding.In this study, we also analyzed the impact of the developed fine-tuning approach on the achieved results by TABLE 4. Comparison of the clinical coding performance of BETO-Galén following different strategies to perform the fine-tuning of the model: the text-stream approach [32], the multiple-sentences strategy [14] and the single-sentence classification approach proposed in this work.For each clinical coding task, we describe the number of fine-tuning instances obtained from the training and development subsets when applying each of the different fine-tuning approaches (Size column), as well as the average subwords sequence length of the obtained fine-tuning instances (Len column).Additionally, the performance of the model in terms of the distribution of the MAP values obtained over 5 distinct executions (Mean ± Std and Max columns), and the MAP score (Ens column) obtained by the combination of the 5 model instances following the ensemble approach are reported for each task.
comparing them with the results obtained when following other existing clinical coding strategies that also deal with the input sequence limitation presented by transformers.In a recent preliminary work [32], we designed a fragment-based classification approach consisting of fine-tuning the models on a corpus of text streams annotated with clinical codes.In a more recent preliminary work [14], we modified the previous strategy by generating a collection of annotated fragments of text comprising a sequence of multiple contiguous sentences.Finally, in the current study, we created a corpus of fragments exclusively spanning a single sentence (see Section III-C2).
Table 4 shows the results obtained by the best performing model, i.e., BETO-Galén (see Section IV), across the 3 clinical coding tasks analyzed in this work when finetuned following the previously described approaches.When comparing the text-stream fine-tuning approach [32] with the multiple-sentences strategy [14], the latter allowed the BERT-Galén model to obtain better results in each of the clinical coding tasks.Although both strategies produced finetuning corpora with similar sizes, by following the multiplesentences approach, the model was fine-tuned on a corpus of annotated fragments of text with full semantic meaning, which resulted in the superior predictive performance of the transformer in the downstream tasks.Among the three strategies, the single-sentence approach proposed in this work yielded a significantly superior performance of the BERT-Galén model across the 3 tasks.The key aspect of the singlesentence strategy is that, in contrast to the multiple-sentences approach, each generated text fragment only comprises one sentence, hence reducing the potential length of the input sequences to the model.As a consequence, the size of the generated fine-tuning corpus was significantly higher than the number of text fragments obtained using the other two strategies.For this reason, our single-sentence developed approach not only deals with the input sequence constraint of transformer architectures but also works as an effective data augmentation method, which has been shown to boost the performance of transformer-based models when applied to small-data NLP problems such as the clinical coding tasks in Spanish addressed in this work.
This study has several limitations.We mainly focused on clinical coding, which is a document-level NLP task.However, the domain-specific transformers examined in this work could also be applied to word-level NLP problems, such as NER tasks, for which transformer-based models have been shown to achieve SOTA performance in the clinical NLP domain [41]- [43], mainly for the English language.Additionally, future studies should pay special attention to model interpretability, as though a few efforts have already been made in this particular area, most of DL models are still regarded as ''black-boxes''.In the medical domain, interpretability is mandatory if computer-based methods aim to serve as support in clinical decision-making.

VI. CONCLUSION
In this paper, we systematically evaluated the performance of 3 transformer-based models to perform automatic clinical coding in Spanish.By means of a TL-based strategy, the models were first pretrained on a private corpus of real-world oncology clinical cases with the aim of adapting transformers to the specificities of the Spanish medical domain.The resulting models were further fine-tuned following a multilabel sentence classification approach that not only addressed the fixed-length input sequence constraint presented by transformers but also effectively served as a data augmentation procedure.The combination of the developed TL-based strategy with an ensemble approach that leveraged the predictive capabilities of the distinct models yielded the best obtained results, which remarkably surpassed the prior SOTA performance by 11.6% on the CodiEsp-D task, 10.3% on the CodiEsp-P task, and 4.4% on the Cantemist-Coding task.Furthermore, we publicly released the mBERT, BETO and XLM-R transformers adapted to the Spanish clinical domain.

FIGURE 1 .
FIGURE 1. Example of CIE-10-ES hierarchical structure for the ''other and unspecified malignant neoplasm of skin'' category (C44).(We preserved the descriptions in Spanish as in the original CIE-10-ES edition of the Spanish Ministerio de Sanidad.)

TABLE 1 .
Description of the number of clinical codes annotations used in each of the 3 clinical coding tasks.Only the CIE-10-ES diagnosis annotations were considered in the CodiEsp-D task, while the CIE-10-ES procedures annotations were only used in the CodiEsp-P task.

FIGURE 2 .
FIGURE 2. Illustration of the two existing annotation formats using the S1130-14732005000300004-1 clinical document from the CodiEsp-D development corpus.A Format of the code annotations available for the clinical coding tasks.B Format of the code annotations available for the NER-N tasks.

FIGURE 4 .
FIGURE 4. Illustration of the sentences annotated with CIE-10-ES diagnosis codes obtained from the S1130-14732005000300004-1 document belonging to the CodiEsp-D development set.The WordPiece tokenizer of the BETO model was used to generate the subtoken sequence of each sentence split from the text.

TABLE 2 .
Models performances on the CodiEsp-D, CodiEsp-P and Cantemist-Coding test sets.The distribution of the MAP values obtained by the 5 distinct fine-tuned instances of each model is described, by reporting the mean, standard deviation and maximum values.For the maximum values column of each task, the best obtained result is bolded, while the second best is underlined.