Testing Contextualized Word Embeddings to Improve NER in Spanish Clinical Case Narratives

In the Big Data era, there is an increasing need to fully exploit and analyze the huge quantity of information available about health. Natural Language Processing (NLP) technologies can contribute by extracting relevant information from unstructured data contained in Electronic Health Records (EHR) such as clinical notes, patients’ discharge summaries and radiology reports. The extracted information can help in health-related decision making processes. The Named Entity Recognition (NER) task, which detects important concepts in texts (e.g., diseases, symptoms, drugs, etc.), is crucial in the information extraction process yet has received little attention in languages other than English. In this work, we develop a deep learning-based NLP pipeline for biomedical entity extraction in Spanish clinical narratives. We explore the use of contextualized word embeddings, which incorporate context variation into word representations, to enhance named entity recognition in Spanish language clinical text, particularly of pharmacological substances, compounds, and proteins. Various combinations of word and sense embeddings were tested on the evaluation corpus of the PharmacoNER 2019 task, the Spanish Clinical Case Corpus (SPACCC). This data set consists of clinical case sections extracted from open access Spanish-language medical publications. Our study shows that our deep-learning-based system with domain-specific contextualized embeddings coupled with stacking of complementary embeddings yields superior performance over a system with integrated standard and general-domain word embeddings. With this system, we achieve performance competitive with the state-of-the-art.

based on Deep Learning, over traditional Machine Learning (ML) algorithms. However, beyond the development of new NN-based methods, researchers have started to explore the impact of improved strategies for the representation of text information provided as input to both NN-based and other ML methods.
Starting from Bag of Words (BoW) representations, word pre-processing has evolved to include more sophisticated word representations such as word2vec word embeddings [6], Glove [7] and FastText [8] embeddings, with the latter two able to capture the subword information from texts. Applied in a range of different NLP tasks, methods using word embeddings have led to significant breakthroughs in model performance for biomedical NER tasks where limited training data is available [9].
Further advances to text preprocessing have been proposed based on language models, that give a word a different embedding vector based on its usage context. The embedding function is trained either from a language modeling perspective [10] or based on recovering masked parts of tokens [11]. The downstream tasks which incorporate these embeddings are considered to be learned in a semi-supervised manner because they benefit from large amounts of unlabeled data [12], [13].
Language representation models can be further applied with or without fine-tuning to problems arising in different domains. 1 The approach of learning on one dataset and applying the model to another dataset is called Transfer Learning.
In our experiments, we explore the use of both Flair and BERT contextualized embeddings as they have been shown to outperform other types of embeddings on a variety of sequence labeling tasks [11], [17].
In addition to pre-trained domain-specific Spanish FastText embeddings [18], we generate domain-specific Spanish contextualized embeddings by pre-training language representation models using the corpus retrieved from the Scientific Electronic Library Online (SciELO) website. 2 The clinical case narrative data from the publications there was used to construct the PharmacoNER dataset. To the best of our knowledge, these are the first contextualized embeddings for Spanish clinical texts made available to a wide audience. The large corpus of more than one billion sentences from SciELO we make available is itself a valuable resource. This paper extends and deepens a preliminary version of our experiments, which are described in [19]. In particular, we add experiments using the Flair framework which outperform our previous results obtained with the Bert model. We also experiment with word embedding stacking approaches, further improving the results we obtained on the PharmacoNER corpus.
The contributions described in this paper are as follows: (1) we retrieve task-specific corpora for training; (2) we construct task-specific contextualized word embeddings from scratch based on Flair and BERT architectures; (3) we compare model performances based on constructed word embeddings, explore how these may be combined with other types of embeddings, and compare these with the standard embeddings, producing new baselines; and (4) we conduct an extensive error analysis checking the source of errors for different models.
The pretrained weights for Flair and BERT models, as well as the SciELO corpora used for their training are made publicly available in a Google Drive repository. 3

A. BIOMEDICAL ENTITY EXTRACTION APPROACHES
Simple approaches to biomedical NER which sometimes give surprisingly good results have made use of rules or dictionaries.
For example, Eftimov et al. [20] built a set of regular expressions to extract evidence-based dietary recommendations from scientific publications and websites. They first detected target mentions in textual data and then extracted them using the rule-based technique.
Various strategies for dictionary lookup have also been shown to be effective [21]. Such approaches leverage biomedical terminology resources or ontologies, and are particularly relevant for biomedical NER where named entities often correspond to fine-grained domain-specific concepts.
However, with the development of automatic NLP methods, these methods are rarely applied on their own to solve NER tasks, but rather are used to generate features to feed ML and deep learning (DL) models. For example, in a recent Meddocan challenge on Spanish medical document anonymization [22], rule-based techniques were actively utilized in ML and DL methods to identify patients' email addresses, locations, phone numbers, etc. In addition, participants in the challenge used domain-and language-specific gazetteers and Brown clusters derived through unsupervised ML. For example, Perez et al. [23] concluded that Brown clusters and gazetteers played a significant role in ML system performance. Further, Lopez at al. [24] tested both ML and rule-based approaches and concluded that a hybrid of the two gives the best result.
Lee et al. [25] solve the problem of biomedical NER in two steps, firstly by discovering entities' boundaries using Support Vector Machines (SVM) techniques and then further applying an ontology-based hierarchical classification method to classify identified entities. Their system got promising results 66.7 F-score on GENIA corpus [26].
Early work on machine learning-based NER includes such techniques as reranking relying on kernels [27] as well as pure feature processing [28]. Kernel-based methods for entity extraction such as SVM utilized in numerous papers [29]- [31] overall became popular methods for extracting entities from texts including biomedical texts [32]. In the latter paper, the authors examined different kernel functions for the problem of biomedical NER and concluded that treebased kernel is more capable of entity extraction.
Current state-of-the-art methods for NER are based on NN architectures, in particular, DL convolutional NNs (CNN) and recurrent NNs (RNN). Transfer learning approaches, in particular the use of pre-trained contextualized word embeddings, have augmented performance of these methods, giving strong results in a number of downstream tasks.
For example, in the Meddocan shared task the best result was achieved by a system which utilized pretrained contextualized Flair embeddings fed into a simple RNN model. However, while dealing with more complex biomedical NER problems including long, discontinuous, overlapping entities, hybrid approaches show the best results. Li et al. [33] integrated KB embeddings in their tree-structured LSTM framework, achieving approximately 3 point gain in F-score.
Related to this, contextualized word embeddings together with part-of-speech (PoS) tags were examined for Bulgarian NER [34] showing sizeable improvements over the state-ofthe-art. In another work, a combination of different types of contextualized embeddings was explored over English biomedical literature corpora [35]. The best results were obtained when combining ELMo and Flair word embeddings. Another relevant work includes the extraction of adverse drug events on 2018 N2C2 shared task corpus [36]. The authors experimented with the off-the-shelf Flair NER framework and kernel-based methods and concluded that a neural Flair-based approach outperforms standard SVM-based methods. In the work of Basaldella et al. [37], the authors pretrained ELMo and Flair contextualized word embeddings on health forums within Reddit and applied them to health social media data for various NER problems. They concluded that domain-based contextualized word embeddings heavily influence the performance on downstream tasks, outperforming embeddings trained both on generalpurpose data or on scientific papers when applied to usergenerated content. Our experiments are very similar to this work.
One can find an extensive overview of recent advances in NLP field in the work of Minaee et al. [38]. While focusing on document classification, it describes several methods, such as transformers, which are completely applicable to NER. Young et al. [39] cover in detail a key element of the current paper -distributed and contextualized word representation, among other recent trends introduced in NLP.

B. SPANISH CLINICAL TEXT PROCESSING
Spanish is an inflectional language with a richer morphology compared to the English language; morphemes denote several syntactic, semantic and grammatical features of words (such as gender, number, etc). From a syntactic point of view, Spanish texts have more subordinate clauses and long sentences with a high word order flexibility; for instance, the subject can be located in any position in a sentence instead of only before the verb.
Clinical notes have many occurrences of abbreviations and usually English abbreviations coexist with Spanish ones. For instance, ''PSA'' corresponds to ''prostate-specific antigen'' and it is preferred to ''APE'' (''antígeno prostático específico''). However, polysemic abbreviations are very common in both languages.
From a syntactic point of view, sentences are very similar in both languages (short sentences or phrases, with use of negation particles and non-standard abbreviations, misspellings, speculation and ungrammatical sentences, among other phenomena).
In summary, there are more lexical variants of medical terms in Spanish with respect to English due to replication or partial adaptation of terms. For these reasons, analyzing these texts is a more resource consuming task, and normalization tools are required.

C. PharmacoNER 2019 SHARED TASK
PharmacoNER is ''the first task on chemical and drug mention recognition from Spanish medical texts, namely from a corpus of Spanish clinical case studies'' [1]. According to the organizers, ''the main aim was to promote the development of named entity recognition tools of practical relevance, that is, chemical and drug mentions in non-English content, determining the current-state-of-the art, identifying challenges and VOLUME 8, 2020 comparing the strategies and results to those published for English data''.
The challenge consisted of two subtracks -(1) NER offset and entity classification and (2) Entity indexing. We focus on the NER task. In total, 22 teams participated in the first subtrack. Xiong et al. [40] was placed first with an overall F-score of 91.05. They used the multi-lingual large version of the pre-trained BERT model 4 with further finetuning to the PharmacoNER NER problem. The key success of their implementation of the BERT model in comparison to other participants' BERT implementations was that they incorporated more semantic and syntactic features such as word shape and PoS tags into their model embedding layer. Moreover, they applied a Spanish biomedical abbreviation detection tool, however they did not detail how the extracted abbreviations were further used.
The second-best results of Stoeckel et al. [41] were updated after the formal challenge with an F-score of 90.52. They used the Flair model and made use of additional corpus derived from SciELO, however of a smaller size than ours. They used this corpus to train word2vec and FastText word embeddings, and for Flair language model (LM)-based embeddings they used pre-trained Spanish general domain word embeddings. 5 Sun et al. [42] achieved the third-best result with an F-score of 89.24. Like Xiong et al. [40], they also used the pre-trained version of BERT with subsequent fine-tuning but without incorporating any additional features.
Overall, many participants experimented with document encoding techniques. For example, Rivera Zavala et al. [43] gathered similar size Spanish biomedical corpora to train their own FastText embeddings. Moreover, they used sense2vec [44] pre-trained embeddings. Both of these embeddings have proven useful in extracting biomedical concepts.
Later, other research papers appeared addressing NER on the PharmacoNER corpus. Multi-tasked and stacked model approaches were offered by [45]. Their best multi-tasking approach achieved 91.4 F-score. In another paper, a set of 104 sophisticated context patterns was constructed [46]. With this knowledge-based approach, authors achieved an impressive result of 91 F-score. We do not compare our results with the results of these two papers, as the approach of [45] required more annotated data, and the approach of [46] required manual rule construction relying on Spanish language syntax.

A. FLAIR
Flair embeddings were developed by the Zalando research group [17]. They are contextualized string embeddings in the sense that the contextualized embedding vectors are trained without any notion of words but purely treat texts as sequences of characters. This is the main difference between this type of embeddings and others such as word2vec [47], Glove [7], and ELMo [17].
Flair is trained using an LM objective function aimed at predicting the next character of a sequence, thus keeping information on the character ordering in a text sequence. By learning the character level representations in both directions it was possible to get the context for each character in both right and left directions. To generate a word embedding from characters the first and last character states of each word are extracted and concatenated.
From the computational and memory point of view, these embeddings are more efficient to store and train a model for word embeddings. Moreover, they have proven to be more effective in terms of rare, out-of-vocabulary (OOV) words and morphologically rich languages [17].
In our experiments, we use the enhanced version of Flair embeddings called Pooled Contextualized String Embeddings [48]. It is different from the previously developed Flair LM in that it better handles representation for words in an underspecified context. By dynamically aggregating the contextualized embedding of each unique word, this information is later used to expand the embedding for the same word encountered in a poorly, ambiguously specified context. This situation is often encountered in the Spanish biomedical NER tasks, when two words with similar suffixes express different types of substances, as for example, creatinina and hemoglobina where the latter is a protein but the former is not.

B. BERT
Bidirectional Encoder Representations for Transformers (BERT) is the deep learning language representation model developed by the Google research team [11]. In contrast to ELMo and Flair, it can be used not only for contextualized word embeddings generation, but also for the downstream tasks itself through a process called fine-tuning.
BERT is trained using the masked word piece representation and the next sentence objective. Its architecture consists of stacked multi-layered transformers, each having a selfattention mechanism with multiple attention heads. Introducing self-attention in encoder-decoder architecture of BERT allows better capture of long-distance relationships among concepts by avoiding no locality bias.
BERT can be further pre-trained for a specific domain or fine-tuned for a specific task [49]. In particular, fine-tuning for token level classification tasks is supported by putting a linear layer, which takes as an input the last hidden state of the sequence, on top of the BERT model.

C. ADDITIONAL EMBEDDINGS
It has been demonstrated that the concatenation of contextualized embeddings with the standard embeddings usually leads to an improvement in results [10], [17]. Following this, for our experiments we used the concatenation of Flair embeddings with Spanish general (not domain specific) FastText embeddings [8], domainspecific Spanish biomedical FastText embeddings [18], byte-pairwise encoded embeddings (BPE) [50] and character embeddings [51]. The results of models with and without these additional embeddings are presented.
General FastText embeddings for Spanish were trained using the full dump of Spanish-language Wikipedia while Spanish domain-specific biomedical embeddings utilizing the architecture of FastText were trained over the Sci-ELO 6 corpus with 100 million tokens and the health section of Wikipedia with 82 million tokens.
Character embeddings are generated using a RNN model and further are concatenated with the other types of word embeddings in a model. While the BPE model represents subword embeddings in 275 languages, we used only one language from this model. It produces relatively light-weight embeddings as they consist of sub-word tokens of words. This method has been shown to deal well with unknown words and to produce results on a par with the standard word embeddings.

D. ENTITY EXTRACTION
In the PharmacoNER task, there are 4 relevant types of entity mentions, although for the official evaluation, only the first 3 types are used: • Normalizables (Normalizable): mentions of biomedical concepts which can be normalized to the SNOMED-CT and ChEBI vocabularies; • No_Normalizables (Non-normalizable): biomedical concepts which cannot be normalized to the given vocabularies; • Proteínas (Proteins): mentions of genes and proteins; • Unclear (Unclear): general substance mentions. The problem of biomedical NER can be framed as a sequence labeling task where the goal is to extract the correct spans of entities. We therefore used a BIO schema.

In this schema, each token in a document is classified as [B]eginning, [I]nside, or [O]utside of an entity mention.
Other than for the BERT experiments, all experiments were conducted using the Flair framework 7 which is built on top of Theano providing a convenient means of experimenting with different combinations of word embeddings. It provides an off-the-shelf neural-based system supporting entity extraction. We train a Long Short Term Memory (LSTM) network with a hidden state of 256 dimensions, learning rate 0.1, mini-batch size of 8, and is optimized with Adam. We train for 150 epochs, and the model that performs best on the validation set provided by the organizers of the competition during training is used to prevent overfitting.
We were unable to conveniently experiment with BERT embeddings using the Flair framework but preferred the Google Cloud TensorFlow TPU set up for both training contextualized word embeddings and the downstream 6 SciELO.org 7 https://github.com/zalandoresearch/flair task fine-tuning and predictions as it works much faster. 8 However, at the time of writing TPU did not support inference on downstream tasks, and it was required to switch over to CPU instances for this step.
We used a Conditional Random Fields loss [52] as it has been shown to increase the accuracy for the NER tasks. The training and evaluation batch sizes were set to 32 and 8, respectively, and the learning rate was set to 5e −5 . The maximum sequence length was set to 160.
Despite the common advice to fine-tune the BERT model for just 3-10 epochs, we fine-tuned it for 30 epochs as we noticed it improved the predictions.

E. PharmacoNER CORPUS
The PharmacoNER corpus was used for training and testing our models. It consists of 1000 annotated SPACCC articles derived from open access Spanish medical publications in SciELO -an electronic library where complete full-text articles from scientific journals of Latin America, South Africa, and Spain are systematically collected and stored. 9  Table 1 shows summary statistics of the PharmacoNER corpus. Results are scored with the scoring tool distributed by the organizers of the challenge. For concepts, true positives are strict (the system concept span must match a gold concept spans begin and end exactly). We report micro-averaged results of the lenient evaluation since that was the metric used to score the shared task.
For training the model, we combined both training and development corpora (yielding 11970 sentences for the merged corpora) and selected by random shuffling 10% of it for validation purposes.

F. LANGUAGE MODEL TRAINING DATASET
We selected a subset of SciELO text based on some heuristics to be in line with the corpus used for training and testing the model. In particular, we chose articles based on the criteria that the specified area of the document is Health Sciences and then selected text in particular sections of the articles.
We also used the same corpora for training the BERT language representation model with the vocabulary size set to 128000.

G. LANGUAGE MODEL TRAINING
The Flair_Sc LM was trained until the perplexity reached 1.92. The settings used to train word embeddings are: hidden size 1150, the number of layers 3 with maximum sequence length 240, mini-batch size 100 and number of epochs equal to 1000.
The training of Flair_Sc LM was done using 1 GPU instance.
The BERT language representation was trained using Tensor Processing Units (TPU) instances in Google Colab with the number of training steps 1B. TPU is designed to efficiently scale operations among different machines thus making calculations on tensors faster than doing it using GPU. For storing and uploading weights for training Google Cloud persistent storage is required. Moreover, every 8 hours Google Colab shuts down its server, so it is necessary to be resumed manually. Overall, it took more than 4 days to train the BERT language representation, substantially longer than it takes to train Flair_Sc LM.

III. RESULTS
The comparative results of experiments are presented in Table 2 where we depict our best Flair-based, and BERT model results: • Flair_Sc_ext2: the extended model is trained using the custom SciELO Flair embeddings Flair_Sc, SciELO FastText embeddings, BPE embeddings and character Embeddings; • BERT_Sc: BERT-based word embeddings are trained on the SciELO corpus. Subsequently, the BERT model is fine-tuned for the downstream task. To compare our results with others, we selected the top results in the challenge leader board and we omit results for which no description of the systems were provided. For the best model, precision for all types of entities is higher than recall, especially for Normalizables entities. This means that while the model is good in determining the correct cases, it is not as strong at identifying positive examples.
Indeed, comparing to the best systems' results, it can be observed that we are superior in terms of higher precision but relatively weaker in terms of recall. Overall, our results are 0.21 points behind the best system of Xiong et al. [40] for this task.
No_Normalizables entities comprising the minority class are not captured by our models. Techniques for tackling the class imbalance should be considered in future experiments with sequence labeling architectures.

IV. DISCUSSION
A. NUMBER OF TRAINING EPOCHS Fig. 1 shows an evolution of the loss and F-score over number of epochs. It can be seen that the loss becomes steady after around 27 epochs and the test F-score stabilizes at around the same point. Overall, the test set loss curve resembles the validation set loss curve which means that the validation set is a good proxy for measuring the model performance.

B. ABLATION ANALYSIS
For our ablation analysis, we explored the following additional combinations of word embeddings:   The results of different variations of stacking word embeddings are shown in Table 3.
In general, LM-based embeddings lead to better results than the standard ones. It can be also seen that the model enriched with different types of word embeddings gives better results in terms of precision, recall and F-score. Domain specific word embeddings lead to improvement of results, however, they are much smaller in size than general domain ones. Augmenting word embeddings with additional subword level embeddings such as FastText, BPE and character embeddings further improves the results.
We also experimented with searching concepts in SNOMED-CT using the Meaning Cloud tool, 12 however it did not work well, as many concepts for the shared task were annotated based on their synonyms.

C. ERROR ANALYSIS
For error analysis, we split gold standard entities into 2 groups: short entities with the length less or equal to 2, and long entities with the length greater than or equal to 3. For the best model Flair_Sc_ext2, the origin and distribution of errors are presented in Fig. 2.
It can be seen that the majority of errors are for the short predicted entities for which there is not even partial overlap with gold standard entities (No intersections false positives (FP)). Indeed, many biomedical entities are acronyms and abbreviations which could be easily misclassified based 12 https://www.meaningcloud.com/ on casing and length of entities. Interestingly, the second primary source of errors for short predicted entities is that the model predicts two entities where the gold standard has a single entity (Longer FP). A smaller number of errors are related to short gold standard entities which the model fails to detect (false negatives -FN). For long entities, the main source of error is that predicted entities are shorter than required (Shorter FP), contributing nearly 75% of the total error.
In Table 4, we present a comparison of errors among 3 models: the best model Flair_Sc_ext2, the model which uses only Flair embeddings trained over the target corpus Flair_Sc, and the model based on a set of standard embeddings Standard_Sc.
Interestingly, the main discrepancies in the number of errors for Flair_Sc model in comparison to the best model are related to the larger number of not predicted short entities (FN). All other discrepancies in errors for both models vary in a range 1-7 in both ways.
In relation to the best model, the main source of errors for Standard_Sc is related as well to the falsely predicted short entities without intersections with gold standard ones (No intersections FP) with almost 15% more predicted FP. It indicates that the best model utilizing the contextual embeddings learns the meaning of acronyms, abbreviations and overall short uppercased words more effectively, assigning them biomedical labels with more caution.
This comparison also shows that lower performing models are much worse at detecting the boundaries of short biomedical concepts, often predicting longer concepts:  It is interesting to observe that for the long predicted concepts, the absolute numbers and distribution of errors for the best Flair_Sc_ext2 and Standard_Sc models are mostly the same. However, the Flair_Sc model performs slightly worse in terms of predicting shorter concepts than the gold standard ones (i.e. predicting three consecutive terms instead of four, etc).
In Table 5, we present two examples of sentences with underlined gold standard and predicted entities. Sentences were chosen from the representative groups of the most common errors for different models. Here, FP is the shorter abbreviation for FP without intersections. It can be observed that the Standard_Sc model in both examples predicted long entities which were either FP or longer version of gold standard entities. Flair-based models are also often confusing short upper-cased entities but in fewer cases.
Interestingly, in the second example, although both Flair_Sc and Standard_Sc models have detected 'USA' entity as a PROTEIN, the Flair_Sc_ext2 model which combines embeddings from both models did not give this entity a biomedical label.
In terms of the best parameter setting, we did not perform parameter selection for either the Flair or BERT models; this might further increase model quality.

V. CONCLUSION
In this work, we have explored the application of transfer learning techniques, in particular, language representationbased word embeddings to the problem of extracting biomedical entities from 1000 Spanish clinical case narratives. By leveraging the knowledge from a huge amount of unlabeled data, with language model pre-training it becomes possible to build a high-quality NER system even with this small amount of annotated data.
With this aim, we trained domain-specific Spanish language models, in particular, Flair and BERT to derive contextualized word embeddings and applied them to the PharmacoNER biomedical NER data achieving competitive results. We showed that domain-specific word embeddings outperform general embeddings, despite being trained on a smaller corpus. Moreover, we demonstrated that stacking together word embeddings of different nature can improve model performance.
Error analysis has shown that the main source of errors for all models is over-zealous recognition of short entities. Indeed, biomedical entities are often short and upper-cased and can be easily mixed up with other abbreviated short words. Testing the approach by analyzing other Spanish health-related texts, such as social media [53], with similar characteristics (e.g., a large number of abbreviations, lack of grammatical structure, punctuation marks, etc.) and others (e.g., patient oriented terminology not included in any resource, slang words, etc.) could help to cope with these phenomena.
Moreover, standard embedding-based models often fail by detecting long false positive entities or longer versions of gold standard entities (in particular, for FastText models). However, it should be noted that the ability to detect long entities could be beneficial in particular scenarios.
One direction for improvement could be more sophisticated utilization of contextualized embeddings. For example, they could be incorporated into state-of-the-art NER architectures such as graph-based NNs or NNs with a dependency tree-based attention mechanism to further improve capturing of long-distance relationships between biomedical entities.
For handling the imbalance of classes, different strategies such as loss function modification could be applied in future work.