Chinese Medical Named Entity Recognition Based on Fusion of Global Features and Multi-Local Features

Chinese medical Named Entity Recognition (NER) is a task of Natural Language Processing (NLP), which aims to extract key information from Chinese medical texts. Recently, Transformer becomes the mainstream approach for NLP due to its powerful capability for global feature extraction. However, entities usually appear in the form of subsequences in NER, therefore the local features are not negligible, and the uncertainty of Chinese word segmentation increases the difficulty of this task. In this paper, we propose a network structure that combines global feature extraction and multi-local feature extraction to enhance the performance of Chinese medical NER. Based on the global feature extracted by the Transformer, Bi-LSTM is used to extract the multi-local features, and a context integration mechanism is used to enhance local features by integrating both forward and backward global contexts in each cell. This allows for a more comprehensive representation of individual cells. And a feature fusion method based on attention mechanism is proposed, which allows the decoder to better focus on the more important information for predicting the current character. During the global feature extraction, the flat-lattice structure is introduced to generate all the potential results of Chinese word segmentation. And the span-based relative positional encoding integrates direction and distance perception, which overcomes the shortcoming of Transformer’s inability to capture sequential characteristics. Finally, a CRF with conditional constraints is used as the decoder of the model. Experimental results on two benchmark datasets show the effectiveness of our model, and the method significantly outperforms the state-of-the-art methods in the medical NER task, achieving $F1$ value of 93.64% on CCKS2017 and 85.01% on CCKS2019.


I. INTRODUCTION
In recent years, artificial intelligence technology has become a focal point of research in the medical field.Clinical data in medical records contains a wealth of vital knowledge and information.The automatic identification and extraction of these crucial clinical named entities are essential for constructing medical knowledge graphs, enabling intelligent The associate editor coordinating the review of this manuscript and approving it for publication was Orazio Gambino .diagnostics, and supporting various applications.It constitutes an indispensable component in many Clinical Natural Language Processing (NLP) systems.
Chinese Medicine Named Entity Recognition (NER) [1] involves the task of extracting key information from texts in the field of Chinese medicine.This includes extracting relevant information such as symptoms, diseases, and anatomical details from Chinese medical electronic records.Typically approached as a sequence labeling problem, where the input text comprises clinical records and the output is a labeled sequence corresponding to the input text, NER can be intuitively understood as consisting of two specific subtasks: the recognition of entity boundaries and the identification of entity categories.
In recent years, researchers have extensively studied NER tasks, conducting numerous meaningful attempts, and the models and improvement measures they proposed have shown excellent results.Since the initial application of BiLSTM to address sequence labeling tasks [2], Recurrent Neural Networks (RNN) have been widely employed in NER tasks.These models exhibit robust capabilities in learning contextual representations, making them the preferred choice as encoders for the majority of NER models.Subsequently, the introduction of the Transformer model enabled the use of fully connected self-attention mechanisms as encoders, establishing long-distance contextual models and addressing the limitations of recursive models.Due to its structural advantages in parallelism and long-range contextual modeling, Transformer has found the most widespread application and achieved excellent results in tasks such as machine translation and reading comprehension [3].
However, existing methods still have certain limitations in NER tasks within the Chinese medical domain.Models like BiLSTM and other RNN models exhibit strong representation capabilities in sequence features, yet they neglect the long-term dependencies within sequences and do not fully exploit the semantic information within the text sequence and the interrelationships between words.While the Transformer excels in capturing long-distance dependencies, it falls short in capturing the relative positional relationships between characters.Yan et al. [4] addressed this limitation by enhancing the traditional Transformer with Relative Positional Encoding (RPE), demonstrating that RPE improves the Transformer's performance in NER tasks.The Transformer encoder is also limited by its insensitivity to local information.In NER tasks, entities typically appear as contiguous subsequences, and for a given character, its neighboring characters often provide more effective information for determining its label.
In order to consider both the global and local feature extraction capabilities of the model, we combine RNN, which is more suitable for sequence tasks, with Transformer, which is better at obtaining long dependencies, to balance the advantages and disadvantages of both.And we propose a network that incorporates global features and multi-local features.

II. RELATE WORK
The main content of this section is a review of existing work that is closely related to our study.In particular, we focus on the development of the NER task, followed by a description of improved methods for positional encoding.

A. NAMED ENTITY RECOGNITION
The NER task is a multi-classification sequence task, where the input is a sequence of characters, and the output is a sequence of labels corresponding to characters.With the increasing complexity of natural language processing tasks, researchers have increasingly focused on the continuous innovation of NER theories, and three mainstream NER methods have gradually emerged, rule and dictionary based, traditional machine learning based, and deep learning based models.Nowadays, multilayer based neural networks and some Large Language Models (LLMs) have become the most popular method for solving named entity recognition and other NLP tasks [5], [6], [7].
In recent years, machine learning-based approaches have been widely applied to NER tasks.This primarily includes Support Vector Machines (SVM) [8], Hidden Markov Models (HMM), and Conditional Random Fields (CRF) [9].Compared to rule-based and lexicon-based methods, machine learning models exhibit strong adaptability and portability.Mccallum and Li [10] utilized a CRF model to address NER tasks, achieving an F1 score of 88.96% on the CONLL-2003 dataset.
The main advantage of deep learning over machine learning is automatic feature extraction, which is advantageous for solving multi-level data.Collobert et al. [11] were the first to propose the use of deep learning neural networks for solving NER tasks.Their CNN-CRF model achieved superior performance at a lower computational cost.Huang et al. [2] were the pioneers in employing BiLSTM to address sequence labeling tasks, and its outstanding representation capabilities have made BiLSTM the most widely used encoder.Building upon Huang et al.'s work, Lample et al. [12] introduced a novel neural network architecture.They combined BiLSTM and CRF, reaching state-of-the-art (SOTA) levels in NER tasks across different languages at that time.
With the introduction of Transformer [13], it soon became the mainstream in NLP tasks.Radford et al. [14] established an encoder-decoder framework based on the self-attention model.The structure of the self-attention mechanism better integrates contextual information, capturing more information compared to RNN models and significantly improving training efficiency.Lin et al. [15] introduced self-attention extraction to explain sentence embeddings, achieving better performance in various text-related tasks.Baevski et al. [16] proposed an approach for pre-training bidirectional transformer models to solve lexical reconstruction tasks in cloze tests, demonstrating improved syntactic performance in GLUE experiments.Yamada et al. [17] proposed a new bidirectional Transformer-based word and entity pre-training context representation.They introduced an entity-aware selfattention mechanism, achieving remarkable performance.Sukhbaatar et al. [18] modified the Transformer model, utilizing Relative Positional Embeddings (RPE) proposed by Shaw et al. [19] and the caching mechanism proposed by Dai et al. [20].This accelerated training and testing time, learned the optimal attention span, and significantly expanded the maximum context size used in Transformer, yielding better results.Fang et al. [21] developed an end-toend neural network for joint entity and relation extraction tasks, leveraging multi-head attention and two prompting mechanisms.Guo et al. [22] replaced the fully connected structure in Transformer with a star-shaped topology.By sharing intermediate nodes, they made connections available between each node, achieving substantial improvements for medium-sized datasets.
In 2019, the emergence of BERT [23], developed based on Transformer, significantly enhanced the dynamic representation of contextual words.Cui et al. [24] proposed Whole Word Masking (WWM), which no longer allows the model to recover the tokens of word fragments in Masked Language Model(MLM) pre-trained tasks, but the whole word.
With the continuous development of deep learning in the field of NLP, numerous applications have been introduced in the medical domain.Moezzi et al. [25] proposed a Transformer-based fine-grained NER architecture for clinical information extraction.In this study, the Transformer-based model outperformed previous methods.However, most applications in the medical field are still focused on English.Chinese, unlike English, lacks natural space segmentation, significantly increasing the uncertainty of word segmentation and thus adding complexity to NER tasks.To better consider the characteristics of the Chinese language, Zhang and Yang [26] suggested using a grid-structured LSTM to depict the vocabulary in sentences and incorporating potential lexical details into a character-based LSTM-CRF.In recent models, many have considered the characteristics of the Chinese language by adding radical and pinyin encodings at the encoding layer, effectively improving named entity recognition performance in the Chinese medical field [27], [28], [29].An et al. [30] enhanced features by improving character-level encoding, while Zhang et al. [31] improved word information retrieval by combining character-level encoding with word-level encoding.Li et al. [32] directly introduced an external lexicon to enhance the recognition of Chinese medical entities.In addition, some improvements combine RNN or CNN with self-attention mechanisms to obtain multi-level features of Chinese text, but they only perform simple concatenation of features at different levels [33], [34].
Methods related to Chinese medical NER in the past have primarily focused on the encoding layer, emphasizing certain aspects of text features while lacking consideration for features at different levels of the sentence, or merely involving simple concatenation of obtained features.In this paper, we explore features at different levels of sentences to better understand sentence content.We propose a network framework that combines global features with multiple local features, aiming to overcome the aforementioned limitations and effectively improve the performance of Chinese medical NER tasks.

B. POSITIONAL ENCODING
Unlike traditional recurrent-based models, non-recurrent models are less sensitive to the perception of positional information.The self-attention architecture of the Transformer is utilised to model the dependencies among various positional elements in a sequence.However, this architecture results in a family of permutation equivalence functions [35], which limits Transformer's ability to access the positional information of elements in a sequence, in other words, there is no way for the model to capture the sequential features of the sequence.However, in natural language tasks, different words appearing in different positions often have different meanings.To address this problem, Vaswani et al. [13] proposed to use sinusoidal signals of different frequencies to generate positional encoding.Where the encoding of the ith position can be expressed by the following equation.

P pos
P pos where the range of i is [0, d model 2 ] and d model is the input size.This sine-based position encoding gives Transformer the ability to model the position of characters and the distance between each two characters, which we call APE.Although APE can represent some positional information, it still lacks directionality compared to RNN.As in Bi-LSTM, the model can collect information from the left and right sides of the token separately, taking into account both directional and positional information.Based on this, Yan et al. [4] proposed a RPE based on the Transformer encoder.The relative position information between x t and x j is represented by R t−j , and the attention scores are calculated as follows: where The directionality of the relative position between characters is based on the properties of the sine and cosine functions: The absolute position relationship between characters is captured using the cosine function and the directionality is captured using the sine function.Liu et al. [36] used the neural ODE method as a reference to model the evolution of encoding results along the position index by a dynamic system, overcoming the limitations such as the inflexible use of sine functions and the lack of learning parameters.Luo et al. [37] were guided by the theory to developed RPE-based universal attention (URPE), a positional encoding with learnable parameters that can be applied to different datasets and frameworks.
The design of the RPE plays a very important role in the modification of the Transformer [38].From an empirical perspective, many studies have shown that Transformers based on RPE can achieve impressive performance in various language tasks and have better inference ability in longer sequences [39].It is also worth noting that RPE makes it easy to extend Transformer to other data schemas.

III. BACKGROUNDS
Chinese NER is closely related to word segmentation, where the boundary of the entity to be recognized in a NER task is the boundary of the Chinese word.An intuitive approach to solving the Chinese NER task is to first perform word segmentation and then annotate the segmentation results with entity categories.However, inaccurate segmentation outcomes could lead to significant error propagation issues since the identification of entities is reliant on the segmentation outcomes.Studies have demonstrated that the character-based methodology outperforms the word-based methodology in NER [40], [41], [42].
However, character-based NER has limitations as it doesn't fully utilize explicit word and word order information.In addressing this issue, Zhang and Yang [26] proposed a Lattice structure that integrates potential lexical details into a character-based LSTM-CRF.By considering all potential segmentation results, it explicitly leverages word and word sequence information, significantly improving the performance of Chinese NER tasks.
Due to the complexity and dynamism of the lattice structure, most lattice-based models struggle to fully leverage the parallel computing advantages of GPUs, resulting in typically slow inference speeds.To address this issue, Li et al. [43] transformed the lattice structure into a flat structure.They reconstructed the lattice structure by assigning indices to characters and directly modeled lattice input using a Transformer.By leveraging the powerful capabilities and excellent parallelism of the Transformer encoder, the drawbacks of the lattice LSTM model were effectively addressed.The structure of the Flat-Lattice Transformer is illustrated in Fig. 1, the information input to the Transformer encoder includes character information, word information, and lattice information.This is still essentially a character-based model that avoids propagation errors while taking into account information about the vocabulary in the sentence.The creation of a glossary provides information about the different word combinations.The vocabulary list is merged with the original sequence as a new input sequence.A spanning set is then defined for the new sequence, marking information about the head and the tail position of each character or words in the original sequence.This allows the model to obtain more accurate word separation information during the training process.

IV. METHOD
The method proposed in this paper is presented in this subsection.Formally, the NER task can be interpreted as follows: given a sequence of characters the purpose of the task is to map the input sequence x to another sequence of labels y with the same length, representing for each word in the sequence x.In this paper, real Chinese electronic medical records are used as the main experimental object to verify the validity of the model.The main architecture of the method proposed in this paper is shown in Fig. 2, comprising three parts: the global feature extraction module, the local feature extraction module, and the feature fusion module.The details of each part will be introduced separately.

A. GLOBAL FEATURE EXTRACT MODULE
The global feature extraction module utilizes Transformer as the encoder, incorporating a flat-Lattcie structure to address Chinese word segmentation issues.Simultaneously, RPE is introduced, enabling the Transformer encoder to capture sequential features in the text.
The RPE aims to provide distance and direction information to the encoder, compensating for the shortcomings of the Transformer compared to recurrent-based models, such as RNN and LSTM.The Transformer encoder with RPE is shown in the Fig. 3.
The design of the RPE is based on the span set, using dense vectors to model the intersection, inclusion and separation relationships between two spans computed from successive transformations of head and tail position information.Two vectors h[i] and t[i] are used to represent the head and tail position information of span x i , and the relative positions of x i and x j are computed as Eq. ( 6) -Eq.( 9).For sequence-based tasks, the element order information is crucial.The RPE   information is provided for the model.
where d ij (hh) is the distance from the head of x i to the head of x j , d ij (ht) is the distance between the head of x i and the tail of x j , d ij (th) is the distance between the tail of x i and the head of x j , and d ij (tt) is the distance between the tail of x i and the tail of x j .The position of the span is ultimately encoded as a simple non-linear transformation of the four distances.Using respectively, the REPs of character i and character j are represented by the following equations.
where W r is a learnable parameter, ⊕ signifies the concatenation operator, and P d represents the absolute position encoding operation, as shown in Eq. ( 1) and Eq. ( 2).Where ) , and i is the dimensional index of the position encoding.The spanwise RPE is then used based on the variant self-attention mechanism [20], which is computed as follows.
where W q , W k,R , W k,E ∈ R d model ×d head and U, V ∈ R d head are learnable parameters, R ij is vectors of RPE, and then A * is substituted for A in the traditional Transformer.After that, the representation of the characters is fed as input to the Feature Fusion layer, while the information about the vocabulary generated by the lattice structure is truncated.

B. LOCAL FEATURE EXTRACT MODULE
The paper [44] verifies that extracting local features is effective for named entity recognition.They set sliding windows in the local feature extraction module to extract local information that is not disturbed by the entire sequence information.But there is no further processing of local features.In this paper, to address the issue of Bi-LSTM's inadequate capacity for characterising sentences, the extracted regional features need to be further contextually fused to improve the representation in each Bi-LSTM cell.
The entire process of local feature extraction is illustrated in Fig. 4, consisting of two parts: the BiLSTM encoder and the context integration module based on the gate mechanism.Firstly, we extract the contexts in the forward and backward first cell and the last cell, respectively, from the results obtained by Bi-LSTM as the global information quantity.After fusing the contexts with each cell, we obtain the local weight and relative global weight of each cell through the Gate mechanism.Then, we perform the corresponding pointwise formation and summation to obtain the final multilocal features.
The RNN-based model has natural advantages in solving sequence-based problems.In this module, Bi-LSTM is used to extract features from local intervals.It performs feature extraction from both directions of the sequence, which allows the encoder to focus well on the positional information of each character.In terms of internal structure, it uses a gate mechanism to control the circulation and loss of features, which is good for obtaining local features for each word.
Due to the diversity of entity lengths in the dataset, setting a single local interval length may lead to semantic limitations, and thus more important information may be ignored by the decoder when making predictions.To avoid such a situation, multiple local intervals are set to extract multi-local features to obtain richer multi-level semantic information.
Set the length of one local interval to l.
Next, we concatenate relative global contextual features G with local features to obtain IN , and alter the variable dimensions through linear mapping.The resultant is then utilized as input for the gate mechanism to derive weights: ) Next, the weights W i g l and W i h l for each cell are determined through the gate mechanism, employing the sigmoid function to assign the weights.The specific process is illustrated in the following equation: We use two weights to fuse the contexts to get the final local features: where ⊙ denotes element wise product.Then, by continuously sliding this interval block, each character is made available as the centre of the interval.Feature extraction is performed for intervals of length l centred on different characters.Ultimately, all local features of x i can be represented as the following.
where l 1 , l where FC represents the fully connection layer, then Q is used as the Query matrix, and the result of the concatenation of global features and local features [Z , H ] is used as the Key and Value matrix.The final attention score is represented by y: The final result obtained will be used as input to the CRF decoder to predict the label of each tag in x.

D. DECODER LAYER
For the decoding layer, the CRF is used as decoder of the model to return the final obtained tag sequence Y .During the inference process, some constraints are added to the transfer matrix of the CRF so that some logical errors in the prediction can be avoided, which can enhances the precision of the model.Given the sequence The Y(x) consists of all valid label sequences y ′ , and the probability of y is computed using the following equation.
where the function f (y i−1 , y i , x) calculates the transformation score from y i−1 to y i and the score for y i .The target is to maximize P(y|x).
In the decoding process, the Viterbi algorithm is utilized to decode and determine the optimal tag sequence, maximizing the objective function in the prediction process.
137512 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.if i−u >= 0 then get end for 24: add H l to H 25: end for

V. EXPERIMENTS A. DATASET AND METRICS
To demonstrate the effectiveness of our model, the experiment used open source electronic medical records as our datasets: CCKS2017 1 and CCKS2019. 2The data is sourced from the authentic electronic medical records of the cloud hospital platform Beijing Juristic Cloud Health Technology Co., Ltd.These data have been desensitized by experts, and medical entity recognition dataset annotations are primarily carried out by professional medical teams.In this paper, the BEMSO annotation method is used for entity tagging.B-TY1 denotes the starting character of type bracket TY1, M-TY1 represents the middle character of the entity, and E-TY1 is used to mark the end of label, S-TY1 represents single-word entities of type TY1, and O means that the character does not belong to any entity.Fig. 5 shows the detailed annotation method.
In CCKS2017, the named entity types include: Anatomy, Symptom, Diseases, Exam and Treatments.And in CCKS2019, the named entity types include: Anatomy, Medicine, Disease, Exam, Operation and Check.The sample  In addition to the datasets in the Chinese medical field, we also utilized traditional Chinese datasets such as MSRA, and Resume to enhance our experiments.The experimental results were compared classical baseline models to demonstrate the universality of our proposed model.The details of the Chinese traditional datasets are presented in the following Table .3: Statistics on the number of entities in the CCKS2019.
For the purpose of assessing the effectiveness of our model, we select the same evaluation metrics as previous works.In the experiments, precision(Prec.),recall(Rec.), and F1 values are used as evaluation metrics to assess the effectiveness of the model based on official strict matching.Strict matching refers to the ground truth and recognition outcome sharing identical mentions, boundaries, and entity types.The calculation formulas of Prec., Rec., and F1 value are as follows.

B. IMPLEMENTATION SETTINGS
All code is developed using the pytorch framework and runs on a single GPU server.In the experiments of CCKS2017, batch-size is set to 16, lr is set to 1e-4, and epoch is set

C. BASELINES
To assess the effects of our proposed model, the following baseline models are used to make performance comparisons with our model on the same dataset.1) BiLSTM-CRF [2]: A classical RNN-based model for NER tasks, using bidirectional LSTM as encoder and CRF as decoder.2) Lattice-LSTM [26]: A NER model incorporating character information with the addition of word segmentation information.3) FLAT [43]: An improvement is proposed based on the Lattice structure, using Transformer to model flattened word information, which better focuses on long sequences of contextual information.4) BERT [23]: BERT is a pre-trained language model based on the Transformer encoder, which can be adapted to different NLP downstream tasks by finetuning.5) TENER [4]: The model incorporates directionawareness, distance-awareness, and off-scale attention, and employs a suitable Transformer encoder to represent the features of character and word.

D. EXPERIMENTAL RESULT
In this section, we compare the performance of the baseline model against our proposed model to validate the overall effectiveness of our approach.Table.4 shows the results of performance comparison.As shown in the table, our model enhances precision, recall, and F1 on both datasets.For the implementation of the code for the baseline models Lattice-LSTM, 3  In addition to the baseline model, we conducted a comparative analysis of our model against some of the most recently published NER models in Chinese medical field.
For CCKS2017, we selected models from recently published articles, examining the results they achieved: 1) AR-CCNER: This approach incorporates radical-level feature extraction into BiLSTM-CRF and employs an attention mechanism to extract more profound semantic information.2) BiLSTM-Att-CRF: The BiLSTM-CRF model employs an attention mechanism to ascertain the highest joint probability of the input tag sequence and predetermined tag set.3) Multi-level semantic fusion: This model concatenates radical-level embeddings, character-level embeddings and token-level embeddings, which are then fed into a Bi-LSTM for feature extraction, and extract syntactic dependencies using graph neural networks.4) FT-BERT-BiLSTM-CRF-Radical: Li et al. pre-trained the BERT model on unlabelled text, extracted text features, and used LSTM and CRF to decode the predicted labels, and introduced dictionary features into the model.5) MUSA-BiLATM-CRF: An improved character-level representation is proposed, with character embedding and character label embedding throughout, to improve the specificity and diversity of the feature representation.
The performance of the models proposed in this paper is presented in Table .5, revealing superior results compared to the recently published articles.Specifically, it can be observed that our models outperform the selected benchmarks.
Similarly, for CCKS2019, we evaluated recently published models and their corresponding results:

VI. ANALYSIS A. ABLATION STUDY 1) MODULE ABLATION
To evaluate the effectiveness of our model in the NER task, we design ablation experiments to show the impact of each improvement module on the overall model.We verify the effect of the model after removing the entire local feature extraction module, flat lattice structure with token-level embedding, RPE and conditional constraints on the CRF transfer matrices, respectively.Another ablation experiment is performed to verify the effectiveness of the local feature extraction module, in this experiment all enhancement modules except the local feature extraction module are removed simultaneously.The results of all ablation experiments are shown in Table.7.
By comparing the experimental results, it can be seen that there is a corresponding degree of degradation in the model results, regardless of which part is removed.When the multi-local feature extraction module is removed, the model results are worse than when the other modules are removed, with F1 values of 92.28% and 83.29% on CCKS2017 and CCKS2019 respectively.Furthermore, in the experiments where only local feature extraction is added, F1 values of 92.67% and 83.96% are obtained for CCKS2017 and CCKS2019, respectively, which show improved results compared to all baseline models, and the result on CCKS2017 surpass most recent models.These results show that focusing on local information in the NER task allows the decoder to better pay attention to the information that is more important for predicting labels, thus effectively improving the accuracy of entity recognition.

2) LOCAL FEATURE ABLATION
Firstly, to demonstrate the enhancement of the model by the context integration mechanism, we conducted experiments with the context integration mechanism removed, and the results are shown in Table .8. The results exhibit that the model's performance substantially advances by implementing the context integration mechanism.
In the experimental process, to better explore which specific local intervals can provide more information for the NER task, we conducted a statistical analysis of the lengths of all entities in two datasets.The statistical results of entity lengths are shown in Fig. 6.As the results indicate, most entities in both datasets are distributed between 1 and 9. To ensure that the local intervals cover a wide range of entity lengths as much as possible, we created a list of local interval lengths, denoted as l = [1, 3, 5, 7, 9].For CCKS2017, the optimal model performance is achieved when l = 5, and for CCKS2019, the optimal model performance is achieved when l = 1.The entity lengths of the two datasets are basically concentrated between 1 to 8, so when local feature extraction is performed, the effective local information obtained can enable the model to focus not only on the overall semantics, but also on the local information in the subsequence where the entities are located.This provides more information for the classification of the character labels, enabling more accurate predictions to be made.However, contrary to our intuition, the enhancement obtained on CCKS2019 when l = [3] and l = [5] are not as high as we predicted.After some analysis, duo to the variety of entities in the dataset, the distribution of entity lengths is more dispersed on CCKS2019, and focusing on features in a single local interval may lead to limitations in semantic acquisition, resulting in misjudgment of the model.This is one of the important reasons why we chose to extract multiple local information.Finally, after several experiments, CCKS2017 chose l = [1,3,5] as the length for multi-local information extraction, and CCKS2019 chose l = [1,5,7].Such a choice allowed the highest model performance to be achieved, based on keeping the increase in model training time consumption within a reasonable range.The specific

3) FEATURE FUSION ABLATION
In order to verify the impact of the attention-based feature fusion approach on the overall model, a comparison is made using traditional feature fusion methods such as concatenation and addition with our feature fusion approach.The results of the comparison are displayed in Table.9.It is evident from the table that the attention-based feature fusion method gives a better performance, indicating that such a feature fusion method makes better use of each part of the features.This demonstrates the positive effect of the feature fusion module on the overall NER performance improvement.

B. CASE STUDY
In order to get a more intuitive feel for the effectiveness our proposed model, we chose two real medical cases to be used for the case study.In these two real cases, we compare the entity identification results of the model proposed in this paper with the baseline model.The results of the experiment are presented in Table.10.In Case 1, the recognition results the model proposed in this paper were identical to the ground truth.There were two identification errors in the baseline model.'' ''(physical examination) is not identified, and the entity '' ''(cerebral infarction) is not accurately identified, but '' is identified as ''Anatomy'', resulting in a boundary identification error and a type identification error.From the results, it can be seen that some of the errors in the basic model are due to the incorrect separation of the Chinese words and the insensitivity to the local information.In case 2, the baseline model did not perform well in identifying entities with only one word, with ' '(heart), ' '(lungs) and ' '(abdominal) not identified, and the model did not identify the entity ' '(all skin mucosa) but only ' '(skin mucosa), resulting in an entity boundary identification error.
After analysing the recognition results of the two cases, it can be concluded that both the improvement of the lattice structure of Chinese word segmentation results and the fusion of sequence global features with local features provide more useful information for entity class determination, resulting in more complete and accurate recognition results.

VII. CONCLUSION
In this paper, we propose a network structure that fuses global features with multi-local features, and present a feature fusion method based on an attention mechanism.
137518 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
This structure takes full advantage of the Transformer and RNN-based models.Global and multi-local features of the sequences are extracted using Flat-lattice-Transformer with relative positional encoding and Bi-LSTM, respectively, and a contextual integration mechanism is used in the local feature extraction module to enhance Bi-LSTM's ability to characterise sentences.Next, feature fusion is performed using an attention mechanism-based approach, which allows the decoder to focus on more important information during prediction.And by using a flat lattice structure to fully consider the Chinese word segmentation problem.Finally, we analyse the results obtained on two benchmark datasets and verify that our approach improves the accuracy of entity recognition and achieves better overall performance without increasing external knowledge and syntactic dependencies.However, the increase in model runtime due to the addition of local feature extraction is not negligible.To address the limitations of the method, we plan to improve the local feature extraction module in the future to increase the model speed and to consider the challenges of recognizing long and difficult entities in the medical domain, both of which have great potential for NER tasks in the Chinese medical domain.

APPENDIX PERFORMANCE OF SINGLE LOCAL FEATURE EXTRACTION
The detailed results obtained from the local feature ablation experiments conducted on both datasets are shown in the following Table .11: First, we use the Transformer encoder for global feature extraction and add local features of different lengths to the global features.The local information is selected by sliding interval blocks and fed into the Bi-LSTM, while a context integration mechanism is introduced to enhance the ability of the local feature extraction module to represent sequences by integrating the overall context in each Bi-LSTM cell.The multi-local features are effectively combined with global features through an attention-based feature fusion layer, and finally the transfer matrix in the CRF decoder is conditionally constrained to decode the labeled sequence of characters.The following is a summary of the main contributions of the text: 1) We propose a network structure that combines global feature extracted by Flat-Lattice-Transformer and the multi-local features extracted by Bi-LSTM, making the model more sensitive to local information and making better predictions for entities.2) A context integration mechanism is introduced in Bi-LSTM to enhance the local features by adding forward global context and backward global context into each individual cell to enhance the representational capability of each cell.3) An attention-based feature fusion method is proposed, which can effectively fuse global features with multilocal features, outperforming the traditional feature fusion methods.The experimental results on the most common public datasets show that the model has a better performance in comparison with other models.

FIGURE 2 .
FIGURE 2. Model structure diagram.Where n represents in the local interval extraction module, the encoder performs feature extraction for n local intervals of different lengths.
length l is extracted from the global features obtained from the Transformer encoder and fed into Bi-LSTM for local feature extraction.The variable h is used to denote the local feature, and a local feature h l i of length l is obtained for the character x i .As depicted in the following equation, where − → h l represents the last cell feature of the forward LSTM within the local intervals z i−l/2 to z i−(l+1)/2 , and ← − h 1 denotes the feature of the first cell of the backward LSTM within the local intervals

FIGURE 4 .
FIGURE 4. The local feature extract module.Where l represents the length of one local interval.

Algorithm 1 i to InI 8 :
Algorithm of Multi-Local Feature Extract Input: The Global Feature Z ; The List of local interval lengths L. Output: The Multi-Local Feature H . 1: Initialize InI ← −List 2: Initialize H ← −List 3: for each l ∈ L do for u← −1 to l/2 + 1 do 9:
FLAT4 and TENER,5 we refer to open source projects on Github.For the implementation of the baseline model BERT-CRF, the open source pre-training model bert-base-chinese 6 from Hugging Face is used.During the experiments, all model training times are consumed within 4 to 8 hours.

FIGURE 6 .
FIGURE 6. Statistics of different datasets entity length.
contribution of local information to the NER task is verified by setting up experiments for single local information extraction.The improvement of F1 values for single-local extraction and multi-local feature extraction are shown in Fig. 7.The growth values the graph are obtained by comparing with the model with only the local feature extraction module removed.It is evident from the outcomes that incorporating local feature extraction enhances the overall performance of the model on both datasets.

FIGURE 7 .
FIGURE 7. Ablation experiments on two datasets.Performance comparison of no local features, single-local features and multi-local features.l = x represents that only the interval of length x is taken for local feature extraction.And the improvement values are calculated based on the ablation experiment which only the local feature extraction module is removed.

FIGURE 8 .
FIGURE 8.The iterative process for F1 values.F1 value comparison of no local features, single-local features when l = 5 and multi-local features.
statistics for the datasets are shown in Table. 1 and Table.2.

TABLE 1 .
Statistics on the number of entities in the CCKS2017.

TABLE 2 .
Statistics on the number of entities in the CCKS2019.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 4 .
Performance comparison with baseline model.

TABLE 5 .
Comparison with recent model performance on the

TABLE 6 .
Comparison with recent model performance on the CCKS2019.

TABLE 8 .
Context integration mechanism method ablation experiments.

TABLE 9 .
Feature fusion method ablation experiments.

TABLE 11 .
Performance results of local feature ablation experiments.