A Residual BiLSTM Model for Named Entity Recognition

As one of the most powerful neural networks, Long Short-Term Memory (LSTM) is widely used in natural language processing (NLP) tasks. Meanwhile, the BiLSTM-CRF model is one of the most popular models for named entity recognition (NER), and many state-of-the-art models for NER are based on it. In this paper, we propose a new residual BiLSTM model and perform it with a conditional random field (CRF) layer together on NER tasks. Based on the most popular BiLSTM-CRF model, we replace the BiLSTM with our residual BiLSTM blocks to encode words or characters. We evaluate our model on Chinese and English datasets. We utilize both word2vec and BERT to generate word or character vectors. Furthermore, we conduct experiments to compare the performance of NER by using different structures of residual blocks. The experimental results show that our model can improve the performance of both Chinese and English NER effectively without introducing any external knowledge.


I. INTRODUCTION
As a fundamental task of natural language processing tasks, named entity recognition (NER) aims to identify the named entities from unlabeled sentences or texts. Named entities are a series of special semantic types such as person (PER), organization (ORG) and location (LOC), etc. Thus, NER is a typical classification task that trains a model with texts in which named entities have been labeled rightly, and then predicts the named entities in other unlabeled texts. NER has received much attention for it will impact the performance of other downstream NLP tasks, such as relation extraction [1], entity linking [2], etc.
In recent years, deep learning technologies have been widely used in a variety of NLP and computer vision (CV) tasks. Compared with convolutional neural network (CNN), recurrent neural network (RNN) is more used for NLP tasks because it can capture the semantic features about a long sequence. The most popular RNN model is Long Short-Term Memory (LSTM) [3] that has achieved success in many NLP tasks [4], [5]. For NER, the state-of-the-art models are usually based on BiLSTM-CRF [6] which uses BiLSTM to extract the features of input sentences and connect them to a conditional random field (CRF) layer to jointly predict target labels. Among these models, many of them introduced The associate editor coordinating the review of this manuscript and approving it for publication was Constantinos Marios Angelopoulos . external knowledge to the BiLSTM-CRF model, such as radical features [7], character features [8], segmentation features [9], sentence features [10], etc. Some other models employed multi-task learning to perform NER tasks with other related tasks jointly [11], [12]. However, almost all above works are the combination of existing models or methods. A few researchers [13]- [15] have attempted to improve the structure of BiLSTM-CRF.
Meanwhile, in the area of CV, residual networks have achieved state-of-the-art accuracy on image recognition and some other related tasks. For instance, ResNets [16] use identity mappings as the skip connections in layers to train a very deep network with over 100 layers. DenseNets [17] connect each layer to its all subsequent layers to build a deep residual network. Compared with a convolution kernel as the basic unit in CNNs, the LSTM kernel is more complex. Hence, it is a challenge to design a residual network based on LSTMs. Inspired by residual CNNs, we introduce a new type of residual block to BiLSTM to build a residual BiLSTM model for NER.
In this paper, we propose a new residual BiLSTM model and connect it to a CRF layer to perform NER tasks. To demonstrate the effectiveness of our model, we conduct experiments on both Chinese and English NER datasets with fixed vectors generated by word2vec [18] and GloVe [19] as inputs respectively. Furthermore, we use BERT [20] the generate higher quality input vectors for both Chinese and English NER to demonstrate our model can also benefit from BERT. We choose BiLSTM-CRF, stack BiLSTM-CRF and other state-of-the-art models that introduce external knowledge or multi-task learning approaches as our baseline models. Furthermore, we conduct experiments to investigate the impacts of residual block structures and the number of layers on the performance. The experimental results show that our proposed residual BiLSTM model can improve the performance of both Chinese and English NER effectively.
The contributions of this paper can be summarized as follows: • We propose a new residual BiLSTM model which introduces a new type of residual block to improve the capability of feature extraction of BiLSTM.
• We apply our model to NER tasks on English and Chinese datasets without introducing any external knowledge. The experimental results demonstrate the effectiveness of our model.
• As a feature extractor, our model has the potential to be applied to other NLP tasks. The structure of our paper includes four main parts. In the first part, we introduce the existing state-of-the-art models for NER and some related works. In the second part, we elaborate the motivation and the architecture of our model. In the third part, we perform our model on a variety of English and Chinese datasets to evaluate our model. In the fourth part, we conduct ablation study to investigate the impact of each component in the residual block and the number of layers.

A. NAMED ENTITY RECOGNITION
Recurrent neural network such as LSTM has shown its advantages in a variety of NLP tasks. [6] proposed the BiLSTM-CRF model that is widely used for NER in various languages. Most of the state-of-the-art models for NER are based on the BiLSTM-CRF model. There are several approaches to improve the performance. The first approach is to introduce external knowledge or some other existing models to the BiLSTM-CRF model, such as character, segmentation, context, etc. Both Chinese and English characters contain semantic information that can contribute to NER tasks. For English NER, [8] used BiLSTM to encode characters in OOV words. [21] and [22] combined LSTM with CNN which is used to encode English characters. For Chinese NER, [7] introduced radical features as the input of a BiLSTM-CRF model instead of Chinese words. [23] proposed a lattice LSTM model which utilized both characters and words as the inputs. In this model, it utilized the results of segmentation by using a lexicon as the extra input for LSTM. [24] can be treated as an improved version of lattice LSTM model that integrated the results of segmentation into characters as the final input of LSTM. [25] combined CNNs and self-attention mechanism with BiLSTM-CRF together which used global self-attention layer to capture the information form characters and sentence contexts. Besides character features, segmentation features [9] and sentence features [10] can also be utilized as external knowledge. The second approach is to adopt a multi-task learning strategy to train the NER task and other related tasks together. [11] trained the NER task with Chinese word segmentation task jointly. [12] introduced adversarial transfer learning framework and self-attention mechanism to learn NER tagging and Chinese word segmentation jointly. [26] incorporated coreferential relation to enrich CNN-BiLSTM-CRF. The third approach is to improve the representations of inputs. Language models such as BERT [20] and ELMo [27] achieved state-of-the-art results in a variety of NLP tasks. They can generate dynamic vectors according to different contexts rather than fixed vectors.
B. RESIDUAL NEURAL NETWORKS [28] proposed Highway Network that first trained very deep end-to-end networks. ResNets [16] improved the Highway Network by adding identity mappings as the skip connections in layers. DenseNets [17] made shortcut connections between each layer and its subsequent layer to build a deep residual network. Furthermore, [29] analyzed the impacts of various usages of activation. In addition to the above works based on CNNs, there are also several works that built residual networks based on LSTMs. [14] proposed stack residual LSTM networks to generate paraphrase. [15] proposed residual LSTMs for distant speech recognition. A similar work to ours is [13] which employed stack residual LSTMs for NER.

III. OUR MODEL
In this section, we first introduce the motivation and the difference between the residual structure we proposed and that of ResNets. Then we elaborate the architecture of our model in three sections. As shown in Figure 3, we take a 3-layer residual BiLSTM model as an example to illustrate the residual structure. Note that the number of layers can be changed. Figure 3 shows the overall architecture of 3-layers residual BiLSTMs with a CRF layer. The whole model consists of three main parts. The bottom part is an input layer and the top part is a CRF layer, which is similar to most of the models based on BiLSTM-CRF. Our innovation is the residual BiLSTM blocks in the middle part.

A. COMPARISONS OF STRUCTURES BETWEEN RESNETS AND OUR MODEL
In this section, we compare the structure of our model with the model in [13] that uses the same structure of ResNets. Then we analyze the impact of the residual structure on LSTMs and show the motivation of our model. From the results of [13]- [15] we can see that applying the same residual structure of ResNets to BiLSTM is not as effective as CNN. The reason is that residual LSTMs are not only deep networks, but each LSTM layer contains the information of long-term dependencies, which is the main difference from residual CNNs. Figure 1 and Figure 2 show two different structures of residual LSTMs respectively. The structure of VOLUME 8, 2020  the network in Figure 1 is the same with that of ResNets, and Figure 2 shows the structure we proposed. Given a loss function J of the network in Figure 1: We take the first LSTM layer as an example, the derivative of J with respect to c 1 t can be written as follows: In (2) we can see that F , will yield bigger value than the items with more product terms. Thus, the value of (2) mainly depends on the last several items. As we know, F contains the addition of direct outputs from each layer. If we use the result of (2) to update the weight c 1 t in the process of back propagation, the weight c 1 t will contain much direct information of each layers. As a result, the all long-term dependencies of each layer are confused and distributed in each layer. However, the cell state of each LSTM layer should be relatively independent from other layers, which allow the residual LSTMs to learn more information. Thus, the motivation of our model is to make the long-term dependencies of each layer more different to improve the capability of residual LSTMs. We adopt a ''local residual'' strategy to build the structure. We suppose that two adjacent LSTM layers have relatively strong correlation, and only keep the shortcut connections between them, which is shown in Figure 2. The loss functionJ and the derivative of J of our model can be written as follows: In this section we illustrate the approach of encoding the input for our model. We denote an input sentence as X = Note that we use only independent tokens as inputs and do not introduce any external knowledge.

C. RESIDUAL BiLSTM BLOCKS
In this section we illustrate the structure of the residual BiLSTM blocks we proposed. As shown in Figure 3, a residual BiLSTM block consists of a BiLSTM layer, a shortcut connection and four additional layers which are fully-connected layer, layer normalization [30], ReLU [31] and dropout [32] which are used to prevent overfitting and the vanishing gradient problem. Inspired by ResNets and DenseNets, we design a type of BiLSTM-based residual block which refers to an order of ''BN-ReLU-Weight'' recommend by [29]. We take the l-th blocks as an example to illustrate the structure of residual BiLSTM blocks.
In the l-th block, we denote the output of (l −1)-th block as r l−1 t for position t. Then the output of the dropout layer can be written as follows: where W l fc denotes the weight matrix of the fully-connected layer, and G l denotes the composite function of layer normalization, ReLU and dropout. Then the vector d l t is used as the input to the BiLSTM in this block. The basic LSTM function can be written as follows: where c l t , i l t , o l t , f l t denote cell state, input gate, output gate and forget gate respectively. W l p , b l p , σ , denote weight matrices, bias matrices, sigmoid function and element-wise product respectively. We denote the hidden state of the forward LSTM as − → h l t and that of the backward LSTM as ← − h l t . Then we concatenate the two hidden states and get the vector Then the final output of the l-th residual block can be written as follows: Thus, we introduce a new type of identity shortcut connection to BiLSTMs to build a residual BiLSTM model. In order to illustrate the structure of the residual BiLSTM blocks more clearly, we re-written the output of l-th residual block as follows: where H l is the composite function of all operations in the l-th residual block.

D. CRF LAYER
The CRF layer is usually used as the top layer in each model for NER. Compared with LSTMs that predict output labels independently, CRF can capture the dependency information across the output labels. For example, a label B-PER cannot follow B-PER. As BiLSTM-CRF, we use a CRF layer to predict output labels with residual BiLSTMs together. We denote an output label sequence as y = {y 1 , y 2 , . . . , y n }, the score of the sequence can be written as follows: where A denotes the transition score matrix and P denotes a score matrix of the probabilities of labels predicted by residual BiLSTMs. Thus, the probability for the sequence y is: y∈Y X e s(X, y) where Y X denotes all possible label sequences. We use Viterbi algorithm to calculate the highest score label sequence as the result of prediction, which can be written as follow:

A. DATASETS AND EXPERIMENTAL SETTINGS
We use four most widely used datasets which are CoNLL-2003 [33], MSRA [34], Weibo [11], OntoNotes 4.0 [35] and OntoNotes 5.0 [36] to evaluate our model on English and Chinese NER tasks respectively. The statistics of sentences of the datasets is shown in Table 1. We apply the schema of BIOES (B-begin, I-inside, O-outside, E-end, S-single) to for all NER datasets as baselines did. For example, the entity ''Kurdistan Democratic Party'' with 3 words is labeled as ''B-ORG I-ORG E-ORG'', where ''ORG'' denotes the entity type as organization. The entity ''Ramallah'' with single word is labeled as ''S-LOC'', where ''LOC'' denotes the entity type as location. Our model has nearly the same type of hyper-parameters as that of BiLSTM-CRF, which is much simpler than previous sophisticated models.
In the experiments, we adopt pre-trained English word vectors published by GloVe [19] and Chinese character vectors published by [37] as the fixed input for all datasets. Furthermore, we utilize BERT as the dynamic input for CoNLL-2003 and MSRA to demonstrate the robustness of our mode. The hyper-parameters of fixed input and BERT are shown in Table 2 and 3 respectively. We use Adam optimizer [38] with a gradient clipping of 5.0. Compared the hyper-parameters in Table 3, we increase the batch size and LSTM hidden size in Table 2, because our model has less parameters than BERT that we can increase these parameters to accelerate the training speed.

B. RESULTS ON ENGLISH NER DATASETS
In this section, we perform our model on English NER Datasets ConLL-2003 and OntoNotes 5.0. We take the same approach proposed by [8] to generate English input vectors for English NER, where the inputs of English NER are composed of pre-trained word vectors from GloVe 1 and character vectors learned by a BiLSTM network. The results are shown in Table 4 and Table 5. Our model achieves a F1-score of 92.22% and 89.65% on CoNLL-2003 and OntoNotes 5.0 respectively, which outperform the baselines on the both datasets. Our model also outperforms the residual LSTM model in [13] significantly. Meanwhile, we can observe that stacked BiLSTM model performs worse that [13]. It demonstrates that shortcut connection can improve the performance of stacked BiLSTM, and the residual structure in our model is more effective and reasonable than [13] which uses the same structure of ResNets.
Since most NLP task can benefit from BERT, we also adopt BERT 2 to generate dynamic input vector for our model on the ConLL-2003 dataset. We use the official BERT tools 3 offered by Google to program which adopts AdamW [39] algorithm for optimization. On account of that our model is more complex to fine tune with BERT, we use the method proposed by [40] which contains two steps to fine tune a complex model with BERT. Table 6 shows the F1-scores on ConLL-2003. The baselines also adopt BERT or ELMo as the input. We can see that our model work with BERT more effectively than baselines, which again shows the effectiveness and robustness of our model.

C. RESULTS ON CHINESE NER DATASETS
In this section, we perform our model on Chinese NER Datasets MSRA, Weibo and OntoNotes 4.0. We use pre-trained Chinese character embeddings proposed by [37] for all datasets. The results of the 3 datasets are shown in Table 7 to Table 9 respectively. Our model achieves a F1-score of 92.17% on MSRA, which gains 1.67% improvement in F1-score compared with BiLSTM-CRF. And it also outperforms baselines on OntoNotes 4.0. For the Weibo dataset, the F1-score of our model is slightly worse than [25]. The reason is that the Weibo dataset is a relatively small dataset that our model use only character as inputs, but [25] utilizes sentences as external information. Nonetheless, the performance of our model is still better than most baselines. Meanwhile, we can observe that the F1-scores of the model in [13] on the 3 Chinese datasets are all lower than our model, which is consistent with the results of English NER.
For Chinese NER, we also adopt BERT and BERT-based language model to generate dynamic input vectors to evaluate our model on the MSRA dataset. We use Chinese BERT-Base, 4 BERT-wwm 5 [41] and ERNIE 1.0 Base 6 [42] to   generate Chinese character vectors. We take the same tool and optimization method in the previous section, and the fine-tune learning rates are 3e −5 , 4e −5 and 5e −5 for BERT-Base, BERT-wwm and ERNIE respectively. The results in shown in Table 10. We can see that the results of using BERT or BERT-based models are much better than the models [23]- [25] using external knowledge, which again demonstrates that our model can benefit from BERT and achieves better performance than baselines. It also shows that a good pre-trained model can make an improvement on NER tasks significantly.

A. IMPACT OF RESIDUAL BLOCK STRUCTURE
In this section, we perform Chinese NER on the MSRA dataset with several different types of residual block structures. We use the same hyper-parameters shown in Table 2. Figure 4 shows four different representative structures of the residual block. Type A builds shortcut connections in a way that is similar to ResNets, where the element-wise addition is before the identity mapping. Note that Type A has the same structure with the model in [13]. Like Type A, Type B just moves the element-wise addition before the BiLSTM layer. Type C can be treated as a simplified version of our model. It removes fully-connected layer, layer normalization, ReLU and dropout from the residual block. Type D builds residual LSTMs for forward LSTMs and backward LSTMs separately. Furthermore, we conduct further experiments by removing the fully-connected layer and layer normalization. We choose Chinese NER for that the model for English NER needs an extra BiLSTM network to encode characters, which may influence the results of the experiment.
From Table 11 we can observe that the performance of Type C is better than Type A and Type B, which demonstrates that it is more reasonable and effective to build a residual BiLSTM model with the structure we proposed. Meanwhile, it also shows that the order of ''BN-ReLU-Weight'' proposed by [29] for the block also works in residual LSTMs.

B. IMPACT OF NUMBER OF RESIDUAL BiLSTM LAYERS
In this section, we repeat the Chinese NER tasks on the MSRA dataset by changing the number of residual BiLSTM blocks. Meanwhile, we choose the traditional stacked BiLSTM-CRF models as the baseline model which is VOLUME 8, 2020   without shortcut connections and uses the output of previous LSTM as the input of next LSTM directly. The results are shown in Table 12. We can observe that increasing the number of stacked BiLSTM-CRF contributes slightly to the performance compared to the BiLSTM-CRF model. For our model, the highest F1-score is achieved by the 4-layer residual BiLSTM-CRF. The F1-score begins to drop when the number of layers is more than 4, which is consistent with the results of [13]. The reason might be that the structure of a LSTM kernel is more complex that has much more parameters that a CNN kernel. It is easier to overfit for a deep LSTM network with multiple layers. Hence, it is relatively difficult to train it effectively and to learn high-quality long-term dependencies information for each LSTM layer. Nevertheless, our model with 4 layers still outperforms stacked BiLSTM-CRF models significantly.

C. ABLATION STUDY
In this section, we investigate the effectiveness of each layer in the residual BiLSTM blocks by the ablation study on MSRA. The results are shown in Table 13. Obviously, layer normalization makes the most contribution. Meanwhile, ReLU and dropout which are used to prevent overfitting both contribute to our model. And we introduce a dense layer to further improve the performance slightly. We conduct two extra experiments where we replace the ReLU and LSTM with ELU [57] and GRU [58] respectively, but it makes little contribution to the F1-score. In particular, GRU is not suitable for the residual BiLSTM units. Because the hidden state and cell state of GRU are the same state, which will easily break the long-term dependencies of each layer when using GRU to build a residual network. Compared to GRU, LSTM has two different states to keep hidden state and cell state respectively that LSTM is more suitable for the residual structure.

D. CASE STUDY
In this section, we conduct a case study on the stacked BiLSTM, residual BiLSTM and our model with CRF. The number of layers is set to 3. The results are show in Table 14.
We can see our model predict the entities in the two sentences correctly. By contrast, the stacked BiLSTM model does not predict either the boundary in the first sentence or the entity type in the second sentence correctly. The residual BiLSTM model in [13] only predicts the entity type correctly. The results show that our model can capture richer semantic information from texts for NER.

VI. CONCLUSION AND FUTURE WORK
We present a novel residual BiLSTM model for NER tasks. We introduce a new type of residual block based on BiLSTMs. Being different from most other state-of-the-art models that introduce external knowledge or multi-task learning, we make efforts to innovate on the structure of residual network based on BiLSTMs. We evaluate our model on Chinese and English NER datasets. The experimental results show that our model can improve the performance of both Chinese and English NER effectively without introducing external knowledge. Meanwhile, our model performs well with both fixed and dynamic inputs, which demonstrates the robustness of our model. Furthermore, we conduct experiments with several different structures of residual blocks. The results demonstrate the effectiveness of the structure of the residual block we proposed.
In the future, we will also try to combine our model with attention mechanism. For example, we can use attention layers to control the weight of each layer. And we will try to introduce external knowledge such as contextual information as the extra input to enhance our model. On the other hand, we will apply our model to other NLP tasks. For example, our model can be used to encode sentences for relation extraction and extract the features of texts for text classification instead of BiLSTM.
GANG YANG received the bachelor's degree in information engineering from Xi'an Jiaotong University, Xi'an, in 2011, where he is currently pursuing the Ph.D. degree in computer technology. His main research directions are natural language processing and deep learning.
HONGZHE XU received the Ph.D. degree in mechanical engineering, in 2004. From 1999 to 2012, she was a Vice Professor with the Research School of Computing. Since 2012, she has been a Professor with the Research School of Computing, Xi'an Jiaotong University, Xi'an. She is the author of ten books, more than 50 articles, and more than five inventions. Her research interests include intelligent platform in cloud environment, object-oriented big data analysis, application of data mining, and medical big data analysis.