LSTM-CRF Neural Network With Gated Self Attention for Chinese NER

Named entity recognition (NER) is an essential part of natural language processing tasks. Chinese NER task is different from the many European languages due to the lack of natural delimiters. Therefore, Chinese Word Segmentation (CWS) is usually regarded as the first step of processing Chinese NER. However, the word-based NER models relying on CWS are more vulnerable to incorrectly segmented entity boundaries and the presence of out-of-vocabulary (OOV) words. In this paper, we propose a novel character-based Gated Convolutional Recurrent neural network with Attention called GCRA for Chinese NER task. In particular, we introduce a hybrid convolutional neural network with gating filter mechanism to capture local context information and a highway neural network after LSTM to select characters of interest. The additional gated self-attention mechanism is used to capture the global dependencies from different multiple subspaces and arbitrary adjacent characters. We evaluate the performance of our proposed model on three datasets, including SIGHAN bakeoff 2006 MSRA, Chinese Resume, and Literature NER dataset. The experiment results show that our model outperforms other state-of-the-art models without relying on any external resources like lexicons and multi-task joint training.


I. INTRODUCTION
Named entity recognition (NER) plays a critical role in the field of natural language processing (NLP).This task aims to extract and categorize entities with specific meanings in the unstructured text, such as person (PER), location (LOC), and organization (ORG), etc. NER is one of the most widely used and key technologies in information extraction.Also, it is the chief work of NLP tasks such as relation extraction [1], event extraction [2], and question answering system [3].Therefore, it has a high value of utility to conduct in-depth research on the NER task.
Compared with the NER of Indo-European languages represented by English, Chinese NER task is more complicated.
The associate editor coordinating the review of this manuscript and approving it for publication was Qichun Zhang .
There are apparent inflections in English (singular, plural, tense, etc.), but Chinese lacks these inflections.Besides, Chinese has the problems of fuzzy word boundaries, complex entity structure, and various forms of expression, which make the Chinese NER task more difficult.Therefore, the correct identification of named entities in Chinese text is of great significance for subsequent Chinese information processing tasks.
At present, a mature method to solve the NER task is to model the NER problem into a sequence labeling problem.The standard method of existing state-of-the-art models for English NER can effectively capture context feature information by using BiLSTM-CRF models [4]- [7].However, there are no apparent delimiters between words in Chinese sentences, and we usually perform word segmentation before the sequence is fed into the word-based model.Each segmented word is mapped to fixed-length word representation.Then we use the word-level sequence labeling model, which is the same as the method of dealing with English NER.
However, entity boundaries are associated with segmentation results.The performance of subsequent NER task is limited by the incorrect NER labeling, which results from segmentation errors.Moreover, many named entities are considered as OOV words in the word-based model because of the large number of Chinese words.Besides, after the word segmentation of the word-based model, the parameter size of the embedding layer is significantly increased, which result in data sparsely problems and lead to overfitting.Let's take '' (Nanjing Yangtze River Bridge)'' as a typical example.Due to the limitations of Chinese linguistic features, the boundaries of characters (words) are often ambiguous.The same sentence may have distinct segmentation after performing word segmentation.For different word segmentation granularity, the sequence '' '' can be divided into '' (Nanjing City)/ (Yangtze River Bridge)'' and '' (Nanjing City)/ (Mayor)/ (Daqiao Jiang)'' respectively.As shown in Table 1, after performing word segmentation, we may get completely different recognition boundary information, which leads to two distinct sequence labeling outcomes.The word-based models cannot judge right or wrong, which results in incorrect entity recognition.Recently, studies have shown that character-level representation can avoid many of the listed above problems.And researchers found word-based models underperform character-based models in deep learning-based Chinese NER task [8]- [11].Due to the polysemy and polymorphism of Chinese characters, the NER based on the pure character only focuses on the per-character information for losing the latent word and word sequence information.For this problem, it is worth exploring how to effectively integrate segmentation information into character-based models for better semantic understanding.
To overcome the shortcomings of the traditional characterbased models, we propose a new neural network, called the GCRA, to improve the performance of Chinese NER task.Firstly, for the embedding layer, we apply the label segmentation vector softly concatenating into character embeddings.It uses word sequence information indirectly.So the model can not only avoid the problem of error propagation caused by word segmentation error but also achieve excellent results based on character and word information.Next, the character representation is fed into the hybrid gated convolutional layer to carry out detailed feature extraction and generate implicitly local feature information connection.Further, the highway neural network [12] is utilized to refine the hidden representation of BiLSTM.Finally, the self-attention layer is employed to capture context-related information in different multiple subspaces, which can better understand the sentence structure and ultimately improve the performance.
In this paper, we compare our model with the state-of-theart methods on three datasets, including SIGHAN bakeoff 2006 MSRA, Chinese Resume, and Literature NER dataset.The three datasets come from news domain, social media domain, and literature domain respectively.
The main contributions of this paper can be summarized as follows: • We propose a novel neural model called the GCRA model for Chinese NER task.The model can not only exploit local context features effectively but also capture the global dependencies of the whole character sequence.
• We design a character-level hybrid gated convolutional neural network which combines the dilated gated convolution with the standard gated convolution.It can effectively generate local feature information connection and avoid gradient vanishing during training.
• We conduct our experiment on various Chinese NER datasets in different domains.The experimental results demonstrate that our model outperforms other stateof-the-art models without using any external lexicon resources and multi-task joint training.The remainder of the paper is organized as follows.Section II reviews the related work on Chinese NER.Section III presents the main idea of the proposed GCRA model.Section IV demonstrates the experimental results and analysis.Section V concludes our works.

II. RELATED WORK
Significant research has devoted to the NER task.The NER system in early was mainly based on rules and dictionaries, which has the shortcoming of poor expansibility and absent ability in finding OOV words.With the advent of statistical machine learning, the NER task is abstracted into a sequence labeling problem.Traditional sequence labeling models extensively utilized Hidden Markov Models (HMM) [13] and Conditional Random Fields (CRF) [14] in the NER task.However, all these models are heavily relying on feature engineering and external resources.
In recent years, deep learning has provided a new approach to solve the problems of natural language processing, which has attracted considerable critical attention.Given the shortcomings of feature engineering, deep learning is proposed as a useful tool for automatic learning, distributed representation of words, and deep feature extraction.Deep neural networks are used in deep learning to replace the artificial feature engineering model of traditional machine learning.To address the NER problems in the English field, models based on neural network demonstrate their excellent performance in identifying entities.Collobert et al. [15] proposed CNN-CRF model to extract the depth feature for sequence labeling tasks automatically.Huang et al. [16] proposed a bidirectional LSTM-CRF network structure for sequence tagging task.But their models use the feature connection tricks to combine the hand-crafted spelling features and context features with word embeddings as the input vectors to the neural network.Lample et al. [4] presented a bidirectional LSTM-CRF architecture which combines wordlevel features with character-level features, and they applied another LSTM layer to generate character-level features.Similarly, Ma and Hovy [5] conducted the character Convolutional Neural Network (CNN) to extract English characterlevel features based on the LSTM-CRF network structure.Chiu and Nichols [6] reported a hybrid of bidirectional LSTM and CNNs structure, which automatically detects word-level and character-level features.
The development of Chinese NER research is relatively late, and the related research is more difficult because of the particularity of Chinese word information.Some researchers also consider Chinese NER task as a character sequence labeling problem and take advantage of external data to compensate for insufficient annotated corpus resources.In particular, in Collobert et al. [15], Passos et al. [17], Huang et al. [16], and Luo et al. [18], the researchers leveraged lexicon features to improve performance.Peters et al. [19] pre-trained a neural bidirectional language model to augment word representations by introducing character-level knowledge.
The existing research indicates that the character-based methods are considered as an empirically better choice than word-based methods [8]- [11].However, the characterbased NER models carry only a limited amount of character information and cannot fully exploit latent word and word sequence information.To solve this problem, some researchers have studied how to better leverage wordlevel information for Chinese NER task.Some proposed to use segmentation information as soft features for NER task [20], [21].Peng and Dredze [22] and Cao et al. [23] designed a multi-task learning model for joint learning Chinese NER tagging and Chinese word segmentation task simultaneously.Zhang and Yang [24] integrated latent wordlevel information into a character-based LSTM-CRF model by identifying candidate lexicon words from the sentence using a lattice-structured LSTM model.Zhang et al. [25] investigated a dynamic meta-embeddings method and applied it to Chinese NER task.They utilized the attention mechanism to combine features of both character and word granularity in the embedding layer.In the work of Zhu and Wang [26], they proposed a Convolutional Attention Network model, which used word segmentation vector as soft features to improve Chinese NER model performance.Their work precluded any external word embeddings and lexicon resources dependencies.
In our work, we enhance the input representation by utilizing the segmentation label vector concatenating into character embeddings directly.Besides, we design the hybrid gated convolution layer and gated self-attention network, which can effectively alleviate gradient vanishing during training and capture depth detailed feature.Experiments on several series of datasets show that our proposed GCRA model can significantly improve the performance of the Chinese NER task.

III. MODELS
As with most named entity recognition methods, our work also turns NER task into a sequence labeling problem.To eliminate the effects of word segmentation error propagation, we utilize character-level BiLSTM-CRF as our basic structure and apply the BIOES tagging scheme for the Chinese NER task.The overview architecture of our proposed model is shown in Figure 1.The model mainly consists of five layers: embedding layer, hybrid gated convolution layer, highway BiLSTM layer, gated self-attention layer, and CRF decode layer.Each part of our proposed model will be presented in detail in the following sections.

A. EMBEDDING LAYER
Most research shows that applying segmentation as soft features for character-based Chinese NER models can lead to improved performance [20], [21].In this work, we concatenate the segmentation label vector into character embedding for augmenting the input representation.The word segmentation information is represented by BIOES scheme.Formally, in the Chinese NER task, we denote a input sentence as X = {x 1 , x 2 , . . ., x n }, where x i represents the ith character in the sentence X .Then, we map discrete characters into the distributed feature representations on the embedding layer.The input representation for each character is embedded in distributional space as x c i : where e c and e s denote a pre-trained character embedding lookup table and a BIOES scheme segmentation label embedding lookup table, respectively.And the ⊕ is the connection operator.The formula seg (x i ) represents the segmentation label of each character x i which is given by a word segmenter.

B. HYBRID GATED CONVOLUTION LAYER
We use hybrid gated convolutions to extract local feature information connection and context information.As shown in Figure 2, it has two separate blocks.The left block is the dilated gated convolution block, which consists of two layers of dilated convolution and a gated filtering mechanism.It is similar to the highway network.The right block is the normal gated convolution block, which has a standard convolutional layer with gated linear units [27].We splice the two separate outputs together as the final output of the hybrid gated convolution layer.

1) NORMAL GATED CONVOLUTION
Dauphin et al. [27] have shown that the gating mechanism can improve the performance of language modeling tasks.The gated convolutional network adds a gating switch to control the information flow.These gates can alleviate gradient vanishing during training since there is a convolution without any activation function.For the embedding output X , the gated convolution layer output can be expressed as: where * denotes convolution operator, W and V with b and c denote kernels and biases respectively, which are parameters to be learned.σ represents the sigmoid function, and ⊕ means the element-wise product between the matrix.With dilation d, the convolutional operator is applied to each token x t with output c t is defined as: where dilation d = 1 is equivalent to a normal convolution.Wu et al. [29] proposed gated linear dilated residual network for reading comprehension task, which mainly consists of dilated convolution and gated linear units with the residual connection.For our dilated gated convolution block, we also use dilated convolution instead of normal convolution to extend the receptive field.But for gated filtering mechanism, it is more similar to the highway network.We combine the residual connection and gated convolutional neural network to achieve selective multi-channel transmission of information.We use C(X ) to represent the output of the dilated convolution.The final dilated gated convolution block output can be expressed as: where X is the input of this layer, C 1 (X ) and C 2 (X ) mean different dilated convolution output respectively.σ represents the sigmoid function, and ⊗ denotes the element-wise product between the matrix.After comparing to experimental results with the different dilated rate, we use two-layer dilated gated convolutions with dilated rate 1 and 2. So the output of a hybrid gated convolution is as follows: where the ⊕ represents the connection operator.

C. HIGHWAY-LSTM LAYER
Hochreiter and Schmidhuber [30] proposed LSTM to solve gradient vanishing and exploding of traditional recurrent neural network.The key role is to utilize adaptive gating mechanism and the memory cell.A typical LSTM cell structure is depicted in Figure 3. signal strength flowing to the next unit, and the forget gate f t is used to control the cell state before forgetting.Defining g = [g 1 , g 2 , . . ., g n ] outputted by CNN layer as input.Then, the LSTM units at step t could be expressed as: where The unidirectional LSTM only retains information from the past sequence of vectors, because the hidden state flow is passed from the front to back.To leverage the past and future sequence information, we use a bidirectional LSTM to capture the context features for sentence.So the hidden state of BiLSTM is as follows: where − → h t ∈ R d h and ← − h t ∈ R d h are the hidden states of the forward and backward LSTM at position t, respectively.The ⊕ represents the connection operator.
Highway network allows information to pass through layers of the deep neural network at high speed, which effectively slows down the problems of the gradient.In this paper, we use the highway network to control the information flow with an adaptive gate network.The overview architecture of highway-LSTM is illustrated in Figure 4.The output of highway-LSTM layer is calculated as follows: where σ is the element-wise sigmoid function, ⊗ is the element-wise product, and f is the rectified linear unit.The W g , W h and b g , b h represent the weight matrix and bias vectors, respectively.The tg denotes the transform gate, which controls how much information is converted and passed to the next layer.And the (1 − tg) is called carry gate, which allows the input to be passed to the next layer directly.Therefore, the highway network input h and output z require to be the same shape.

D. GATED SELF-ATTENTION LAYER
Self-Attention is a mechanism of attention that relates different locations of a single sequence to calculate an interactive representation of the sequence.Recent evidence suggests that it performs well on a variety of tasks, such as machine translation [31], semantic role labeling [32], and relation extraction [33].Inspired by these works, we utilize the multihead self-attention mechanism to capture the global sequence information from multiple subspaces and exploit the inner features contained in the text.Attention is essentially a mapping function consisting of many Queries and Key-values.For self-attention, we use the highway-LSTM output Z = [z 1 , z 2 , . . ., z n ] to initialize Q, K, and V.The scaled dotproduct attention could be calculated as: where Q ∈ R n×2 d h , R ∈ R n×2 d h and V ∈ R n×2 d h denote query matrix, keys matrix and value matrix respectively.√ d k equals the dimension of hidden units of BiLSTM, and plays a regulating role, controlling the inner product of Q and K not too large.
The essence of multi-head attention is to perform multiple self-attention calculations, which can make the model capture more features from different representation subspaces.The multi-head attention mechanism will linearly project the Q, K, and V through the parameter matrix without the sharing of parameters, and then perform the scaled dot-product attention.This process repeats for m times in parallel, and finally splices the results and linearly projects to get the new representation.The final result of S could be expressed as: where The tag of each position in the sentence has different degrees of dependence on the context.We introduce a gating mechanism to generate a representation combining context features and self-features.The gated output representation can be expressed as: where σ is the sigmoid function, ⊗ is the element-wise product, and ⊕ represents the connection operator.Finally, we carry out a fully connected layer to compute the probability scoring matrix.It can be described as: where W p ∈ R |k|×4 d n and b p ∈ R |k| are the trainable parameters.|k| denotes the number of output labels, and n is the length of the input sequence.O is the output probability matrix, whose size is n × k.

E. CRF LAYER
In the NER sequence labeling task, there is a strong dependency between the tags of adjacent characters.For example, the I-PER (I-person) tag should be followed by a B-PER (B-person) tag or I-PER tag.Also, the I-LOC (I-location) tag cannot appear behind the B-PER tag or S-PER (S-person) tag.Therefore, instead of making independent tagging decisions using the output of the fully connected layers, we utilize CRF to inference the entity tags outputs of a sequence jointly.The CRF can express this dependence and add some constraints to the final predicted tag sequence effectively.
The CRF layer is trained to predict the most possible tag sequence y = {y 1 , y 2 , . . ., y n } for a given sentence X = {x 1 , x 2 , . . ., x n }.The score of the tag sequence can be calculated as: where O i,y i represents the score of the y i th tag of the ith character x i in the sentence.T is a transition score matrix, which denotes the scores of transition from tag i to tag j. y 0 and y n+1 in the formula represent the start and end tags of a sentence, and we add them to the possible tag sets.Therefore, T is a square matrix of size k + 2.Then, the probability of the ground-truth label sequence y is defined as: where ỹ denotes an arbitrary label sequence, and Y x is the set of all possible output label sequences for the input X .
In decoding, we use the Viterbi algorithm [34] to predict the best path that obtains the highest scoring mark sequence: Given a set of manually labeled data {(x i , y i )}| N i=1 , we add L2 regularization to the negative log-likelihood loss for training.The specific loss function is as follows: where λ is the L2 regularization hyper-parameter, and θ denotes the parameter set.For training, we minimize the loss function L through shuffled mini-batches stochastic gradient descent method with the Adam update rule.

IV. EXPERIMENTS
In this section, to evaluate the effectiveness of the proposed GCRA model, we compare our model with previous state-ofthe-art methods on different Chinese NER datasets.We will describe the details of different datasets, settings of parameters, and results of our experiments.

A. DATASETS
We evaluate our proposed model on three Chinese NER datasets, which include MSRA NER dataset [35], Literature NER dataset [36], and Chinese resume dataset [24].
Table 2 provides detailed statistic information for each dataset.
• MSRA dataset comes from SIGHAN 2006 shared task for Chinese NER [35].This dataset is news in simplified Chinese, which contains three annotated named entity types: PER (Person), ORG (Organization) and LOC (Location).The development set is not available in the MSRA dataset.Therefore, we sample 10% data of training set as the development set.
• Literature dataset is annotated from hundreds of Chinese literature articles, which contain seven entity types: Thing, Person, Location, Time, Metric, Organization, and Abstract.The training, development, and test sets have been divided on the Literature dataset.
• Resume dataset consists of resume of senior executives from listed companies in the Chinese stock market, which contains eight types of named entities: CONT (Country), EDU (Educational Institution), LOC, PER, ORG, PRO (Profession), RACE (Ethnicity Background), and TITLE (Job Title).

B. EXPERIMENTAL SETTINGS
We adopt BIOES tagging scheme where each character in the corpus is labeled as one of B (Begin), I (Inside), O (Outside), E (End), and S (Single).Studies have suggested that BIOES scheme is remarkably better than BIO scheme since BIOES can get more detailed position information [37].Table 3 shows the values of hyper-parameters for our model.In particular, we make our parameter selection according to the performance on the development set of datasets.We set the character embedding size, hidden sizes of CNN and Bi-LSTM to 300 dims.The sliding window size of all convolutional layers is set to 3. The highway gate bias is initialized with −1 vector.We exploited Adam [38] as the model optimization with an initial learning rate of 0.001, and the gradient norms clipped at 5.0.The projection number of self-attention m is 8.To avoid overfitting, we set the L2-norm regularization parameter as 0.005, and apply dropout to embedding layer with a rate of 0.5.The batch normalization is utilized to the outputs of the self-attention layer.For the batch size, we set the batch size of MSRA dataset as 64 and other datasets as 20, respectively.The character embeddings utilized in our proposed model are from Chinese-Word-Vectors [39], which are pre-trained on Baidu Encyclopedia corpus by Skip-Gram with Negative Sampling (SGNS).
For evaluation, same as most of the previous work, we also use the Precision (P), Recall (R), and F1 score as metrics to evaluate the recognition effectiveness of the model.

C. EXPERIMENT RESULTS
We compare our experimental results with previous state-ofthe-art methods on MSRA dataset, Literature dataset, and Chinese Resume dataset, respectively.Besides, we propose two baselines and a GCRA model.In   introduced multi-prototype embeddings features to Chinese NER task and Dong et al. [44] exploited neural LSTM-CRF with radical features in Chinese character.Yang et al. [45] proposed a five-stroke based CNN-BiRNN-CRF model for Chinese NER task by considering the semantic information as well as n-gram features.Cao et al. [23] used Adversarial Transfer Learning and self-attention to joint train Chinese NER task with Chinese word segmentation for better performance.Zhang and Yang [24] constructed a lattice LSTM structure to exploit word information in character sequence with incorporate lexicon information into the neural network.Although the model achieves state-of-the-art F1-score of 93.18%, it leverages external lexicon data, and the result may be affected by the quality of the lexicon.Zhang et al. [25] investigated a dynamic meta-embeddings method and applied it to Chinese NER task.Zhu and Wang [26] proposed a Convolutional Attention Network model to improve Chinese NER model performance and preclude word embedding and additional lexicon dependencies.
The second block in Table 4, we list the results of baselines and our proposed model.Our baseline model achieves an F1-score of 91.36% using only character embedding and softword information.We add a highway network for purifying the hidden representation of Bi-LSTM, and the experimental results show that the Baseline + Highway model has surpassed most of the previous methods.Compared with the state-of-the-art model proposed by Zhang and Yang [24], our character-based model gives a highly competitive accuracy of 93.71% without external lexicon data and multi-task joint training.Compared with state-of-the-art result among the character-based models proposed by Zhu and Wang [26], our GCRA model achieves higher F1-score of 93.08% to the character-based on the MSRA dataset.[25] proposed DME-SUM model, which applied dynamic meta-embeddings method to combine the character and word vectors.Zhang et al. 2019(b) [25] presented DME-attention based model, which implemented two attention layers to integrate character and word information with a combination method of elementwise summation.

2) LITERATURE DATASET
The results of our baselines and proposed models are listed in the second block of Table 5.Our baseline Bi-LSTM + CRF achieves an F1-score of 72.79%, and adding a highway network can improve F1-score to 73.48% which better than previous methods.Compared with the state-of-the-art model proposed by Zhang et al. 2019(b) [25], our GCRA model outperforms the state-of-the-art model without using external data and leads 1.26% increment of F1-score.In the second block of Table 6, the results show that our proposed baseline + highway model achieves highly competitive F1-score of 94.87%.We can observe that our proposed character-based GCRA model outperforms the previous methods and achieves the state-of-the-art F1-score of 95.54% for Chinese Resume dataset, which demonstrates the effectiveness of our proposed model.

D. RESULTS ANALYSIS
With the introduction of the gating mechanism, our model can effectively avoid gradient vanishing during training and achieve the selective multi-channel transmission of information.Shown by Table 4, 5, and 6, we can observe that the baseline + highway model gains significant improvement in F1-score compared with the baseline model.It indicates that the gate network can perform more detailed feature extraction and learn more complicated dependencies.Our proposed GCRA model outperforms previous methods on Chinese Literature and Resume dataset and gives highly competitive results on MSRA dataset without utilizing any external resources.Compared with the baseline model, our proposed GCRA model lead 1.72%, 1.5%, and 1.26% noticeable improvements on MSRA dataset, Literature dataset and Resume dataset, respectively.It demonstrates that the effectiveness of our proposed model for Chinese NER task, which will better understand a sentence and achieve better recognition effect.However, the overall performance on Literature NER dataset is relatively low.And previous methods all get higher precision and lower recall.The lower recall rate means a lot of unknown entities cannot be recognized.It may be explained by the reason that there are various rhetorical devices and a large number of ambiguous cases in Chinese literature text.Nevertheless, the remarkable improvement on Literature NER dataset suggests that our proposed model can efficiently handle the problem of unknown entities.

V. CONCLUSION
In this paper, we propose a new model (GCRA) for Chinese NER task, which utilizes the gated filtering to refine the hidden representation and avert the problems of the gradient.In our model, we apply hybrid gated convolutions and highway-LSTM, and gated self-attention mechanism to learn the inner features of the sentence and capture the context information from multiple subspaces.Compared with previous state-of-the-art methods, the experiments on three datasets demonstrate that our proposed model can achieve better performance.Furthermore, our model does not depend on any external resources and domain-specific knowledge.Thus, it can be easily extended to other sequence labeling tasks, such as Chinese Word Segmentation and Part-of-Speech Tagging.
In the future, we will consider using transfer learning to integrate the knowledge of other NLP tasks in Chinese named entity recognition task to improve performance.

FIGURE 1 .
FIGURE 1.The whole architecture of our proposed GCRA model.

FIGURE 2 .
FIGURE 2. The architecture of hybrid gated convolution layer.

2 )
DILATED GATED CONVOLUTIONStrubell et al.[28] applied the iterated dilated convolutions to expand the receptive fields, which have better capacity than traditional CNNs for NER task.To enable the CNN model to capture farther distances without increasing the model parameter number, we use a dilated convolution.In normal CNN, each kernel window consists of adjacent inputs, whereas dilated convolutions define wider effective input width by introducing dilation between inputs.Given a 1-D convolutional filter w = {w −r , w −r+1 , . . ., w r } of a widow size l = 2r + 1 and the input sequence X = {x 1 , x 2 , . . ., x n }.

FIGURE 4 .
FIGURE 4. The architecture of highway network layer.
d h represent the projection matrix, and d k = 2 d h /m.The ⊗ is the element-wise product and ⊕ represents the connection operator.

TABLE 1 .
Examples of word segmentation.

TABLE 2 .
Detailed statistics of datasets.

Table 4 ,
5, and 6, we use the Baseline to represent the BiLSTM + CRF model and Baseline + Highway to indicate Highway-LSTM + CRF model.The best experiment results in tables are in bold.

Table 4
[43]s the experimental results conducted on the MSRA dataset.The first block is the results of previous models for Chinese NER on MSRA dataset.Chen et al.[40], Zhang et al.[41], and Zhou et al.[42]who employed the statistical model with rich hand-crafted features.Lu et al.[43]

TABLE 4 .
Experimental results on MSRA dataset.

Table 5
Zhang et al. 2019(a)e results on the Literature dataset.Xu et al. 2018(a)[36]employed bi-directional LSTM for Chinese Literature NER, andXu et al. 2018(b)[36]used CRF with the features template, which includes unigram and bigram features.The first two rows in the first block clearly show that CRF achieves better performance than bi-directional LSTM, which probably attributed to the feature template.Zhang et al. 2019(a)

TABLE 5 .
Experimental results on Literature dataset.

Table 6
[26]s the comparative results on the Chinese Resume dataset.The result in the first three rows of the first block respectively represents the char-based LSTM model, the word-based LSTM model, and the Lattice model proposed by Zhang and Yang[24].Zhu and Wang 2019(a)[26]used BiGRU + CRF model and Zhu and Wang 2019(b)[26]leveraged CNN-BiGRU + CRF model for the Chinese Resume NER.Zhu and Wang 2019(c)[26]presented a Convolutional Attention Network model and achieves F1-score of 94.94% for Resume dataset.

TABLE 6 .
Experimental results on Chinese Resume dataset.