An Instance Transfer-Based Approach Using Enhanced Recurrent Neural Network for Domain Named Entity Recognition

Recently, neural networks have shown promising results for named entity recognition(NER), which needs a number of labeled data to for model training. When meeting a new domain (target domain) for NER, there is no or a few labeled data, which makes domain NER much more difficult. As NER has been researched for a long time, some similar domain already has well labeled data(source domain). Therefore, in this paper, we focus on domain NER by studying how to utilize the labeled data from such similar source domain for the new target domain. We design a kernel function based instance transfer strategy by getting similar labeled sentences from a source domain. Moreover, we propose an enhanced recurrent neural network (ERNN) by adding an additional layer that combines the source domain labeled data into traditional RNN structure. Comprehensive experiments are conducted on two datasets. The comparison results among HMM, CRF and RNN show that RNN performs better than others. When there is no labeled data in domain target, compared to directly using the source domain labeled data without selecting transferred instances, our enhanced RNN approach gets improvement from 0.8052 to 0.9328 in terms of F1 measure.


I. INTRODUCTION
In recent years, Web data and knowledge management attracts the interests from industry and research fields. There are various promising applications, such as intelligent recommendation, machine Question & Answer, knowledge graph and so on. Named entity recognition is a fundamental and very important step in the automatic information extraction. A recognition method with high quality can directly improve the follow-up processing results of Web data management products. NER research shows great successes in various domains and becomes a hot topic [10], [26], such as social media [28], [33], language texts [13], biomedicine [2], [25], and so on. As we know, texts of different domains may vary from features, writing styles and structures. Domain NER meets a challenge that annotating data for new domains is labor intensive.
For domain NER,a new domain(target domain) has no or a few labeled data. However, it is natural to think that if some similar domain(source domain) with enough labeled data The associate editor coordinating the review of this manuscript and approving it for publication was Huiqing Wen . already exists, it is possible to borrow some from this similar domain. Domain adaptation targets at transferring the source domain knowledge to the target domain [16], which is an effective way to solve the problem of labeling large amount of data on new corpus or domains. In other words, if a NER model is trained well for some fixed source domain, it is interesting to study how to deploy them across one or more different target domains.
In this paper, we focus on domain NER for the politics text domain in Chinese high schools to support an automatic question and answer system (Q & A) that will take the national college entrance examination (NCEE) in the future. However there is no public politics text corpus used by Chinese high school for NER task. Moreover, there is no labeled data. We notice that People's Daily corpus is free for public download and similar to politics text. Therefore, we propose an instance transfer based approach for domain NER with enhanced Recurrent Neural Network (RNN). First, we design an instance transfer strategy to extract similar sentences from the People's Daily corpus (source domain). Here, instances mean labeled data and politics text is our target domain. Then, recurrent neural network (RNN) model can be trained based on the transferred instances. Moreover, we improve the traditional RNN model by enhancing its activation function and structure. Finally, an instance transfer enhanced RNN (ERNN) model is proposed to do NER for politics text target domain, which is trained based on the transferred instances from similar source domain (People's daily corpus). Compared with the traditional RNN model, experimental results show that our ERNN with the instance transfer strategy can get improvement from 80.52% to 93.28% in terms of F1 measure.
In addition, we consider other situation where a small number of labeled data is available to further investigate the performance of our proposed approach. Since the politics text as target domain data is needed to obtain, we collect texts from high school books and relevant websites and manually label a very small set of them, making labeled data much less than the unlabeled one. Experimental results show that labeling data for target domain is quite useful, reaching at 92.13 in terms of F1 measure. However, our instance transfer based enhanced RNN can further improve the F1 value to 93.81. Finally, we adopt the co-training approach using our proposed ERNN model and CRF by taking advantage of large unannotated target domain data, which get the F1 value of 94.02.
The rest of the paper is organized as follows. In Section 2, we introduce related work on NER (especially on domain NER), recurrent neural network and transfer learning. Section 3 elaborates our approach including the instance transfer strategy, our proposed ERNN. Experiments and results are given in Section 4, then we make the conclusion and future work in Section 5.

II. RELATED WORK
As the virtual step in information extraction and a shared task of CoNLL-2003 [23], NER has been widely studied nowadays and most are about domain NER. Reference [35] employ conditional random fields (CRF) and structure support vector machines (SSVMs) to do chemical entity mention recognition in patients and chemical passage detection. Reference [6] propose novel active learning algorithms to significantly save annotation cost for clinical NER task. Besides, considering the condition that available labeled data is not at hand sometimes, [5] present a NER system for tagging fiction. To avoid the lack of annotated data, they bootstrap a model from term clusters and leverage multiple instances of the same name in a text instead. However it needs to pass through the corpus to build a feature vector corresponding to all the contexts in one document. Moreover, a large number of researches about domain NER are on biomedical fields [9], [20], and the typical methods are CRF [12], [24], and other supervised learning approaches [11].
As neural networks have shown promising results in domain NER [9], such as RNN, LSTM, and so on. Reference [27] use deep neural networks for referring to the real world (i.e. game states) to improve Japanese chess NER. Reference [34] combine the bidirectional long short-term memory (LSTM) and CRF to automatically explore words and characters level features. However, transfer learning and semi-supervised learning, like co-training, which are effective ways for the situation when train data is much less than unannotated data.
When considering how to deal with the situation of lack of labeled data, [21] have categorized and reviewed the research progress on transfer learning for classification, regression, and clustering problems. Similar to the work of [22] in domain adaption, in this paper we study a instance transfer strategy, which is hot today but rarely used in NER [3], [7], to make use of out-of-domain data (source domain).
Co-training is also an effective way to use unlabeled data in supervised learning task, [14] use a bilingual co-training and maximum entropy model to carry out English and Chinese NER. Reference [19] modify the original co-training to cover knowledge from unlabeled data to recognize bio named entities in text. Besides, clustering should be another research direction for this problem [29], [30], [31]. Our interest is domain NER for Chinese high school news texts. In this paper we also put our proposed ERNN into a co-training approach to see how much improvement can be obtained through large amounts of unlabeled data.
The contributions and differences of our work from others are listed as follow: 1) We design a instance transfer strategy by selecting similar sentences from a source domain.
2) We add an additional layer to the traditional RNN structure by coupling knowledge from the transferred similar instances.
3) We conduct experiments on two datasets. No matter there is no labeled data or no in target domain, our proposed approach can improve the quality of NER.

III. OUR APPROACH A. PROBLEM DESCRIPTION
Manually labeling data is time consuming in most machine learning methods. However, there are some other domain data with well-labeled data. As Pan and Yang [21] have declared, in such cases, knowledge transfer, if done successfully, would greatly improve the performance of learning without taking lots of efforts to label data. Our interest is Chinese high school politics text NER to support automatic Question and Answer application. There is no labeled data at hand for this new domain, but we find that some public NER data, like People's Daily corpus, is similar to it. The similar domain with well-labeled data is called source domain, while the new domain without enough labeled data is called target domain. In this paper, we propose an instance transfer enhanced RNN model for domain NER by leveraging labeled data from a source domain. In Section 3.2, we will introduce an our instance transfer strategy that will select similar labeled data (sentences) from a source domain. In Section 3.3, we will discuss how to improve the traditional RNN model by adding those selected data into the network structure of RNN.

B. OUR INSTANCE TRANSFER STRATEGY
We design an instance transfer strategy where the target domain is Chinese high school politics texts, and People's Daily is the source domain. Since texts from different domains may vary from features, writing styles and structures, it is not suitable to use all labeled data from the source domain. In this paper we first consider two functions to compute sentence similarity between source domain and target domain. The two functions are Gaussian radial basis function (RBF) and polynomial kernel function which are described as following: • Gaussian RBF kernel: • polynomial kernel function: Here the − → x stands for the sentence vector in source domain, while − → z represents the mean of target domain corpus matrix. And we set σ = 1 and d = 2. We use the two functions to evaluate the sentence similarity between source domain and target domain and sort source domain sentences according to their similarity scores. Then two transfer strategies are given to select top similar sentences: 1) Directly regard the top n sentences as target domain training data.
2) Repeat the nth sentence k times in source domain where the value of k depends on n. The original source corpus has been enlarged.

C. OUR ERNN MODEL
Through our instance transfer strategy, target domain borrows labeled data from source domain and top similar sentences have been assigned more weights than others. Therefore, the next step of our approach is to enhance RNN model by utilizing those labeled data. RNN is designed to solve the problems involved in time and sequence [18]. Not only the output of moment t is influenced by that of previous moment, but also the nodes between hidden layer are all connected to each other. In this paper, we propose an enhanced RNN by modifying both its activation function and structure to do domain NER task.

1) ACTIVATION FUNCTION
As described in [8], a deep neural network is an ensemble of vectors and matrices that stand for bias and weight values, and nonlinear activation function. A suitable activation function means much for a deep neural network. Its changes can speed up model training [32] and enhance stability [15], etc. However, for a given domain, the best nonlinear function remains unknown, even though many rectifier-type nonlinear functions have been proposed as activation functions [8]. The performance of a same activation function may be widely different when applying it to different tasks. For NER task, compared to other frequently used activation functions such as ReLu, PReLu, tanh and so on, the sigmoid function shows best in our preliminary experiments. Its definition is given in Eq. 3.
However, the model training process involves the derivation, while the derivative of sigmoid function tends to zero when the argument x approaches infinity (which called the saturation phenomenon) both at its left and right. This may make it harder to train model. Even so, sigmoid is most similar to the reflex mechanism on the biological neuron level and the interval of its output is always between 0 and 1, which can represent the prediction probability of a label. In our experiments, its linear approximation function expressed by Eq. 4, shows a good performance and a is set to be 0.2 and b is 0.5 A more intuitive depiction is shown in Figure 1.
Focusing on our NER task, a new activation function is proposed by combining the two above functions (shown in Eq. 5). Here parameters α and β are coefficients and they are determined by experimental performances.
The new activation function has some advantages compared to its two original functions: sigmoid and linear approximation. It ameliorates the former's saturation phenomenon on the one hand and smooths the latter on the other hand. In our experiments, the results using the proposed activation function are better than that using sigmoid or its linear approximation.

2) MODEL STRUCTURE
In order to take full advantage of source domain data, we also improve RNN's structure and propose an instance transfer enhanced RNN (ERNN), which can input the source domain   sentences into RNN. Our ERNN is modified based on the Elmantype described in [17].
The new structure we modified is depicted as Figure 2 and Figure 3. In Figure 2, the left shows for the original structure and the right describes our modified recurrent neural network when unfolding its structure in time of the computation involved in its forward computation. T nodes represent the hidden layer, meanwhile S indicates the confluent layer. The source domain data combines with the output of the hidden layer T. The overview of the structure is shown in Figure 3 that depicts the domain source data input i compared to Figure 2. As shown in Figure 2, − → U , − → V , − → w , − → w 1 , − → w 2 are weights and parameters. The output of the S t is computed by Eq. 6. In output layer, we use softmax to get the prediction results i.e. o(t), as shown in Eq. 7.
Here F is the activation function, i stands for the source domain data, as marked in Figure3. − → b 0 and − → b 1 are bias vectors. In this way we achieve the reuse of the valuable source domain corpus.

IV. EXPERIMENTS
We conduct two sets of experiments on two totally different datasets for different purposes. One is NER experiments among three popular models, i.e., HMM, CRF and RNN, which aims to explore the classification quality with different training data sizes. The other is domain NER using our ERNN with instance transfer.

A. DATA DESCRIPTION
There are two datasets used in our experiments. For NER experiment among three popular models, we mainly focus the changes of the classification performance with different amount of training data. We conduct the experiment on a standard corpus named ATIS [17]. It contains 127 classes and uses the in/out/begin (IOB) representation. A sentence in ATIS can be expressed as in Table 1. The other dataset is collected from Chinese high school politics textbooks and websites, which is the target domain in this paper. The source domain corpus is People's Daily 1998 that is open online. In addition, due to lack of label information, we set 11 classes for target domain based on People's Daily 1998, including person names, regular time words, single time words such as those means ''night'', proprietary organization and company name. For example, Chinese characters are ''Xinhua News Agency''and''the United Front Department'', and special region name, noun and etc.

B. DATA PRE-PROCESSING
Specifically, for NER experiment, ATIS contains 3983 training sentences and 893 testing sentences. In order to find out how it works when the labeled data is less than the unlabeled one, we reorganized the training set. Based on ATIS original train/test proportion (3983/893), we randomly select 20%, 40%, 60%, 80% and 100% of total train data (3983 sentences) as our training sets to train two traditional statistical models (Hidden Markov Model and CRF) and RNN model and one RNN model. For domain NER experiment, we first pick the most common words from target domain data as our dictionary. By counting the times a word has appeared in this corpus, we select top 13,450 words as our dictionary and other words are marked as 'unknown' in a sentence. After this, we clean the data based on a rule that is if a sentence meets any of the following two conditions, it would be regarded as noisy data and abandoned..
• The length is less than 3 • Too many 'unknown' tags(more than 50%) After preprocessing, the People's Daily sentences have been reduced from 19484 to 4818, meanwhile the number of politics text sentences is also reduced from 82700 to 15584. Since there is no annotated data of the latter, we manually mark 3043 sentences. Table 2 shows the statistics of the labeled and unlabeled data size of both datasets.
For People's Daily, we use 80%/20%of the total 4818 sentences as our train/test data sets. For our target domain data in severe lack of tagged data, we use manually labeled 2035 sentences and tag other 1008 sentences from the large unannotated set as the test set.

C. EVALUATION AND SETUP
In this paper, all the experimental results were evaluated by the popular evaluation measures, i.e., precision (P), recall (R) and F1-score. In addition, for supervised learning experiment, we used K-fold cross validations. K is different based on the different size of train set. For example, when using 20% of the original train set, K is 5, and when using 40% and 60%, K is 3.
We also do some preliminary experiments to set the ERNN parameters. For example, the word embedding we use is trained by target domain texts via word2vec, a Google open source project which brings about 14.1% performance improvement. Besides, we set the ERNN context window size equal to 1, which performs better than those 5, 7 and 9 (probably because there are too many nonentity labels in a sentence). For instance transfer, since the RBF similarity function achieves a better result, we use it to enlarge the source domain corpus.

D. NER EXPERIMENTAL RESULTS OF THREE POPULAR NER MODELS
In this part, we compare the NER performance of a current popular deep learning model, i.e., RNN with two traditional models, i.e., HMM and CRF. And we also mainly explore the NER performance with different training data sizes. We conduct several experiments using ATIS's total training data's 20%, 40%, 60%, 80% and 100% to train the three popular NER models. The results are shown in Figure 4, Figure 5 and Figure 6. FromFigure4, Figure5 and Figure6 we can see how the three models behave when being trained on different data sizes. Overall, the P, R and F1 of these three models are gradually rising when using more training data. However, the performances are not growing linearly when the data size gets larger. While using more and more data to train a model,   the degree of improvement becomes smaller. Actually, when the training data size increases from 20% to 40%, the three models generally achieve a highest improvement (about 4% improvement in terms of F1 measure). After that, increasing the training data size brings little benefits.
We have done experiments on BiLSTM-CRFt. When using all training data, its F1 score is 90.28. Because this article mainly discusses the impact of transfer learning on NER performance, the following experiments discuss the two basic models of CRF and RNN.
This indicates that when meeting a new domain for NER, labeled data will actually improve performance. But VOLUME 8, 2020 manually labeling data is labor intensive. Putting more labor on labeling data could not bring us larger improvement considering the time and the labor we need. Therefore, in this paper, we propose an instance transfer strategy by transferring labeled data from other well labeled domain, which is a effective way to solve the problem of lack of labeled data in a new domain.

1) INSTANCE TRANSFER
We have designed the two transfer learning strategies introduced in Section 3.2. One is directly using the RBF top n (here we set n=800) similar sentences from source domain as new additional target domain training data. The other is increasing the source corpus by Eq. 8, i.e. the nth most similar sentences repeat k times. Our preliminary results show that the latter is better. Therefore, due to page limit, we just discuss the latter in this paper. Finally, 41500 sentences are obtained as labeled data. We still keep the 80%/20% train/test proportion when using the enlarged training data to train our ERNN model. The parameters are given in Equation 8.
Since in the above experiment, RNN show better performance than HMM and CRF, RNN related baselines and our approach are listed as follows(Here 'IT' means 'Instance Transfer'). As discussed in Section 2, most of current researches that study deep neural network for NER work on trying different variants of deep learning models by assuming that a number of training data is ready. However, we consider two situations. One is that there is no labeled data in target domain, i.e., RNN_D_IT and ERNN_IT. The other is there is a few labeled data in target domain, i.e., RNN_L, RNN_L_D_IT and ERNN_L_IT. Our problem context is quite different from most of current work.
1. RNN_D_IT: train a traditional RNN model (left part in Figure 2) directly with source domain. This training process is intuitive and any NER model can be applied like this.
2. ERNN_IT: train our ERNN model ( Figure 3) with selected instances from source domain by our instance transfer strategy (Eq. 8).
3. RNN_L: train a traditional RNN model (left part in Figure 2)with a few labeled data from target domain. This is a traditional training process for supervised learning and more results are discussed in Section 4.4. 4. RNN_L_D_IT: train a traditional RNN model (left part in Figure 2) with a few labeled data from target domain and all instances in source domain. The problem context is the same as that of [22], but RNN is used in our paper instead of CRF.

ERNN_L_IT
: train our ERNN model ( Figure 3) with a few labeled data from target domain and transferred instances in source domain by our instance transfer strategy (Eq. 8).
For the situation that there is no labeled data in target domain, the experimental results are reported in Table 3. The results of RNN_D_IT tell us that directly using all the labeled data from source domain directly is not much satisfactory. With the help of our instance transfer strategy, similar sentences are selected and repeatedly used as training data. The F1 measure of our ERNN_IT reaches at 93.28%, with the improvement about 15.84%.
In addition, the other situation for domain NER is considered where a few labeled data is available. Therefore, experiments are conducted by adding our manually labeled data into RNN_D_IT and ERNN_IT. Experimental results are show in Table 4. Traditional RNN model gets the F1 value of 92.13, which further indicates that manually labeled data for target domain can largely improve the quality of domain NER. Under this situation. our ERNN_L_IT can still obtain the F1 value of 93.81, and it benefits from our designed instance transfer strategy and adding a particular layer to RNN. The F1 value of RNN_L_D_T is 93.06, which says that source domain can help target domain NER. Transfer strategy should be studied. The results in Table 3 and Table 4 tell us that transfer learning is an effective way to leverage the labeled data from source domain corpus.

2) CO-TRAINING
The experiments in instance transfer part aim to utilize the source domain data to raise the recognition performances. Along with instance transfer learning methods, we also consider taking advantages of unannotated data since we have much of them at hand. Specifically, we conduct experiments by co-training our ERNN and a statistical probability model i.e. CRF. Co-training is kind of a semi-supervised strategy that is firstly proposed by [4] in 1998 and the conditional independence of the data views is declared as a required criterion. However, [1] shows that the independence assumption can be relaxed, which means co-training is still effective under a weaker independence assumption. In this paper, we explore the effect of co-training with our ERNN and adopt it to leverage the large unannotated in-domain data. The initial training data size is 2035 sentences here, while the unannotated data is 12541 sentences. We select top 800 high confidence level sentences in each iteration and after iterated 10 times, all 12541 sentences get labeled. Table 5 shows the results of each model in each iteration.
For the co-training experiments, large unannotated data also brings rise to the quality of NER, but not as much as that of instance transfer experiments. Due to the CRF's low starting point, the ERNN F1 score goes down after reaching 94.02 in terms of F1 measure.

V. CONCLUSION AND FUTURE WORK
Domain NER is important and useful in various applications. In this paper, we study instance transfer and RNN to improve the quality of domain NER. We leverage the source domain data by proposing an instance transfer enhanced RNN called ERNN. In addition, we adopt co-training strategy to leverage the large unannotated in-domain data to further improve the recognition performances. In the future, we are going to make a further exploration to automatic QA system, relationship extraction between entities.