Reducing Wrong Labels for Distantly Supervised Relation Extraction With Reinforcement Learning

,


I. INTRODUCTION
Relation extraction (RE) aims to predict the semantic relations between two entities from given texts. The extracted relation facts play an important role in various natural language processing (NLP) tasks, such as knowledge graph construction [1] and question answering system (QA) [2].
Most existing supervised RE methods often encounter problems with a lack of massive labeled data, relying entirely on manually labeled relation-specific training data. These data generated are expensive and inefficient. In addition, the quantity of these data are limited. In order to obtain largescale training data, distant supervision (DS) [3] is proposed. DS assumes that if two entities have a relation in the knowledge base (KB), any sentence referring to these two entities will express the relation.
The associate editor coordinating the review of this manuscript and approving it for publication was Fatih Emre Boran .
Although DS can automatically and effectively generate a large amount of labeled training data, it suffers from the wrong labeling problem. As shown in Fig. 1, taking the triple (donald_trump, place_of_birth, united_states) as an example, the sentence ''donald_trump was elected the 45th president of the united_states.'' will be regarded as a positive instance by DS and the relation place_of_birth is labeled to this sentence. However, the sentence above does not describe this relation at all. Thus, DS introduces noise into the training data.
To solve the problem of wrong labeling, previous studies introduce multi-instance learning (MIL) [4]- [6] into the distantly supervised RE task. MIL divides the sentences containing the same entity pair into a bag, which is labeled with the relation mentioned by the entity pair in KB. In these studies, the impact of wrong labels is reduced by changing the sentence-level RE task to the bag-level task.
However, there are still two problems in previous studies: (1) Sentence-level relation prediction cannot be handled. The training model is sensitive to the bag, especially all sentences in the bag that do not express the relation at all. Compared to bag-level RE, sentence-level RE is more widely used in other applications. (2) Few studies focus on the sentence-level noisy label problem. Most methods only alleviate the problem of noisy sentences or bag-level noise labels.
To better explain the above problem, an example of baglevel and sentence-level distantly supervised RE is shown in Fig. 1. For the first problem, the bag-level extraction can find that a set of sentences containing the same entity pair (donald_trump, united_states) has two relations place_of_birth and president_of. However, sentence-level prediction can further map each relation to the corresponding sentences [7]. In the above bag-level extraction task, at least one sentence must be selected even if all sentences in a given bag do not express the relation. It's very common that all sentences contained in a bag are noise. The results surveyed by Feng et al. [7] on the commonly used Riedel dataset [4] show that 53% out of 100 sample bags have no sentences that express the relation. Such a noise bag seriously affects the training effect of the RE. For the second problem, a sentence containing an entity pair may be labeled with multiple relations by DS, but not all labels can be expressed by this sentence. Thus, there is a problem with multiple noise labels in a sentence. For example, the sentence ''don-ald_trumpwas born in the united_states.'' is labeled with two relations place_of_birth and president_of. This sentence only describes the relation place_of_birth, and president_of is a noisy label.
In order to deal with the problem of multiple noise labels, an express-only-one assumption is proposed. It assumes that an entity pair has only one relation in a sentence to express. Intuitively, it indicates that a sentence mentioning the entity pair can only have one correct label, and it is impossible for a sentence to have multiple correct labels. Based on this assumption, a sentence-level label denoising model based on reinforcement learning (RL) is proposed. This model includes two modules: a label denoiser and a relation extractor. The label denoiser selects a correct label for each sentence in the original data. To handle the first problem, after obtained the data with the correct labels, the relation extractor predicts sentence-level relation probability. The proposed model selects reliable labels for sentences, which reduces the frequency that the labels of all sentences in the bag are incorrect, and alleviates the sensitive problem to bag. The main difficulty here is how to jointly train the relation extractor and the label denoiser, especially when the label denoiser does not know exactly which relational labels are correct in sentences.
The label denoiser addresses this challenge by adopting a value-based RL method. Although the problem of sentencelevel noise labels is difficult to obtain reliable labels with supervised methods, we can measure the effectiveness of the selected labels as a whole. Therefore, the sentence-level label denoising task has the following three properties: First, trialand-error-search. The label denoiser tries to select the label with the most value for the sentence without a clear relational label. Second, there is a dynamic interaction between the label denoiser and the environment containing the data and the relation extractor. The label denoiser filters the noisy labels in the training data. The relation extractor to perform relation prediction on the cleaned data. The results of the relation prediction are fed back to the label denoiser. These three parties dynamically interact with each other to get reliable labels and RE results. Finally, delayed reward. The rewards from the relation extractor can be received only when the label denoising process is completed.
The major contributions of our work can be summarized as follows: • We propose a RL-based sentence-level label denoising model, which includes two modules: a label denoiser and a relation extractor. The joint training of the two modules can reduce the noise labels of sentences, improve the performance for sentence-level relation prediction, and reduce sensitivity to bag.
• To address the multiple noisy labels problem, we propose an express-only-one assumption, and design deep Q network (DQN) as a label denoiser to select reliable labels.
• We propose a reward function based on the difference of the results from relation prediction. The reward of the label denoiser comes from the changes in the predicted relation score of the relation extractor before and after the label selected.
• The experimental results show that the DQN-based model is effective to reduce the noise labels and VOLUME 8, 2020 the proposed method outperforms various state-of-theart baselines on both the Riedel dataset and humanannotated dataset.

II. RELATED WORK
RE is one of the most important research tasks in NLP. Many supervised methods have been proposed to improve the performance of RE. Zelenko et al. [8] use kernel-based machine learning methods to extract relations from unstructured natural language sources. Other conventional machine learning classifiers, including Naive Bayes [9], support vector machines [10], and maximum entropy [11], have been proven to be able to achieve outstanding performance on labeled relation-specific training data. However, these methods rely strongly on the quality of the features extracted by NLP tools, which inevitably lead to error propagation or accumulation. Motivated by the successful application of deep neural networks in other fields, such as image processing [12] and speech recognition [13], researchers have proposed supervised deep neural network to automatically learn lowdimensional text features in RE tasks: convolutional neural network (CNN) [14], recursive neural networks (RNN) [15], and bidirectional long short-term memory (BiLSTM) [16]. Although supervised methods demonstrate superior performance in the RE task, these methods require a large amount of human-annotated training data, which is time-consuming and labor-intensive.
To solve the problem of human-labeled data, DS is first proposed by Mintz et al. [3], which can build a large scale data quickly and automatically by aligning plain texts and triples in KB. However, it inevitably leads to the wrong label problem. Afterward, MIL [4]- [6] is introduced into the RE, and changes the RE task from sentence level to bag level. Riedel et al. [4] employ MIL and at-least-one assumption in RE. Hoffmann et al. [5] present a probabilistic, graphical model for the distantly supervised RE, which can deal with overlapping relations. Surdeanu et al. [6] train a Bayesian framework by expectation maximization (EM) algorithm to resolve the multi-instance multi-label problem for RE. Later, considering automatic feature engineering, researchers combine MIL and deep neural networks for distantly supervised RE. Zhou et al. [17] propose a piecewise convolutional neural network (PCNN) to select the sentence most likely to represent a bag for each entity pair. Lin et al. [18] adopt a sentencelevel attention model to assign different weights to each sentence in the bag, which achieves outstanding performance in RE. Yuan et al. [19] propose a cross-relation cross-bag selective attention method to reduce the noise data problem. Sahu et al. [20] build a labeled edge graph convolutional neural network model on a document-level graph to capture local and non-local dependency information. In the study of MIL, the training and testing process is performed in a bag. However, these bags contain positive instances and some noisy sentences, and the sentences mentioning the same entity pairs may not describe the same relation. Therefore, previous studies about MIL are sensitive to bags and difficult to handle sentence-level relation prediction.
In addition to relation extracting methods above, RL is proposed as an effective method to improve the performance for distantly supervised RE. Feng et al. [7] and Zeng et al. [21] select high-quality sentences to train relation classifiers, which can reduce noise sentences and improve the accuracy of sentence-level relation prediction. Feng et al. [7] present a RL-based model. The model consists of an instance selector filtering noisy sentences with RL, and a relation classifier predicting sentence level relation. Zeng et al. [21] use RL to guide the training of relation extractor and follow the at-leastone assumption to obtain the delayed reward. Qin et al. [22] propose a RL framework to systematically learn a sentencelevel false-positive indicator, whose goal is to automatically recognize false positives for each relation type without any supervised information. Afterward, the RL-based bag-level label denoising method is proposed by Sun et al. [23] to correct the noisy labels by treating a new label as the gold labels. They use a policy network to correct wrong labels and extraction network to provide rewards for the corrected labels from the policy network. Another bag-level label denoising method without RL is introduced by Liu et al. [24]. They use soft labels to correct noisy labels during training. The soft label is a joint score function that combines the relational scores based on the entity-pair representation and the confidence of the hard label.
These existing methods are mainly focused on reducing the impact of noise sentences and bag-level noise labels on the relation prediction, without removing sentence-level noise labels. Our model employs the RL algorithm to select reliable labels from multiple relation labels generated by DS. The differences between our method and these methods reducing the noise sentences and bag-level noise labels are as follows: (1) Our method aims to solve the problem of sentence-level noise labels. Feng et al. [7] and Zeng et al. [21] filter noise sentences. Qin et al. [22] redistribute noise sentences into the negative set. Sun et al. [23] and Liu et al. [24] correct bag-level noise labels. In addition, our method, Feng et al. [7] and Zeng et al. [21] can handle the sentence-level RE task, and other methods focus on the bag-level extraction task. Sentence-level RE is more widely used in other applications than bag-level RE, and the bag is easy to contain noise sentences. (2) Our label denoising method employs DQN, a value-based RL algorithm, to try to select the labels with the highest value for the sentences. However, existing denoising methods use simple linear function [24] and policy function [7], [21]- [23]. (3) The rewards of the label denoiser in this paper come from the differences between the predicted relation scores of the data before and after the labels changed by the relation extractor. The rewards are intuitively reflected by the changes in the relation prediction in the relation extractor. The rewards of Sun et al. [23] and Feng et al. [7] come directly from prediction results obtained by the relation extractor. The rewards of Qin et al. [22] are proportional to the differences of the F1 scores between the adjacent epochs.

III. PRELIMINARIES
In this section, some concepts and symbols are applied to describe label denoising and RE task more clearly.

Definition 1 (Distantly Supervised Data):
When an entity pair (h, t) contains multiple relations R = {r 1 , r 2 , . . . . . . , r h } in KB, the sentence s will be labeled with multiple relation labels by DS. The sentence associated with two entities and labeled by multiple relations can be expressed as (h, R, t, s), where R = {r 1 , r 2 , . . . . . . , r h }. For example, there are two triples (donald_trump, pesident_of, united_states) and (donald_trump, place_of_birth, united_states) in KB, then the sentence ''donald_trump was elected the 45th president of the united_states.'' is labeled with relation pesident_of, and place_of_birth, which are represented as (donald_trump, {pesident_of, place_of_birth}, united_states, donald_trump was elected the 45th president of the united_states.). The sentences associated with two entities and labeled by multiple relations constitute the distantly supervised data, denoted by

and the set of relation labels for all sentences is
Definition 2 (Express-Only-One Assumption): This assumption is that if an entity pair exists in a sentence, the entity pair in the sentence can only express one relation in DS, even if this entity pair has multiple relations in the KB. For the example mentioned above, the sentence expresses the relation president_of, and place_of_birth is a noisy label. In this paper, we only select the closest and most correct relation with the entity pair in the sentence. Thus, we choose the relation president_of as the correct label for this sentence.
Definition 3 (Label Denoising): Given a sentence associated with two entities and labeled by multiple relations (h, R, t, s), where the set R = {r 1 , r 2 , . . . . . . , r h } is composed of multiple relations. Under the express-only-one assumption, label denoising aims to filter out the noisy labels in the relation set R and select the relation r that is only expressed by the sentence s. Definition 4 (Relation Extraction): Give a sentence containing the target entity pair (h, t, s), where h is the head entity, t is the tail entity, and s is a sentence containing an entity pair (h, t). RE aims to predict the relation r for the entity pair (h, t) in the sentence s.

B. SENTENCE ENCODING WITH PCNN
PCNN aims to extract the word phase features and the structural information between two entities of each sentence, which is commonly used as a sentence encoder in distantly supervised RE tasks [17], [18]. To this end, we use word embedding to encode these words into low-dimensional vectors. In addition, PCNN requires position embedding to specify the position of two entities in the sentence. As illustrated in Fig. 2, given the sentence s, we use word2vec 1 to 1 https://code.google.com/p/word2vec/ obtain the d w -dimension word embeddings. Following previous works [17], [18], we use d p -dimensional position embeddings, which are defined as the low-dimensional vectors of the relative distances from the current word to head or tail entities. The word embeddings and position embeddings are concatenated into an encoding matrix R l×d , which is the input of PCNN. Similar to previous work [18], d = d w +d p * 2, and l is a fixed value, which is 120. Then, PCNN uses convolution operations to flexibly extract the local features of the sentence. Finally, piecewise max pooling obtains the maximum value in each piece according to the positions of the target entity pair. The sentence encoding with PCNN is represented as V pcnn . Fig. 3 shows the proposed overall framework of sentencelevel label denoising in this paper. This framework contains two parts: a label denoiser and a relation extractor. According to the express-only-one assumption, the label denoiser uses a value-based RL algorithm to select a reliable label for each sentence containing multiple wrong labels. The selected labels and the original sentences form new data. Then, the relation extractor uses the typical neural network model to perform sentence-level relation prediction on the new data. The change of the prediction results can intuitively reflect the quality of the labels selected by the relation denoiser. Thus, the reward based on the differences of the scores from relation prediction is provided to the label denoiser. The label denoiser and relation extractor are trained jointly to optimize the label denoising and extraction processes.

A. DQN-BASED RELATION DENOISER
Distantly supervised RE predicts the relations of entity pairs in an automatically generated training data. However, the sentence mentioning the target entity pair may not directly determine which relations are expressed. Therefore, we design a value-based RL algorithm as a label denoiser to determine the correct labels of sentences. In the model, the label denoiser employs DQN, which is the agent in the RL. The denoiser interacts with the environment containing data and relation extractor and guides the removal of noise labels step by step. More specifically, the denoiser selects the label with the largest score from multiple relation labels as the gold label, and provides the selected data to the relation extractor. Then, the relation extractor performs relation prediction. The difference between the scores obtained by the relation prediction is used as the reward of the denoiser. In addition, the delayed rewards can be obtained from the relation extractor only when the selection of all the labels is finished.
It should be noted that the label denoiser here uses DQN, a value-based RL. DQN is suitable for small and discrete action spaces. Additionally, DQN can calculate the reward for each action, and find the action with the highest value through the value function. Although the policy gradient is also a widely used RL method, it is not suitable for our situation. It is more efficient in continuous action space (or high-dimensional space). Besides, the policy gradient usually converges to a local optimum rather than a global optimum, and sometimes there are many inefficient attempts.
Next, we will introduce four fundamental components of the proposed RL method: state, action, reward, and objective function.

1) STATE
The prerequisite of RL is that the external environment should be modeled as a Markov decision process (MDP) [25]. However, the sentences in the training data for distantly supervised RE are independent of each other. That is to say, it is difficult to satisfy this prerequisite using only the information of the sentence being processed as the current state. In addition to the current sentence information, the current state also includes information from previous states. We employ a continuous real-valued vector S t to describe the state, which is encoded as a concatenation of the following vectors: (1) The current sentence encoding, which is obtained from PCNN; (2) The currently selected action, that is the one-hot vector of the chosen label; (3) The currently predicted scores from relation extractor, which are calculated by Eq. (6); (4) The previous sentence encoding; (5) The previously selected action; (6) The previously predicted scores.

2) ACTION
Given a sentence associated with two entities and labeled by multiple relations (h, R, t, s), where R = {r 1 , r 2 , . . . . . . , r h }, our task is to select a label r with the most value from the multiple labels R for the sentence.
In the model, we use DQN as an agent, which determines the optimal action according to the value of the Q function. The final action can be obtained by the formula a t . The Q function is defined as follows: where S t denotes the state, a represents the action, f is a nonlinear function, such as Tanh, Relu, W is the weight, and b is the offset vector.
Under distantly supervised data that consists of sentences with correct and incorrect labels, we hope that the label denoiser can filter wrong labels through DQN. In this selected data, RE can achieve better performance.

3) REWARD
Intuitively, the performance of the relation extractor can be improved when the label denoiser filters the wrong labels in the instances during the training process. The changes in the prediction results from the relation extractor reflect the quality of the selected labels. Therefore, a reward function based on the difference between the results from relation prediction is proposed. The reward is expressed as the difference between before and after the relation prediction results: where x is an instance in the distantly supervised data, y represents a relation label of the sentence. p(y m t |x, ) denotes the prediction score of the relation label in the currently updated relation extractor. p(y m−1 t |x,˜ ) is the prediction score of the relation label in the previous relation extractor. max p(y|x,˜ ) is the maximum predicted score for all relations in the previous relation extractor. There are two cases of reward. The first case is that the selected labels of the instance before and after have not changed, that is, the label currently selected by the label denoiser is the same as the previous relation label. We use the difference between the extraction results as the reward. For the second case, the labels of the data before and after have changed. The previous relation label is different from the currently selected label. The reward can be expressed as the difference between the prediction score of the currently selected label and the maximum score of the previous relation prediction. This difference means that in step m, the label denoiser gets positive reward only if the score of the relation prediction is improved. Otherwise, the label denoiser will receive a negative reward.

4) OBJECTIVE FUNCTION
Q(S t , a, ) represents the output of the network, and Q(S t , a ,˜ ) represents the output of the target network. Our goal is to make the network Q(S t , a, ) continuously approximate to the target networkQ(S t , a ,˜ ). We can obtain y t using the Bellman equation: where y t is taken as the target value of Q (S t , a, ). Then, the loss function of RL is defined by squared errors: where n is the total number of sentences in the distantly supervised RE. Similar to previous works [18], we use stochastic gradient descent (SGD) to minimize the objective function.

B. SENTENCE-LEVEL RELATION EXTRACTOR
The relation extractor aims to obtain the prediction scores of the cleaned data from the label denoiser, and feedback the differences of scores to the label denoiser. More specifically, sentence encoder uses word embedding and position embedding as input and learns a sentence representation with a neural network. In this paper, the relation extractor uses the typical PCNN model as a sentence encoder. PCNN can capture the structural information between two entities in the sentence and has achieved satisfactory results in the RE task. After obtaining the PCNN-based sentence encoding, we use a non-linear layer to project the sentence encoding into the target space of the relation label classes, and provide it to a softmax extractor to predict the relation labels. Give a sentence containing the target entity pair (h, t, s), where h is the head entity, t is the tail entity, and s is a sentence containing an entity pair (h, t). The prediction score of each possible relation in the sentence is calculated as follows: where c denotes the total number of relations. o is the final output of the neural network, which is defined as follows: where tanh represents the hyperbolic tangent function. M is the transformation matrix. d denotes a bias vector. Finally, we employ cross-entropy to define the objective function for sentence-level RE: where s i denotes the ith sentence in the distantly supervised RE, r i is the possible relation label of the sentence s i .

C. JOINT TRAINING
Because the label denoiser dynamically interacts with the environment containing the data and the relation extractor, we need to jointly train the label denoiser and the relation extractor. The complete joint training process is shown as Algorithm 1, which mainly includes parameter initialization and pre-training of relation extractor, parameter initialization of label denoiser, and joint training (on lines 1-4). In this paper, the parameters of the relation extractor are initialized randomly, and then the extractor is pre-trained. If we also initialize the parameters of the label denoiser randomly and pre-train the denoiser, it will consume more time and even get an extremely bad result. Thus, the parameters of the label denoiser in this paper are initialized as the pre-trained parameters of the relation extractor. Relation extractor and label denoiser use SGD to minimize the objective function (Eq. (5) and Eq. (8)). Algorithm 2 gives detailed information on the joint training process. In lines 3-5, the label denoiser filters the noisy labels of the training set and obtains the cleaned data. Then, in lines 7 and 8, the relation extractor performs relation prediction on the data with the new labels, and provides the differences of the prediction results as the rewards to the label denoiser. Afterward, the target value of the Q function is obtained in line 9. Finally, parameters of all networks are updated on lines 11-15.

V. EXPERIMENTS A. DATASET AND SETTINGS 1) RIEDEL DATASET
We evaluate our model on the Riedel dataset 2 [4], which is the most widely used dataset for the distantly supervised RE task.  Table 1.

2) HUMAN-ANNOTATED TESTING DATASET
In order to better evaluate our model, we also employ a human-annotated testing dataset to perform the experiment. This dataset contains 166,004 negative instances from the original Riedel testing dataset and manually annotated 4288 positive instances 3 [26]. Phi et al. [26] select the 4288 positive instances by using three annotators to check false positive examples from 5,863 positive instances of the Riedel testing dataset. As shown in the last row of Table 1, the human-annotated dataset we used includes 170,292 sentences, 96,678 entity pairs, and 1419 relation facts.

3) PARAMETER SETTINGS
Similar to previous works, the word2vec tool is applied to train the word embeddings in the NYT corpus. We set the dimension of word embedding to 50, the position embedding to 5. The window size and the number of feature maps for CNN and PCNN are set to 3 and 230, and the number of hidden units for BiLSTM is 230. Moreover, the learning rate and the dropout probability are set to 0.001 and 0.5, respectively. The batch size is fixed to 300. The training episode number is set to 40. The weighted factor τ is 0.001.

4) BASELINES
To show the effectiveness of the proposed method on the Riedel dataset and the human-annotated dataset, we compared our method with six strong baselines. For the fair comparison with baselines, we use their publicly released source codes 4 or the results from their reports.

a: TRADITIONAL FEATURE-BASED METHODS
• Mintz [3]: A multi-class logistic regression model, which is a traditional feature-based RE method.
• MultiR [5]: A probabilistic graphical model, which alleviates the problem of the overlapping relation on the distantly supervised RE.
• MIML [6]: A graphical model, which is to resolve the multi-instance multi-label problem for RE.

b: NEURAL NETWORK METHODS
• PCNN [17]: A neural network model that uses the PCNNs module to obtain the sentence embedding.
• PCNN+ATT [18]: It is one of the state-of-the-art neural network approaches in distantly supervised RE, which applies the PCNNs module to generate sentence encod- ing, and uses sentence-level attention to reduce the weights for those noisy sentences.
• PCNN+ATT+Policy [22]: To handle the false positive problem, it uses a policy-based RL method to redistribute false positives into the negative examples. The model PCNN+ATT generates bag encoding.
• PCNN+DQN: Our proposed sentence-level label denoising model, which uses the PCNNs module to obtain sentence representation.

B. EXPERIMENTAL RESULTS AND ANALYSIS 1) COMPARISON WITH BASELINES ON THE RIEDEL DATASET
We evaluate our proposed method by comparing it with three feature-based methods (Mintz, MultiR, and MIML) and three neural network methods (PCNN, PCNN+ATT, PCNN+ATT +Policy) on the Riedel dataset. The precision-recall curves of these models are shown in Fig. 4. Clearly, we can observe that: (1) The neural network approaches [17], [18], [22] perform better than traditional feature-based methods [3], [5], [6], which proves that neural network approaches can automatically extract useful information in RE tasks without human intervention.
(2) The curve of PCNN+ATT is significantly higher than that of PCNN over the entire range of recall, which illustrates sentence-level attention mechanism can effectively assign smaller weights to noisy instances.
(3) PCNN+ATT+Policy generally outperforms PCNN + ATT, and has an obvious improvement compared with PCNN. It demonstrates that a policy-based RL algorithm redistributing false-positive samples into a negative sample set can further improve the extraction performance.
(4) Compared with the six-strong baselines, our proposed model PCNN+DQN achieves the best precision when the recall is the same in most regions of the curves. The results fully indicate that our proposed sentence-level label denoising model effectively removes noisy labels and improves extraction performance.

2) COMPARISON WITH NEURAL NETWORK METHODS WITH DIFFERENT SENTENCE ENCODERS ON THE RIEDEL DATASET
In order to verify the generalization ability of our proposed model in different sentence encoders, we replace the PCNNs module with CNNs and BiLSTMs, respectively. Fig.5 shows the precision-recall curves of neural network methods with different sentence encoders (PCNN, CNN, and BiLSTM). We can find that combining RL (policy gradient and DQN) algorithms can boost the performance of PCNN, CNN, and BiLSTM, especially our models PCNN/CNN/BiLSTM + DQN achieve the highest precision in the corresponding sentence encoders. However, for CNN+ATT, when the recall is greater than 0.2, its precision is higher than CNN+ATT + Policy, while significantly lower than our proposed CNN+DQN. Therefore, the proposed sentence-level label denoising method based on DQN can be effectively applied to different models for sentence encoder, including CNN and BiLSTM.

3) COMPARISON WITH NEURAL NETWORK METHODS WITH DIFFERENT SENTENCE ENCODERS ON THE HUMAN-ANNOTATED DATASET
To verify the effectiveness and robustness of our models, we perform the proposed models (PCNN/CNN/BiLSTM+ DQN) and other neural network models with different sentence encoders (PCNN/CNN/BiLSTM, PCNN/CNN/ BiLSTM+ATT, and PCNN/CNN/BiLSTM+ATT+Policy) on the human-annotated dataset. Because this dataset is labeled by human efforts, its labels have higher reliability, and the results obtained by the extractor have higher credibility. The precision-recall curves of these methods are shown in Fig.6. Our models achieve the highest precision on three dif- ferent sentence encoders. The result proves that our extractor can achieve good performance even with the influence of noisy data. In addition, it fully demonstrates the effectiveness and robustness of our proposed methods.

4) TOP-RANKED PRECISION IN DIFFERENT RECALLS
The precision-recall curves of some methods with the same sentence encoder in Fig.5 and Fig. 6 are difficult to distinguish. Thus, we show the top-ranked precision in different recalls 10.0% / 20.0% / 30.0% / 40.0 and their mean values in Table 2, where the mean value is the average of top-ranked precisions in recalls 10.0%, 20.0%, 30.0% and 40.0%. For each sentence encoder on the Riedel dataset and human-annotated dataset, combining DQN can achieve the best performance and remarkably improve the original sentence encoder. CNN+DQN obtains best average precisions in CNN-based sentence encoders that are 16.0 and 13.4 percentage points higher than that of original sentence encoder CNN on the Riedel dataset and human-annotated dataset, respectively. In addition, CNN+DQN has better performance than CNN+ATT and CNN+ATT+Policy. Similarly, BiLSTM + DQN also achieves the highest average precisions on both datasets, which are 9.6% and 15.0% higher than BiLSTM, 6.4% and 11.4% higher than BiLSTM+ATT, and 2.7% and 9.2% higher than BiLSTM+ATT+Policy. Undoubtedly, our proposed PCNN + DQN achieves a better performance than the other three models based on PCNN sentence encoder, and remains the best four different recalls and average precision in these models on two datasets, respectively. The results indicate effectiveness and robustness of the proposed methods in different sentence encoders.

5) TOP N PRECISION
Following previous works [18], [24], we also apply the top N precision (P@N) metric to evaluate our proposed models and six baselines. As shown in Table 3, we report the P@100, P@300, P@500, and their mean for each model, where the mean is the average of P@100, P@300 and P@500. We can find that: (1) In P@100, P@300 and P@500, CNN+DQN, BiLSTM+DQN, and PCNN+DQN have the highest precision in the corresponding sentence encoders on the two datasets, respectively, which indicates that our DQN-based denoiser effectively improves the performance of RE. (2)For the Riedel dataset, CNN+DQN achieves the highest average precision in CNN-based sentence encoders, namely 14.3 points higher than CNN, 6.4 points higher than CNN+ATT and 5.6 points higher than CNN+ATT+Policy. Similarly, BiLSTM+DQN has 14.9%, 14.6% and 9.4% improvement of BiLSTM, BiLSTM+ATT and BiLSTM+ATT+Policy in mean precision, respectively. Obviously, PCNN+DQN performs better than the other two sentence encoders, and obtains a best average precision, which is 84.9%. (3) For the human-annotated dataset, the DQN-based methods achieve the best performance in the corresponding sentence encoders and significantly improve the original sentence encoders. Based on these, we can conclude that our sentence-level label denoiser show effectiveness and robustness on both datasets.

6) CASE STUDY
In order to evaluate the quality of the labels selected by the label denoiser, we randomly select 200 instances from the Riedel dataset and check them manually. The results demonstrate that our model can reach a high label denoising accuracy 91.5% (183/200), and prove that the sentencelevel label denoiser can effectively select the correct labels. Table 4 shows some typical examples of label denoising with our proposed model. For each case, we show two entities, the DS labels, the labels selected by DQN-based denoiser, and the instances. In the DS labels, the correct labels are emphasized with bold formatting.
We can find that the proposed method can recognize instances with wrong labels during the training process and choose the correct labels successfully. For example, the first instance is labeled with the relation place_of_birth and nationality at the same time because there are both triples (luis_moreno, place_of_birth, panama) and (luis_moreno, nationality, panama) in Freebase. However, this sentence failed to express the relation place_of_birth. Similarly, the second sentence also fails to express the relation place_lived between marian_marsh and palm_desert. For these instances, the proposed sentence-level label denoiser can use express-only-one assumption to select the correct labels automatically. In addition, our method has a strong ability to distinguish similar relations, such as the relations between people and locations (nationality, place_of_birth, place_lived, place_of_death).
Although the sentence-level label denoiser has a high accuracy for label selection in most cases, there are still some wrong selections. For instance, the correct relation contains in the last instance in Table 4 is wrongly recognized as admin-istrative_divisions. In general, label denoiser select same relation label for instances with similar sentence patterns in DS dataset. During the training process of the label denoiser, the model may find that the common sentence pattern of relation contains is ''in PALACE, PALACE'' [27]. In the fourth instance, the phrase ''in wellington, new_zealand'' is the same sentence pattern as ''in PALACE, PALACE''. Thus,   the relation of the fourth instance can be easily identified as contains. However, we find that the sentence pattern of the last instance is slightly different from the common pattern of contains, and the relative distance between the entity pairs argentina and buenos_aires is large. In addition, since this sentence has complex semantic relations, it is difficult for typical models to extract relation, and further reasoning is needed.
Due to the diversity of sentence patterns and the complexity of semantic relations, we think that a small number of incorrect selections are inevitable. In general, our proposed models can effectively remove noisy labels and improve the performance of the distantly supervised RE systems.

VI. CONCLUSION
In this paper, the express-only-one assumption is proposed, which means that sentences associated with two entities and labeled by multiple relations have only one correct label. Based on this assumption, a new sentence-level label denois-ing model based on reinforcement learning (RL) is proposed for distantly supervised relation extraction (RE). This model has two modules: a label denoiser and a relation extractor. The task of label denoiser is to select a reliable label from multiple relations during the training process, and provide cleaned data for the relation extractor. The relation extractor aims to obtain the results for relation prediction and use the difference of the results to guide the learning of the label denoiser. The two modules are trained jointly to optimize the label denoising and extraction processes. The experimental results show that the proposed model can effectively reduce sentence-level noise labels and remarkably outperform previous state-of-the-art baselines on both the Riedel dataset and human-annotated dataset.