Improving Distantly-Supervised Relation Extraction through BERT-based Label & Instance Embeddings

Distantly-supervised relation extraction (RE) is an effective method to scale RE to large corpora but suffers from noisy labels. Existing approaches try to alleviate noise through multi-instance learning and by providing additional information, but manage to recognize mainly the top frequent relations, neglecting those in the long-tail. We propose RED-SandT (Relation Extraction with Distant Su-pervision and Transformers), a novel distantly-supervised transformer-based RE method, that manages to capture a wider set of relations through highly informative instance and label embeddings for RE, by exploiting BERT’s pre-trained model, and the relationship between labels and entities, respectively. We guide REDSandT to focus solely on relational to-kens by ﬁne-tuning BERT on a structured input, including the sub-tree connecting an entity pair and the entities’ types. Using the extracted informative vectors, we shape label embeddings, which we also use as attention mechanism over instances to further reduce noise. Finally, we represent sentences by concatenating relation and instance embeddings. Experiments in the NYT-10 dataset show that REDSandT captures a broader set of relations with higher conﬁdence, achieving state-of-the-art AUC (0.424).


Introduction
Relation Extraction (RE) aims to detect semantic relationships between entity pairs in natural texts and has proven to be crucial in various natural language processing (NLP) applications, including question answering, and knowledge-base (KB) population.
Most RE methods follow a supervised approach, with the required number of labeled training data rendering the whole process time and laborintensive.To automatically construct datasets for RE, (Mintz et al., 2009) proposed to use distant supervision (DS) from a KB, assuming that if two entities exhibit a relationship in a KB, then all sentences mentioning these entities express this relation.Inevitably, this assumption generates falsepositives and leads distantly-created datasets to contain erroneous labels.To alleviate the wrong labeling problem, (Riedel et al., 2010) relaxed this assumption so that it does not hold for all instances and along with (Hoffmann et al., 2011;Surdeanu et al., 2012) proposed multi-instance based learning.Under this setting, classification shifts from instance-level to bag-level, with a bag consisting of all instances that contain a specific entity pair.
The study of the above approaches led us to the following core observations.First, among all models used in the literature, the use of a pretrained transformer-based language model (LM) can help in recognizing a broader set of relations, even though at the expense of time and computational resources, and second, the relationship between label and entities can entail valuable information but rarely used over external knowledge.Driven by these observations we inspired to develop a novel transformer-based model that can efficiently capture instance and label embeddings in less complexity so as to drive RE in recognizing a broader set of relations.
We propose REDSandT (Relation Extraction with Distant Supervision and Transformers), a novel transformer-based RE model for distant supervision.To handle the problem of noisy instances, we guide REDSandT to focus solely on relational tokens by fine-tuning BERT on a structured input, including the sub-tree connecting an entity pair (STP) and the entities' types.The input's RE-specific formation, along with BERT's knowledge from unsupervised pre-training, results in REDSandT generating informative vectors.Using these vectors, we shape relation embeddings representing the entities' distance in vector space.Relation embeddings are then used as relation-wise attention over instance representation to reduce the effect of less-informative tokens.Finally, RED-SandT encodes sentences by concatenating relation and weighted-instance embeddings, with relation classification to occur at bag-level as a weighted sum over its sentences' predictions.
We chose BERT over other transformer-based models because it considers bidirectionality while training.We assume that this characteristic is important to efficiently capture entities' interactions without requiring an additional task that importantly increases complexity (i.e.fine-tuning an auxiliary objective in GPT (Alt et al., 2019)).
The main contributions of this paper can be summarized as follows: • We extend BERT to handle multi-instance learning to directly fine-tune the model in a DS setting and reduce error accumulation.• Relation embeddings captured through BERT fine-tuned on our RE-specific input help to recognize a wider set of relations, including relations in the long-tail.• Suppressing the input sentence to its relational tokens through STP encoding allowed us to capture informative instance embeddings while preserving low complexity to train our model on modest hardware.

REDSandT
Given a bag of sentences {s 1 , s 2 , ..., s n } that concern a specific entity pair, REDSandT generates a probability distribution on the set of possible relations.REDSandT utilizes BERT pre-trained LM to capture the semantic and syntactic features of sentences by transferring pre-trained commonsense knowledge.We extend BERT to handle multiinstance learning, and we fine-tune the model to classify the relation linking the entity pair given the associated sentences.
During fine-tuning, we employ a structured, REspecific input to minimize architectural changes to the model (Radford and Salimans, 2018).Each sentence is adapted to a structured text, including the sentences' tokens connecting the entity pair (STP) along with the entities types.We transform the input into a (sub-)word-level distributed representation using BPE and positional embeddings from BERT fine-tuned on our corpora.Then, we form final sentence representation by concatenating relation embedding and sentence representation weighted with the relation embedding.Lastly, we use attention over the bag's sentences to shape bag representation, which is then fed to a softmax layer to get the bag 's relation distribution.
REDSandT can be summarized in three components, namely sentence encoder, bag encoder, and model training.Each component is described in detail in the following sections with the overall architecture shown in Figure 1 and 2.

Sentence Encoder
Given a sentence x and an entity pair h, t , RED-SandT constructs a distributed representation of the sentence by concatenating relation and instance embeddings.Overall sentence encoding is represented in Figure 1, with following sections to examine the sentence encoder parts in a bottom-up way.

Input Representation
Relation extraction requires a structured input that can sufficiently capture the latent relation between an entity pair and its surrounding text.Our input representation encodes each sentence as a sequence of tokens, depicted in the very bottom of Figure 1.
It starts with the head entity type and token(s) followed by delimiter [H-SEP], continues with the tail entity type, and token(s) followed by delimiter [T-SEP] and ends with the token sequence of the sentence's STP path.The whole input starts Several other sentence encodings were attempted1 with the presented one to perform the best.Moreover, the ablation studies in section 4.2, reveal the importance of encoding entities' types and compressing the original sentence to the belowpresented STP path.Below, we present in brief how we form the sub-tree parse of the input and the entity types.Sub-tree parse of input sentence: We utilize the sub-tree parse (STP) of the input sentence in order to reduce the noisy words within sentence and focus on the relational tokens.Precisely, STP preserves the path of the sentence that connects the two entities with their least common ancestor (LCA)'s parent.Compared to other implementations (Liu et al., 2018), who shape the final STP sequence by re-assigning the participating tokens into their original sequence order, we preserve the tokens' order within STP achieving a grammatical normalization of the original sentence.
Entity Type special tokens: In the extent that every relation puts some constraint on the type of participating entities (Liu et al., 2014;Vashishth et al., 2018), we incorporate the entity type in the model's structured input (see bottom of Figure 1).Precisely, we incorporate 18 generic entity types, captured from recognizing NYT-10 sentence's entities with the spaCy model2 .We assume these types KB-independent and easily accessible with our experiments in section 4.2 indicating their inclusion to improve performance.

Input Embeddings
The input embedding h 0 to BERT is created by summing over the positional and byte pair embeddings for each token in the structured input.Byte-pair tokens encoding: To make use of subword information, we tokenize input using bytepair encoding (BPE) (Sennrich et al., 2016).We particularly use the tokenizer from the pre-trained model (30,000 tokens), which we extend with 20 task-specific tokens (e.g., [H-SEP], [T-SEP], and the 18 entity type tokens).Added tokens serve a special meaning in the input representation, thus are not split into sub-words by the tokenizer.Positional encoding: Positional encoding is an es-sential part of BERT's attention mechanism.Precisely, BERT learns a unique position embedding to represent each of the input (sub-word) token positions within the sequence.

Sentence Representation
Input sequence is transformed into feature vectors (h L ) using BERT's pre-trained language model, fine-tuned in our task.In spite of common practice to represent the sentence by the [CLS] vector in h L (Alt et al., 2019), we argue that not all words contribute equally to sentence representation.
By encoding the underlying relation as a function of the examining entities and by giving attention to vectors related to this underlying relation, we can further reduce sentence noise and improve precision.Core modules constitute the: relation embedding, entities-wise attention, and relation attention.We examine them below.
Relation Embedding: We formulate relation embeddings using the TransE model (Bordes et al., 2013).TransE model regards the embedding of the underlying relation l as the distance (difference) between h and t embeddings (l i = t i − h i ), assuming that a relation r holds between an entity pair (h, t).Then, we shape relation embedding for each sentence i by applying a linear transformation on the head and tail entities vectors, activated through a Tanh layer to capture possible nonlinearities: , where w l is the underlying relation weight matrix and b l ∈ dt is the bias vector.We mark relation embedding as l because it represents the possible underlying relation between the two entities and not the actual relationship r.Head h i and tail t i embeddings reflect only the entities' related tokens, which we capture through simple entities-wise attention, shown below.
Entities-wise Attention: Head and tail embeddings participating in the relation embedding are created by summing over respective token vectors from BERT's last layer h L .We capture these tokens through head-and tail-wise attention.Headwise attention assigns the weight α h it to focus on head related tokens and tail-wise attention assigns the weight α t it to focus on tail related tokens.
Head h i and tail t i embeddings are then shaped as follows: Relation Attention: Even though REDSandT is trained on STP that naturally preserves only relational tokens, we wanted to further reduce possible left noise on sentence-level.For this reason, we use a relation attention to emphasize on sentence tokens that are mostly related to the underlying relation l i .We calculate relation attention α r by comparing each sentence representation against the learned representation l i for each sentence i: Then, we weight BERT' s last hidden layer h L ∈ d h with relation embedding: Finally, sentence representation s i ∈ d h * 2 is computed as the concatenation of the relation embedding l i and the sentence's weighted hidden representation h L : Several other representation techniques were tested, with the presented method to outperform.

BAG Encoder
Bag encoding, i.e., aggregation of sentence representations in a bag, comes to reduce noise generated by the erroneously annotated relations accompanying DS.Assuming that not all sentences contribute equally to the bag representation, we use selective attention (Lin et al., 2016) to emphasize on sentences that better express the underlying relation.
As seen, selective attention represents bag as a weighted sum of the individual sentences.Attention α i is calculated by comparing each sentence representation against a learned representation r: Finally, bag representation B is fed to a softmax classifier to obtain the probability distribution over the relations.
where W r is the relation weight matrix and b r ∈ dr is the bias vector.

Training
REDSandT utilizes a transformer model, precisely BERT, which fine-tunes on our specific setup to capture the semantic features of relational sentences.Below, we present the overall process.

Model Pre-training
For our experiments, we use the pre-trained bertbase-cased language model (Devlin et al., 2018), which consists of 12 layers, 12 attention heads, and 110M parameters, with each layer being a bidirectional Transformer encoder (Vaswani et al., 2017).The model is trained on cased English text of BooksCorpus and Wikipedia with a total of 800M and 2.5K words respectively.BERT is pre-trained using two unsupervised tasks: masked LM and next sentence prediction, with masked LM being its core novelty as it allows the previously impossible bidirectional training.

Model Fine-tuning
We initialize REDSandT model' s weights with the pre-trained BERT model, and we fine-tune its 4-last layers under the multi-instance learning setting presented in Figure 2, given the specific input shown in Figure 1.We end up fine-tuning only the last four layers after experimentation.During fine-tuning, we optimize the following objective: , where for all entity pair bags |B| in the dataset, we want to maximize the probability of correctly predicting the bag's relation given its sentences' representation and parameters.
3 Experimental Setup

Dataset
We conduct experiments on the widely used benchmark dataset NYT-10 (Riedel et al., 2010), which was built by aligning triples in Freebase to the NYT corpus and contains 53 relations.There are 522,611 (172,448) sentences, 281,270 (96,678) entity pairs, and 18,252 (1,950) relation mentions in the train (test) set.We provide an enhanced dataset, NYT-10enhanced, including both STP and SDP versions of the input sentences as well as the head and tail entity types to facilitate future implementations.

Hyper-parameter Settings
In     RE methods AUC P@100 P@300 P@500  1: AUC and P@N evaluation results.P@N represents precision calculated for the top N rated relation instances steady, downward trend, acting similar to RESIDE at the low and medium recalls and surpassing all baselines in the very high recall values.We believe the reason is that we use potential label information as an additional feature and as attention over the instance tokens.The learned label embeddings are of high quality since they carry common-knowledge from the pre-trained model fine-tuned on the specific dataset and task.Moreover, the chosen pretrained model, BERT, considers bidirectionality while training, being thus able to efficiently capture head and tail interaction.
Table 1, which presents AUC and precision at various points in the P-R curve, reveals our model's precision performance to be between that of RE-SIDE and DISTRE while preserving the stateof-the-art AUC.Precisely, REDSandT' s precision does not exceed RESIDE', even though it is close enough, which suggests that additional side-information would improve our model.Meanwhile, REDSandT surpasses DISTRE' s precision, which we attribute to our selected pre-trained model that efficiently captures label embeddings.Consequently, our model is more consistent to the various points of the P-R curve.
Table 2 shows the distribution over relation types for the top 300 predictions of REDSandT and baseline models.REDSandT encompasses 10 distinct relation types, two of which (place founded, /geographic distribution) are not recognized by none of rest models.PCNN+ATT predictions are highly biased towards a set of only four relation types, while RESIDE captures three additional types.DISTRE and REDSandT manage to recognize more types than all models, emphasizing the contribution of transfer knowledge.Moreover, REDSandT correctly not recognizes /location/country/capital relation that DISTRE does, as their authors found most errors to arise from the specific predicted relation in manual evaluation.Meanwhile, we highlight REDSandT' s effectiveness in recognizing rela-   tions in the long-tail.Particularly, our model captures, founders (1.47%), neighborhood of (1.06%), person/children (0.47%), and sports team/location (0.16%) relations.Relations are listed in descending order regarding population in test set with respective percentage referenced in parentheses.

Ablation Studies
To assess the effectiveness of the different modules of REDSandT, we create four ablation models:  et al., 2015) in sentence encoding.
As shown in Table 3, all modules contribute to final model' s effectiveness.Greatest impact comes from relation embeddings with their removal result-ing in the highest AUC (2 units) and P@300 (5.3%) drop.Meanwhile, P@100 goes up to 80% with inspection of top 300 predictions revealing a focus on 5 relation types only, with /location/contains to make up the 79% of these.Simple integration of entity types in input representation is the next most important feature that boosts our model.Next, "REDSandT w.SDP", shows STP's superiority, while a manual inspection in the model's top 300 predictions prove SDP's weakness to recognize relations in the long tail, with focus given on /person/nationality relation.Finally, removing the relation attention over instance tokens exhibits the least effect in AUC (0.002) and precision (∼2%).Meanwhile, we notice that model focuses solely on 8 relation types in the top 300 predictions.Figure 4 shows a visualization of the relation attention weights, highlighting the different parts of the sentence that drive relation extraction, for two longtail relations.In both cases, we see that the special tokens preserve important information, while also the entity type is given more weight than the entity itself.Moreover, we see which tokens affect more the relation.Tokens "girlfriend", "son", and the repetition of name "James" are predictive of the "children" relation, while tokens "neighborhood", "was", "in", along with a GPE entity type show a probable "neighborhood of " relation.

Related Work
Our work is related to distant supervision, neural relation extraction (mainly pre-trained LMs), subtree parse of input, label embedding, and entity type side information.Distant Supervision: DS plays a key role in RE, as it satisfies its need for extensive training data, easily and inexpensively.The use of DS (Craven and Kumlien, 1999;Snow et al., 2005) to generate large training data for RE was proposed by (Mintz et al., 2009), who assumed that all sentences that include an entity pair, which exhibits a relationship in a KB, express the same relation.However, this assumption comes with noisy labels, especially when the KB is not directly related to the domain at hand.Multi-instance learning methods were proposed to alleviate the issue, by conducting relation classification at the bag level, with a bag including instances that mention the same entity pair (Riedel et al., 2010;Hoffmann et al., 2011).Neural Relation Extraction: While the performance of the above approaches heavily relies on handcrafted features (POS tags, named entity tags, morphological features, etc.), the advent of neural networks in RE set the focus on model architecture.Zeng et al. (2014) propose a CNN-based method to automatically capture the semantics of sentences, while PCNN (Zeng et al., 2015) became the common architecture to embed sentences.PCNN is used in several approaches that handle DS noisy patterns, such as intra-bag attention (Lin et al., 2016), inter-bag attention (Ye and Ling, 2019), soft labeling (Liu et al., 2017;Wang et al., 2018) and adversarial training (Wu et al., 2018;Qin et al., 2018).Moreover, Graph-CNNs proved an effective way to encode syntactic information from text (Vashishth et al., 2018).
The latest development of pre-trained LMs relying on transformer architecture (Vaswani et al., 2017) has shown to capture semantic and syntactic features better (Radford and Salimans, 2018).Howard and Ruder (2018) found that they significantly improve text classification performance, prevent overfitting, and increase sample efficiency.Shi and Lin (2019) fine-tuned BERT (Devlin et al., 2018) on the TACRED dataset showing that simple NNs built on top of BERT improve performance.Meanwhile, Alt et al. (2019) extended GPT (Radford and Salimans, 2018) to the DS setting by incorporating a multi-instance training mechanism, proving that pre-trained LMs provide a stronger signal for DS than specific linguistic and side-information features (Vashishth et al., 2018).Side information: Apart from model architecture, several methods propose additional information to further reduce noise.Vashishth et al. (2018) use relation phrases and incorporate Freebase entity types achieving state-of-the-art precision at higher recall values, while (Ji, 2017;Hu et al., 2019) use entity descriptors to enhance entity and label embeddings, respectively.Sub-Parses of Input: Xu et al. (2015) showed the importance of the shortest-dependency path (SDP) in reducing irrelevant to RE words.Liu et al. (2018) further reduce the noise within sentences by preserving the sub-path of the sentence that connects the two entities with their least common ancestor's parent (STP).In contrast with (Liu et al., 2018), who shape the final STP sequence by re-assigning the participating tokens into their original sequence order, we preserve the tokens' order within the STP to maintain the emerged grammar information.Label Embedding: Label embeddings aim to embed labels in the same space with word vectors.The idea comes from computer vision, with (Wang et al., 2018) to introduce them in text classification and (Hu et al., 2019) to use them as attentionmechanism over relational tokens in distantlysupervised RE.We make use of the TransE (Bordes et al., 2013) model to shape label embeddings as the entities' distance in BERT's vector space, and we show that their use both as a feature and as attention over sentences significantly improves RE.

Conclusion
We presented a novel transformer-based relation extraction model for distant supervision.REDSandT manages to acquire high-informative instance and label embeddings and is efficient at handling the noisy labeling problem of DS.REDSandT captures high-informative embeddings for RE by fine-tuning BERT on a RE-specific structured input that focuses solely on relational arguments, including the sub-tree connecting the entities along with entities' types.Then, it utilizes these vectors to encode label embeddings, which are also used as attention mechanism over instances to reduce the effect of less-informative tokens.Finally, relation extraction occurs at bag-level by concatenating label and weighted instance embeddings.Extensive experiments on the NYT-10 dataset illustrate RED-SandT's effectiveness over existing baselines in current literature.Precisely, REDSandT manages to recognize relations that other methods fail to detect, including relations in the long-tail.Future work includes an investigation of whether additional information, such as entity descriptors, influence REDSandT's performance and to what extent, while also whether the special token embeddings can act as global embeddings for RE.

Figure 1 :
Figure 1: Sentence Representation in REDSandT.The input embedding h 0 to BERT is created by summing over the positional and byte pair embeddings for each token in the structured input.States h t are obtained by selfattending over the states of the previous layer h t−1 .Final sentence representation is obtained by concatenating the relation embedding r ht , and the final fine-tuned BERT layer h L weighted with relation attention α r .Head and tail tokens participating in the relation embedding formation are marked with bold and dashed lines respectively.and ends with special delimiters [CLS] and [SEP], respectively.In BERT, [CLS] typically acts as a pooling token representing the whole sequence for downstream tasks, such as RE.Several other sentence encodings were attempted 1 with the presented one to perform the best.Moreover, the ablation studies in section 4.2, reveal the importance of encoding entities' types and compressing the original sentence to the belowpresented STP path.Below, we present in brief how we form the sub-tree parse of the input and the entity types.Sub-tree parse of input sentence: We utilize the sub-tree parse (STP) of the input sentence in order to reduce the noisy words within sentence and focus on the relational tokens.Precisely, STP preserves the path of the sentence that connects the two entities with their least common ancestor (LCA)'s parent.Compared to other implementations(Liu et al., 2018), who shape the final STP sequence by re-assigning the participating tokens into their original sequence order, we preserve the tokens' order within STP achieving a grammatical normalization of the original sentence.

Figure 2 :
Figure 2: Transformer architecture (left) and training framework (right).Sentence representation s i is formed as shown in Figure 1.

3. 3
State-of-the-art ModelsFor evaluating REDSandT, we compare against following state-of-the-art models: Mintz(Mintz et al., 2009): A multi-class logistic regression model under distant supervision setting.PCNN+ATT (Lin et al., 2016): A CNN model with instance-level attention RESIDE (Vashishth et al., 2018): A NN model that uses several side information (entity types 3 , relational phrases) and employs Graph-CNN to capture syntactic information of instances.DISTRE (Alt et al., 2019): A transformer model, GPT fine-tuned for RE with an auxiliary objective under the distant supervision setting.4 Results 4.1 Comparison with state-of-the-art Models

Figure 3
Figure3compares the precision-recall curves of REDSandT against state-of-the-art models.We observe that: (1) The NN-based approaches outperform the probabilistic method(Mintz), showing human-designed features limitation against neural networks' automatically extracted features.(2) RE-SIDE, DISTRE, and REDSandT achieve better performance than PCNN+ATT, which even exhibiting the highest precision in the beginning soon follows an abrupt decline.This reveals the importance of both side-information (i.e., entity types and relation alias), and transfer knowledge.(3) RESIDE performs the best in low recalls and generally performs well, which we attribute to the multitude of sideinformation given.(4) Although DISTRE exhibits 3.5% greater precision in medium-level recalls, it presents 2-12% lower precision in recall values <0.25 compared to RESIDE, and REDSandT.(5) Our model shows the more stable behavior, with a 3 Compared to our 18 KB-independent entity types, authors use 38 Freebase-specific entity types.
Evaluation results AUC and P@N of variant models on NYT-10 dataset.

Figure 4 :
Figure 4: Relation attention weights for children (top) and neighborhood of (bottom) long-tail relations.