JoinER-BART: Joint Entity and Relation Extraction With Constrained Decoding, Representation Reuse and Fusion

Joint Entity and Relation Extraction (JERE) is an important research direction in Information Extraction (IE). Given the surprising performance with fine-tuning of pre-trained BERT in a wide range of NLP tasks, nowadays most studies for JERE are based on the BERT model. Rather than predicting a simple tag for each word, these approaches are usually forced to design complex tagging schemes, as they may have to extract entity-relation pairs which may overlap with others from the same sequence of word representations in a sentence. Recently, sequence-to-sequence (seq2seq) pre-trained BART models show better performance than BERT models in many NLP tasks. Importantly, a seq2seq BART model can simply generate sequences of (many) entity-relation triplets with its decoder, rather than just tag input words. In this article, we present a new generative JERE framework based on pre-trained BART. Different from the basic seq2seq BART architecture: 1) our framework employs a constrained classifier which only predicts either a token of the input sentence or a relation in each decoding step, and 2) we reuse representations from the pre-trained BART encoder in the classifier instead of a newly trained weight matrix, as this better utilizes the knowledge of the pre-trained model and context-aware representations for classification, and empirically leads to better performance. In our experiments on the widely studied NYT and WebNLG datasets, we show that our approach outperforms previous studies and establishes a new state-of-the-art (92.91 and 91.37 F1 respectively in exact match evaluation).


I. INTRODUCTION
E XTRACTING entities and their corresponding semantic relations is a fundamental and critical step for automatically constructing knowledge graphs from unstructured text.This task tries to detect all possible relational triplets that consist of two entities and the corresponding semantic relation between them in the given raw text.These triplets are of the form <subject, relation, object > (<s, r, o> for simplicity).
As shown in Fig. 1, a given text is likely to contain multiple triplets, with the added complexity that a single entity may be part of different relational triplets (SEO), and two entities of the same entity pair may engage in different relations (EPO).As a result, the task requires highly complex classifier and tag designs if we regard it as a sequence labeling (i.e., tagging) task.
By contrast, the seq2seq framework seems a simpler solution than regarding the task in a simple sequence labeling manner, as seq2seq can simply generate triplet sequences of any length.And indeed, [1] presented a seq2seq model for the joint entity and relation extraction task.Their approach uses a copy mechanism to enhance its ability in copying source tokens.
Recently, fine-tuning models pre-trained on large amounts of raw texts have shown great performance in a wide range of NLP tasks and become the de-facto approach in the community, especially in low-resource scenarios.Given the great success of the pre-trained BERT model for a wide range of tasks [2], many studies try to improve the performance of the joint entity and relation extraction task [3], [4], [5], obtaining substantial performance improvements with pre-trained BERT as BERT captures much knowledge from the large amount of raw texts it has been trained on.
Although these tagging-based methods with pre-trained BERT [3], [4] achieve better performance than previous seq2seqbased approaches, a BERT model only has an encoder and is not pre-trained for generation, and therefore it is more suitable for sequence labeling and classification tasks than for sequence generation, and as a result, BERT-based approaches have to design complex tagging frameworks for joint entity and relation extraction with the possibility of SEO and EPO cases.
More recently, [6] pre-train a seq2seq model, BART, with improved performance over BERT in a wide range of task evaluations including but not limited to: abstractive dialogue, question answering, summarization, intention classification, word filling [7], and machine translation [8].In this article we explore whether it is possible to design a BART based seq2seq model for the joint entity-relation extraction task that can lead to good performance with a simple seq2seq architecture?The decoder of the seq2seq model is capable of autoregressively producing a target sequence of arbitrary types and length, thus we can obtain entity-relation triplet sequences that meet all requirements for joint entity-relation extraction, including normal (Normal), Single Entity Overlapping (SEO) and Entity Pair Overlapping (EPO) situations, easily with the seq2seq model (as shown in Fig. 2).Below we present a BART-based seq2seq architecture for joint entity-relation extraction.
In the joint entity-relation extraction task, the target sequence is pretty much constrained: each token comes either from the source sequence or from a set of pre-defined relation tokens.
To specifically address this in our framework, 1) we employ a constrained classifier which only makes predictions over the source tokens and defined relation classes in each decoding step, 2) instead of learning new weight vectors for the classification of source tokens, we reuse representations from the pre-trained BART encoder in the classifier.This leverages the knowledge of the pre-trained model and uses context-aware representations for classification, and 3) recent studies show that different layers of deep models capture linguistic properties at different levels [9], [10], [11], [12], [13], [14], [15], and a fusion of multi-layer representations is likely to bring about further benefits [16], [17], [18], [19].Therefore, instead of only using representations of the last BART encoder layer for classification, we also additionally use preceding layers' representations for the prediction of source tokens.
In our experiments on the widely studied NYT and WebNLG datasets for this task, we show that our approach is able to significantly outperform previous studies and establish a new state-of-the-art.
Our main contributions are as follows: r Empirically, our approach establishes a new state-of-the art performance on both NYT and WebNLG datasets in the strict exact match evaluation, with F1 scores of 92.91 and 91.37 respectively, significantly outperforming the previous SoTA by a large margin (+0.9 and +3.6 F1 respectively).

A. Joint Entity and Relation Extraction
The key ingredient of a knowledge graph are relational triplets of the form <subject, relation, object>, consisting of two entities connected by their semantic relation.Extracting relational triplets from unstructured text is important for the automatic construction of large-scale knowledge graphs [3].
One key issue for joint entity and relation extraction is that a sentence may contain multiple relational triplets that overlap either in a single entity (SEO) or an entity pair (EPO) with other relations as shown in Fig. 1.Traditional sequence tagging schemes only predict one tag for each token, and conventional relation classification only results in one relation between each entity pair, which make them fail to take care of SEO and EPO cases properly.As a result, very complex tagging schemes were developed to attempt to capture such situations with sequence tagging technologies [3], [4].

B. Seq2Seq Approaches
The seq2seq model normally consists of an encoder and a decoder (with cross-attention mechanisms between decoder and encoder to attend encoded source representations).
Given a source sequence X, the encoder (with θ E as its parameters) encodes X into a sequence of dense vectors h E with each vector representing a source token.
The decoder (with θ D as its parameters) takes the encoded source representations h E , the previous decoding history Y i<t and produces the representation h D,y t of the next token y t .
Normally, there is a classifier inside the decoder which interprets the dense vector h D,y t produced by the last decoder layer into Fig. 2. Our framework with constrained decoding, representation reusing, and multi-layer representation fusion for joint entity-relation extraction based on pre-trained BART.In this example, the relational triplet is <Sudan, /location/country/capital, Khartoum>.The positions of "Sudan" and "Khartoum" are 12 and 10 respectively in the source sentence, "/location/country/capital" is abbreviated to "<lcc>", and it is the 231st relation and the 247th class with a source length of 16.The indexes of the special separator token and the end-of-sentence token are 263 and 264 respectively with 16 tokens on the source side and 246 kinds of relations.
a probability distribution over some vocabulary.
The decoder iteratively appends the predicted token y t to the decoding history and makes new predictions.The decoder of seq2seq models is capable of generating sequences of arbitrary length, and it can easily handle both SEO and EPO cases, as both source entities and entity pairs can be decoded as many times as required in the target sequence.
Early research on joint entity and relation extraction with neural models [1] present a seq2seq model for the task.Their model consists of a bi-directional RNN encoder (specifically, bi-GRU) to encode the source sentence, multiple decoders each of which generates specific types of entity-relation triplets, and copy mechanisms to copy the first and second entity of a relational triplet from the source side.[32] also utilize an RNN (specifically, LSTM) based seq2seq model, but use a single decoder which predicts an entity-relation triplet in each decoding step, where the decoder has two pointer networks to find the subject and object entities respectively, and a classification network for the relation of the entity pair.

C. BERT-Based Approaches
[2] pre-train the Transformer encoder on large-scale unlabeled text, and then fine-tune for a wide range of NLP tasks.BERT substantially advances state-of-the-art results for many natural language processing tasks.
To benefit from BERT in entity and relation extraction, [3] propose a cascade binary tagging framework (CasRel).The framework first obtains representations of the source input sentence with the pre-trained BERT encoder.Next, it extracts relational triplets in two cascade steps with a subject tagger and a set of relation specific object taggers to address the overlap issue for SEO and EPO.The subject tagger uses two identical binary classifiers to detect the start and end position of subjects respectively by assigning each token a binary tag that indicates whether the current token corresponds to the start or end position of a subject.Each of the set of relation specific object taggers then adds the representations of the subject extracted by the subject tagger to the BERT sentence representation, and predicts the start and the end position of the corresponding object given the subject and the relation.[4] regard the joint extraction task as a Token Pair Linking problem, and present a one-stage TPLinker model.Based on the BERT sentence representation, the TPLinker enumerates all possible token pairs and uses 3 types of matrices to tag token pairs, representing whether these two tokens are the entity head and its corresponding tail, the subject head and its corresponding object head for each relation, and the subject tail and its corresponding object tail for the same relation (for each relation, there are two classifiers for the prediction of the subject head token and object head token pair, and subject tail token and object tail token pair, respectively).These link matrices can then be decoded into different tagging results, from which all entities and their overlapping relations can be extracted.While these approaches are able to obtain substantial improvements using pre-trained BERT, their tagging schemes and the machinery required to implement them are highly complex.

III. OUR METHOD
A. Seq2Seq BART for Joint Entity Relation Extraction [6] present BART, a denoising autoencoder for pre-training seq2seq models.BART uses a standard Tranformer-based neural machine translation architecture, and is pre-trained to reconstruct the noised source input text corrupted by token masking, token deletion, text infilling (where spans of text are replaced with a single mask token), sentence permutation and document rotation.BART can be seen as generalizing BERT (a bidirectional encoder), GPT (a left-to-right decoder), and other recent pre-training schemes.BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks.It matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of tasks including abstractive dialogue, question answering, summarization, and machine translation [8].Is it possible for us to leverage both the simplicity of seq2seq architectures while benefiting from pre-training for the joint entity-relation extraction task with BART?
In our approach, we fine-tune BART to predict a sequence of relational triplets Y given a source input sentence X.The special start-of-sentence token <sos> is prepended to X and the special end-of-sentence token <eos> is appended to X before feeding it into the BART encoder.The target sequence Y is then of the form: <sos>, s 1 , <sep>, o 1 , r (s 1 ,o 1 ) ,..." s j , <sep>, o j , r (s j ,o j ) , <eos>, where s i , o i and r (s i ,o i ) stand for the subject, object and their relation in the ith triplet, and <sep> is a special token to separate the subject and the object.Relational triplets are distinguished by the relation type r (s i ,o i ) .The specifics of this representation are motivated by our investigation of different representations in Table VI further below.There is no constraint that a source token can only be generated once, so both SEO and EPO can be taken good care of simply in this manner.The BART decoder starts to generate with the special <sos> token, and learns to iteratively decode the whole sequence and produces the special <eos> token ending the triplet sequence.
During fine-tuning, the encoder takes the input sentence X.The gold relational triplet sequence Y prepended by the <sos> token is fed into the decoder as its input, and the decoder is trained to predict the shifted sequence, i.e., Y appended by the <eos> token.We minimize the negative log likelihood loss.
where P gold y t is the probability of the correct class predicted by the model at step t, and θ stands for the model parameters.
During decoding, we find the sequence with the highest overall probability via beam search.Since the beam search algorithm tries to maximize the sequence probability, the prediction of subsequent tokens also affects the selection of tokens before them.
where |Y | indicates the total number of tokens (both entity and relation) in Y .
Our joint entity and relation extraction model using pretrained BART exhibits special characteristics compared to the general seq2seq modeling: to enforce the model picking entity tokens only from the source sequence, and to make the most from the pre-trained model, we present constrained decoding (Section III-B), representation reusing (Section III-C), and multi-layer representation fusion (Section III-D) mechanisms in our model (shown in Fig. 2) in addition to the general BART fine-tuning, described in the following subsections.

B. Constrained Decoding
In the seq2seq modeling for the joint entity relation extraction task, as the decoder is expected to generate a sequence of relational triplets, the classifier shall only produce either a token that appears in the source side, or one of a set of globally specified relation types, or the special tokens (<sep>, <eos>) in decoding steps.
To address this, we specifically design the weights and biases of the linear classifier in (3).Its weight matrix and bias vector are both the concatenation of three parts: where W X and b X are the indexed weight vectors and bias scalars of the tokens of the source input sentence X respectively.W r and b r are the weight matrix and bias vector for all relation types.W s and b s are for the special separator and end-of-sentence tokens.| indicates concatenation.With our approach, the linear classifier only makes predictions over source tokens, the pre-defined set of relations and the two special tokens.If the predicted class index k is no larger than |X| (where |X| is the number of source tokens), the model produces the kth source token, if the index is between |X| + 1 and |X| + |R| (where |R| is the number of pre-defined relations), the model generates the k − |X|th relation, otherwise, indices |X| + |R| + 1 and |X| + |R| + 2 are for the <sep > and < eos> tokens respectively, and decoding is completed when we encounter the <eos> token.Our approach is simple but caters the specifics of the task well.

C. Representation Reusing
Compared to learning a new randomly initialized weight matrix W X for the classification over source tokens, it might be better if we reuse the hidden states produced by the pre-trained model.This way we can recycle knowledge already gained from pre-training.
Specifically, we construct W X based on the hidden representations produced by the BART encoder for the classification.We investigate both the use of the source token embeddings of the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
pre-trained model h 0 E (where 0 indicates the embedding layer) and the context-aware representation of the last BART encoder layer h |E| E (where |E| stands for depth of the BART encoder, 12 in our case with pre-trained BART).
As the hidden representation h E may not work in the same vector space as the classifier, we employ a two-layer neural network to transform h E to the vector space of the classifier weight h E .
h E can then be used as the part of the classifier weight W X for the classification over source tokens.

D. Multi-Layer Representation Fusion
Previous studies show that different layers of deep models capture linguistic properties at different levels [9], [10], [11], [12], [13], [14], [15].Specifically, [10] show that: 1) phrase-level information is captured in the lower layers, 2) the intermediate layers compose a rich hierarchy of linguistic information, starting with surface features at the bottom, syntactic features in the middle followed by semantic features at the top.[11] find that the model represents the steps of the traditional NLP pipeline in an interpretable and localizable way, and that the regions responsible for each step appear in roughly the expected sequence: POS tagging, parsing, NER, semantic roles, then coreference.[12] demonstrate that word morphology and part-of-speech information are captured at the lower layers of the model, while lexical semantics or non-local syntactic and semantic dependencies are better represented at the higher layers of the model.
We speculate that we may also get improved performance by leveraging representations of some shallow layers which work near the level of entity recognition and relation extraction, in addition to the last layer, while reusing BART encoder representations for classification.We study two types of multi-layer fusion approaches, soft fusion and hard fusion.
Following [33], in the soft fusion approach, we learn a weight vector w over layers which is normalized into an importance probability vector p by the softmax.
Next, we weight outputs of all BART encoder layers with p, and use the aggregated result for classification.
where 0 stands for the embedding layer.
The weight vector w is trained by back-propagation during fine-tuning, and we expect it to learn to assign a higher weight to those layers which provide useful information for entity relation extraction than the other layers.
However, the softmax normally leads to a soft and dense probability distribution, i.e., representations of all layers are used, even though some layers have a low weight.It might be better if we only pick layers which are more important for the task while totally discarding the other layers.In the hard fusion approach, we pick representations of just some encoder layers and average them for classification.
where S is a manually selected set of layers.
To quantify the importance of BART encoder layers for the task, we first test the performance of models that average representations of each layer with that of the last layer (which is expect to be the most informative one as it sees all preceding layers), we sort the performance in descending order, and iteratively add layers into a set S one at a time and test the performance, until the performance does not increase with a new layer being added into the set.
As shown in Fig. 2, when performing multi-layer fusion, we take the output of the 2-layer neural network (MLP) for the representation of the last layer while using the other layers' outputs directly for weighted aggregation or averaging.We empirically find that this leads to better performance than aggregating or averaging all layers' representations before or after performing the MLP.We speculate that the potential reason might be that: when picking out the layers working on entities and their relations, using their outputs directly without being processed by MLP enables them to receive gradients more directly from the loss function during back-propagation than being processed by the MLP; this shortens the gradient path to these layers and may supervise their fine-tuning better.At the same time, the last encoder layer's representations are also used by cross-attention sub-layers in the decoder, which may need to preserve more information than for entity-relation extraction, and the MLP might help filter out the information for entity-relation extraction.
The training, validation and test sets of NYT have 56195, 5000, and 5000 sentences respectively, with 24 types of relations.The training, validation and test sets of WebNLG have 5019, 500, and 703 sentences respectively, with 246 types of relations.Additionally, we divide the test sets into three subsets: Normal, SEO and EPO, according to the entity overlap between relational triplets.We also divide the test sets into 5 subsets according to the number of relational triplets in the sentence.These subsets are useful to examine the performance of approaches in handling each of these cases.Statistics of the NYT and WebNLG datasets are shown in Table I.

A. Settings
We used the pre-trained BART model from [6].We adopted cross entropy loss for fine-tuning.Model parameters were optimized by Adam [39].We employed all default settings of the BART open source project for hyper parameters including a learning rate of 10 −5 , even though carefully tuning these hyper parameters may lead to better performance, this is beyond the main concern of this article.We used beam size of 2 for decoding, as this results in the best performance on the development sets.We implemented our method based on PyTorch and transformers libraries, and ran experiments on a single RTX3090.
We compared our method with the following baselines: 1) Tagging: [30] propose a tagging scheme that can convert the joint extraction task to a tagging problem.They employ a Bi-LSTM layer to encode the input sentence and a LSTM-based layer with biased loss to enhance the relevance of entity tags.2) CopyRE: [1] present a seq-to-seq model with a copy mechanism, which can handle entity overlap cases.3) GraphRel: [31] present a relation extraction model which uses GCNs to jointly learn named entities and relations.4) CopyMTL: [36] propose a multi-task learning framework equipped with a copy mechanism to allow the model to predict multi-token entities.5) WDec: [32] propose a pointer network-based decoding approach where at every time step an entire tuple is generated.6) AttentionRE: [37] present an attention-based joint model, which mainly contains an entity extraction module and a relation detection module.The model devises a supervised multi-head self-attention mechanism as the relation detection module to learn the token-level correlation for each relation type separately.7) CasREL: [3] propose a cascade binary tagging framework containing a subject tagger and a set of relation specific object taggers.8) TPLinker: [4] regard the joint extraction task as a Token Pair Linking problem, and present a one-stage TPLinker model.
We adopted strict exact match for evaluation, which only regards the extracted relational triplet as correct when both the subject and the object and their relation match the reference.We report the standard micro precision (P), recall (R) and F1 score (F1) in line with previous studies.

B. Main Results
We first compare the performance of our approach with baselines.In our approach, we use constrained decoding, representation reusing, and the hard multi-layer fusion mechanism that averages the representations of the 9th and the last BART encoder layer for classification.This configuration is selected by the ablation study conducted based on the average F1 score on the development sets of both datasets.Results are shown in Table II.
Comparing the performance of tagging-based approaches with those involving pre-trained BERT, shows that in general the use of pre-trained models substantially improves performance.Table II shows that our BART-based approach significantly surpasses all baselines in all metrics on both datasets.Specifically, our seq2seq model with pre-trained BART achieves F1 scores of 92.91 and 91.37 on the NYT and WebNLG datasets respectively, establishing new state-of-the-art performance on the two datasets.
Concurrent to our work, OneRel [40] regards the task as a span classification problem following TPLinker, and achieves performance (92.9 and 91.0 on NYT and WebNLG respectively) comparable to ours (92.91 and 91.37 correspondingly) with a simple but effective relation-specific tagging strategy.However, the span classification of both TPLinker and OneRel requires them to design complex tagging schemes and its computational complexity is O(n 2 ), while our seq2seq architecture enhanced by constrained decoding, representation reusing and multi-layer fusion is much simpler and its decoding computational complexity is only O(n).

C. Ablation Study
We conduct ablation studies on both NYT and WebNLG datasets and select the best fusion and representation reusing setting for our model by average F1 score on the development sets.
a) Representation reusing and constrained decoding: We first test the effects of using randomly initialized weights, the pre-trained embedding weights (with and without constrained decoding) and the representation of the last BART encoder layer for the classification over source tokens.As BART ties the embedding matrix with the weight matrix of the classifier, initializing the classifier weight with the embedding matrix without constrained decoding is indeed vanilla BART fine-tuning for the task.Results are shown in Table III.
Regarding representation reusing, Table III shows that: 1) representation reusing for the classification over source tokens is crucial for the performance, as the use of both pre-trained embeddings or the last BART encoder layer output substantially outperforms using a randomly initialized weight matrix, especially on the WebNLG dataset, and 2) the use of the last Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE III EFFECTS OF REPRESENTATION REUSING AND CONSTRAINED DECODING
BART encoder layer output for source token classification leads to better performance than using embedding layers, suggesting that the deep context-aware representations produced by BART encoder layers benefit the classification over source tokens.
Representation reusing leads to large performance improvements.We conjecture potential reasons might be: 1) the weight matrix for the classification over source tokens is very large (the size of the vocabulary times the model hidden dimension), it contains a very large amount of parameters, the learning of which is likely to have a strong impact on the performance of the model, and 2) as pointed out by previous studies [10], [11], [12] on understanding pre-trained models, some pre-trained encoder layers may learn some useful representations for joint entity relation extraction, which are represented in the last encoder layer output but not represented in embeddings.
Table III shows that constrained decoding leads to better performance than without it.We suggest that constrained decoding brings about several advantages in addition to the performance gain: 1) with constrained decoding, we only need to make predictions over source input tokens, which is much more efficient and faster than predicting over the full vocabulary, and 2) we can only obtain context-aware word representations of source tokens, not all tokens in the vocabulary.Without constrained decoding, we cannot use context-aware representations and multi-layer representation fusion for classification.

b) Multi-layer fusion:
We investigate how multi-layer fusion mechanisms affect the performance.Results are shown in Table IV.
In the hard fusion mechanism, when averaging the outputs of the last layer with another layer: 1) even the worst setting studied, the average of the last layer and layer 4, performs on par with using the last layer only, and its baseline (BART with constrained decoding and representation reusing) is already quite strong (92.63/90.34F1 on NYT/WebNLG respectively, surpassing TPLinker), and 2) layer 9 and layer 5 lead to the second best performance.However, averaging the representations of the last layer with both layers 9 and 5 does not outperform the combination of the last layer and layer 9 in average F1 score, and averaging the last layer representation with that of layer 9 leads to the best performance in terms of average F1 score.
Table IV shows that: 1) both soft and hard multi-layer fusion mechanisms bring about higher F1 scores on both the NYT and the WebNLG datasets, and 2) the hard fusion mechanism that averages the outputs of the last layer and layer 9 performs better than the soft fusion mechanism that utilizes all encoder layers, and leads to the best average F1 score.We adopt the hard multilayer fusion mechanism that averages the outputs of the last layer and layer 9 in the other experiments by default.
To verify how entity related information is captured by individual encoder layers of pre-trained BART, we regard Named Entity Recognition (NER) as a sequence labelling task, adopt the simple linear probing approach [41] on the OntoNotes 5.0  [42], and measure the prediction accuracy of output representations of BART encoder layers.Only the linear probe is trained on the NER task, and the pre-trained BART is frozen during probe training.Despite its simplicity, we suggest that it is sufficient for us to compare between pre-trained layers.Results are shown in Table V.Table V shows that: 1) comparing pre-trained with randomly initialized embedding layers, pretrained BART layers capture NER related information.2) layers 7 and 8 achieve highest probing accuracy, indicating that they capture the most relevant information for NER.This corelates well with the result that layer 9 leads to the highest entityrelation extraction F1 in Table IV, considering that relation extraction is expected to be performed after the recognition of entities.
c) The generation order inside relational triplets: Following common practice of previous frameworks [1], [3], [32], our decoder also produces the subject of the triplet first, but there are still two questions remaining: 1) is it better to generate the relation before the object, or in the reverse order?2) when we encounter entity overlap, shall we keep the overlapped entity for multiple times in each triplet, or can we omit the overlapped entity in following triplets?We test these two questions, and results are shown in Table VI.
Table VI shows that: 1) generating tokens in the order of <subject, object, relation> leads to the best performance, and 2) it is better to keep the overlapped entity between triplets.Regarding possible reasons to explain the better performance with the overlapped entity kept rather than omitted, we conjecture that 1) keeping these overlapped entities leads to a consistent form of the triplet sequence, where the decoder always first produces the subject, followed by the object and their relation, which might be easier than using a varying form, and 2) previous studies [43], [44] show that the Transformer's multi-head self-attention network has a preference in attending adjacent tokens, keeping these overlapped tokens ensures that the subject always appears near the object and their corresponding relation, avoiding the model to capture long-distance dependencies.We simply generate the triplets following their order in the datasets, as we find that generating the triplet sequence in the reverse order does not lead to significant changes in the performance, and we think that the generation order of triplets may not have a significant effect on the performance.

D. Analysis of Entity Overlap and Triplet Numbers
As mentioned, we divide the test sets in two ways: 1) based on the entity overlap (this leads to 3 categories: Normal, SEO and EPO), and 2) based on the number of relational triplets per sentence (this leads to 5 categories: 1, 2, 3, 4, ≥ 5).We test our approach and some baselines in each category to observe their performance for these settings.
a) Entity overlap: We first test the performance of approaches in handling entity overlap cases.Results are shown in Fig. 3.
Fig. 3(a) shows the results there is no entity overlap between relational triplets.Our approach performs better than all baselines on both datasets, outperforming the second place CasREL by 3.7 and 0.4 F1 scores on NYT and WebNLG datasets respectively.Fig. 3(b) shows the results when there is a single entity overlap between relational triplets.Results show that our approach is able to bring about better performance than all baselines in both datasets, outperforming the second best by 4.0 and 3.0 F1 scores on NYT and WebNLG datasets respectively.Fig. 3(c) shows the results when both entities of one relational triplet overlap with another.Results show that our approach also leads to better performance than all baselines, outperforming the second best by 3.6 and 6.4 F1 scores on NYT and WebNLG datasets respectively.We suggest that the good performance with our approach in handling both SEO and EPO cases shows its strong ability in taking care of entity overlap.We conjecture that as the pre-trained BART decoder produces the entities autoregressively, compared to previous studies that employ a set of newly learnt complex classifiers which aggressively make predictions over all possible token pairs, it seems easier for the BART decoder to generate entities multiple times, and the knowledge gained during its pre-training (as probed in Table V) b) Performance on different numbers of triplets: Next, we analyze the performance of our method and baselines in handling sentences with different numbers of relational triplets.Results are shown in Fig. 4.
Fig. 4 shows that on the NYT dataset, our approach outperforms all baselines in all cases.On the WebNLG dataset, our approach outperforms baselines in cases when there are 1, 3, 4 and ≥ 5 relational triplets in the sentence, AttentionRE performs best for the case with 2 relational triplets.

E. Case Study
We manually inspect a few samples of the TPLinker baseline and our approach from the NYT test set for a case study.Results are shown in Fig. 5.
The first 3 examples in Fig. 5 show that our approach is more precise at extracting the object given the subject and the relation than our baseline, and is less likely to over-extract (i.e.link possible objects to the other subject and relation pairs found).This is also confirmed in Table VII, which shows that there is a larger performance gap between the baseline and our approach in the object prediction (−0.43) than in subject prediction (−0.15).We conjecture the reason might be that the subject and the corresponding object extraction of our approach relies on the pre-trained BART layers which might be more powerful than the newly introduced classifiers to aggressively predict over all Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.possible pairs of the baseline method.Our approach is also better in the extraction of the subjects and objects than the baseline, as shown in the last few examples in Fig. 5 and the large improvements in recall in Table VII.

V. RELATED WORK a) Joint entity and relation extraction:
Early studies [20], [21], [22], [23] on relational triplet extraction follow the pipeline architecture which first performs entity detection followed by relation classification.To address the error propagation issue, joint entity and relation extraction approaches are proposed [24], [26], [27], [45], [46].Entity(-pair) overlapping is the main concern in recent studies.[30] propose a tagging scheme that converts the joint extraction task to a tagging problem, and design a LSTM-based tagging model to extract entities and their relations.[1] divide the sentences into three types according to triplet overlap degree, and propose a seq2seq model with a copy mechanism.[31] present a GCN-based model for the task.[36] propose a multi-task learning framework equipped with a copy mechanism.[47] propose a Seq2UMTree model to minimize the effects of exposure bias.[32] propose a pointer network Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
based decoding approach where an entire tuple is generated at every time step.On using pre-trained models, [3] propose a cascade binary tagging framework containing a subject tagger and a set of relation specific object taggers, outperforming their baselines by a large margin with the pre-trained BERT encoder.[4] regard the joint extraction task as a Token Pair Linking problem, and present the TPLinker model which enumerates and tags token pairs based on BERT representations.[5] propose a representation iterative fusion based on heterogeneous graph neural networks.[48], [49] explore the use of GPT, but only for relation extraction.More recently, [50] introduce contrastive triplet extraction to encourage the model generating gold triplets instead of negative ones.[5] employ heterogeneous graph neural networks for the modeling of relations and words.[51] facilitate the task by potential relation prediction and a global correspondence component to align the subject and object.[40] use a scoring-based classifier to evaluate whether a token pair and a relation belong to a factual triplet and a relation-specific tagging strategy for simple but effective decoding, and obtain comparable performance to our work.Complementary to our work, [52] presents a large-scale distantly supervised dataset obtained by NLI, and fine-tune BART on the dataset to improve its performance.
b) Pre-trained models: Pre-training for NLP dates back to word vectors [53], [54].[9] pre-train a deep bidirectional language model on a large text corpus, and show that these representations can be easily added to existing models and significantly improve the state of the art.[2] utilize the transformer encoder [55] and pre-train BERT.[56] propose XLNet, a generalized autoregressive pre-training method.[57] present the Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks.[58] propose MAsked Sequence to Sequence pre-training (MASS) for encoder-decoder based language generation tasks.[59] show that hyper-parameter choices have significant impact on the final results.[60] present SpanBERT, a pre-training method that is designed to better represent and predict spans of text.[61] propose ELECTRA, a more sample-efficient pre-training task called replaced token detection.[62] present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT.[63] propose a Transformer distillation method that is specially designed for knowledge distillation of Transformer-based models.[64] propose the DynaBERT, which can flexibly adjust the size and latency by selecting adaptive width and depth.[65] explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format.[66] demonstrate that scaling up language models substantially improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-of-the-art fine-tuning approaches.c) Layer representation probing and multi-layer fusion: [9] present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.[10] find that BERT gradually captures phrase-level information, surface features, syntactic features, semantic features from shallow to deep layers.[11] find that BERT represents the traditional NLP pipeline in an interpretable and localizable way, in the expected sequence: POS tagging, parsing, NER, semantic roles, then coreference.[67] show that Multilingual-BERT's robust ability to generalize cross-lingually is underpinned by a multilingual representation.[12] show that deep NMT models learn a nontrivial amount of linguistic information: word morphology and part-of-speech information are captured at the lower layers, lexical semantics or non-local syntactic and semantic dependencies are better represented at the higher layers.[13] propose a parameter-free probing technique for analyzing pre-trained language models.[14] propose an information-theoretic operationalization of probing.[15] propose a Bird's Eye informationtheoretic probe for detecting if and how representations encode the information in linguistic graphs.To improve the performance by multi-layer fusion, [16] propose a densely connected NMT architecture.[17] propose to simultaneously expose all layers' outputs with layer aggregation and multi-layer attention mechanisms.[18] propose a multi-layer representation fusion approach to fusing stacked layers.[19] propose to use routingby-agreement strategies to aggregate layers dynamically.
d) Comparison with previous work: Our simple architecture outperforms all baselines compared.
Compared to previous seq2seq based models for joint entity and relation extraction, we 1) employ pre-trained BART which substantially improves the performance, 2) focus on the decoding machinery and present a simple constrained decoding method that only manipulates the linear classifier, which can effectively adapt the BART decoder to the task, and can be compared to copy mechanisms or pointer networks in previous studies, 3) reuse the hidden representations of pre-trained BART for the classification over source tokens, which makes the most from the pre-training and partially avoids learning a large randomly initialized weight matrix, and 4) propose the soft and hard multi-layer representation fusion mechanisms which significantly improve the performance by leveraging representations of task relevant encoder layers.
Compared to tagging based approaches (with pre-trained BERT), we use a seq2seq model which can easily address the entity overlap issue, avoiding the design of complicated tagging schemes.
The recent REBEL [52] model also uses BART.REBEL presents a large-scale distantly supervised dataset obtained by NLI, and fine-tunes BART on this dataset to improve its performance, obtaining an F1 of 92.0 on NYT.In contrast our work focuses on the decoding of BART, proposing constrained decoding, representation reusing and multi-layer fusion mechanisms, leading to an F1 of 92.91 on the NYT dataset without leveraging any other data.We suggest that the two studies are complementary as our approach can also leverage the REBEL dataset for further improvements.OneRel [40] still regards the problem as a span classification following TPLinker [4] with O(n 2 ) complexity, despite using a simple but effective relationspecific tagging strategy.In contrast our work adapts pre-trained seq2seq BART to the task with constrained decoding, pre-trained representation reusing (for classification) and layer fusion, a simple approach with decoding computational complexity of only O(n).

VI. CONCLUSION
We present a simple seq2seq model for joint entity and relation extraction.Our approach is empowered by pre-trained BART, with constrained decoding, representation reusing, and multi-layer representation fusion mechanisms to enhance its performance.The model can simply decode the relational triplet sequence and address the entity overlap issue well, without special and complex design of tagging schemes.
In our experiments on the widely studied NYT and WebNLG datasets, our model achieves F1 scores of 92.91 and 91.37 respectively in the strict exact match evaluation, and establishes new state-of-the-art performance.Our analysis shows the strong ability of our approach in handling entity overlap cases.

Fig. 1 .
Fig. 1.Example of Single Entity Overlap (SEO) and Entity Pair Overlap (EPO) relational triplets from the NYT dataset.

Fig. 5 .
Fig. 5. Case study.Wrong predictions are marked in red.Better view in color.
We design a constrained decoding-based classifier to at each step make predictions over either source tokens or predefined relation types for the task.
r We present a seq2seq model for the joint entity-relation extraction task based on a pre-trained BART approach.As the model can naturally generate entity-relation triplet sequences of arbitrary lengths, it can handle complex cases (including SEO and EPO) easily.rr Our model leverages and fuses different levels of represen- tations and knowledge gained during pre-training for the classification over source tokens.

TABLE I STATISTICS
OF THE NYT AND WEBNLG DATASETS

TABLE VI EFFECTS
OF THE ORDER INSIDE RELATIONAL TRIPLETS

TABLE VII INDIVIDUAL
RESULTS OF TRIPLET ON THE NYT DATASET may empower it to extract the corresponding object of the subject more precisely.