Graph Convolution Over Multiple Latent Context-Aware Graph Structures for Event Detection

Event detection is a particularly challenging problem in information extraction. The current neural network models have proved that dependency tree can better capture the correlation between candidate trigger words and related context in the sentence. However, syntactic information conveyed by the original dependency tree is insufficient for detecting trigger since the dependency tree obtained from natural language processing toolkits ignores semantic context information. Existing approaches employ a static graph structure based on original dependency tree which is incompetent in terms of distinguishing interrelations among trigger words and contextual words. So how to effectively make use of relevant information while ignoring irrelevant information from the dependency trees remains a challenging research question. To address this problem, we investigate a graph convolutional network over multiple latent context-aware graph structures to perform event detection. We exploit a multi-head attention mechanism on BERT representation and original adjacency matrix to generate multiple latent context-aware graph structures (a “dynamic cutting” strategy), which can automatically learn how to select the useful dependency information. Furthermore, we investigate graph convolutional networks with residual connections to combine the local and non-local contextual information. Experimental results on ACE2005 dataset show that our model achieves competitive performances compared with the methods based on dependency tree for event detection.


I. INTRODUCTION
Event detection is an important and challenging task in information extraction, which aims to discover specific types of event triggers in texts. This task is one of the most significant research topics in social media analysis [1], which has been beneficial to a wide range of downstream tasks, including question answering [2], automatic text summarization [3] and others. Event triggers are generally verbs or nominalizations that evoke the events of interest, which serve as the main word(s) to the corresponding event. The event detection task, more precisely stated, involves how to identify event triggers and classify them into specific types. An example is shown in Figure 1, in the sentence ''In Baghdad, a cameraman died when an American tank fired on the Palestine Hotel'', an event The associate editor coordinating the review of this manuscript and approving it for publication was Orazio Gambino . detection system is expected to detect ''died'' as a trigger of type Die and ''fired'' as a trigger of type Attack.
In early years, researchers had designed various hand-crafted feature sets such as lexical features and contextual features to extract events [4]- [6]. Although these methods can achieve high performance in specific fields, they rely heavily on manual annotations and features specific for each event type. Recently, most existing event extraction models can be categorized into two classes: sequence-based models and dependency-based models. Sequence-based models only use the given sentences [7]- [9] whereas dependency-based models incorporate dependency trees into the models [10]- [14]. Compared to sequence-based models, dependency-based models can more effectively obtain the connection between the event trigger word and its corresponding entity or other trigger words through the non-local syntactic structure in the sentence. Among dependency-based models, Graph Convolution Networks (GCNs) [15] is often performed to model syntactic dependency information for event detection. For example, reference [11] constructs a graph structure using the adjacency matrix obtained from the dependency tree. Multi-order syntactic representations in sentences are utilized via graph convolution network with aggregative attention [13]. However, the above dependency-based models explicitly use a invariant graph structure generated by the adjacency matrix and they can't distinguish which syntactic information is useful or useless from the dependency tree. Figure 1 depicts the dependency parse tree of the aforementioned sentence. Words along the ''fired'' form a trimmed phrase ''American tank fired on Hotel'' of the original sentence, which conveys much information about the candidate trigger ''fired' '. In other words, such as the rest ''In Baghdad, a cameraman died when'' is less informative and may bring noise if not dealt with properly. Intuitively, a model, capable of learning to maintain a balance between including and excluding information in the full tree, will benefit event trigger detection.
To alleviate the problems above, we propose a graph convolutional network over multiple latent context-aware graph structures for event detection. Firstly, a sentence is fed into BERT [16] to generate BERT representation. Secondly, based on the BERT representation, multi-head attention [17] is employed to substitute the original dependency graph into multiple fully connected edge-weighted graphs. Then, GCNs take BERT representations and adjacency matrixs based on fully connected graphs as input to learn syntactic contextual representation of each node. Residual connection [18] is employed to integrate BERT representation with syntactic contextual representations. Finally, a linear classifier is adopted to extract event triggers. The multi-head attention mechanism exploits BERT representations to transform the original syntactic graph into multiple latent context-aware graphs, which can be understood as a ''dynamic cutting'' strategy on dependency matrix. Multi-head attention jointly attending to information from different representation subspaces at different positions allows the model to learn how to select relevant sub-structures for candidate triggers. Through such ''dynamic cutting'' strategy, GCNs can draw efficiently syntactic structure information. What's more, the residual connection mechanism combines BERT representation and GCNs information to better leverage the structural information of the full dependency tree in event detection task.
We extensively evaluated the effectiveness of our model on the ACE2005 dataset. Experimental results show that our model outperforms previous dependency-based approaches in terms of both Recall and F1-measure. In summary, the contributions of this article are as follows: • We propose a new method for event detection with a ''dynamic cutting'' strategy. It utilizes BERT semantic representation to generate multiple latent context-aware graph structures via multi-head attention mechanism. Such a ''dynamic cutting'' strategy on adjacency matrix learns how to select and discard information in the dependency trees at different positions.
• We employ Graph Convolutional Network to exploit more non-local and non-sequential dependency information. Furthermore, a residual connection mechanism is investigated to incorporate the local contextual representations learned by BERT with the non-local contextual information generated by Graph Convolutional Networks.
• Experiment results show that our proposed method can substantially improve the performance of event detection, where the F1-measure and Recall score achieve 76.34%, 75.21% on ACE2005 dataset, respectively.

II. RELATED WORK
Graph Convolutional Network has been employed in event detection to explore the syntactic representation, which provides an effective mechanism for linking words directly to their informative contexts in the sentence. In this work, we divide the relevant studies into following two categories, namely event detection and graph convolutional networks.

A. EVENT DETECTION
Event detection is a fundamental problem in information extraction and natural language processing [7], [19]. The early approach for event detection involved the feature-based methods which employed hand-design feature sets in different statistical models [5], [20]. The feature-based methods required extensive human engineering which essentially affects model performance. The last couple of years witnessed the success of neural network models for event detection. The typical models employed convolutional neural networks (CNNs) [7], [21], recurrent neural networks (RNNs) [8], [22]. While such models effectively captured relations in the local context, they had limited capability of exploiting non-local and non-sequential dependencies. In many applications, however, such dependencies could significantly reduce tagging ambiguity and improve overall performance.
In order to capture non-local dependencies in the input space, dependency-based approaches were exploited to incorporate structural information into neural models. Syntactic relation representation based on dependency trees could better capture the interrelation between candidate trigger words and related contexts than sentence representation. A novel dependency bridge recurrent neural network (dbRNN) was proposed to carry syntactically related information when modeling each word for event extraction [10]. Reference [11] used CNN based on dependency tree and an entity mention-guided pooling scheme to perform event detection. Syntactic shortcut arc was introduced to enhance information flow and attention-based GCN to perform event detection [12]. Reference [13] proposed a dependency tree based GCN model with aggregative attention to combine multi-order word representations from different GCN layers. However, these GCN-based models are powerless in distinguishing messy syntactic information due to the invariant dependency graph structure. In this article, we propose a model to select useful information dynamically from dependency tree by which the performance can be improved.

B. GRAPH CONVOLUTIONAL NETWORKS
Graph convolutional networks generalize CNNs over graphs [23], [24], which is one of the typical variants of graph neural networks (GNN). It has been successfully applied to many natural language processing tasks, such as text classification [25], semantic role labeling [26], relation extraction [27], [28], machine translation [29] and knowledge base completion [30]. Early efforts [23], [31] attempted to extend neural networks to deal with arbitrary structured graphs, where the states of nodes are updated based on the states of their neighbors. Given a graph, a graph convolutional network can embed the node by recursively aggregating the node representation of its neighbors. Subsequent efforts improved its computational efficiency with local spectral convolution techniques [32]. Our method is closely related to the GCNs [11], which used a convolutional neural network based on dependency trees and a novel pooling method to improve the performance of event detection task.
More recently, graph attention networks (GATs) [33] was proposed to summarize neighborhood states by using masked self-attentional layers [17]. Compared with their work, the motivation and method when applying an attention mechanism in graph convolutional networks are different. Particularly, each node only focuses on its neighbors in GATs, while our model involves the correlation among all nodes. The network topology in GATs remained unchanged, while our model will construct fully connected graphs to capture long-range semantic interactions.

III. TASK DEFINITION
In this article, we focus on event detection task defined in Automatic Content Extraction (ACE) evaluation, where an event is defined as a specific occurrence involving one or more participants. We firstly introduce some ACE terminologies to facilitate the understanding of this task in Table 1.
The goal of event detection is to identify event triggers and categorize their event types. For example, in the sentence ''Obama beats McCain.'', an event detection system is expected to detect an Elect event along with the trigger word ''beats''. The ACE 2005 evaluation defines 8 super types of events, with 33 subtypes and a ''Not Applicable(NA)'' type, such as Attack or Elect. Following previous works [7], [8], [19], [34], we treat these simply as 34 separate event types and ignore the hierarchical structure among them.
There are triggers that consist of multiple tokens. The previous work [35] applies a BIO annotation schema to assign trigger label to each token w i . In this work, we treat consecutive tokens which share the same predicted label as a whole trigger.

IV. METHODOLOGY
Event detection can be formalized as a sequence labeling problem. We treat every word in a sentence as a trigger candidate, and classify each candidate to a certain event type. As shown in Table 2, the event detection model collectively assigns a tag for each word in the sentence to indicate whether it triggers a specific type of event (O means it does not belong to any event type).
In this section, we illustrate the details of the proposed method for event detection. Figure 2 shows the overall architecture of our proposed model, which primarily involves the following four components: (I) Sentence Encoding Layer which encodes input sentence to a sequence of hidden embeddings.
(II) Graph Construction Layer which exploits a multi-head attention mechanism to construct multiple latent context-aware graph structures. (III) Graph Convolution Layer which performs graph convolution over multiple connected graphs with the help of residual connections.
(IV) Trigger Classification Layer which assigns an event type (including the NA type) to each candidate in a sentence.
In the following section, we will introduce them respectively.

A. SENTENCE ENCODING LAYER
Formally, an event detection instance is an n-word sequence W = {w 1 , . . . , w n }, where w i refers to the i-th token. Sentence encoder is adopted to encode the word sequence W into hidden embeddings, where E(·) is a neural network to encode the sentence. In this article, we select BERT [16] as sentence encoder. The input of the BERT is the concatenate of three types of embedding, including WordPiece embedding [36], position embedding and segment embedding. The input sequence for BERT is constructed as follows: where < CLS > and < SEP > are special tokens of BERT.
Since the input contains only one sentence, all its segment ids are set to zeros. After feeding the input representation described above into 12 successive Transformer encoder blocks, we can obtain the BERT contextual representation H = {h 0 , h 1 , . . . , h n , h n+1 }. The final remaining H = {h 1 , h 2 , . . . , h n } is the sequence of hidden embeddings while removing the representation of < CLS > and < SEP >, which will be used as an input of the subsequent graph construction layer.

B. GRAPH CONSTRUCTION LAYER
Each dependency tree can be represented as a syntactic graph in terms of an adjacency matrix. Let A be the adjacency matrix of original syntactic graph, which is generated from dependency tree of the sentence. As mentioned before, pre-existed methods maintain the invariance of the original syntactic graph structure. However, in graph construction layer, BERT representation and adjacency matrix based on original dependency tree are used to generate multiple latent context-aware graph structures. The original dependency tree is substituted into a fully connected edge-weighted graph by constructing an attention guided adjacency matrixÃ. In this work, we computeÃ by using multi-head attention [17]. The model employs BERT semantic vectors to calculate the correlation between words from different perspectives (N head). An attention function can be described as mapping a query and a set of key-value pairs to the output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key: where t is the t-th head (t = 1, · · · , N ), Q and K are both equal to the output of sentence encoding module H , are the parameter matrices, d is the dimension of hidden layer. After using multi-head attention mechanism, the system can get N different attention guided adjacency matrices, where N is the head.Ã (t) is the tth attention oriented adjacency matrix generated by the t-th head. Figure 3 shows an example of transforming the original adjacency matrix into multiple attention guided adjacency matrices. BERT representation and the adjacency matrix are fed into multi-head attention mechanism to produce multiple fully connected edge-weighted graphs. Compared with the original sparse dependency graph, the fully connected edge-weighted graphs can better express the correlation between nodes. In practice, we treat the original adjacency matrix as an initialization so that the dependency information can be captured in the node representations for later attention calculation. The attention guided adjacency matrices and BERT embeddings are jointly into GCNs for learning syntactic contextual representation of each word.

C. GRAPH CONVOLUTION LAYER
This part uses graph convolutional network with residual connection over multiple latent context-aware graph structures to model text sentences.

1) GCNs
We firstly introduce the general framework of graph convolutional network. Given a graph with n nodes, an (n×n) adjacency matrix A can represent the graph. Let G = {V , E} be the dependency parse tree for a sentence W = {w 1 , . . . , w n }, where V contains n nodes corresponding to the n tokens in W and E are the sets of edges. Each edge (w i , w j ) in E is directed from the head word w i to the dependent word w j and has the dependency label K (w i , w j ). Following the previous work [15], we also add all the self-loops For instance, in the dependency tree of Figure 1, there is a directed edge from the node for word w i = ''died'' (the head word) to the node for word w j = ''cameraman'' (the dependent word) with the edge label K (w i , w j ) = K (''died'', ''cameraman'') = nsubj, the reversed dependency arc with the additional type label K (''cameraman'', ''died'') = nsubj', and two self-loops of ''died'' and ''cameraman'' with type label Each graph structure G based on dependency tree corresponds to an adjacency matrix A. Intuitively, a ij = 1 and a ji = 1 if an edge exists between node i and node j, otherwise a ij = 0 and a ji = 0, where a ij is an element in A.
Graph convolution operation aims to gather information from neighbor nodes in the graph. In particular, the graph convolution vector h (l) i at the l-th layer for node i can be computed as follows: where A is the adjacency matrix of a graph with n nodes, W (l) and b (l) are the model parameters, and ρ is an activation function (e.g. RELU [37]). Moreover, we use the output of the sentence encoding module H to initialize node representation h (0) of the first layer of GCNs. N separate graph convolutional operations are required as we have N different fully-connected adjacency matrices.

2) GCNs WITH RESIDUAL CONNECTIONS
GCNs can get information about the k hops by stacking k-layers, but sometimes the length of shortcut path between two triggers is less than k. To avoid information overpropagating, we adapt residual connections [18] into GCNs. A residual connection mechanism can incorporate the local contextual representation learned by BERT with the non-local contextual information generated by Graph Convolutional Networks. Note that the application of this mechanism can allow contextual information flowing across GCN layers. Specifically, the initial representation of GCN can be integrated into node's representation on the last iteration. Based on Equation (4), the computation of L-th layer (L is the total layers of GCNs) is modified as follows: where h (0) is the initial input vector (the BERT representation H ) and the other parameters are defined as above. The BERT representation is only integrated into the last iteration while graph representation updates depending on Equcation (4) in other layers. Based on the t-th attention guided adjacency matrixÃ (t) (t = 1, · · · , N ), we can get the final output for the t-th head (H t i is the final representation of i-th token):

3) GCNs WITH LINEAR COMBINATION
In addition to residually connected layers, we include a linear combination layer after multi-layer GCNs to merge the representations from different GCN blocks, reaching a more expressive representation. In which the final representation of each node is computed by fusing the node's representations from all graph convolution networks. Formally, the outputs of multiple GCNs are concatenated and feed into the linear combination layer as follows: where H con is the concatenation of N separate GCNs outputs, H con = [H 1 , · · · , H N ] ∈ R d * N , N is num of multi-head. W com ∈ R (d * N ) * d is a learnable transformation matrix and b com is the bias vector. VOLUME 8, 2020

D. TRIGGER CLASSIFICATION LAYER
After applying the proposed method over the GCNs, the syntactic contextual representations of all tokens are obtained. The goal of event detection is to predict the type of tokens based on these representations. The context vector D com are fed into a fully connected network to predict the trigger label. v where f is a non-linear activation, d i ∈ D com and y i is the final output of the i-th trigger label. We minimize the negative log-likelihood loss function to train our model. Due to the data sparsity in the ACE 2005 dataset, we adapt a bias loss function [35] to balance event label weights during training.
where D is the number of sentences in training corpus. D i,w is the number of tokens in D i , I (y t i ) is an indicating function, if y t i is O, it outputs 1, otherwise 0; β is a hyper-parameter larger than 1.

V. EXPERIMENTS A. DATASET, EVALUATION METRIC AND RESOURCES
We evaluate method on the ACE2005 dataset. To comply with previous work, we use a pre-defined split of the documents as the previous work [8], [10], [38], Table 3 shows the data statistics.
Similar to previous work [39]- [42], we use the same criteria to judge the correctness of each predicted event mention.
• Event Trigger Identification: A trigger is correctly identified if its offset matches a reference trigger.
• Event Type Classification: A trigger is correctly classified if both its offset and event type match a reference trigger. We use Stanford CoreNLP toolkit to parse every sentence in the corpus, including tokenizing, sentence splitting and generating dependency parsing trees.
In the graph convolution layer, we use a three-layer GCN with 768 hidden units, linear combination with 768*N hidden units, where N is the head and 300 hidden units for linear classification layer. Stochastic gradient descent over shuffled mini-batches with the Adadelta update rule [43] is used for training processes. We use ReLU [37] as our nonlinear activate function. We also set dropout rate to 0.2 and the bias loss parameter β to 5.

C. OVERALL PERFORMANCE
To demonstrate the effectiveness of our model (called MH-GCN), we take several previous classic works for comparison, and divide them into three categories: feature-based models, sequence-based neural network models, external resource-based models and GCN-based neural network models.

1) FEATURE-BASED MODELS
• Cross-Event [44], using document-level information to improve the performance of ACE event extraction.
• MaxEnt [19], only using lexical features, basic features and syntactic features designed by human.

2) SEQUENCE-BASED NEURAL NETWORK MODELS
• DMCNN [7], which exploits a dynamic multi-pooling convolutional neural network for event trigger detection.
• dbRNN [10], which adds dependency bridges with weight to BiLSTM for event extraction.
• LearnToSelectED [45], which features the automatic identification of important context words in the sentence.
• GMLATT [46], which exploits the multi-lingual information for more accurate context modeling.
• BERT+Boot [38], which applys an adversarial training mechanism to enhance distantly supervised event detection models.
• EKD [47], which leverages external open-domain trigger knowledge to provide extra semantic support.

4) GCN-BASED MODELS
• GCN-ED [11], which investigates GCN on syntactic dependency tree and an argument pooling mechanism to improve performance.
• JMEE [12], which exploits GCN with a self-attention aggregation mechanism and highway network to improve performance of GCN for event extraction. • MOGANED [13], which uses GCN with aggregated attention to combine multi-order word representation from different GCN layers.
• RA-GCN [48], which investigates a relation-aware GCN on the tree with syntactic dependency relation labels. MH-GCN is a GCN-based model which is compared with the state-of-the-art methods in Table 4 on the blind test set. From the table, we can make the following observations. Among all these methods, sequence-based neural network models beat feature-based models on F1-measure. Moreover, compared with sequence-based models, external resource-based models and GCN-based models achieve better performance (except LearnToSelectED). This phenomenon is not surprising. Sequence-based neural network models can automatically learn salient features in the data to achieve better performance. External resource-based models utilize more data from external resource, and GCN-based models incorporate structural information to achieve further improvement.
We conduct a t-test to compare the proposed method to the other methods on F1-measure. Compared with the best feature-based models, MH-GCN gains 7.5 F1-score improvement. What's more, our model is superior to all of the sequence-based baselines. The MH-GCN model achieves the best performance among the GCN-based models which employ dependency arcs only. MH-GCN can improve the best recall and F1 in the best reported method with dependency arcs only MOGANED by +2.9 and +0.5, respectively. This demonstrates the effectiveness of our method to exploit BERT representation to enhance graph convolution via the multi-head attention mechanism.
Compared with RA-GCN, our model tends to achieve higher precise score but lower F1-measure. We notice the RA-GCN model exploits dependency arcs and relations simultaneously based a relation-aware GCN while MH-GCN uses dependency arcs only. The dependency relation obtained from syntactic parsing toolkit is indeed noisy, which is responsible for the lower precision. Our model outperforms these external resource-based models except EKD, as it investigates a teacher-student model to distill open-domain trigger knowledge from WordNet. It demonstrates that external resources are useful to improve event extraction.

D. MODEL STABILITY ANALYSIS
To further explore the performance stability of our model, we performe a 5-fold cross-validation on the ACE2005 dataset in Table 5. The ACE2005 corpus includes 6 different domains: broadcast conversation (bc), broadcast news (bn), telephone conversation (cts), newswire (nw), usenet (un) and webblogs (wl). In order to maintain data consistency of different domain in each fold. We divide the data of each domain into five parts and splice together. Finally, we choose a different fold each time as the testing set and used the remaining four folds as the training set. In Table 5, the average of trigger identification is 75.16% (variance is 0.21) and the average of trigger classification is 70.9% (variance is 0.18). We can see that our model obtains a close performance in each fold when the data for each domain is consistent.
In general, multiple experiments can improve robustness of the model since a single experiment may not reflect the true performance. We choose different random seeds and conduct a 10-test experiments based on a pre-defined data split [38]. Figure 4 show the 10-test experiments on ACE 2005 dataset. The average of trigger identification is 79.23% (variance is 0.18) and the average of trigger classification is 76.34% (variance is 0.05), which demonstrates our model has VOLUME 8, 2020  strong robustness on event detection. Through two groups of experiments from different perspectives (i.e., cross-validation and t-test), it is obvious that the performance of our model is relatively stable.

E. EFFECT OF DEPENDENCY RELATION
To further reveal whether dependency relation can improve our modul, we design a model (MH-GCN-RW) based on MH-GCN. MH-GCN-RW: It applys gate on the edges to weight the importances of relation in GCN. Graph convolution operation is modified based on Equation (4) as follows: where g (l) ij is the relation-weighted coefficient between nodes i and j at l-th layer, r ij represents the relation embedding between nodes i and j, W rel are the model parameters according to relation, and other parameters are defined as above. Table 6 shows the effectiveness of relation in our model. This modified model gets higher performance than MH-GCN while lower performance than RA-GCN. This demonstrates syntactic dependency relation can provide information to improve the performance. Note that MH-GCN-RW uses a simple fusion module while RA-GCN explores the relation-aware aggregation module and context-aware relation update module simultaneously. So designing an efficient  relation fusion module will significantly improve the model. In the future, we are going to explore how to integrate syntactic dependency relation into our model.

F. ABLATION STUDY
To study the contribution of attention mechanism and residual connections, we design ablation experiments. For this purpose, we design three architectures based on MH-GCN: 1) BERT-GCN: it integrates BERT encoder with original syntactic graph structure instead of fully connected edge-weighted graphs; 2) MH-GCN/RC: it casts the residual connections away; 3) MH-GCN-Mean:it adopts mean pooling as the combination mechanism of multiple GCN blocks representations while MH-GCN adopts linear combination layer.
The experimental results are shown in Table 7. These three modified models all get lower performance than MH-GCN. MH-GCN/RC performs the worst, which suggests that residual connections mechanism plays an important role in context-aware graph structures. It implies that the context feature representation produced by BERT can bring essential information. Without residual connections, the GCNs converges slowly and most likely to ignore the relevant information. BERT-GCN drops more on precision than recall, which illustrates that multi-head attention learned from matrices helps MH-GCN to predict trigger words more precisely. The performance drop of MH-GCN-Mean is the smallest among the three modified models. Although the average of multiple GCNs representations achieves competitive performance for event detection, the proposed linear layer aggregation module still distinguishes the importance of syntactic representations of different head, which achieves 0.8% improvement on F1-measure.

G. EFFECT OF MULTI-HEAD GCNs ON EXTRACTING MULTIPLE EVENTS
In order to further prove the effectiveness of MH-GCN, especially for those sentences with multiple events, the test  set is divided into two parts (single event and multiple events) following the previous work [8], [12] and perform evaluations separately. Single event means that one sentence only has one trigger; otherwise, multiple events in one sentences. Table 8 shows the performance (F1-measure scores) of DMCNN [7], JRNN [8], JMEE [12] and two other baseline systems, named Embeddings+T and CNN. Embeddings+T uses word embeddings and the traditional sentence-level features in [19] while CNN is similar to DMCNN, except that it applies the standard pooling mechanism instead of the dynamic multi-pooling method.
From the table, we see that MH-GCN significantly outperforms all the other methods when the input sentences contain more than one event (i.e. the row labeled with 1/N in the table). In the 1/N data split of triggers, our framework is 1.5% better than the JMEE, which demonstrates that our method uses multi-head attention to help alleviate multiple events issue. The multi-head attention mechanism allows the model to jointly attend to information from different representation subspaces at different positions, in order to maintain the correlation between multiple events.

H. ERROR ANALYSIS
We further examine output of MH-GCN on the test set to determine the contribution of each event type to trigger classification errors. Three types of errors have been arisen in this case: (i) missing an event trigger in the test set (called Missing), (ii) proposing a fakean event trigger (called Spurious), and (iii) misclassified event types (called Wrong Type). Table 9 shows the top two event types appearing in these errors and their corresponding percents. These three types account for 45.56% of the missing errors, 48.65% of the spurious errors and 5.79% of the wrong type errors. Attack and Transport are the types that are present frequently in misssing and spurious errors.
A careful analysis of the missing cases reveals that the errors mainly correspond to the trigger words not appearing in the training data, such as the word ''admits'' (of type Transport) in the sentence ''Turkey would lose a aid-package unless it admits troops into the country for the Iraq conflict.''. The spurious errors occur due to the confusable context of trigger words. For instance, the word ''war'' in the following sentence can be easily misproposed as an Attack event (due to its context with the word ''victims'') ''. . . provided the money went for goods to victims of the first Gulf War.''. Highly ambiguous trigger (word triggers too many events) is the main reasons of the wrong type errors. In the sentence ''A rocket landed in farmlands and the other hit a house inside the refugee camp.'', the ''landed'' is misclassified as a Transport event instead of an Attack event.
For the problems mentioned above, our model may increase the scale of training data to solve rare or unseen trigger words by introducing external resources. Moreover, designing a model that can better capture modeling of the context will mitigate the confusable context problem.

VI. DISCUSSION
To better understand what the model has learned via the ''dynamic cutting'' strategy on adjacency matrix, we visualize the new edge-weighted adjacency matrix generated by attention mechanism based on BERT representation. We use a sentence ''In Baghdad, a cameraman died when an American tank fired on the Palestine Hotel'' as an example in Figure 5. These weights in adjacency matrix are viewed as the strength of relatedness between words. Figure 5a shows the change of applying attention mechanism on dependency matrix. Since the original adjacency matrix generated by dependency tree is binary values (0 or 1) and symmetric matrices, edges on the top are the same color while are different color intensity on the bottom. Compared with the original dependency matrix, the pruned matrix of the attention mechanism can hierarchically express the correlation between words in the sentence. All words are equally important in the original dependency matrix although some words are directly connected by short arcs. However, the new adjacency matrices focus more on the words associated with the trigger words. For example, the edges in diedcameraman, fired-tank are darker than in-cameraman, fired-Palestine respectively. This shows attention mechanism has the ability to distill useful information for trigger word from the dependency tree. Figure 5b demonstrates different dependency matrix generated by two attention heads. The first head gives more attention to the trigger word fired while the second exploits another trigger word died in the sentence. As the different attention head attends to information at different positions, the method of leveraging multi-head attention mechanism is helpful in alleviating the multiple events phenomenon. Multi-head attention can aggregate information from multi-dimensional VOLUME 8, 2020 space of multiple events to keep the associations between multiple events.

VII. CONCLUSION
We present a new method based on graph convolution network for event detection. An attention mechanism is applied on BERT representation and adjacency matrix to generate multiple latent context-aware graph structures, which can dynamically retain relevant information and ignore irrelevant information from the dependency trees at different positions. Using BERT semantic information for dynamically pruning on dependency matrices can distill more beneficial syntactic structure for trigger words. We introduce the graph convolutional networks with residual connections to combine the local and the non-local contextual information, which can effectively enhance the information flow of graph structure. The proposed model is empirically shown to be effective on the sentences with multiple events as well as yields the competitive performance on the ACE2005 dataset. For future work, we expect to investigate the joint models for event extraction (i.e. both event detection and argument prediction) based on the proposed model. We also plan to apply the GCNs models to the other datasets and extend it to other information extraction tasks such as relation extraction.