DPNPED: Dynamic Perception Network for Polysemous Event Trigger Detection

Event detection is the process of analyzing event streams to detect the occurrences of events and categorize them. General methods for solving this problem are to identify and classify event triggers. Most previous works focused on improving the recognition and classification networks which neglected the representation of polysemous event triggers. Polysemy is habitually somewhat confusing in semantic understanding and hard to detect. To improve polysemous trigger detection, this paper proposes a novel framework called DPNPED, which dynamically adjusts the network depth between polysemous and common words. Firstly, to measure the polysemy, the difficulty factor is devised based on the frequency of a word as an event trigger. Secondly, the DPNPED utilizes a confidence measure to automatically adjust the network depth by comparing the predicted and initial probability distribution. Finally, our model applies focal loss to dynamically integrate the difficulty factor and confidence measure to enhance the learning of polysemous triggers. The experimental results show that our method achieves a noticeable improvement in polysemous event trigger detection.


I. INTRODUCTION
Event detection (ED) is an aspect of information extraction, that aims at detecting event triggers in text and classifying them into specific event types [1], [2], [3], [4], [5], [6]. In general, the event trigger is a keyword or phrase to describe events. For instance, as the intuitive sentence shown in Table 1, the ED system is required to identify two events: an Attack event triggered by ''hacked'', and a Die event triggered by ''death''. It is easy to see that some triggers are simple to detect, but others are difficult to identify. Take the benchmark ACE2005 as an example, as shown in Table 1, judging the Attack event triggered by ''hacked'' and the Die event triggered by ''death'' is straightforward. On the contrary, it is confusing to determine whether ''going'' is a Transport event, which means Obama goes to the place; or a Start-Position event, which means Obama will work in the White House.
The associate editor coordinating the review of this manuscript and approving it for publication was Bo Pu . TABLE 1. An example of intuitive and polysemous trigger. ''hacked'' and ''death'' is an intuitive trigger and ''going'' is a polysemous trigger.
From the results of the baseline model, it can be analyzed that the event triggers which are difficult to detect are mainly concentrated in polysemous words, such as, ''go'', ''force'' and ''take''. However, most of the previous works [5], [7], [8], [9], [10] focused on improving the recognition and classifi- VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ cation networks, neglecting the representation of polysemous event triggers. Therefore, these methods are not effective in detecting polysemous triggers. Based on previous experience, for polysemous triggers, we need more context to decrease ambiguity. To get more contextual information, it is usually to use external knowledge [9], [11] or increase the network depth [12], [13]. For reusability, we are inclined to adjust the network depth. Regarding the dynamic semantic representation of words, [14] applied adaptive computation time for recurrent neural networks, which could dynamically increase the number of network layers. Reference [13] designed a universal transformer for BERT based on adaptive computation time, which was improved on a variety of challenging questions, such as language modeling, reasoning question, and answer, etc. Overall, the core issue of dynamic networks is how to determine whether the network needs to stack more layers. In [14] and [15], the sigmoid probability of RNN cells or transformers is accumulated until it exceeds a given threshold. The problem with these methods is that the cumulative sigmoid probability as a halting condition is task-irrelevant and unexplainable. This halting condition means that the representation of all tasks is similar. To provide a task-oriented and reasonable halting mechanism, we design a confidence measure that is derived from the divergence [15] calculated by the contrast between two probability distributions. The default probability distribution is uniform initialization. In each layer, we calculate the target probability distribution. Compare the two distributions, if the difference is greater than the given confidence measure score, then stop increasing layers. Otherwise, we should stack more layers.
To improve the semantic representation of polysemous triggers, this paper applies a difficulty factor and confidence measure to automatically adjust the network depth between polysemous and common words. As shown in Table 2, the word ''go'' appears 271 times in the training set, 44 of which are triggers, and the frequency of it as an event trigger is 0.162. At the same time, ''go'' acts as a trigger for various event types, such as, ''Transport'' and ''End-Organization''. Obviously, compared with the trigger ''kill'', the trigger ''go'' is more difficult to detect. To make our model focus on polysemous triggers, we define a difficulty factor to measure the polysemy of a word which is calculated by σ ( 1 Freq ), where σ is the sigmoid function. The ''Freq'' that is less than 0.5 is defined as a highly polysemous trigger. According to statistics, more than 20% of triggers in ACE and MAVEN are polysemous triggers.
In summary, The key contributions of this work are as follows: • Via confidence measure based on the divergence, we propose a dynamic perception network that can dynamically adjust the network depth between polysemous and common words.
• Via the prior probability of each word as an event trigger, we introduce a difficulty factor to estimate the polysemy of a means the frequency of a word as an event trigger, and ''Type'' means the number of event types of a word acting. ''Difficulty'' represents the difficulty factor of a word. ''Polysemy'' is high when ''Freq'' is less than 0.5. Some of the existing methods are difficult to detect ''go'' and ''take''.
word. The difficulty factor is used for the focal loss to enhance the learning of polysemous triggers.
• We have conducted extensive experiments on ACE 2005 and MAVEN datasets. The experimental results show that our model achieves specific improvement on polysemous event trigger detection.

II. METHODOLOGY
In this paper, we formulate event trigger detection as a sequence labeling task. Formally, given a sentence X = [x 1 , x 2 , . . . , x n ] of n words and an index i of the trigger candidate x i , to predict the event type y of this candidate. We define a start and end tag for each word. The tag ''O'' stands for ''other'' event, which means that the corresponding word is not an event trigger. The tag ''EventType'' represents it is a trigger. For example, as can be seen in Figure 1, the trigger word ''hacked'' is annotated as ''Attack'' [5], [16], and our model marks all words in the sentence sequentially via the dynamic perception network, where ''hacked'' has a deeper network than ''death''.

A. TRIGGER ENCODER
This paper takes token embedding as input. In the embedding layer, each token x i consists of the following vectors: • The word embedding vector of each token: this is obtained by looking up a pre-trained model [17]. The vector can be represented as w i , and it contains token embedding, segment embedding, position embedding.
• The candidate embedding vector of each token: it can be obtained from the training data, if a word appears as a trigger word, set the candidate to 1. Otherwise, set it to 0. The vector can be represented as t i .
• The pos embedding vector of each token: the part-ofspeech tagging of each word can be obtained from the stanza tool. A total of 57 tags are used in this paper. The vector can be represented as p i . The embedding of each token can be formulated as x i , which is fed into the dynamic perception network: Specifically, as shown in Figure 2, taking x i as the input. It will be transformed by the embedding layers to a vector FIGURE 1. Framework overview of dynamic perception network for event detection. The dynamic perception network adopts the difficulty factor and confidence measure to adjust the network depth between polysemous and common words. It can be seen that, compared to other words, 'hacked' has a deeper network. CS represents the number of computational steps, also known as the number of network layers. In each layer, the predicted probability distribution is calculated, and then compare it with the initial probability distribution to determine whether the model needs to stack more layers.
representation e i as follows: Next, the transformer blocks in the encoder performs a layer-by-layer feature extraction as equation 3: where CS is the number of computational steps. The predicted probability distribution of the current hidden state like equation 4: . The confidence measure score of different predictions. q j i is a uniform distribution which is the red dotted line. p j i is the predicted probabilities of our dynamic network. The confidence measure score can be calculated by equation 6. The more the predicted probability distribution deviates from the uniform distribution, the larger the confidence measure score.

B. CONFIDENCE MEASURE
In order to measure whether this layer has enough information to predict the target correctly, we design a confidence measure to estimate the disparity between the predicted probability distribution and the initial probability distribution. Formally, the confidence measure score of the i-th token in j-th layer is expressed as follows: where M is the number of labeled classes, and p j i (k) is the probability of the i-th token in j-th layer belongs to label k, and p j i ∈ R M . The factor 1 log M is used to balance the discrepancy in different tasks. As shown in Figure 3, the more VOLUME 10, 2022 the predicted probability distribution deviates from the initial distribution, the larger the confidence measure score.
If the confidence measure score c j i of the input x i is greater than the given threshold value µ (a constant), the bounding step is located. Otherwise, we need to stack more layers and the computational step CS i of the input x i should be increased: To avoid making mistakes, the dynamic perception network tends to learn as long as possible. Therefore, we set the maximum number of computational steps (MC). If the computational step CS i ≥ MC, then CS i is the boundary of input x i , and the output of this input is the last layer.

C. DIFFICULTY FACTOR
We could notice that some triggers are easy to detect, but others are baffling to identify. To make our model focus on polysemous triggers, we define a difficulty factor to measure the polysemy of a word. Practically, the difficulty factor is calculated by the frequency of a word as an event trigger. The prior probability is expressed as follows: where count(triggers) represents the number of a word as an event trigger and count(words) means the occurrences of this word in the training corpus. For example, as can be seen from Table 2, the word 'war' appears 379 times in the training set, 338 of which are triggers, and the frequency of it as an event trigger is 0.851. From the labeled data, the frequency of each word as an event trigger f i can be calculated.
If a word appears frequently but has a low frequency as a trigger, in general, it means that the word is difficult to detect. Finally, the difficulty factor of a word is as follows: where f i ∈ (0, 1] and d i ∈ [0.731, 1). It can be seen that the difficulty factor of polysemous words is larger than that of common words.

D. DYNAMIC FUSION LOSS
To learn the difference of each word, during training, we use the focal loss that adds a factor d i ×(1−p i (k)) γ to the standard cross-entropy criterion. Setting γ > 0 reduces the relative loss for well-classified examples, putting more focus on hard, misclassified triggers. The total loss of our model L(x) in the sequence x is formulated as: (10) where N is the input sequence length, and M is the number of labeled classes. d i is the difficulty factor of the input x i , and p i (k) is the probability of the i-th token belongs to label k. As CS i focal loss, we divide CS i by MC, and α is a balance factor.

E. SPAN PREDICTION
Given the output of DPN, the model first predicts the probability of each token being a start index as follows: where h cs ∈ R N ×d is the output of DPN, and N is the sequence length; d is the hidden dimension. The start position predicted probability is p start ∈ R N ×M , and M is the number of classes. The end index prediction procedure is exactly the same, except that we have another matrix W end to obtain probability matrix p end ∈ R N ×M . Finally, our model starts with the start-index prediction, and from the end-index prediction finds a nearest position of the same event type. At training time, the input X is paired with two label sequences y start and y end of length N representing the ground-truth label of each token x i being the start index or end index of any trigger. Therefore, we have the following two losses for start and end index predictions: where the loss function L is equation 10. The overall training objective to be minimized is as follows:

III. EXPERIMENTAL SETTINGS A. Datasets
We conduct experiments on the widely used ACE2005 and MAVEN dataset [10]. For more statistical details, please refer to Appendix A.

B. Evaluation Metric
The evaluation metric is similar to previous works [10], [18]. For more details, please refer to Appendix B.

C. Implementation Details
During the training, we use the same embedding as BERT [17] for input sequence initialization. The token, segmentation, and position embeddings are initialized from the pretrained model. The candidate and pos embedding use random initialization. The implementation of the transformer block is identical with Vaswani [19]. After a transformer block learning, the confidence measure score between the output probability distribution and the uniform distribution is calculated by equation 6. If the confidence measure score c i of the input x i is greater than the given threshold value µ, the output of the current transformer block is the final output and the final transformer block will be copied until the number of layers accumulates to MC. For other hyper-parameters and details, please refer to Appendix C.

D. Comparison Methods
For comparison, we investigate the performance of the following state-of-the-art methods: (1) MOGANED [20],  which uses graph convolution network with aggregative attention to explicitly model and aggregate multi-order syntactic representations.
(3) GatedGCN [18], which presents a gating mechanism to filter noisy information in the hidden vectors of the GCN models for event detection based on the information from the trigger candidate. (4) EE-GCN [21], which exploits syntactic structure and typed dependency label information to perform event detection. (5) MLBiNet [22], which applies a multi-layer bidirectional network to capture the document-level association of events and semantic information simultaneously. For more implementation details and hyper-parameters, please refer to Appendix D.

A. OVERALL RESULTS
As can be seen from   supports online testing) into two parts (low and high) and perform evaluations separately. Among all event triggers, if the probability of a word as an event trigger is less than 0.5, it is a highly polysemous trigger; otherwise, it is a common trigger. As shown in Table 4, more than 20% of triggers in ACE and MAVEN are highly polysemous triggers. As can be seen in Table 5, benefiting from the dynamic perception network and difficulty factor, compare with MLBiNet and DMBERT, the F1 score of our model improves significantly in highly polysemous triggers which increased by 8.8% in the ACE2005 test set and 9.5% in the MAVEN dev set. In ACE2005, MLBiNet is 1.9% higher than our model in common trigger (Low) recognition, and the main reason is that they use the document-level association of events to solve the multi-event problem. However, the multi-event problem in MAVEN is not obvious, the effect of MLBiNet deteriorates drastically, and our model is still robust.

C. ABLATION ANALYSIS
We ablate each part of our model on the ACE 05 and MAVEN, as shown in Table 11. First, without difficulty factor, we observe performance drops of 1.54% on ACE2005 and 0.73 % on MAVEN, which verifies the usefulness of difficulty factor in polysemous triggers extraction. By removing pos embedding, and trigger candidate embedding, the performance drops slightly. Furthermore, after replacing DPN with LSTM or BERT, the performance decrease is most significant. The main reason is that LSTM and BERT are difficult to extract polysemous trigger, resulting in boundary errors. This shows that the DPN module increases the computational  steps of difficulty words, thereby increasing the effect of polysemous triggers. At last, instead of using Span as a decoder, replace the decoder with softmax and crf, the performance drops generally.

D. ANALYSIS OF DYNAMIC PERCEPTION NETWORK
The dynamic perception network employs a confidence measure to adjust the depth of neural networks. As can be seen in Table 7, we listed a few highly polysemous triggers. For example, the frequency of ''threat'' as an event trigger is 0.228. Compared with BERT+Span, DPN+Span uses 19 computational steps, and the F1 score is relatively increased by 33.3%. As the computational step increase, DPNPED will take more interaction with its context that can better capture the contextual semantics. On the whole, the overall F1 score has increased by 9.7% in MAVEN and 16.5% in ACE2005 on the highly polysemous triggers. The results on common trigger, please refer to Appendix E.
As shown in Table 8, when the maximum computational step is set to 24, our DPNPED model parameter is 135M, which is 63% smaller than BERT large. The main reason is that DPNPED uses confidence measure to reduce the number of network layers (46%-54% decline) for some semantically expressive words. As the confidence measure score is calculated for each layer, the inference time is only reduced by 17%-20%.  As shown in Table 9, in ACE2005 and MAVEN, compared to BERT large, although the number of layers of the DPNPED model is reduced by 54% and 46%, the F1 score is increased by 6.3% and 1.8%. According to previous works [23], [24], as the number of BERT layers increases for some words, their semantics will gradually converge, which is called the semantic overfitting of words. Although each word has 24 layers in BERT large, it leads to the semantic overfitting of some words. However, our dynamic perception network reduces unnecessary layers through the confidence measure, thus reducing the semantic overfitting problem of words. Compared with MC = 24, the effect of MC = 12 is 0.6% worse in ACE2005 and 0.4% less in MAVEN. The maximum number of computational steps limits the learning ability of the model.

E. ANALYSIS OF DIFFICULTY FACTOR
How to measure the difficulty of text reading? The Lexile Framework for Reading [25] is widely accepted in the world. The framework measures are based on two factors: semantic difficulty (word frequency) and syntactic complexity (sentence length). Both of these factors, over decades of research, have been shown to be excellent predictors of how difficult a text is to comprehend. In practice, we fill all the sentences to a fixed length. Therefore, we are concentrating on the semantic difficulty. For comparison, we define five different types of semantic difficulty. There are 1 f i and σ ( 1 f i ) for dynamic semantic difficulty and d i = 0.731, 1 as the static semantic difficulty. In order to learn hard word better, this paper applies focal loss [26], [27], [28] to standard cross entropy loss, and it will make our model focused on hard, misclassified words. In the focal loss, the difficulty factor d i dynamically adjusts the loss weight of polysemous words and common words.
As shown in Table 10, in dynamic semantic difficulty, compared with 1 f i , the F1 score of σ ( 1 f i ) in highly polysemous triggers is relatively increased by 8.0% (ACE2005) and 6.5%  (MAVEN). Due to the wide range of 1 f , the training loss fluctuates greatly, and the sigmoid function can limit it to [0.731, 1). In static semantic difficulty, d i = 0.731 or d i = 1 means that focal loss only adjusts the balance coefficient. Compared with static semantic difficulty, dynamic semantic difficulty σ ( 1 f 1 ) is increased by 3.4% (ACE 2005) and 3.2% (MAVEN) in highly polysemous triggers. Dynamic semantic difficulty allows triggers that are difficult to identify in sentences to get more attention.

F. HYPERPARAMETER SENSITIVITY ANALYSIS
The modulating factor of focal loss γ = 0, which is equivalent to cross entropy loss. It can be seen from the experiment that focal loss is more suitable for event detection tasks with event imbalanced. The confidence measure is mainly used to determine whether more layers are needed. As shown in Figure 4, if µ is too small, the number of computational steps will be reduced, and our model will underfit. According to the statistics of a large number of experimental results, µ = 0.5 is the most suitable value.

V. LIMITATIONS
Nonetheless, these results must be interpreted with caution and a number of limitations should be borne in mind. Firstly, although the DPN design is very versatile and suitable for all sequence labeling problems, the improvement is not obvious in the NER and part-of-speech tagging experiments. Secondly, the difficulty factor depends on the training data, so it is not suitable for the recognition of unregistered words.

VI. RELATED WORK
As reported by [18], [31], [32], and [33], pre-trained language model turns out to require deeper layers when long-distance dependency information is required. However, in the actual situation, the deeper model requires a lot of computational resources and data to fine-tune. Therefore, in the interests of both computational efficiency and ease of learning, it seems preferable to dynamically vary the number of layers for different inputs. Dynamic neural networks [2], [13], [14], [32] change their architectures based on the input data. Networks with dynamic depth [33], [34], [35], [36] achieve deep semantic representation in two ways, early exiting when shallower sub-networks have high classification confidence [36], [37], [38], or adding additional layers adaptively [39], [40]. For the halting conditions of dynamic networks, [41], [42], [43] proposes a contrastive learning approach to measure semantic integrity based on divergence, which can be determined whether to stop early or to add additional layers. Although dynamic neural networks can enrich the semantic information of polysemous triggers, it is still difficult to fully understand those low-frequency polysemous triggers. Therefore, it needs to pay more attention to these hard samples. For the critical evaluation of text difficulty, [44], [45], [46] publishs characterizing text difficulty with word frequencies. [26], [27] designs a focal loss to deal with hard samples; [47] uses contrastive learning by selecting hard negative samples. Various methods are proposed for event trigger detection [1], [3], [5], [6]. However, the above methods treat all the triggers equally, leading to these methods being insufficient in dealing with polysemous triggers. To learn a better representation, we design a dynamic perception network that can adjust the network depth on polysemous triggers.

VII. CONCLUSION
In this work, to address polysemous triggers, we propose the dynamic perception network which applies a confidence measure to determine whether more layers are needed and difficulty factor to the cross-entropy loss in order to focus learning on hard triggers. Our model achieves a certain improvement on the polysemous event trigger detection task of ACE 2005 and MAVEN datasets.

APPENDIX B EVALUATION METRIC
Our evaluation metrics are identical to all the previous works for meaningful comparison. The comparison is based on the famous F measure (F1), Precision (P) and Recall (R), P = TP TP+FP , R = TP TP+FN , F1 = 2PR P+R . TP (True Positive) is the number of triggers assigned correctly. FP (False Positive) is the number of triggers that do not belong to these event types VOLUME 10, 2022 but are assigned incorrectly to them. FN (False Negative) is the number of triggers that belong to these event types but are assigned incorrectly to other events.

APPENDIX C IMPLEMENTATION DETAILS
The confidence measure threshold µ is 0.5 and the max computation step MC is 24. Our model is trained using the Adam optimizer with a learning rate of 2e −5 with decay rate 0.99, and the weight decay parameter is 0.01. The dropout rate is set to 0.1, the maximum sequence length is 512 and the batch size is 8. In the inference, the difficulty factor of each word is the same as that in the training which is calculated from training data. The loss function balance factor α is 0.2.

APPENDIX D COMPARISION METHODS DETAILS
In this section, we provide more implementation details of baselines. For a fair comparison, all of these models are tested using the NVIDIA TESLA P100 GPU. Most of the previous models take event detection as a sequence labeling problem. They use the ''B/I/O-trigger'' schema to extract event triggers. MOGANED uses a dependency tree based graph convolution network with aggregative attention to explicitly model and aggregate multi-order syntactic representations in sentences. 2 The dependency tree is built on Stanford CoreNLP toolkit and the pre-trained word embedding is glove 6B 100d. The batch size, training epoch, and sequence length is same as their paper. DMBERT applies a large event-related candidate set with an adversarial training mechanism to iteratively identify those informative instances from the candidate set and filter out those noisy ones. 3 And then, fine-tune the task dataset. The large distantly supervised dataset from Freebase contains 142, 611 labeled instances and 21 event types. The initial embedding is based on BERT and the hyperparameter is the same as their paper. GatedGCN exploits gate diversity and the overall contextual importance scores of the words on graph convolution neural networks. 4 The dependency tree is built on Stanford CoreNLP toolkit. The word embedding is initial from the cased BERT base and its parameters are frozen during training in this work. The learning rate is 5e −5 and the number of GCN layers is 2. Other parameters take default values. EE-GCN exploits syntactic structure and typed dependency label information to perform event detection. 5 The framework is the same as GatedGCN, and the dependency tree is built on Stanford CoreNLP toolkit. All hyperparameters take default values. MLBiNet takes event detection as a seq2seq problem based on multi-layer bidirectional network. 6 The dropout rate as 0.5 and penalty coefficient as 2e −5 . The layers of bidirectional network is 2 and other hyperparameters take default values.

APPENDIX E DYNAMIC PERCEPTION NETWORK ON COMMON TRIGGER
As can be seen in Table 13, we listed a few common triggers. Overall, the average number of computational steps for the common trigger is less than that for the polysemous trigger. The advantage of DPNPED in common trigger is not obvious. The main reason is that the semantics of common triggers are relatively clear.

APPENDIX F CASE STUDY
As can been seen in Table 14, the ambiguity of the trigger 'go-' is serious. At the same time, the computational steps of our model will vary with context. The experimental results show that the dynamic adjustment of the network depth is better than the fixed one.

APPENDIX G ERROR ANALYSIS
To analyze the bad case in event detection, we randomly sampled 50 bad cases and gave some reasonable explanations for our model predicting error. We give them a summary in Table 15.
For precision, rare triggers (such as 'free', 'end', etc.) in the training which appear less than five times. They are the main reasons for predicting error. Annotation errors are also a severe problem. For example, 'On the news that Jessica Lynch is eventually going to come home', come in the sentence should represent the transport event, but it is marked as the meet event. Others include ambiguous and indirect triggers such as 'go' and 'the', which need more in-depth semantic analysis of the sentence to identify them.
For recall, rare triggers still account for the most significant proportion. The multiple triggers inherently are the secondary cause of detection error. As an illustration, 'The deadliest conflict since World War II and has been going on for five years', 'World' should mark as B-Attack, 'War' should mark as I-Attack and 'II' should mark as I-Attack. Others include ambiguous and indirect triggers.

ACKNOWLEDGMENT
The authors would like to thank the reviewers for their helpful comments and suggestions.