Japanese Event Factuality Analysis in the Era of BERT

Recognizing event factuality is a crucial factor for understanding and generating texts with abundant references to possible and counterfactual events. Because event factuality is signaled by modality expressions, identifying modality expression is also an important task. The question then is how to solve these interconnected tasks. On the one hand, while neural networks facilitate multi-task learning by means of parameter sharing among related tasks, the recently introduced pre-training/fine-tuning paradigm might be powerful enough for the model to be able to learn one task without indirect signals from another. On the other hand, ever-increasing model sizes make it practically difficult to run multiple task-specific fine-tuned models at inference time so that parameter sharing can be seen as an effective way to reduce the model’s size. Through experiments, we found: (1) BERT-CRF outperformed non-neural models and BiLSTM-CRF; (2) BERT-CRF did neither benefit from nor was negatively impacted by multi-task learning, indicating the practical viability of BERT-CRF combined with multi-task learning.


I. INTRODUCTION
Identifying the factuality of an event mention is an important task in natural language processing (NLP), with a wide range of potential applications such as information extraction, recognizing textual entailment, reasoning and natural language understanding [1], [2], [3], [4], [5], [6]. Here we work on a recently published corpus on shogi (Japanese chess) commentaries in Japanese [7] to develop a system of event factuality analysis although the proposed method can readily be ported to other corpora following the same design principle. As an extensive-form game, shogi allows a computer to ground most event mentions in a game tree. Yet it is complex enough for its commentaries to exhibit a rich variety of factual statuses, for example, a possibility (Ex. (1)) and a counterfactual (Ex. (2)) (event mentions are marked with underlines): The associate editor coordinating the review of this manuscript and approving it for publication was Amjad Ali. (1) White may use static rook strategy (2) The prediction that white would use cheerful central rook strategy turned out to be false Given these, we expect event factuality analysis to help automatic generation of human-like commentaries, among other applications.
The design principle this corpus adopts is to decompose event factuality analysis into a combination of several subtasks. Event mentions need to be detected to begin with. To assign factual statuses to them, we need to identify words and phrases that convey factuality information, which are a subset of modality expressions. Identifying grammaticalized verbs can be a useful filtering step because due to semantic bleaching, they are unsuitable for further factuality analysis. We also notice that event mentions have a substantial overlap with named entities (NEs) specially designed for the shogi domain [8]. The divide-and-rule strategy is useful for  corpus construction as well because it facilitates speedy and consistent annotation.
The question, then, is how to solve the closely related but different subtasks as a whole. Since manually writing rules to connect them [9] is daunting, it is desirable to make a computer automatically learn their relationships from data. While each subtask can be straightforwardly formalized as sequence labeling, how best to exploit dependencies among subtasks remains unknown. The creators of the annotated corpus only reported preliminary experiments where they independently tackled each subtask using a non-neural sequence labeling tool [7].
One apparently promising approach is multi-task learning. Unlike taggers supplied with hand-crafted features, neural networks have the ability of flexible knowledge sharing among related subtasks, which has proven to be effective in natural language analysis [10], [11]. For sequence labeling, knowledge sharing can be done by building subtask-specific taggers on top of a shared text encoder. The shared encoder transforms the input sentence into a sequence of vector representations, and each tagger uses them to predict labels. By sharing the encoder, the taggers implicitly exploit inter-task dependencies.
The situation has changed with the introduction of the powerful pre-training/fine-tuning paradigm [12], however. It has been shown that Transformer-based models pre-trained on a huge raw corpus outperform existing neural models with large margins and tend to retain good performance even if a small amount of training data are given for the target task. This raises the possibility that pre-trained models are powerful enough to overshadow indirect signals from related subtasks.
From a practical point of view, it is non-negligible that pre-trained models are huge in size, with their success driving a race to build even larger models. If the model is fine-tuned separately for each subtask, we end up running multiple variants of a huge model at inference time. For this reason, we observe that huge pre-trained models give a new significance to multi-task learning: an effective way to reduce the model's size when we have multiple related tasks.
We conducted experiments to identify NEs, modality expressions, event classes, and event factuality, either separately or jointly. We found that BERT-CRF consistently outperformed non-neural models and BiLSTM-CRF, reconfirming the power of pre-training. Multi-task learning brought neither increase nor decrease in performance for BERT-CRF. Thus we conclude that BERT-CRF with multi-task learning is a practical solution.

II. TASK DESIGN
As shown in Table 1, we adopt the task design proposed by Matsuyoshi et al. [7]. We assume that the input sentence is segmented into words. Our task is to perform sequence tagging for the following four layers: A. NAMED ENTITIES 21 NE types are defined for the shogi domain [8]. With the BIO tagging scheme [13], each word is given one of 43 (= 21 × 2 + 1) tags. Note that many NEs happen to be event mentions. For example, moves (Mn) and defensive formations (Ca) are likely to be events.  Table 1 indicate that the target events are counterfactual and possibly factual, respectively. As an agglutinative VOLUME 11, 2023     language, Japanese often uses complex sequences of function words as modality expressions. There are also some predicates that quantify the degree of factuality of their arguments, and hence modality expressions can simultaneously be event mentions (''break'' in this example). For ease of annotation, modality expressions are not explicitly linked to the corresponding event mentions, not to mention their scopes.

C. EVENT CLASSES
One of 8 tags is assigned to the head word of an event mention and the O tag to other words. The purpose of this layer is to distinguish factuality-bearing event mentions (e.g., EVe) from others. For example, grammaticalized verbs that do not warrant factuality statuses are given EVf tags.

D. EVENT FACTUALITY
One of 6 tags, such as FNc (certain−) and FPr (probable+), is assigned to the head word of a factuality-bearing event mention while other words are given O tags. Fig. 1 shows an overview of the proposed neural network model. To solve the four related subtasks introduced in Section II, we adopt multi-task learning that enables parameter sharing. We build task-specific CRF taggers on top of a shared encoder.

III. PROPOSED METHOD
The input word sequence, x 1 , x 2 , · · · , x N , is converted into a sequence of word embeddings, e 1 , e 2 , . . . , e N , using a lookup table. The vector sequence is fed into the encoder to obtain h 1 , h 2 , . . . , h N , or vector representations of the input sequence.
For the encoder, we test (1) BiLSTM and (2) BERT. BiLSTM is a combination of a forward LSTM and a backward LSTM. LSTM [14] is a powerful extension to recurrent neural networks and is capable of capturing long-distance dependencies. Combining two LSTM units, BiLSTM makes use of both left and right contexts. For brevity, let LSTM f be the blackbox forward LSTM. At time t, it takes e t and its previous output − → h t−1 as input and outputs − → h t . The backward LSTM is defined in an analogous way. Combining the two, BiLSTM computes h t as follows: where ⊕ is the vector concatenation operation. BERT (Bidirectional Encoder Representations from Transformers) [12] is a modern pre-trained language representation model known for achieving state-of-the-art performance for a wide range of tasks. Since BERT is pre-trained on a large raw corpus, we expect it to complement small annotated data. 1 For each subtask m ∈ M , the task-specific CRF [16] takes h 1 , h 2 , . . . , h N as the input and produces tagging decisions y m = y m,1 , y m,2 , · · · , y m,N . h t is first linearly transformed into o m,t , whose dimension equals the number of tag types.
o m is then used to calculate the probability of y m : 1 In preliminary experiments, we also tested transfer learning from the latest version of the BCCWJ modality corpus [15]. It was a balanced corpus covering multiple domains. Although it was annotated with event class and factuality tags that were fully compatible with those of Matsuyoshi et al. [7], no annotation was available for modality expressions. We found no significant improvement with transfer learning, however.
where o y m,t m,t ∈ R is the score for the output tag y m,t according to o m , and T y m,t−1 ,y m,t m ∈ R is the score of transition from y m,t−1 to y m,t . At t = 0, the special token BOS (beginning of sentence) is assigned to y m,t . Similarly, the special token EOS (end of sentence) is assigned to y m,N +1 .
Let D m be the training data for task m. The task-specific objective function is defined as Finally, we define the objective function as a weighted sum of the task-specific objective functions: where α m ≥ 0 and m∈M α m = 1. Here we employ the multiple gradient descent algorithm (MGDA) [17], and α m is automatically tuned at each backward step.

IV. EVALUATION
A. EXPERIMENTAL SETTINGS Table 2 summarizes the corpus specifications. We used Japanese Wikipedia for pre-training and the shogi commentary corpus [7], [8] for evaluation. We used automatic word segmentation by KyTea [18] for the former and gold standard word segmentation for the latter. The shogi commentary corpus was annotated with event factuality and other linguistic phenomena. For evaluation, the dataset was partitioned into ten roughly equal-sized subsets. Out of these subsets, eight were employed for training, TABLE 8. Tag-wise statistics of the performances on modality expressions (left), event classes (center), and event factuality (right). Linear, LSTM, and BERT refer to Linear CRF, BiLSTM-CRF, and BERT-CRF, respectively. one for development, and the remaining one for evaluation. Hyper-parameters were tuned using the development set. This procedure was repeated ten times, with a distinct subset chosen for evaluation in each run. We averaged micro F-1 scores of the ten outcomes.

B. MODELS
As a baseline non-neural model, we used Linear CRF, a CRF model with sparse hand-crafted features. It directly outputs tags for each of the four subtasks. As shown in Table 3, the features used were word and POS n-grams (n ≤ 3) taking into account three words on the both sides as well as the target word itself. We used KyTea [18] to obtain POS tags. We also tested PWNER, 2 an off-the-shelf non-neural sequence labeling tool. PWNER was used by the creators of the annotated corpus to provide initial evaluations [7].
For the proposed neural network-based method, we tested BiLSTM-CRF and BERT-CRF. Their sentence encoders were BiLSTM and BERT, respectively. The models with multi-task learning (+multi) were compared against the models without it (unmarked). In the absence of multi-task learning, we obtained fine-tuned BERT models for individual subtasks, leading to practical challenges in their simultaneous execution. We also tested multi-task learning focusing on modality expressions, event classes, and event factuality (+MEF), or in other words, excluding named entity recognition.
In the pre-training step for BERT, we first segmented sentences to word sequences with KyTea [18] and then split each word into subwords by WordPiece [19], with the vocabulary size of 32,000. We used Adam [20] as the optimization algorithm.
The details of network configurations are shown in Tables 4 and 5. For the BiLSTM-CRF model, 64 × 2 dimensional vectors are fed into a CRF layer for each task because the outputs of the forward and backward LSTMs are concatenated into one. For BERT-CRF, 768 dimensional vectors are fed into a CRF layer for each. In both models, dropout [21] was applied to each layer.

C. RESULTS AND DISCUSSION
The main results are shown in Table 6. Overall, BERT-CRF performed the best. It consistently beat BiLSTM-CRF with large margins. Non-neural PWNER worked surprisingly well, especially for event classes and event factuality.
Multi-task learning (+multi) yielded no clear gains or losses. BERT-CRF+multi performed relatively poorly for NE. As indicated by Table 2, the number of NE tags were much larger than the numbers of event-related tags. These motivated us to try +MEF, but it brought no consistent changes either.
For further analyses, we calculated tag-wise statistics. For the detailed description of tag types, please refer to Mori et al. [8] and Matsuyoshi et al. [7]. Tables 7 and 8 show the results of the four subtasks. In these tables, ''Freq.'' indicates the number of instances for each tag type in the corpus. Most noticeable is that the frequencies are skewed toward some tag types. BERT-CRF performed relatively well for low-frequency tags, demonstrating the effectiveness of pre-training. Again, we observed no clear trend for the effect of +multi.
As we discussed in Section I, multi-task learning, or more precisely, parameter sharing among subtasks, has a practical advantage in computational efficiency because running multiple variants of fine-tuned BERT at inference time can be prohibitively expensive. The absence of any performance gain or decline due to multi-task learning leads us to the conclusion that BERT-CRF combined with multi-task learning stands as the pragmatic selection for event factuality analysis.

V. CONCLUSION
We proposed a deep neural network model for Japanese event factuality analysis. We combined pre-training, multi-task learning, and other techniques to achieve high performance for this important task. We reconfirmed that pre-training was highly effective in enhancing accuracy. While multi-task learning does not improve accuracy, it saves us from running multiple variants of huge fine-tuned models. Our experiments led us to conclude that BERT-CRF combined with multi-task learning represents the practical choice for performing event factuality analysis.
Although our experiments employed a shogi (Japanese chess) commentary corpus, the proposed method is applicable to other domains if the task is designed in a similar way. In the future, we will apply the proposed approach to other domains, possibly with knowledge transfer from the shogi domain. We would also like to use event factuality analysis to tackle the symbol grounding problem since shogi is characterized by multiple possible worlds.