Capsule Networks With Word-Attention Dynamic Routing for Cultural Relics Relation Extraction

,


I. INTRODUCTION
Online museums and online cultural relic information provide better resources to help people understand cultural relics and provide data for information extraction research, which has great significance in identifying the semantic relation between the cultural relic entities in online cultural relic data for various applications related to cultural relics, such as cultural relic knowledge graphs and cultural relic retrieval.
Relation extraction (RE) is one of the important steps in knowledge extraction, the main task of which is to extract the semantic relation between entities in the text to obtain a triple entity 1, relation, entity 2 . These relations exist between cultural relic entities, such as, cultural relic name (CRN) and cultural relic dynasty (CRD), CRN and unearthed date (UD), CRN and unearthed location (UL), CRN and museum collection (MC).
The associate editor coordinating the review of this manuscript and approving it for publication was Paolo Napoletano .
Examples of semantic relations are presented in Table 1. The one of the existed relations between two entities in sentence S1 is the cultural relic unearth date between CRN and UD, e1 = ''Three Sheep Bronze Lei'' and e2 = ''1977'', and the second relation is the cultural relic unearth location between CRN and UL, e1 = ''Three Sheep Bronze Lei'' and e2 = ''Liu Jiahe, Pinggu, Beijing''. In sentence S2, the existed relation between CRN and CRD is the cultural relic dynasty, e1 = ''the Lei'' and e2 = ''Shang dynasty''. In the last sentence, the existed relation is the cultural relic museum collection between CRN and MC, e1 = ''Three Sheep Bronze Lei'' and e2 = ''the capital museum''.
In previous relation extraction studies, researchers have applied many different traditional models (e.g., featurebased methods [1] and kernel function-based methods [2]). Although these methods have an effect on the relation extraction task, the feature-based approach does not take full advantage of the context structure information of the entities, and the training data in the model mainly rely on manual labelling, leading to many incorrectly labelled data in the experiments. The training speed and testing speed of the kernel function-based methods are slow and not suitable for processing large scale data [3].
Deep neural networks can automatically learn the underlying feature information and alleviate the laborious task of manual annotation [4]. In recent years, many researchers have applied neural network models to relation extraction tasks, such as, convolutional neural networks (CNNs) [5], recurrent neural networks [6], and bidirectional long short-term memory (BiLSTM) [7]. These systems predict one relation for a single entity pair in a sentence. However, multiple entity pairs may commonly appear in a sentence and describe more than one relation, so it is also beneficial to focus on other relations when extracting relations in the context. Nevertheless, when modelling space information, spatially insensitive methods of CNN and LSTM are inevitably limited by rich text structures, which makes them difficult to encode effectively and lack the ability of text expression.
Hinton et al. [8] proposed a capsule network, which replaced the single neuron node of the traditional neural network with a neuron vector to avoid the representational limitations of CNNs and RNNs. Sabour et al. [9] proposed a dynamic routing algorithm to replace the max-pooling of CNN or RNN. The length of a capsule indicates the significance of features, and the direction of a capsule expresses the specific property of a feature. In this paper, we utilize the capsule network with dynamic routing to extract one or multiple relations between multiple entity pairs in a sentence.
Word representation is considered to be an important factor influencing the performance of relation extraction. Recent researches have shown that word embedding induced from a large corpus could effectively capture the semantic and syntactic information of words [10]. Character-based word embedding can improve the performance of natural language processing tasks by capturing unknown words [11]. Because Chinese cultural relic data often contain unknown words and consider the importance of context characteristics, we combine word embedding with character embedding as part of the word representations. In addition, the part of speech (POS) of words and the position of words are also incorporated into our model because they carry critical context information for key characters and highlight the impact of entities.
In the task of relationship extraction in the field of cultural relics, the sparsity of relation and irrelevant words in the cultural relics data are existed. It is pivotal to extract contextual information of long sentence. In addition, irrelevant words in a sentence have a great influence on the performance of the model for the task of relation extraction [12]. Effectively removing noise from sentences remains a key issue in existing relation extraction research. Half of the sentences in original data of cultural relics are longer than 50 words. Even though multiple relations might occur in a sentence, this still means that there are many irrelevant words. Since deep neural networks with attention mechanisms have achieved remarkable results in natural language processing tasks, some people have introduced attention mechanisms into relational extraction tasks for mitigating noise data [13], [14]. However, these attention mechanism models are all based on neural networks of different depths, adding a single-layer attention mechanism of different feature levels into the input vector. To further capture the relevance of context and reduce the decay of useful information in long sentences, we devise a word-attention mechanism in a dynamic routing algorithm to effectively search for capsules containing correlation characteristics.
We propose a relation extraction model named WAtt-Capsnet (the capsule network with word-attention dynamic routing), based on capsule networks with word-attention dynamic routing for the relation extraction task of online cultural relic data. In our method, the capsule network can capture richer instantiation features, while the combination embedding as the model input are used to capture critical context information. To reduce the decay of useful information in long sentences and further capture the relevance of context, we propose a routing algorithm based on a word-attention mechanism to focus on informative words. We conduct comparative experiments between our model and the baselines, and experimental results demonstrate that our method achieves significant performance for the relation extraction task of online cultural relic data.
The main contributions of our work are summarized as follows: (1) We propose a relation extraction model named WAtt-Capsnet, which uses capsule networks with wordattention dynamic routing to extract instantiation features for the relation extraction task of the online cultural relic data.
(2) We also present combination embedding to capture the characteristic information of Chinese sentences by considering the contribution of words and parts of speech and characters as well as the position of words, which captures rich internal structure information of the sentence and highlights the impact of entities.
(3) We also import word-attention in dynamic routing by considering the different contributions of the words, which iteratively amend the connection strength to solve the problem of long-term memory by increasing the weight coefficient of the informative words. VOLUME 8, 2020 The remaining sections of this paper are described as follows. Prior relevant works about traditional and deep learning models for relation extraction and word representation are organized in Section 2. In Section 3, we represent our framework for cultural relic relation extraction and introduce the working principle of each component of our model in detail. In Section 4, we introduce the dataset, evaluation, baselines, and describe the experimental process and experimental results. In Section 5, we objectively discuss the performance of our model in relation extraction tasks through experimental results. Section 6 mainly describes our major research findings and future research priorities.

II. RELATED WORK A. TRADITIONAL METHODS FOR RELATION EXTRACTION
Relational extraction has important significance for the research of information retrieval, text understanding and knowledge graph construction [15]. Traditional methods for relation extraction are feature engineering-based and kernel function-based.
Early methods performed on relation extraction tasks used handcrafted features. A set of designed features and a classifier are trained to classify other new relation instances. Kambhatla [16] employed maximum entropy models that combined diverse lexical, syntactic and semantic features exploited from the text to solve the problem of the scarcity of labelled data and the errors produced by entity detection modules in the task of extracting semantic relations between entities. Based on Wang's work, Zhou et al. [17] incorporated additional features and systematically explored the method of feature-based relational extraction. They combined the base phrase chunking information at the syntactic aspect and used the semantic information (e.g., WordNet and name list) to improve the performance.
The overall performance of the approach based on features largely depends on the validity of the features generated. The methods applying high-level lexical and syntactic features suffer from error propagation [18]. The disadvantage of explicit feature engineering is avoided by kernel-based methods that design kernel functions to compute the similarities of two relation instances [19]. Mooney and Bunescu [20] proposed a kernel method that extracted semantic relations between entities. The method was based on a generalization of three types of subsequence kernels to maintain entity relations. The authors proposed that each word instead of each sequence was generalized to a feature vector (i.e., one feature vector for each word), then a sequence of feature vectors was represented for each relation instance. Khayyamian et al. [21] proposed a generalized convolution tree kernel and using syntactic parse tree kernels was proposed by Zhang et al. [22]. The convolutional tree kernel produced various subkernels from different definitions of the two weighting functions for relation extraction.

B. DEEP LEARNING MODELS FOR RELATION EXTRACTION
In kernel-based methods, elaborately designed kernels utilized pre-existing systems, which caused errors of the individual modules gathering downstream. Fortunately, this problem has been fortunately removed by popular deep learning [4]. The early deep learning model treated the task of relation extraction as a multi-class classification problem that added a relation class label to each sentence. Santos, et al. [23] proposed a ranking CNN for relation classification tasks. The model generates a distributed vector representation for the text using a convolutional layer and creates a score for each class comparing the distributed vector representation with the class representations. CNN cannot well represent the precise spatial relations between the highlevel parts. Recurrent neural networks are another popular deep learning model for relation extraction. Sorokin et al. [6] utilized the LSTM-based encoder to jointly capture the representations for all relations in the sentential context. The final prediction was produced by combining representations of the context relations and the representation of the target relation.
To better capture important features, Lin, et al. [24] proposed an attention-based CNN model, and Zhang, et al. [25] proposed an attention-based LSTM model for RE. However, it is difficult for CNN and LSTM to obtain richer instantiation features.

C. CAPSULE NETWORKS FOR RELATION EXTRACTION
Recently, research has proposed the capsule network to improve the representation imperfection of CNNs and RNNs. Wang, et al. [26] proposed a framework based on the capsule network for relation extraction. They used a capsule network in the sentence encoder, and the involvement of the capsule layer made encoding spatial patterns more powerful, which helped to determine the relation expressed in sentences. Zhang, et al. [27] presented a framework based on capsule networks for relation extraction. In their method, the multi-label relation extraction problem is converted into a multiple binary classification problem by the capsule networks. They treated aggregation as a routing problem so that complex and interleaved features could be identified by the capsule network. Zhang et al. [28] proposed capsule networks with dynamic routing for multi-labelled relation extraction to address the challenge of having multiple relations in a sentence. To better cluster features, they devised a dynamic routing algorithm based on an attention mechanism. They further designed a sliding-margin loss function to extract the relations precisely.
Because of the effectiveness of the capsule network with attention-based dynamic routing in relation extraction, we use the capsule network with word-attention dynamic routing to extract instantiation features for the task of extracting cultural relic text relations.

D. WORD REPRESENTATION
Liu [5] used a simple convolutional neural network model to automatically capture features instead of handcrafted features. This model used word vectors and lexical features to encode the input sentence and then utilized a convolutional layer and a softmax layer given a probability distribution of the relation classes. The model assigned a vector to each synonym class. However, it did not take full advantage of the real representational ability of word embedding. Zhou, et al. [29] converted a word into word embedding by utilizing the matrix-vector product (i.e., every word is transformed into a real-valued vector). They looked up the embedding matrix for each word when inputting a sentence consisting of some words and used 50-dimensional and 100-dimensional pre-trained word vectors to represent words as the input of their long short-term memory (LSTM) layer. Word embedding effectively captures semantic and syntactic features of words but underuses features contained in a character. Zhang, et al. [30] used character embedding initialized with a vector pre-trained by word2vec and character position embedding as the vector representation. In a sentence containing two entities, each character is mapped to a low-dimensional dense vector, which combines the character vector and the position vector of the word in the sentence. The position embedding combines the previous entity and the latter entity to the relative positions of the current character, each of which corresponds to a position embedding.
To make better use of character characteristics and effectively capture semantic and syntactic information of words, we propose a combination of word embedding and character embedding to improve the performance of relation extraction.

III. METHODOLOGY
In this section, we describe the architecture of WAtt-Capsnet based on a capsule network with wordattention dynamic routing for cultural relic relation extraction. Our proposed network is a variant of the original capsule network. As depicted in Figure 1, our relation extractor comprises four primary layers: the input representation layer, bidirectional LSTM layer, capsule network with a wordattention dynamic routing layer, and relation prediction layer.
(1) Input Representation Layer. The input representations of our model incorporate four components to obtain the critical context information for each word.
(2) Bidirectional LSTM Layer. The forward and backward networks of the BiLSTM model are used to capture the low-level features of the sentences. (4) Relation Prediction Layer. A separate-margin loss function similar to Zhang [28] is used to predict possible relations.

A. INPUT REPRESENTATION LAYER
In the input representation layer, we incorporate four components as the input representations of our model for each sentence: word embedding, character embedding, part-ofspeech (POS) vectors and position vectors because they contain critical context information for each sentence.

1) WORD EMBEDDING
We adopt word2vec-based embedding [31], an open source tool developed by Google, which learns low-dimensional continuous vector representations of words. The word embedding pre-trained in the skip-gram setting of word2vec are distributed representations of words that map each word w i of the sentence s i to the d w -dimensional real-valued vector.

2) CHARACTER EMBEDDING
We also use word2vec to train character embedding to address the unsatisfactory performance of word segmentation for relation extraction tasks [32]. Character embedding are representations mapping each character c i of the word w i to the d c -dimensional vector.

3) POS EMBEDDINGS
To represent the POS features of the words in s i , POS tags are predefined according to the scheme of Jieba, and then we map each POS tag of w i to a d POS -dimensional vector.

4) POSITION EMBEDDINGS
To present the position features, we import the relative distances between the current word and M entities. When s i contains only a single entity pair relation, M = 2; when s i contains multiple entity pairs relation, we limit the maximum VOLUME 8, 2020 number of entity pairs in a sentence to two, M = 4. When the distance between the word and entity1 is 2 and the distance between the word and entity2 is -4, this word has two positional characteristics [d 1 , d 2 ], which are represented as a d P -dimensional vector. Position embedding highlight the impact of entities.
Finally, we concatenate the word embedding, character embedding, POS vectors and position vectors and then obtain a combinational representation for each word as input for our model.

B. BIDIRECTIONAL LSTM LAYER
Bidirectional LSTM is effective for modelling sequence data in relation extraction tasks [33]. Consequently, we leveraged BiLSTM to deeply exploit the low-level features of the sentence. The combination of embedding from the input representation layer is input into the BiLSTM model. Then, the global sequence features are captured by both the forward and backward networks of the BiLSTM model. Finally, the hidden state vector h t is expressed as h t = h t ⊕ ← h t as the output, where ⊕ is the element-wise sum.

C. CAPSULE NETWORK WITH WORD-ATTENTION DYNAMIC ROUTING 1) CAPSULE NETWORK
The output of the capsule is a vector that can more richly express the features extracted by nodes, and different dimensions in the vector are used to record different attributes of the same feature. In our work, we used capsule networks to extract instantiation features. A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of relation feature. We use the length of the output vector of a capsule to represent the probability that the relation features exist, and the orientation of the vector expresses the specific property of one kind of feature. We disperse all the semantic information captured by the BiLSTM into the first level capsules. The representation of each word token is expressed by capsules. Then, a non-linear ''squash'' function shown as follows is used to compress the vector length, where v j is the output of capsule j whose input is s j . Except for the first layer of capsules, the total input s j to a capsule j is a linear weighted sum over all the prediction vectors û j|i from the capsules in the low-level layer and is calculated by multiplying the instantiated parameters of a capsule u i in the layer below by a weight matrix W ij as follows: where c ij is the coupling coefficient that is iteratively obtained by dynamic routing during the testing stage, the initial logit b ij coupled through capsule i and capsule j is the log prior probability, which determines the coupling coefficients c ij through a routing softmax. The coupling coefficient c ij can be calculated as follows: With the capsule network, we can obtain high-level capsules that represent relation features. The high-level capsule network can more richly express relation features. However, the long sequence of source sentences in the data of cultural relic results in a large attenuation in the gradient update and failures to effectively update the parameters at the head of the sequence Therefore, we propose a word-attention dynamic routing algorithm that iteratively amends the connection strength. Our attention mechanism allows the decoder to obtain important input information to solve the problem of long-term memory, which is an extension of the standard attention mechanism. The goal of word-attention is to ignore irrelevant words and focus on informative words, and the weight coefficient of informative words is greater than that of irrelevant words. The word-attention mechanism distributes different weights to different words; that is, informative words have high weights, while irrelevant words have low weights to increase the intensity of informative words.
Let P be a matrix of the output vectors [p 1 , p 2 , . . . ,p L ] generated by the BiLSTM layer, where L denotes the sentence length. In detail, the weight β ij of relational word p ij as described by Zhou et al. [29] is computed by the following equation.
where w ij is a weighted matrix, and vector q ij is an informative word.
The pseudocode for the word-attention routing algorithm is summarized in Algorithm 1. s j = i c ijûj|i ,û j|i = W ij u i 6: β ij = softmax(p ij q ij ) 7: v j = squash s j · c j|i β ij , a j = v j 8:b j|i =b j|i + v j ·û j|i 9: Return v j , a j 10: End procedure

D. RELATION PREDICTION LAYER
To train the proposed model, we use a separate-margin loss applied by Sabour et al. [9] for the task of relation extraction as follows.
where Y j = 1 if the relation j is present, otherwise, Y j = 0. The top margin and the bottom margin are set as m + = 0.9, m − = 0.1, respectively. λ is the down-weighting of the loss for absent relations, and we set λ= 0. The total loss of a sentence is the sum of losses for all the relations.

IV. EXPERIMENTS A. DATASET AND EVALUATION
To evaluate our model, we employ the List of National Cultural Relics Collection (LNCRC) (http://www.sxhm.com/, http://gl.sach.gov.cn/collection-of-cultural-relics/index.html) as the cultural relic semantic relation annotation specification. We manually annotate 32,760 cultural relic texts that were captured from the national museum of China online (http://www.chnmuseum.cn/). The training data and test data are 22,932 cultural relic texts and 9,828 cultural relic texts, respectively, to evaluate our method. We adopt precision, recall and F1-score as evaluation metrics.

B. BASELINE MODELS
We compare the performance of our model on the experiments with the following baselines: (1) CNN proposed by C. N. Santos, et al. [23] for relation classification task; (2) CNN+Att proposed by Y. Lin, et al. [24] attentionbased CNN model for relation extraction (3) LSTM proposed by D. Sorokin, et al. [6] utilized the LSTM-based encoder to jointly capture the representations for all relations in the sentential context.

C. EXPERIMENTS 1) PERFORMANCE COMPARISON BETWEEN OUR MODEL AND BASELINES
To verify the overall performance of our model in the task of cultural relic relation extraction, we implement comparison experiments between our model and the baseline models. The experimental results shown in Table 1 indicate the overall relation extraction performance of our model and different baselines on our evaluation dataset. It can be seen that our model WAtt-Capsnet improves more than the baselines in precision, recall and F1-score. Overall, our model WAtt-Capsnet exhibits the best performances in precision, recall and F1-score.

2) EFFECT OF WORD REPRESENTATION ON OUR METHOD
To evaluate the effect of word representation on our model, we compare our method to the model-applied capsule network with word-attention dynamic touting, which uses four kinds of word representations (word embedding, character embedding, the combination of word and character embedding, the combination of word, character embedding and POS) as input. Based on the combination of word, character embedding and POS, we consider the importance of position and add position vector in the word representation to ensure that our model performs effectively in capturing features of the entities. The experimental results are shown in Figure 2. Our model WAtt-Capsnet achieves maximum precision, recall and F1-scores in relations DUR, LUR, DMR, MCR, and the mean precision, recall and F1-scores of the four relations.
In Figure 2, DUR is the relation of cultural relic/date_ unearth, LUR is the relation of cultural relic/location_unearth, DMR is the relation of cultural relic/dynasty_manufactured, MCR is the relation of cultural relic/museum_collection, and All is the mean of the DUR, LUR, DMR and MCR. Word is utilizing the word embeddings as the input representations of our model, character is utilizing the character embeddings as the input representations of our model, W_C is utilizing the combination of words and character embeddings as the input representations of our model, W_C_POS is utilizing the combination of word embeddings, character embeddings and POS vectors as the input representations of our model, Watt-CapsNet is our method that utilizes the four components as the input representations: word embeddings, character embeddings, part-of-speech (POS) vectors and position vectors.

3) EFFECT OF WORD-ATTENTION DYNAMIC ROUTING ON OUR METHOD
To better validate the performance of word-attention dynamic routing, we compare our model with the variant model NWatt-CapNet, which removes the word-attention from dynamic routing. The experimental results in Figure 3 show that our model WAtt-Capsnet using a capsule network with word-attention dynamic routing outperforms the variant model NWatt-CapNet, although the variant model still proves effective when used for other tasks. Word-attention is useful for capsule networks and greatly improves the performance of cultural relic relation extraction.

V. DISCUSSION
The overall results of the performance comparison shown in Table 1 have higher precision, recall and F-measure over VOLUME 8, 2020 the baseline models, which demonstrates that the capsule network with word-attention dynamic routing performs better than the baseline model, which implies the substantial appropriateness and the advantage of our proposed model based on the capsule network capturing richer instantiation features over existing approaches in the task of cultural relic relation extraction.
The further word representation constituted by word embedding, character embedding, POS vector and position vector evaluation reveals that using pre-trained embedding combined with POS and word position can dramatically improve the performance of relation extraction. Most likely, the same words appear in different positions in a sentence, which has different POSs and positions, so they are assigned to different representation vectors as the embedding of the model input. The POS of the word can further effectively express the feature of the word, and the position of the word highlights the impact of entities. This could be one reason why our model generally improves the overall relation extraction performance and outperforms the other approaches.
The evaluation of word-attention dynamic routing on our method also suggests that word-attention dynamic routing is useful for capsule networks and greatly improves the performance of cultural relic relation extraction. This may be because word-attention dynamic routing focuses on informative words and the weight coefficient of informative words is greater than that of irrelevant words to increase the intensity of informative words, which enhances the identification information of different types of relations. This could be another reason why our model generally improves the overall relation extraction performance.

VI. CONCLUSIONS
In this paper, we propose a model named WAtt-Capsnet based on capsule networks with word-attention dynamic routing for the relation extraction task of online cultural relic data. The capsule network captures richer instantiation features and enhances relation discrimination features with word-attention dynamic routing to reduce the decay of useful information in long sentences so that informative word features can be combined effectively in relation extraction. We also present the combination embedding of word embedding, character embedding, POS and the position of words as the model input to capture the features and rich internal structure information of sentences. Comparative experiment results demonstrate that our method achieves significant performance for the relation extraction task of online cultural relic data.
At present, most deep learning methods for relation extraction are supervised methods and require a large quantity of labelled data. In the future, we will study the semi-supervised and unsupervised learning methods of relation extraction to further improve the efficiency of relation recognition and reduce dependency on labelled data simultaneously.
MIN ZHANG received the B.S. degree from Xidian University, Xi'an, China, in 2004, and the M.S. degree in computer software and theory from Northwest University, Xi'an, in 2010. She is currently pursuing the Ph.D. degree in software engineering at Northwest University.
She is also an Associate Professor. Her research interests include information processing, machine learning, and knowledge graph. She is a member of China Computer Federation.
GUOHUA GENG is currently a Professor at Northwest University, Xi'an, China. She is also a Ph.D. Supervisor with the School of Information Science and Technology, Northwest University, China. Her major research interests include information processing and image processing.
She is the Director of the National-Local Joint Engineering Research Center of Cultural Heritage Digitization and a reviewer in many international conferences and journals. She also received several academic rewards, including three National Science and Technology Progress Awards (second class).