Research on Joint Extraction Method of Entity and Relation Triples Based on Hierarchical Cascade Labeling

As an important research field of artificial intelligence, knowledge graph develops rapidly, and triplet extraction is the key to the construction of a knowledge graph. The traditional pipeline extraction method will bring the error of entity recognition into the relationship extraction and affects the extraction effect. Besides, the traditional pipeline extraction method cannot solve the SEO (Single Entity Overlap) and EPO (Entity Pair Overlap) problems. Inspired by this, we compare the advantages and disadvantages of the mainstream methods of entity and relationship triples joint extraction, propose a new joint extraction method of entity relation triples based on a hierarchical cascade labeling model (named HCL model), and the HCL model is based on multi neural network cooperation. Further, we construct a balanced sampling Chinese dataset about the entity and relational triplet extraction which contains SEO and EPO. We carry out the experiments on the balanced data set, and the F1 value of the HCL model reaches 65.4% better than other baseline models.


I. INTRODUCTION
In the current big data era, the rapid development of information equipment in people's life has caused explosive growth of data. The data are also diverse including structured data, semi-structured data, unstructured data, etc. How to use these multi-source heterogeneous data and how to extract useful knowledge from them have always been the focus of people's research [1], [2]. Entity and relation triplet extraction is a core task in the current information extraction field, its main purpose is to extract triplet knowledge from different types of data and build a knowledge graph to provide support for downstream tasks such as intelligent recommendation, semantic search, and deep question answering [3], [4].
Compared with extracting useful information from unstructured data, the extraction of useful information from structured data is relatively easy and the extraction tasks can be completed according to certain rules [5]. At present, there is a lack of effective methods to utilize the current large The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wei . amount of unstructured data, and the intrinsic value of text data cannot be reflected. Triple extraction technology can extract triplet knowledge from text, which is conducive to the knowledge fusion application of unstructured data and multi-source heterogeneous data. This paper focuses on the research of entity and relation triples extraction for Chinese text data, uses the deep learning method, and proposes an HCL (Hierarchical Cascade Labeling) model to joint extract entity relationship triples in cascade based on hierarchical cascade labeling, The experimental results show that the HCL model can achieve better results compared with other baseline models under small sample data conditions, the HCL model has advantages for scenarios lacking open corpora, such as some vertical industries. In addition, the HCL model achieves good results in the dataset containing SEO and EPO phenomena, it has advantages when there are overlapping entity relationships in the text corpus.
The main contributions of this paper are as follows: (1) This paper summarizes the problems faced by the current triplet extraction, analyzes the current mainstream triplet extraction methods, contains parameter sharing method and sequence labeling method, and compares the advantages and disadvantages of each method; (2) Because of the lack of Chinese balanced triplet data set at present, this paper constructs a reasonable Chinese-balanced triplet data set. The public Duie2.0 entityrelationship data set is optimized and simplified and the threshold value of each relationship is set to ensure the average sampling of data; (3) For the joint extraction task of entity and relationship triples, this paper analyzes the multi-neural network cooperation model, and a hierarchical cascade labeling (HCL) model is proposed. Roberta-2BiLSTM-FC is used as the head entity labeling, and the head entity feature enhancement and CNN dimension reduction are used in the cascade extraction model of the tail entity and relationship labeling.
(4) On the balanced sampling triplet data set, triple joint extraction models based on the dictionary as feature extraction are tested, and triple joint extraction models based on the Bert model as feature extraction are tested, they all compared with the HCL model in this paper, the F1 value of the HCL model reaches 65.4% and shows superior results than other baseline methods.
The rest of this paper is organized as follows: the second chapter introduces the related work of triplet extraction, the third chapter introduces the method proposed in this paper, the fourth chapter introduces the experimental environment, data sets, and results analysis, and the fifth chapter summarizes this paper.

II. RELATED WORK
At present, there are still many problems in the extraction of entity relation triples for text [6]. In general, there are mainly two types of problems. The first type of problem is that there are many polysemy and even mutual inclusion of entities in texts [7]. In different contexts, the same entity may have multiple meanings and may also have inclusion relationships. For example: 'Wuhan University is located in Wuhan', the entity nesting occurs in the head entity 'Wuhan University' and the tail entity 'Wuhan', the word 'Wuhan' is an adjective in the head entity, but in the tail entity it represents a place. In 2011, Hoffmann and Zhang [8] first considered the overlapping entity relationship in the model construction process. The second type of problem is that there are many overlapping entity relationships in texts [9]. Currently, there are two main types of them: SEO and EPO [10]. As shown in Fig. 1.
SEO means a single entity overlap, for example, in the text 'Zhang San was born on January 28, 1981 and is a lawyer', the subject is 'Zhang San'. There is a relationship named 'Date of birth' between the object 'Zhang San' and the object 'January 28, 1981', and There is another relationship named 'Occupation' between the object 'Zhang San' and the object 'Lawyer'.
EPO refers to the overlap of entity pairs in the text. For example, in the text 'Li Si was born in Shanghai', the subject is 'Li Si', and the object is 'Shanghai'. There are two kinds of relationships between this entity pair (Li Si, Deep learning methods are widely used in various classification tasks [11], [12], and for the task of entity relationship extraction, compared with traditional feature engineering, deep learning is the best method at present [13]. Therefore, this paper mainly uses deep learning technology to carry out the joint extraction task of entity and relation triples. Mainstream methods of joint extraction of entity relation triples are shown in Fig. 2. From the perspective of method implementation, the method of entity relationship extraction mainly includes the parameter sharing method and joint decoding method [14]. Among them, the parameter sharing method includes mapping from entity pair to relation, mapping from relation to entity pair, mapping from head entity to tail entity and relation [15], [16]. The sequence labeling method is commonly used in the joint decoding method.

A. THE PARAMETER SHARING METHOD
The general process of the parameter sharing method is shown in Fig. 3. It generally includes three layers of structure: text input layer, feature extraction layer, and parameter sharing layer [17]. The parameter sharing layer includes two subtasks: entity recognition and relationship extraction. According to the different parameter flow directions of the parameter sharing layer, it is divided into three types: mapping from entity pair to relationship, mapping from relationship to entity pair, and mapping from head entity to tail entity and relationship [18].

1) MAPPING FROM ENTITY PAIRS TO RELATIONSHIPS
In the method of mapping from entity pairs to relationships, the identification of entity pairs is first carried out on the parameter sharing layer, and then the relationship extraction is carried out [19]. The two tasks are treated as pipelines, so the current main research of this method is to continuously improve the accuracy of the two subtasks of entity identification and relationship extraction, and the deep learning correlation model is widely used in this task.
On entity recognition tasks, Chen et al. [20] proposed an entity recognition model based on improved CNN, which has a more powerful ability to capture global context information than CNN. Ronran and Lee [21] proposed a two-layer BiLSTM-CRF model, and the F-measure F1 value reached 91.10%.
On relation extraction tasks, Yuan et al. [22] proposed a cross-relationship attention mechanism, which can notice the interaction between relation types and enhance the effect of relation extraction. Liu et al. [23] introduced curriculum learning into the remote supervised relationship extraction task, and the proposed model improves the generalization ability of the model through cooperative training of relationship extractors and tutor networks.
There are two defects in this method [24]. One is that entity recognition and relationship extraction are regarded as two serial tasks resulting in mutual dependence between the two tasks, and the error of entity recognition will be amplified in the relationship extraction task; Second, although this method can solve the SEO problem, it cannot solve the EPO problem.

2) MAPPING FROM RELATIONSHIPS TO ENTITY PAIRS
Mapping from relationship to entity pair method firstly uses a multi-label classification model to judge the relationship type on the parameter sharing layer and then carries out entity recognition. Zhou et al. [25] improved the effect of relationship extraction by superimposing convolutional neural networks on BiLSTM and decoded the entity pairs corresponding to the current relationship type using unidirectional LSTM. Yuan et al. [26] used the gate mechanism to reduce the impact of unrelated relationship types on entity recognition after extracting relationships.
This method can solve the SEO and EPO problems by first judging all relationship categories and then generating all triples in turn. However, it is difficult to extract relationships directly without identifying entities, and it is difficult to have a better result on most datasets.

3) MAPPING FROM HEAD ENTITY TO TAIL ENTITY AND RELATIONSHIP
The method of mapping from the head entity to the tail entity and relationship first extracts the head entity on the parameter sharing layer and then extracts the tail entity and relationship at the same time [27]. Therefore, this method is also called the half pointer and half labeling method. Katiyar and Cardie [28] used the bilou sequence labeling method to label the head entity and used the improved attention model to extract the relationship and tail entity. Wei et al. [29] proposed a cascaded binary labeling framework, which uses the Bert model as a feature extraction model and takes the relationship as a condition to map tail entities through head entities, as shown in Formula 1. R stands for the relationship, S stands for the head entity, and O stands for the tail entity.
Compared with the first two methods, it is easy to achieve better results in identifying the relationship and tail entities based on identifying the head entities, and this method can solve the SEO and EPO problems.
Therefore, this paper uses this idea and proposes a joint extraction method of entity relationship triples based on hierarchical cascade labeling.

B. THE SEQUENCE LABELING METHOD
The sequential labeling method realizes the joint extraction of entity relationships by designing a unified annotation scheme for entities and relationships. Novingtagging [30] uses the method of sequence tagging for the first time to realize joint extraction and extends the original BIES (Begin, Inside, End, Single) tagging scheme. The encoded sentence sequence is directly decoded to obtain tagging information, and the relational triplet is obtained according to the tagging information. Dai et al. [31] further improved the tagging scheme by tagging the sentences with the length for times according to the position of each word so that each word can be tagged multiple times.
This method is easy to implement, and the entities and relationships are directly marked through a unified annotation scheme [32]. However, the model that directly marks entities and relationships is difficult to achieve a good result, and the SEO and EPO problems cannot be solved through the annotation scheme.
According to the above comparative analysis of various current methods, this paper uses the method of head mapping from entity to tail entity and relationship, designs a multi-neural network collaborative entity relationship triplet joint extraction model, and optimizes the model to better extract entity relationship triples.

III. METHOD
In this paper, we propose a new joint entity relation triples extraction model network architecture based on hierarchical cascade labeling. The structure of the HCL model is shown in Fig. 4. The HCL model mainly includes two parts: head entity labeling, relationship and tail entity labeling. The

A. HEAD ENTITY LABELING
The structure of the head entity labeling is shown in Fig. 5. To extract the text features and identify the header entities, the head entity labeling includes the Bert coding layer, BiLSTM layer 1, BiLSTM layer 2, and FC layer, and the head entity is marked through the whole process.

1) ROBERTA LAYER IN HEAD ENTITY IDENTIFICATION
To improve the accuracy of head entity recognition, this paper adopts the Roberta model as the text feature extraction model, it is an improved version of the Bert model [33], which is the current mainstream feature encoding method to encode the text features. As shown in Fig. 6, after the text enters the Roberta layer, the Roberta tokenizer uses the Roberta tokenizer dictionary to map the line text into token embeddings.
Token embeddings represent the feature value of the words in the text, and the entity relationship contained in the text is also related to the characteristics of the sequences of words and the characteristics of the sequences of lines [34]. Before the text features are input into the transformer of Roberta, Need to fuse the three types of the word feature vector (token embeddings), Position embeddings, and segment embeddings [35]. The expression formula of the three types of vectors is as follows, where X Token is the vector of Token, X Position is the vector of Position, X Segment is the vector of a segment, and W is the number of Chinese characters in the sentence. d m is the dimension of the word vector. In this paper, the text length is not uniformly truncated, and the length of the original line is kept unchanged during the model operation.
After token embedding, the Roberta model mainly uses the multi-head attention mechanism to extract the feature information in the sentence [36]. The process is shown in formula 5. the Roberta model obtains three vectors query (q), key (k), and value (V) by multiplying the random matrix w with the vectors in the embedding layer, and then training and continuously updating and optimizing these three vectors. D k represents the dimension of Q [37].
The Roberta model's training process will iterate the attention mechanism several times to obtain a more ideal effect [38], and the structure of a single iteration is shown in Fig. 7, it includes a Multi-head attention layer, Adds and Normalize layer, and Feed Forward layer.

2) BILSTM LAYER IN HEAD ENTITY IDENTIFICATION
LSTM (Long Short-Term Memory) network is a special form of an RNN (Recurrent Neural Network), which can better handle the sequence data for example the text data [39]. Compared with the traditional RNN model, LSTM can solve the problem of the long gradient disappearance of the text [40]. Therefore, this paper uses the BiLSTM layer to decode the output vector of the Robert layer twice to obtain the data characteristics, which can better reflect entities and relationships. As shown in Fig. 8, the network structure mainly includes an update gate and a forgetting gate [41]. The update gate mainly controls the retention or deletion of some information from the forward state. The forgetting gate is used to control whether the calculation of the candidate state depends on the previous state.
The forgetting gate is calculated by the following formula: The update gate is calculated by the following formula: The output state is calculated by the following formula: x t represents the current input data, h t−1 represents the input at the previous time, h t represents the current input, σ Is the sigmoid function with a value range of 0 to 1, W f , W i , W c , W o , b f , b i , b c are the weight matrix, and tanh is the activation function.

3) FC LAYER IN HEAD ENTITY IDENTIFICATION
This model uses the FC layer to perform feature processing and dimension transformation on the output of the BiLSTM layer, it ensures that the output dimension of the model corresponds to the dimensions of the starting position of the head entity. The formula is as follows: X is the input matrix, A is the weight matrix, and b is the paranoid matrix.

B. RELATIONSHIP AND TAIL ENTITY LABELING
Based on head entity recognition, the relationship and tail entity are extracted in a cascade. The structure of the relationship and tail entity labeling is designed as shown in Fig. 9. The text features are extracted through the Roberta layer and spliced with the output of the head entity recognizer, the splicing enhances the head entity features of the original text, and conducive to the subsequent relationship and tail entity labeling. Since the features after splicing become longer, it is considered to add a CNN network for the dimensionality reduction of features [42]. Finally, the start and end positions of tail entities, all relationships are marked by two FC layers, which match the total number of relationships, and the relationship and tail entity are obtained, the formula is shown in formula 13. The output of the subject tagger is processed so that its dimensions are compatible with the output by Roberta to ensure the feature splicing operation. The dimensional transformation is shown as the formula follows, where A Sub represents the output vector of the head entity locator, N represents the all-1 vector whose dimension is the same as Roberta's output, and bert_dim represents the setting dimension of the Robert model.

A. CONSTRUCTION OF A BALANCED SAMPLING DATASET
Duie2.0 is the largest Chinese entity relationship extraction data set at present, with 173109 lines of training sets and 15475 lines of test data, including 50 entity relationship types. However, the occurrence times of each relationship are uneven and the difference is large, this is not suitable for triple extraction experiments [43]. To ensure that all  kinds of relationships can be sampled in a balanced manner, and compare the effects of different models more fairly, this paper optimizes Duie2.0, set the threshold value for the maximum occurrence of each relationship in the training set to 240 times, The threshold value for each relationship in the test set is set to be 50 times. Some examples of sampling results are shown in Table 1. Few relationships due to too little data in Duie2.0 cannot reach the average level, such as the 31st relationship, Most relationships can reach the horizontal line, so the optimized data is relatively balanced on the whole. The triplet data set created includes 10661 lines of training data, 2019 lines of test data, and 49 relationship types. The data set constructed by text has two characteristics: first, each line of the corpus contains the marked triples and text sentences to be trained, and the quality of the data set is high; Second, the data set contains two special scenarios of SEO and EPO. Part of the corpus is shown in Table 2.

B. EXPERIMENTAL SETTINGS AND EVALUATION METHOD
Our experiment is based on the PyTorch framework, the version is torch 1.7.1, and the version of the transformers module used is 2.5.1. The Chinese-Roberta-wwm-ext model is used as the coding layer, and the main parameter settings are as follows: To measure and evaluate the experimental results, this paper uses the precision rate, recall rate, and F1 value as the main evaluation indexes, and uses the trained model to predict the entity-relationship triples on the test set. When predicting the entity-relationship triples, the probability threshold of predicting the head entity, tail entity, and relationship is set to 0.5. The calculation formula is as follows: P represents the precision rate of the experimental results, R represents the recall rate of the experimental results, and TP represents the number of correct predictions of triples. FP represents the number of prediction errors of triples, TP + FP represents the total number of predictions, FN represents the number of prediction errors of the triples, and TP + FN represents the number of triples annotated in the corpus itself.

C. BASELINES
Under the condition of constructed balanced data set, the HCL model is compared with the Baseline model [44] proposed at the 2019 LIC (2019 Language and Intelligence Challenge), the CASREL model proposed at the 2020 ACL (2020 Annual Meeting of the Association for Computational Linguistics) [29], and the NPCTS model proposed at Journal of Wuhan University in 2022 [45].
HCL model is a joint extraction task of Chinese text entity relations, and it is difficult to find the unified comparison classical algorithm. To further verify the performance of the model, the HCL model is also compared with its eight variant  models and three variants of the Baseline model [44], the detail is shown in TABLE 4.

D. MAIN RESULTS AND ANALYSIS
The HCL model uses the cascade extraction algorithm, aiming to extract entities and relationships, the experiments are on the whole model, not on some parts. The experimental result comparison of triples extraction using the traditional dictionary method as the feature extraction method is shown in Fig. 10. Comparing the effects of different head entity extractors and tail entity extractors, it can be seen that when the traditional dictionary is used as the feature extraction method, the head entity labeling adopts 2BiLSTM-CNN and the tail entity and relation annotator adopts CNN-FC is superior to other methods, that is the second column in Table 5. From the overall effect when the traditional dictionary method is used as the feature extraction method the F1 value is below 60%.  The experimental result comparison of triples extraction using Bert as the feature extraction method is shown in Fig. 11. It can be seen that the F1 values of these models are very close using Bert as the feature extraction method. Among them, when Bert-BiLSTM-FC is used as the head entity labeling the efficiency is the highest, and the F1 value reaches 56.8% at the beginning. When Bert-2BiLSTM-FC is used as the head entity labeling the overall effect is slightly stronger than the other three models, and the best F1 value is 63.8%. On the whole, the F1 value based on Bert as the feature extraction method can finally reach more than 60% which is about 10% higher than that of the traditional dictionary method as the feature extraction.
When the head entity labeling uses Roberta-2BiLSTM-FC model, the tail entity and relationship labeling uses different schemes, the experimental curve results are shown in Fig. 12. It can be seen that when the tail entity and relationship labeling adopts Roberta-FC, the feature splicing effect is the VOLUME 11, 2023   worst because the head entity feature is not added, while the tail entity and relationship labeling of the HCL model using Roberta-FeatureStitch-CNN-FC, the effect is the best and the F1 value can reach 65.4%.
The HCL model is compared with the Baseline model [44], the CASREL model [29], and the NPCTS model [45]. The experimental results are shown in Fig. 13, Fig. 14, and Fig. 15. It can be seen that the effects of CASREL, NPCTS, and HCL are better than the baseline model proposed at the 2019 LIC in the entity-relationship triples extraction tasks, and the F1 value increases by about 10%. The HCL model is slightly better than the CASREL model, and NPCTS model. when the epoch is greater than 8, the HCL model is more stable than the CASREL model and the NPCTS model.
For the feature coding, three cases are compared: the traditional dictionary method used as the feature extraction layer, the Bert method used as the feature extraction layer, and the Roberta method used in the HCL model in this paper.
In the head entity labeling, we compare the BiLSTN model, BiGRU model, CNN model, and other models.
Different models such as FeatureStitching-FC and FeatureStitching-CNN-FC are compared on the tail entity and relationship labeling.
To ensure the optimal value of the baseline model, the Feature parameters of the dictionary method in the experiment refer to the baseline [44] in the 2019 Language and Intelligent Challenge jointly held by the Chinese Computer Society (CCF), the Chinese Information Society (CIPS) and Baidu. The feature parameters of the baseline model using the Bert model in the experiment refer to the CASREL model proposed by Zhepei Wei et al. [29] at the ACL (Annual Meeting of the Association for Computational Linguistics) in 2020. For the parameter of the maximum epoch, it is found that when the dictionary method is used as the feature extraction layer, the results can be stabilized when the epoche reaches 100, and when Bert or Roberta is used as the feature extraction layer, the results can be stabilized when epoche reaches 60, so we set the maximum epoch parameters to 60 and 100.
The results are shown in Table 5. It can be seen from the experimental results that in the triples joint extraction task when the head entity labeling is changed from LSTM (Row 7 in table 5) to BiLSTM (Row 8 in table 5) the F1 value is increased by 1%, which indicates that BiLSTM has a good effect on the head entity encoder. When the head entity labeling is changed from BiLSTM  10 in table 5) the effect is improved, and the best epoch is larger, which indicates that the effect is better but the timeliness is worse. When two BiLSTMs (Row 10 in table 5) are added to the CNN layer (Row 11 in table 5) the F1 value decreases, which indicates that CNN cannot play its role in the header entity labeling. This is also because the dimension of the header entity editor does not increase after the Bert feature encoder and the two BiLSTMs, Adding CNN can't play the role of dimension reduction, but it affects the effect. Finally, the HCL model (Row 15 in table 5) in this paper achieved the best effect and the best epoch was 17 with good timeliness. Its precision rate reached 67.5%, the recall rate reached 63.5%, and the F1 value reached 65.4%. Compared with the DictionaryEncodding-2BiLSTM-CNN-FC model the F1 value increased by 10.4%. This proves that our method is effective in the task of entity and relation triples extraction.

V. CONCLUSION
For the joint extraction task of entity relation triples, this paper analyzes the current problems and the mainstream methods. After research, we construct a balanced data set and establish a hierarchical cascade labeling model (HCL). The comparison experiment on the balanced data set shows that the method in this paper can achieve a higher F1 value than other baseline models. In the follow-up work, we hope to continue to improve the training effect of the model, especially considering the continuous optimization of the tail entity and relationship labeling of the model to improve the accuracy of the triplet extraction.
YIBO LIU is currently pursuing the Ph.D. degree with the School of Information and Communication, National University of Defense Technology. His research interests include knowledge engineering, natural language processing, artificial intelligence, and deep learning.
FENG WEN is currently a Professor with the School of Information and Communication, National University of Defense Technology. His research interests include computer science, software engineering, and computer-human interaction.
TENG ZONG is currently pursuing the Ph.D. degree with the School of Information and Communication, National University of Defense Technology. His research interests include intelligent algorithm and data governance.
TAOWEI LI received the M.S. degree from the National University of Defense Technology, where he is currently pursuing the Ph.D. degree with the College of Information and Communication. His research interests include data engineering and software algorithm.