Towards Knowledge Enhanced Language Model for Machine Reading Comprehension

Machine reading comprehension is a crucial and challenging task in natural language processing (NLP). Recently, knowledge graph (KG) embedding has gained massive attention as it can effectively provide side information for downstream tasks. However, most previous knowledge-based models do not take into account the structural characteristics of the triples in KGs, and only convert them into vector representations for direct accumulation, leading to deficiencies in knowledge extraction and knowledge fusion. In order to alleviate this problem, we propose a novel deep model KCF-NET, which incorporates knowledge graph representations with context as the basis for predicting answers by leveraging capsule network to encode the intrinsic spatial relationship in triples of KG. In KCF-NET, we fine-tune BERT, a highly performance contextual language representation model, to capture complex linguistic phenomena. Besides, a novel fusion structure based on multi-head attention mechanism is designed to balance the weight of knowledge and context. To evaluate the knowledge expression and reading comprehension ability of our model, we conducted extensive experiments on multiple public datasets such as WN11, FB13, SemEval-2010 Task 8 and SQuAD. Experimental results show that KCF-NET achieves state-of-the-art results in both link prediction and MRC tasks with negligible parameter increase compared to BERT-Base, and gets competitive results in triple classification task with significantly reduced model size.


I. INTRODUCTION
With the rapid development of deep learning and increasing availability of datasets (e.g., SQuAD, RACE) [1], [2], machine reading comprehension (MRC) has achieved remarkable advancements in the last few years. A common approach is to use recurrent neural networks (RNNs) to process the given text and the question in order to predict or generate the answers. RNNs, such as Long Short-Term Memory (LSTM) units [3], are effective to encode the context semantics and capture the long-term dependency. With the recurrent connection, the word in the given passage or question also contains the information of its neighbors. Besides, attention mechanism [4] is widely used with RNNs so as to promote interaction between the question and the given passage. The attention mechanism imitates the process of people's repeated thinking while doing reading comprehension, which captures the semantic relevance of words The associate editor coordinating the review of this manuscript and approving it for publication was Gustavo Olague . in questions and passages. Therefore, many studies have used attention-based RNNs [5] as backbone and achieved promising results. However, existing methods can only obtain the semantic information of a given text, lacking common knowledge, which will limit the understanding of words or cause deviations.
In the process of human reading comprehension, people may utilize world knowledge when the question cannot be answered simply by contextual reasoning. The external knowledge plays such an important role that it is considered to be the biggest gap between MRC and human reading comprehension. Knowledge graph (KG) [6] raised by Google in 2012, uses an effective way to organize and manage massive amount of information, which can provide rich structured knowledge facts for better language understanding. Recently, introducing structured knowledge in KG into MRC has become a new trend and several studies centered on knowledge-based machine reading comprehension (KBMRC) [7] have been proposed. Zhang et al. [8] utilized both large-scale textual corpora and KGs to train an enhanced VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ language representation model (ERNIE), which achieves significant improvements on various knowledge-driven tasks.
Liu et al. [9] proposed a knowledge-enable language representation model (K-BERT) with knowledge graph (KGs), which introduces soft-position and visible matrix to avoid knowledge noise to a large extent. In the former, the algorithm TransE which is used for extracting and encoding structured knowledge, has a poor ability to obtain semantic information, while in the latter, the sentence tree to limit the impact of knowledge greatly increase the complexity of information fusion.
In this paper, we take advantage of structural characteristics of the triples in KGs and incorporates knowledge graph representations into MRC to overcome the shortcomings of the above methods. There are two challenges lies in the road of this knowledge integration: (1) Structured Knowledge Encoding: although the structure of triples (head entity, relation, tail entity) in KGs can well sort out facts, it cannot be directly used for downstream tasks. So how to effectively extract and encode its related informative facts in KGs for language representation models is an important problem (2) Knowledge Noise Issue: the intervention of irrelevant knowledge or synonyms may be misleading and divert the sentence from its correct meaning, hence how to design a reasonable fusion mechanism is another challenge.
To tackle these challenges, we propose a two-stage model KCF-NET. For knowledge extraction, we first convert the triples in KG into vector representations, and design Caps-BERT, a novel module based on CapsNet to mine the global relationship between entities. The global semantic information got from CapsBERT is more comprehensive and richer, compared with that in previous work which only retrieves information of related entities. For knowledge fusion, we propose a new fusion mechanism, adopting the idea of gating, which can balance the weight distribution of the meaning of the original text and external knowledge to limit the impact of external knowledge in a simple form.
In summary, main contributions of this paper are as follows: • To enhance knowledge expression in relational triples, we design a novel knowledge extraction module consisting of BERT and CapsNet which captures the horizontal relationship and vertical relationship of the triple matrix, containing features between and in the triples.
• To balance the weight of knowledge and context, a novel knowledge fusion structure based on multi-head attention mechanism is proposed, which utilizes gating units to preserve the original information and avoid knowledge noise.
• In order to accelerate the convergence of CapsNet and avoid overfitting, we design an initialization mechanism based on the similarity between capsules to optimize the dynamic routing algorithm. The remaining of this article is structured as follows. The related works are discussed in section 2. Section 3 presents KCF-NET and its details. Finally, we evaluate the experimental results with datasets in section 4 and draw a conclusion in section 5.

II. RELATED WORK A. MACHINE READING COMPREHENSION
Machine reading comprehension is a task to measure the extent to which the machine understands natural language by reading a document and answer questions. Compared with traditional tasks such as word segmentation, named entity recognition and syntactic analysis, MRC usually needs longer chapters and deeper semantic information, which requires comprehensive use of text representation, retrieval, reference resolution and reasoning. In order to accurately predict the answer, many efforts have devoted to extract the correlation between the context and question. Hermann et al. [10] proposed an attention-based neural network that calculates the similarity between the context and question. Weston et al. [11] introduced an end-to-end memory network, which can explicitly store long-term memories and also have an easy access to reading memories. In this way, the model can understand the context and question more deeply by multiple turns of interaction. Dhingra et al. [12] proposed GA reader, which utilize gate mechanism to decide the extent to which question information affects context words when updating the context representations.
However, due to the limited knowledge contained in the context, there is still a long way to go before the machine truly understands text. Knowledge-based machine reading comprehension (KBMRC), as a new trend, makes MRC tasks much closer to real-world application. KBMRC differs from MRC mainly in inputs. In addition to sequence of the context and question, related knowledge extracted from knowledge base is necessary in KBMRC. Long et al. [13] propose a new task, which provides additional entity description extracted from knowledge base as external knowledge to help entity prediction. Miller et al. [14] utilize Key-Value Memory Networks to find out relevant external knowledge and store it in memory slots as key-value pairs. Then keys are used to match with the query to generate relevant knowledge representations. Wang & Jiang [15] propose a data enrichment method with semantic relations in WordNet. For each word in the context and question, they try to find out the positions of passage words, which are regarded as explicit knowledge to assist answer prediction. Pan et al. [16] propose a novel encoder-decoder supplementary architecture which transfers knowledge learned from machine comprehension to the sequence-to-sequence tasks to deepen the understanding of the text. In conclusion, KBMRC has demonstrated its advantages in exploiting knowledge for question answering, which is no longer restricted to the given context. resources for various NLP tasks. However, the structure of triples cannot be directly transformed into semantic expression. To this end, knowledge graph embedding models have been proposed to learn vector representations for entities and relations in KGs. Bordes et al. [20] proposed TransE, inspired by the interesting translation-invariant phenomenon of lexical semantics and syntactic relations in the word vector space. TransE regards the relationship in the knowledge base as some kind of translation vector between entities. However, because of the simple structure, it is stretched when dealing with complex relationships. To break through the limitations in dealing with the complex relationships of 1-N, N-1, and N-N, Wang et al. [21] modeled relational triples on hyperplanes and Lin et al. [22] built entity and relation embeddings in separate spaces. More fine-grained models were proposed one after another such as TransD, TransG, and KG2E [23]- [25]. In addition, Ji et al. [26] proposed TranSparse to address the issue of multiple relation semantics. Recently, deep learning approaches have demonstrated breakthrough performance. Nguyen et al. [27] proposed Con-vKB, a convolutional neural network (CNN)-based model for KG completion. Yao et al. [28] employed BERT for knowledge graph representation and achieved state-of-the-art results in triple classification, link prediction and relation prediction. Zhong et al. [29] treated triples in knowledge graphs as textual sequences and proposed a novel network based on Transformer. However, the method mentioned above does not take the characteristics of structured knowledge in triples into account. In this paper, CapsNet is combined with BERT to enhance knowledge expression.

C. KNOWLEDGE FUSION
In recent years, enhancing pre-trained language model [30] with external knowledge for downstream tasks has become a new trend. Meanwhile, how to combine knowledge representation with the characteristics of specific task has become an immense challenge. Yang et al. [31] took advantage of external knowledge bases to improve recurrent neural networks for MRC. To effectively integrate background knowledge with information from currently processed text, Yang et al. employed an attention mechanism with a sentinel to adaptively decide whether to attend to external knowledge and which information is useful. Zhang et al [32] utilized both large-scale textual corpora and KGs to train an enhanced language representation model named ERNIE. ERNIE consists of two stacked modules: T-encoder and K-encoder. The former is responsible to capture basic lexical and syntactic information from the input tokens, and the latter use TransE to obtain extra token-oriented knowledge information and integrate it into textual information through GELU function. Mihaylov et al. [33] proposed KN-Reader, encoding external commonsense knowledge as a key-value memory. By calculating the attention between the knowledge and context representation, KN-Reader selects the candidate that has the highest attention score as answer. Wang & Jiang [34] designed two types of attention mechanisms. The first one, mutual attention, is aimed at fusing the question representations into the passage so as to obtain the question-aware passage representations; the second one, self-attention, is aimed at fusing the questionaware passage representations into themselves so as to obtain the final passage representations. Yang et al. [35] proposed KT-NET, which employs an attention mechanism to adaptively select desired knowledge from KBs, and then fuses selected knowledge with BERT to enable context-and knowledge-aware predictions. However, most previous methods retrieve the relevant knowledge base before encoding it into the MRC, which can only refer to KB locally. In this paper, we first learn the overall knowledge graph embeddings, and then integrate the knowledge with the context. We design a novel fusion structure that can control the influence of knowledge on reading comprehension and is easier to train.

III. METHOD
In this work we consider the span extraction task in MRC. Given a passage with m tokens P = {P 1 , P 2 , . . . , P m } and a question with n tokens Q = {Q 1 , Q 2 , . . . Q n }, our goal is to predict an answer A, which is constrained as a continuous span in the passage, donated as A = {P , ...., P }, where 1 <= <= <= m.
As depicted in Fig. 1, KCF-NET can be divided into embedding layer, fusion layer and output layer. The previous work in the literature only converts the triples in KG into a vector representation. When related entities are mentioned in the article, the semantic vector is used as external knowledge for direct accumulation, leading to deficiencies in both knowledge extraction and knowledge fusion. To take full advantage of facts in KGs which contains rich language patterns, we propose a novel module CapsBERT as a pre-processing step, which fine-tunes pre-trained BERT [36] and utilizes a capsule neural network (CapsNet) [37] for knowledge graph representation. BERT is a state-of-theart contextual language representation model built on a multi-layer bidirectional Transformer encoder, which is able to capture complex linguistic phenomena. Sabour et al. [38] first introduced CapsNet that employs vector neurons to capture entities in images. Extracted features will be sent from capsules in a layer to those in the next layer by a powerful dynamic routing mechanism, which could encode the intrinsic spatial relationship of triples between the part and the whole. Different from the traditional application of CapsNet, we use capsules to model the entries at the same dimension in the entity and relation embeddings. Thus, each capsule can encode more characteristics in the embedding triple to enhance its knowledge expression.
Given a question and passage, KCF-NET first splits the input into sentences. Then it employs, in turn, (1) a BERT embedding layer containing two encoding methods which compute context-aware representation and knowledge graph representation respectively for the reading text; (2) a fusion layer to integrate context information with external knowl- edge, so as to enable rich interactions between them; (3) an output layer to predict the final answer.

A. EMBEDDING LAYER
The embedding layer is designed as two branches, using BERT encoders to model reading text and facts in KGs respectively, denoted as context-BERT and knowledge-BERT.
As depicted in Fig. 2, context-BERT takes passage P and question Q as input, and computes a context-aware representation for each token. Specifically, given a passage with m tokens P = {P 1 , P 2 , . . . ,P m } and a question with n tokens Q = {Q 1 , Q 2 , . . . ,Q n }, we first pack them into a sequence of length m + n + 3 i.e., where <SEP> is the token separating Q and P, and <CLS> is the token for classification. For each token π i in π, we construct its input representation as: where t, s, p are the token, segment and position embeddings respectively, i represents the i-th sample in the training batch, 0 represents the input of the BERT layer. Tokens in Q share a same segment embedding δ p and tokens in P a same segment embedding δ q . Such input representations are then fed into L successive Transformer encoder blocks, i.e., The final hidden states o BERT = H j m+n+3 ∈ R n s ×d w are taken as the output of this layer, where n s is the length of the sentence.

B. CapsBERT
As depicted in Fig.3, CapsBERT takes advantage of the structural characteristics of the triples for further knowledge extraction, which is designed as a pre-training step to get knowledge-BERT. CapsBERT is mainly optimized from two aspects. One is to improve the integrity of the knowledge graph, and the other is to better acquire local and global knowledge from triples.
Similar to context-BERT, the input of CapsBERT consists of three parts, namely token embedding, segmentation embedding and position embedding. The token embedding sequence can be symbolized as  In addition, we employ CNN and CapsNet to enhance its knowledge expression ability. We use a variety of convolution kernels of different sizes to extract local features of different angles from the triple vector output by the knowledge-BERT. In this way, not only the overall characteristics of the triple, but also the local relationships between the items within the triple can be captured. The convolution process can be expressed as: where w v represents the value of the v-th column in the convolution kernel matrix, o BERT i,j−v+1 represents the value of the i-th row, j-v + 1 column in the output triple matrix of the BERT encoding layer, k represents the column width of the convolution kernel matrix and b represents the magnitude of the offset.
Since convolution kernels of different sizes can get feature vector maps of different sizes, we use maximum pooling to convert all features to a uniform size and concatenate as output. The output will be fed into the corresponding capsule.
We have designed two capsule layers to capture the relationship between triples. The first capsule consists of η capsules. The dimension of each capsule is 1 × d w , where d w is the dimension of the final hidden state of the knowledge-BERT coding layer. The second capsule layer is composed of one capsule, the dimension of the capsule is 1 × d cap , whose output is the score the triple gets, predicting whether the triple is valid or not. The parameters in two capsule layers are updated by dynamic routing algorithm. However, the uniform distribution in original CapsNet ignores the differences among underlying capsules and increase the iterative calculation. To alleviate this problem, we propose a simi-dynamic routing algorithm. The simi-dynamic routing algorithm enhances the upward transfer of similar feature vectors, in the meanwhile, weakens the transfer of dissimilar feature vectors. Therefore, we use the similarity of the underlying capsule as the initial value, which makes the capsule network converge faster: Optimized dynamic routing algorithm is as followed: Finally, we define the scoring formula of the triple as: where e is the vector output of the second capsule layer, whose length is used as a score to predict whether the triple (h, r, t) is valid or not. We compute a cross-entropy loss as a VOLUME 8, 2020 for all capsule i in layer l and capsule j in layer (l+1): : return e training target: where represents the positive sample of the knowledge graph triple, and ' represents the negative sample of the knowledge graph triple.

C. FUSION LAYER
The input of the fusion layer consists of two parts. Through the context-BERT, we can get the embedding of passages and questions with rich semantic information o BERT ∈ R n s ×d w , and through the knowledge-BERT module, we can obtain the knowledge graph embedding o knowledge ∈ R n s ×d k of the entities in the text, where n s is the sentence length, d w is the output dimension of the context-BERT layer, and d k is the output dimension of the knowledge-BERT layer. We integrate them through a nonlinear activation function: where w ci and w ei are weight information, σ is activation function like RELU and b is offset. When the i-th word is a non-entity part, T e = w e o konwledge i is an all-zero vector, and only T c = w ci o BERT i will be calculated. Through the information fusion module, we can get the knowledge fusion encoding y = {y i } ∈ R n s ×d w .
Different from the previous works, we believe that the original information in the context should be more decisive for question answering than external knowledge. Too much knowledge incorporation may divert the word vector from original sentence meaning. Therefore, in order to preserve the original information and avoid knowledge noise, we employ gating units to regulate the flow of information, including transform gate [39] and carry gate as shown in Fig. 4. The former determines how much original information is transferred to the subsequent module, whereas the latter chooses how much information to carry from the knowledge fusion encoding to the subsequent module. Therefore, a part of the information can quickly pass through a layer without transformation, while the rest needs to go through a nonlinear transformation layer first. The use of the gating structure has two advantages: one is to control the impact of knowledge fusion encoding on machine reading comprehension, and the other is to enable the contextual attention semantic information to skip the knowledge fusion module, which is easier to train: where t represents transform gate and c represents carry gate. Since the information fusion module is calculated based on each word, the fusion vector for each word in the sentence will lack context information between other words. Therefore, A multi-head attention mechanism with residual connection is added to encode the resulting fusion vector.
where represents the multi-head attention mechanism, ã represents batch normalization and o fusion is the output of the knowledge fusion layer, o fusion = {o i ∈ R n s ×d w .

D. OUTPUT LAYER
Finally, in the output layer, we use dense layer and Softmax to obtain the probability of the answer boundary for each position, as depicted in Fig. 5: The probability of each token as the start or end of the answer label is calculated as follows: where v i represents the output of the knowledge fusion layer o fusion passes the i-th row of the output vector matrix of the fully connected layer, w 1 , w 2 represent the learned weight parameters, and the loss function is as follows: where N is the size of the batch and y 1 j , y 2 j are the ground truths of the starting point and the ending point.

IV. EXPERIMENTS
We evaluate our approach from two aspects, namely knowledge expression and reading comprehension ability. Specifically, the knowledge expression ability is mainly to conduct experimental tests on CapsBERT in the task of triple classification and knowledge-BERT module in the task of link prediction. The accuracy of reading comprehension is evaluated on the entire KCF-NET model. This section will be elaborated in the chronological order of the experiment.

A. TRIPLE CALSSIFICATION 1) DATASETS
We ran CapsBERT on two widely used benchmark datasets: WN11 and FB13. The former is a subset of WordNet, and the latter is a subset of Freebase. Table 1 provides statistics of the two datasets mentioned above.

2) EVALUATION PROTOCOL
For the triple classification task, we use the accuracy rate to evaluate the classification effect. We compare our Caps-BERT with multiple state-of-the-art KG embedding methods including: TransE [17] used a distributed representation to describe the triples in KGs, which could obtain semantic information through simple mathematical calculations based on spatial invariance.
TransG [21] proposed a variant of TransE, which leveraged a Bayesian non-parametric infinite mixture model to handle multiple relation semantics by generating multiple translation components for a relation.
DistMult [40] proposed a novel approach that utilizes the learned relation embeddings to mine logical rules.
ConvKB [24] proposed a novel use of CNN for KB completion task.
AATE [41] proposed a mutual attention mechanism between relation mention and entity description to learn more accurate textual representations for further improving knowledge graph representation.

KG-BERT [25]
fine-tuned BERT model on textual sequences for predicting the plausibility of a triple or a relation.

ERNIE [8]
proposed a knowledgeable language representation model, which aggregated both context and knowledge facts in large-scale textual corpora and KGs for predicting both tokens and entities.

K-BERT [9]
utilized domain-specific knowledge to train a knowledge-enabled language representation model and introduced soft position and visible matrix to limit the impact of knowledge.

3) QUANTITATIVE EVALUATION
Triple classification aims to judge whether a given triple (h, r, t) is correct or not. Since the model we proposed is based on supervised learning, positive and negative samples are needed to guide the model training. However, the training set of WN11 and FB13 does not contain negative samples, only positive samples. Therefore, we need to perform negative sample filling on these two data sets before the experiment. The negative triple set ' is simply generated by replacing head entity h or tail entity t in a positive triple (h, r, t) ∈ with a random entity h or t , i.e.
Note that a triple will not be treated as a negative example if it is already in positive set . In order to keep the classification effect balanced, we generate negative sample data with the same number of positive samples.
We first pre-trained knowledge graph embeddings, initializing parameters of the BERT encoding layer with models officially released by Google. These models were pre-trained on the concatenation of Books Corpus (800M words) and Wikipedia (2,500 words), using the tasks of masked language model and next sentence prediction.
We use BERT-Tiny, BERT-Small and BERT-Base for comparative experiments. In the convolutional layer, we use separable convolution and pooling of different layers to convert the output of BERT with different hidden state dimensions into the input size of capsule network. Since the input of the triple classification task is only the splicing of entities and relationships, we set the maximum length to 20.
As the results obtained in different environments may be different, we use KG-BERT as our benchmark model, where KG-BERT * means the accuracy rate obtained by training in the local environment. It can be seen from Table 2 that although the proposed model CapsBERT does not obtain the highest accuracy, the overall size of the model is reduced by three to seven times compared with recent work based on BERT, while the average accuracy only decreases by no more than 3% to 89.25%, outperforming the traditional triple classification model.

B. ABLATION EXPERIMENTS ON CAPSBERT
In order to analyze the impact of different components in CapsBERT on knowledge expression, we designed four different training strategies in the training process, as shown in Fig. 6.
Strategy A: Directly use BERT for triple classification without a capsule network. We selected Google pre-trained

BERT-Tiny, BERT-Small and BERT-Base for experiments and fine-tuned them.
Strategy B: Train CapsBERT as a whole. Strategy C: First train BERT separately and freeze its parameters, and then train the entire CapsBERT, the BERT parameters will not be updated during the overall training process.
Strategy D: First train BERT and then train the entire CapsBERT, that is, the knowledge-BERT layer will be trained twice.
From Table 3, we have observed that original BERT can perform well and achieve an accuracy rate of 93.11%, whereas the accuracy rate of model in Strategy B is only 69.68%, which is almost down to the distribution rate of the experimental data set. From this, we can conclude that knowledge-BERT will not generate appropriate contextual semantic information without pretraining and cause the capsule layer unable to extract spatial information between entities, resulting in poor results. Both Strategy C and Strategy D have worked well. In contrast, strategy C is limited due to freezing parameters. Pre-training and re-training make the triple semantic representation obtained by knowledge-BERT better meet the requirements of the capsule network. Strategy C and strategy D also prove that pre-trained BERT can achieve better results in retraining downstream tasks.  Task 8 dataset includes 8000 training samples  and 2717 test samples, each of which includes a text sentence and the category label of the relationship between two entities in the sentence, using <e1>, </e1> and <e2 >, </e2> to identify the location of the entity. The relationship distribution of the data set is shown in Fig. 7:

2) EVALUATION PROTOCOL
For the relationship classification task, we use the F1 score to evaluate. The F1 score takes into account both the accuracy rate and recall rate of the classification model. The calculation formula of each evaluation index is as follows: where TP represents the number of samples that are correctly predicted in a category, FP represents the number of samples in other categories that are incorrectly predicted as the category, and TN is the number of samples in the category that are incorrectly predicted as other categories. We compare our model with some other strong baselines: Entity Attention Bi-LSTM [42] adopted dynamic long short-term memory to extract complex relations between entities.
Multi-Attention CNN [43] proposed a novel convolutional neural network architecture for relation classification, relying on two levels of attention to discern patterns in heterogeneous context. R-BERT [44] leveraged the pre-trained BERT language model and incorporates information from the target entities to tackle the relation classification task.
BERT EM +MTB [45] utilized BERT and distributional similarity for relation learning.

3) QUANTITATIVE EVALUATION
In order to employ knowledge-BERT in CapsBERT, that is, the BERT layer that has been trained by the triple classification task to be used in subsequent machine reading comprehension, we introduce the SemEval 2010 Task8 data set for link prediction tasks to modify its input form. Link prediction aims to predict the relationship category between two entities, showing the model's ability to analyze the contextual semantic information of the entities in the text, and detect connections between entities.
The experimental results are shown in Table 4. Obviously, the knowledge-BERT (base) obtain the current best F1 score of 91.83, which increases 2.64% compared to BERT. The performance of the knowledge-BERT (small) does not exceed the initial BERT due to the limit of parameters, but compared with the method based on CNN or LSTM, it has made great progress. In order to visually analyze the experimental results, we draw a confusion matrix graph for the prediction results of each test sample, as shown in Fig. 8. It can be seen that our model can effectively distinguish each relationship category. After joint training with capsule network, knowledge-BERT can not only express the semantic information of the entity itself, but also be able to contact the relationship between the entities. In the subsequent machine reading comprehension tasks, we will integrate knowledge-BERT into KCF-NET.

D. MACHINE READING COMPREHENSION 1) DATASETS
SQuAD, regarded as a milestone for MRC, collects 536 articles from Wikipedia, requiring crowd-workers to ask VOLUME 8, 2020 more than 100,000 questions. The answer to each question is a span from the corresponding passage. In this paper, we choose SQuAD 1.1 since we focus more on answerable questions.

2) EVALUATION PROTOCOL
For machine reading comprehension tasks, the input of the model is the stitching of the passage P and the question Q, the output is the start position and end position of the answer, and the text between consecutive positions is intercepted as the predicted answer. There are two indicators to evaluate our model: (1) Exact Match (EM) The exact match EM refers to the percentage that the predicted answer exactly matches a certain answer in the correct answer list, that is, whether the predicted answer text is exactly the same as the real answer text.
(2) F1 score The F1 score index refers to the average coincidence rate between the answers predicted by the model and the correctly marked answers.
We compare KCF-NET with some other strong baselines on SQuAD including: BiDAF [46] used bidirectional attention flow mechanism to obtain a query-aware context representation without early summarization.
QANet [47] used CNN and self-attention instead of recurrent neural networks to model local and global interactions, respectively.
KAR [48] integrated the general knowledge with the MRC model and used attention mechanisms to improve the robustness to noise.

RM-Reader [49]
designed a memory-based answer prediction to gradually increase the reading knowledge and continuously refine the answer span.
SLQA [50] proposed a multiple granularity hierarchical attention MRC model, which performed attention and fusion at different levels.

3) QUANTITATIVE EVALUATION
In KCF-NET, we choose BERT-Base as context-BERT to extract contextual semantic information, and for knowledge-BERT, we use the BERT-Small model that has been trained in CapsBERT, which has 4 layers and 8 self-attention heads. The dimension of the output hidden state vector is 512. In addition, we also set the maximum sentence length of the input to 384. For hyper-parameters, we set Dropout to 0.1, learning rate 5e-5, and batch size to 16. We use BERT-Base and BERT-Large as baselines for comparative experiments. As can be seen from Table 5, our model has a greater improvement than BERT-Base. In the verification set, EM has increased by 0.6 percentage points and F1 has increased by 0.46 percentage points, indicating the integration of external knowledge can improve the performance of pre-trained language models towards MRC. Compared with BERT-Large, due to the gap in the number   of parameters, our model has not been able to obtain better results.

E. ABLATION EXPERIMENTS ON KCF-NET
In order to study the impact of different components on performance of KCF-NET, we design three strategies for further experiments: Strategy A: Directly use context-BERT for machine reading comprehension without knowledge-BERT and knowledge fusion layer. We choose Google pre-trained BERT-Base for experiments and fine-tune.
Strategy B: Employ context-BERT and knowledge-BERT. Add them directly for fusion.
Strategy C: Use the complete KCF-NET, containing context-BERT, knowledge-BERT and knowledge fusion layer. Figure 9 shows the model structure of strategy A and strategy B. As can be seen from Table 6, each component of  KCF-NET has played a certain role, of which the effect of BERT is the most obvious. In contrast, we can conclude that the introduction of external knowledge has a positive effect on reading comprehension, but inappropriate integration mechanisms can mislead the model to make wrong predictions.

F. EFFECTS OF KNOWLEDGE EMBEDDING LAYER
Experiments are also conducted to analyze the influence of different knowledge graph data. We compared the different It can be seen from Table 7 that different knowledge graph datasets also have a great impact on the knowledge graph embedding. We find that knowledge-BERT trained FIGURE 11. Line charts of impact of different hidden layers. With the increase of the hidden state dimension, the EM score and F1 score have been greatly improved, but the larger the hidden state dimension, the smaller the increase. on WN11 has better performance than that trained on the FB13.
We infer that it may be related with the composition of the KGs. WN11, which consists of several words and corresponds to distinct word facts, is more suitable for constructing word semantic vectors for machine reading comprehension. Besides, the result of Strategy C shows that knowledge graphs can make up for each other to some extent.  Table 8.
According to Table 8 we find that the dimension of hidden state, the number of hidden layers and attention heads all have a positive impact on MRC. However, as the number of the parameters increases, the training time also increases, so it is necessary to balance performance and complexity in actual application. To visually compare different parameters, we first grouped all models according to the number of hidden layers and plotted curves as shown in Fig. 11. Obviously, with the increase of the hidden state dimension, the EM score and F1 score have been greatly improved, but the larger the hidden state dimension, the smaller the increase. In addition, we find that by fusing external knowledge, KCF-NET can indeed improve the performance of MRC to a certain extent. When the hidden state dimension is small, the improvement is very significant. In most cases, KCF-NET's EM score is always slightly greater than the F1 score. We believe that this is the positive effect of external knowledge on entity coding, which can make the original fuzzy position prediction more precisely. Since the F1 score itself calculates the degree of similarity, which is a vague indicator, the improvement is not as high as the EM score.
Then, we adjust the groups of different models for further observation. We reassigned 4 groups according to the number of attention-heads, as shown in Fig. 12. Group (a) contains the model {1, 5, 9, 13, 17, 21} in  We find that with the increase of the number of hidden layers, the model's EM score and F1 score have been greatly improved, especially when the number of hidden layers is small. In addition, compared with Fig. 11, when the number of hidden layers increases, the change in the increase of KCF-NET is more volatile than that of BERT, and it is not entirely in a downward trend. We speculate that this is because the number of hidden layers in BERT does not directly affect the information representation of the final word, whereas the knowledge coding layer in KCF-NET directly affects the entity representation, so the impact is more obvious. Overall, this experiment proves that the knowledge coding layer and knowledge fusion layer in KCF-NET have a positive effect on the entire model, especially for models with small parameters. But for BERT with a large number of parameters, more word context information can be learned in the pre-training stage. At this point, it is similar to the original intention of adding external background knowledge in this paper. Therefore, KCF-NET has a limited improvement on BERT with a larger number of parameters, but it still has a positive effect. It also proves the effect of external knowledge on improving the performance of machine reading comprehension.

H. EFFECTS OF TRANSFORMER LAYER
In previous experiments, we have studied the BERT coding layers with different parameters. In this section, the influence of the information carried by different layers in BERT will be revealed.
As shown in the Fig.13, we distinguish the different Transformer layers of BERT-Base in different colors We design four strategies as follows: Strategy A: Only use the first layer of BERT-Base to connect MLP and Softmax for output.
Strategy B: Only use the last layer of BERT-Base.  Strategy C: Use the last three layers of BERT-Base. Strategy D: Use all twelve layers of BERT-Base. The experimental results can be seen from Table 9. The information obtained in Layer 1 is so limited that it can hardly be used directly, whereas that in the last layer is the most effective. Besides, we will not get the best results by blindly concatenating features in all layers.

V. CONCLUSION
Most existing methods can only obtain the semantic information contained in given context, which limits the understanding of words or cause deviations. This paper presents a knowledge-based machine reading comprehension model KCF-NET. Compared to previous knowledge-based models, we make breakthrough in knowledge extraction and embedding fusion. Specifically, we consider the structural characteristics of the triples in KGs, not just converting triples into vector representations for direct accumulation. Considering the advantages of capsule networks in encoding spatial information, we propose CapsBERT to capture the horizontal relationship and vertical relationship of the triple matrix, containing features between and in the triples. Meanwhile, a novel fusion mechanism is proposed to preserve the original information and avoid knowledge noise, which uses gating units to regulate the flow of information and is easier to train. Experimental results show that CapsBERT has a competitive performance compared to the existing best models on the task of triple classification, and the instantiation parameters are reduced by three-quarters. On the link prediction task, knowledge-BERT obtains the current best F1 score of 91.83. Compared with the performance of BERT-Base on SQuAD, our model has a higher accuracy rate with negligible parameter increase. This work demonstrates the feasibility of further enhancing advanced language models with knowledge from KBs, which indicates a potential direction for future research. HUIHUA HE received the Ph.D. degree from Washington State University, in 2007. She is currently an Associate Professor with the College of Education, Shanghai Normal University. Her research interests include ICT uses in education, affective science, social-emotional learning, and related topics. She is also a member of CSE. VOLUME 8, 2020