Reinforcement Learning Over Knowledge Graphs for Explainable Dialogue Intent Mining

In light of the millions of households that have adopted intelligent assistant powered devices, multi-turn dialogue has become an important ﬁeld of inquiry. Most current methods identify the underlying intent in the dialogue using opaque classiﬁcation techniques that fail to provide any interpretable basis for the classiﬁcation. To address this, we propose a scheme to interpret the intent in multi-turn dialogue based on speciﬁc characteristics of the dialogue text. We rely on policy-guided reinforcement learning to identify paths in a graph to conﬁrm concrete paths of inference that serve as interpretable explanations. The graph is induced based on the multi-turn dialogue user utterances, the intents, i.e., standard queries of the dialogues, and the sub-intents associated with the dialogues. Our reinforcement learning method then discerns the characteristics of the dialogue in chronological order as the basis for multi-turn dialogue path selection. Finally, we consider a wide range of recently proposed knowledge graph-based recommender systems as baselines, mostly based on deep reinforcement learning and our method performs best.


I. INTRODUCTION
Across the globe, millions of households have adopted intelligent assistant powered devices. In light of this, multiturn dialogue, in particular, task-oriented multi-turn dialogue which aims to handle with certain questions, has become an important field of inquiry with substantial real-world impact. The system not only needs to identify a user's information need from this dialogue but also locate an appropriate answer from all the knowledge that is accessible to it. Such knowledge can oftentimes be regarded as taking the form of a knowledge graph and locating an answer often corresponds to identifying relevant nodes in the graph [1].
Recent work in this area has exploited advances in neural representation learning to address this task [2], [3]. However, in real-world deployments of such systems, it is not sufficient for a multi-turn dialogue recognition system to merely use latent vector representations for knowledge graph nodes to identify appropriate responses. Rather, the system ought to be able to offer the user clear explanations of how the multi-turn dialogue led to specific intention recognition outcomes. In this paper, we consider a knowledge graph The associate editor coordinating the review of this manuscript and approving it for publication was Arianna Dulizia .
providing information such as user utterances, the sub-intents associated with the dialogues, and the standard queries of the dialogues.
We propose a method called PGMD that draws on a neural reinforcement learning network to navigate the knowledge graph in pursuit of the pertinent query nodes in the graph. The reinforcement learning agent starts from a user utterance from the current multi-turn dialogue and searches the knowledge graph iteratively with the goal of obtaining a precise and interpretable path in the graph for intent recognition. As the agent makes its prediction based on specific paths in the graph, we have a highly interpretable model that can easily explain the underlying process of intent recognition [4].
Thus, the goal of our paper is not only to identify the candidate sets of intentions in multi-turn dialogue, but also to provide an interpretable path in the knowledge graph that explains the process of identifying such intentions. This novel strategy yields a means of overcoming the shortcomings of current approaches. We use the intent recognition process as a Markov decision process based on a knowledge graph. Reinforcement learning is invoked for each given multi-turn dialogue, wherein the agent learns to search for the subintents associated with the dialogues, and finally search for the standard queries of the dialogues. The search path can serve as an explanation of the dialogue intent prediction process.
The main contributions of this paper are as follows: 1) We use multi-turn dialogue data to construct a knowledge graph and train a node embedding model for this knowledge graph, which mainly includes the following types of nodes: user utterance nodes, sub-intent nodes, and standard query nodes. In light of the sparsity of the textual data, our model draws on the BERT pre-training model [5] to obtain the word representations of the user utterances to train the model. 2) We propose a reinforcement learning method for path selection called PGMD. Since multi-turn dialogue has chronological characteristics, we consider an BiLSTM (Bidirectional Long Short-Term Memory) network with attention mechanism in our reinforcement learning agent to obtain the state characteristics of the path. And we proposed a new reward to compute the macro-averaged matching score between nodes on the path with the query nodes. 3) We have designed multi-turn dialogue tracking path searching algorithms including backward tracking strategy and forward tracking strategy to find different paths as candidate sets for identified intents.

II. RELATED WORK A. KNOWLEDGE GRAPH-DRIVEN RECOMMENDATION
The primary objective in a recommendation task is to determine the suitability of items for users that they have not yet seen or used. There are two principal ways of incorporating a knowledge graph into a recommendation engine. The first is based on a feature-driven recommendation method, and involves extracting pertinent user and item attributes from the knowledge graph as features, which can then be included into traditional models, such as the FM model, LR model, etc. [6]. The second is the path-based recommendation method. [7] considers the knowledge graph as a heterogeneous information network and then constructs meta-graph based features between items. [8] proposed a new model named KPRN which can generate path based on the semantics of entities and relations. [9] transferred the relation information in knowledge graph in order to figure out the reasons why a user prefers an item. [10] proposed a model named KGCN which can mine the associated attributes between items in knowledge graph. In particular, [11]- [14] proposed different path-based methods to get recommended results for the Linked Open Data(LOD). [12] found out the recommended path in LOD based on variable importance scores. [11] and [13] used DBpedia to extract semantic path-based features to compute the recommended results eventually. [14] made an investigation about the incorporation of graph-based features into LOD path-based systems. One advantage of this second approach is the full and intuitive use of the network structure of the knowledge graph. In existing work, the recommendation engine is trained based on prior interactions between users and items.
However, in our dialogue engine, we need to recommend appropriate query nodes based on the user dialogue, and there may not have been any prior interaction at all with any relevant items. Thus, our setting is quite different from general recommendation systems. However, we compare our algorithm against state-of-the-art recommendation engines.

B. REINFORCEMENT LEARNING
In recent years, a large number of studies in different areas have identified reinforcement learning as a promising artificial intelligence technique. Thus, reinforcement learning is not only used for standard text mining tasks such as text classification [15]. Additionally, it has also been explored for knowledge graphs of the sort mentioned above. For example, in terms of question answering, a knowledge graph may be considered as the environment for an agent. [1] used reinforcement learning whose reward function considers accuracy, diversity and efficiency to find paths in the knowledge graph and [16] proposed a multi-hop knowledge graph to handle Question Answering. [17] proposed a knowledge graph question answering model based on end-to-end learning. [18] proposed a collaborative system which contains two agents. And one agent is used to reason path in knowledge graph, another is used to extract relation from background corpus. More recently, [4] proposed a method called Policy-Guided Path Reasoning which couples recommendation and interpretability by providing actual paths in a knowledge graph. [19] proposed a method which can identify explicit paths from users to items over the knowledge graph as the recommendation results, and experimental results show that not only the method gets a good recommendation results, but also provides explanations. [20] proposed a cooperative system including reasoning agent and information extraction agent to handle with Question Answering problem. The reasoning agent identifies the path over knowledge graph, and the information extraction agent provides shortcut or missing relations for long-distance target entity. [21] proposed a new performance metric for Question Answering agents which improve the results of Question Answering models while not to answer a limited number of questions which have been answered correctly. [22] proposed a new method named CogKR which includes summary module and reasoning module to handle with the one-shot knowledge graph reasoning problem. Besides, reinforcement learning can also be used in automated knowledge base completion and knowledge aware conversation generation. [23] proposed a new framework which reasons the relations between the missing factors and updates the knowledge base to implement the automated knowledge base completion. [24] proposed a new chatting machine which can generate conversation by reasoning over the augmented knowledge graph containing both triples and texts. We compare against such work in our experiments.
Compared with other reinforcement learning path reasoning methods, PGMD use an BiLSTM network with attention mechanism to extract path features which has an important underlying ordered time sequence, and use a new formula which computes the macro-averaged matching score between nodes on the path with the query nodes as the soft reward. Besides, the action trimming method also plays an import role in the algorithm.

C. DIALOGUE CONSTRUCTION
Building an automated conversational agent is a long cherished goal in Artificial Intelligence (AI). At present, there are two common ways to construct a dialogue bot: generationbased methods [25], [26] and retrieval-based methods [2], [3], [27]- [30]. In particular, [2], [27], [28] analyzed different baselines which aim to select the next response on the Ubuntu Dialogue Corpus. [3] formed a fine-grained context representation via formulating previous utterances into context to get a better performance. [29] proposed a model named SMN to address problem i.e., losing relationships among utterances or important contextual information. [30] used Deep Attention Matching Network to select response which takes advantage of attention mechanism to extract information from user utterance and response. Generation-based models generate the best answer under the context. With sufficient data, they can learn various ways to generate diverse responses. However, they may not be sufficiently stable. Retrieval-based chatbots, on the other hand, select a suitable response from a pre-built inventory of potential responses. Their advantage is that the entire system is relatively stable as they only considers a specific narrow domain. However, the set of potential answers is limited by the repository.
Multi-turn retrieval-based dialogues usually compute the matching scores between user utterances and responses and then select the suitable responses from the response inventory. However, our method primarily identifies the standard queries for the multi-turn dialogues. Then the users can get responses according to the standard queries identified by our systems.

D. DIALOGUE INTENT MINING
Generally, the dialogue systems are usually classified into two categories including task-oriented dialogue systems and non-task-oriented dialogue systems. The task-oriented dialogue systems aim to handle certain questions and the nontask-oriented dialogue systems do not have certain targets. The first step of the pipeline for task-oriented dialogue systems is to capture users' intents according to the users' utterances, then the second step is to make actions based on the task policy, and finally the systems select a decent responses to reply to users from the pre-built inventory associated with the actions [31]. The methods using deep learning techniques have made great progress in dialogue intent mining [32]- [34], and convolutional neural networks (CNN) are used to capture the user utterance features to identify standard queries [35]. Moreover, [36] and [37] resembled CNN-based model to get a better performance.
There is no previous work using knowledge graphs to identify suitable query nodes with explainable paths for multi-turn dialogue. Our method can give clear explanations about how the multi-turn dialogues led to the query nodes.

III. PRELIMINARIES
In a task-oriented dialog system (cf. Figure 1), the system response in each turn of the dialogue is decided by the intention analyzed from the previous turns of user utterances, which plays a significant role in the whole dialogue system. And the goal of our system is exactly the intent mining in the certain task-oriented dialogue system via reinforcement knowledge graph reasoning.

A. INPUT
In our experiments, we consider a dataset coming from a company's real customer service hotline, with 19 predefined standard queries and about 120,000 call dialogues. As input, we consider multi-turn dialogue data including the automated customer service agent from the company that attempts to identify the human caller's intent with regard to an inventory of standard queries, i.e., the intents for which the customer service can provide predefined help. The user utterances in the dataset refer to user questions, the sub-intents for dialogues include the demands associated with the dialogues and the relevant business units for the dialogues, and the intents for dialogues refer to standard queries. An example of such a conversation is given in Table 1. The data comprises customer utterance, business unit and demand for each turn of customer questions or utterances, and the standard query IDs (QIDs) for the overall multi-turn dialogue. We assume the connection between business units and their demands is known.

B. PROBLEM FORMULATION
A knowledge graph (KG) G is a graph G = {(e, r, e ) | e, e ∈ ε, r ∈ R} that captures factual information. A node e represents an entity, class, type, or literal. ε is the set of nodes, and R is the set of edges between pairs of entities e, e , where two nodes e, e are connected by a predicate r, forming a semantic fact (subject e, predicate r, object e ), e.g., (Berkeley, locatedIn, California).
KGs are widely used to model such semantics relationships. In this paper, we model the multi-turn dialogue process as an ad hoc knowledge graph G D , created on the fly to capture relationships between nodes including multi-turn dialogue text utterances T of customers and a subset of standard queries Q, where T , Q ⊆ ε and T ∩ Q = ∅. The two entities are connected via predicates r t,q , where t ∈ T , q ∈ Q. Overall, in our dataset, there are 4 kinds of entities and 7 types of predicates, as described in Table 2. Given a knowledge graph G D , the maximum length of searchable paths K and the number of standard queries N , the goal is to learn a model and identify a candidate set {(q n , p n ) | 0 ≤ n < N } for each customer question t ∈ T , where p n denotes the probability of query q n . Thus, for every pair (t, q n ), we need to have a path p k (t, q n ) with 2 ≤ k ≤ K .
Definition 2 (1-Hop Scoring Function): We define a scoring function f to compute the degree to which the entity e matches the entity e k , where the relationship r is a f (e, e k ) = e + r, e k + b e k (1)

IV. METHODOLOGY
We construct a knowledge graph for the multi-turn dialogue based on the entities and relations shown in Table 2. All user questions (or utterances), standard queries (which serve as the intents of dialogues), sub-intents of dialogues which include business units and demands can become entity nodes in such a graph. The goal is to find the correct query for the overall dialogue. This entails devising a strategy to pursue paths emanating from the user question node and leading to an appropriate query node. The searching path which begins from the first turn, passes through the following turns and finally reaches the query can be modeled as a Markov Decision Process. Hence, we rely on reinforcement learning for navigation along the graph towards the correct query node.

A. DIALOGUE GRAPH CONSTRUCTION
From the multi-turn dialogue, we induce a knowledge graph (cf. Figure 2) with edges connecting 4 types of entities. For every customer utterance node, there are multiple paths in the graph that can reach the standard query nodes. We assess which standard query has the highest probability based on scores within the knowledge graph. First, in order to construct a knowledge graph, we begin by extracting triples of the form (e, r, e ) from the data and add them to the knowledge graph. For instance, according to the Table 2, ''Credit Pay'' is a business unit entity (business b2 in Figure 2), while ''stolen'' is a Demand entity (Demand d2 in Figure 2), and the predicate ''includes'' is the relationship for the edge linking these two nodes.

1) VECTOR REPRESENTATIONS
While it is possible to apply structured queries on a knowledge graph, to make full use of the rich information that it provides, we additionally learn vector representations. We VOLUME 8, 2020 exploit the elegant TransE method [38] to learn such representations. For the embedding layer, the embeddings for query nodes, business nodes and demand nodes are initialized randomly based on a certain distribution, and then updated in the process of training TransE.

2) BERT QUESTION TEXT EMBEDDINGS
However, considering only structural information, the embeddings for question entities would remain overly coarsegrained and uninformative in light of the rich semantic structure of linguistic utterances. Hence, in order to learn richer embeddings capturing fine-grained semantic nuances, we invoke the pretrained BERT model [5] for embedding initialization, and fine-tune it in the process of training the TransE model. Overall, the process of training TransE can be summarized as Equation 2 and Equation 3.

B. MULTI-TURN DIALOGUE PATH SEARCHING ALGORITHMS
In order to identify the user intent by mapping it to a standard query, we need to find the correct path in the knowledge graph from the user question node to such a query node. We design multi-turn dialogue path searching algorithms including backward tracking strategy and forward tracking strategy. We formalize the forward tracking strategy as Algorithm 1 and propose the algorithmic procedure using forward tracking strategy formalized as Algorithm 2 by considering the specific multi-turn property of the data. While searching for paths, we rely on reinforcement learning to select the next node in the knowledge graph (cf. blue lines). For a multi-turn dialogue, the shorter the search path, the more reliable its result tends to be. Hence, we define a threshold σ as the maximum number of search steps. When using forward tracking strategy, we start searching the path from the utterance in the first turn of the multi-turn dialogue. For a three-turn dialogue {question 1 → question 2 → question 3 }, for instance, we set the node question 1 as the starting point when searching for a query node. However, it is possible that the query node is not reached when limiting the number of steps. If the path searching process stops at a certain step, the multi-turn dialogue will fail to return a query. In this case, we forward track to search from question 2 .
Inversely, when using backward tracking strategy, we start searching the path from the utterance in the last turn of the multi-turn dialogue. If the query node is not reached, we backward track to search from the previous turn of the multi-turn dialogue.
For this path selection part, the time complexity for each dialogue is O(T * L * A), the T represents the maximum turns of the dialogue, the L represents the maximum length of path when system search query nodes, and the A represents maximum out-degree in the knowledge graph. if current_node ∈ Q then 9: return current_node find the target node 10: end if 11: end while 12: k++ need forward tracking 13: end while 14: return ∅

C. REINFORCEMENT LEARNING
The goal of our reinforcement learning is to pursue suitable paths in the knowledge graph. Algorithm 2 provides the details of the proposed reinforcement learning empowered PGMD algorithm, which extends the path searching process.

1) POLICY/VALUE NETWORK
At every step, the reinforcement learning model requires the state of the current search path to select the best action to take. An important property for multi-turn dialogue is that the different turns adhere to an underlying ordered time sequence. For instance, the query nodes for two partial dialogues question1 → question2 versus question2 → question1 may be entirely different. Thus, the network needs to account for this temporal order property of the data. We use an Actor-Critic algorithm, with the structure of the network as given in Figure 3. An BiLSTM network with attention mechanism is invoked to extract path features. We further concatenate the embedding of the historical nodes with the BiLSTM model's output as a fusion layer, and then pass it through two fullyconnected layers. Finally, the probabilities of actions in the action space are emitted by the actor layer. The effect of the network is evaluated by the critic layer.

2) STATES
The state is the input of the policy network and provides information about the current path. To avoid overfitting, we only consider partial paths. Define k as the upper-bound of the historical nodes used to make decision. State s t at step t is the start node embedding E q and the path embedding starts from the current node e t to the previous k nodes e t−k+1 including edges: (E e t−k+1 , E r t−k+2 . . . , E r t , E e t ) (E e is the embedding of the entity e, and E r is the embedding of the predicate r).

3) ACTIONS
For a current node e t at step t, the complete action space includes all the outgoing connected nodes of e t (but excluding historical nodes). Some nodes in the graph may have a large out-degree. Owing to efficiency considerations, we propose an action pruning strategy. We compute scores of node e t with all nodes in the complete action space A according to the 1-hop scoring function f (e t , a), a ∈ A. Given δ as the upper-bound of the size of the action space, we eliminate low- scoring actions after sorting. The pruned action space A is defined in Equation 4.

4) REWARD
During path searching in the KG, it is not possible to confirm whether the action will ultimately reach the correct target before the final step. Hence, we cannot only use a binary reward to indicate whether the agent has reached the target. Instead we propose a soft reward formula when the agent reach query nodes except the target. As the number of nodes of each type on the path may vary, and we wish for each type of node to play the same role, we consider as the reward the macro-averaged matching score between nodes on the path and the query node. The reward function is defined as Equation 5.
and e t = e r 1 if e t = e r 0 otherwise.
where Q is a set of query entities, B is a set of business entities, D is a set of demand entities, and T is a set of question entities. e 0 , e 1 and e 2 represent nodes of searching path, and e 0 ∈ D, e 1 ∈ B, e 2 ∈ T . e r is the query node corresponding to the multi-turn dialog. n 0 is the number of demand nodes of the path, n 1 is the number of business nodes of the path, n 2 is the number of question nodes of the path. VOLUME 8, 2020

A. SETTINGS 1) DATASET
The details of the dataset were already given in Section III. From the total of 120, 000 call dialogues, we randomly selected one-tenth as the test set, one-tenth as the valid set and the remaining eight tenths for training.

2) DATA PROTECTION STATEMENT
1) The data used in this research does not involve any Personal Identifiable Information (PII).
2) The data used in this research were all processed by data abstraction and data encryption, and the researchers were unable to restore the original data. 3) Sufficient data protection was carried out during the process of experiments to prevent the data leakage and the data was destroyed after the experiments were finished. 4) The data is only used for academic research and sampled from the original data, therefore it does not represent any real business situation in Ant Financial Services Group.

3) EVALUATION METRICS
The experiments target at evaluating whether our algorithm can predict the correct query for a user-provided questions within the dialogue. To this end, we compute macro-Precision (Prec.), macro-Recall (Rec.) and macro-F1 to evaluate the performance of the top-1 result. There are also scenarios in the company requiring multiple queries to be selected. Thus, we additionally computed Precision@K , which counts a result as correct when among the top-ranked K result queries, there is at least one match with the ground-truth query.

4) IMPLEMENTATION DETAILS
In our experiments, we relied on a maximum searching step limit σ = 3, an upper-bound of historical state nodes k = 3, and an upper-bound δ = 100 for the action space. We set the dimensionality of the word embeddings to 100. To increase the diversity of paths, we set the dropout rate to 0.5, and use SGD 2 optimizer. We train the model for 10 epochs, setting the learning rate to 0.0001, and adopting a batch size of 64, with the entropy loss weight set to 0.001. Especially, the SGD 2 refers to the SGD optimizer with 0.9 momentum. We have presented the performance when we set different parameters which are the most important parameters including optimizer, maximum searching step limit and upper-bound for action space in the paper Section V-C, Section V-G and Section V-F, and select the parameters corresponding to the best result as the configuration. As for other parameters such as dimensionality of word embedding [1], dropout rate [17], learning rate [4], batch size [4], upper-bound of historical state nodes [4], entropy loss weight [17], are set according to the previous work.

B. BASELINES
We compare the proposed PGMD against both recommender systems and text classification methods. As the baselines are not specifically designed for our problem, they rely on varying subsets of data sources. Details of the data sources used by every baseline could check Table 3.

BERT-Classification:
We use the pre-trained BERT vectors of the questions for the task of intent classification.
BPR [39]: The Bayesian Personalized Ranking approach for recommendation, which is one of the state-of-the art ranking-based method for top-N recommendation with numerical ratings. and we use BPR-MF for model learning.
DeepCoNN [40]: The Deep Cooperative Neural Networks model for recommendation, which models users and items jointly using review text for rating prediction.
KGCN [10]: The Knowledge Graph Convolutional Network for recommendation, which mines associated attributes between items on knowledge graph.
KTUP [41]: A Joint Knowledge Graph Recommender, which transfers the relation information in knowledge graph in order to figure out the reason why an user prefers an item.
JRL [42]: A Joint Representation Learning(JRL) framework based on multi-view machine learning, which is capable of incorporating heterogeneous information sources for top-N recommendation by learning user/item representations in a unified space.
Semhash-Classification [43]: We use Semantic Hashing vectors of the questions for the task of intent classification.
DeepPath [1]: A method for knowledge graph reasoning, which includes a reward function that takes the accuracy, diversity and efficiency into consideration.
MINERVA [16]: An reinforcement learning method for knowledge reasoning, which navigates the agent based on the input query to identify predictive paths in the graph.
MultiHopKG [17]: An approach to reason in knowledge graph, which reduces the influence of false negative supervision and weakens the sensitivity to spurious paths of onpolicy RL.
PGPR: We adapt PGPR to our problem by removing the BiLSTM and attention mechanism in the policy/value net-  work (cf. Figure 3), but only keeping the concatenation layer of historical nodes.

C. OPTIMIZER
We analyze the performance when using different optimizer during training. Five optimizers are compared in the Table 5 including Adam, SGD 1 , SGD 2 , RMSProp and AdaGrad. Particularly, the SGD 1 refers to the SGD optimizer with no momentum, while the SGD 2 represents the SGD optimizer with 0.9 momentum. The alpha parameter for RMSProp optimizer is set as 0.9. The time(min) refers to the average cost time for every epoch in the training procedure. By using the SGD 2 , the system performs the best, but costs too much training time. One explanation is that the momentum increases the rate of convergence and helps the optimizer avoid the local optima value. When using Adam, the performance is not the best, but less time cost. Thus Adam is an optimizer with excellent comprehensive performance in our experiments. The application of adaptive learning rate allows the loss function to converge quickly. However, although Adam has an excellent convergence speed at the early stage of training, the final generalization ability of the model is not as good as the model trained with SGD.

D. QUANTITATIVE ANALYSIS
In order to compare PGMD against other baseline models exhaustively, we conduct an extensive quantitative analysis of these models. First, we train the model for 10 epochs with the default settings mentioned above including SGD 2 optimizer, and observe the results of different models' top-1, top-2, top-3. The best values for every model are reported in Table 6. Overall, PGMD performs better than other baselines in precision, recall, and F1 of the results.
The results of DeepCoNN and KGCN are dismal. This might be because DeepCoNN relies on user reviews and item scores for training. KGCN, meanwhile, builds a knowledge graph around the item and the corresponding attributes. It relies on the user's previous interactions with the item to update the embedding matrix of users and achieve the effect of ''knowing'' a particular user. However, for our dialogue dataset, text utterances are often not repeated. Thus, the user information is limited to that given by the BERT vector representation of the dialogue text. As there is little duplicate text in the data, it is difficult to learn relationships between a text item and a target query node. Thus, every dialogue shows up as introducing a new user at test time, severely hampering the result quality. VOLUME 8, 2020

E. EVALUATION OF PATH SEARCHING STRATEGIES
For further analysis, we compare our forward tracking strategy during path searching with a backward tracking one. We trained the model with SGD 2 optimizer and other default settings. The total numbers of multi-turn dialogues with respect to different turns of the dialogue is listed in Table 4. The table also provides statistics about which question nodes the algorithm could find query nodes for during the path search. Dialogue for which it is unable to find the query node are referred to as invalid dialogues when reasoning path. ''Que'' refers to question nodes in the tables. For example, the value 6, 822 in Table 4 signifies that there are 6, 822 3turn dialogues for which the query node is found at the second turn among all the 42, 175 3-turn dialogues.
The experimental results show that by using the forward path searching strategy the number of searches from the previous question is more than the number of searches from the next question. In contrast, with the backtracking path search strategy, the number of searches from the next question is more than the number of searches from the previous one. The model using the backward tracking strategy attains a lower accuracy than that adopting the forward tracking strategy. One explanation is that for a multi-turn dialogue in our scenario, a new turn usually serve as additional information of the previous turn and doesn't contain complete information. When using forward tracking strategy, it is easier for the system to find the target query node based on the complete information. In contrast, when using backward tracking strategy, it is possible for the system to find wrong query node because of the incomplete information contained of the later turn.

F. INFLUENCE OF ACTION PRUNING STRATEGY
The action space has an important effect on the result. In this experiment, we evaluated the result of PGDM with different sizes of trimming action spaces. When the number of actions is lager than the upper-bound δ, the action space will be adjusted according to the scoring function. Higher score actions are more likely to be saved, the lower score actions will be removed from the action space. We want to explore whether a larger action space, even we keep all the actions, can help the system perform better than smaller action space. Because the Adam optimizer is the most widely used optimizer, we analyze the performance of system using different action space with Adam optimizer and SGD 2 optimizer. The size of the trimmed action space varies from 100 to  500, with a step size of 100. We can observe from Table 7 that when using a smaller pruning action space, it is more possible to get a better performance. The results indicate that it is effective to apply the action pruning strategy by 1-hop score function. Therefore when using smaller pruning action space, the system trim more noisy nodes, which increases the probability of reaching the correct query nodes.

G. HISTORY REPRESENTATIONS
The maximum length of the path when the system searching query node is an important hyper-parameters. We compared the use of different maximum path searching lengths, considering 3, 4, and 5 as the maximum lengths with Adam optimizer and SGD 2 optimizer. The above Table 8 provides the results for different maximum lengths. We find that with 3 as the maximum path length, the system performs best. One explanation is that though longer searching paths directed to more query nodes, they lead to less reliable predictions. Therefore when define the maximum path length, we need to consider the trade off between the reliability of the predictions and the searching range of query nodes.

VI. CASE STUDY ON PATH REASONING
In order to visually understand how our model allows for interpretability, we present a case study based on the previous experimental results. Figure 4 illustrates how to use the predicted path to explain the process of intent recognition through the paths. For the first example, the question value ''I don't know why there is a default on my Credit Account.'' has the Demand value ''Max term'', while the question value ''Yes I used to default once but not this time. I want to ask that why I cannot use my credit pay and how to solve this issue?'' also has the Demand value ''Max term'', and the query of the question value ''Yes I used to default once but not this time. I want to ask that why I cannot use my Credit Pay and how to solve this issue?'' is query ''Default issues''. Thus, we can infer that the query for the question ''I don't know why there is a default on my credit account.'' is also query ''Default issues''.
For the second example, the question value ''Why I still cannot use my Credit Pay after repaying the debts.'' has the Business value ''Pay for Credit Pay advanced'', and the question value ''I didn't register any Credit Pay account, but I was informed that I borrowed money from it and then paid back to it.'' also has the Business value ''Pay for Credit Pay advanced'', and the query of the question value ''I didn't register any Credit Pay account, but I was informed that I borrowed money from it and then paid back to it.'' is query ''Repaying issues'', then we can think the query of the question value ''Why I still cannot use my Credit Pay after repaying the debts'' is also query ''Repaying issues''.
For the third example, the question value ''I lost my phone, and I can't log in my credit account.'' can go on the dialog question value ''I cannot log in my credit account now. How can I pay the bill?'', while the dialog question value ''I cannot log in my credit account now. How can I pay the bill?'' can go on the dialog question value ''When I log in the credit account, It says that the account doesn't exist.'', and the query of the dialog question value ''When I log in the credit account, It says that the account doesn't exist.'' is query ''Login issues'', then we can think that the query of the question value ''I lost my phone, and I can't log in my credit account.'' is also query ''Login issues''.

VII. CONCLUSION
In this paper, we present a novel explainable approach for intent identification in multi-turn dialogue. Our novel PGMD approach relies on a reinforcement learning neural network to navigate an query-specific ad hoc knowledge graph in pursuit of relevant query nodes, via our order-aware forward tracking path searching algorithm for multi-turn dialogue. We conducted a series of experiments demonstrating that PGMD is a powerful method for multi-turn dialogue intent identification providing intuitive explanations and outperform state-of-theart related work. We believe our method can be extended to dynamic knowledge graphs to deal with dynamic problems that new knowledge will appear in the future. As new nodes and edges being added in the knowledge graph, some existing nodes and edges are probably removed from the knowledge graph. In order to get the embedding of nodes and edges efficiently once updating, we could use DKGE model [44] which can achieve online embedding learning to get the updated knowledge graph embedding.