Intention Detection Based on Siamese Neural Network With Triplet Loss

Understanding the user’s intention is an essential task for the spoken language understanding (SLU) module in the dialogue system, which further illustrates vital information for managing and generating future action and response. In this paper, we propose a triplet training framework based on the multiclass classification approach to conduct the training for the intention detection task. Precisely, we utilize a Siamese neural network architecture with metric learning to construct a robust and discriminative utterance feature embedding model. We modified the RMCNN model and fine-tuned BERT model as Siamese encoders to train utterance triplets from different semantic aspects. The triplet loss can effectively distinguish the details of two input data by learning a mapping from sequence utterances to a compact Euclidean space. After generating the mapping, the intention detection task can be easily implemented using standard techniques with pre-trained embeddings as feature vectors. Besides, we use the fusion strategy to enhance utterance feature representation in the downstream of intention detection task. We conduct experiments on several benchmark datasets of intention detection task: Snips dataset, ATIS dataset, Facebook multilingual task-oriented datasets, Daily Dialogue dataset, and MRDA dataset. The results illustrate that the proposed method can effectively improve the recognition performance of these datasets and achieves new state-of-the-art results on single-turn task-oriented datasets (Snips dataset, Facebook dataset), and a multi-turn dataset (Daily Dialogue dataset).


I. INTRODUCTION
The dialogue systems are being integrated into various devices and allow users to speak to the system directly to perform the specific task efficiently, such as Google Home [1] and Amazon Echo [2]. The spoken language understanding (SLU) module is an indispensable component in the dialogue system. A typical SLU module is designed to transform the spoken language into a specific semantic template that human language can be well-understood by the dialogue system. After that, the dialogue management module can facilitate future actions according to detection results in the SLU module. The role of the intention detection task in SLU is to discriminate the implicit intention by recognizing the intents of received utterances. The intent tag is a semantic label attached with each utterance in dialogue, which represents the user's intention and concise utterance interpretation [3]. Therefore, intention detection task is crucial to enhance the The associate editor coordinating the review of this manuscript and approving it for publication was Julien Le Kernec . spoken language understanding performance in the dialogue system.
In our research, we study spoken language as described in written format. According to the real situation, it is challenging to study the spoken language because of some attributes of natural language. Firstly, the sparsity of semantic information and obscure slang in spoken language make the model difficult to interpret thoroughly [4]. For instance, the average length of some utterances is no more than 20 words. Secondly, the same underlying utterances have different tags or multiple tags, which give rise to ambiguity in classifying intention labels. We use the utterance 'Yeah' as an example showed in Table 1 that the 'Yeah' has three tags, which are 'Backchannel,' 'Agree,' and 'Yes/No Answer,' respectively. The prior works of multi-class classification of intention detection exploit Softmax to train an encoder on labeled training data. The learned features are optimized under the supervision of Softmax, which cannot be sufficiently distinguished because it does not consider the intra-class compactness of features. The categories prediction was only focusing on finding a decision boundary, which results in poor generalization capabilities. Inspired by these observations, we assume that the intention recognition performance can benefit from constructing the robust and discriminative feature representations of the short-length utterances. To this end, we improve the conventional method by proposing a novel triplet training framework based on multi-class classification learning.
Pre-trained language models have recently proved to be very useful and efficient in learning general language representations. For instance, the BERT model is conceptually simple and empirically powerful in enormous natural language processing tasks [5]. Inspired by the pre-trained language model learning approach and transfer learning techniques, we refer to the concept of unsupervised pre-training method with triplet loss to learn a structured space of interpretable utterance representations.
Specifically, we design a two-stage process for intent classification, which includes feature embedding learning and intention prediction. In the first stage, we develop the RMCNN model and BERT model as Siamese encoder with metric learning to obtain robust and discriminative feature embeddings by minimizing the intra-class differences. In the second stage, we fuse the features from pre-trained feature embedding models and add additional relevant information as completed feature sets to predict intention labels in the downstream task.
We summarize the contributions of this paper as follows: (1) The proposed triplet training framework learns discriminative utterance feature by using the same weights on different inputs. The triplet loss function infers a non-linear mapping in the resulting latent space, and the inter-class sample distances are maximized based on a certain margin [6]. The triplet selection turns out to be crucial for model convergency. By considering the strong correlations between dialogue context, we propose a sequential sampling strategy to keep the intention transition traits into the triplet sampling process. (4) In the downstream task, we predict the probability distribution of each intent label based on multi-class classification learning. We obtain utterance features by fusing the features from different pre-trained feature embedding models. Besides, we extent features with relevant information as external knowledge, such as speaker information.
The rest of the paper is organized as follows: the related research methods are introduced in Section II; Section III introduces the model framework and methodology; Section IV conducts experiments on benchmark dataset; Section V analysis the result from different aspects; Section VI concludes the whole article and outlines the future work.

II. RELATED WORK A. INTENTION DETECTION TASK
The learning methods for the intention detection task are divided into two categories: multi-class classification and sequence labeling. The multi-class classification models are SVM [7], Naive Bayes [8], and Maximum entropy [9] in experiments. The sequence labeling methods are HMM [7] and SVM-HMM [10]. Plenty of features had been exploited in traditional models, including lexical, syntactic features, prosodic cues, and dialogue structure. For example, the keywords [11] and vocabulary pairs as lexical features [12] can highlight the particularity of a sentence. Besides, the syntactic features like utterance length [10] and word order [13] had shown its utility for identifying intention tags. However, the traditional approaches for intention detection relied on hand-crafted features that were time-consuming and laborintensive.
The emergence of deep learning methods effectively alleviated the constraints of the traditional approaches and achieved state-of-the-art results from natural language processing to computer vision [14]. For example, Khanpour et al. [15] utilized the pre-trained word embedding matrix and a modified RNN model to represent the utterance features. Kim [16] used CNN as an utterance encoder with pre-trained embedding that performed well on this task. Lee and Dernoncourt [17] got the cutting edge by investigating standard RNN and CNN that incorporated preceding short texts as context to predict dialogue act tags. Besides, some researches utilized the joint learning approach to conduct the intention detection and slot filling [48], [49]. In addition, some researchers considered the contextual structure of the multi-turn dialogue, so the intention detection task also can be regarded as a sequence labeling task. Kumar et al. [18] utilized hierarchical Bi-LSTM to capture utterance granularity and inherent properties from multi-levels of conversation and predicted sequential dialogue act with the CRF model. Tu et al. [19] build a hybrid neural network-based ensemble model for Chinese hierarchy dialogue. Notably, this paper incorporated the speaker changing as a feature to illustrate utterance peculiarity. Furthermore, some other features were useful to generate more discriminative predictions in detecting user's intention. For examples, the location of the comment in web forum [20], speaking preference of users [20], dialogue topic context of same user [21], emotion transition trait of user's blog [22], the rating and comments of products in shopping website were treated as the weak label to learn the sentence representation [34].

B. LANGUAGE REPRESENTATION MODEL
Recently, the language representation model improved significantly in many NLP tasks, such as textual entailment, semantic similarity, reading comprehension, and question answering [23]. The language representation models can provide powerful context-dependent representations by pre-training on a large scale unlabeled data, such as Contextualized Word Representations (ELMo) [24], Generative Pre-trained Transformer (GPT) [25] and Bidirectional Encoder Representations from Transformers (BERT) [5]. Besides, these models can be easily applied to different downstream tasks with minimum parameters. Therefore, we exploit the concept of pre-trained language model representation to construct a novel utterance feature embedding model in this paper.

C. METRIC LEARNING
Utilizing the deep neural network with a distance metric to learn the feature embedding had been successfully applied to many tasks, such as face recognition [26], speech recognition [27], [28] and speaker identification. For example, FaceNet [26] of Google utilized a random semi-head triplet mining approach to make up facial picture triplets, which obtained excellent performance. He et al. [29] achieved outstanding performance on 3D object retrieval by proposing triplet loss and center loss. Huang et al. [30] applied triplet loss in training to automatically recognize emotion state in spoken language. To deal with the spoken language, Cambria [31] presented a system that directly learned mapping from speech features to a compact fixed-length speaker discriminative embedding. The triplet loss function focuses on fine-grained identification and adds the measurement of the latent state, which can help model distinguish the details.

D. MULTI-SOURCE FUSION
Generally, the exceptional performance of the classification model depended on sufficiently large training corpora to a great extent. To comprehensively understand sentences, the fusion strategy can aggregate multiple sources to enriching the features and boost learning performance [31]. Majumder et al. [32] fused the multimodal resources like audio, video, and text for sentiment analysis. Tay et al. [33] generated sentence representations by using a gating mechanism to combine the sentence token features and sentiment lexicon features. Sun et al. [35] detected emotional elements by using a mixed model to extract sentimental objects and their tendencies from product reviews. Specifically, the multistream architecture is prevalent in data fusion. For example, Simonyan and Zisserman [36] designed a model with two-stream ConvNet architecture to illustrate spatial feature and temporal features, which can achieve significant performance under the condition of limited training data by the two-stream model. Inspired by these experiments, we use the fusion strategy in the downstream task to enhance the utterance feature representation.

III. PROPOSED METHOD
Before describing the proposed method in detail, we illustrate the mathematical notation for the intention detection task. In this experiment, we deal with the intention detection task based on multi-class classification learning. Suppose, we have the number of n utterance sequences X = {x 1 , x 2 , . . . ,x n } with corresponding the sequences of intents label Y = {y 1 , y 2 , . . . ,y n } . Each utterance x i of dialogue is composed of a sequence of words x i = {w 1 , w 2 , . . . ,w j }. The purpose of this paper is that given an unseen utterance x i , we construct a model to learn the valid feature representation better and accurately predict the corresponding intent label y i . Besides, we evaluate the proposed model on single-turn task-oriented dialogue and multi-turn conversation. It's worth noting that the multi-turn conversation contains the speaker's role information, so we supplement the role information as a feature in the downstream task. Each utterance correspond to a speaker tag C = {c 1 , c 2 , . . . , c n }.

A. THE WHOLE FRAMEWORK
This section mainly introduces the whole framework of the proposed model. The entire structure consists of three parts, which are triplet sample selection, triplet training section, and the downstream task of intention classification. Firstly, the system needs a sampling strategy to generate valid triplet data (x a i , x p i , x n i ) as training objects. One triplet sample consists of an anchor sample x a i , a positive sample x p i , and a negative sample x n i . Then, we input all the triplet samples into the Siamese encoder and train the model with a triplet loss function. The triplet training model uses the same weights on different inputs to compute variables and accomplish a better separation between two positive related samples of the same class (x a i , x p i ) and one negative sample x n i . To avoid meaningless calculation in the training process, we need to verify whether triplet samples are valid by setting up a particular margin parameter to observe Euclidean distance between embedding triplets in the test section. After the training, we can obtain a robust pre-trained feature embedding features, which can better reflect the specific characteristics of utterance. Secondly, given the well-defined feature embedding model with parameters, we exploit it mapping utterances in the downstream task. The critical components for triplet training are the Siamese model selection and triplet data composition. Therefore, the related information of essential components and modifications are illustrated in the following subsections.

B. THE TRIPLET SIAMESE NEURAL NETWORK 1) TRIPLET LOSS TRAINING
Triplet loss function is calculated on the triplet data i are extracted from the same intention category. We obtain the negative sample x n i in different intention category from the x a i , x p i . We exploit the feature embedding model f θ (x) ∈ R d to map utterance triplets to d-dimension Euclidean space, and the distances are measured in resulting latent space. (1) The f θ (·) refers to the Siamese encoder. The are outputs from the Siamese encoder. T is the set of all possible triplets in the training set.
The triplet loss optimizes model by minimizing the distance between f θ x a i and f θ x p i and maximizing distance between f θ x a i and f θ x n i by at least a margin parameter α ∈ R + . The triplet loss L triplet is illustrated as follow: where N stands for the number of triplets in the training set, and i denotes the i-th triplet sample. During the triplet training, generating all possible triplets can easily be satisfied but results in slower convergence. Therefore, it is vital to select valid triplet samples to improve training efficiency. The following section is about triplet sampling strategies.

2) TRIPLET SAMPLING STRATEGY
It is crucial to comply with the triplet constraint to ensure fast convergence. The constraint of triplet selection is illustrated as follow: Based on the constraint, we adopt two sampling strategies to extract triplets, which are random sampling strategy and sequential sampling strategy. The random sampling strategy randomly composes triplets as a training object without order. Initially, we design a generator to random sampling two different intention categories from all intention candidates N , which generates a total of N (N − 1)/2 anchor-positive utterance pairs. For each selected anchor-positive utterance pairs, we randomly choose one of it as a negative label and another one as a positive label. Then, we randomly select an utterance from the negative label and select two utterances from the selected positive label. We combine three selected utterances as one triplet data for training. After each epoch, we repeat sampling the triplets based on batch size. Different from the random sampling strategy, we can find that there are specific correlations among two adjacent utterances and adjacent intents in the multi-turn dialogue dataset. For example, the 'Question' tag followed by the 'Affirmative' tag is frequently appearing together, and the 'Request' tag always connects with the 'Repeat Response' tag. However, the disadvantage of the random sampling strategy is that it composes triplets without order, so it cannot take the context into triplet selection. Therefore, the encoder might learn useless context information from random order utterances. From this point of view, we keep the intention transition traits into triplet selection. To this end, we keep the original intent sequence order as anchor samples. Then we randomly select other utterances the same as the intention category of anchor samples as positive samples. We form negative utterance sequences with intention category that are different from the anchor utterances' intention category. Then, we input the triplets into Siamese encoders to train the feature embedding models. Through the sequential sampling strategy, the Siamese encoder can learn the valid context information in training. The following sections are to illustrate the Siamese neural network.

3) SIAMESE RMCNN NEURAL NETWORK
We modify the RMCNN model as a Siamese encoder to train the utterance triplets and generate a fixed-dimension representation. Firstly, we have the number of n utterances X = {x 1 , x 2 , . . . ,x n } in the dialogue. Each utterance contains variable-length word tokens x i = w 1 , w 2 , . . . ,w j . After triplet sampling, we obtain utterance triplet samples. For each utterance sample in triplet, we embed word tokens into vector E = {e 1 , e 2 , . . . ,e n } through a trainable embedding matrix pre-trained on enormous unlabeled data. The bidirectional GRU model encodes sequence token embedding to produce sequences of corresponding hidden vectors H = h 1 , h 2 , . . . , h i , which extracts the context information by concatenating the hidden states from forward and backward directions. The operation of bidirectional GRU is formulated as follows: in which h t maintains the sequence information of the utterance. Then, we feed the output from Bi-GRU layer into the CNN layer. The CNN model can capture fine-grained local features inside a multi-dimensional filed. The convolutional operation includes a filter W c ∈ R, which is utilized to a window of l continuous word vectors to produce a new feature map. A scalar feature c i is generated from a window of words h i:i+l by: where the symbol • indicates the dot product operation, l refers to the width of the convolutional kernel, f is a non-linear function (ReLU), W c is the convolutional matrix, and b c is a bias term. Each kernel corresponds to an utterance detector to extract specific n-gram patterns at various granularities. The kernel applied to each possible region matrix to produce a valuable feature map: in which m is the number of the channels. The pooling layer can extract local dependencies in different regions to preserve the most useful information. Then, we apply the pooling layers to capture the most valuable feature from each feature map, which includes the global maximum pooling layer and global average pooling layer. The outputs from two pooling layers are concatenated together as the local phrase feature of dialogue:ĉ where the 'gmp' indicates the global maximum pooling layer and the 'gap' indicates the global average pooling layer. Then, the outputs of the pooling layers with different widths are concatenated. Finally, three fully connected layers with 'tanh' activation are stacked together, and an L2-normalization layer is followed behind to form final utterance embedding. The Siamese RMCNN neural network optimized by minimizing the triplet loss and Adam optimizer is used during training.

4) SIAMESE BERT NEURAL NETWORK
Here is the process that we train utterance triplet samples with the Siamese BERT model. In this section, we fine-tune the pre-trained BERT model as Siamese encoder to train utterance triplet samples. Given sequence utterances X = i , . . . , s (k) i (14) in which k is the number of attention heads, h is the dimension of hidden states, and d s is the parameter of scale dotproduction. The W Q , W K , W v and W O indicate the model parameters. The output of the residual connection and the normalization moduleS = {s 1 ,s 2 , . . . ,s N } are denoted below: The output of the position-wise fully connected sublayer O = {o 1 , o 2 , . . . , o N } is calculated as follows: (16) in which W 1 , W 2 , b 1 and b 2 are the model parameters. The residual connection layer and the normalization layer are followed the encoder block. The final contextual representatioñ We feed the final contextual representation into three fully connected layers with 'tanh' activation and an L2-normalization layer to get final utterance token embedding. The Siamese BERT encoder is optimized by triplet loss function by end-to-end propagation, and Adam optimizer is utilized during training.

C. FEATURE FUSION IN DOWNSTREAM TASK 1) FEATURE-BASED STRATEGY
Fine-tuning the pre-trained language model can save expensive pre-computing. The pre-trained feature representation can be easily testified on many experiments with cheaper models on top of this representation [37]. Therefore, there is no need to train complex afterward. In this paper, we verify our pre-trained feature embedding model by utilizing the feature-based strategy for the downstream task. Feature-based strategy collects utterance features from the well-defined pre-trained language model to different downstream tasks.
The intention detection task in our experiment is based on the multi-class classification learning method, which can be seen in Fig. 2. The pre-trained feature embedding models (f RMCNN , f BERT ) can form two robust utterance representations from different semantic aspects, which are denoted below.
Then, we feed the utterance feature U BERT and U RMCNN into the fully-connect layers, respectively. We use the Softmax classifier to predict the probability distribution of intention labels, which is defined as follows: where W U , b U , W Q , and b Q are model parameters. We take cross-entropy as the loss function and Adam as an optimizer during training. The end-to-end backpropagation is employed in the training process.

2) MULTI-FEATURE FUSION STRATEGY
The multi-source fusion strategy can effectively improve the performance of natural language learning by various relevant resources [38]. Inspired by this conception, we employ a fusion strategy to accumulate semantic information of  utterance from several aspects, such as utterance granularity, dialogue structure, and speaker information, which can be seen in Fig. 3. The same sentence may express different aspects concerning different aspects. To be specific, the RMCNN model can capture the global structural features of the input sentence. The BERT model remedies the limitation of the insufficient training corpora and provides more external knowledge about common utterance words. Otherwise, the participants have different roles and speaking preferences in various domains in multi-turn conversation, which also can be regarded as a distinctive feature to enhance utterance differences. We indicate speaker information in the model as 'C . Specifically, we use numerical values to represent different speakers. We unified a two-stream fusion model to integrate the utterance features from different models to show its different aspects. Firstly, we set two pre-trained feature embedding models as two streams to encode utterance from different aspects. We feed the sequence word tokens into the models independently and obtain the optimal parameters of each model. In this section, we compose the utterance encoder using two models with optimal settings. After the optimal parameters are trained in each stream, the outputs from each stream are concatenated together and then input to the classifier. Then, we extend the utterance representation to U all = U RMCNN , U BERT , U Speaker . Precisely, U RMCNN refers to the structural feature learned from the Siamese RMCNN model, U BERT refers to the fine-grained contextual feature learned from the BERT triplet model and the U Speaker as an additional feature refers to the speaker's role aligned with each utterance. Then, all the features are concatenated together to be a VOLUME 8, 2020 comprehensive utterance representation. The Softmax function is connected to the encoders to calculate the probability distribution, and the output is P = {p 1 , p 2 , . . . ,p n }, in which n is the number of the intention labels, and p i is the predicted probability that utterance belongs to the corresponding intent tag i, and the final predicted tag:ŷ = arg max (P). The model optimization is to minimize the cross-entropy loss, and Adam optimizer is used during training.

IV. EXPERIMENT A. DATASETS
We evaluate the proposed model on several benchmark datasets. We find that the evaluation object of intention detection task includes not only task-oriented dialogues but also multi-turn dialogues. In the previous studies [6], the intention detection task of multi-turn conversation is regarded as a multi-class classification. Therefore, we transfer the multiturn conversation from the nested dialogue structure into a flat structure, so that the utterance triplets can be properly sampled. Besides, we also performed a series of pre-processing steps by utilizing Stanford's CoreNLP tool [39] to avoid text noise, such as utterance tokenization and word lemmatization.
We introduce three single-turn task-oriented dialogue dataset and two multi-turn dialogue datasets, which are listed below: The SNIPS dataset [40] is collected from the Snips personal voice assistant and contains 7 intent types. The number of samples for each intention label is approximately the same.
The ATIS dataset [41] is the audio recording of making the flight reservation. The training set includes utterances, and the test set contains 893 utterances. We follow the previous experiment and set the validation set with 500 utterances from the training set. There are 21 intention labels in the dataset.
The Facebook's multilingual dataset [42] contains annotated utterances with the English version, Spanish version, and the Thai version. It covers the weather, alarm, and reminder domains in English, Spanish, and Thai language. There are 12 intention labels in the training set.
The Daily Dialogue dataset [43] is a high-quality multiturn dialogue dataset, which mainly records dialogue in terms of people's everyday life. Each utterance of the Daily Dialogue dataset is manually labeled with the topic tag, intention tag, and emotion tag.
The ICSI Meeting Recording Dialogue Act (MRDA) dataset [44] contains 72 hours of multi-party meeting speech dialogue from 75 naturally happened meetings. The original tag sets of MRDA included 11 general tags and 39 specific tags. Based on the previous experiments, we utilize the most widely used class-map to cluster all tags into 5 groups of intention categories.

B. HYPER-PARAMETERS TUNING
In this section, we illustrate the related parameters in model training, which is associated with the triplet training process and downstream task. All the work is implemented under the TensorFlow framework.
In terms of the triplet training with the Siamese RMCNN model, we pad each utterance to the maximum length for training. We initialized word vectors with the 300-dimensional word2vec word vectors. We set the dropout as 0.3 after the embedding layer to avoid over-fitting. The hidden size of Bi-GRU is 512 in one direction. We use multiple kernel size (1,2,3) in the CNN layer to encode different utterance granularity, and the filter size is 256. The three fully-connect layers and an L2-normalization layer are followed behind. We set the Adam optimizer with a learning rate of 2e-4 and a weight decay of 1e-6.
In terms of the Siamese BERT model, we fine-tuned the BERT model with metric learning to obtain utterance features. The pre-trained BERT encoder is trained on the unlabeled data, which are Books corpus (800M words) and English Wikipedia (2500M words). The maximum length of an utterance is 50. The BERT-base model has 12-layers, 768-hidden states, and 12-heads. The hidden dim of the token embedding is 50. We set the Adam optimizer with a learning rate of 3e-5 and a weight decay of 1e-6. The other parameters we follow the original BERT paper [5].
Furthermore, we utilize the feature-based strategy in downstream intention detection tasks. The pre-trained RMCNN and BERT feature embedding model is employed as different encoders in single-stream, respectively. In this section, we set the hidden size as 64, Adam optimizer is used with learning rate is 2e-4, and the batch size is 256.

C. BASELINES
We compare the proposed model with several state-of-theart baseline models. For the single-turn task-oriented dataset, it includes the following: • Attention-BiRNN [45] utilizes the encoder and decoder model for joint learning the intention detection task and slot-filling task. An attention weighted sum of all encoded hidden states is used to recognize intention.
• Slot-Gated Attention [46] uses slot-gated LSTM to learn context vector, which improves the performance of intention classification.
• Capsule-NLU [47] accomplishes the intention detection by exploiting the hierarchical semantic information. They propose a re-routing schema to synergize further the slot filling performance using the inferred intention representation.
• Joint BERT [48] uses joint intention classification and slot filling based on the pre-trained BERT model.
• BERT-SLU [49] provides a novel encoder-decoder framework based on a multi-class classification method to joint learn intention detection and slot-filling. The model uses BERT as an encoder to train utterance and then design a decoder to detect intention label.
• Cross-Lingual transfer [42] uses a novel method of using a multilingual machine translation encoder as contextual word representations to predict intents.  According to previous studies, there are several multiturn dialogue datasets contain the intention detection task. In particular, we also verify the model on the multi-turn dialogue dataset to evaluate the model generalization capability. Therefore, we compare our model with the existing baselines, which includes: is a simple baseline model, which applies the text feature and multi-classification algorithm on the dialogue act classification.
• CNN [17] method utilizes the CNN model to encode the utterance with the Softmax classifier. The encoder considers two preceding utterances as context information in the experiment.
• Bi-LSTM-CRF [18] method constructs a hierarchical bidirectional LSTM as an encoder to learn the conversation representation and the conditional random field as the top layer to predict intention label.
• CRF-ASN [49] incorporates hierarchical semantic inference with memory mechanism on utterance modeling at multiple levels and uses a structured attention network on the linear-chain CRF to dynamically separate the utterance into cliques.
• Dual-Attention [50] utilizes a novel dual task-specific attention mechanism to capture interaction information between intents and conversation topics for utterances.
• SelfAttn-CRF [51] proposes a hierarchical deep neural network to model different levels of utterance and dialogue act and use CRF to predict dialogue acts.

V. DISCUSSION
A. THE RESULT ANALYSIS Table 3 and Table 4 show the intention detection accuracy on different datasets. Precisely, the prefix RAN means random triplet sampling strategy, and SEQ refers to the sequential triplet sampling strategy. The RAN-BERT means the random sampling strategy with the BERT model as Siamese encoder, and the SEQ-BERT means the sequential sampling strategy with the BERT model as a Siamese encoder. The rest model name is the same meaning. As we can see the results shown in Table 3 and Table 4, the proposed model significantly outperforms baseline models and achieve state-of-the-art performance on Snips, Facebook (EN), and DYDA datasets. Although the proposed model does not obtain the-state-of-the-art results on ATIS and MRDA datasets, it still can show that the feature learning ability of the proposed model is useful. For the task-oriented dialogue dataset, the proposed feature learning model achieves the recognition accuracy of 99.29% (from 98.96%) on the Snips dataset, 99.22% (from 99.11%) on Facebook(EN) dataset. The fusion features also improve the performance slightly that obtain 99.31% on the Snips dataset, 99.56% on the ATIS dataset, 99.28% on Facebook(EN) dataset. For the multi-turn dialogue dataset, the model SEQ-CNN, SEQ-RCNN, and SEQ-BERT of the DYDA dataset improve the accuracy over the-state-of-the-art model by 0.6%, 2.9%, and 1.5%, respectively. The multi-source data fusion compensates for the lack of data-sparse to a certain extent. It boosts the performance than other methods because it integrates a wide range of available features, which achieves 91.3% on the DYDA dataset and 91.0% on MRDA.
However, the gains on the ATIS dataset and MRDA dataset are slight. One of the reasons for this phenomenon is that the VOLUME 8, 2020 data distributions in these two datasets are both imbalanced. In the MRDA dataset, the class 'Statement' is occupied more than 50% of the intention category. In the ATIS dataset, the intention label ''flight'' also accounts for almost half of the total training data. Based on the sampling strategy, the sampled utterances can be affected by the proportion of intent categories in the database. It is difficult for the model to learn the exact features for very few classes. Another reason is that the ambiguity of label correlation and label annotation is harmful to triplet feature learning. Besides, the MRDA dataset was found to have a high negative correlation between previous label entropy and accuracy, indicates the impact of label noise. Some utterances in ATIS dataset contains more than one label. In this experiment, we only study the single intent of utterance, which affects the results to some extent. The last reason is that the triplet training method adopts the flat dialogue structure to compose utterance triplets and predict the intents based on the multi-class classification approach in the downstream task. The model only focuses on the current utterance ignoring the hierarchical context structure information that damages the recognition performance of multi-turn conversation. In the future, we also need to consider how to be more effectively integrated triplet training with the nested structured dialogue.

B. ABLATION STUDIES
We can observe the improvement of the proposed model in the last section, and then we explore the contribution of each part in this section. We first perform ablation studies to verify the proposed feature embedding models, whether to contribute to the intention classification task. Then, we explore the details about the effect of BERT model selection. Next, we study the impact of the sampling strategy selection. Besides, the margin parameter selection also is vital for model optimization. We test the wide-range margin parameters in the experiment. Finally, we exploit the T-SNE visualization method to verify the performance of the pre-trained feature learning models. Table 5 shows the comparison between the basic models and proposed triplet training model of different dialogue datasets. To validate the generation ability of the proposed model, we also add the other multilingual Facebook data (Spain version and Thai version) in the experiment. The CNN and RCNN models require particular text preprocessing for different languages, so there is no comparability in this experiment. Hence, we fine-tune the pre-trained multilingual BERT model to evaluate the two datasets. We implement comparative experiments under fixed hyperparameters and parameters.

1) THE EFFECT OF THE ENCODER SELECTION
The results shown in Table 5 can prove that the pre-trained feature learning models are sufficient to learn more discriminative features representation for the intention classification task. Precisely, the fine-tuned BERT model performed better than RMCNN model in basic models. However, we can see the triplet training can significantly improve the leaning ability of RMCNN. From Tabel 5, the SEQ-RMCNN model performs better than the BERT and CNN encoder on Snips datasets, ATIS dataset, Facebook dataset, and DYDA dataset. We attribute this to the fact that the combination of Wikipedia embedding and RMCNN model can effectively capture granular semantic details locally. Also, the Siamese BERT encoder improves the results of the intention classification because the pre-trained BERT model can provide rich semantic information by unsupervised trained with enormous external knowledge. The results demonstrate that the pre-trained feature embedding model can effectively improve conventional multi-class classification by supplementing utterance triplet training.

2) THE EFFECT OF THE SAMPLING STRATEGY
In this section, we discuss the effect of sampling strategy on classification results. Based on the results of Table 5, it can illustrate that both two sampling strategies can effectively improve the results of the basic models (without triplet training). To be specific, the sequential method is slightly better than the random method. Besides, the multilingual dataset also shows the sequential strategy is better than the random strategy. The SEQ-BERT improved by 0.76% over RAN-BERT in the Facebook dataset (Spain) and 2% in the Facebook dataset (Thai). The reason for these results is that the feature learning model might learn the useless context information because of random selection.
Furthermore, we make a comparison between each intention label of the DYDA dataset to show the effect of different strategies on context-sensitive data in detail. As we can see in Fig. 4, the DYDA dataset has four intention labels, which are Inform, Commissive, Question, and Directive. The proposed models generally perform great on label ''Inform'' and ''Question'' because these two intent often appears in spoken language. Although it performs poorly in tag ''Commissive''  because of the lack of data, we still can find the sequential strategy can improve feature representation to be more distinguished. Specifically, the result of SEQ-CNN grew by 0.25 over RAN-CNN, the result of SEQ-RMCNN improved by 0.26 over RAN-RMCNN. The ''Directive'' label promotes 0.24 on CNN, 0.28 in RMCNN, only 0.08 in BERT. Therefore, the sequential sampling strategy can effectively select valid utterance triplets for spoken language objects.

3) THE EFFECT OF THE BERT MODEL SELECTION
In this section, we study the influence of the choice of the pre-trained BERT models based on the single-turn dialogue datasets. The pre-trained BERT models are publicly released on Google's GitHub website. 1 The BERT model includes a monolingual version and a multilingual version. According to the results, we find the monolingual BERT model benefits the English dataset, but it improves less on Facebook (Spain) and Facebook (Thai) datasets. The multilingual model can effectively improve the performance of the cross-language datasets. Therefore, we use monolingual models to deal with English datasets and use multilingual models to train other language datasets. Besides, the BERT models contain two uncased versions and two cased versions. Therefore, we conduct a comparison of basic BERT and BERT triplet training on the English version dataset. To keep the parameters to a minimum in the interaction system, we only verify the model on the base model. From Table 6, we can see the performance 1 https://github.com/google-research/bert of uncased model is better than the cased model for utterance representation. The random sampling strategy might inferior the performance of the cased model on Snips and Facebook datasets. In the following experiments, we finally adopt the result of the Bert uncased base model as Siamese BERT encoder to train utterance triplets.
Moreover, we verified the effect of token embedding on the task-oriented dialogue dataset. We assume the token embedding might provide finer-grained semantic information of utterances compared with sentence embedding. Therefore, we facilitate the comparison between sentence embedding and token embedding on all task-oriented dialogue dataset. We indicate the T as the token embedding in Table 7 and  Table 8. As we can see in Table 7 and Table 8, the token embedding can enhance the semantic information of utterance and improve the performance of intention detection. Therefore, we choose token embedding as utterance feature representation in this experiment.

4) THE EFFECT OF THE MARGIN PARAMETER
As we mentioned in (16), the margin parameter controls the relative distance between the feature embeddings to its positive samples and negative samples. Therefore, the margin parameter selection is essential for model convergency and optimization. From Fig. 5, we can observe that the triplet loss optimization is sensitive to the margin parameters. The VOLUME 8, 2020   margin parameter is too large or too small, both results in inferior performance. The large margin parameter may cause over-fitting, and the small margin parameter may impair the strength of the triplet loss because the small value not enough to distinguish between details. Therefore, we conduct different margin parameters under fixed hyperparameters in the experiment to observe the impact of margin parameters for recognition performance. We evaluate the margin parameters on wide-ranged values from 0.1 to 20. We list the final choices of the margin parameter for each dataset. To be specific, we use 5 for the Snips dataset, 1 for the ATIS dataset, 1.5 for the Facebook dataset, and 15 for DYDA and MRDA dataset. Therefore, we set the fixed margin parameter in the following experiments.

5) VISUALIZATION OF LEARNED REPRESENTATION
In this section, we apply the T-SNE [52] method to visualize 2D feature embedding of test data learned from triplet learning models. Based on the T-SNE visualization method, we can intuitively observe the impacts of feature learning models on different datasets in Fig. 6. The first column is the original data distribution of each dataset, and the second column is the utterance feature embeddings of the pre-trained SEQ-BERT model. As we can see in Fig. 6, the feature embedding of the same intention category is visibly getting closer to each other and gain distinct clusters at the same time. Hence, the proposed models are benefits for extracting more discriminative features through utterance triplet training. The triplet loss training results in a better feature embedding since the margin parameter is considered appropriately.
However, the feature embedding of the MRDA corpus is not as explicit as the DYDA dataset cause the data distribution of the MRDA dataset is imbalanced. The ''Statement'' tags are occupied approximately 50% in test data, so the rest of the four intents are not clear enough to visualize. Therefore, this visualization reveals the intuition that better underlying feature embedding for short utterance can be obtained by Siamese neural network architecture with metric learning.

VI. CONCLUSION AND FUTURE WORK
In conclusion, we formulated the intention detection task from the perspective of enriching semantic information of utterances. In the first stage, we proposed a novel feature embedding model by utilizing the fine-tune BERT model and RMCNN model as Siamese encoders with a triplet loss function. The RMCNN and BERT as Siamese encoders were employed to train utterance triplets, and the triplet loss function can optimize the embedding model end-to-end. Then, we can obtain two well-trained feature embedding models to illustrate discriminative utterance features from different aspects. Moreover, we introduced the sequential sampling strategy in triplet selection to capture context within the dialogue. In the second stage, we used a multi-source fusion strategy to boost the recognition performance of the downstream intention detection task. Given the pre-trained models, we predict intention labels by fusing discriminative pre-trained and other relevant features within the dialogue. The extensive experiments demonstrated the effectiveness of the proposed model for intention detection on several benchmark datasets. The results illustrate that the proposed method can effectively improve the recognition accuracy of these datasets. For single-turn task-oriented dialogue, the model achieves 99.31% in the Snips dataset, 99.56% in the ATIS dataset, 99.28% in Facebook (English) dataset, 97.67% in the Facebook (Spain) and 96.39% in the Facebook (Thai). For multi-turn conversation, the recognition accuracy achieves 91.3% in the DYDA dataset and 91.0% in the MRDA dataset.
There is still much space for improvements in our system. Firstly, we can verify different neural network architectures, loss functions, and distance metrics based on the pre-training framework. Secondly, the multi-class classification learning approach may inferior the results because the model predicts intents only consider the current time step. Except for the single-turn dialogue and multi-turn dialogue, there are more complicated dialogue structures, such as multi-party and multi-modal dialogue. Therefore, the combination of intricate dialogue structures and metric learning could be a new direction. Furthermore, the triplet loss training also can be employed in other NLP tasks like emotion detection and topic adaptation in the dialogue system filed, which are also promising for future research.