Bridge Pre-Training and Clustering: A Unified Contrastive Learning Framework for OOD Intent Discovery

Discovering Out-of-Domain (OOD) intents is essential for developing new skills in a task-oriented dialogue system. Previous methods suffer from poor knowledge transferability from in-domain (IND) intents to OOD intents, and inefficient iterative clustering. In this paper, we propose an efficient unified contrastive learning framework to discover OOD intents, bridging the gap between IND pre-training stage and OOD clustering stage. Specifically, we employ a supervised contrastive learning (SCL) objective to learn discriminative pre-trained intent features for clustering. And we introduce an efficient end-to-end contrastive clustering method to jointly learn representations and cluster assignments. Besides, we propose an adaptive contrastive learning (ACL) method to automatically adjust the weights of different negative sample pairs for a given anchor according to their semantic similarities. Extensive experiments on two benchmark datasets show that our method is more robust and achieves substantial improvements over the state-of-the-art methods.


I. INTRODUCTION
Discovering Out-of-Domain (OOD) or unknown intents from user queries is an essential component in a task-oriented dialog system [1], [2], [3], [4], [5]. By grouping new unknown intents into different clusters, we may identify future development directions to improve the dialogue system. Different from normal text clustering tasks, OOD discovery needs to consider how to leverage the prior knowledge of known indomain (IND) intents to enhance clustering unknown OOD intents, which makes it difficult to directly apply existing clustering algorithms [6], [7], [8], [9] to the OOD discovery task.
We classify the existing methods of OOD discovery into two main categories, unsupervised and semi-supervised OOD The associate editor coordinating the review of this manuscript and approving it for publication was Arianna Dulizia . discovery. Unsupervised methods [10], [11], [12] only model OOD data but ignore prior knowledge of in-domain data thus impair final clustering performance. Therefore, recent work focus on the semi-supervised setting where there exist a few labeled IND intents [4], [5]. [5] firstly pre-trains a BERTbased [14] in-domain intent classifier using cross-entropy classification loss then uses intent representations to calculate the similarity of OOD sample pairs as weak supervised signals. The gap between pre-trained IND features and unseen OOD data makes it hard to generate high-quality pairwise pseudo labels. Then, [4] proposes an iterative clustering method, DeepAligned, to obtain pseudo classification labels. For each training epoch, they firstly perform k-means [7] on the extracted pre-trained intent features, and then use the produced aligned cluster assignments to finetune the intent classifier. However, DeepAligned learns intent representations and cluster assignments in a pipeline manner, which is notably inefficient and may cause error propagation. Generally, these semi-supervised methods don't match the IND pre-training objective in the first stage with the OOD clustering objective in the second stage and suffer from poor knowledge transferability. Therefore, in this paper, we aim to align the two-stage learning objectives and improve the efficiency and accuracy of OOD discovery via a unified contrastive learning framework.
The main challenges of OOD intent discovery are summarized as follows: (1) Knowledge Transferability. It's hard to effectively transfer prior IND knowledge to OOD data. Because classification objectives in the IND pre-training stage don't align with the clustering objectives in the OOD clustering stage, which makes the knowledge transfer from IND to OOD has a natural gap. (2) Jointly Learning Representations and Cluster Assignments. Previous OOD clustering methods [4], [6] iteratively learn intent features and cluster assignments. Limited by the inherent inefficiency of clustering algorithms like k-means, these methods suffer from lazy two-stage back-propagation signals thus result in poor performance. Consequently, it's vital to jointly learn representations and cluster assignments.
To solve these challenges, we propose an efficient unified contrastive learning framework (COD) to discover OOD intents as shown in Fig 1. For knowledge transferability, we employ a supervised contrastive learning (SCL) pre-training objective [15] to better learn discriminative pre-trained intent features for clustering. Previous crossentropy (CE) loss only focuses on whether a sample is correctly classified and does not explicitly distinguish the margins between categories [3], [13]. In contrast, SCL aims to minimize intra-class variance by pulling together IND samples belonging to the same class and maximize inter-class variance by pushing apart samples from different classes. Moreover, pre-training with SCL aligns with the clustering objective we will discuss later, which can bridge the gap between pre-training and clustering. For jointly learning representations and cluster assignments, we introduce an efficient end-to-end contrastive clustering method to simultaneously model instance-level and cluster-level representation space. Specifically, we regard the rows of the input feature matrix of a batch of augmented examples as instance representations and the columns as cluster representations [16], [17]. Then we can construct two levels of contrastive objectives where the instance-level one helps capture low-level linguistic knowledge and the cluster-level one facilitates learning high-level semantic concepts. Further, we theoretically find the instance-level contrastive objective indeed has a negative impact on clustering performance since it pushes apart representations from different instances even while they belong to the same cluster with strong semantic similarities. Therefore, we propose an adaptive contrastive learning (ACL) method to automatically adjust the weights of different negative samples for a given anchor according to their semantic similarities. ACL aims to perform a smaller penalty on semantically similar negative intent samples but a larger penalty on distant negatives, which can be regarded as a soft negative sampling strategy. Combining the pre-training stage and the clustering stage, we propose a simple but strong unified contrastive learning framework for the OOD discovery task, which can effectively solve both knowledge transfer and clustering efficiency issues.
The novelty of our proposed method comes from three aspects: (1) We are the first to propose a unified contrastive learning framework for the OOD discovery task. In contrast, previous methods [4], [5], [6] use two indenpedent models to learn IND features and OOD clustering respectively, which make the gap between pre-trained IND features and unseen OOD data. Our method employs a unified view to solve the two problems. (2) We introduce a supervised contrastive learning pre-trained objective to learn discriminative intent features compared to the previous cross-entropy loss. (3) We propose a novel adaptive contrastive learning mechanism to perform better OOD clustering using soft negative sampling.
Our contributions are four-fold: (1) To the best of our knowledge, we are the first to propose a unified contrastive learning framework for OOD discovery, bridging the gap between pre-training and clustering. (2) We introduce a supervised contrastive learning pre-trained objective to learn discriminative intent features by maximizing inter-class variance and minimizing intra-class variance. (3) We propose a novel adaptive contrastive learning mechanism to perform soft negative sampling. (4) Experiments and analysis on two benchmark datasets demonstrate the effectiveness of our framework for OOD discovery.

A. INTENT MODELING
There are two main applications of intent modeling in the task-oriented dialogue system, intent classification and OOD discovery. The former aims to distinguish intent types jointly VOLUME 11, 2023 with other tasks, like slot filling [24], [25], [26]. The latter is to leverage intent representations to construct clustering signals [4], [5], [12]. In this paper, we focus on the latter application. It's important to leverage hidden semantic information to construct supervised signals for intent feature learning.

B. CLUSTERING
Most existing clustering methods are unsupervised, such as partition-based methods [7], hierarchical methods [27] and density-based methods [28], feature dimensionality reduction methods [27]. However, these methods suffer from high computational complexity and poor performance since they can't capture high-level semantics of intent features. Then deep clustering methods are proposed to leverage the strong feature modeling capability of deep neural networks (DNNs), such as JULE [29], DEC [9], DeepCluster [6]. The joint unsupervised learning (JULE) [29] combines deep feature learning with hierarchical clustering but needs huge computational and memory costs on large-scale datasets. Deep Embedded Clustering (DEC) [9] trains the autoencoder with the reconstruction loss and iteratively refines the cluster centers by optimizing KL-divergence with an auxiliary target distribution. Compared with DEC, Deep Clustering Network [6] further introduces a k-means loss as the penalty term to reconstruct the clustering loss. However, these methods follow a two-stage clustering process and only use unsupervised data.
Recent work perform semi-supervised clustering with the aid of some labeled data, such as KCL [30], CDAC+ [5], DeepAligned [4] and DKT [46]. KCL [30] uses deep neural networks to perform pairwise constraint clustering. It firstly trains an extra network for binary similarity classification with a labeled auxiliary dataset. Then, it transfers the prior knowledge of pairwise similarity to the target dataset and uses KL-divergence to evaluate the pairwise distance. CDAC+ [5] is specifically designed for discovering new intents. It uses limited labeled data as a guide to learn pairwise similarities. However, it is limited in providing specific supervised signals and fails to estimate the number of novel classes. DeepAligned is the previous mainstream baseline for OOD discovery which iteratively learns intent representations then cluster assignments. DKT is a newly proposed method, which introduces contrastive learning in OOD discovery for the first time, and uses a multi-head decoupling framework to map the shared intent representations of BERT to the instance-level and cluster-level subspaces. Our proposed COD significantly outperforms all previous methods on two benchmark datasets.

A. PROBLEM FORMULATION
In this paper, we denote OOD discovery as OOD clustering with IND pre-training unless otherwise stated. Given a set of labeled in-domain data (X IND , Y IND ) and unlabeled OOD data (X OOD , Y OOD ), OOD discovery aims to cluster OOD groups from unlabeled OOD data using prior knowledge from labeled IND data. Note that IND classes have no overlapping with OOD classes.

B. OVERALL ARCHITECTURE
Fig 2 displays the overall architecture of our proposed unified contrastive learning framework for OOD discovery, COD. We follow the similar two-stage framework as [5] and [4]: IND pre-training and OOD clustering. For IND pre-training, we employ a supervised contrastive learning (SCL) objective to better learn discriminative pre-trained intent features along with the traditional cross-entropy (CE) loss. For OOD clustering, we introduce an efficient end-to-end contrastive clustering method to jointly learn representations and cluster assignments. Besides, we propose an adaptive contrastive learning (ACL) loss to automatically adjust the weights of different negative sample pairs for a given anchor according to their semantic similarities. We will dive into the details in the following sections.

C. SUPERVISED CONTRASTIVE PRE-TRAINING
We firstly pre-train an intent feature extractor using labeled IND data. Specifically, we use the similar BERT [14] intent classifier following [4] for fair comparison, including the input layer, BERT encoder, and a pooling layer. Finally, we obtain the intent representation z i ∈ R H for the input sample x i . Previous intent classification models [2], [4], [5], [19] always use cross-entropy objective which only focuses on whether a sample is correctly classified, and does not explicitly distinguish the margins between categories. Inspired by recent contrastive work [3], [15], [20], we employ a supervised contrastive learning (SCL) objective to learn discriminative intent features by maximizing inter-class variance and minimizing intra-class variance. We formulate SCL as follows: where N y i is the total number of examples in the batch that have the same label as y i and 1 is an indicator function. Following [21], [22], we employ simple dropout [23] as data augmentation. As Fig 1(a) shows, SCL aims to pull together IND samples belonging to the same class and push apart samples from different classes, which helps distinguish OOD cluster boundaries. In the implementation, we perform joint training both using SCL and CE. We also try other variants, such as only using SCL, firstly use SCL then CE, etc. However, simply adding SCL and CE gets the best performance. We conduct a comprehensive analysis (see Section V-A) of the effect of SCL from multiple perspectives, including IND and OOD, both achieving superior performance than CE.

D. ADAPTIVE CONTRASTIVE CLUSTERING
After transferring knowledge from known intents, we propose an efficient end-to-end contrastive clustering method to group similar OOD intents into the same cluster. The key challenge of OOD clustering is how to jointly learn representations and cluster assignments. Previous mainstream method DeepAlighed [4] uses the DeepCluster [6] algorithm with an aligned mechanism to iteratively learn intent representations then cluster assignments. We argue that this method suffers from poor clustering efficiency and lazy back-propagation signals. Therefore, we introduce an end-to-end contrastive clustering method [16] to mitigate the above issues. Specifically, we firstly use the pre-trained intent classifier to obtain a feature matrix given a batch of dropout-augmented OOD samples. Then we adopt two individual two-layer nonlinear MLPs g(·) to map the feature matrix to a new subspace where two contrastive objectives are applied. We regard the rows of the new feature matrix as instance representations and the columns as cluster representations [17]. Next, we can construct two levels of contrastive objectives where the instance-level one helps capture low-level linguistic knowledge and the cluster-level one facilitates learning high-level semantic concepts. We use different transformation MLPs for the two-level contrastive objectives, which has been proved effective by [16].

1) ADAPTIVE INSTANCE-LEVEL CONTRASTIVE LOSS
We formulate the original instance-level CL loss for a given sample z i : where z i represents the transformed vector of i-th intent sample and z j is the dropout-augmented sample.
is an indicator function evaluating to 1 if k ̸ = i. τ denotes a temperature parameter. Then we extend the normalized item as follows: where P is the positive set of anchor z i and N is the negative set. Original instance-level CL uses the anchor sample and its augmented sample as a positive pair, but regard the other samples in the batch as negatives, which is not suitable to OOD clustering. Because clustering tries to pull together samples within the same cluster and push apart samples from different clusters. Therefore, to decrease the weight of these false negative samples which belong to the same cluster with the anchor, we propose an adaptive contrastive loss as shown in Fig 2: Soft Negative Sampling. The main intuition is to adaptively adjust the temperature of each negative sample according to their semantic similarities: where τ i,j denotes the temperature between anchor z i and other sample z j and ⊙ represents the dot product of z i 's cluster logits C i and z j 's cluster logits C j from cluster-level contrastive head. Here we use samples' cluster logits to compute their semantic similarities of belonging to the same cluster. If the similarity of (z i , z j ) is above a fixed hyperparameter τ 0 (we set it to 0.5), then the negative sample z j gets a relatively larger temperature and a smaller penalty so that (z i , z j ) don't stay away from each other. By this way, we can keep pairs with the similar semantics as near as possible. VOLUME 11, 2023 Here we give a theoretical explanation of our proposed adaptive instance-level contrastive (ACL) loss. The original instance-level contrastive loss is formulated as follows: where . For convenience, we denote s i,i as the positive pair and s i,j , i ̸ = j as negative pairs, which is slightly different from Eq 2. Then we analyze the gradients with respect to positive samples and different negative samples following [37], [38].
∂L (x i ) We observe from Eq 8 & 9 that the gradients with respect to negative samples is proportional to the exponential term exp(s i,j /τ ) since all other items are the same for all negative samples. If we increase the temperature τ of negative samples belonging to the same cluster, the gradient (penalty) gets smaller so that these false negatives within a cluster can get closer than true negatives from different clusters. Therefore, we can keep intent representations from the same cluster close and dense. For specific implementation, we use the dot product of cluster logits between two samples to measure whether they belong to the same cluster as shown in Eq 4 & 5.

2) CLUSTER-LEVEL CONTRASTIVE LOSS
When projecting a data sample into a space whose dimensionality equals the number of clusters, the i-th element of its feature can be interpreted as its probability (logit) of belonging to the i-th cluster. Meanwhile, all the i-th elements from a batch of feature vectors (i-th column of the feature matrix) denote the i-th cluster representation accordingly. Intuitively, OOD clustering aims to pull together cluster representation pairs(positive) from the same cluster and push apart negative pairs from different clusters. We simply use dropout augmentation to get its augmented version corresponding to the cluster representation of original samples. Therefore, we formulate the cluster-level CL as follows: where y i denotes i-th cluster representation (also i-th column of feature matrix) and y j is the dropout-augmented cluster representation. M is the cluster number. To avoid the trivial solution that most instances are assigned to the single cluster, we also add an regularization item H (y i ): where y ji is the (j, i) coordinate of cluster-level feature matrix Y . We simply add the above three objectives and optimize together in the experiments which still gets significant improvements. For inference, we only use the cluster-level contrastive head and compute the argmax to get the cluster results without additional k-means.

IV. EXPERIMENTS A. DATASETS
We conduct experiments on two benchmark datasets, CLINC [33] and Banking [34]. CLINC contains 22,500 queries covering 150 intents and Banking contains 13,083 customer service queries with 77 intents. We show the detailed statistics in Table 1. Following previous work, to construct IND/OOD data, we divided the two datasets according to the specified OOD ratio(10%, 20%, 30% for CLINC, 10% for Banking), and the rest is IND data. For the semi-supervised setting (we mainly focus on in this paper), we use the labeled IND data for pre-training and use unlabeled OOD data for clustering. For the unsupervised setting, we only use unlabeled OOD data for clustering. We rerun all the baselines for three times using our settings and report the averaged results on the same divided IND/OOD datasets for reliable and fair evaluation. For each run, all the models use the same divided dataset. Due to limited resources, we only perform a 10% split on Banking, but pay more attention to extensive ablation studies to understand the effectiveness of our proposed method.

B. BASELINES
We mainly compare our method with semi-supervised baselines: PTK-means (k-means with IND pre-training), Deep-Cluster [6] and three OOD discovery methods CDAC+ [5], DeepAligned [4] and DKT [46]. We also report the results of unsupervised methods for a comprehensive comparison. For fairness, we use the same BERT backbone as the baselines.
To avoid the randomness of splitting IND/OOD, we average results over three random runs following [4]. For each run, all the models use the same divided dataset. We adopt three widely used metrics to evaluate the clustering results: Accuracy (ACC), Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI). To calculate ACC, we use the Hungarian algorithm [35] to obtain the mapping between the predicted classes and ground-truth classes.

C. IMPLEMENTATION DETAILS
For a fair comparison with previous work, We use the same pre-trained BERT model (bert-base-uncased 1 ) as our network backbone. During the pre-training phase, the training batch size is 128, and during the clustering phase, For DKT, during the pre-training phase, the training batch size is 128, and during the clustering phase, the training batch size is 512 for CLINC-10%, CLINC-30%, Banking-10%, and 400 for CLINC-20%. The learning rate is 5e-5 in the pre-training phase and 0.0003 in the clustering phase. For the instance-level contrastive head, the dimensionality of the row space is set to 128, and the temperatures of SCL and instance-level CL are 0.5, and the cluster-level temperature parameter τ = 1.0 is used for all datasets. For DeepAligned, the training batch size is 128, the learning rate is 5e-5, and the dimension of intent features is 768. For CDAC+, the training batch size is 256, and the learning rate is 5e-5. We use the same dynamic thresholds as [5]. we freeze all but the last transformer layer parameters to speed up the training procedure and improve the training efficiency with the backbone of BERT. Table 2 shows the main results of our proposed method compared to the baselines. Our method consistently outperforms all the previous baselines with a large margin. For the semi-supervised setting on CLINC-10%, COD w. ACL outperforms the DeepAligned by 3.11%(ACC), 6.34%(ARI),  3.66%(NMI). On Banking, COD w. ACL also gets significant improvements of 8.78%(ACC), 6.36%(ARI), 2.12%(NMI). The results prove the effectiveness of our proposed contrastive framework for OOD discovery. Specifically, comparing COD with DeepAligned, COD gets an improvement of 1.33%(ACC), 2.69%(ARI), 1.49%(NMI) on CLINC-10%. Comparing COD with COD w. ACL, we find ACL also gets an improvement of 1.78%(ACC), 3.65%(ARI), 2.17%(NMI), which confirms adaptive contrastive learning helps learn better cluster assignments. Besides, comparing unsup COD with semi-sup COD, the latter significantly outperforms the former by 9.77%(ACC), 15.34%(ARI), 7.87%(NMI), which demonstrates the effectiveness of SCL pre-training. Overall, both COD and ACL achieve superior performance and the combination of the two is the best.

V. QUALITATIVE ANALYSIS A. EFFECT OF SUPERVISED CONTRASTIVE LEARNING
Supervised contrastive learning (SCL) contributes to model discriminative representation. We analyze the effect of SCL from multiple perspectives.
We first analyze the spatial distribution of representations when our proposed clustering objective is not used. For indomain data, we use the intra-class distance, which is the mean value of the Euclidean distance between each sample and its class center, and the inter-class distance, which is the mean value of the Euclidean distance between the center of each class and the center of the 3 classes closest to it.   For OOD data, we use the SC metric [32] for evaluating the quality of OOD clusters (see details in Appendix A). It can comprehensively consider the relationship between the intra-cluster distance and the inter-cluster distance and is used to characterize the tightness of clusters. It should be noted that the cluster label of OOD data is calculated by k-means since we aim to analyze the effect of SCL. As shown in Table 3, we use two basic settings, No-pretraining and CE (using cross-entropy as the pretraining loss). On this basis, we find that after adding SCL, each statistical indicator significantly improves. It shows that SCL is effective for improving data distribution and modeling discriminative representations.
Then, we further conduct experiments on the model that added our proposed clustering method. As shown in Table 4, compared to the corresponding base setting, the addition of SCL brings consistent improvement on all metrics. It indicates that pre-training with SCL does align with the clustering objective, and can effectively bridge the gap between pre-training and clustering. Furthermore, we also independently analyze the 5 OOD clusters with the worst clustering metric (get the lowest SC in the no-pretraining setting). As shown in Fig 5, after adding the SCL training objective, we observe significant improvements in SC, which shows that our method brings obvious improvements to the OOD clusters that are difficult to cluster accurately. This is of great significance in practical applications.

B. EFFECT OF COD AND ACL
To understand the effectiveness of COD and ACL, we perform OOD intent visualization of DeepAligned, COD and COD w. ACL in Fig 3. COD is the overall contrastive learning framework for OOD discovery and ACL is our proposed adaptive instance-level contrastive (ACL) loss. Comparing COD to DeepAligned, we can observe DeepAligned gets some mixed OOD clusters (see red and black dots in Fig  a) while COD successfully separates them, which indicates COD learns discriminative OOD cluster assignments. But we also find some OOD clusters have narrow distributions (see black, brown dots in Fig b). We argue it's because COD uses the original instance-level contrastive loss which pushes apart the samples within the same cluster. After using ACC, we can get a more uniform and tight distribution. The visualization proves both COD and ACL helps OOD discovery and have a mutual complementary effect on each other. We also display OOD SC curves in the training in Fig 4. Results show COD converges faster and better than DeepAligned. Note that the initial SC of COD (w. ACL) is worse than DeepAligned because we add a new cluster-level MLP head(randomly initialized) while DeepAligned directly uses k-means, but our methods still converge faster via contrastive objectives. It demonstrates the efficiency of our proposed COD.

C. ESTIMATE THE NUMBER OF CLUSTER K
Since we may not know the exact number of OOD clusters, we use the following K estimation method [4] to determine the number of clusters K before clustering. The method estimates K with the aid of the well-initialized intent features. We assign a big K ′ as the number of clusters at first. As a good feature initialization is helpful for partition-based methods (e.g., k-means), we use the well pre-trained model to extract intent features. Then, we perform k-means with the extracted 63720 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.   features. We suppose that real clusters tend to be dense even with K ′ , and the size of more confident clusters is larger than some threshold t. Therefore, we drop the low confidence cluster whose size is smaller than t, and calculate K with: where |S i | is the size of the i th produced cluster, and δ(·) is an indicator function. It outputs 1 if condition is satisfied, and outputs 0 if not. Notably, we assign the threshold t as the expected cluster mean size N K ′ in this formula. Table 5 shows the OOD clustering results using the automatic K-value estimation strategy. We find that our method both achieves the best performance under the fixed or auto K settings. Besides, under the auto K, all methods have observed a decline in the metrics, which shows that the unknown K value is a great challenge for OOD discovery. However, the reduction of our proposed method is significantly lower than other methods, indicating that our method has strong robustness to the challenge of unknown K which reflects the good practicability of our method.

D. VISUALIZATION AT DIFFERENT TRAINING EPOCHS
To see the evolution of our method in the training, we show a visualization at four different timestamps throughout the training process in Fig 6. Results show features are mixed in the beginning and cluster assignments become increasingly visible and distinct as the training process goes.

E. EFFECT OF IND DATA
We analyze the impact of in-domain data on the effect of clustering from two perspectives, number of IND classes and number of samples per class. Figure 7 (a) shows the Overall, the performance of our method is much better than the baselines. Moreover, with the decrease of the amount of in-domain data, all methods show varying degrees of performance fluctuation, and the fluctuation amplitude of our proposed method is the smallest, which shows that our method has little dependence on the in-domain data, and has stronger performance and robustness in the few-shot scenario.

F. ERROR ANALYSIS
We further analyze the error cases of DeepAligned and COD w. ACL in Fig 8. We find for semantically similar OOD intents, DeepAligned is probably confused but our COD w. ACL can effectively distinguish them. For example, DeepAligned incorrectly clusters accept_reservation intent into cancel_reservation (14% error rate) while COD w. ACL gets 100% accuracy. The result shows COD w. ACL helps separate semantically similar OOD intents. We hypothesize it's because the adaptive instance-level contrastive learning helps the model learn discriminative linguistic knowledge. Results show for all the OOD clustering metrics ACC, ARI and NMI, τ 0 = 0.5 gets the best performance. Too larger or smaller temperatures both result in a significant performance drop. Our method with τ 0 in (0.4, 0.9) outperforms the sota baselines, and τ 0 in (0.5, 0.7) brings larger improvements(above 2%), which proves τ 0 is robust. To avoid the randomness, we average results over three random runs. The standard deviation (std) of DeepAligned is 1.16, and the std of COD(t=0.5) is 0.67.    Table 6 show the effect of different batch size of our proposed COD w. ACL on CLINC-10%. Results show that a larger batch size of input samples obtains a better performance on OOD discovery.

I. ABLATION STUDY
In the OOD clustering stage, the intent representation of BERT output is mapped to instance-level and cluster-level subspaces respectively, and optimized with different contrastive losses. In Table 7, we remove two subspaces respectively, where w/o cluster-level means only instance-level contrastive learning used for learning representations, 2 and w/o instance-level means only cluster-level contrastive learning used for learning representations. Results show both instance-level and cluster-level contrastive losses contribute to the performance. When the cluster-level contrastive loss is removed, it is difficult for the model to learn the cluster structure from the unlabeled data, so the performance degradation is the most significant.

VI. CONCLUSION
In this paper, we propose a unified contrastive learning framework for OOD discovery, bridging the gap between pre-training and clustering. For IND pre-training, we employ a supervised contrastive learning (SCL) loss to learn discriminative intent features. For OOD clustering, we introduce an efficient end-to-end contrastive clustering method to jointly learn representations and cluster assignments. Besides, we propose an adaptive contrastive learning (ACL) method to automatically adjust the weights of different negative samples. Experiments on two benchmark datasets prove the effectiveness of our method. And extensive analyses demonstrate our method converges faster and better than the previous SOTA, helps separate semantically similar OOD intents and is robust to different IND data and K. Besides, we find even if the number of OOD clusters is not given, our method still gets relatively accurate estimation and is more robust to K. We also perform visualization and error analysis to understand the reason for the performance improvements. We hope to explore more self-supervised learning methods for future work.

APPENDIX A SILHOUETTE COEFFICIENT (SC)
Following [4], we use the cluster validity index (CVI) to evaluate the quality of clusters obtained during each training epoch after clustering. Specifically, we adopt an unsupervised metric Silhouette Coefficient [32] for evaluation: where a (I i ) is the average distance between I i and all other samples in the i-th cluster, which indicates the intra-class compactness. b (I i ) is the smallest distance between I i and all samples not in the i-th cluster, which indicates the interclass separation. The range of SC is between -1 and 1, and the higher score means the better clustering results.

APPENDIX B COMPARISON WITH DKT FRAMEWORK
Our proposed COD w. ACL and DKT framework are two different training strategies. They have two differences: (1) In terms of implementation, since the motivation of DKT is to decouple the shared intent representations obtained through BERT into instance-level and cluster-level representations through a multi-head framework, thus DKT maps BERT's output into two subspaces on the model structure. And in the IND pre-training stage and the OOD clustering stage, the contrastive learning objectives is designed respectively to optimize the two subspaces. However, our COD w. ACL does not adopt the multi-head framework in the IND pretraining stage, but directly uses the CE+SCL objective to constrain the representation of BERT output. (2) In terms of method, the clustering algorithm adopted by DKT is contrastive clustering [16], that is, using an instance-level CL and a cluster-level CL to optimize the instance-level and cluster-level subspaces respectively. In this paper, we propose an adaptive contrastive clustering (ACC) method, which improves the problem that traditional instance-level CL will make similar samples be pushed away as negative samples. Adaptive contrastive clustering method, which automatically adjusts the weight of different negative samples according to the semantic similarity of a given anchor, is beneficial to form a more compact cluster distribution, which is one of the innovations of this paper. We also made a theoretical analysis of this in section III-D.