DimCL: Dimensional Contrastive Learning for Improving Self-Supervised Learning

Self-supervised learning (SSL) has gained remarkable success, for which contrastive learning (CL) plays a key role. However, the recent development of new non-CL frameworks has achieved comparable or better performance with high improvement potential, prompting researchers to enhance these frameworks further. Assimilating CL into non-CL frameworks has been thought to be beneficial, but empirical evidence indicates no visible improvements. In view of that, this paper proposes a strategy of performing CL along the dimensional direction instead of along the batch direction as done in conventional contrastive learning, named Dimensional Contrastive Learning (DimCL). DimCL aims to enhance the feature diversity, and it can serve as a regularizer to prior SSL frameworks. DimCL has been found to be effective, and the hardness-aware property is identified as a critical reason for its success. Extensive experimental results reveal that assimilating DimCL into SSL frameworks leads to performance improvement by a non-trivial margin on various datasets and backbone architectures.

Compared with the CL-based frameworks, the non-CL ones [9], [12] have a unique advantage: they propose simpler frameworks without using the negative samples, yet achieve comparable or even superior performance on benchmark datasets (like ImageNet-1K and CIFAR-10/100).Thus, there is a trend to shift from CL to non-CL frameworks.Recog-FIGURE 1: Dimensional contrastive learning (DimCL).As the term suggests, existing BCL performs CL along the batch direction to encourage diversity of representations, while our proposed DimCL performs CL along the dimensional direction to encourage diversity among elements within a representation (termed feature diversity).Our DimCL can be used as a plug-andplay regularization method to improve non-CL (and CL-based) SSL frameworks.
nizing the significance of CL in the development of SSL, this work attempts to distill beneficial properties of CL to push the frontiers of non-CL frameworks further.However, naively assimilating CL to non-CL does not show visible improvement, as pointed out in BYOL [25].This can be attributed to the fact that the frameworks mentioned above focus on the same inter-instance level of constraints and mainly pursue the same objective (augmentation invariant).In essence, existing CL encourages representation diversity among the instances in the batch.In this paper, CL is utilized to encourage diversity among the representation elements in obtaining "feature diversity", referred to as Dimensional Contrastive Learning.To avoid any confusion between batch contrastive learning and dimensional contrastive learning, we denote each as BCL and DimCL, respectively.The difference between BCL and DimCL is depicted in Fig. 2.
A prudent variation in BCL led to a separate SSL framework, while the proposed DimCL (as illustrated in Fig. 1) is designed as a regularizer for feature diversity enhancement to support other frameworks.Even though DimCL is originally motivated to boost non-CL frameworks, empirically, DimCL is found to also enhance the performance of existing CLbased frameworks and can be generalized to other domains (e.g,.supervised learning).This implies that feature diversity is necessary for good representations.
Our contributions are as follows: • Recognizing the significance of CL in the development of self-supervised learning, we are the first to apply DimCL to push the frontiers of non-CL frameworks.In contrast to existing BCL, our proposed DimCL performs CL along the dimensional direction and can be used as a regularizer for boosting the performance of non-CL (and CL-based) frameworks.• We perform extensive experiments on various frameworks with different backbone architectures on diverse datasets to validate the effectiveness of our proposed DimCL.We also investigate the reason for the benefit brought by DimCL and identify the hardness-aware property as an essential factor.The rest of this paper is organized as follows.Section II summarizes the related works.Section III describes the background of Batch Contrastive Learning.Section IV presents the proposed method DimCL.Section V provides the experiment setup and results.Section VI shows the ablation study on important hyper-parameters.Section VII provides some discussions about DimCL.Finally, Section VII concludes this work.
MoCo v1 [29] has attracted significant attention by demonstrating superior performance over supervised pre-training counterparts in downstream tasks while making use of large negative samples, decoupling the need for batch size by introducing a dynamic dictionary.Inspired by [9], MoCo v2 [10] applies stronger augmentations and an additional MLP projector, which shows significant performance improvement over the first version of MoCo.[14] has empirically shown that the predictor from the non-CL frameworks [12], [25] helps to gain performance boost for MoCo variants with ViT structures [20].
Several works explain the key properties that lead to the success of CL.It is noticeable that momentum update [9] and large negative samples play an important role in preventing collapse.InfoNCE loss was identified to have the hardnessaware property, which is critical for optimization [64] and preventing collapse by instance de-correlation [1].[15], [34], [36], [49], [62], [66], [69] have demonstrated that hard negative samples mining strategies can be beneficial for better performance over the baselines.Notably, [65] identified CL form alignment and uniformity of feature space which benefits downstream tasks.
Most of the contrastive learning frameworks adopt the instance discrimination task which inevitably causes class collision problems [74] where the representations of the same class images are forced to be different.The problem can hurt the quality of the learned representation.Different from the above methods, which perform CL along the batch direction, DimCL performs the CL along the dimensional direction in order to encourage diversity among representation elements instead of representation vectors.This approach never faces class collision problems.
Non-Contrastive Learning.Non-contrastive learning focuses on making augmentation invariant without using negative samples.With the absence of negative samples, training the simple siamese network using the cosine similarity loss leads to complete collapse [1], [25].BYOL [25] and Simsiam [12] demonstrated that using a careful architecture design to break the architecture symmetry can avoid collapse.Specifically, a special 'predictor' network is added in conjunction with the exponential moving average update (BYOL) or with a stop gradient in one branch (Simsiam).Besides, several works have attempted to demystify the success of BYOL [25].A recent work [24] has suggested that batch normalization (BN) plays a critical role in the success of BYOL; however, another work [52] refutes that claim by showing BYOL works without the need for BN.
Recognizing the strong points of CL in the development process, this work tries to distill beneficial properties of CL in a novel manner and use it as a regularizer to boost the performance of non-CL (and CL) based frameworks.Moreover, most non-CL frameworks aim to learn augmentation invariant representation which training often leads to trivial constant solutions (i.e., collapse) [6].DimCL naturally avoids collapse as it encourages diversity in the solution which is a great complement to non-CL.

III. BACKGROUND
Conventional contrastive learning, i.e.BCL, aims to make the representations similar if they are from different augmented versions of the same image and dissimilar if they are from different images.Or shortly, it aims to make meaningful discriminative representations.To be more specific, in BCL, there are query, positive, and negative samples.The considered image is called the query sample.The augmentation views of the query image are called positive samples.The other images in the sample batch and their augmentation views are called negative samples.The loss of CL-based frameworks basically makes the query representation to be near the positive sample presentation and far apart from the negative sample representation.Mathematically, given an encoder f , an input image is augmented and encoded as a query q ∈ R D or positive key k + ∈ R D , which are often l 2normalized to avoid scale ambiguity [9], [28].Processing a mini-batch of N images will form a set of queries Q = {q 1 , q 2 , ..., q N } and positive keys K + = {k + 1 , k + 2 , ..., k + N }.Consider a query q i , the corresponding negative keys are defined . With similarity measured by dot product, BCL can be achieved by the simple CL loss below [64]: The gradient of L i w.r.t q i is derived as: The above equation treats all negative keys equally.Based on this, [64] proved that the superficial loss in Eq. 1 performs poorly in practice.
The InfoNCE is formulated as follows [28]: with τ denoting the temperature.The InfoNCE has been identified to outperform the above simple loss Eq. 1 due to its hardness-aware property, which puts more weight on optimizing hard negative pairs (where the query is close to negative keys) as shown in [64].

IV. METHODOLOGY
Dimensional Contrastive Learning (DimCL) explores a new way of using InfoNCE compared to BCL.As shown in Fig. 2, BCL aims to make meaningful discriminative representations by applying InfoNCE along the batch direction.The keys and queries are the representation vectors.By contrast, DimCL encourages each representation element to contain a piece of distinct information to maximize the amount of information contained in the overall representation, which is feature diversity enhancement1 .To this end, the DimCL make the elements of the representation vector orthogonal to each other in term of information by minimizing the empirical correlation among column vectors.A novel form of InfoNCE along the dimensional direction is proposed as the loss to achieve this objective.Therein, the corresponding queries and keys are column vectors, each of which is formed from the same-index representation elements within a batch, as highlighted in Fig.

2.
Mathematically, similar to BCL, given a mini-batch of N images, we have a set of queries G = {g 1 , g 2 , ..., g D } and positive keys Considering a query g i , the corresponding negative keys are defined as In order to maximize the feature diversity, the considered query g i should be orthogonal with all negative keys H − i .The corresponding objective is: ) Empirically, we observe that the original InfoNCE is sufficient to achieve the objective without any modification (e.g., adding the absolute) (evidence is provided in the discussion).This can be explained by considering the exp term and the effect of temperature τ .With small τ , the exp(x/τ ) has high weight on pushing positive value x toward zero with a corresponding high gradient but has almost no consideration on negative value x with the same magnitude due to its much smaller gradient.For simplicity, we adopt the following loss as the DimCL optimization target: .
(5) Note that, in DimCL each query g i has total 2D−2 negative keys instead of 2N −2 as in BCL.And each of collumn vector g, h are l 2 -normalized along the batch direction instead of dimensional direction as in BCL.Furthermore, the proposed DimCL inherits the hardness-aware property of the traditional BCL for which we provide more detail in the discussion part.
Contrary to BCL, which works as an independent SSL framework, DimCL serves as a regularizer to benefit existing SSL frameworks.We denote L BASE as the loss of the SSL baseline.DimCL can be simply assimilated into the baseline by a linear combination to form a final loss as: where λ ∈ [0, 1] is a weight factor to balance the two loss components.We perform a grid search and find that λ = 0.1 works well in most cases and recommend this value as a starting point for more fine-grained tuning.The pseudo algorithm is provided in Algorithm. 1

A. EXPERIMENT SETUP
To show its effectiveness, we evaluate DimCL by assimilating it to state-of-the-art non-CL and CL-based frameworks.Five widely used benchmark datasets are considered including CIFAR-10 [38], CIFAR-100 [38], STL-10 [16], ImagetNet-100 [57], and ImageNet-1K (1000 classes) [39].Different encoders (ResNet-18, ResNet-50) are also considered.The performance is bench-marked with linear classification evaluation and transfer learning with object detection following the common evaluation protocol in [12], [25], [29].To be more specific, the encoder is pre-trained in an unsupervised manner on the training set of the selected dataset without labels [39].For the linear classification evaluation, the pretrained frozen encoder is evaluated by training an additional linear classifier and tested on the corresponding test set.For object detection evaluation, the pre-trained frozen encoder is evaluated by a Faster R-CNN detector (C4-backbone) with the object detection datasets (i.e., VOC object detection).
In this paper, the Faster R-CNN detector (C4-backbone) is finetuned on the VOC train-val 07+12 set with standard 2x schedule and tested on the VOC test2007 set [29], [68].More details regarding the two evaluation methods are provided in Appendix

B. IMPLEMENTATION DETAILS
For a simple implementation, DimCL directly uses the In-foNCE loss [9] but transposes the input.BCL framework implementations are based on the open library solo-learn [61].Setups of the SSL baseline framework for training are described below.Image augmentations.The paper follows the setting in previous approach [9], [25].Concretely, a patch of the image is sampled and resized to 224 × 224.Random horizontal flips and color distortion are applied in turn.The color distortion is a random sequence of saturation, contrast, brightness, hue adjustments, and an optional grayscale conversion.Gaussian blur and solarization are applied to the patches at the final.
For example, on CIFAR-100 with Resnet-50, DimCL enhances the baseline MoCo v2 and SimCLR with a performance boost of 1.56% and +4.46%, respectively.A more significant performance boost can be observed for BYOL (+6.63%), and SimSiam (+11.4%).In addition, during the experiment, the BASEs are highly tuned to get the best performance, and BASEs+DimCL does not.With a fine-tuned parameter search, a higher gain might be possible.Overall, the result indicates DimCL is compatible with both CL and non-CL SSL frameworks with a non-trivial performance gain.Furthermore, it also has good generalization across various datasets and backbones.
When evaluating the performance of DimCL under different metrics, the result suggests the same conclusion.To be more specific, an experiment is conducted on CIFAR100 with Resnet-18 backbone.The pre-trained models of the baselines and DimCL are evaluated on the classification task with different performance metrics: Top-1 Accuracy, Top-5 Accuracy, Top-1 KNN, and Top-5 KNN.The result, shown in Tab. 2, suggests that DimCL consistently improves the baseline under various performance metrics.

Method
Top  accuracy of two settings is not necessarily the same due to using the cosine learning rate scheduler.

2) Large-scale dataset
For the large-scale dataset, Imagenet-1K is chosen, and BYOL is selected as the baseline.Due to the resource constraint, BYOL and BYOL+DimCL are pre-trained for 100 epochs without labels.The results are reported in Tab. 3. The results show that on the large-scale dataset, DimCL improves the BYOL baseline with a performance boost of +2.0% and outperforms all other frameworks.The performance is consistent with the results in Tab.

3) Longer Training
To demonstrate the results are consistent between short training (200 epochs) and long training (1000 epochs), we conduct experiments on CIFAR-100 and ImageNet-100 with BYOL as the baseline framework [25].Top-1 classification accuracies are reported in Tab. 4. We observe that DimCL also has a consistent performance boost for the long training.Specifically, incorporating DimCL helps to significantly boost the top-1 accuracy of BYOL from 70.54% to 71.94% (+1.4%) for CIFAR-100, and further improves BYOL from 81.24% to 82.51% for ImageNet-100.It is reasonable that the performance boost margin can be relatively smaller in the setup of the long training compared to the short training.
Fig. 3 shows the learning curve in two different settings: 200 epochs (a) and 1000 epochs (b).The results demonstrate that our method does not vanish but further improves BYOL in long training.There is a high correlation in performance improvement between short and long training.It proves that the 200 epochs setting is reasonably adequate to evaluate the performance gain.

VI. ABLATION STUDY
In this section, we provide ablation for important hyperparameters of DimCL: the temperature τ , the weight factor λ, and the dimensionality D.

A. THE EFFECT OF THE TEMPERATURE τ
We monitor the changes in feature diversity and performance when assimilating DimCL to BYOL with various τ values.The experiment runs on CIFAR-100 for 200 epochs.The results in Fig. 4 suggest that selecting a reasonable τ leads to high feature diversity (and performance).τ = 1 does not lead to good feature diversity.The value of τ that supports gaining the best performance is around 0.1.This result coincides with the τ used in conventional BCL frameworks [9], [29].There is a drop in performance when using too large or too small τ in DimCL.

B. THE EFFECT OF THE WEIGHT FACTOR λ
The balance weight factor between DimCL and the baseline plays an important role in gaining performance.We conduct the experiments with a range of [0, 1] for λ in Eq. 6.All other parameters are kept unchanged.with respect to λ on the test set of CIFAR-100.Note that, the performance at λ = 0, corresponding to the performance of the baseline BYOL, is much lower than the case when incorporating BYOL with our loss.
The results in Fig. 5 show that with all λ in the range of (0, 0.7), our method consistently outperforms the baseline BYOL (corresponding to λ = 0) in both two measures: top-1 classification accuracy and top-1 KNN accuracy.λ = 0.1 is found to be the optimal value to boost performance when plugging DimCL into BYOL.λ usually depends on the baselines and dataset.However, we empirically find that setting λ to 0.1 often gives the best performance for the most recent SSL frameworks in datasets.It is recommended to use this value at the beginning of the tuning process when using our DimCL regularization.

C. THE EFFECT OF THE DIMENSIONALITY D
As the DimCL targets to address dimension-wise diversity, dimensionality should be a key fact that needs to be considered.We provide ablation studies on the effects of dimensionality.Tab.6 shows the top 1 accuracy on CIFAR 100 with 200 epochs of BYOL and BYOL+DimCL.
The result shows that for the small dimensionality, DimCL provides a large improvement over the baseline.For bigger dimensionality, the improvement tends to reduce.It is understandable since DimCL aims to maximize the useful information (or in other words, minimize the redundancy) contained in a low dimensionality.For bigger dimensionality, there is plenty of space for storing information which reduces the importance of DimCL.It is also noticeable that for very small dimensionality, the performance starts to drop for both BYOL and BYOL+DimCL (e.g: under 256) since there is not much space for storing information.The higher feature diversity leads to higher performance.

A. FEATURE DIVERSITY ENHANCEMENT
Our proposed DimCL is motivated to enhance feature diversity which is defined as the independence among the elements of a representation.In other words, good feature diversity means each element of representation should carry a piece of distinct information about the input image.In this view, feature diversity can be evaluated by considering correlation among all pairs of negative column vectors.Given a tensor with size N × D, the feature diversity measure is defined as: Here, g, h ∈ R N are column vectors.sim(.) is the cosine similarity measure.The range for the feature diversity measure is within [0, 1].The optimum value of feature diversity is 1 which means all elements of representation are mutually independent.
To prove the enhancement of feature diversity, we take BYOL [25] and SimSiam [12] into account where the encoders are designed to learn the representation, which is invariant to augmentation without considering feature diversity.We assimilate DimCL to BYOL, and Simsiam then observe changes in feature diversity and accuracy.Results are reported in Tab. 7.
Interestingly, the BASEs generate embedding, which already has high feature diversity.Adding DimCL to BASEs has a strong effect on further increasing the feature diversity, which impacts performance improvement.Specifically, DimCL makes an improvement 0.05 (5% in percentage) feature diversity with corresponding 5.49% accuracy on BYOL and 0.17 (17% in percentage) feature diversity with corresponding 10.82% accuracy on SimSiam.The more feature diversity improvement, the better performance gain.The relation between feature diversity is shown clearly in Fig. 6.
From the perspective of information theory, improving feature diversity can be classified as the Information Bottleneck objective [59] which forces a representation that conserves as much information about the sample as possible.It is mentioned to be beneficial in various research [2], [35], [59].Our result is one of the empirical pieces of evidence proving the benefit of feature diversity.

B. HARDNESS-AWARE PROPERTY IN DIMCL
The hardness-aware property plays a key role in BCL controlling the uniformity-tolerance dilemma [65] leading to its success.In the view of optimization, the hardness-aware property puts more weight into optimizing negative pairs that have high similarities.This way is influenced by hard examples mining and has proven to be effective [4], [36], [49], [62], [66], [72].
The interpretation of Harness-aware in the DimCL can be understood via gradient analysis of loss function.Let's consider loss for query g i : where g and h are the l 2 -normalized column vectors.The gradient of L DimCL i w.r.t query g i is derived as: where ) can be interpreted as the probability of g i being recognized as the positive colunm vector h + i .Similarly, ) can be interpreted as the probability of g i being recognized as the negative vector h − j .We can easily see that α ′ i + j α j = 1 and all α > 0. The Eq. 11 reveals how DimCL makes the query similar to the positive key and dissimilar from negative keys.Concretely, if g i and h + i are very close, the gradient of g i is very small because 1 − α ′ i ≈ 0 and j α j ≈ 0 (because 1 − α ′ i ≈ 0 and α ′ i + j α j = 1) .Thus, the optimizer does not update the query g i .By contrast, if g i and h − j are very close, the weight α j is big, encouraging the optimizer to push the query far away from the corresponding negative keys.
Regarding the ability to differently treat negative keys, the gradient weight w.r.t negative keys proportional to the exponential exp( gi.h − j τ ).It shows that hard column pairs, where query g i is far with negative keys, are penalized more with larger α j .In other words, the optimizer will pay more attention to optimizing hard column pairs, which leads to better optimization results than treating them equally.This phenomenon is the hardness-awareness property of the loss 8.The effect of the hardness-aware property in DimCL in relation to feature diversity can be empirically seen clearly in the ablation study Fig. 4 C. BEYOND CL AND NON-CL.

Datasets
ResNet  Previous results show that DimCL is most beneficial in boosting the performance of CL and non-CL frameworks with a non-trivial margin.Here, we also investigate the recent work that designed an explicit term for decorrelation, Barlow Twins (BT) [71].
We experiment by adding the correlation-reduction loss of BT to the previous baseline BYOL and comparing it against DimCL.The result in Tab. 9 shows that BYOL+DimCL strongly outperforms BYOL+Barlow.Furthermore, as shown

Method
Feature This empirical result recommends that DimCL provides better performance than BT.

D. DIMCL FOR SUPERVISED LEARNING.
Since DimCL works as a regularizer enhancing the feature diversity, it is expected to benefit other fields beyond selfsupervised learning (e.g.supervised learning (SL)).This experiment utilizes DimCL to boost SL on CIFAR-100 and CIFAR-10 datasets.We use the solo-learn library [61] to train the supervised model with backbone ResNet-18 [30].DimCL is assimilated with cross-entropy loss for training the model simultaneously.Tab. 10 shows the top-1 classification accuracy on the test set.For CIFAR-10 DimCL shows slight improvement, while CIFAR-100 shows the DimCL supports to boost the conventional supervised learning from 70.27% to 71.68% (+1.4%), demonstrating the benefit of DimCL for SL.

DIMCL VERSUS ABSCL
In order to maximize the feature diversity, the considered query g i should be orthogonal with all negative keys H − i .The corresponding objective is:     τ , DimCL and AbsCL can outperform the baseline.However, to achieve the best performance, τ is needed to present.At the optimal τ = 0.1, the performance of DimCL is nearly the same as AbsCL.This phenomenon can be explained by considering the exp term and the effect of temperature τ .With small τ , the exp(x/τ ) has a high weight on pushing positive value x toward zero with a corresponding high gradient but has almost no consideration on negative value x with the same magnitude due to its much smaller gradient.

F. VISUALIZATION OF REPRESENTATION.
Visualization of representation via t-SNE is reported to see the effect of DCL on representation space.Fig. 7 and Fig. 8 show the representation of BYOL baseline and our method on the 2D space.The experiment is conducted on CIFAR-10 with 10 classes.The results clearly show that our method in Fig. 8 gives more separable representations.More specifically, airplane, auto, ship, and truck are almost separable among them and also from other animal classes.All classes are scatted in the more compact clusters compared to the baseline in Fig. 7.
To show the difference between the two representation spaces quantitatively, intra-class distance and inter-class distance [56] are calculated and provided in Tab.12.The

FIGURE 2 :
FIGURE 2: The difference between (a) Batch Contrastive Learning (BCL) and (b) Dimensional Contrastive Learning (DimCL) .BCL performs along the batch direction to encourage representation diversity whereas DimCL performs along the dimensional direction to encourage feature diversity.N is the batch size, and D is the feature dimension.

FIGURE 3 :
FIGURE 3: Top-1 classification accuracy learning curve on the test set of CIFAR-100 of (a) 200 epochs and (b) 1000 epochs.The figure shows the consistent result between long and short training.Note that at the same epoch, the Top-1 accuracy of two settings is not necessarily the same due to using the cosine learning rate scheduler.

4 )FIGURE 4 :
FIGURE 4: Feature diversity (a) and performance (b) with respect to τ on the test set of CIFAR-100.Our hypothesis emphasizes the importance of increasing the feature diversity or decreasing the correlation to remove the residual information of the feature representation.

FIGURE 5 :
FIGURE 5: Top-1 classification accuracy and top-1 KNN with respect to λ on the test set of CIFAR-100.Note that, the performance at λ = 0, corresponding to the performance of the baseline BYOL, is much lower than the case when incorporating BYOL with our loss.

FIGURE 6 :
FIGURE 6: Relation between feature diversity and performance during training on CIFAR-100.a) the top-1 test classification accuracy.b) the corresponding feature diversity.The higher feature diversity leads to higher performance.

FIGURE 7 :
FIGURE 7: t-SNE plot of ten classes for data trained by the BYOL baseline in 200 epochs with accuracy = 88.51% in CIFAR-10 with 10,000 samples of the test set.

FIGURE 8 :
FIGURE 8: t-SNE plot of ten classes for data trained by the BYOL + DimCL in 200 epochs with accuracy = 90.57% in CIFAR-10 with 10,000 samples of the test set.

TABLE 1 :
The top-1 classification test accuracy (%) of the BASEs (the baseline frameworks) + DimCL (the baseline with DimCL regularization) amongst various datasets, and backbones.All models are trained for 200 epochs Classification is performed with a linear classifier trained on top of the frozen pre-trained encoder (output of the evaluated framework)."*" denotes an improved version of MoCo v2 with symmetric loss.

TABLE 2 :
Performance evaluated with different metrics.The methods are trained on the CIFAR-100 dataset with 200 epochs and use Resnet-18 as the backbone.Classification is performed with a linear classifier trained on top of the frozen pre-trained encoder.The test accuracy is reported with various performance metrics: Top-1 accuracy, Top-5 accuracy, Top-1 KNN, and Top-5 KNN.

TABLE 3 :
Imagenet-1K classification.All frameworks are trained without labels on the training set for 100 epochs.

TABLE 4 :
[17] training with 1000 epochs.Linear classification accuracy (%) on the test set of CIFAR-100.All models are pre-trained on the training set without labels before evaluation.Note that MoCo v2+ is the improved version of MoCo v2 with symmetric loss[17].

TABLE 6 :
The effects of DimCL on dimensionality.The table shows the top 1 accuracy on CIFAR 100 with 200 epochs of BYOL and BYOL+DimCL

TABLE 9 :
Comparison between DimCL and Barlow Twins on top of baseline BYOL.Models are trained for 200 epochs with ResNet-18 on the CIFAR-100.We report top-1 linear classification (%) accuracy.

TABLE 7 :
Comparison of Feature diversity and performance in CIFAR-100 dataset for both BASE (baseline) and +DimCL (baseline with DimCL regularization).All frameworks are pre-trained with 200 epochs on ResNet-18 backbone.in Tab. 8, When incorporate into BT, DimCL can also improve BT.

TABLE 10 :
DimCL for improving supervised learning.Models are trained for 200 epochs with ResNet-18 and ResNet-50 backbone on the 4 datasets.We report top-1 linear classification (%) accuracy.

TABLE 11 :
DimCL versus AbsCL.We report top-1 linear test accuracy (%) on CIFAR-10 and CIFAR-100.All methods are trained for 200 epochs.For τ = 0.1, all methods DimCL and AbsCL perform best and performance is almost similar.Empirically, Tab.11 shows that the original InfoNCE is sufficient to achieve the objective without any modification (e.g., adding the absolute).It is important to note that without

TABLE 12 :
Inter-class distance and Inter-class distance on CIFAR-10 test set.quantitativeresult agrees that BYOL+DCL forms the more compact clusters while maintaining a higher separation among different clusters compared to BYOL.This paper introduces Dimensional Contrastive Learning (DimCL), a new way of applying CL.DimCL works as a regularization that can assimilate with non-CL (and CL) based frameworks to boost performance on downstream tasks such as classification and object detection.DimCL enhances feature diversity among elements within a representation.DimCL has high compatibility and generalization across datasets, frameworks, and backbone architectures.We believe that feature diversity is a key indispensable ingredient for learning representation.This paper focuses on images and provides mostly empirical evidence but DimCL can be generalized to other modalities (e.g.audio, video, text) and proven with theoretical results.We let it for future work.