Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students | IEEE Conference Publication | IEEE Xplore

Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students


Abstract:

The popular methods for semi-supervised semantic segmentation mostly adopt a unitary network model using convolutional neural networks (CNNs) and enforce consistency of t...Show More

Abstract:

The popular methods for semi-supervised semantic segmentation mostly adopt a unitary network model using convolutional neural networks (CNNs) and enforce consistency of the model’s predictions over perturbations applied to the inputs or model. However, such a learning paradigm suffers from two critical limitations: a) learning the discriminative features for the unlabeled data; b) learning both global and local information from the whole image. In this paper, we propose a novel Semi-supervised Learning (SSL) approach, called Transformer-CNN Cohort (TCC), that consists of two students with one based on the vision transformer (ViT) and the other based on the CNN. Our method subtly incorporates the multi-level consistency regularization on the predictions and the heterogeneous feature spaces via pseudo-labeling for the unlabeled data. First, as the inputs of the ViT student are image patches, the feature maps extracted encode crucial class-wise statistics. To this end, we propose class-aware feature consistency distillation (CFCD) that first leverages the outputs of each student as the pseudo labels and generates class-aware feature (CF) maps for knowledge transfer between the two students. Second, as the ViT student has more uniform representations for all layers, we propose consistency-aware cross distillation (CCD) to transfer knowledge between the pixel-wise predictions from the cohort. We validate the TCC framework on Cityscapes and Pascal VOC 2012 datasets, which outperforms existing SSL methods by a large margin. Project page: https://vlislab22.github.io/TCC/.
Date of Conference: 13-17 May 2024
Date Added to IEEE Xplore: 08 August 2024
ISBN Information:
Conference Location: Yokohama, Japan

Funding Agency:


I. INTRODUCTION

Semantic segmentation [1], [2] is a crucial scene understanding task in computer and robotic vision, aiming to generate pixel-wise category prediction of an image. Most of the state-of-the-art (SoTA) methods focus on exploring the potential of convolutional neural networks (CNNs) and learning strategies [3], [4]. However, a hurdle of training these models is the lack of large-scale and high-quality annotated datasets, imposing much burden for real applications, e.g., autonomous driving [5]. Consequently, growing attention has been paid to deep semi-supervised learning (SSL) for semantic segmentation [6] using the labeled data and additional unlabeled data.

Contact IEEE to Subscribe

References

References is not available for this document.