Journals & Magazines >IEEE Access >Volume: 12

ExpPoint-MAE: Better Interpretability and Performance for Self-Supervised Point Cloud Transformers

0 seconds of 0 secondsVolume 90%

00:00

Evaluation of different pretraining schemes and an extensive explainability study on the vision transformer's behavior with and without pretraining. Results show that it ...

Abstract:

In this paper we delve into the properties of transformers, attained through self-supervision, in the point cloud domain. Specifically, we evaluate the effectiveness of M...Show More

Metadata

Abstract:

In this paper we delve into the properties of transformers, attained through self-supervision, in the point cloud domain. Specifically, we evaluate the effectiveness of Masked Autoencoding as a pretraining scheme, and explore Momentum Contrast as an alternative. In our study we investigate the impact of data quantity on the learned features, and uncover similarities in the transformer’s behavior across domains. Through comprehensive visualizations, we observe that the transformer learns to attend to semantically meaningful regions, indicating that pretraining leads to a better understanding of the underlying geometry. Moreover, we examine the finetuning process and its effect on the learned representations. Based on that, we devise an unfreezing strategy which consistently outperforms our baseline without introducing any other modifications to the model or the training pipeline, and achieve state-of-the-art results in the classification task among transformer models.

0 seconds of 0 secondsVolume 90%

00:00

Evaluation of different pretraining schemes and an extensive explainability study on the vision transformer's behavior with and without pretraining. Results show that it ...

Published in: IEEE Access ( Volume: 12)

Page(s): 53565 - 53578

Date of Publication: 12 April 2024

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2024.3388155

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Deep learning models at large scale require adequately large labeled datasets to be able to learn. This is clearly a limitation for deep learning in general, since manual annotation is a very time-consuming and costly task. As a result, it was a great breakthrough when it was discovered that models can actually benefit from unlabeled data, by using them to design and solve pretext tasks, in which the sample itself is the label.

This concept appears to have features resembling a real-world analog, that is, the training happening inside a small child’s/infant’s brain. Supervision from experts (adults) is actually responsible for a small percentage of knowledge acquired throughout a human’s lifespan, and it mostly tailors to specific material rather than the perception of the real world at large. Simply by observing the behavior of other people and surrounding objects, they gain intuitive understanding of their environment. While we have fairly successfully replicated a form of supervised learning, self-supervision is an area that still lags behind, despite theoretically offering the most advantages.

Naturally, the data needs to be diverse and of sufficient size for this process to actually yield any benefits. Fortunately, by leveraging the ocean of unlabeled data that is freely available on the web, researchers have managed to significantly boost the performance of their models. Great examples of this concept can be found in both language [1] and in image [2], [3] domains, showing great promise for a variety of potential applications in other domains.

The most intuitive and (arguably) popular pretext task applied both in vision and language is the so-called “fill in the gaps”. It refers to tasks where input samples are truncated to occlude pieces of information or corrupted by applying various types of perturbations and noise. The deep learning model is given the damaged sample as input and is tasked to reconstruct/complete the original. During this process, the model learns features related to the sample’s class. These features are incredibly useful and can be leveraged to improve the performance of the model in downstream supervised tasks, involving potentially much smaller datasets.

As this approach became more common, more sophisticated techniques of self-supervision were devised [4], [5], [6]. However, few studies have actually delved into what it is that the models actually learn [7], [8], [9] and even fewer have examined point clouds in particular [10], [11]. Point Cloud models differ significantly from image-based ones. As opposed to an image’s canonical grid structure, uniform density and general data availability, the point clouds are highly irregular, highly inconsistent in terms of density, and are generally scarce compared to images. Additionally, contrary to an image’s fixed pixel position, point clouds are unordered, often different in terms of cardinality, and require that their handling be invariant to permutations. Due to these challenges that arise in the domain shift from language/image to point clouds, we find that this particular area is lacking.

In this study we adopt the standard transformer as our baseline model. By “standard transformers” we refer to models that follow the architecture presented in ViT [12], as opposed to architectures that use transformer-like blocks. It is a highly versatile, strong architecture that has demonstrated incredible results in both language and vision, overtaking previous state-of-the-art models [13]. However, transformer-based works on point clouds do not reflect this success. In this work we explore various aspects of this architecture, in order to better understand its inner workings and find ways to improve its performance. We argue that our findings will be of further use in future works, since they are targeted at a widely used, general-purpose architecture rather than a specialized one.

To summarize, our contributions are twofold:

We propose strategic unfreezing, a finetuning strategy that retains the properties of the backbone, learned through pretraining, while increasing the accuracy of our baseline both in ModelNet40 (+0.5%) and ScanObjectNN (+0.86%, +1.73%, +0.07%), and achieving state-of-the-art results among transformer models. (Sec. IV)
We adjust explainability tools from the image and NLP domain to work with point clouds, with the aim of understanding the inner workings of the transformer and the effect of pretraining. Interestingly, we uncover that with more data, the transformer seems to learn the inductive bias of convolution, to attend locally. (Sec. V)

SECTION II.

Related Work

A. Deep Learning on Point Clouds

The domain shift from 2D to 3D came with several extra challenges, due to the lack of grid structure and uneven density of point clouds. Early attempts involved the application of image methods as-is, using multi-view images [14], [15] or voxels [16], [17]. Not long after that, specialized architectures emerged. PointNet [18] pioneered point-wise MLPs and pooling for extracting global features. PointNet++ [19] later followed a multi-scale approach by incorporating neighborhood information. Following this paradigm, other works create more complex kernels, taking advantage of geometric priors [20], [21], [22], [23], [24].

More recently, transformer variants have been successfully applied to the point cloud domain. Originally introduced in [26], transformers were quickly established as the go-to choice in NLP. Dosovitskiy et al. [12], later introduced the vision transformer, operating on image data. This was accomplished by tokenizing the input into patches and incorporating spatial information through a positional embedding. This approach also allowed data from different modalities to be used jointly [27], [28], making the transformer the most prominent architecture in that field. Point Transformer [29] is among the first works to apply a transformer-like architecture to point cloud data. They apply intra-neighborhood attention to create a patch feature vector and use pooling operations to downsample the point cloud.

In PCT [30], they extract point embeddings through a downsampling network and apply a series of single attention transformer layers to the remaining points. The feature maps of these layers are concatenated, and global features are extracted through mean and max pooling. In PVT [31], the authors create specialized attention modules for both points and voxels, utilizing the complementary information they contain to extract better features.

Point-BERT [32] tries to generalize the plain transformer, as used both in Image and NLP. They train a ‘tokenizer’ autoencoding network, based on DGCNN [33], to map point neighborhoods into feature vectors. During pretraining, a transformer encoder-decoder architecture receives masked point clouds as input and tries to reconstruct the embeddings of the ‘tokenizer’. Additionally, the classification token is trained through a contrastive loss like in [4]. Point-MAE [25] and Point-Multiscale MAE (M2AE) [34] both utilize masked autoencoding to pretrain their transformer backbones, by reconstructing the actual points of the masked neighborhoods directly. Point-M2AE uses a pyramid-like backbone that gradually downsamples the input point cloud, obtaining multi-scale features. On the other hand, Point-MAE uses the same architecture as Point-BERT, that is, the plain transformer.

While hierarchical architectures such as M2AE are typically associated with slightly better performance, they require parameter tuning when transferred to other domains or scaled to larger datasets. Since the main scope of this work is explainability, the simplicity and cross-domain applicability of MAE are more valuable. It can be trivially scaled by adding or removing blocks and provides a fair ground for comparisons with image and NLP. Therefore, for our baseline model, we choose to adopt the transformer presented in Point-MAE [25].

B. Self Supervised Learning

The idea of making use of large amounts of unlabeled data to improve a learning model is not new, rather it has been around for decades [35], [36], [37], [38]. Its popularity did not surge until the mid-2010s [2], [3], [39] however, when the availability of unlabeled data at scale started to increase and the potency of hardware could finally keep up to the challenge. Of particular importance is masked autoencoding, pioneered by [1], which revolutionized language modeling. This paradigm was later followed by other works, such as [40], [41], and [42], that further scaled up in terms of parameter and dataset size, achieving state-of-the-art results in the few-shot and zero-shot settings. In [43] they experiment with pretraining by using synthetic data, artificially injected with desirable properties.

Another line of work that has been amassing popularity recently is contrastive learning [4], [5], [44], [45], [46], [47], [48] as a way to learn meaningful data representations. It is accomplished by generating positive and negative sample pairs and training the model to pull together or push them away, respectively. The pairs are generated through data augmentations; perturbations, crops, and transformations. A similar idea, dubbed contrastive clustering, is applied to groups of samples instead of pairs [6], [49]. Reference [6] in particular, eases up the computational load by introducing learnable cluster centers. In [50], the authors eliminate the need for negative pairs by training two networks to produce matching feature vectors for two different views of the same sample.

Despite the scarcity of point cloud data compared to image and language data, and the corresponding need for good self-supervision techniques in this domain, the topic has only recently begun attracting attention. Reference [51] proposes to split point clouds into parts and use these parts to parameterize gaussian mixture models. The model is then trained by using a loss that resembles likelihood maximization. In [52] the authors perform patch-level contrastive learning. By rotating a query patch they form positive pairs, while negative pairs are formed by taking into account any other patch of the shape. A similar approach is followed in [53], finding positive and negative pairs by using an inductive model to perform pseudolabeling. In [54], a similar approach to masked autoencoding is followed, that is, completion of occluded point clouds acquired by taking custom viewpoints through virtual cameras. Reference [55] argues that performing pretext tasks on single object point clouds might have limited benefits in real world applications. They instead sample positive and negative pairs from complex scene scans, hoping to get a better estimate of the target distributions, and apply a contrastive loss to train their model.

C. Explainability in Deep Learning

An extremely important field in deep learning research is explainability. Deep models are generally seen as black boxes and any attempt at designing new or improving existing models is usually empirical. Nevertheless, there are several tools one can use in order to better understand a model’s behavior. In [56] and [57] the authors present CKA, a similarity measure between feature representations of two arbitrary neural networks. Reference [7] utilizes this tool and provides valuable insights into the differences between how CNNs and transformers learn, while [8] tries to compare the representations of networks with different depths and feature map sizes. Other works [9], [11] utilize gradient-based methods in order to visualize the receptive fields or the relevancy of input patches towards the model’s decision. We use a combination of all the above tools to attain a complete picture of our model and accompanying pretraining scheme.

SECTION III.

Baseline Method, Tools, and Dataset

In this section, we briefly overview our baseline model and pretraining setup. After training and finetuning, we assess the quality of the learned features both qualitatively and quantitatively by using explainability tools and measuring the accuracy in the classification task. Further details regarding the model and the training process can be found in the appendix. We utilize pytorch lightning1 for our codebase, in favor of clarity, transparency, and reproducibility. The code is publicly available on github.2

A. Masked AutoEncoders

For our baseline model, we adopt the vision transformer trained with masked autoencoding, as presented by Pang et al. [25], which we restate here for the sake of completeness. The transformer is comprised of 12 blocks with 6 attention heads per block. The input point cloud is split into N, possibly overlapping patches, 60% of which are masked out. The remaining patches are embedded into feature space via a small PointNet-like [18] network, and the patch centroids are used to generate positional encodings. The two are added to form the input to our transformer, which outputs a feature vector for each patch. At this stage, masked patches are assigned a mask token and a positional encoding and are concatenated with the rest before being fed into the decoder. The decoder reconstructs the masked patches and a reconstruction loss based on Chamfer distance is applied patch-wise.

B. Explainability Tools

For explainability, one of the main points of interest in this work, we utilize a variety of tools. The Centered Kernel Alignment criterion offers a versatile way of comparing feature representations between two models across a dataset, and its importance to our work cannot be overstated. It provides insights into the differences between models that have been pretrained & finetuned vs just finetuned, as well as models that have been pretrained using varying datasets and or strategies. We omit the math behind CKA and refer the interested reader to [56] and [57]. Attention visualization is another commonly used tool that helps to understand how the model correlates the point patches based on the extracted geometric features. Finally, we compute and visualize the receptive fields to better understand the information flow within the network.

C. Dataset

As opposed to their image counterparts, point cloud datasets are generally fewer and significantly smaller in size. Additionally, there are a lot of impactful differences between point cloud datasets, including sampling density, scanning device-specific artifacts, and shape variations in general (real-world objects, CAD models). As a result, a model trained on a specific dataset may not generalize to other data distributions. Our first goal towards explainability is to see how the amount of training data might affect the network’s performance. To this end, we concatenate the commonly used ShapeNet [58] with the more obscure CC3D [59] (41k and 43k training samples, respectively). CC3D contains high resolution, single-object, fine-grained CAD shapes from arbitrary categories. To study the effects of the increased volume of data, we conduct pretraining experiments using only ShapeNet (S) and the concatenated dataset (C+S). We compare the two by finetuning for classification in ModelNet40 [60] and ScanObjectNN [61] and report the results in Tables 1, 2. In both cases, our proposed dataset helps achieve better accuracy. For the rest of the paper, we will be using C+S for pretraining, unless otherwise specified.

TABLE 1 Comparison of Different Transformer Based Models and Pretraining Schemes With and Without Using Voting on the ModelNet40 Test Set

TABLE 2 Classification Results on ScanObjectNN. the HARD Variant Refers to PB-T50-RS That is the Most Challenging One

SECTION IV.

Strategic Unfreezing

There are several ways to approach the finetuning process of a pretrained model. Most commonly, the pretrained backbone is frozen, that is, gradient propagation is allowed but no weight updates happen, and a smaller classification (or other) head is trained to map the learned features to logits. This method is based on the assumption that the backbone has already learned a robust representation of the data and is capable of separating the class clusters effectively. However, the learned data representation often differs significantly from the data used in the downstream tasks. A great example of this is pretraining on synthetic data and using the model in real-world applications.

Another approach is to perform task-specific adjustments to the model by training both the backbone and the added head end-to-end using a small learning rate. However this contains an important caveat, as the random initialization of the classification head’s weights can cause the weights of the backbone to be perturbed in an unintended direction, effectively destroying the learned features and representations.

In order to properly evaluate these approaches we perform comparative the following experiments.

We test the first method by measuring the accuracies in ModelNet40 classification (Figure 2). We see that when keeping the backbone frozen and only relying on the classification head, the accuracy is vastly inferior compared to the backbone being unfrozen at various intervals. This is because no task-specific knowledge has been incorporated into the backbone.
In Figure 3(a), we test the second method. We compare the pretrained backbone that has been finetuned without freezing and the pretrained backbone without finetuning. It is evident that there is hardly any similarity between them, except for the very early layers. This arguably defeats the purpose of pretraining, since very few of the valuable properties are retained in the final network. This deficiency is also reflected in the final accuracy score, as seen in Figure 2.
Finally, in Figure 2 we note the accuracy of the model that has not been pretrained. It exhibits the lowest accuracy among all models that have been pretrained, proving the effectiveness of pretraining.

FIGURE 1.

Graphical description of the two pretraining pipelines studied in this paper, namely Masked-AutoEncoding (MAE) [25] and Momentum Contrast (MoCo) [4]. In simple terms, MAE trains an autoencoder to reconstruct a shape with missing parts, whereas MoCo trains two networks (Student/Teacher) to generate approximately equal predictions for different augmentations of a data sample.

ExpPoint-MAE: Better Interpretability and Performance for Self-Supervised Point Cloud Transformers

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work

A. Deep Learning on Point Clouds

B. Self Supervised Learning

C. Explainability in Deep Learning

Baseline Method, Tools, and Dataset

A. Masked AutoEncoders

B. Explainability Tools

C. Dataset

Strategic Unfreezing

Explainability Study

A. Attention Visualization

B. Attention Distance

C. Effective Receptive Fields

Contrastive Learning

Ablation Studies

Discussion

Future Work

ACKOWLEDGMENT

Appendix AMoCo & MAE Hybrid Pretraining

MoCo & MAE Hybrid Pretraining

Strategic Unfreezing

Optimization and Learning Rate Policy

Data Preprocessing

Hardware

Appendix BClass Clusters

Class Clusters

Authors

Figures

References

Citations

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Appendix A
MoCo & MAE Hybrid Pretraining

Appendix B
Class Clusters