Point Transformer

In this work, we present Point Transformer, a deep neural network that operates directly on unordered and unstructured point sets. We design Point Transformer to extract local and global features and relate both representations by introducing the local-global attention mechanism, which aims to capture spatial point relations and shape information. For that purpose, we propose SortNet, as part of the Point Transformer, which induces input permutation invariance by selecting points based on a learned score. The output of Point Transformer is a sorted and permutation invariant feature list that can directly be incorporated into common computer vision applications. We evaluate our approach on standard classification and part segmentation benchmarks to demonstrate competitive results compared to the prior work. Code is publicly available at: https://github.com/engelnico/point-transformer


I. INTRODUCTION
Processing 3D point sets using deep neural networks has become very popular the past few years. The three-dimensional information has a wide range of applications in autonomous driving [1]- [6] and computer vision [7], [8]. However, training neural networks on point sets is not trivial. First, point sets are unordered, thus require the neural network to be permutation invariant. Second, the number of points in the set is usually dynamic and unstructured. Finally, the network needs to be robust against rotation and translation to operate in the metric space, and since the points describe objects, the network needs to capture the spatial relations between the points.
Standard neural architectures, such as convolutional neural networks (CNN), have shown promising results for structured data. For that reason, several point set processing approaches attempt to transform the points into regular representations such as voxel grids [9], [10] or rendered views of the point clouds [11], [12]. However, transforming the point sets leads to loss of shape information as geometric relations between points are removed. Furthermore, these methods suffer from high computational complexity due to the sparsity of the 3D points. To address these limitations, there is another family of approaches that act directly on the point set. The main idea is to process each point individually with a multi-layer perceptron (MLP) and then fuse the repre-sentation to a vector of fixed size with a set pooling operation over a latent feature space [7], [13]. Set pooling is a symmetric function that is permutation invariant. Additionally, under certain conditions, set pooling acts as a universal set function approximator [14]. Nevertheless, Wagstaff et al. [15] argue that reducing the latent representation to a vector of fixed length can be impractical since the cardinality of the input set is usually not considered. Thus, the capacity of the vector may not be sufficient enough to capture the spatial relations of the point set which may reduce the overall performance. Therefore, the set pooling mechanism can become a bottleneck for point processing networks.
Our goal and motivation stems from removing the set pooling method and overcoming the aforementioned bottleneck, while still achieving a permutation invariant representation that models the point set relations in terms of object shape and geometric dependencies. Therefore, it is necessary to introduce a symmetric set function that replaces traditional set pooling operations. For that, we adapt the attention mechanism [16], which was originally introduced for natural language processing, that is used to weight and score sequences (words) based on learned importance. To our understanding, we face a similar problem in 3D point processing, given that we need to relate representations of the input points to capture and describe the object's shape. Additionally, attention itself does not depend on the input ordering, i.e. it is permutation-invariant, as it is comprised of matrix multiplication and summation only, which makes it . .

FIGURE 1.
Overview of the Point Transformer Pipeline. A point cloud serves as input to our network from which local and global features are extracted. We sort local features using SortNet, a module that focuses on important points based on a learned score. We then employ local-global attention to relate global and local features. We aim to capture geometric relations and shape information. The resulting feature representation is permutation invariant and can be used for common computer vision tasks.
well-suited for our problem. However, the output is still unordered, thus, directly processing the output of attention for standard computer vision tasks is not possible. Consequently, our goals can be outlined as follows: • Avoid the bottleneck that can occur while employing set pooling operations [15]. • Present a novel permutation invariant network architecture that adapts the popular and prevalent attention mechanism for 3D point processing. • Demonstrate superior performance compared to traditional set pooling methods to justify the use of attention and reinforce the claims made by Wagstaff et al. To address these problems, we propose SortNet, a permutation invariant network module, that learns ordered subsets of the input with latent features of local geometric and spatial relations. For that, we learn important key points, which we call top-k selections, that replace the set pooling operation. Since current state-of-the-art methods have shown that aggregating local and global information increases the network's capabilities of capturing context information [7], [17], [18], we employ SortNet to generate local features of the point cloud. Moreover, global features of the entire point cloud are related to the sorted local features using localglobal attention. Local-global attention attends both feature representations to capture the underlying shape. Since the local features are ordered, the output of local-global attention is ordered and permutation invariant; and thus it can be used for a variety of visual tasks such as shape classification and part segmentation. An overview of our network is outlined in Fig. 1. Since we aim to process 3D point sets using the ideas proposed by the Transformer network architecture [19], we took inspiration from [20], and name our network Point Transformer.
Overall, our contributions can be summarized as follows: • We propose Point Transformer, a neural network that uses the multi-head attention mechanism and operates directly on unordered and unstructured point sets. • We present SortNet, a key component of Point Transformer, that induces permutation invariance by selecting points based on a learned score. • We evaluate Point Transformer on two standard benchmarks and show that it delivers competitive results.

II. RELATED WORK
Below, we discuss approaches that process 3D points and are related to our work.

A. POINT SET PROCESSING
Point clouds are irregular and unordered sets of points with a variable amount of elements, thus applying standard neural networks on 3D points is not possible. For that reason, previous approaches rely on transforming the point sets into an ordered representation, such as voxel grids. The metric space is discretized into small regions (voxels), which are labeled as occupied if a point lies inside the voxel. Then, 3D convolutional networks (CNN) can be easily applied to the voxel-based representation [9], [10], [21]. This pre-processing, however, reduces the resolution as multiple points are combined into a single voxel and thus damages important spatial relations of the metric space. Furthermore, voxelization increases the memory requirements and computational complexity due to the sparsity of the 3D points.
To address these limitations, multiple extensions have been proposed that try to leverage the sparsity of 3D data [22]- [24], but still fail to process large amounts of input points. View-based methods: In contrast to building voxel grids, a lot of research has been conducted on rendering point clouds into 2D images, i.e. structured representation of the underlying 3D shape. Then, working with traditional CNNs is possible [12], [25]. Since shape information can be occluded by rendering point clouds from a specific viewpoint, multi-view approaches have been proposed that render multiple images from different angles [11], [12], [26], [27]. Even though images are rendered from different views, the model still fails to capture all geometric and spatial relations. To this day, multi-view approaches achieve impressive results on standard 3D benchmarks. However, the transformation from sparse 3D points into images increases computational complexity as well as required memory.
Shape-based methods: PointNet [13] is a pioneering network architecture that operates directly on 3D point sets, and it is invariant to input point permutations. Therefore, a transformation into a structured representation is no longer necessary. PointNet uses a multi-layer perceptron (MLP) with shared weights that encodes spatial features to each input point separately. Then, a symmetric function, e.g. max pooling, is applied to the latent features to induce permutation invariance and create a global feature representation of the input. PointNet established the de facto standard for point processing that many state-of-the-art approaches still rely on [1], [28]. However, it is not able to encode and capture local information, since the max pooling operation induces permutation invariance, but also destroys local structures and relations of the points in metric space. To address this issue, Qi et al. proposed the improved PointNet++ [7] architecture, a hierarchical model that abstracts the input points with every layer to produce sets with fewer elements. First, centroids of local regions are sampled using hand-crafted algorithms, then local features are encoded to the centroids by exploring the local neighborhood. Thus, allowing the network to capture fine-grained patterns and improving the performance on current datasets. A general approach related to unordered sets was introduced by Zaheer et al. [14] demonstrating the capabilities of pooling operations to induce permutation invariance. Importantly, they prove that the set pooling method is a universal approximator for any set function. In general, problems arise with set pooling when the reduced feature vector lacks the capacity to capture important geometric relations. Our work addresses this limitation with a network topology that encodes the entire point cloud by relating local information with the global shape structure.
Convolutions on Point Clouds: Classic convolutional neural networks require the input data to be ordered, such as images or voxel grids. Since points are unstructured, an active research area is the definition of convolution operations that can operate on irregular 3D point sets such as KPConv [29], SpiderCNN [30] or PointCNN [31]. These methods achieve state-of-the-art performance on a variety of tasks. However, due to the irregularities of the shape and point density, point convolutions are usually hard to design and the kernel needs to be adapted for different input data [32].

B. ATTENTION
Attention itself has its origin in natural language processing [16], [33]. Traditionally, encoder-decoder recurrent neural networks (RNN) were used for machine translation applications, where the last hidden state is used as the context vector for the decoder to sequentially produce the output. The problem is that dependencies between distant inputs are difficult to model using sequential processing. Bahdanau et al. [16] introduced the attention mechanism that takes the whole input sequence into account by taking the weighted sum of all hidden states and additionally, models the relative importance between words. Vaswani et al. [19] improved the attention mechanism by introducing multi-head attention and proposing an encoder-decoder structure that solely relies on attention instead of RNNs or convolutions. Therefore, they reduce the computational complexity. In this work, multihead attention is the basis for Point Transformer.
Attention with point cloud processing: Neural networks that rely on attention achieved impressive results in machine translation, and were adopted to function on point clouds by utilizing the points as sequences. Vinyals et al. [34] proposed a network that processes unordered sets using attention. They show that the network is able to sort numbers. However, they only focus on generic sets. In contrast, we present an approach that is applied to different point cloud related tasks for capturing shape and geometry information. Recently, Lee et al. [20] proposed Set Transformer, a method that is related to our approach. They adapt the original Transformer network to process unordered sets by using induced points, i.e. trainable parameters of the network, that are attended to the input. Set Transformer focuses on general sets as input. Furthermore, Lee et al. demonstrate that it is applicable to point sets. In our work, Point Transformer is specifically designed to process point clouds and leverage important characteristics of points in metric space such as shape and geometric relations.
Xie et al. [35] propose ShapeContextNet, where they hierarchically apply the shape context approach that acts as a convolutional building block. To overcome the difficulties of manually tuning the shape context parameters, Xie et al. employ self-attention to combine the selection and feature aggregation process into one trainable operation. However, similar to point cloud convolutions, shape context relies on a manual selection of the shape context kernels which is sensitive to the irregularities of point cloud data.
The Point2Sequence model [17] uses an attention-based sequence-to-sequence network. The approach first extracts local regions and produces local features using an LSTMbased attention module. Using a set pooling method, a global feature vector is generated following the ideas of [14] and [13]. However, it relies on a sequence-to-sequence architecture that tends to be more computational complex than multi-head attention [19]. Furthermore, in contrast to our method, Point2Sequence uses a max-pooling operation to make the network permutation invariant. Yang et al. [36] introduce a network architecture that replaces traditional subsampling methods like furthest point sampling (FPS) with an attention-based selection process using the gumbel-softmax function, which is similar to the proposed SortNet module.
Recently, Tao et. al [37] proposed a multi-head attentional point cloud processing network that uses a rotation invariant representation of point clouds as input. For that, they employ a multi-head attentional convolution layer (MACL) with attention coding. However, their work focuses on designing a rotation invariant network that relies on global max pooling operations, whereas Point Transformer together with SortNet leverages the strengths and advantages of the attention operation to select useful local point structures and relates them to the global shape to induce permutation invariance.

III. FUNDAMENTALS
Attention has been first proposed for natural language processing, where the goal is to focus on a subset of important words [16]. Here, we frame the problem in the context of point sets. We consider the unordered point set VOLUME 9, 2021 P = {p i ∈ R D , i = 1, . . . , N }. Our goal is to map P to the output space R O with the set function f : P → R O . Furthermore, we assume that f is invariant to input permutations. Since the input point set represents some object, e.g. from laser scans, the points are not independent of each other. We aim to make use of the attention mechanism to capture the relations between the points, as well as shape information for performing visual tasks such as object classification or segmentation. Next, we shortly present attention and introduce the Transformer architecture in the context of point sets.

A. ATTENTION
The idea of the attention mechanism is to set an importancebased focus on different parts of an input sequence. Consequently, relations between inputs are highlighted that can be used to capture context and higher-order dependencies. The attention function A(·) describes a mapping of N queries Q ∈ R N ×d k and N k key-value pairs K ∈ R N k ×d k , V ∈ R N k ×dv to an output R N ×d k [19]. Using the pairwise dot product QK T ∈ R N ×N k , a score is calculated indicating which part of the input sequence to focus on where score(·) : we set the activation function σ(·) = softmax(·) and scale QK T by 1 / √ d k to increase stability [19]. To capture the relations between the input points, the values V are weighted by the scores from Equation (1). Therefore, we have It is apparent, that the attention function (2) is a weighted sum of V , where a value gets more weight if the dot product between the keys and values yields a higher score.
If not specified otherwise, we set the model dimension to

B. TRANSFORMER
The Transformer network [19] is an extension of the attention mechanism from Equation (2) that consists of an encoderdecoder structure and introduces multi-head attention. In the following, we explain multi-head attention in detail, as our Point Transformer architecture relies on it. Instead of employing a single attention function, multihead attention first linearly projects the queries, keys and values Q, K, V h times to d k , d k and d v dimensions, respectively, using separate feed-forward networks to learn relations from different subspaces. Then, attention is applied to each projection in parallel. The output is then concatenated and projected again using a feed-forward network. Thus, multi-head attention can be defined as follows: The ⊕ operation denotes matrix concatenation and W O ∈ R hdv×dm is a learnable parameter matrix [19]. To achieve similar computational complexity as traditional attention, the dimensions of each head d k , d v are reduced such that For the transformer architecture, Vaswani et al. [19] define encoder and decoder stacks of identical layers that are comprised of multi-head attention and a pointwise fully connected layer, each with a residual connection followed by layer normalization [38]. We call this layer multi-head attention and define it as follows: where and rFF is a row-wise feed-forward network that is applied to each input independently. In practice, multiple multihead attention layers can be deployed in sequence to further capture higher-order dependencies. Note that the output of A MH depends on the ordering of X, thus it is not permutation invariant. However, the values of the corresponding outputs for each input point are always the same regardless of the input order, since A MH only consists of matrix multiplication and summation. For the task of point processing, we take the unordered point set P and generate a latent feature representation p latent i with dimension d m for every p i ∈ P using a rFF and concatenate them to form P = [p latent 1 , . . . , p latent N ] ∈ R N ×dm . Based on P we now define the self multi-head attention as: which performs multi-head attention between all elements of P , thus resulting in a matrix of same size as P .
To attend elements of different sets, we additionally introduce a second matrix representation Q of another set Q = {q j ∈ R D , j = 1, . . . , N k } that has been projected to latent feature dimension d m , thus Q ∈ R N k ×dm . We can now define cross multi-head attention as: that outputs a matrix of dimension N × d m which order depends on the ordering of P . Since the output is not permutation invariant but follows the ordering of the input, Transformer and multi-head attention can not be used directly for point data without further processing. To solve this problem, we introduce our novel Point Transformer architecture that handles unordered point sets.

IV. POINT TRANSFORMER
This section presents Point Transformer, a neural network that operates on point set data and it is based on the multihead attention mechanism. The network is permutation invariant due to a new module that we name SortNet. Our goal is to explore shape information of the point set by relating local and global features of the input. This is done using cross multi-head attention. To introduce our method, we first give an overview of the complete Point Transformer architecture, which is shown in Fig. 2. Our approach is divided into three parts: 1) SortNet that extracts ordered local feature sets from different subspaces. 2) Global feature generation of the whole point set.
3) Local-Global attention, which relates local and global features.
As introduced in Sec. III, we consider the point set P = {p i ∈ R D , i = 1, . . . , N } as input to our network. In most cases, the point dimension is given by D = 3 when xyz coordinates are considered. Moreover, it is possible to append additional point features, for example lidar intensity values (D = 4) or point normal vectors (D = 6). Point Transformer consists of two independent branches: a local feature generation module, i.e. SortNet, and a global feature extraction network. For the local feature branch, the input P is projected to latent space with dimension d m using a rowwise feed-forward network. Then, we employ self multi-head attention on the latent features to relate the points to each other. Finally, SortNet outputs a sorted set of fixed length. This module is comparable to a kernel in convolutional neural networks, where the activation of a kernel depends on regions of the input space, i.e. the receptive field. SortNet works in a similar fashion: It focuses on points of interest according to the learnable score derived from the latent feature representation. For the extraction of global features, we employ set abstraction with multi-scale grouping introduced by [7]. After obtaining features from both branches, we employ our proposed local-global attention to combine and aggregate local and global features of the input point cloud. Since we use local-global attention such that the ordering of the output depends on the local features, the output of Point Transformer is permutation invariant and ordered as well and can directly be incorporated into computer vision applications such as shape classification and part segmentation.

A. SORTNET
The local feature generation module, i.e. SortNet, is one of our key contributions. It produces local features from different subspaces that are permutation invariant by relying on a learnable score. We show the architecture in Fig. 3. SortNet receives the original point cloud P ∈ R N ×D and the projected latent feature representationP = [p latent 1 , . . . , p latent N ] ∈ R N ×dm from the row-wise feed forward network. We employ an additional self multi-head attention layer on the latent features to capture spatial and higher-order relations between each p i ∈ P.
Subsequently, a row-wise feed forward (rFF) network is used to reduce the feature dimension to one, thus creating a learnable scalar score s i ∈ R for each input point p i , which incorporates spatial relations due to the self multihead attention layer. We now define the pair which assigns the corresponding score to every input point p i , s i N i=1 . Let (Q, ≥) be a totally ordered set. We select from the original input point list K ≤ N points with the highest score value and sort them accordingly such that: where q j = p j i , s j i K j=1 , p j i ∈ P such that s 1 i ≥ . . . ≥ s K i . In other words, we employ the top-k operation to search for the K highest scores s i and select the associated input points p i . After selecting K points using the learnable score, we now capture localities by grouping all points from P that are within the euclidean distance r of each selected points, i.e. we perform a ball query search similar to [7]. The grouped points are then used to encode local features, denoted by g j ∈ R dm−1−D , j = 1, . . . , K. We choose the feature dimension of the grouped points g j such that the resulting dimension of the local feature vector corresponds to the model dimension d m . The scores s j i , as well as the local features g j from the grouping layer, are concatenated to the corresponding input points p j i to include the score calculation into our optimization problem and encode local characteristics to the selected point. Thus, we obtain our local feature vector Consequently, the output of SortNet constitutes one local feature set Since Q is an ordered set, it follows that F L m is ordered as well. To capture dependencies and local features from different subspaces, we employ M separate SortNets. Finally, the M feature sets are concatenated to obtain an ordered local feature set of fixed size

B. GLOBAL FEATURE GENERATION
The second branch of Point Transformer is responsible for extracting global features from the input point cloud. To reduce the total number of points to save computational time and memory, we employ the set abstraction multiscale grouping (MSG) layer introduced by Qi et al. [7]. We subsample the entire point cloud to N < N points using the furthest point sampling algorithm (FPS) and find neighboring points to aggregate features of dimension d m resulting in a global representation of dimension N × d m . Note that the global feature representation is still unordered since no sorting or set pooling operation was performed.

C. LOCAL-GLOBAL ATTENTION
The goal of Point Transformer is to relate local and global feature sets, F L and F G respectively, to capture shape and context information of the point cloud. After obtaining both feature lists, we employ self multi-head attention A self on the local features F L as well as the global features F G . Then, cross multi-head attention layer A cross from Equation (6) is applied such that every global feature is scored against every local feature, thus relating local context with the underlying shape. We call this operation local-global attention A LG (see Fig. 2) and define it as follows: where F L and F G are the matrix representations of F L and F G , respectively. The last row-wise feed forward layer in the multi-head attention mechanism of A LG reduces the feature dimension to d m < d m in order to decrease computational complexity, thus we have In other words, we take every local feature from SortNet and score the global features against it. At this point, it is important to note that we relate the local features, i.e. a subset of the input F L ⊆ P, with the global structure. Thus, we avoid reducing the shape representation using set pooling; instead, the output of local-global attention includes information of the entire point cloud, i.e. the underlying shape, as well as local characteristics. As with multi-head attention, for local-global attention, we employ multiple cross and self multi-head attention layers in sequence to learn higher-order dependencies [19]. Since the ordering of the local features F L defines the order of the output of local-global attention, we obtain a permutation invariant latent representation of fixed size of the aggregated features, that can directly be incorporated into computer vision tasks.

D. COMPLETE MODEL
To recap, Point Transformer functions as follows: Our architecture is comprised of two independent branches, SortNet for the extraction of local features and a global feature generation module. SortNet constitutes a novel architecture that selects a number of input points based on a learned score from latent features, resulting in M · K ordered feature vectors with dimension d m . In the global feature branch, we employ multi-scale grouping to reduce the total number of points to N while aggregating spatial information. Then, local-global attention is used to relate both spatial signatures, producing a permutation invariant and ordered representation of length K · M with reduced dimension d m (see Fig. 2), which can be used for different tasks such as shape classification or part segmentation. Additionally, we demonstrate the processing chain of our model as a flowchart in Fig. 4.
Shape Classification assigns the point cloud to one of C object classes. For this, we flatten the sorted output of localglobal attention to a vector of fixed size R M ·K·d m and reduce the dimensions using a row-wise feed-forward network to R C . Thus, each output represents one class. Using a final softmax layer, class probabilities are produced. The shape classification head is shown in Fig. 2 a).

Method ModelNet ShapeNet
PointNet [13] 89.2 83.7 PointNet++ [7] 91.9 85.1 ShapeContextNet [35] 89.8 84.6 Deep Sets [14] 90.3 -Point2Sequence [17] 92.6 85.2 Set Transformer [20] 90.4 -PAT [36] 91.7 -Tao et. al [37] 87 Part Segmentation assigns a label to each point of the input set. State-of-the-art methods [7], [17] upsample a global feature vector obtained from a set pooling operation using interpolation. We, however, employ an additional cross multi-head attention layer to attend the output of A LG , i.e. the aggregated shape and context information, to each point of the input set P. It is important to note that we project the points in the global feature generation branch to d m dimensions and apply self multi-head attention. The features are additionally used for the set abstraction layer. Later, we attend the projected features with the output of Point Transformer. Thus, we can relate each point to the entire point cloud. The result is a matrix of dimension R N ×d m . Then, a row-wise feed-forward layer reduces the dimension of each point to the C possible classes R N ×C . Again, using a final softmax layer, per-point class probabilities are produced as shown in Fig. 2 b).

V. EXPERIMENTS
In this section, we perform two standard evaluations on Point Transformer. We compare our results with approaches that operate directly on 3D point sets [7], [13], [14], attentionbased approaches [17], [20], [35] and methods that use point cloud convolutions [29]- [31], [39]. Moreover, we provide a thoughtful analysis and visualizations of the components of our approach. We implement our network in Pytorch [40] where we rely on the RAdam optimizer [41] for all experiments. The weights of each layer are initialized using the popular Kaiming normal initialization method [42]. Our implementation will be made publicly available.

A. POINT CLOUD CLASSIFICATION
We evaluate Point Transformer on the ModelNet40 dataset [10] and use the modified version by Qi et al.  in the range of [−0.1, 0.1]. Additionally, we apply random dropout of the input points as proposed in [7], [13].  Table 1. Point Transformer outperforms attention-based methods (top part of Table 1) and achieves on par accuracy when compared to state-ofthe art methods (bottom part of Table 1) with a classification accuracy of 92.8%.

B. POINT CLOUD PART SEGMENTATION
Here, we evaluate Point Transformer on the challenging task of point cloud part segmentation on the ShapeNet dataset [43], which contains 13.998 train samples and 2874 test samples. The dataset is composed of objects from 16 categories with a total of 50 part labels. The goal is to predict the class category of every point. To address this task, the network has to learn a deep understanding of the underlying shape. For the part segmentation, we set M = 10 and K = 16. Again, we use xyz coordinates with normal vectors (D = 6) and N = 1024 input points. For this experiment, we follow the setup of [13] where a one-hot encoding of the category is concatenated to the input points as an additional feature. We report the mean IoU (Intersection-over-Union) in Table 1. Finally, we visualize exemplary results of the part segmentation task in Fig. 5.

C. NETWORK COMPLEXITY
We examine the network complexity of Point Transformer and perform a comparison to related approaches. The results of this experiment are shown in Table 3. We performed all experiments on a Nvidia GeForce 1080Ti. Point Transformer has about 13.5 million learnable parameters (51 MB), which is less when compared to KPConv (15 million learnable parameters). However, our model is about 6 times bigger than PointNet++ and Point2Seq. This is mainly due to the fact that the Transformer model itself has a lot of learnable parameters. For example, one SortNet only has about 10.000 learnable parameters which shows that SortNet can be incorporated into any existing network architecture without much space requirements and computational overhead, as it only 2 M -adds about 1.2 ms of inference time. In many cases, the forward pass of multiple SortNets can additionally be performed in parallel. Even though, Point Transformer has more learnable parameters than, e.g, PointNet++, it still has a faster inference time because multi-head attention blocks are highly optimized and computation is also performed in parallel by employing multiple attention heads. For the computational complexity of the network, an upper bound can be estimated from the most expensive operation, which in our case is the multi-head attention mechanism. The complexity is given by O(N 2 · d m ), thus it scales quadratic with respect to the total number of input points.

D. HYPERPARAMETER STUDY
Here, we analyze the effects of different numbers of SortNets in our Point Transformer architecture as well as the amount of Top-K selections on the ModelNet40 dataset [10]. The results are shown in Tab. 4. Furthermore, we present the hyperparameters that were used for the reported results for the classification and the part segmentation task in Tab. 5. The parameters follow the notation introduced in Fig. 2 and Fig. 3. The values were found by performing a hyperparameter grid search experiment for the classification and the part segmentation, similar to Tab. 4. We report the set of parameters that achieved the best overall performance. Note, that for the rFF, each value in the parenthesis denotes one layer, where the value represents the feature dimension for that layer.

E. POINT TRANSFORMER DESIGN ANALYSIS
We conduct an ablation study to show the influence of each Point Transformer module. Afterward, we qualitatively examine our classification results by visualizing the learned point set regions that contribute to the classification output.
Ablation study of SortNet: We first evaluate Point Transformer using only the SortNet module from Fig. 3 with the classification head from Fig. 2 a). Our aim is to show that the learned scores are based on the importance of points for the classification task. In addition, we want to verify that SortNet selects points that help to understand the underlying shape. Since we cannot explicitly define which are the most important points, we rely on the accuracy score. In detail, we train SortNet based on three different experiments and deliberately set M = 10 and K = 12, selecting only a subset of the entire point cloud (M · K = 120, N = 1024). In the first experiment, we train SortNet as it is implemented in the Point Transformer pipeline. In the second experiment, we replace the Top-K selection process with the furthest point sampling. Finally, we randomly select K points from the input set instead of the learned Top-K selection. It is important to note, that the last two experiments remove the permutation invariance property. However, we want to show that SortNet performs better than a random selection of points and handcrafted sampling methods. Thus, we rely on random sampling and FPS as baselines. The results are shown in Table 2 a). With randomly sampled points, SortNet achieves 60.1% classification accuracy. When we apply the FPS to cover most of the underlying shape, the accuracy increases to 74.8%, indicating spatial information preservation. Finally, when we use learned Top-K selection, we achieve the highest classification accuracy of 83.4%. This empirically shows that SortNet learns to focus on important shape regions.
Ablation study Global Feature Generation: In this ablation study, we compare different sampling methods for the extraction of global features. We rely on the complete Point Transformer pipeline as shown in Fig. 2 and replace the set abstraction (MSG) with different sampling approaches. Again, we evaluate the accuracy of the classification task. The results are presented in Table 2 b). In the first experiment, we use the complete input point cloud. Then, we sample N = 128 points using the furthest point sampling, which slightly improves our result by 0.4%. When we additionally aggregate features from local regions around the sampled points, i.e. set abstraction with multiscale grouping (MSG) [7], the accuracy can be further increased to 92.8%. This indicates that scoring the local features against every input point makes it harder to find important relations. Additionally, by uniformly selecting fewer points and aggregating local features the network can concentrate on meaningful parts of the underlying shape.
Rotation robustness of SortNet: In this section we evaluate the robustness of SortNet against rotations of the input cloud. For this, we first evaluate Point Transformer on the ModelNet40 test set and randomly rotate the input point cloud. Even though we did not train the network with rotations, we still achieve a classification accuracy of 92.3% compared to 92.8% without rotations. We applied the same input point rotation to PointNet++ and classification accuracy dropped from 91.9% to 88.6%. To qualitatively support this claim, we visualize the learned Top-K selections of one SortNet for different rotations in Fig. 6, which shows that SortNet still focuses on the similar local regions even when the input point cloud is rotated.
Visualizations of learned local regions: Here, we show that SortNet focuses on local regions similar to the receptive field of a CNN. For this, we visualize the learned Top-K selections of multiple trained SortNet modules on different models of the same object class in Fig. 7 and Fig. 8. It is apparent, that each SortNet tries to select similar regions even when the shape of the model is slightly different. This, together with the results from the rotational robustness, suggests that SortNet is aware of the underlying shape. All Top-K selections: As an additional evaluation, we show all selected points of M = 8 SortNet modules in Fig. 9 for the classification task. We visualize points that were selected from the same SortNet with the same color. It is apparent, that different SortNet modules focus on different parts of the object and in combination, still retain as much as possible of the underlying shape.

VI. CONCLUSION AND FUTURE WORK
In this work, we proposed Point Transformer, a permutation invariant neural network that relies on the multi-head attention mechanism and operates on irregular point clouds. The core of Point Transformer is a novel module that receives a latent feature representation of the input point cloud and selects points based on a learned score. We relate local features to the global structure of the point cloud, thus exploiting context and inducing shape-awareness. The output of Point Transformer is a sorted and permutation invariant feature list that is used for shape classification and part segmentation. Finally, we show that our point selection mechanism is based on importance for the specified task. As future work, we want to focus on improving the efficiency of the Transformer architecture by implementing recent advances for self-attention, such as [44], [45].    table 1  table 2  table 3  table 4 SortNet 1