Learning to Detect 3D Symmetry From Single-View RGB-D Images With Weak Supervision

3D symmetry detection is a fundamental problem in computer vision and graphics. Most prior works detect symmetry when the object model is fully known, few studies symmetry detection on objects with partial observation, such as single RGB-D images. Recent work addresses the problem of detecting symmetries from incomplete data with a deep neural network by leveraging the dense and accurate symmetry annotations. However, due to the tedious labeling process, full symmetry annotations are not always practically available. In this work, we present a 3D symmetry detection approach to detect symmetry from single-view RGB-D images without using symmetry supervision. The key idea is to train the network in a weakly-supervised learning manner to complete the shape based on the predicted symmetry such that the completed shape be similar to existing plausible shapes. To achieve this, we first propose a discriminative variational autoencoder to learn the shape prior in order to determine whether a 3D shape is plausible or not. Based on the learned shape prior, a symmetry detection network is present to predict symmetries that produce shapes with high shape plausibility when completed based on those symmetries. Moreover, to facilitate end-to-end network training and multiple symmetry detection, we introduce a new symmetry parametrization for the learning-based symmetry estimation of both reflectional and rotational symmetry. The proposed approach, coupled symmetry detection with shape completion, essentially learns the symmetry-aware shape prior, facilitating more accurate and robust symmetry detection. Experiments demonstrate that the proposed method is capable of detecting reflectional and rotational symmetries accurately, and shows good generality in challenging scenarios, such as objects with heavy occlusion and scanning noise. Moreover, it achieves state-of-the-art performance, improving the F1-score over the existing supervised learning method by 2%-11% on the ShapeNet and ScanNet datasets.


INTRODUCTION
S YMMETRY is omnipresent in both real and man-made worlds. Symmetry detection has long been of interest to vision and graphics researchers [1], [36], [42]. As a longstanding problem, it benefits a wide range of downstream applications regarding understanding, processing, and modeling the world at different scales and in various modalities. Based on its mathematical definition, symmetry could be detected through finding the geometric invariance under the transformation of a symmetry group, e.g., reflection, rotation, translation, etc. For example, various methods have been proposed to collect the evidence of symmetric correspondences under a symmetry transformation and detect symmetry from completed or well-observed 3D shapes where sufficient symmetry evidence exists [42].
However, when the problem setting comes to symmetry detection with incomplete observation, such as a singleview RGB-D image of a 3D object, conventional methods become intractable. Detecting 3D symmetries from partial or even a single-view observation is potentially useful to a variety of applications such as shape completion [59], shape understanding [41], inverse procedural modeling [4] and object pose estimation for robotic manipulation [44]. However, there are not many robust methods existing for symmetry detection with incomplete data. SymmetryNet [53] is one of the pioneers that detect 3D reflectional and rotational symmetries based on single-view RGB-D images by training a carefully designed deep neural network.
Despite the satisfactory performance, SymmetryNet requires a large amount of training data with 3D symmetry annotations along with dense symmetry correspondences. Manually annotating 3D symmetries accurately is clearly laborious especially for RGB-D images. A straightforward method is to hand draw and manipulate symmetry axes in 3D space with an interactive tool, which is hard to do accurately. Another more common way is to first find a similar shape from a 3D shape repository, annotate or detect symmetries on that 3D shape, and then transfer the symmetries from the object coordinate space to the camera reference frame of the RGB-D image based on 6D object pose. Here, the effort of manual labeling is still prohibitive [38], limiting the amount of training data and the generality of the learned model.
In this paper, we propose a weakly-supervised learningbased method that detects global 3D symmetry from singleview RGB-D images without requiring symmetry annotations. Instead of training the network with direct symmetry supervision, as shown in Fig. 1, our approach learns to predict symmetries so that the symmetry-induced shape completion results in a plausible shape according to a pre-learned shape prior. This is motivated by the psychologic mechanism of human perception: the visual detection of symmetry is an integral part of the perceptual organization process applied to every incoming visual stimulus and is correlated to the processes of object recognition and amodal completion [61]. Our weakly-supervised approach can be regarded as an implementation of the idea via coupling symmetry prediction and shape completion.
To do so, we first need to learn a shape prior to determine whether a 3D shape is plausible or not. Specifically, we develop a discriminative variational autoencoder (DVAE). We first learn a variational autoencoder (VAE) with plausible shapes collected from off-the-shelf shape repositories. Consequently, the latent space, as a normal distribution, represents a prior distribution of plausible shapes. We then introduce an extra Gaussian distribution for implausible shapes and make it distant from the normal distribution. This discriminative learning of the VAE naturally results in two distinguishable distributions for plausible and implausible shapes, respectively. With the Gaussian assumption on both distributions, DVAE generalizes well on novel shapes.
Having learned the shape prior, we train a network for detecting 3D symmetries from single-view RGB-D with weak supervision. The backbone of our symmetry detection network is an RGB-D transformer, a novel attention-based multi-modality feature aggregation network. It can aggregate appearance and geometric features at multi-scale. Symmetry hypotheses are generated based on the observed points of the depth image. The partially observed point cloud is then completed with each of the predicted symmetries and then fed into the learned DVAE to measure its plausibility. The network is trained to predict symmetries that produce shapes with high plausibility when completed based on that symmetry.
Moreover, we propose to represent reflectional symmetry as a 3D transformation matrix and represent rotational symmetry as a center location as well as an axis direction.
With this symmetry parametrization, the network of symmetry detection and shape completion is differentiable and is able to estimate both reflectional and rotational symmetries effectively. In particular, for the problem of multiple reflectional symmetry estimation, the parametrization contains a perpendicularity constraint, facilitating the accurate estimation of multiple symmetries.
Through extensive evaluation, we demonstrate that our method achieves state-of-the-art performance on two public datasets. Notably, it outperforms the supervised method SymmetryNet with an F1-score improvement of 2%-11% on the ShapeNet and ScanNet datasets. We also show that our method is able to produce high-quality symmetry detection results in challenging scenarios such as objects with heavy occlusion and scanning noise.
In summary, we make the following contributions: We propose a framework to detect 3D symmetry from RGB-D images without symmetry supervision, which is applicable to real-world data with good generality.
We propose a discriminative variational autoencoder to learn the shape prior and to estimate shape plausibility from incomplete data. We propose RGB-D transformer, a new multi-modality aggregation network architecture, to fuse the appearance and geometric feature at multi-scale. The proposed method achieves the state-of-the-art performance on two public datasets and shows good generality and robustness in challenging scenarios.

RELATED WORK
2D Symmetry Detection. Symmetry is a pervasive phenomenon in both real-world and man-made environments. As symmetry provides a fundamental intermediate-level clue, the perception and recognition of symmetry in 2D images have been well-explored in computer vision over the last decades [33], [37], [57]. Various methods have been proposed to automatically detect multiple types of 2D symmetry, such as reflectional symmetry, rotational symmetry, translational symmetry as well as the combination of them [35]. Prominent methods include direct methods [28], voting-based methods [45], moment-based methods [39], and learning-based methods [16], [60]. However, existing methods require a sufficient number of symmetric counterparts to be appeared in the image space, limiting their applicability on complex scenes with data incompleteness. In contrast, our work is different from those methods as it detects symmetries in 3D space and does not necessitate the presence of a large portion of symmetric correspondences. 3D Symmetry Detection. Many attempts have been made to detect symmetry from 3D geometry [42], [49]. Remarkable progress has been achieved on extracting symmetries and applying such structural information to various tasks [41], [48], [59], [73]. Existing symmetry detection methods can be categorized according to the type of symmetries that are of interest, such as exact [32] or approximate [47], global [27] or partial [19], and euclidean [17] or intrinsic [68]. For example, Mitra et al. [41] studied the problem of detecting partial and approximate symmetry in euclidean space. Xu et al.  We propose a deep neural network to estimate 3D symmetry from single-view RGB-D images with weak supervision. Given an incomplete shape (a), our method predicts symmetry such that the completed shape based on the symmetry (b) is similar to existing plausible shapes. This is achieved by learning to embed the plausible shapes into a unified distribution (c) in the latent space.
proposed an algorithm to detect partial intrinsic reflectional symmetry. However, most of the prior symmetry detection methods are only applicable to scenes/objects with complete geometry and show unsatisfactory performance in incomplete data, such as the geometry captured by single-view RGB-D images. To reduce the requirements, learning-based follow-up works have been developed [53], [75]. However, competitive results could only be achieved by training on a large amount of annotated data which is not always available in real-world datasets, e.g., KITTI [18]. Unlike these works, our method detects symmetry without symmetry annotations, which is a more general solution.
3D Shape Completion. Our work is related to 3D shape completion from single-view scans. Early conventional 3D shape completion approaches interpolate the missing region based on the observed surface [13], [25] or the structural prior [54]. However, they are unadaptable to shape with severe incompleteness or shape with a novel structure. Another research direction of 3D shape completion is to find similar CAD models from 3D shape repositories [10], [30], [31], [58], but it is limited by the scale of the repositories.
More recently, 3D shape completion has been dominated by learning-based methods, where a trainable model is applied to learn a mapping between the partial scan and its complete counterpart. According to the shape representation, those works could be divided into several categories: voxel, point cloud, and implicit field. Voxel-based shape completion approaches discrete the 3D shape by volumetric grids and leverage 3D CNN to learn the mapping [12], [56]. Most of these approaches are limited to low-resolution since 3D convolutional operations are generally memory and computation-consuming. Point cloud is a more flexible representation compared to volumetric grids. Existing studies are focused on generating shapes using a coarse-to-fine generation scheme [71] or learning to deform 2D surface patches to 3D surface elements to facilitate the completion of fine-grained details [20], [34]. There are also pioneering works that represent 3D shapes with an implicit representation [7], [40], [46]. Despite the promising results on facilitating shape reconstruction, implicit representation is limited by the immature network architecture and the requirement of large-scale training data. Our method is based on point cloud [43], [74]. Unlike the point cloud-based approaches described above, our method completes shape by the predicted object symmetry, which explicitly involves structural information and is, therefore, more robust.
Object Pose Estimation. Our work is also relevant to object pose estimation. Previous works mainly focus on estimating instance-level pose and require the 3D CAD models [29], [52], [63] which is unavailable in our problem setting. Our problem is more similar to category-level pose estimation, where the transformation between the camera coordinate and a category-wise shape coordinate is estimated [6], [64]. Nevertheless, those methods cannot be applied to symmetry detection directly. This is because symmetry detection requires object symmetry annotations which are unknown for most of the 3D shape datasets, even for datasets with axis-aligned objects such as ShapeNet [5].This inspires our idea of learning to estimate symmetry without direct symmetry supervision.
Weakly-Supervised Learning. Weakly-supervised learning is a general concept of learning problem where the supervision signal is incomplete, inaccurate, or inexact, aiming at reducing the requirement of obtaining labeled data [76]. Incomplete supervision concerns about the situations where only a small amount of labeled data is given, while more unlabeled data are available [51]. Inaccurate supervision describes the problem where the supervision is noisy, so it is not accurate enough to train a satisfactory model [72]. Inexact supervision addresses the problem where inexact supervision, which is relevant but indirect, is given [2], [70]. A plethora of works on weakly-supervised learning has been proposed in computer vision and graphics communities over the past decade to address a wide range of problems, such as object detection [62], semantic segmentation [26], and 3D reconstruction [56]. Our method, for the first time, introduces weakly-supervised learning to the symmetry detection task. It leverages the inherent relations between symmetry and shape prior, and infers the symmetry with the inexact supervision from existing shape repositories.

Overview
The input to our method is a single-view RGB-D image containing an object. We assume the object is already segmented from the image by a detector. The output is M ref reflectional symmetry and M rot rotational symmetry of the object, where M ref 2 ½0; 3 and M rot 2 ½0; 1. In our method, detecting symmetry consists of two key components: 1) a network to identify whether a shape is plausible or not by learning the shape prior from a large-scale shape repository; 2) a network to detect symmetries from single-view RGB-D images such that the completed symmetric shape fits the learned shape prior. In the following, we provide solutions to implement these components respectively. First, we introduce discriminative variational autoencoder (DVAE) with an out-of-distribution likelihood estimation module, to estimate the plausibility of a shape (Section 3.2). The DVAE, trained with a completion task, is capable of not only completing shapes for plausible symmetric objects, but also identifying implausible objects from the plausible ones. Second, we present a symmetry prediction network to estimate symmetries for objects in RGB-D images with weakly-supervised learning based on the learned DVAE (Section 3.3). Fig. 2 illustrates the overview of our method.

Learning Distributions for Plausibility Estimation
In this subsection, our goal is to learn a function s x ¼ fðxÞ, where x is the 3D point cloud of the input shape, s x 2 ½0; 1 is the estimated shape plausibility. Note that x is not necessarily a complete object and may contain noise. The function fðÁÞ is implemented with a VAE architecture, where the encoder embeds x to a high-dimensional latent space, and the decoder generates a complete shape from the latent code. Recent advances have demonstrated the remarkable capability of VAE on shape reconstruction by training on large-scale shape repositories. However, it assumes all the input data is in the true data distribution and falls short of identifying the out-of-distribution samples (i.e., implausible shapes in our case). To solve this problem, we introduce an out-of-distribution likelihood estimation module to the conventional VAE framework. The core idea of it is to learn two separate distributions for plausible and implausible shapes respectively, so the network could distinguish plausible and implausible shapes in the latent space.

Learning Distribution of Plausible Shapes
We first describe the basic VAE for distribution learning of plausible shapes. The encoder of the VAE embeds the input data to a high-dimensional feature space. To be specific, the encoder takes a point cloud x R k as input, where k is the number of points. The encoder outputs the mean m m and the standard deviation s s of a Gaussian distribution pðzjxÞ ¼ Nðm m; s sÞ by where EncðÁÞ is the encoder, FC m m and FC s s are fully-connected layers, m m; s s R l . The architecture of the VAE encoder consists of four point set abstraction layers, four point feature propagation layers [50] and two 1D convolutional layers. The decoder samples from distribution z $ Nðm m; s sÞ and is expected to reconstruct x bŷ where DecðÁÞ is the decoder,x R k is the reconstructed shape. A naive way to implement the decoder is to use an MLP which takes the sampled code z and a 2D point set sampled uniformly in the unit square as input, and outputs the deformed 3D points. However, directly generating the object with complex geometry by a single MLP is practically infeasible [20], [69]. To cope with the problem, similar to the existing point-based generation network [20], [34], we use m MLPs to generate a set of deformed 3D point set simultaneously. The deformed 3D point set generated by each of the MLP corresponds to a small piece of planar object surface patch, and the sum of them represents the whole object. Each surface patch contains k=m points, so the overall surface formed by all patches has the same number of points as the input x.
The training loss of the VAE is where L KL is the Kullback-Leibler divergence loss which pushes the predicted Gaussian distribution Nðm m; s sÞ to approximate the standard normal distribution, i.e., Nð0 0; 1 1Þ. L recon is the reconstruction loss, which is computed as the Earth Mover's Distance [15] between x andx. w recon is the pre-defined weight.

Learning Distribution of Implausible Shapes
The above shape completion VAE forces the predicted distribution of all input shapes to be closed to a standard normal distribution, which increases the generality on untrained indistribution samples but brings incapability of identifying the out-of-distribution samples [22]. Based on the network described above, we propose to learn an extra distribution of implausible shapes simultaneously with the VAE, enabling the network with a shape discrimination ability. Specifically, we learn to embed implausible shape x into another Gaussian distribution Nðm m; s sÞ in the latent space which is away from the plausible shape distribution Nðm m; s sÞ. Inspired by the idea of negative sampling in VAE [9], the implausible shapes are generated from the 3D shape repository with random editing and perturbation. Examples of the generated shapes are visualized in Fig. 16. The details of the generation process will be described in the Appendix, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/ 10.1109/TPAMI.2022.3186876.
Overall, the training loss of the DVAE with out-of-distribution sample identification turns to be where the Kullback-Leibler divergence loss of implausible shapes L KL neg optimizes the predicted Gaussian distribution pðzjxÞ to be closed to a predefined distribution Nðm m; s sÞ. We use m m ¼ 1 and s s ¼ 5 in our experiment. Please refer to Fig. 2. Method overview. The method consists of two main components. First, a discriminative variational autoencoder is proposed to distinguish the plausible shapes and the implausible shapes via learning two separate distributions for each of them in the latent feature space. Second, given an RGB-D image with an object in it, the RGB-D transformer aggregates the multi-modality feature and estimates symmetry as well as the symmetryinduced object proposal. The network training optimizes the predicted symmetry, such that the object proposal is similar to the plausible shapes as much as possible. This is achieved by using the plausibility loss and the visibility loss while keeping the parameters in the pre-trained encoder of the discriminative VAE fixed. Section 4 to see the quantitative experiment and the discussion regarding the rationality of parameter selection.
The two KL divergence loss functions provide the model with a discriminative component: the negative data is expected to be close to the negative distribution, and thus is far from the positive distribution. The reconstruction loss is also beneficial to the discriminative task, by regularizing and balancing the learned latent space based on the plausible complete shapes, so the latent space will be less influenced by the randomly generated implausible shapes.

Shape Plausibility Estimation
During the DVAE inference, the shape plausibility could be predicted with the out-of-distribution likelihood estimation. Basically, we consider leveraging the KL divergence between the prior and the posterior distribution of the input shape. Suppose p x is the distribution predicted by input shape x. The shape plausibility s x is measured by where g½Á is the sigmoid function, D KL is the KL divergence.
High s x means that x is close to the distribution of plausible shapes and far from that of implausible shapes.
The effectiveness of the reconstruction loss for shape plausibility estimation is demonstrated in Fig. 3. The reconstruction loss of plausible shapes could regularize the latent space by forcing the incomplete and plausible shapes to be close to the complete ones in the latent space. This makes the shapes being distributed according to the geometry correctness, instead of the shape completeness.

Weakly-Supervised Symmetry Detection Network
Given the learned distributions, symmetry detection from RGB-D images could be formulated as the problem of maximum likelihood estimation (MLE). In our problem, the goal of MLE is to determine the parameters of object symmetry, via optimizing a likelihood function, such that the completed shape induced by the predicted object symmetry is most plausible. We adopt a deep neural network to approximate this likelihood function. An overview of the network is shown in Fig. 4. In the following, we describe the network architecture, the symmetry parametrization and the loss functions.

RGB-D Transformer for Feature Aggregation
Given an RGB-D image with a segmented object in it. We first crop the depth image according to the segmentation mask of the object and convert the cropped depth image into a point cloud. Next, we extract and aggregate features from the color image and the point cloud. Previous works have demonstrated the effect of utilizing both the appearance features and the geometric features from the two modalities [23], [67]. Albeit the progress, current RGB-D feature aggregation methods are still unsatisfactory in correlating the crucial information in each modality [55], [65]. Symmetry detection from single-view RGB-D images demands the method to be able to accumulate the appearance and geometry clues, not only at local patches but also across long-range regions. In particular, the long-range region context inherently corresponds to the symmetric counterpart searching, which is well suited to our problem. Hence, we propose RGB-D Transformer, a new attentionbased module to aggregate appearance features and geometric features at multi-scale regions. To achieve this, we first extract features from the color image and the point cloud by a 2D CNN [21] and a point convolutional Fig. 3. The difference between the proposed discriminative VAE (DVAE) trained with/without the reconstruction loss. (a) The DVAE without the reconstruction loss tends to distribute the shapes in the latent space according to its completeness. (b) In contrast, the reconstruction loss forces the incomplete yet plausible shapes to be close to the complete ones in the latent space, such that the shapes are distributed in the latent space by the geometry correctness, i.e., shape plausibility. Fig. 4. The architecture of the symmetry detection network. The RGB-D transformer first feeds the RGB image and the point cloud (converted from the depth image) to a 2D CNN and a point convolutional network to extract multi-scale appearance and geometric features, respectively. For each 3D point in the object, the multi-scale features are fetched, concatenated, and processed by an attention module to learn the correlations. This is achieved by the self-attention layers and mutual attention layers. The aggregated object feature is fed into a symmetry detector that estimates multiple symmetries as well as the corresponding object proposals. The estimated symmetries are then optimized by the plausibility loss and the visibility loss, simultaneously.
network [50], respectively. Then the features from the two modalities are fetched from multi-scale feature maps and fused by N s self-attention layers and N m mutual attention layers. The network architecture of the two attention layers is shown in Fig. 5. Suppose the concatenated point-level feature at point p is where F p C;1 ; F p C;2 ; F p C;3 are the appearance feature fetched from multi-scale 2D CNN feature maps, F p D;1 ; F p D;2 ; F p D;3 are the geometric feature fetched from multi-scale point convolutional network feature maps. The self-attention layer processes F p with Linear Attention [24] S ¼ SelfAttentionðQ; K; VÞ ¼ fðQÞðfðKÞ T VÞ; where are the query vector, the key vector and the value vector respectively. W Q , W K , W V are the learned weights. fðÁÞ is computed as eluðÁÞ is the exponential linear unit activation function [8]. fðÁÞ is applied rowwise to the matrices Q and K. Two Add & Norm layers and a feed-forward layer are followed to further aggregate the feature. The output of the self-attention layers is the transformed point-level feature F p s . To generate the feature of the entire object where Q ¼ F s W Q , K ¼ F s W K , V ¼ F s W V are the query vector, the key vector and the value vector respectively. Note that Q, K, V here do not share weights with those in Equation (7). Two Add & Norm layers and a feed-forward layer are followed to further aggregate the feature. The visualization of how the attention layers aggregate features is shown in Fig. 13.

Symmetry Detection
Given the object feature F o , a symmetry detector is applied to estimate M ref reflectional symmetry and M rot rotational symmetry. Symmetry detector is implemented with MLPs. M ref and M rot are set to be 3 and 1, respectively. We use the following parametrization to represent the reflectional and rotational symmetries, as well as the symmetry counterparts. For reflectional symmetry, as shown in Fig. 6a, the symmetry detector estimates a rigid transformation T ¼ fRjtg, where R and t are the rotation matrix and the translation matrix, respectively. R is first estimated using quaternion and then converted into the matrix.
Since the mirror planes of the potential reflectional symmetries, if exist, should be perpendicular to each other, the counterparts of input shape x computed by the predicted symmetries, with the perpendicularity constraint, are The goal of the above equations is to transfer the input point cloud into a canonical coordinate system where the x-y plane, y-z plane and x-z plane are the potential symmetry planes.
For rotational symmetry prediction, as shown in Fig. 6b, the network estimates the object center c ¼ ðc x ; c y ; c z Þ and  the rotational symmetry axis u ¼ ðu x ; u y ; u z Þ. u is an unit vector with u 2 x þ u 2 y þ u 2 z ¼ 1. The counterparts of input shape x are then computed as where Q ¼ fk Á p=8g k¼1;...;16 , Q u is the rotation matrix of rotating u around the axis u Q u 0;0 ¼ cos u þ u 2 x ð1 À cos uÞ; Q u 0;1 ¼ u x u y ð1 À cos uÞ À u x sin u; Q u 0;2 ¼ u x u z ð1 À cos uÞ þ u y sin u; Q u 1;0 ¼ u y u x ð1 À cos uÞ þ u z sin u; Q u 1;1 ¼ cos u þ u 2 y ð1 À cos uÞ; Q u 1;2 ¼ u y u z ð1 À cos uÞ À u x sin u; Q u 2;0 ¼ u z u x ð1 À cos uÞ À u y sin u; Q u 2;1 ¼ u z u y ð1 À cos uÞ þ u x sin u; Q u 2;2 ¼ cos u þ u 2 z ð1 À cos uÞ; Please refer to the Appendix for the formula derivation, available in the online supplemental material.
Since the computation of x ref , and x rot are differentiable, the symmetry detection network could be trained in an end-to-end manner.

Weakly-Supervised Training
We denote the input shape and the estimated symmetric counterparts as the generated object proposals: To train the symmetry detection network, each object proposal o i 2 O is fed into the pre-trained DVAE encoder, and embed it into the high-dimensional latent space to estimate the shape plausibility using Equation (5). Note that the DVAE encoder is set to be fixed during the training. Only the parameters of the symmetry detection network will be optimized during the training.
We use two loss functions to train the network. First, the shape plausibility loss is used to maximizing the correctness of the generated object proposal where s o i is the shape plausibility of object proposal o i estimated by the DVAE encoder. Besides, we introduce a counterpart visibility loss to penalize the predicted counterpart if appeared in the observed region where K i is the number of points in o i that are located in the observed region of the camera. See Fig. 7 for an illustration.
The overall training loss is:

Network Inference
During the network inference, the symmetry detection network predicts 3 reflectional symmetries and 1 rotational symmetry. The plausibility of the corresponding object proposals is then estimated. We filter the predicted symmetries whose plausibility is less than 0.5. Although we have considered the visibility in the network training stage, the network would still output symmetries that violate this constraint. To alleviate the issue, we consider these incorrect symmetries as false positives and remove them from the set of predicted symmetries. This is achieved by examining whether there is a sufficient number of counterpart points located in the observed region of the camera frustum.

Category-Specific Train & Inference Protocol
The DVAEs and the symmetry detection networks are trained individually on each object category. During the inference, the method requires the category information of the object to be known, so it can adopt the appropriate category-specific DVAE as well as the symmetry detection network. In practical applications where the category information is not given, the category could be predicted by any feasible classifier/detector.

Discriminative Variational Autoencoder
The DVAE is trained on the ShapeNet dataset. To enhance the sim2real ability, we add random noise perturbation to both the generated plausible and implausible shapes. During the training, at each time, we feed an equal number of positive and negative data into the network. In the previous paragraphs, we have discussed some of the parameter settings. Here, we list the rest of them. The number of points k is 1024. The latent feature length l is 64. The number of MLPs m in the DVAE decoder is 16. We use Adam optimizer with an initial learning rate 0.001. The batch size is 32. w recon is set to be 0.2. The network training of one object category takes about 3-5 epochs to converge. The inference time is about 0.3 s.

Symmetry Detection Network
Given a learned DVAE of a specific category, the corresponding symmetry detection network is adopted. The symmetry detection network is trained with the DVAE parameters being fixed. In particular, the symmetry detection network could be trained and tested on any RGB-D dataset with object segmentation. Object symmetry annotation is not a necessity. The number of self-attention layers N s and the number of mutual attention layers N m is 2. The number of attention-heads in the self-attention layers and the mutual attention layers are 3. For the RGB-D dataset with object pose annotation, an extra pose estimation loss could be added to the predicted rigid transformation T to increase the training speed. In such a case, the L 1 loss is added during the initial 5 training epochs to fast transform the object into the canonical coordinate. The L 1 loss is then turned off, making the network focus on symmetry detection. The network is trained using Adam optimizer with an initial learning rate 0.001 and batch size 32. The training weight w vis is 0.00001. The network training of one category takes about 8-10 epochs to converge. The inference time is about 0.5 s.

RESULTS AND EVALUATION
We evaluate the proposed method in this section. First, we describe the experimental datasets (Section 4.1) and the evaluation metrics (Section 4.2). Second, we show the detected symmetries on various scenes (Section 4.3) and qualitatively compare our method against the baselines (Section 4.4). Third, ablation studies, parameter studies (Section 4.5), robustness evaluation (Section 4.6) and shape discrimination evaluation (Section 4.7) are conducted to further analyze the proposed method.

Experimental Datasets
Our experiments are conducted on four public datasets: ShapeNet [5], ScanNet [11], KITTI [18] and SAPIEN [66]. We describe the data pre-processing and experimental setting for each dataset as follows.
ShapeNet consists of a large number of synthetic CAD models. We use the processed data in [53] that contains the rendered RGB-D images and symmetry annotations. We use the train/test split which includes 100,000 training RGB-D images and 144,00 testing RGB-D images in 10 categories. The test set is split into two subsets: holdout view and holdout instance. ShapeNet is adopted to evaluate the proposed method on large-scale data with various categories.
ScanNet includes RGB-D sequences of real-world indoor scenes. [3] annotates the objects in the scenes with groundtruth symmetries. In total, there are 18970 RGB-D images, each with various objects in it, in 400 scenes. The test set is split into two subsets: holdout view and holdout scene. ScanNet is adopted to evaluate the performance on largescale data of real-world scenes.
KITTI is a large-scale real-world dataset with scanning sequences of outdoor scenes. Since there are no object symmetry annotations, we show qualitative comparison results.
SAPIEN contains part-based articulated objects. The dataset is to evaluate the stability of the methods, in terms of handling complex objects with minor part-level differences. This is achieved by conducting an experiment of symmetry detection on the discrete frames of a video where some parts of the objects were moving.

Evaluation Metrics
Since each object may have an arbitrary number of symmetry, evaluating symmetry detection models is not straightforward. To evaluate the performance of the symmetry detection methods, the metrics should: 1) measure if the model predicts the symmetry number correctly; 2) verify if the detected symmetries have accurate parameters. To this end, we use two evaluation metrics described as follows.
First, we opt to use precision-recall curve produced by altering the threshold of the confidence value of the estimated symmetries and the corresponding F1-score. To determine whether a predicted symmetry is positive or negative, we compute the dense symmetry error between the predicted symmetry and the ground-truth symmetry. For reflectional symmetry, the dense symmetry error of the predicted symmetry S ref and the ground-truth sym-metryŜ ref on object with point cloud P ¼ fP i g; i 2 ½1; N is measured as For rotational symmetries, the dense symmetry error between the predicted symmetry S rot and the ground-truth symmetryŜ rot is where T rot;g is the rotational transformation of S rot with the rotation angle g. The set of rotation angles is G ¼ fk Á p=8g k¼1;...;16 . The predicted symmetries whose dense symmetry error is less than 0.25 to any ground-truth symmetry are selected as true positive. Second, to evaluate how accurate the predicted symmetry parameter is, we propose mean symmetry error. This metric is adopted to evaluate the accuracy of estimated symmetry on test sets in which each object has a fixed number of symmetries. The mean symmetry error of reflectional symmetry G ref and that of rotational symmetry G rot are computed by taking the average of the dense symmetry error over the symmetry predictions of all objects where E ref k and E rot k are the dense symmetry error of the kth object in the subset for reflectional and rotational symmetry, respectively. K is the number of objects in the test set.

Qualitative Results
We visualize the qualitative results of our method on various data in Fig. 8. Our method not only achieves accurate symmetry detection on the synthetic data (row 1-3) but also generalizes well on more challenging scenarios (row 4-8), such as objects with complex geometry, heavy occlusion, textureless surface, sparse points, and poor light condition. Moreover, we qualitatively compare our method against the state-of-the-art methods in Fig. 9. It shows that our method could accurately detect symmetry in all the test cases and is more robust compared to the state-of-the-art methods.

Comparison to State-of-The-Arts
To further show the advantages of the proposed method, we quantitatively compare it to state-of-the-art methods. First, we compare the proposed method against existing methods that directly detect symmetries. Second, since symmetry detection from RGB-D images is relevant to other tasks, such as shape completion and pose estimation, we also compare it with the modified state-of-the-art shape completion and pose estimation methods that are applicable to the symmetry detection task.
Baselines. We compare our method against the following baselines. SymmetryNet [53]: the state-of-the-art symmetry detection approach that detects both the reflectional and rotational symmetries from RGB-D images. It is a competitive baseline, as it learns symmetry prediction with dense and accurate symmetry supervision by a progressive network architecture. The baseline is compared in two manners: the unified SymmetryNet train and test on all object categories (SymmetryNet-U) and the category-specific Sym-metryNets which are trained and tested on each category separately (SymmetryNet-C). Shape completion: a learningbased approach that coupled a recent shape completion network [34] and a symmetry detection network that works on completed shapes [17]. Specifically, it adopts a two-stage pipeline that first completes the shape and then infers the potential symmetries. Pose estimation [6]: the state-of-the-art category-level 6D pose estimation network. The symmetries are obtained by transforming the category-average symmetries with the predicted object pose. Geometric fitting [14]: a heuristic-based reflectional symmetry detection method that tolerates moderate data incompleteness of the target shape. NeRD [75]: a learning-based method that estimates the normal direction of the reflectional symmetry plane from RGB images. It could detect at most one reflectional symmetry in each image.
Could the Method Find All Symmetries Accurately? We conduct an experiment on the ShapeNet and ScanNet datasets for both reflectional and rotational symmetry detection. The precision-recall curves are plotted in Figs. 10 and 11. Geometric fitting fails on most of the tested data, which shows the difficulty of the tasks caused by the data incompleteness. Our method performs the best, compared to other learning-based baselines, especially on the holdout instances and holdout scenes which are more challenging. This Fig. 8. Qualitative symmetry detection results on ShapeNet [5], KITTI [18], and ScanNet [11]. We see that our method is able to handle objects with complex geometry, heavy occlusion, texture-less surface, sparse points, and poor light condition.
demonstrates the robustness and generality of our method on both synthetic and real-world scenarios. It is worth noting that our method achieves even better performance compared to the supervised-learning baselines of Symmetry-Net-U and SymmetryNet-C. This verifies the effectiveness of our idea that detects symmetry via shape completion. In particular, we see SymmetryNet-U slightly outperforms Sym-metryNet-C in several test subsets (see Figs. 10c, 10d and 11a and 11b), implying that the performance of symmetry detection in specific categories might benefit from the knowledge in all training data. Shape completion is also inferior to our method, demonstrating the naive combination of shape completion and symmetry detection is less effective on symmetry detection.
How Accurate are the Predicted Symmetries? We conduct an experiment to evaluate the accuracy of the predicted symmetries. To achieve this, we first pick up several RGB-D images that contain objects with only one symmetry as the test set. For methods that might output multiple symmetries in one image, we only use the symmetry with the highest prediction confidence. Since NeRD could only estimate the normal direction of the reflectional symmetry plane, we generate the full symmetry parameters by using the predicted object center of our method. The results are reported in Table 1. It is clear that our method achieves state-of-theart performances in all experiments, thanks to the discriminative ability of the DVAE that identifies incorrect symmetry prediction through the analysis of the generated object proposals. Fig. 9. Qualitative comparisons against existing methods on challenging scenarios. Geometric fitting [14] fails on most of the examples due to the data incompleteness. Shape completion [17], [34] is unstable on objects with space points. SymmetryNet [53] and Pose estimation [6] could detect most of the symmetries correctly, but are less accurate on shapes with complex geometry. In contrast, our method could accurately detect all the symmetries.

Ablation and Parameter Study
In Fig. 12, Tables 2 and 3, we study several key components and parameters to quantify their efficacy. The experiments are conducted on ShapeNet holdout instance and ScanNet holdout scene subsets.
No Decoder. The DVAE decoder learns to generate completed shapes with the reconstruction loss, which might not be necessary for the shape discrimination learning task. We turn off the decoder and the reconstruction loss, and retrain the network to evaluate the performance. Interestingly, we found that the ablated baseline leads to a quantitative performance drop, especially on the ScanNet dataset. This demonstrates shape reconstruction is indeed helpful to our method, confirming the idea that the reconstruction loss could regularize the learned latent feature space.
No Color Input. We disable the color image input to study the impact of appearance features. The degraded performance of the baseline validates the necessity of the color image input. It is worth noting that the training of this baseline is much slower than the full method, showing that color images could facilitate the network training by providing crucial clues to alleviate the symmetry prediction ambiguities.
No RGB-D Transformer. The RGB-D transformer aggregates features from color and depth images using an attention mechanism. To validate the efficacy of the attention module, we replace the self-attention layers and the mutual attention layers with two fully connected layers. As a result, the ablated baseline has no attentional feature aggregation. We observe two phenomena from the result of No RGB-D transformer. First, it is inferior to the full method in terms of F1-score, indicating that the RGB-D transformer is indeed helpful. Second, the ablated baseline has a higher precision when the recall is small, and it falls sharply when the recall > 0:2. This means the RGB-D transformer is especially useful in detecting symmetry in challenging cases. See Fig. 13 for a visualization of the RGB-D transformer effects.
No Visibility Loss. We turn off the visibility loss so the predicted symmetric counterparts might appear in the invalid space. We found that the ablation leads to a small performance decrease, implying that the visibility constraint could be learned in an end-to-end manner during the network training and is indeed beneficial to symmetry detection.
Alternative Negative Prior. Our method selects Nð1; 5Þ as the target Gaussian distribution of implausible shapes. To study its rationality, we first compare it to a naive alternative No negative distribution that directly pushes the negative samples away from the distribution of plausible shapes, i.e., Nð0; 1Þ. As such, the loss of the negative distribution learning turns out to be: L KL neg ¼ maxð0; D À L KL Þ, where D ¼ 1 is the pre-defined margin. The performance drops dramatically, especially on the challenging real-world test set Scan-Net holdout scene which contains severely occluded objects, demonstrating explicit learning a Gaussian distribution for the implausible shapes is a better choice. Besides, we study the possibilities of using alternative parameter settings of negative Gaussian distribution. The results are  The mean symmetry error (MSE) of our method and several baselines is reported (lower is better).  The maximal F1-score is reported (higher is better). When s s ¼ 5 5, our method performs the best with m m ¼ 1 1. The maximal F1-score is reported (higher is better). When m m ¼ 1 1, our method performs the best with s s ¼ 5 5.
shown in Tables 2 and 3. In general, our method is not sensible to the parameter setting. We see it achieves the best performance when m m ¼ 1 1 and s s ¼ 5 5. In particular, the large s s provides a high latent space capacity to accommodate various implausible shapes with high diversities.

Robustness Evaluation
Robustness to Scanning Noise.To test how our method performs on RGB-D images with scanning noise, we conduct a pressure test. To this end, we randomly add Gaussian noise perturbation on the point cloud of objects in the ShapeNet holdout instance subset. The results are shown in Fig. 14 (left). We see that our method is robust to handle objects when the noise standard deviation 4cm (shape normalized into a 1m Â 1m Â 1m cube). Robustness to Occlusion. We also test our method on RGB-D images with occlusions. For the RGB-D images in the ShapeNet holdout instance subset, we randomly add realistic obstacles and test our method on them. The performance on RGB-D images with different occlusion ratios is reported in Fig. 14 (right). It shows that our method is robust when it comes to moderate occlusion, e.g., occlusion ratio 40%. Note that the occlusion ratio is computed based on the mutual occlusion caused by the added obstacles and does not include the unseen area due to self-occlusion.
Robustness to Structure Changes. To show the stability of our method on realistic scenarios, namely RGB-D images of objects with structure changing, we test our method on the rendered RGB-D image in SAPIEN which includes videos of articulated objects whose parts could be manipulated. The task is extremely challenging because the object appearance   The maximal F1-score is reported (higher is better). Comparing to Symmetry-Net [53], our method is more stable and robust to structure changes. Fig. 15. Robustness evaluation on objects with changing structure. Our method is robust to object structure changes, thanks to the generality provided by the DVAE.
in all images is similar, but the object symmetries might be different. In this experiment, the model is trained on ScanNet and finetuned on SAPIEN. Quantitative and qualitative comparisons are shown in Table 4 and Fig. 14, respectively. It shows that our method quantitatively outperforms Symmetry-Net by a large margin on the tested data. It is also noteworthy that SymmetryNet fails on most of the test scenes, as shown in Fig. 15. In contrast, our method is more stable to structure changing, thanks to the generality provided by the discriminative variational autoencoder. The strong generality makes our method applicable to real-world challenging cases, such as the examples shown in Figs. 8 and 9.

Shape Discrimination Evaluation
The symmetry detection network is built based on the pretrained DVAE with the shape discriminative ability. To show how good the shape discriminative ability is, we quantitatively evaluate the DVAE with a classification task. We produce a test set containing 5,000 plausible shapes and 5,000 implausible shapes. Those shapes are generated using the methods described in Section 3.4. Note that the task is nontrivial as the shapes are incomplete and may have complex patterns due to the randomly cropping and editing. The DVAE achieves a high accuracy (92.4%) on this dataset, outperforming the baseline [50] with an accuracy 78.9%. This demonstrates the good shape discrimination ability of our method. Examples of the test shape are visualized in Fig. 16.

CONCLUSION AND DISCUSSION
With our work, we have studied the problem of 3D detecting reflectional and rotational symmetries from single-view RGB-D images. To tackle the problem, we have proposed a symmetry detection network with weak supervision. At its heart, the proposed method learns a symmetry-induced shape completion that generates plausible shapes with the estimated symmetry. To this end, we have proposed a discriminative variational autoencoder to learn the shape prior and to estimate shape plausibility from incomplete data, as well as a new multi-modality aggregation network for feature extraction and symmetry estimation. Besides, a new symmetry parametrization is proposed to facilitate the effective end-to-end network training and multiple symmetry estimation. The proposed framework with weak-supervised learning led to robust symmetry detection and is more applicable to challenging scenarios with good generality. Moreover, the proposed method achieves state-ofthe-art performance on two public datasets. Our current solution has the following limitations. First, we assume the objects were pre-segmented and pre-classified from the RGB-D images by a detector. The detected symmetries might be inaccurate if the object is incorrectly segmented or classified. Optimizing the object segmentation and the symmetry prediction simultaneously with a unified network is an interesting problem and might benefit more applications. Second, we train individual models for each object category separately, limiting the cross-category generalization ability. One straightforward future direction is to learn a universal network for all categories. Third, our method needs a color image input. However, symmetry is a purely geometric property and potentially could be detected with only geometry input. A potential future direction is to exploit the solution in detecting symmetry with blurry color images or even without color images. We expect that symmetry detection from single-view images provides useful geometric complements to more downstream applications.
Yifei Shi (Member, IEEE) received the PhD degree in computer science from the National University of Defense Technology, in 2019. He is an assistant professor with the College of Intelligence Science and Technology, National University of Defense Technology (NUDT). During 2017-2018, he was a visiting student research collaborator with Princeton University, advised by Thomas Funkhouser and Szymon Rusinkiewicz. His research interests mainly include computer vision, computer graphics, especially on object/ scene analysis and manipulation by machine learning and geometric processing techniques. He has published more than 20 papers in toptier conferences and journals, including CVPR, ECCV, ICCV, SIG-GRAPH Asia, and ACM Transactions on Graphics.
Xin Xu (Senior Member, IEEE) received the BS degree in electrical engineering from the Department of Automatic Control, National University of Defense Technology, and the PhD degree in control science and engineering from the College of Mechatronics and Automation, National University of Defense Technology. He has been a visiting scientist for cooperation research in the Hong Kong Polytechnic University, University of Alberta, and the University of Strathclyde, respectively. Currently, he is a full professor and the director of the Department of Intelligent Science and Technology with the National University of Defense Technology. His main research areas include: reinforcement learning and intelligent vehicles, learning control, robotics and machine learning. He has coauthored four books and published more than 150 papers in international journals and conferences. He is an associate editor of Information Sciences, CAAI Transactions on Intelligence Technology, Acta Automatica Sinica, Intelligent Automation and Soft Computing. He has also been a guest editor of IEEE Transactions on System, Man and Cybernetics: Systems, International Journal of Adaptive Control and Signal Processing. He is one of the recipients received the 2nd class National Natural Science Award of China, in 2012, the 1st class Natural Science Award from Hunan Province, P. R. China, in 2009 and the Fork Ying Tong Youth Teacher Fund of China, in 2008. He is a committee member of the IEEE Technical Committee on Approximate Dynamic Programming and Reinforcement Learning and the IEEE Technical Committee on Robot Learning. He has served as a PC member or session chair in many international conferences.
Junhua Xi received the master's degree from the National University of Defense Technology, in 2013. She is currently working toward the PhD degree with the College of Computer Science and Technology, National University of Defense Technology. Her research interests include 3D vision and robotics, especially on object analysis, multiview stereo and large-scale scene reconstruction.
Xiaochang Hu is currently working toward the PhD degree with the College of Intelligence Science and Technology, National University of Defense Technology. His research interests include robotics, outdoor scene understanding and semi-supervised learning. He has published several papers in international journals and conferences such as Knowledge-Based Systems, Journal of Field Robotics, IEEE Transactions on Multimedia, etc.
Dewen Hu (Senior Member, IEEE) received the BSc and MSc degrees from Xi'an Jiaotong University, China, in 1983 and 1986, respectively, and the PhD degree from the National University of Defense Technology, in 1999. In 1986, he was with the National University of Defense Technology. From October 1995 to October 1996, he was a visiting scholar with The University of Sheffield, U.K. In 1996, he was promoted as a professor. He has authored more than 200 articles in journals, such as the Brain, the Proceedings of the National Academy of Sciences of the United States of America, the Neu-roImage, the Human Brain Mapping, the IEEE Transactions on Pattern Analysis and Machine Intelligence, the IEEE Transactions on Image Processing, the IEEE Transactions on Signal Processing, the IEEE Transactions on Neural Networks and Learning Systems, the IEEE Transactions on Medical Imaging, and the IEEE Transactions on Biomedical Engineering. His research interests include pattern recognition and cognitive neuroscience. He is currently an action editor of Neural Networks, and an associate editor of IEEE Transactions on Systems, Man, and Cybernetics: Systems.
Kai Xu (Senior Member, IEEE) received the PhD degree from the National University of Defense Technology, in 2011. He is a professor with the College of Computer, National University of Defense Technology. He conducted visiting research with Simon Fraser University and Princeton University. His research interests include geometric modeling and shape analysis, especially on data-driven approaches to the problems in those directions, as well as 3D vision and its robotic applications. He has published more than 80 research papers, including more than 20 SIGGRAPH/TOG papers. He has co-organized several SIGGRAPH Asia courses and Eurographics STAR tutorials. He serves on the editorial board of ACM Transactions on Graphics, Computer Graphics Forum, Computers & Graphics, and The Visual Computer. He also served as program cochair of CAD/Graphics 2017, ICVRV 2017 and ISVC 2018, as well as PC member for several prestigious conferences including SIGGRAPH, SIGGRAPH Asia, Eurographics, SGP, PG, etc. His research work can be found in his personal website: www.kevinkaixu.net.