PoseRBPF: A Rao-Blackwellized Particle Filter for 6D Object Pose Tracking

Tracking 6D poses of objects from videos provides rich information to a robot in performing different tasks such as manipulation and navigation. In this work, we formulate the 6D object pose tracking problem in the Rao-Blackwellized particle filtering framework, where the 3D rotation and the 3D translation of an object are decoupled. This factorization allows our approach, called PoseRBPF, to efficiently estimate the 3D translation of an object along with the full distribution over the 3D rotation. This is achieved by discretizing the rotation space in a fine-grained manner, and training an auto-encoder network to construct a codebook of feature embeddings for the discretized rotations. As a result, PoseRBPF can track objects with arbitrary symmetries while still maintaining adequate posterior distributions. Our approach achieves state-of-the-art results on two 6D pose estimation benchmarks. A video showing the experiments can be found at https://youtu.be/lE5gjzRKWuA


I. INTRODUCTION
Estimating the 6D pose of objects from camera images, i.e., 3D rotation and 3D translation of an object with respect to the camera, is an important problem in robotic applications.For instance, in robotic manipulation, 6D pose estimation of objects provides critical information to the robot for planning and executing grasps.In robotic navigation tasks, localizing objects in 3D provides useful information for planing and obstacle avoidance.Due to its significance, various efforts have been devoted to tackling the 6D pose estimation problem from both the robotics community [7,4,43,40] and the computer vision community [32,21,12].
Traditionally, the 6D pose of an object is estimated using local-feature or template matching techniques, where features extracted from an image are matched against features or viewpoint templates generated for the 3D model of the object.The 6D object pose can then be recovered using 2D-3D correspondences of these local features or by selecting the best matching viewpoint onto the object [7,11,12].More recently, machine learning techniques have been employed to detect key points or learn better image features for matching [2,18].Thanks to advances in deep learning, convolutional neural networks have recently been shown to significantly boost the estimation accuracy and robustness [15,44,30,38,43], So far, the focus of image-based 6D pose estimation has been on the accuracy of single image estimates; most techniques ignore temporal information and provide only a single hypothesis for an object pose.In robotics, however, temporal data and information about the uncertainty of estimates can also be very important for tasks such as grasp planning or Fig. 1.Overview of our PoseRBPF framework for 6D object pose tracking.Our method leverages a Rao-Blackwellized particle filter and an auto-encoder network to estimate the 3D translation and a full distribution of the 3D rotation of a target object from a video sequence.active sensing.Temporal tracking in video data can improve pose estimation [28,5,17,8].In the context of point-cloud based pose estimation, Kalman filtering has also been used to track 6D poses, where Bingham distributions have been shown to be well suited for orientation estimation [36].However, unimodal estimates are not sufficient to adequately represent the complex uncertainties arising from occlusions and possible object symmetries.
In this work, we introduce a particle filter-based approach to estimate full posteriors over 6D object poses.Our approach, called PoseRBPF, factorizes the posterior into the 3D translation and the 3D rotation of the object, and uses a Rao-Blackwellized particle filter that samples object poses and estimates discretized distributions over rotations for each particle.To achieve accurate estimates, the 3D rotation is discretized at 5 degree resolution, resulting in a distribution over 72 × 37 × 72 = 191, 808 bins for each particle (elevation ranges only from -90 to 90 degree).To achieve real time performance, we pre-compute a codebook over embeddings for all discretized rotations, where embeddings come from an autoencoder network trained to encode the visual appearance of an object from arbitrary viewpoints at a certain scale (inspired by [37]).For each particle, PoseRBPF first uses the 3D translation to determine the center and size of the object bounding box in the image, then determines the embedding for that bounding box, and finally updates the rotation distribution by comparing the embedding value with the pre-computed entries in the codebook using cosine distance.The weight of each particle is given by the normalization factor of the rotation distribution.Motion updates are performed efficiently by sampling from a motion model over poses and a convolution over the rotations.Fig. 1 illustrates our PoseRBPF framework for 6D object pose tracking.Experiments on the YCB-Video dataset [43] and the T-Less dataset [14] show that PoseRBPFs are able to represent uncertainties arising from various types of object symmetries and can provide more accurate 6D pose estimation.
Our work makes the following main contributions: • We introduce a novel 6D object pose estimation framework that combines Rao-Blackwellized particle filtering with a learned auto-encoder network in an efficient and principled way.• Our framework is able to track full distributions over 6D object poses.It can also do so for objects with arbitrary kinds of symmetries, without the need for any manual symmetry labeling.The rest of the paper is organized as follows.After discussing the related work, we present our Rao-Blackwellized particle filtering framework for 6D object pose tracking, followed by experimental evaluations and a conclusion.

II. RELATED WORK
Our work is closely related to recent advances in 6D object pose estimation using deep neural networks.The current trend is to augment state-of-the-art 2D object detection networks with the ability to estimate 6D object pose.For instance, [15] extend the SSD detection network [24] to 6D pose estimation by adding viewpoint classification to the network.[38] utilize the YOLO architecture [31] to detect 3D bounding box corners of objects in the images, and then recover the 6D pose by solving the PnP problem.PoseCNN [43] designs an endto-end network for 6D object pose estimation based on the VGG architecture [35].Although these methods significantly improve the 6D pose estimation accuracy over the traditional methods [12,2,18], they still face difficulty in dealing with symmetric objects, where most methods manually specify the symmetry axis for each such object.In contrast, [37] introduce an implicit way of representing 3D rotations by training an auto-encoder for image reconstruction, which does not need to pre-define the symmetry axes for symmetric objects.We leverage this implicit 3D rotation representation in our work, and show how to combine it with particle filtering for 6D object pose tracking.
The particle filtering framework has been widely applied to different tracking applications in the literature [27,34,16,33], thanks to its flexibility in incorporating different observation models and motion priors.Meanwhile, it offers a rigorous probabilistic formulation to estimate uncertainty in the tracking results.Different approaches have also been proposed to track the poses of objects using particle filters [1,6,29,42,20].However, in order to achieve good tracking performance, a particle filter requires a strong observation model.Also, the tracking frame rate is limited by the particle sampling efficiency.In this work, we factorize the 6D object pose tracking problem and deploy Rao-Blackwellized particle filters [10], which have been shown to scale to complex estimation problems such as SLAM [39,26] and multi-model target tracking [19,33].We also employ a deep neural network as an observation model that provides robust estimates for object orientations even under occlusions and symmetries.Our design allows us to evaluate all possible orientations in parallel using an efficient GPU implementation.As a result, our method can track the distribution of the 6D pose of an object at 20fps.

III. 6D OBJECT POSE TRACKING WITH POSERBPF
The goal of 6D object pose tracking of an object is to estimate the 3D rotation R and the 3D translation T of the object for every frame in an image stream.In this section, we first formulate the 6D object tracking problem in a particle filtering framework, and then describe how to utilize a deep neural network to compute the likelihoods of the particles and to achieve an efficient sampling strategy for tracking.

A. Rao-Blackwellized Particle Filter Formulation
At time step k, given observations Z 1:k up to time k, our primary goal is to estimate the posterior distribution of the 6D pose of an object P (R k , T k |Z 1:k ), where R k and T k denote the 3D rotation and 3D translation of the object at time k, respectively.Using a vanilla particle filter to sample over this 6D space is not feasible, especially when there is large uncertainty over the orientation of the object.Such uncertainties occur frequently when objects are heavily occluded or have symmetries that result in multiple orientation hypotheses.We thus propose to factorize the 6D pose estimation problem into 3D rotation estimation and 3D translation estimation.This idea is based on the observation that the 3D translation can be estimated from the location and the size of the object in the image.The translation estimation provides the center and scale of the object in the image, based on which the 3D rotation can be estimated from the appearance of the object inside the Fig. 3. Illustration of the computation for the conditional rotation likelihood by codebook matching.Left) Each particle crops the image based on its translation hypothesis.The RoI for each particle is resized and the corresponding code is computed using the encoder.Right) The rotation distribution P (R|Z, T) is computed from the distance between the code for each hypothesis and those in the codebook.bounding box.Specifically, we decompose the posterior into: where P (T k |Z 1:k ) encodes the location and scale of the object, and P (R k |T k , Z 1:k ) models the rotation distribution conditioned on the translation and the images.
This factorization directly leads to an efficient sampling scheme for a Rao-Blackwellized particle filter [10,39], where the posterior at time k is approximated by a set of N weighted samples Here, T i k denotes the translation of the ith particle, P (R k |T i k , Z 1:k ) denotes the discrete distribution of the particle over the object orientation conditioned on the translation and the images, and w i k is the importance weight.To achieve accurate pose estimation, the 3D object orientation consisting of azimuth, elevation, and in-plane rotation is discretized into bins of size 5 degree, resulting in a distribution over 72×37×72 = 191, 808 bins for each particle (elevation ranges only from -90 to 90 degrees).At every time step k, the particles are propagated through a motion model to generate a new set of particles X k+1 , from which we can estimate the 6D pose distribution.

B. Observation Likelihoods
The observation likelihoods of the two posteriors P (Z k |T k ) and P (Z k |T k , R k ) measure the compatibility of the observation Z k with the object pose at the 3D rotation R k and the 3D translation T k .According to the Bayes Rule it is sufficient to estimate the likelihood Intuitively, a 6D object pose estimation method, such as [15,38,43], can be employed to estimate the observation likelihoods.However, these methods only provide a single estimation of the 6D pose instead of estimating a probability distribution, i.e., there is no uncertainty in their estimation.Also, these methods are computationally expensive if we would like to evaluate a large number of samples in the particle filtering.
Ideally, if we can synthetically generate an image of the object with the pose (R k , T k ) into the same scene as the observation Z k , we can compare the synthetic image with the input image Z k to measure the likelihoods.However, this is not feasible since it is very difficult to synthesize the same lighting, background or even occlusions between objects as in the input video frame.In contrast, it is straightforward to render a synthetic image of the object using constant lighting, blank background and no occlusion, given the 3D model of the object.Therefore, inspired by [37], we apply an auto-encoder to transform the observation Z k into the same domain as the synthetic rendering of the object.Then we can compare image features in the synthetic domain to measure the likelihoods of 6D poses efficiently.
1) Auto-encoder: An auto-encoder is trained to map an image Z of the target object with pose (R, T) to a synthetic image Z of the object rendered from the same pose, where the synthetic image Z is rendered using constant lighting, and there is no background and occlusion in the synthetic image.In this way, the auto-encoder is forced to map images with different lighting, background and occlusion to the common synthetic domain.Fig. 2 illustrates the input and output of the auto-encoder during training.In addition, the auto-encoder learns a feature embedding f (Z) of the input image.
Instead of training the auto-encoder to reconstruct images with arbitrary 6D poses, which makes the training challenging, we fix the 3D translation to a canonical one T 0 = (0, 0, z) T , where z is a pre-defined constant distance.The canonical translation indicates that the target object is in front of the camera with distance z.The 3D rotation R is uniformly sampled during training.After training, for each discretized 3D rotation R i , a feature embedding f (Z(R i , T 0 )) is computed using the encoder, where Z(R i , T 0 ) denotes a rendered image of the target object from pose (R i , T 0 ).We consider the set of all the feature embeddings of the discretized 3D rotations to be the codebook of the target, and we show how to compute the likelihoods using the codebook next.
2) Codebook Matching: Given a 3D translation hypothesis T k , we can crop a Region of Interest (RoI) from the image Z k , and then feed the RoI into the encoder to compute a feature embedding of the RoI.Specifically, the 3D translation T k = (x k , y k , z k ) T is projected to the image to find the center (u k , v k ) of the RoI : where f x and f y indicate the focal lengths of the camera, and (p x , p y ) T is the principal point.The size of the RoI is determined by z k z s, where z and s are the canonical distance and the RoI size in training the auto-encoder, respectively.Note that each RoI is a square region in our case, which makes the RoI independent from the rotation of the object.
The RoI is feed into the encoder to compute the feature embedding f (Z k (T k )).Finally, we compute the cosine distance, which is also referred as a similarity score, between the feature embedding of the RoI and a code in the codebook to measure the rotation likelihood: where R j c is one of the discretized rotations in the codebook, and φ(•) is a Gaussian probability density function centered at the maximum cosine distance among all the codes in the codebook for all the particles.In this way, we can obtain a probabilistic likelihood distribution of all the rotations in the codebook given a translation.Fig. 3 illustrates the computation of the rotation likelihoods by the cookbook matching.
Since the auto-encoder is trained with the object being at the center of the image and at a certain scale, i.e., with the canonical translation T 0 , any change in scale or deviation of the object from the center of the image results in poor re-constructions (see Fig. 4).Particles with incorrect translations would generate RoIs where the object is not in the center of the RoI or with the wrong scale.Then we can check the reconstruction quality of the RoI to measure the likelihood of the translation hypothesis.We utilize this property to compute the translation likelihood P (Z k |T k ).Intuitively, if the translation T k is correct, the similarity scores in Eq. ( 4) for rotation R i that is close to the ground truth rotation would be high.Specifically, P (Z k |T k ) is computed as the sum of the probability density P (R j c |T k , Z k ) for all the discrete rotations.

C. Motion Priors
Motion prior is used to propagate the distribution of the poses from the previous time step k − 1 to the current time step k.We use a constant velocity model to propagate the probability distribution of the 3D translation: (5) where N (µ, Σ) denotes the multivariate normal distribution with mean µ and covariance matrix Σ, and α is a hyperparameter of the constant velocity model.The rotation prior is defined as a normal distribution with mean R k−1 and fixed covariance Σ R : where we represent the rotation R using Euler angles.Then the rotation prior can be implemented by a convolution on the previous rotation distribution with a 3D Gaussian kernel.

D. 6D Object Pose Tracking Framework
The tracking process can be initialized from any 2D object detector that outputs a 2D bounding box of the target object.Given the first frame Z 1 , we backproject the center of the 2D bounding box to compute the (x, y) components of the 3D translation and sample different zs uniformly to generate a set of translation hypotheses.The translation T 1 with the highest likelihood P (Z 1 |T) is used as the initial hypothesis and P (R|T 1 , Z 1 ) as the initial rotation distribution.
At each following frame, we first propagate the N particles with the motion priors.Then the particles are updated with the latest observation Z k .Specifically, for each particle, the translation estimation T i k is used to compute the RoI of the object in image Z k .The resulting RoI is passed through the auto-encoder to compute the corresponding code.For each particle, the rotation distribution is updated with: is the rotation distribution defined in Eq. ( 4), and P (R k |R k−1 ) is the motion prior.Finally, we compute the posterior of the translation P (T i k |Z 1:k ) with and use it as the weight w i of this particle.The systematic resampling method [9] is used to resample the particles

Resample
Step: Step: Fig. 5.We propose PoseRBPF, a Rao-Blackwellized particle filter for 6D object pose tracking.For each particle, the orientation distribution is estimated conditioned on translation estimation, while the translation estimation is evaluated with the corresponding RoIs.
according to the weights w 1:N .Our 6D object pose tracking framework is shown in Fig. 5.Some robotic tasks require the expectation of the 6D pose of the object from the particle filter for decision making.The expectation can be represented as The translation expectation can be computed simply by averaging the translation estimations T 1:N k for all the N particles due to the uni-modal nature of translation in the object tracking task.Computing the rotation expectation R E k is less obvious since the distribution P (R k ) might be multi-modal and simply performing weighted averaging over all the discrete rotations is not meaningful.To compute the rotation expectation, we first summarize the rotation distribution for all the particles by taking the maximum probability for every discrete rotation, resulting in rotation distribution P (R E ) k .The rotation expectation R E k is then computed by weighted averaging the discrete egocentric rotations within a neighborhood of the previous rotation expectation R E k−1 using the quaternion averaging method proposed in [25].
Performing codebook matching with the estimated RoIs also provides a way to detect tracking failures.We can first find the maximum similarity score among all the particles.Then if maximal score is lower than a pre-defined threshold, we determine it is a tracking failure.Algorithm 1 summarizes our Rao-Blackwellized particle filter for 6D object pose tracking.

E. RGB-D Extension of PoseRBPF
PoseRBPF can be extended to use depth measurements for computing the observation likelihoods.With the RGB image Z C k and the additional depth measurements Z D k , the observation likelihood in Eq. ( 2) can be rewritten as: Note that the auto-encoder only uses the RGB image.Therefore, To compute the likelihood with the depth image P (Z D k |T i k ) for a translation hypothesis T i k , we first render the object with pose By comparing the rendered depth image ẐDi k with the depth measurements Z D k , we first estimate the visibility mask , where p indicates a pixel in the image and m is a small positive constant margin to account for sensor noises.Therefore, the rendered pixel p with depth less than Z D k (p) + m is determined as visible.With the estimated visibility mask, the visible depth discrepancy between the two depth maps is computed as: ) where τ is a pre-defined threshold for each object.For every particle, we compute its depth score , where v i k is the visibility ratio of the object, i.e., the number of visible pixels according to the visibility mask divided by the total number of pixels rendered.Finally, we compute , where φ (•) is a gaussian probability density function centered at the maximum depth score among all the particles.

A. Datasets
We evaluated our method on two datasets: the YCB Video dataset [43] and the T-LESS dataset [14].
YCB Video dataset: The YCB video dataset contains RGB-D video sequences of 21 objects from the YCB Object and Model Set [3].It contains textured and textureless household objects put in different arrangements.Objects are annotated with 6D object poses and two metrics are used for quantitative evaluation.The first metric is ADD, which is the average distance between the corresponding 3D points on the object at groundtruth pose vs the predicted pose.The second metric is ADD-S, which is the average distance between the closest point between the 3D model of the object at groundtruth and the model of the object at the predicted pose.ADD-S is designed for symmetric objects, since it focuses on shape matching, rather than exact pose matching.

T-LESS:
This dataset contains RGB-D sequences of 30 nontextured industrial objects.Evaluation is done on 20 test scenes.The dataset is challenging because the objects do not have texture and they have various forms of symmetries and occlusions.We follow the evaluation pipeline in SIXD challenge and used Visible Surface Discrepancy err vsd [13] to evaluate the quality of the pose estimation.Visual surface discrepancy is computed as mean average of the distance between the visible points.The metric is the recall of correct 6D poses where err vsd < 0.3 with tolerance 20mm and visibility of more than 10%.

B. Implementation Details
The auto-encoder is trained for each object separately for 150, 000 iterations with batch size of 64 using the Adam optimizer with learning rate of 0.0002.The auto-encoder is optimized with the L2 loss on the N pixels with largest reconstruction errors.Larger N s are more suitable for textured objects to capture more details.We use N = 2000 for textured objects and N = 1000 for non-textured objects.The training data is generated by rendering the object at random rotation and superimposed at random crops of the MS-COCO dataset [22] at resolution 128 × 128.In addition to the target object, three additional objects are sampled at random locations and scales to generate training data with occlusions.The target object is positioned at the center of the image and jittered with 5 pixels, the object is sampled uniformly at scales between 0.975 and 1.025 with random lighting.Color is randomized in HSV space and we also add Gaussian noise to pixel values to reduce the gap between the real and synthetic data.The images are rendered online for each training step to provide a more diverse set of training data.The architecture of the network is described in [37].It consists of four 5 × 5 convolutional layers and four 5 × 5 deconvolutional layers for the encoder and the decoder, respectively.The standard deviations used to compute observation likelihoods in Eq. ( 4) are selected between 0.03 and 0.1.The codebook for each object is precomputed offline and loaded during test time.Computation of observation likelihood is done efficiently on a GPU.Table I shows the frame rate at which PoseRBPF can process images.

C. Results on YCB Video Dataset
Table II shows the pose estimation results on the YCB video dataset, where we compare with the state-of-the-art methods for pose estimation using RGB images [43,40] and RGB-D images [43,41].We initialize PoseRBPF using PoseCNN at the first frame or after the object was heavily occluded.On average, this happened only 1.03 times per sequence.As can be seen, our method significantly improves the accuracy of 6D pose estimation when using 200 particles.Note that our method handles symmetric objects such as 024 bowl, 061 foam brick much better.One of the objects on which PoseRBPF performs poorly is 036 wood block, which is caused by the difference in texture of the 3D model of the wooden block and the texture of the wooden block used in the real images.In addition, the physical dimensions of the wooden block are different between real images and the model contained in this dataset.Another observation is that with the increase in the number of particles, the accuracy improves significantly because with more samples the variations in scale and translation of an object are covered much better.
It has been shown in the context of robot localization that adding samples drawn according to the most recent  observation can improve the localization performance [39].
Here, we applied such a technique by sampling 50% of the particles around PoseCNN predictions and the other 50% from the particles of the previous time step.Our results show that such a hybrid version, PoseRBPF++, further improves the pose estimation accuracy of our approach.Fig. 7 illustrates the 6D pose estimation on YCB Video dataset.Depth measurements contain useful information to improve the pose estimation accuracy.By comparing the depth of the object rendered at the estimated pose and the depth image (explained in Sec III-E), our method achieves the state-of-the-art performance.The comparison between RGB and RGB-D versions of PoseRPBF shows using depth information with the same number of particles improves the accuracy of estimated poses significantly.Note that depth information is only used during inference and the encoder takes only the RGB images.

D. Results on T-LESS Dataset
Table III presents our results on the T-LESS dataset.T-LESS is a challenging dataset because objects do not have texture and objects are occluded frequently in different frames.We compared our method with [37] which uses a similar auto-encoder, but does not use any temporal information.We evaluated both using ground truth bounding boxes and the detection output from RetinaNet [23] that is used in [37].Our tracker uses 100 particles, and is reinitialized whenever the observation likelihood drops below a threshold.The results show that the recall for correct object poses doubles by tracking the object pose rather than just predicting object pose from single images in the RGB case.With additional depth images, the recall can be further improved by around 76%, and PoseRBPF outperforms refining [37] with ICP by 28%.For the experiments with ground truth bounding boxes, rotation is tracked using the particle filter and translation is inferred from the scale of the ground truth bounding box.This experiment highlights the viewpoint accuracy.In this setting, recall increases significantly for all the methods and the particle filter consistently outperforms [37], which shows the importance of temporal tracking for object pose estimation.

E. Analysis of Rotation Distribution
Unlike other 6D pose estimation methods that output a single estimate for the 3D rotation of an object, PoseRBPF tracks full distributions over object rotations.Fig. 8 shows example distributions for some objects.There are two types of uncertainties in these distributions.The first source is the symmetry of the objects resulting in multiple poses with similar appearances.As expected, each cluster of the viewpoints corresponds to one of the similarity modes.The variance for each cluster corresponds to the true uncertainty of the pose.For example for the bowl, each ring of rotations corresponds to the uncertainty around the azimuth because the bowl is a rotationally symmetric object.Different rings show the uncertainty on the elevation.
To measure how well PoseRBPF's capture rotation uncertainty, we compared PoseRBPF estimates to those of PoseCNN assuming a Gaussian uncertainty with mean at the PoseCNN estimate.Fig. 6 shows this comparison for the scissors and foam brick objects.Here, the x-axis ranges over percentiles of the rotation distributions, and the y-axis shows how often the ground truth pose is within 0, 10, or 20 degrees of one of the rotations contained in the corresponding percentile.For instance, for the scissors, the red, solid line indicates that 80% of the time, the ground truth rotation is within 20 degrees of an rotation taken from the top 20% of the PoseRBPF distribution.If we take the top 20% rotations estimated by PoseCNN assuming a Gaussian uncertainty, this number drops to about 60%, as indicated by the lower dashed, red line.The importance of maintaining multi-modal uncertainties becomes even more prominent for the foam brick, which has a 180 • symmetry.Here, PoseRBPF achieves high coverage, whereas PoseCNN fails to generate good rotation estimates even when moving further from the generated estimate.

V. CONCLUSION
In this work, we introduced PoseRBPF, a Rao-Blackwellized particle filter for tracking 6D object poses.Each particle samples 3D translation and estimates the distribution over 3D rotations conditioned on the image bonding box corresponding to the sampled translation.PoseRBPF compares each bounding box embedding to learned viewpoint embeddings so as to efficiently update distributions over time.We demonstrated that the tracked distributions capture both the uncertainties from the symmetry of objects and the uncertainty from object pose.Experiments on two benchmark datasets with house hold objects and symmetric texture less industrial objects show the superior performance of PoseRBPF.

Fig. 2 .
Fig. 2. Illustration of the inputs and outputs of the auto-encoder.Images with different lighting, background and occlusion are feed into the network to reconstruct synthetic images of the objects from the same 6D poses.The encoder generates a feature embedding (code) of the input image.

Fig. 4 .
Fig. 4. Visualization of reconstruction of the RoIs from auto-encoder.Left is the groundtruth RoI.The other two column show the reconstruction with shifting and scale change.As it is shown the reconstruction quality degrades with deviations from groundtruth RoI.This property makes auto-encoder a suitable choice for computing the observation likelihoods.

Fig. 6 .
Fig. 6.Rotation Coverage Percentile comparison between PoseRBPF and PoseCNN for scissors and foam brick.Foam brick has 180 • planar rotation and scissors is an asymmetric object.

Fig. 7 .
Fig. 7. Visualization of estimated poses on the YCB Video dataset (left) and T-LESS dataset (right).Ground truth bounding boxes are red, green bounding boxes are particle RoIs, and the object models are superimposed on the images at the pose estimated by PoseRBPF.

Fig. 8 .
Fig.8.Visualization of rotation distributions.For each image, the distribution over the rotation is visualized.The lines represent the probability for rotations that are higher than a threshold.The length of each line is proportional to the probability of that viewpoint.As can be seen, PoseRBPF naturally represents uncertainties due to various kinds of symmetries, including rotational symmetry of the bowl, mirror symmetry of the foam brick, and discrete rotational symmetries of the T-LESS objects on the right.

Fig. 7
Fig. 7 shows the 6D pose estimation of PoseRBPF on several T-LESS images.

TABLE I EFFECT
OF THE NUMBER OF PARTICLES ON FRAME RATE IN TRACKING.

TABLE III T
-LESS RESULTS: OBJECT RECALL FOR e vsd < 0.3 ON ALL PRIMESENSE