3D-ReConstnet: A Single-View 3D-Object Point Cloud Reconstruction Network

Object 3D reconstruction from a single-view image is an ill-posed problem. Inferring the self-occluded part of an object makes 3D reconstruction a challenging and ambiguous task. In this paper, we propose a novel neural network for generating a 3D-object point cloud model from a single-view image. The proposed network named 3D-ReConstnet, an end to end reconstruction network. The 3D-ReConstnet uses the residual network to extract the features of a 2D input image and gets a feature vector. To deal with the uncertainty of the self-occluded part of an object, the 3D-ReConstnet uses the Gaussian probability distribution learned from the feature vector to predict the point cloud. The 3D-ReConstnet can generate the determined 3D output for a 2D image with sufficient information, and 3D-ReConstnet can also generate semantically different 3D reconstructions for the self-occluded or ambiguous part of an object. We evaluated the proposed 3D-ReConstnet on ShapeNet and Pix3D dataset, and obtained satisfactory improved results.


I. INTRODUCTION
Reconstructing the shape of 3D objects from a single-view is the fundamental task of robot navigation and grasping, CAD, virtual reality and so on. Therefore, data-driven 3D object reconstruction has attracted more and more attention.
At present, there are two kinds of 3D object representations: voxel and point cloud. The voxel-based neural networks [1]- [3] can reconstruct 3D objects by generating voxelized three-dimensional occupancy grids. However, voxel representation suffers from two problems: sparse information and high computational complexity, especially in high resolution 3D object processing. In order to make up for the deficiency of voxel expression, Fan et al. [4] proposed point cloud-based 3D object reconstruction which is a deep learning method to study point cloud generation. The 3D point cloud of an object is composed of three-dimensional points uniformly sampled from the surface of the object. Point cloud model has scalability and flexibility, so we use point cloud as our 3D representation.
The difficulties of 3D point cloud reconstruction are: 1. When a 2D input contains enough information, the The associate editor coordinating the review of this manuscript and approving it for publication was Genoveffa Tortora .  reconstruction network needs to infer an accurate 3D reconstruction; 2. When a 2D input is ambiguous or uncertain, the reconstruction network needs to reconstruct multiple plausible reconstructions 3D output for the 2D input. As shown in Figure 1 (a), the view of the chair in the two 2D images provides enough information for reconstruction.  Figure 1 (a). It is unreasonable to predict a single deterministic output for an ambiguous input. In this work, we propose a neural network called 3D-ReConstnet for single-view 3D point cloud reconstruction. The 3D-ReConstnet uses residual network to extract a feature vector from 2D input, and uses probability distribution learned from the feature vector to predict 3D point cloud. The 3D-ReConstnet can generate the determined 3D output as shown in Fig. 1 (b) for a 2D image with sufficient information (such as Figure 1 (a)), but in the case of uncertainty or ambiguity in the input 2D image (such as Figure 2 (a)), 3D-ReConstnet can generate multiple plausible reconstructions as shown in Figure 2 In summary, our contributions in this work are as follows: 1) We propose an end-to-end 3D point cloud reconstruction network: 3D-ReConstnet. The end-to-end network structure enables 3D-ReConstnet to infer 3D point cloud directly from 2D image features, avoiding the feature propagation across the network like those multi-stage network [18], and avoiding the loss of features.
2) For an ambiguous 2D input, our 3D-ReConstnet can generate multiple plausible 3D reconstructions from a single input image.
3) We evaluated 3D-ReConstnet on ShapeNet and Pix3D datasets. The experimental results show that 3D-ReConstnet outperforms the state-of-art reconstruction methods in the task of single view 3D reconstruction.
The rest of this paper is organized as follows: Section II introduces the related work. In Section III, we introduce the 3D-ReConstnet in detail. In Section IV, we evaluate the 3D-ReConstnet on ShapeNet and Pix3D dataset. Section V concludes this paper.

II. RELATED WORKS A. SINGLE-VIEW 3D RECONSTRUCTION
The traditional 3D reconstruction method [5]- [7] needs multiple view correspondence. As a result, single-view 3D reconstruction has more advantages than traditional methods. Single-view point cloud reconstruction can be roughly divided into voxel-based 3D reconstruction and point cloudbased 3D reconstruction.
Voxel-based 3D reconstruction. As described below, a number of works have based on voxel representations. Choy et al. [1] trained a recurrent neural network to learn the mapping from 2D image to 3D output from a large number of synthetic data. In [8], a 3D local shape generation method is proposed. This method infers a low resolution but complete output by using a 3D encoder, and associates the output with the 3D graphics in the shape database to obtain 3D voxel reconstruction. Tulsiani et al. [9] proposed an unsupervised 3D voxel reconstruction neural network trained by multiview observations of unknown poses. Shubham et al. [10] explored the way to reconstruct 3D outputs by using different 2D view projections, such as depth maps, color images, image semantics and so on. Although several studies [11], [12] are devoted to solve the two defects of voxel: sparse information and high computational complexity, and have achieved some good results, the defects of voxel are still obvious compared with point cloud.
Point cloud-based 3D reconstruction. Fan et al. [4] first proposed a 3D reconstruction method based on point cloud. In this method, Chamfer distance (CD) and Earth Mover's distance (EMD) were chosen as loss functions to train an autoencoder point cloud generation network, and multiple plausible reconstructions can be generated for ambiguous input by variational autoencoder [14], [15]. In [16], a segmented and point cloud reconstruction network: 3D-PSRNet was proposed. In the training process, 3D-PSRNet propagates the segmented or reconstruction information to another task, and uses the CD and location aware segmentation loss as the loss function. The main contribution of the work in [17] is that the author proposed geometric adversarial loss with two components: geometric loss and conditional adversarial loss. Geometric loss is responsible for ensuring that the shape of 3D reconstruction is close to ground-truth, while conditional adversarial loss generates a semantically-meaningful point cloud. Mandikal et al. [18] proposed a two-stage point cloud reconstruction network: 3D-LMNet. First, 3D-LMNet uses chamfer loss to train a point cloud auto-encoder. Then, 3D-LMNet uses diversity loss and latent matching loss to map the vector of auto-encoder to a probability distribution to solve the problem of uncertain 2D input. In [19], a deep pyramid network for generating dense 3D point clouds was proposed. The pyramid network is trained by CD and EMD loss to predict a low-resolution point cloud. Then, the lowresolution point cloud becomes a high-resolution point cloud through dense reconstruction network. Chen et al. [20], proposed a Point Auto-Encoder, which is implemented based on the novel semi-convolutional and semi-fully-connected layers proposed that can handle the problem of mapping from single global feature vector to massive numbers of 3D points. All the related work of point cloud-based 3D reconstruction is devoted to two problems in 3D point cloud reconstruction: 1. Design a better point cloud reconstruction neural network. 2. Choose a more suitable loss function. Only by putting forward better solutions to the above two problems, can we reconstruct more accurate 3D output for 2D input with sufficient information and reasonable output with multiple possibilities for uncertain 2D input.

III. APPROACH A. ARCHITECTURE OF 3D-ReConstnet
The architecture of the 3D-ReConstnet is shown in Figure 3. Our 3D-ReConstnet is an end-to-end neural network. The end-to-end network architecture enables the semantic features of 2D images to be transferred only within the network, rather than across the network like 3D-LMNet [18], thus reducing the loss of features. The 3D-ReConstnet has three main tasks: 2D input feature extraction, sampling a probabilistic vector, point cloud generation. The depth neural  network ResNet-50 is used to extract the features of 2D images. The addition of residual makes the deep neural network ResNet-50 easy to train, and it can extract sufficient semantic features of 2D images without vanishing gradient. The 3D-ReConstnet first uses the residual network ResNet-50 proposed in [21] to extract the features of 2D input image and gets a feature vector Z. After that, the full connection layer compresses the dimension of vector Z from 1000 to 100, and obtains vector Z c .
We learn a probabilistic distribution from the vector Z c in order to generate multiple possibility 3D shapes for uncertain 2D input. We map the vector Z c of a specific 2D input to a Gaussian vector Z , i.e. Z ∼ N (µ, σ 2 ). We use the ''reparameterization trick'' of Variational Auto-Encoders [14], [15] to deal with the randomness in the network. The network in the middle dotted box in Figure 3 is responsible for predicting the mean µ and standard deviation σ of Z c , sampling ε ∼ N (0, 1), and finally obtaining the Gaussian probabilistic vector as Z = µ + εσ . The mean µ of Z is unconstrained, and the standard deviation σ is constrained by ε, so that the uncertain 2D input image can be reconstructed meaningfully and diversely.
We use a multi-layer perceptron(mlp) with two hidden layers and one output layer to transform the probabilistic vector as Z into point cloud data. The activation function of the two hidden layers is Leaky ReLU [22], and that of the output layer is tanh [23]. The output channels of the two hidden layers are 512 and 1024 respectively. The output channels of the output layer are N×3, where N is the number of points in the point cloud.  of the ReLU function on the negative axis is zero. However, the Leaky ReLU function has non-zero values on the negative axis (Figure 4 (a)), and Leaky ReLU also has non-zero derivative values on the negative axis (Figure 4 (b)). Therefore, using Leaky ReLU as the activation function, the negative information in the neural network will not be lost. The range of the normalized point cloud data coordinates are between [−1,1], which indicates that there are many points with negative coordinates. In the process of forward propagation of network information, the coordinate information contained in the negative value will be transferred to the next layer through Leaky ReLU. In the process of network information back propagation, because the derivative value of Leaky ReLU is not zero, the gradient corresponding to the negative value will provide more help for the weight update of the network.
Using tanh as the activation function of the last layer of multi-layer perceptron can quickly fit the generated point cloud data between [−1,1]. Figure 5 (a) and (b) are schematic diagrams of tanh and derivative of tanh, respectively. As shown in Figure 5 (a), the tanh activation function can restrict the coordinate value range of the generated 3D point cloud data to [−1,1]. The ground truth read by the reconstruction network is normalized, that is, the coordinates of the ground truth are exactly between [−1,1]. In the early stage of network training, the activation function tanh can reduce the gap between the generated point cloud data and the ground truth as much as possible, so as to accelerate the fitting speed. However, we use Leaky ReLU as the activation function in the first two layers of MLP instead of tanh. This is because the gradient of neural network may disappear in the process of training, and improper activation function is one of the reasons for vanishing gradient. From Figure 5 (b), it can be seen that the derivative of tanh is 1 when the horizontal axis is 0, and the corresponding derivative values of other positions are less than 1, even in the positive and negative infinite fields, the derivative tends to 0, that is, the derivative of tanh activation function is less than 1 in most cases. When tanh is used as the activation function, the result of chain derivation may approach to 0 as the gradient accumulates, and eventually the vanishing gradient. In order to reduce this risk, we only use tanh as the activation function in the last layer of the mlp.

B. LOSS FUNCTION
The loss function of the 3D-ReConstnet is defined as: where the diversity loss is defined by [18]: where ϕ o is the azimuth angle of maximum occlusion view, and ϕ i is the azimuth angle of the 2D input image. The goal of network training is to minimize the loss. The diversity loss only acts on standard deviation σ of the probabilistic vector Z sampling network. The smaller the difference between ϕ o and ϕ i , that is, the larger the occlusion of the 2D input, the greater the value of σ . The larger the value of σ , the more likely the 3D-ReConstnet is to generate multiple plausible reconstructions.
Since point cloud is an unordered representation, we need to use a loss function independent of the relative order of the input points to train the point cloud generation network. Fan et al. [4] proposed using Chamfer distance (CD) and Earth Mover's distance (EMD) [25] to train point cloud generation network. This method was widely used in later works [16]- [19].
Let X gt ∈ R N ×3 represent ground-truth and X pred ∈ R N ×3 represent the generated point cloud, where N represents the number of points in the point cloud. The chamfer distance between X gt and X pred is defined as: The chamfer distance measures the square distance between each point in set X gt and its closest point in set X pred . VOLUME 8, 2020 The EMD loss between X gt and X pred is defined as: where φ is a bijection. The EMD performs a point-to-point mapping between set X gt and set X pred .
In [4], Fan et al. shows the Mean-shape behavior of CD and EMD through pictures. After that, in the process of training 3D reconstruction network with CD and EMD, other research work found their characteristics as follows: 1) The chamfer distance is related to the contour of the reconstructed point cloud. The 3D reconstruction network trained by chamfer distance is easier to catch the rough contour of 3D object [4], [26]. However, the reconstruction network trained by CD is easy to generate a splash shape, which blurs the geometry of the reconstructed shape, and chamfer loss may confuse different reconstruction with similar chamfer distance. Figure 6 illustrates the cause of this confusion. Let the blue dot represent the ground-truth and the yellow dot represent the predicted point cloud. Suppose that the yellow dots in Figure 6 (a) and (b) represent two different reconstruction results. D 1 ∼ D 6 represent the distance between 6 ground-truth points and 6 points obtained from the first reconstruction. D 1 ∼ D 6 represent the distance between 6 ground-truth points and 6 points obtained from the second reconstruction. If the sum of D 1 ∼ D 6 is equal to the sum of D 1 ∼ D 6 , the CD will determine that the two reconstruction results are equal. But this is not the case. In the EMD loss function, φ represents the bijection relationship between the truth value and the predicted point cloud, so EMD loss has no defect of reconstruction confusion.
2) The EMD is related to the visual quality of the reconstructed point cloud [4], [26]. The lower the EMD loss, the better the visual quality of 3D reconstruction [26], [27]. However, the reconstructed network trained by EMD is not good at grasping the whole contour of the reconstructed object. We can see that the CD and the EMD loss have their own advantages, so we take the combination of them as the loss function of 3D-ReConstnet. The role of CD is to train the network to form the contour of the reconstructed object, and the role of EMD is to train the network to modify the appearance of the reconstructed object.

IV. EXPERIMENTS
We evaluated the proposed 3D-ReConstnet on the ShapeNet [28] dataset and the Pix3D [29] dataset, respectively. ShapeNet dataset consists of 43809 CAD models in 13 categories. Pix3D dataset consists of 7595 real images and their corresponding metadata (masks, ground truth CAD models and pose). In order to compare with these related works [4], [18], we use the same partition of training set and test set as [1]: the ratio of training set to test set is 4 to 1. We use the training set divided from the ShapeNet to train the 3D-ReConstnet, and carry out 3D reconstruction experiments on the test set divided from the ShapeNet and Pix3D data set respectively.
Implementation Details: 3D-ReConstnet is trained using the Adam optimizer, with batch size of 32 and learning rate 0.00005 for 50 epochs. We crop the size of a 2D input image to 128 × 128 and use it as the input of 3D-ReConstnet. The parameters of ResNet-50 [21] used to extract the features of 2D pictures are shown in Table 1. We just want to extract the features of 2D images, so we don't use softmax at the end of ResNet-50 like [21].
Evaluation Methodology: We use the Chamfer Distance (Chamfer) and Earth Mover's Distance (EMD) calculated on 1024 random sampling points to evaluate the reconstruction quality in all our experiments. We selected three images from each object category in ShapeNet and Pix3D datasets and showed their qualitative 3D reconstruction results in Figure 7-8. Figure 9 show the qualitative 3D reconstruction results of ambiguous 2D input. Figure 7 show the qualitative 3D reconstruction results of three images of each object category in ShapeNet. The predicted resolution of 3D point cloud reconstruction in Figure 7 is 2048. Table 2 shows the CD and EMD values of point cloud reconstructed by 3D-ReConstnet(ours), PSGN [4] and 3D-LMNet [18] on ShapeNet dataset. The smaller the values of CD and EMD, the better the reconstruction quality. The values of CD and EMD of 3D-ReConstnet are lower than those of PSGN and 3D-LMNet, while also having lowest mean scores of CD and EMD.    Figure 8 show the qualitative 3D reconstruction results of three images of each object category in Pix3D. The predicted resolution of 3D point cloud reconstruction in Figure 8 is 2048. Table 3 shows the CD and EMD values of point cloud reconstructed by 3D-ReConstnet(ours), PSGN [4] and 3D-LMNet [18] on Pix3d dataset. The smaller the values of CD and EMD, the better the reconstruction quality.  The values of CD and EMD of 3D-ReConstnet are lower than those of PSGN and 3D-LMNet, while also having lowest mean scores of CD and EMD.

C. GENERATING MULTIPLE PLAUSIBLE OUTPUTS
In this experiment, we select the 2D image with the parameter ϕ o = 180 • in Formula 2 from the chair category of ShapeNet, that is, the back-view image with the maximum occlusion. For each chair image in the back-view, we generated three 3D reconstruction outputs using 3D-ReConstnet. In Figure 9 we show the back and side views of each 3D reconstruction with different ε. The predicted resolution of 3D point cloud reconstruction in Figure 9 is 1024. We show the consistency between reconstruction results and 2D input through back-view, and show the diversity of reconstruction results through side-view. As shown in Figure 9, 3D-ReConstnet can generate semantically different reconstructions which are consistent with the ambiguous input image with the largest occlusion. From Figure 9, we can see that the handle and leg structures of these different reconstruction results are different.

V. CONCLUSION
In this paper, we propose an end-to-end single view 3D reconstruction network: 3D-ReConstnet. The 3D-ReConstnet maps the feature learned from a 2D image to a normally distributed vector to deal with the uncertainty of the selfoccluded part of an object. The proposed 3D-ReConstnet can generate the determined 3D output for a 2D image with sufficient information while generate semantically different 3D reconstructions for an ambiguous 2D input. We evaluated 3D-ReConstnet on ShapeNet and Pix3D datasets.
The experimental results show that 3D-ReConstnet outperforms the state-of-art reconstruction methods in the task of single view 3D reconstruction. BIN