Cascade Convolution Neural Network for Point Set Generation

Automatic and efficient 3D object modeling has become critical in industrial applications. The advancement of deep convolutional neural networks (CNNs) has prompted researchers to use CNNs for learning 3D geometry information directly from images. However, the feature maps directly extracted by CNNs are more suitable for image processing tasks because they contain more deep texture information of the entire 2D image. Compared with this, 3D reconstruction tasks using CNNs demand geometric information about a specific object. The existing architecture mainly tries to infer the geometric structure through texture information, which leads to an uneven distribution of points in the generated point cloud object. To address this problem, we propose a cascade point set generation network (CPSGN) that deforms the predicted object while more effectively inferring the object’s 3D geometric information from the 2D image so that the distribution of the final object becomes more uniform and denser. The CPSGN consists of a point set generation part for producing a basic 3D object and a point deformation part for fine-tuning the basic 3D object. In addition, we designed a projection loss that optimizes the geometry of the model by measuring shape differences from multiple perspectives. Experimental results on different benchmark datasets indicate that the produced point-based model outperforms existing approaches.


I. INTRODUCTION
The reconstruction of 3D objects from 2D images is attracting much attention for various applications such as mapping in robotics navigation, 3D animation, and 3D printing [1]. According to the information of an input 2D image and the representation of an output 3D object, 3D reconstruction includes multiple approaches. For the visual representation, researchers commonly adopt meshes, voxels, or point clouds to represent generated objects [2]. Although all three types of representation can effectively represent and visualize 3D objects, point cloud provides more advantages in industrial applications that require special geometrical information, and offers a higher degree of freedom with less memory [3]. More specifically, voxel-based representation takes an enormous amount of memory with low computational efficiency, especially at high resolution [4], and mesh-based representation indicates poor flexibility and scalability since it is commonly derived from point clouds [5]. Furthermore, point cloud is a standard 3D acquisition format that is especially suitable for geometric operations and transformation. It is widely used in common 3D scanning devices such as Kinect, LiDAR and ToF camera [6]. We also adopted point cloud as our final presentation to benefit from its efficiency and flexibility compared with the other representations.
From 2D input information, 3D reconstruction can simply be spilt into a single-view image and multiple view images reconstruction. Traditional 3D reconstruction research has focused on multiple view images. The main idea is that multiple view geometry [7] recovers the 3D geometric structure information of the object from images of the static object from different viewpoints. The common approaches include structure from motion [8], simultaneous localization and mapping [9], scene reconstruction [10], and point cloud alignment [5]. Compared with multiple images 3D reconstruction, it is more difficult to predict 3D geometric information from a single-view image.
By benefiting from deep learning technology and largescale 3D object datasets [11], single-view 3D reconstruction methods [3,4,12,13] have achieved great success in recent years. A common method is to use variational autoencoder (VAE) theory [12] to generate a 3D object. It encodes the 2D information of the image in a latent space and directly constructs a 3D object from latent information and constrains the distribution of the predicted point set close to ground truth by a specific objective function. However, methods similar to VAE architecture might have the following challenges: (i) In the encoder part, the convolution layer is commonly used to extract deep features from the original image. However, texture features make up a large part of constructed feature maps extracted by CNN, especially feature maps from deep layers [13]. Thus, the feature vector in a latent space will lose a lot of geometry information. It is difficult to directly infer the geometry of the object though texture features. Excessive texture features promote the network to focus on the large surface part of the image, which turns 3D point set prediction into a recognition or classification problem [14]. The core task of the network changes from reconstruction to the closet shape searched according to the database [4].
(ii) In 3D reconstruction, the Chamfer distance (CD) [15] is a loss function used by most researchers that can measure the difference between two different point-based distributions. It forces each point in a predicted point set approach to the corresponding position by minimizing the function [3]. However, two similar point-based distributions only guarantee that the overall shape of a product is close to real objects. Therefore, the constraint ability of the CD function has a limitation in geometric structures, especially in complex geometric structures. Additionally, the CD function induces a splashy shape that blurs the shape of the geometric structure, while it is sensitive to outliers [14]. With the constraint of the CD function, the predicted points may be clustered at the center of the body part of the label, which causes the body part of a product to become dense and other parts to become sparse. When the distribution of the predicted point set is close to the label, the CD function will directly ignore a small number of outliers during the optimization process. Therefore, optimization easily falls into a local minimum and the network learns poorly for sparse distributions and the geometric information based on CD functions.
(iii) Furthermore, the number of points and the degree of sparseness depend on the application. In fact, we usually set a fixed value as the size of predicted points for training of the rule of neural network methods. The common size of the predicted point set that can basically describe the shape of a 3D object is 1024 [14]. However, many real applications may require a larger number of points for a single object.
To address the above problems, we propose a novel neural network, called a cascade point set generation network (CPSGN), to generate a 3D object from a single-view image. As shown in Figure 1, the CPSGN is composed of two parts: a point generation network and a point deformation network. Rather than directly synthesizing a point cloud object, we adopted a coarse-to-fine strategy that generates a primary point-based 3D object with a basic geometric shape from a 2D image and then deforms the primary object to large size point sets in a point-based CNN, as depicted in Figure 2. In the cascade network, the final 3D object not only benefits from the deep texture features of a 2D image, but also increase detail information and predicted points resolution while each layer maintains the global features of the primary point cloud object. The deformation network can cover the shortage of sparse and rough points from the generation network. At the same time, dynamic graph convolutional neural networks (DGCNN) have presented an edge convolution method to learn both local and global geometric features from point clouds [15]. Edge convolution generates edge features that describe the relationships between a point and its neighbors. We designed a deformation network based on the edge convolution module.
Furthermore, the CPSGN regularizes predictions from a global perspective and thus can work in complement with previous CD loss for better object reconstruction from a single image [8]. Due to the characteristics of a point cloud, it is difficult to measure the differences in a 3D shape on a point cloud directly, yet it is relatively easy in 2D projection [16]. The geometric information can be preserved well from different views. Thus, we propose a projection loss that uses multiple 2D projection images to focus on measuring geometric similarity. It regularizes prediction globally by enforcing the prediction to be consistent with the ground-truth among different 2D views and following the 3D semantics of a point cloud.
The main contributions of this work are as follows: (1) We propose a novel cascade architecture that models a 3D object from a 2D image. The network first models a 3D object from 2D texture information, and then deforms the point set into a dense point set.
(2) We propose a loss function, called projection loss, to regularize the generated object from the geometric shape closest to the real object.
(3) We evaluated the CPSGN on benchmark datasets and found that it achieves state-of-the-art performance.
Based on object representations, a single-image 3D reconstruction includes three streams. It is easier to transform 2D features extracted by CNN to reconstruct a 3D object using voxels, a regular grid representation. Choy et al. [17] proposed a 3D recurrent neural network (3D-R2N2) using long-and short-term memory to convert 2D features to 3D features and to generate a voxel-based object from a single or multiple image. Follow-up work [18] used a backpropagated ray trace pooling operation to learn 2D silhouettes and adversarial constraint to regularize voxel-based objects with unlabeled 3D shapes. Wu et al. [19] proposed a VAE architecture by combining two generative models that encode images in a latent space and reconstruct a voxel-based model. Tulsiani et al. [20] proposed a geometric consistency constraint to jointly learn a pose estimation network and voxel-based objectgenerated network using unsupervised learning.
For clearer visualization, mesh is another direction in 3D object generation. Wang et al. [4] proposed a cascade ResNet network based on graph convolution to generate a mesh-based 3D object. Pontes et al. [21] used free-form deformation and a sparse linear combination to infer the parameters of a compact mesh representation. Kato et al. [22] proposed a gradientbased 3D mesh network using silhouette image supervision. Fan et al. [3] first proposed a novel point set generation network (PSG) that encodes a 2D image and outputs an unordered point-based object. Based on PSG, Jiang et al. [23] proposed adversarial loss to constrain prediction geometry.

B. POINT CLOUD FEATURE EXTRACTION
Although the CNN achieves excellent success in image feature extraction, it is hard to maintain the same effect since a point cloud is an unordered representation. Qi et al. [24] proposed the pioneering PointNet, which directly extracts point-based representation by using multilayer perceptrons and global pooling, to address this problem. However, PointNet was unable to extract local features adequately. As another approach, PointNet++ [25] was proposed to integrate global and local representations with increased computation cost. Furthermore, many researchers have applied machine learning methods such as KNN and KD-tree to aid PointNet in comprehensively extracting point cloud features. KD-Net [26] built a KD-tree for input point clouds, followed by hierarchical feature extraction from leaves to root. SO-Net [27] proposed using the spatial distribution of point clouds extracted by a self-organizing map before PointNet architecture. Wang et al. [15] proposed an edge convolution method to extract geometric relationships among points, which considers neighborhoods of points rather than acting on each one independently. In this work, we adopted edge convolution as our feature extractor in the deformation part since geometric information is essential for upsampling and reordering a point set.

III. METHOD
The CPSGN adopts a cascade strategy to design the architecture to force the output point cloud in uniform distribution within the geometric information form of the 2D image. More specifically, given a single-view image, the first stage network (point generation network) generates a fundamental point cloud with rough shapes and sparse points, then a dense and uniformly distributed point cloud is generated by the next stage network (point deformation network) based on the fundamental output and 2D geometric feature. Compared to the previous single-view point-cloud reconstruction methods, which directly generate the 3D object from 2D image, our method reduces the problem of point cloud concentration on edges, corners and excessive reliance on deep texture features.
In the following section, we first briefly introduces the edge convolution kernel used in the deformed part. Then, we explain the architecture of the point generation network and the point deformation network in detail. The loss function is discussed at the end.

A. PRELIMINARY: EDGE CONVOLUTION
We first provide some background about edge convolution used in the deformation part; a more detailed introduction can be found in [15]. Given a point cloud with points = {p 1 , … , p } ⊆ ℝ 3 , the edge convolution layer (EdgeConv) produces an F-dimensional point cloud with the same number of points. More specifically, EdgeConv builds a directed graph = ( , ℰ) using k-nearest neighbor to represent the loc structure of the point cloud, where = {1, … , n} denote vertices and ℰ ⊆ × denotes edges. Assume that there are points p = �p 1 , … , p � close to point p i ∈ , which include directed edges ( , 1 ), … , ( , ) . The edge features are defined as = ℎ (p , p ) , where ℎ : ℝ 3 × ℝ 3 → ℝ is a nonlinear function constructed by some learnable parameters . In detail, EdgeConv adopts an asymmetric edge function ℎ ( , ) = ℎ �⃗ �p , p − p � where the centers of the patch point set p get the global geometric information and p − p capture the local neighborhood point information. Thus, the feature is denoted as shown in Equation (1).
where is a set of learnable parameters. Finally, EdgeConv adopts a shared multilayer perceptron with ReLU to symmetrically aggregate a channel. Then, the max-pooling of edge feature is defined as ′ = max

B. POINT GENERATION NETWORK
The point generation network aims at generating a 3D object with basic shapes and sparse points from a single-view image. As shown in Fig.2, we build this architecture from the core idea of U-net [28], which transfers the low-and highresolution features in encoder part to the decoder part. More specifically, the network can capture more geometric and texture feature from pixel-level information in encoder part and regress the loculation information of points through two different ways in decoder part. For the encoder, the low-and high-resolution features of image is extracted by the VGGlike encoder ( ) which removes the pooling layer between each convolution block. For the decoder, different resolution features are fused and sent into two branches to predict a fundamental point set 1 = ( ) + ( ). Here, the FC and DC branches are expected to predict different part of the object. FC branch focuses on the corners and edges of objects with less points (240 pts) by stacking the fully connected layers. Rather than using only the low resolution and depth features, we flatten and combine the features from different layers (conv2, conv3, conv4, conv5). At the end of the FC branch, we add them together and output 240*3 units as 240 points by stacking the final fully connected layer activation. DC branch focuses on the body of object with majority points (768 pts) through multiple deconvolution blocks. The skip connection adds-in Deconv6, Deconv5, and Deconv4. Fig.3 is the visualization of the DC's and FC's output. It clearly shows that the DC branch generates the main part of the object, whereas the FC branch generates the remainder. Finally, a fundamental point cloud is composed from the output of the FC branch and DC branch.

C. POINT DEFORMATION NETWORK
The goal of the point deformation network is to make primary point cloud (sparse and rough) finer and denser while maintaining its geometric structure. More specifically, the primary point cloud 1 is supplemented and reordered based on the feature of the primary point cloud and projection loss (see in section D) to final point cloud 2 . The feature of primary point cloud provides a rough geometrical structure information to maintain the object shape. The projection loss constrains the location of reorganization points in point deformation network.
As shown in Fig.4, a point-based CNN is used to extract features of primary point cloud (1024 points), which is leveraged by deformation to progressively restructure a rough point cloud into a dense point-based object (2048 points). More specifically, the encoder ( 1 ) is consisted of five edge convolution blocks with multilayer perceptron layers (64, 64, 64, 128, 256) and the k-nearest neighbors (k is 20). It captures both global and local latent information from the primary point cloud 1 . The feature of primary point cloud ∈ ( 1 ) is concatenated from each layer of encoder, which fuses different space information and preserve the prime geometry information of 1 . Then, was fed into ( ) , which is a two-level parallel network of two convolution layers, where the weights of each level are independent, including a 256-dimension feature vector of 1024 points. Finally, we concatenated them together to get a feature map of 2048 points. Compared to latent information encoding from a 2D image, the latent information encoding from a point cloud does not rely on texture information, but location information. By restructuring the point set, we tried to put each point in a more reasonable position so that the output point cloud is more dense and evenly distributed.

D. LOSS FUNCTION
To make the CPSGN trainable, we defined three kinds of objective functions to constrain the quality of the predicted object. We adopted EMD [32] to constrain the distribution of primary point cloud 1 ; the CD to regress the location of the point close to its correct position similarly to ground truth; and proposed a projection loss to focus on the geometry and reduce outliers.

1) EARTH MOVER'S DISTANCES LOSS
EMD measures the distance between two different point sets. For primary point cloud 1 , EMD calculates the minimum distance to ground truth point cloud over all possible arrangements of correspondence. More formally, given two sets of points 1 and , EMD is defined as shown in Equation where ∅: 1 → is a bijection. In this instance, the EMD loss function is defined by the standard EMD function = .

2) CHAMFER DISTANCES LOSS
CD is widely applied to 3D reconstruction research, which confirms the pairwise relationship based on suboptimal matching. Given a point set, the CD distance sums the squared distance between each point and the nearest neighbor in ground truth. Considering that the CPSGN is a cascade network that produces two different sizes of point clouds, we set two CD loss covering two parts according to the standard distance function. For the point generation network, we apply standard CD function as a loss function. Formally, given two sets of points 1 and G , the CD loss is defined as shown in Equation (3).
For the point deformation network, we apply the CD with L1 regularization to measure the difference between the final object and ground truth that the overall point cloud offsets to force the network to deform the template as little as possible. Formally, given two sets of points 2 and G , the CD loss is defined as shown in Equation (4).

3) PROJECTION LOSS
Both CD loss and EMD loss focus on optimizing the location of points in a point set, yet they cannot guarantee the predicted point set's geometry to be similar to the ground truth. Following this line of thought, we propose a projection loss that measures geometry inconsistencies between predicted point set 2 and ground truth . Specifically, we simultaneously project 2 and toward 2D image planes from different camera angles so that 2D projection can preserve shape information as much as possible.
To project 3D points, we applied perspective projection in this section. In detail, we defined a point , , as the transformation of point , , with the 3D world coordinate to the camera coordinates system using Equation (5).
where is the camera position, and is the orientation of the camera. Then we applied projection of the transformed point onto a two-dimensional plane by the following Equation (6).
where and are the projection coordinates. As shown in Fig. 5, this pair of projections on the feature map is similar to the mask in the segmentation problem. Therefore, we adopted dice-confidence loss to measure the overlap between predicted projection and ground-truth projection. First, we defined the projection of point cloud and , which are denoted as and , respectively. The dice loss is defined as shown in Equation (7).
where means the intersection of the projection image and . We found that only using dice confidence distance is unstable during the training. Thus, the L2 distance is added, and the projection loss function is defined as shown in Equation (8).
( , ) = ∑ ‖ − ‖ 2 2 + * ( , ) (8) where is the different camera angles for projection, = 1 or 2 is the corresponding point cloud, and is a factor to control the projection of the sparse and dense point cloud.

4) FINAL LOSS
The final loss function consists of the losses of the point generation network and the point deformation network, as shown in Equation (9). = + For the generation network, we adopt the EMD loss as main loss and standard CD loss as auxiliary loss, as shown in Equation (10), where 1 is a factor to balance the two losses. (a) ground-truth For the deformation network, we adopt the CD in L1 regularization as main loss and projection loss as auxiliary loss, as shown in Equation (11), where 1 is a factor to balance the two losses and 2 and are projections of point cloud 2 and ground truth , respectively.

IV. EXPERIMENT
In this section, we first introduce the dataset, implementation details, and evaluation metrics details. Secondly, we compare different representation methods on three evaluation metrics. Thirdly, we focus on comparing the reconstructed details of the point-based objects.

A. DATASET
To train and evaluate our networks, we use the dataset provided by Choy et al. [17], which includes rendering images of 50,000 models belonging to 13 object categories from ShapeNet [11]. The ShapeNet dataset collected 3D CAD models generated from the WordNet hierarchy with various camera viewpoints, camera intrinsic, and extrinsic metrics. To compare each method fairly, the ratio of training and testing dataset was set as the same as Choy et al. [3], which split each category into 80% for training and 20% for evaluation.

B. IMPLEMENTATION DETAILS
Our network is implemented with TensorFlow on an Nvidia Titan 1080. First, we initialize the camera intrinsic for projection loss, where the camera position is set to 2,1,2 on the of x, y, z, axes, respectively, and focal length is set to 100. Second, we conduct the experiment with seven different angles for projection loss. We optimize the network using ADAM with a batch size of 12 and 150 total epochs. The initial learning rate is 0.0001 and then decrease after 15k iterations using natural exceptional decay. We use two different weight initialization methods for whole networks. The point generation network adopted He initialization [29] since ReLU activation is connected with each convolution layer. The point deformation network use Xavier initialization [30] to reduce the local minimization problem.

C. EVALUATION METRICS
Following previous research [3], we adopt EMD as standard evaluation metrics. EMD calculates the distance between a pair of point sets; the smaller the value, the more similar both are. Since the standard 3D reconstruction metric may not thoroughly reflect the geometry quality, we evaluated our network on F-score [31]. CD and EMD often capture occupancy or point-wise distance, which focus on the location of each point that is generated for an object. It has been reported in some articles that the CD value does not reflect the quality of model reconstruction well [6]. Thus, we do not adopt the CD score to evaluate performance in this work. By contrast, F-score calculates precision and recall by checking the percentage of points in prediction or ground truth that can find the nearest neighbor from the other within a certain threshold . In this case, precision and recall are defined as shown in Equation (9) and (10).

D. COMPARISON WITH THE SOTA
For a comprehensive comparison, we evaluated our model from two aspects. First, we compared it with three representations on different metrics. It is difficult to measure the differences between an object generated by different representation methods. Thus, we selected three well-known studies of each representations as comparative experiments. More specifically, the voxel-based object generated by Choy et al. [10], the point-based object generated by Fan et al. [4], and the mesh-based object generated by (N3MR) [18]. All the models were trained with the same data using the same amount of time. Note that the labels corresponding to the dense model are 2048 points, and the other corresponding labels are 1024 points.    Table I shows the EMD with different methods for 13 objects. Our method achieved the best mean score and shows better scores than the other methods for the most objects. When generating a dense point-based object, one step is upsampling the sparse model. Thus, the EMD value fluctuates to a certain extent.
Compared with the previous two metrics, the F-score can quantify the shape information of the model better. Table II summarizes the F-score for the 13 objects for various methods. The qualitative results from Table II show that our method is superior for the 13 objects. We achieved significantly better results than the others. Every object has a higher model completion with fewer missing parts in our method. Fig. 6 shows the visual results compared with PSG. For the major comparison, PSG only applied Chamfer loss, acting like a regression loss, which gives too much freedom to the point cloud. In contrast, our model does not suffer from these issues by using cascade architecture, integration of perceptual features, and carefully defined losses during the training. Our result was not restricted by the resolution due to the limited memory budget and contains both a smooth continuous surface and local details.
We focused on comparing the visual effect of PSG because of the same representation. We generated both a sparse object (1024 points) and dense object (2048 points) for comparison. In this part, every input image is standard, which has a higher geometric similarity to the training data. The PSG results seem to be comparable to ours when generated from these input images. However, the generated 3D objects of PSG lost some details around edges. For example, PSG failed to generate the engines and the tail part of the airplane correctly.
On the other hand, we input 2D images with special shapes. As shown in Figure 7 and Figure 8, for inputting images with special geometrics, PSG generates objects that only have a rough outline. By contrast, our model is more accurate and refined in generating the details of objects, such as the legs, the connections, and the hollows of the table, than PSG.

E. ABLATION STUDY
For a comprehensive evaluation, we performed the ablation studies in order to prove the effectiveness of the proposed projection loss. We applied same hyper-parameters mentioned above in all experiments. To clearly demonstrate the effect of projection loss, we evaluated both EMD and F-score on the CPSGN (2048 pts) with and without projection loss.  Table III shows the EMD and F-score with different loss function for 13 objects. After adding the projection loss, both EMD and F-score's mean score are better. In the most categories, the network with the projection loss performs better on both the EMD and F-score. It confirms that projection loss has a positive influence on point cloud generation task.

V. DISCUSSION
In this work, we build a cascade network to generate a more flexible and dense point cloud from single-view image. Compared with previous research [5], we generate more uniform and dense point cloud. However, there are still some details lost, especially in the edge of point cloud. For example, the leg of rarely seen chair is hard to reconstruct from the image. One possible reason is the limitation of the single view image reconstruction. It cannot provide some detailed geometric information. The other possible reason comes from the dataset, which just cover the conventional structure object. Thus, there is enough space for 3D reconstruction to be studied.

VI. CONCLUSION
We have presented a cascade network for single-view 3D object reconstruction. According to the coarse-to-fine strategy, the CPSGN includes a point generation part and a point