Research on 3D Reconstruction of Furniture Based on Differentiable Renderer

Due to the self-obscuration, traditional 3D reconstruction algorithms have difficulty in recovering the 3D structure of an object from a single image. With the rapid development of convolutional neural networks, 3D reconstruction based on deep learning has attracted a wide range of attention from researchers. However, it is expensive to obtain the 3D supervised data corresponding to the objects. To solve the above problems, we combine convolutional neural networks with differentiable renderer and propose the Mesh_CA in this paper, which enables reconstruction of a single image without 3D supervision data. Specifically, an ellipsoid is first initialized for each input single view, and then the features extracted by the convolutional neural network are used to guide the deformation of the ellipsoid to obtain the generated 3D object; After that, the generated object is passed into the differentiable renderer and its corresponding contour information is output; finally, calculating the error between the predicted contour and the real one, and the final 3D object is obtained after training and testing. By training and testing on five types of furniture objects on a large-scale public dataset ShapeNet, the performance of the proposed Mesh_CA surpasses current classical methods.


II. RELATED WORKS
There are many representations of 3D objects, section A 106 ''3D Reconstruction Based on Deep Learning'' focuses on 107 the three common representations, including voxels, point 108 clouds and mesh [16], [17]. Finally, the generation method of 109 the grid is investigated in our work. In addition, the related 110 applications of attention mechanisms are also described in 111 section B ''Attentional Mechanisms''.

112
A. 3D RECONSTRUCTION BASED ON DEEP LEARNING 113 Voxel reconstruction. Choy et al. [18] first proposed a 3D 114 reconstruction method based on voxel representation, the 3D 115 recurrent reconstruction neural network(3D-R2N2), which is 116 now widely used as a baseline for comparison. 3D-R2N2 uses 117 RNNs to fuse the feature maps of the input images, solv-118 ing the problem of inconsistent single-view and multi-views 119 reconstruction methods, enabling refinement of the generated 120 3D shapes when more views are input. However, it also has 121 some drawbacks. One is that for a given sequence of images, 122 the reconstruction results of RNNs-based methods are not 123 consistent when their input order is different; second, due to 124 the long-term memory loss of RNNs, they cannot fully utilize 125 the input images, making the reconstruction results poor. 126 Third, 3D-R2N2 has high computational complexity and can 127 only generate low-resolution voxels due to the limitation of 128 computer hardware, thus losing much detail information. 129 To make the quality of the generated shapes better while 130 avoiding multi-stage training, Liu et al. [19] proposed Varia-131 tional Shape Learner (VSL). VSL uses short connections to 132 combine some local latent variables to form a global latent 133 variable. Finally, all the local variables are cascaded with the 134 global variables to represent the encoded shape. In their work, 135 the reconstruction results of VSL trained jointly on all species 136 are significantly worse than those after training on individual 137 species, which indicates that VSL does not learn the poten-138 tial representation of objects well. Xie et al. [20] proposed 139 Pix2Vox++, their contribution is to propose a context-aware 140 fusion network, which refines the detail of the generated 141 voxels based on the learned fusion scores. Pix2Vox++ relies 142 on the object-centered coordinates to align multi-view fea-143 tures. However, object-centered coordinates encourage the 144 network to memorize observed meshes, which may lead to 145 poor generalization abilities. In order to use the shape prior to 146 learn the generic shape representation for agnostic categories, 147 Zhang et al. [21] proposed the generalized reconstruction 148 framework GenRe (Generalizable Reconstruction). Although 149 GenRe hallucinates the unseen parts of these shape primi-150 tives, it fails to exploit global shape symmetry to produce 151 correct predictions. This is not surprising given that their 152 network design does not explicitly model such regularity. 153 A possible future direction is to incorporate priors that facili-154 tate learning high-level concepts such as symmetry. To solve 155 the problem of topology perception in 3D shape reconstruc-156 tion, Chen    tecture that also uses 2D projections for supervision. But 213 this architecture is proposed based on specific classes and 214 learns from a set of unlabeled images. In contrast to Lin,215 Mandikal and Radhakrishnan [29] proposed a hierarchical 216 approach to reconstruct dense point clouds using deep pyra-217 midal networks. Due to outlier points in the sparse point 218 cloud get aggregated in the dense reconstruction, certain 219 predictions have artifacts consisting of small cluster of points 220 around some regions. Lu et al. [30] recently proposed an 221 attention-based approach to generate dense point clouds from 222 the input single view, and this framework is considered as an 223 improvement to the work of Mandikal. To fit more categories 224 of objects, they need to improve the attention mechanism in 225 the future.

226
The advantage that point cloud only captures the surface 227 features of shape provides a effective representation for 3D 228 shape. However, the unstructured nature, lack of connectivity, 229 and irregularity of point cloud leads to it cannot be easily 230 processed by applying deep learning models. Besides, the 231 generated point cloud needs post-processing before it can be 232 applied to virtual reality, robotics and other fields. However, 233 the representation of mesh can be applied to various fields 234 without post-processing, which attracts researchers' attention 235 to mesh-based representation 3D reconstruction.

236
Mesh Reconstruction. The model proposed by 237 Kar et al.
[31] is one of the pioneering works in 3D recon-238 struction from a single image in a mesh representation. The 239 model is proposed based on specific class and from several 240 annotated images to learn a deformable shape model that 241 captures shape changes within the class. Some of their major 242 failure modes include not being able to capture the correct 243 scale and pose of the object and thus badly fitting to the 244 silhouette in some cases. Their subtype prediction also fails 245 on some instances (e.g. CRT vs flat screen ''tvmonitors'') 246 leading to incorrect reconstructions. Groueix et al. [32] pro-247 posed AtlasNet, their goal is to learn directly to reconstruct 248 3D meshes from a single image or point cloud, and the main 249 contribution is the design of a decoder that feeds potential rep-250 resentations into the decoder and outputs parametric surface 251 elements. But AtlasNet has poor generalization performance, 252 such as when not trained on chairs, AtlasNet seems to struggle 253 to define clear thin structures, like legs or armrests, especially 254 when they are associated to a change in the topological genus 255 of the surface. Wang et al.
[33] first used Graph Convolution 256 Networks(GCN) for 3D reconstruction based on mesh repre-257 sentation and proposed Pix2Mesh. The work of GEOMetrics 258 proposed by Smith et al. [34] is similar to Pix2Mesh, which 259 improves on the work of Wang in terms of loss functions, 260 mesh adaption and vertex information update. Both methods 261 are restricted to generating meshes with the same topology as 262 the initial mesh, which limits the utilization of them, future 263 research directions include addressing the restrictive constant 264 topology prescribed by the initial mesh object through recon-265 struction and generation methods.

266
In recent years, many researchers turn to the field of 267 combining differentiable renderer with neural network [35], 268 [36], [37]. In the unsupervised shape reconstruction work of 269 and alignment simultaneously. Instead of encoding the entire 325 input sentence all as a fixed-length context vector, their work 326 encodes the input sentence as a sequence of vectors and 327 adaptively selects a subset of these vectors at decoding time. 328 To address the question that visual question and answer 329 models neglect the modeling of object relationships in image, 330 Cheng et al. [42] proposed the relational reasoning mode 331 Graph Attention Network Relation Reasoning(GAT2R) by 332 introducing a graph attention mechanism. GAT2R model 333 includes question feature extraction, scene graph generation, 334 scene graph update part, multimodal fusion and answer pre-335 diction. Among them, the scenario graph update part uses 336 a question-guided graph attention network to updated graph 337 node representation dynamically. Thus, providing more accu-338 rate predictions for the subsequent answer prediction.

339
Understanding the position and size of the same struc-340 ture of an object in different views is important for the 3D 341 reconstruction of furniture objects. Therefore, the SK module 342 and CA module are introduced to make fuller use of the 343 input image features and thus improve the reconstruction 344 performance of the model in this paper. Firstly, an initialized ellipsoid is predefined for each input 359 furniture image, and the ellipsoid is an isotropic sphere with 360 642 vertices; then, the 3D object of the target is generated 361 by deforming the ellipsoid. The mesh generator consists of 362 two parts: encoder and decoder. The structure of the mesh 363 generator is shown in Figure 2.

364
Encoder. The encoder used in this paper is the same 365 as ResNet18 except that the last fully connected layer is 366 removed. The input images are processed by the encoder, and 367 the output feature vector is 1 × 1×512. 368 VOLUME 10, 2022    The influence of the triangular surface slice f j on the image 398 plane is simulated using the probability map D j , it is defined 399 as in equation (2).
where σ is a positive scalar that controls the sharpness of the 402 probability distribution, (i, j) is the coordinate of pixel P i , 403 and d (i, j) is the closest distance from triangular face slice f j 404 to pixel P i . δ i j is a symbolic representation that maps pixels 405 inside and outside f j to (0.5,1) and (0,1) respectively. The 406 value of δ i j is 1 when P i lies inside f j .

407
The Euclidean distance is used for compute d (i, j), let t i j ∈ 408 R 3 be the center of gravity coordinates of the point on the 409 edge of f i closest to p i . Then the signed Euclidean distance 410 D E (i, j) from p i to the edge of f j is calculated as follows.
where δ i j is the symbolic denotation and has a value of 1 when 413 P i lies within the triangular surface slice f i .

414
Therefore, ∂D E (i,j) ∂U j can be obtained by the following 415 equation: The aggregation function A s is used to merge the color map C j 420 to obtain the rendered output I based on the probability map 421 D j and the depth Z j , which bridges the relationship between 422 the 2D plane and the 3D space. The aggregation function is 423 calculated as in equation (5): where w i j is calculated as in equation (6): In the above two equations: C b is the background color; The contour of an object is independent of its color and 436 depth map, so the aggregation function A O of the contour is 437 further explored based on equation (5), is calculated as below: Equation (8)   In order to quantitatively evaluate the reconstruction per-470 formance of the networ, the real and generated meshes are 471 voxelized and the voxel size is set to 32 3 to calculate the Inter-472 section over Union (IOU) between voxels, which is calculated 473 as in equation (11): where: (i, j, k) denotes the location of the voxel, p (i,j,k) 476 denotes the probability value of the presence of the voxel and 477 obeys the Bernoulli distribution [1 − p (i,j,k) , p (i,j,k) ], y (i,j,k) 478 denotes the corresponding true output with values belonging 479 to 0, 1}. I (.) is the indicator function and t is the set voxeliza-480 tion threshold. If p (i,j,k) is greater than t, a voxel exists at that 481 position; conversely, the position is an empty voxel. A higher 482 IOU value indicates a better reconstruction.

484
The activation function used is Leaky-relu in this paper. Com-485 pared with sigmoid and tanh activation function, it is easy 486 to compute and converge faster because there is no power 487 function. In addition, compared with Relu, Leaky-relu does 488 not lose the negative information of the input.

489
For a fair comparison, the number of iterations is 490 250,000, which is consistent with the work of Liu et al. 491 The parameters of the hybrid loss function are set to 492 α= 5×10 −4 , β= 5×10 −4 (Same as Kato et al.). The opti-493 mizer used is ADAM, which uses the official recommended 494 value (learning rate α =0.0001, β1=0.9, β2=0.999).     and the other with contour and shading supervision. In this 548 paper, we compared with their first work, which is trained 549 with contour information as supervision.

550
As can be seen from table 2, compared with SoftRas and 551 NMR, the overall mIOU of the model proposed in this paper 552 is increased by about 6.1% and 6.6% respectively. In addition 553 to the cabinet reconstruction, the performance of the model 554 proposed in this paper has been greatly improved when recon-555 structing other types of models.

556
Some furniture images in the ShapeNet test set are ran-557 domly selected for reconstruction and compared with the 558 SoftRas model, and the results are shown in Figure 5. The first 559 row is the input single image, the second row is the recon-560 struction results of SoftRas, and the third row is the recon-561 struction results of Mesh_CA in this paper.

562
As shown Figure 5, the model proposed in this paper can 563 predict the 3D structure of the object well (e.g., three legs 564 for the table in the third column); the visual effect of the 565 reconstruction is also improved by using the Laplace predic-566 tion loss and the smoothing loss function. The reconstruction 567 results of the last two columns for the chair as well as the 568 cabinet are also more flat. However, when reconstructing 569 the bench (the second column), the hollow structure of the 570 backrest of the bench is not reconstructed. 3D potential features of the object, rather than simply fitting 592 a function for the training set.

594
This paper presents a 3D reconstruction method of furni-595 ture object based on differentiable renderer, which named 596 Mesh_CA. Mesh_CA uses only the contour information of 597 the object as the supervision, can realize the 3D reconstruc-598 tion of furniture object. Training and testing Mesh_CA on the 599 five types of furniture in shapenet and comparing with the 600 current mainstream methods NMR and SoftRas, results show 601 that the mIOU is about 6.6% and 6.1% higher respectively. 602 Finally, generalized experiments show Mesh_CA has good 603 generalization performance. However, the model can only 604 deform the initial template mesh to generate a specific topol-605 ogy, which means it can only reconstruct specific class of 606 objects. In addition, the generalization performance of the 607 model needs to be further improved if it is applied to a 608 real scene. Therefore, the future research directions of this 609 paper include: improving the generalization performance of 610 the model and generating 3D objects with arbitrary topology. 611