Deep-Learning-Based 3-D Surface Reconstruction—A Survey

In the last decade, deep learning (DL) has significantly impacted industry and science. Initially largely motivated by computer vision tasks in 2-D imagery, the focus has shifted toward 3-D data analysis. In particular, 3-D surface reconstruction, i.e., reconstructing a 3-D shape from sparse input, is of great interest to a large variety of application fields. DL-based approaches show promising quantitative and qualitative surface reconstruction performance compared to traditional computer vision and geometric algorithms. This survey provides a comprehensive overview of these DL-based methods for 3-D surface reconstruction. To this end, we will first discuss input data modalities, such as volumetric data, point clouds, and RGB, single-view, multiview, and depth images, along with corresponding acquisition technologies and common benchmark datasets. For practical purposes, we also discuss evaluation metrics enabling us to judge the reconstructive performance of different methods. The main part of the document will introduce a methodological taxonomy ranging from point- and mesh-based techniques to volumetric and implicit neural approaches. Recent research trends, both methodological and for applications, are highlighted, pointing toward future developments.


I. I N T R O D U C T I O N
In the last decade, advances in artificial intelligence, in particular in deep learning (DL) [1], [2], [3], have been adopted by a multitude of fields and have, thus, led to major breakthroughs in science and industry alike.One of the major driving forces behind these developments is the field of computer vision, and its desire to "teach" machines how to recognize patterns within image and video data.Initially, a strong emphasis was placed on the interpretation of 2-D information; however, recent advances in cost-effective scanner-based data acquisition and the establishment of large-scale shape repositories have brought the analysis of 3-D data into focus.Still, complexity, variety, and irregularities in 3-D shape representations pose significant methodological challenges.
The reconstruction of 3-D surfaces of objects from different types of input data formats, such as point clouds, depth maps, single-view, or multiview images, is fundamental to a number of application fields, such as computer vision, robotics, CAD, medicine, city planning, disaster prevention, and archeology.One of the special use cases of 3-D reconstruction is human shape reconstructions and pose estimation from images or videos, which is addressed by some other works [4], [5].Despite a long research history for 3-D surface reconstruction, the precise representation of 3-D geometrical objects remains an unsolved problem, usually requiring the reconstructed 3-D surfaces to be: 1) highly resolved and smooth; 2) water-tight, i.e., "without gaps"; 3) in accordance with possible ground truth; 4) robust against noisy or incomplete input; and 5) simultaneously, densely, and compressibly represented.
Several reconstruction-related surveys [17], [18] present early approaches, with [17] providing an overview of the classical and non-DL-based surface reconstruction methods from point clouds with respect to priors and [18] reviewing RGB-D scene reconstruction approaches.There is another DL-based surface reconstruction survey [19] with a focus on image-based methods.This article, however, covers broader data modalities and reviews recent trends in 3-D surface reconstruction including implicit neural representation and neural radiance fields (NeRFs) thoroughly.
In this survey, we present a comprehensive overview of these state-of-the-art DL-based approaches to 3-D surface reconstruction.Our main goal is to provide method researchers with a guide to current work and applied researchers with a toolbox for their domain challenges.Toward this end, we first provide a broad introduction to input data formats (see Section II), acquisition technologies (see Section III), and widely used benchmarking datasets (see Section IV).Section V covers evaluation metrics enabling to quantitatively judge the reconstructive performance of a method, independent of being classical or learning-based.The main part of this survey (see Section VI) highlights DL methods to reconstruct 3-D surfaces using volumetric, point-and mesh-based, and implicit neural representations.We assume that the reader has a general grasp of neural networks and DL concepts to thoroughly follow the content.Discussion, current trends, and challenges are highlighted in Section VII.Finally, Section VIII summarizes and concludes this survey.

II. I N P U T D A T A
Various types of data representations can be used as input for the 3-D surface reconstruction task.Conventional representations of 3-D inputs can be divided into Euclidean and non-Euclidean data.Examples of non-Euclidean data representations are point clouds or meshes, while Euclidean data representations can be volumetric, RGB-D data, or multiview images.
Point clouds are currently the most common format of raw 3-D sensor data.With the improvement of scanning devices, leading to enhanced capabilities for capturing the surrounding 3-D environment in various applications and representing it with points, point clouds are becoming increasingly important and available.Thus, processing this type of representation using neural networks and DL techniques has attracted considerable attention.From a mathematical point of view, point clouds comprise an irregular data structure in the form of an unordered set of points.Each point on a 3-D surface of an object can basically be defined by a vector of its (x, y, z) coordinates, which can be inferred by various 3-D data acquisition techniques.Hence, the size of the representation matrix of a 3-D object is initially N × 3 for N points.The matrix may also contain different properties including color, transparency, surface normals, and other scanner information.However, pure point clouds do not include the interconnections between vertices.Since a point cloud is a set, its elements are orderless, a characteristic that causes many challenges for surface reconstruction methods.Point clouds can be easily converted to/extracted from other data representations, such as voxels, depth maps, or meshes, and vice versa.Furthermore, they can be extracted from depth images by projecting the depth value of each pixel into 3-D space.
Meshes are another highly popular type of representation for 3-D objects providing detailed and connected geometries in an efficient way.They are irregular data embedding in continuous space.Their basic components are vertices, edges, i.e., pairs of vertices, and (triangular) faces, i.e., n-tuples of edges, forming an undirected graph.
In volumetric representations, the basic element is a voxel.A voxel in a 3-D grid is a cuboid equivalent to a pixel in 2-D space.The 3-D grid, regardless of being sparse or dense, can be fed to a neural network as the input.
An RGB-D image is a combination of an RGB image and a depth image.It not only has RGB information for each pixel but also includes depth information.
Multiview images are a collection of (single-view) images taken from different angles of an object.By putting these images together, 3-D information can be partly retrieved.
On the other hand, 2-D data, such as single-view RGB images, can also be considered the input to a network for surface reconstruction individually, in which the method is called single-view reconstruction (SVR) [69], [70], [71] or in conjunction with another 3-D input mentioned earlier.

III. D A T A A C Q U I S I T I O N
As explained in Section II, point clouds are the most common format of raw 3-D sensor data.3-D point cloud data are acquired through sensing technologies that measure distance [i.e., 3-D laser scanning also known as light detection and ranging (LiDAR)] or generated with stereoand multiview image-derived systems that can be based on red, green, blue-depth (RGB-D) cameras, stereo cameras, and multiple synthetic aperture radar (SAR) image pairs [16], [72].High-quality 3-D point clouds can capture the 3-D surface geometries of target objects (e.g., physical features that occupy the Earth's surface and ocean bottom) with a spatial accuracy up to the millimeter level and a point density of a few thousand points per square meter (pts/m 2 ).

A. 3-D Laser Scanning (LiDAR)
LiDAR is a remote sensing (RS) active technology that uses light in the form of a pulsed laser to measure the distance between the sensor and the object under study [73].By measuring the time that emitted pulses take to travel to a target, LiDAR derives 3-D representations of objects.LiDAR can also operate at different wavelengths (i.e., multispectral LiDAR [74], [75]) to discriminate the different spectral reflectance of land-cover classes [76], [77].
Depending on the platform on which the LiDAR sensor is mounted, a 3-D laser scanner is classified as a terrestrial laser scanner (TLS or ground LiDAR), airborne laser scanner (ALS), mobile laser scanner (MLS), and unmanned laser scanner (ULS) [16], [72].
A TLS uses ground-based RS systems (e.g., tripods) to cover middle-or close-range areas with scans performed in all directions, including upward [78].Once scans of a single zone are completed, the tripod is moved to another location to scan from another angle or capture data from a new area.As TLS systems are static during the acquisition process, they reach the highest point cloud density and can produce high-quality 3-D models of the interiors of buildings and heritage sites.
Nevertheless, TLS systems cannot always be used, especially for scanning restricted locations that are not safe or accessible for teams (e.g., areas of dense vegetation and unsafe building sites).In these cases, LiDAR sensors can be mounted on airborne platforms.ALS systems are also used to acquire point cloud data over large areas (e.g., for 3-D building reconstruction [79]).
When target regions are directly accessible, their structures and objects can be reconstructed from data acquired by MLS systems, i.e., LiDAR sensors mounted on moving vehicles (e.g., to derive high-resolution 3-D city models [80]).
Since drones and other unmanned vehicles have become cheaper and autonomous navigation more reliable [81], ALS and MLS are often operated as ULS systems.Their platforms are compact and lightweight, which enables them to be exploited as first responders for disaster management.ULS systems can make a first scan of the terrain to track movements and changes, and deliver 3-D mapping of the most affected locations [82], [83].

B. Photogrammetry
While LiDAR performs a direct measurement of the target object, i.e., by physically hitting a feature with light and measuring the reflection, approaches based on photogrammetry or computer vision theory [84] use a set of overlapping images taken from different locations to identify isolated points within a target.This includes not only airborne photogrammetry but also satellite stereo systems, which can map larger regions quickly.Imagebased reconstruction algorithms can estimate the relative locations of these points and eventually convert the overlapping images into a 3-D point cloud.For instance, the structure from motion (SfM) algorithms [85] can process multiview images simultaneously through estimating camera positions and orientations automatically, while dense matching and multiview stereo (MVS) algorithms [86] can generate a large volume of point clouds (e.g., large-scale scenarios and crowded environments).

C. RGB-D Camera
Similar to LiDAR, RGB-D cameras measure the distance between the sensor and the objects.Depth information of each RGB pixel of the image is retrieved via a depth sensor.An RGB-D camera generates a colored point cloud by mapping RGB images with depth information (i.e., images include the (x, y, z) spatial coordinates and RGB colors).In this case, the point cloud is not the direct result of RGB-D scanning [87], [88] since the camera generates pixelwise depth data rather than unstructured points.RGB-D cameras are generally cheaper than LiDAR systems and are mostly used in indoor environments for close-range applications [89].
Structured light and Time of Flight (ToF) [90], which are active imaging systems, serve as depth cameras and calculate the distance from the sensor to an object, consequently providing 3-D information.The depth of an object can be determined using ToF sensors by measuring the duration of light travel from the sensor to the object and back.By determining the ToF of light, these sensors can calculate the object's distance and create a detailed depth map, which can be directly used or easily converted to a point cloud for instance.Structured light sensors employ the deformation of a projected pattern to determine the distance.By emitting a known light pattern onto a scene and examining how the pattern changes as it interacts with objects in the scene, these sensors are able to accurately measure the depth information of the objects.Structured light technology-based 3-D scanners are comparatively more affordable, being lighter in weight than their laserbased counterparts as well.Due to their higher degree of sensitivity to lighting conditions, they may not operate well in outdoor environments or in challenging conditions such as dusty rooms.For black or glossy surfaces, a specific spray should be applied before 3-D scanning.

D. SAR Point Cloud
SAR is an active RS system that can operate day and night and can penetrate clouds and smoke.Interferometric SAR (InSAR) extends the principle of SAR to the 3-D domain [91] by taking advantage of the physical properties of microwaves [92].An InSAR system compares the phase of multiple SAR image pairs acquired from slightly different viewing angles to generate InSAR-based point clouds.The SAR tomography (TomoSAR) and persistent scatterer interferometry (PSI) are two major techniques that generate point clouds with InSAR [16].They are used to monitor terrain changes (e.g., surface deformations and human-made structures [93]).

E. Videogrammetry
3-D point clouds can also be reconstructed using video frames (i.e., the input data are video streams instead of a collection of images).This approach is referred to as videogrammetry [94] and is based on the principles of photogrammetry.It can reconstruct point clouds from the frames of a video since their information is sequentially interconnected.Videogrammetry approaches provide a valuable alternative to camera images.They can be semiautomatic since the search for target points in different images can be achieved by measuring or tracking features of interest between consecutive video frames.However, the reconstruction needs to be coupled with effective frame selection algorithms (e.g., video frames are selected based on the surveyed geometry) and robust 3-D processing methodologies [95].

IV. D A T A S E T S
DL approaches are data-demanding; thus, they require large amounts of data with high-quality 3-D shapes and ground truths.Recent developments in scanning and sensing technologies have led to the collection of various widely used and openly accessible benchmarking datasets.These datasets are used to train and evaluate the performance of DL methods for different tasks, including 3-D reconstruction.In this section, we summarize some of the most popular datasets, which can be used by different 3-D DL approaches, with a focus on 3-D reconstruction.Table 1 offers a comparative overview of these datasets.[96] is a richly annotated, large-scale synthetic dataset of 3-D shapes represented by 3-D computer-aided design (CAD) models of objects, providing roughly 3 000 000 shapes.This dataset has been used for computer graphics and vision purposes.The final representation of this dataset can be a mesh.4) KITTI [99], [100] is a real-world urban scene dataset composed of images and point clouds.The dataset was acquired by the autonomous driving platform Annieway while driving around the city of Karlsruhe.Evaluation benchmarks were developed for several computer vision and robotic tasks, such as stereo, optical flow, visual odometry, SLAM, 3-D object detection, and 3-D object tracking.Semantic KITTI [101], which is based on KITTI, provides pointwise annotations for semantic segmentation and semantic scene completion purposes.The dataset comprises 28 classes including classes for nonmoving and moving objects.5) ScanNet [102] is a 3-D reconstruction dataset of indoor scenes consisting of 2.5 million frames (views) derived from more than 1500 RGB-D scans.[109] is a CAD model dataset with one million 3-D models.Koch et al. [109] offered a pipeline that is able to convert these CAD models into other representations in order to be processable by DL techniques.These models are provided in .objand 3-D systems' stereolithography CAD file format (.stl).12) Semantic3D.net [110] is a large labeled 3-D point cloud dataset of natural scenes with over four billion points in eight class labels.These dense point clouds, which were recorded by TLSs, depict urban and rural outdoor terrestrial scenes.13) H3D [111] is a high-resolution real-world dataset containing both point clouds (H3D(PC)) and meshes (H3D(Mesh)) of airborne LiDAR data and can be used for semantic segmentation in geospatial applications.The point clouds are classified into 11 classes, and labeled 3-D textured meshes can be derived from them.14) 3-D Furnished Rooms with layOuts and semaNTics (3D-Front) [112] is a synthetic dataset of indoor CAD model scenes, containing 18 968 rooms with 3-D objects.The individual objects are taken from 3D-FUTURE [113].The CAD models are stored in .objand .mtlfile formats.15) 3-D Furniture shape with TextURE (3D-FUTURE) [113] is a repository of 3-D furniture shapes in the household scenario enriched with 3-D and 2-D annotations.It includes 20 240 synthetic images of 5000 different rooms.Stylistic and texture details of individual objects are provided.The 3-D models are stored in .objfile format.16) SensatUrban [114] is a dataset for urban-scale point cloud understanding.It covers 7.6 km 2 of urban areas in Birmingham, Cambridge, and York cities.The point clouds are obtained from high-resolution aerial images, which are captured by the UAV mapping system.17) Stanford 3-D Scanning Repository [115] is a surface reconstruction repository containing some famous 3-D models, such as the Stanford bunny, happy Buddha, dragon, and armadillo in .plyformat.These 3-D models and some others also exist in the Large Geometric Models Archive [116].

V. E V A L U A T I O N M E T R I C S
Evaluation metrics are used to assess the performance of DL models [1], [2], [3].Various metrics have been proposed for testing deep geometric learning methods.Some of the common distance metrics used for surface reconstruction methods are Chamfer distance (CD), earth mover's distance (EMD), and Hausdorff distance (HD), which all measure the discrepancy between two sets, as illustrated in Fig. 2. Another common metric for evaluating 3-D reconstruction solutions is the Intersection over Union (IoU).Furthermore, the formulas in this section denote false positives, false negatives, true positives, and true negatives as FPs, FNs, TPs, and TNs, respectively.
1) The CD [30] measures the distance between two different surfaces or sets of points by first calculating the distances between predicted points and their groundtruth nearest neighbors and then averaging all of these distances.The calculated value represents the dissimilarity between predicted output and ground truth.The lower the value, the better the result.Let S 1 and S 2 be two point clouds that represent the predicted and ground-truth shapes, and x and y be two points that belong to these point clouds, respectively.Then, the CD is defined as 2) The EMD, also known as the Wasserstein distance in mathematics and optimization theory) [30], [117], [118], is based on solving an optimization problem, called the transportation problem.The transportation problem attempts to find the least-expensive flow of goods from suppliers to consumers while satisfying the consumers' demand.For the calculation of the EMD of two point sets, each point in one set should be assigned to a unique point in the other set to fulfill optimal assignment.EMD uses bijection between the points that minimize the total sum of the pairwise distances.Consider S 1 ⊆ R 3 and S 2 ⊆ R 3 to be two point sets of equal size, representing the predicted and ground-truth shapes, respectively.The EMD [30] is defined as where ϕ : S 1 → S 2 is a bijection.3) The HD considers the farthest and largest dissimilarity between predicted output and ground truth.A point in one set that has the worst mismatch and maximum distance from its nearest point in the other set determines the HD The metric is, however, not very robust toward outliers.4) The IoU, also known as the Jaccard Index, is often used as a quality measure in object detection and semantic segmentation.As illustrated in Fig. 3, it is defined as the overlap between the prediction and the ground truth, divided by their union.The lower the IoU, the worse the prediction result.IoU can also be easily utilized for evaluating voxelbased representations and specifying the overlap between a reconstructed 3-D voxel and its voxelized ground truth.For volumetric approaches, IoU can be formulated as [20] where I(.) is an indicator function, p(i, j, k) is the predicted voxel occupancy probability, t is a voxelization threshold, and y(i, j, k) is the ground-truth occupancy probability.5) In classification problems, precision is the number of predictions correctly assigned to one label, i.e., true positives, divided by the number of all predictions assigned to that label, including those identified incorrectly, i.e., false positives (see Fig. 4) The average precision (AP) is computed by averaging all precision values of all positively labeled samples [98].The mean AP (mAP) is the average of AP calculated over all classes.For point clouds, precision is calculated as the percentage of predicted points that are close to the ground-truth surface, i.e., with a distance less than a specific threshold [119].6) Recall or sensitivity denotes the ratio between the number of predictions correctly assigned to one class (TP) and the actual number of elements in that class, including those that are incorrectly assigned to the other label (FN) (see Fig. 4).It is a measure of how well a DL model can find all labels of one class For point clouds, recall is calculated as the percentage of points on the ground truth, which are close to the predicted surface, i.e., having a distance less than a specific threshold [119].7) The F 1 score, also known as balanced F-score, F-measure, or dice similarity coefficient (DSC), is the harmonic mean of precision and recall.The higher the value, the better the result For point clouds, precision and recall can be calculated by checking the percentage of points in one point cloud, for instance, the predicted point cloud or the ground truth, which can find a neighbor from the other point cloud within a threshold [38].Intuitively, the F-score can be interpreted as the percentage of points that were reconstructed correctly [119].8) In classification problems, the accuracy (Acc) is the ratio between correct predictions and all predictions, i.e., it shows how much of the data is labeled correctly However, it is not an appropriate metric for imbalanced datasets as it does not take into account the distribution skew [120].9) Normal consistency (NC) [45] is defined as the mean absolute dot product of the surface normal of each point, i.e., a perpendicular vector to the surface at the given point, in one mesh, and the surface normals of its nearest neighbors in the other mesh where ∂ M and ∂M are predicted and ground-truth mesh surfaces, n(p) and n(q) are unit normal vectors on these mesh surfaces, respectively, π 2 (p) and π 1 (q) indicate the projections of p and q on the aforementioned surface meshes, respectively, and ⟨., .⟩implies the inner product.The higher the NC, the better the result.10) The Jensen-Shannon divergence (JSD) [31] measures the similarity between marginal point distributions.It is mainly based on the Kullback-Leibler (KL) divergence [121].Considering two point clouds and a voxel grid that discretizes 3-D space, the number of points within each voxel from the predicted point set P and the ground-truth point set G is counted.
The JSD between the obtained empirical distributions (PP , PG) is calculated as where M = (1/2)(PP + PG).11) Coverage [31] quantifies the fraction of points in the ground-truth set S 2 , which are matched to points in the predicted set S 1 .A match happens when the nearest neighbor in the ground-truth set is found for each point in the predicted set where D(., .) or "nearness" is measured using distance metrics, such as CD or EMD.High coverage indicates that most of the points in S 2 are roughly present within S 1 .However, this does not assess the quality of the predicted set.Achieving perfect coverage is possible, despite large distances between the predicted point set and the ground-truth set [33].12) Minimum matching distance (MMD) [31], [33] is a complement to the coverage metric.It measures the distance between every point in the ground-truth set S 2 and its nearest neighbor in the predicted set S 1 and averages these distances in order to evaluate the quality of the predicted set where D(., .) is measured using distance metrics such as CD or EMD.13) Light field descriptor (LFD) [122] measures visual similarity between 3-D shapes.In short, LFD assumes that a 3-D object can be represented as a number of 2-D views; therefore, if two 3-D models are similar, they also look alike from all views.A light field, which is used in image-based rendering, is defined as a 5-D function that represents the radiance at a given 3-D point along a given direction.To extract LFD for a 3-D model, a set of image renderings (silhouettes) are obtained from different angles.These rendered images are acquired using cameras located on the vertices of a fixed regular dodecahedron, i.e., 20 vertices, which surrounds the 3-D model.Each of

Next, a different mapping between rendered images of the two 3-D models is chosen, and thus, another similarity value is extracted, as illustrated in (c). Eventually, the rotation of camera positions with the best similarity is found, as shown in (d). The similarity between the two 3-D models is attained by summing up the similarities from all the corresponding images [122].
these silhouettes is then encoded both by a region shape descriptor (Zernike moments descriptor) and a contour shape descriptor (Fourier descriptor) for similarity comparisons.A visual representation can be found in Fig. 5. LFD is a good visual similarity metric for 3-D surfaces; however, by rendering merely the silhouette of the shape without lighting, LFD can only observe the condition of this shape on the edge of the silhouette [49].DA, which is the dissimilarity between two 3-D models, is calculated as where i indicates different rotations between camera positions for two 3-D models, and I 1 k and I 2 k are corresponding images for the ith rotation.The dissimilarity between two images is denoted by d.

VI. D L -B A S E D 3 -D S U R F A C E R E C O N S T R U C T I O N
DL-based 3-D surface reconstruction approaches can be broadly classified into four main categories according to their representation, as illustrated in Fig. 6: 1) Volumetric representations define a surface via small cuboids, either a dense 3-D voxel grid [20], [21], [22], [23], [24], [25] or an octree [26], [27], [28], [29].Dense voxels are the 3-D analog of a pixel in 2-D space, i.e., a cubical element in a regularly spaced 3-D grid.Therein, octrees are obtained by recursively splitting 3-D space into octants, i.e., eight equally sized cells.In this data structure, only cells containing information by being close to the surface boundary are subdivided.Neighboring cells that have the same value do not need to be subdivided, and all of these areas can be represented by a single large octree cell.
Existing approaches can be mainly divided into three categories.
a) Patch-based approaches attempt to reconstruct the final shape by learning a group of mappings from 2-D squares to 3-D patches and putting together these small patches.b) Deformable template-based approaches deform the vertices of a template mesh with predefined interconnections and predict the final shape based on it.
c) Other mesh generation methods are so unique, yet singular, that they are sorted into a catch-all category.

A. Volumetric Representations
Volumetric approaches in neural networks for 3-D surface reconstructions rely on describing the object through a grid.By extending the concept of 2-D convolutions to 3-D, a grid can be easily processed using learning-based approaches, such as neural networks.
Analogously to the concept of a pixel in the 2-D world, a voxel is a cubical element in a regularly spaced 3-D grid.An octree can be built by recursively subdividing the space into octants until a predefined maximum depth is reached.Additional information can be stored in cubic cells (both in dense and sparse voxels) to help reconstruct surfaces as follows.
1) Signed distance functions (SDFs) express the distance between the center of each voxel and the closest point on the surface of an object.They can be stored in a cuboid by calculating distance functions (DFs) [25], [123].SDFs, a variation of DFs, purely calculate the signed distance value for each cell.Truncated SDFs (TSDFs) [124] go beyond the SDF definition by specifying a truncation threshold for SDF values stored in cuboids, i.e., assigning a fixed value to voxels that are not near enough to the surface and their signed distance values exceed the defined threshold.

2) Occupancy or indicator functions indicate whether a
cuboid is occupied by the surface of an object or not.
Learning voxel-based SDF representations is usually rather complicated compared to occupancy representations since dealing with DFs in 3-D space is more difficult than simply classifying a voxel as occupied or unoccupied [45].However, voxel-based SDF approaches provide the advantage of generating smoother surfaces compared to occupancy grid-based approaches.A general disadvantage of voxel-based methods is their resolution limitation by the underlying 3-D grid.Mesh extraction approaches, such as the classical Marching Cubes (MC) algorithm [125], can be used to infer a mesh from the final output of these methods.
1) Dense Voxels: The majority of approaches with dense voxel-based representation voxelize the 3-D space in order to apply 3-D convolutional neural networks (CNNs) on a grid directly.In this section, we first present pioneer studies that applied CNNs to a 3-D representation, i.e., dense voxels, for shape classification and then introduce 3-D surface reconstruction and shape completion approaches that use dense voxels.
a) Volumetric CNNs for 3-D shape classification: Several studies have focused on solving shape classification and recognition tasks using dense voxels [21], [23], [98], [126], [127], [128], [129].One of the pioneers in building DL models in 3-D world is 3-D ShapeNets, as proposed by Wu et al. [98].They were among the first authors to show the application of CNNs to a 3-D representation.The introduced architecture uses a convolutional deep belief network for representing a 3-D shape as a probabilistic distribution of binary variables on a 3-D voxel grid.3-D ShapeNet is able to conduct several tasks, from shape recognition to reconstruction and completion, as well as next-best-view prediction.The DL model takes a singleview depth map of the physical object as input and converts it into a volumetric representation.The occupancy status of each cell is specified by classifying it as either free space, unknown space, or observed surface.Next, a deep belief network is trained on this grid of size 30 3 .In terms of accuracy, precision, and recall metrics, 3-D ShapeNets outperform several baseline methods for 3-D shape classification and retrieval, such as the LFD approach [122] and the spherical harmonic descriptor (SPH) [130], even though it utilizes a mesh at lower resolution.It was further shown that the DL model is able to automatically learn general 3-D features.
Maturana and Scherer [126] introduced VoxNet that voxelizes input point cloud data and processes the grid with a 3-D-CNN for object recognition tasks.The authors utilized a volumetric grid for representing the estimated spatial occupancy and a 3-D-CNN for extracting features and predicting class labels directly from the occupancy grid of size 32 3 .Each point in the input point cloud is mapped to discrete volume coordinates.The resulting voxel volumes are fed to the proposed shallow neural network.VoxNet has fewer parameters compared to 3-D ShapeNets [98], i.e., less than one million versus over 12.4 million parameters, while achieving 8% and 6% higher average accuracy for ModelNet10 and ModelNet40 datasets, respectively.However, in both these methods, the memory and computational costs increase cubically with respect to the input resolution.
ORION [127], which is based on VoxNet [126], studies the importance of object orientation in 3-D object recognition results.Unlike VoxNet and 3-D ShapeNets [98], which augment training data with rotations of the objects to achieve rotational invariance of the network, ORION seeks to predict object orientation.The proposed network uses 3-D convolutional networks for 3-D recognition and adds an auxiliary orientation loss for better classification performance.By forcing the network to predict object orientation in addition to class labels during training, more accurate classification results can be achieved at test time.The ORION network is shallower than the proposed method by Brock et al. [21] that we discuss further down this survey, leading to fewer trainable parameters.
Some studies utilize multiview CNNs for analyzing a 3-D shape.Multiview CNNs work in three steps: 1) rendering a 3-D shape as a collection of images from different viewpoints; 2) inferring features for each viewpoint; 3) fusing these features across various views.In order to minimize the performance gap between multiview CNNs and volumetric CNNs, Qi et al. [128] suggested two new network architectures of volumetric CNNs.One architecture focuses on local regions, while the other uses anisotropic probing kernels for convolving a 3-D cube, then projecting 3-D volumes to a 2-D image, and afterward applying image-based CNNs for classification.The proposed CNNs surpass volumetric CNN-based methods, such as 3-D ShapeNets [98] and VoxNet [126].Moreover, their classification accuracy competes with some multiview-based methods, such as MVCNN [131], LFD approach [122], and SPH [130], given the same 3-D resolution of 30 3 .
b) 3-D surface reconstruction and shape completion using volumetric representation: In this section, we review the studies that leverage dense voxel representations for 3-D surface reconstruction [20], [21], [22], [23] and 3-D shape completion [24], [25].Choy et al. [20] introduced a framework, 3-D recurrent reconstruction network (3D-R2N2), for both single-and multiview 3-D reconstructions.This method takes one or more RGB images of an object from arbitrary viewpoints as input and outputs a 3-D occupancy grid.The proposed network is composed of three main modules, as shown in Fig. 7: 1) a 2D-CNN, which encodes the input into a low-dimensional feature vector; 2) a 3-D convolutional long short-term memory (LSTM) [132], in which the 3D-LSTM units keep their previous cell states or update them, whenever there are more observations, i.e., multiview images, available; 3) a 3-D

Fig. 7. Overview of the 3D-R2N2 network [20]. The input to this network is one or more RGB images from arbitrary viewpoints, and the output is a 3-D occupancy grid. The main modules of 3D-R2N2 are an encoder, a 3-D LSTM, and a decoder.
deconvolutional neural network (3D-DCNN) that decodes the 3D-LSTM hidden states into a higher resolution and produces the final occupancy grid.
In the LSTM module, 3D-LSTM units are located in a grid structure in such a way that each of them focuses on reconstructing a particular part of the output.Two versions of 3D-LSTMs, 3D-LSTMs without output gates and 3-D gated recurrent units (GRUs), were tried out in 3D-R2N2, in which the latter achieved better results.The output size is 32 3 .Although the generation of detailed and thin parts of the objects and the reconstruction of objects with high texture levels are very challenging, 3D-R2N2 performs better than the category-specific approach proposed by Kar et al. [133], which learns 3-D shapes using camera viewpoint estimations together with object silhouettes, in SVR using real-world images.3D-R2N2 is also able to produce accurate outputs compared to the MVS method [86] in multiview reconstruction (MVR).
Brock et al. [21] investigated generative and discriminative voxel modeling with deep ConvNet architectures.In short, their method presents a voxel-based variational autoencoder (VAE) [134], [135] for reconstruction and interpolation, a graphical user interface for investigating the latent space of autoencoders (AEs), and a deep voxelbased CNN for object classification.The output size of the network is 32 3 .The voxel-based VAE learns to reconstruct features of an object, attaining acceptable reconstruction accuracy.It further facilitates the transition from one object to another by interpolating between their reconstructions.The neural model has significantly fewer parameters than FusionNet [129], i.e., 18 million as opposed to 118 million.Nevertheless, it achieves competitive results compared to ORION [127] considering that ORION uses orientation augmentations to improve the classification.
The TL-embedding network [22] learns a vector representation of an object, which is both generative in 3-D, i.e., able to reconstruct objects in 3-D space from this representation, and predictable from 2-D images, i.e., able to extract this representation from images.As shown in Fig. 8, this architecture is composed of a convolutional network, which brings about predictability, and an autoencoder, which results in generativeness.It generates outputs with 20 3 resolution.This method captures stylistic details better than the method proposed by Kar et al. [133].
Wu et al. [23] introduced a framework, called 3-D generative adversarial network (3D-GAN), which generates novel volumetric 3-D objects from a probabilistic latent space.3D-VAE-GAN, an extension of 3D-GAN, provides the ability to reconstruct surfaces from input images.For generation and recognition of 3-D objects, this method utilizes both general-adversarial modeling [136], [137] and volumetric convolutional networks [98], [126], as illustrated in Fig. 9. Furthermore, it fuses 3D-GAN with a VAE [134] for 3-D object reconstruction from a single 2-D image.The resolution of its final output can reach up to 64 3 .The classification accuracy of this network is roughly similar to volumetric learning-based approaches, such as VoxNet [126] and ORION [127], but is lower than the method proposed by Qi et al. [128].It shows higher AP for voxel prediction compared to the work by Girdhar et al. [22] in a single-image 3-D reconstruction task.However, 3D-VAE-GAN usually creates a noisy and incomplete output from an input image.Studies conducted by Wu et al. [138] showed that, ultimately, training GANs together with recognition networks can lead to high instability.
Stutz and Geiger [25] introduced a learning-based approach with weak supervision for 3-D shape completion.It takes a 3-D bounding box and an incomplete point cloud as input and predicts the complete object shape.The completion process is done in two steps.1) A shape prior is learned, i.e., a VAE is employed to learn a 3-D  shape model on synthetic data, encoding shape models in a dataset using occupancy grids and SDFs at 24 × 52 × 24 resolution.2) Shape inference is performed.For this, 3-D shape completion is considered a maximum likelihood (ML) problem.The authors used the amortized ML (AML) approach that works over the lower dimensional latent space z from the first step.It keeps the pretrained decoder from the previous step fixed and adds a new encoder.The encoder is trained without supervision, i.e., without using explicit labels, and learns to directly predict ML solutions from incomplete input observations using ML loss.The presented method was shown to be faster than a fully supervised baseline while using 9% or less supervision while being able to produce competitive results.
Dai et al. [24] fused a volumetric DNN with a 3-D shape synthesis procedure to complete partial 3-D inputs.Their approach generates the output in two major stages.1) A shape prediction method, which predicts a volumetric grid with 32 3 resolution as a low-resolution global structure of the input.The proposed network, the 3-D-encoder-predictor network (3D-EPN), consists of 3-D convolutional layers and attempts to predict distance field values for missing data.2) A patch-based 3-D shape synthesis method, which employs a synthesis procedure to improve local details and create a high-resolution output using CAD model priors.Given the predicted coarse output from the first stage, the authors carried out a search for similar 3-D shape models in the ShapeNet [96] database.Based on the results, they sought to find similar local patches in these shape models for the purpose of local detail synthesis.The resolution of the final voxel grid is 128 3 .Without the synthesis step, 3D-EPN provides only low resolution and is unable to predict local details and fine structures.Nevertheless, it outperforms 3-D ShapeNets [98] and Poisson methods [8], [9].
In another approach, Dai et al. [139] suggested sparse generative neural networks (SG-NNs), which is a selfsupervised scene completion approach that accepts an incomplete RGB-D scan as input and predicts a highresolution 3-D reconstruction while also inferring unseen, missing geometry.The self-supervised nature of this technique allows for training entirely on real-world, partial scans.This eliminates the requirement for synthetic ground truth.Self-supervision is achieved by removing some frames from a given (incomplete) RGB-D scan, resulting in an even more incomplete input; this input is used to create an input-target pair (the original scan is considered the target scan).The difference in partialness is then correlated in this input-target pair, while regions that have never been observed are masked out during training.Despite the fact that fully complete scenes are not used as samples during training, this approach generates high levels of completeness by learning to generalize completion patterns across the training set.Dai et al. also proposed an SG-NN, a fully convolutional encoder-decoder architecture, capable of predicting high-resolution final geometry as a sparse TSDF representation.This end-toend formulation generates a 3-D scene in a coarse-to-fine manner.SG-NN is built upon sparse convolutions [140] that operate only on surface geometry.This self-supervised approach produces more accurate and complete scenes in comparison to a fully supervised approach, such as 3D-EPN [24].
In general, voxel-based methods encounter a number of difficulties.Information loss may occur due to discretization and transformation of input data to coarse voxels.Moreover, cubic growth in memory limits the resolution and the overall computational demands bring about coarse final outputs.Generating higher resolution surfaces requires deeper networks.However, the network depth is constrained by the available GPU memory.Therefore, it may affect the ability of CNNs with volumetric decoders in producing high-resolution outputs [141].
2) Octrees: Dense voxel representations are associated with a number of challenges regarding resolution, memory, and computational complexity.In many cases though, the 3-D shape surface occupies only a small portion of 3-D space.Hence, octrees mark a popular approach for partitioning space, as they allow for the 3-D data to be stored in a sparse structure [142], [143].For octree construction of a 3-D shape, a bounding cube is created around the entire shape.This bounding cube will be recursively subdivided.In each step, all cuboids, which are occupied by a shape boundary, are traversed, and each of them is divided into eight smaller, equally sized cuboids.However, in order to enable CNN operations on an octree, this data structure needs to be updated and slightly changed, which leads to complex implementations, while the resolution is still limited by the underlying 3-D grid [45].Hence, convolutions and pooling to octrees are applied similar to CNN operations on dense voxels with the main difference being that the elementary operand is an octant.a) OctNet: Riegler et al. [144] presented OctNet that enables the usage of high-resolution inputs for DL purposes.OctNet is based on a 3-D-CNN that can be applied to a special form of octree data structure to learn representations from high-resolution 3-D data.Vanilla octree implementations might encounter data access speed issues in high-resolution (high recursion depths) octrees.On the other hand, for convolutional network operations, such as convolution or pooling, it is crucial to have frequent access to different data elements, such as cell neighbors.In order to provide faster data access and reduce cell traversal time, the authors proposed a hybrid grid-octree data structure.They used a shallow octree, which is an octree with maximum depth D = 3, as a basic building block.Several of these shallow octrees are stacked in a regular grid structure to cover the whole volume.Input resolution effects of this representation were evaluated on three different tasks: 3-D classification, 3-D orientation estimation of unknown object instances, and semantic segmentation of 3-D point clouds.For high-resolution inputs in the 3-D shape classification task, OctNet runs faster and requires less memory as opposed to DenseNet, a densely voxelized version of OctNet.In general, both OctNet and DenseNet perform better than a shallow network such as VoxNet [126], verifying that network depth is of great importance.
OctNet does not generate an octree structure, and this structure has to be known in advance for both input and output.In classification and semantic segmentation tasks, this does not comprise a problem.However, learning the volumetric structure of objects and scenes, and being able to construct them is crucial in generative tasks, such as reconstruction, generation, and completion, since the input and output partitioning structure might be different.OctNetFusion [26] proposes a learning-based approach, which learns to partition the space and can predict an SDF or a binary occupancy map.The network takes one or more 2.5-D depth maps as input.To reconstruct precise and complete 3-D outputs, it fuses depth information from different viewpoints into a coarse volumetric grid.Then, this volumetric grid (grid-octree structure) is fed to the OctNetFusion network architecture, consisting of encoderdecoder modules.The network determines whether a cell should be subdivided or not in a coarse-to-fine manner.The output resolution can be up to 256 3 .This approach performs qualitatively and quantitatively better than traditional volumetric fusion approaches, such as vanilla TSDF fusion [124] and TV-L1 fusion [145] for volumetric fusion tasks and Voxlets [146] for volumetric shape completion from a single image.
b) O-CNN: Another concurrent work in the scope of octree-based CNNs (O-CNNs) for 3-D shape analysis is the O-CNN [147].The authors' main idea is to represent 3-D objects with octrees and execute 3-D-CNNs only on nodes or cuboids, which are occupied by boundaries of the 3-D object, instead of sliding the convolutional kernel over the whole voxel grid, as done for the standard convolution computation in full voxel grids.The network constructs an octree from an input-oriented 3-D model, e.g., an oriented triangle mesh or a point cloud with oriented normals, and enriches each octant of this data structure with metainformation, such as shuffle key vectors, label vectors, and input signal, which are needed for the convolution operations.Furthermore, a hash table is built to accelerate neighborhood search in the convolution.By storing the octree data structure in the graphical memory, O-CNN can be easily and efficiently trained and evaluated on GPUs.To demonstrate the efficiency of their network, the authors evaluated it on three shape analysis tasks: object classification, shape retrieval, and shape segmentation.In terms of classification accuracy, O-CNN performed better than VoxNet [126], slightly worse than the method proposed by Brock et al. [21], and competitive to nonvoxel-based methods, such as PointNet [148].In addition, the impact of different input representations on the same network architecture (O-CNN) was investigated.Results showed that an octree input achieves higher accuracy compared to full voxel structures.For object part segmentation, O-CNN yields better or comparable performance than other methods, such as PointNet [148].
To improve the computation and memory efficiency of O-CNN, Wang et al. [27] proposed the extension "Adaptive O-CNN," which consists of an encoderdecoder structure and uses patch-guided adaptive octree shape representations.Contrary to approaches such as volumetric-based CNNs, where the output is generated as voxels of the same resolution, this method can generate adaptive octrees based on a patch-guided partitioning strategy and with differently sized planar patches.The underlying assumption is the subdivision rule, which states that splitting all octants to the finest level is not necessary.The process can be stopped early for some of the octants, and the local shape inside these octants can be represented by simple patches, e.g., planar patches.However, this approach limits the quality of the output and may encounter some difficulties in generating watertight and curved surfaces.Adaptive O-CNN obtains better or comparable classification accuracy than PointNet [148], Oct-Net [144], and O-CNN [147], yet it performs worse than PointNet++ [149], Kd-Network [150], and the method proposed by Brock et al. [21].For the task of shape reconstruction from a single image, Adaptive O-CNN surpasses PointSetGen (PSG) [30] and AtlasNet [34] in generating more detailed geometry.
c) Other octree prediction approaches: Häne et al. [28] introduced a hierarchical surface prediction (HSP) framework for high-resolution voxel grid prediction in 3-D object reconstruction.The main idea boils down to generating and predicting high-resolution voxels around the predicted surface and coarse-resolution voxels for the interior and exterior parts of an object.The highresolution voxels are not predicted directly, but, instead, a coarse-to-fine approach is used to create smoother 3-D models hierarchically and in a multiresolution fashion.Starting with approximating the coarse geometry of the output, more finely resolved details are added step by step by refining the surface.This process, finally, results in a voxel grid with up to 256 3 resolution.The proposed method is based on an encoder-decoder architecture.A convolutional encoder encodes input to a feature vector, and then, an upconvolutional decoder predicts the voxel grid or final data structure (called voxel block octree data structure in this article).Classifying each voxel as the boundary, free space, or occupied space, only voxels with a boundary label require high-resolution prediction since they cover the actual surface.The major difference between HSP and OctNet [144] is that OctNet takes the structure of the shallow octrees as input, while HSP predicts the structure of the tree together with its content.HSP produces more accurate surfaces with higher resolutions compared to low-resolution baselines predicting dense voxels.
In a similar approach, Tatarchenko et al. [29] suggested an octree generating network (OGN) that is a convolutional decoder that can generate and predict the octree structure of 3-D shapes, along with the occupancy value of each cell.It operates on octrees and reconstructs 3-D shapes in a multiresolution manner, as illustrated in Fig. 10.This method generates results up to a resolution of 512 3 .The network gradually reconstructs a highresolution surface from the initial, low-resolution dense voxel grid using hash-table-based octree blocks.If the reconstructed surface has not yet reached the final output resolution, cells with a "mixed" state, i.e., undetermined state, will be passed to the next layer of the network for further subdivision.Providing the same accuracy as dense voxel grids in low resolutions, OGN offers less memory consumption and shorter run-time in higher resolutions in comparison to voxel grid-based networks.In particular, it is 20 times faster and requires two orders of magnitude lower memory usage at 512 3 resolution.

B. Point-Based Representations
These days, point clouds are becoming increasingly important and available due to the improvements in scanning devices in recent years.A point cloud is a set of points in 3-D space, inferred by various 3-D data acquisition techniques.It is an irregular data format since there is no canonical order between the points in a set.Each point can be defined by its (x, y, z) coordinates.Therefore, the size of the matrix representing a 3-D object is initially N × 3 for N points.The number of columns in this matrix representing the features might be extended if other information, such as color and normal, exists.Considering the irregular and unordered nature of point clouds, it is difficult to apply DL techniques, such as CNNs, directly on them.Consequently, in order to process a point cloud with neural networks, it was common to transform them into voxel grids or collections of images.These transformations usually present numerous challenges, such as information loss, voluminous data, resolution constraints, and high computational costs.To reduce the overhead of data transformation to other data formats, different methods for effectively processing point clouds with neural networks have been proposed, which will be discussed in Sections VI-B1 and VI-B2.
1) PointNet and PointNet++: Pioneer works in the field of learning global features directly on point clouds are PointNet [148] and PointNet++ [149].PoinNet as proposed by Qi et al. [148] directly consumes a raw point cloud as an input and uses it for discriminative DL tasks, e.g., object classification, semantic segmentation, and part segmentation.As illustrated in Fig. 11, each of the points in the input set is processed by a small neural network individually and independently based on its own coordinates, resulting in a high-dimensional embedding of the points.Following the embedding step, a simple symmetric function, such as max pooling, is utilized to aggregate the encodings from each of the points.The symmetric function is chosen such that it pays attention to the permutation invariance of the input points.The aggregation step brings about a global feature vector, which encodes the whole shape and can be fed to different neural networks for recognition purposes.PointNet achieves higher classification accuracy compared to the LFD approach [122], which is a 3-D model retrieval method, SPH [130], and other methods with volumetric representation, such as 3-D ShapeNets [98], VoxNet [126], and another method previously proposed by Qi et al. [128].Although it has around 17 times fewer parameters than multiview-based methods, such as MVCNN [131], its performance is only slightly lower compared to these methods.PointNet pro- vides linear complexity O(N ) in both spatial and temporal domains, where N is the number of input points, while the complexity grows squarely with respect to image resolution for multiview methods and cubically with respect to the volume size for volumetric methods.More importantly, due to it satisfying the permutation invariance condition, PointNet cannot capture local information and, thus, lacks generalization.
In order to resolve the issues of PointNet, Qi et al. [149] introduced the extension PointNet++, which pays more attention to local features and combines them with global features to infer better results.The architecture is built on top of PointNet, enriching it with a hierarchical feature learning approach.The whole process, which is done recursively, can be summarized as follows: 1) specifying centroids of local regions by sampling a subset of the input point cloud using the farthest point sampling (FPS) algorithm; 2) finding local neighborhoods of these centroid points using radius-based ball query; 3) applying a mini-PointNet in each neighborhood to mimic the concept of a convolution kernel and conduct convolution-like operations in point space for the purpose of local feature extraction.The presented method proved to be robust toward nonuniform sampling density, which might occur due to perspective effects, variations in radial density, motion, and so on.Compared to PointNet, PointNet++ has an improved classification accuracy for the ModelNet40 dataset.
2) Point cloud reconstruction and generation: PointNet was mainly implemented for discriminative tasks, such as classification and segmentation.The first approach for reconstructing a 3-D point cloud of an object from a single (monocular) RGB or RGBD image was proposed by Fan et al. [30] and is based on a generative learning-based approach.The main contributions of this work are given as follows: 1) designing a point set generator network; 2) proposing two proper loss functions for the comparison of the ground truth with the network's predictions for point sets, i.e., CD and EMD; 3) modeling uncertainty and ambiguity of the ground truth.The proposed network is composed of an encoder and a predictor part.The encoder transforms the input into an embedding space.The predictor is divided into two parallel branches: a deconvolution (deconv) branch and a fully connected (fc) branch.The deconvolution branch learns the smooth parts and main body of the object, while the fc branch learns nonsmooth parts and details.The results of these branches are then concatenated to create the final point set.In comparison to 3D-R2N2 [20], which generates a volumetric representation from single or multiview images, this method produces better results on CD, EMD, and IoU metrics.In addition, it is able to reconstruct thin structures more accurately.
Achlioptas et al. [31] proposed a solution for generative tasks and unsupervised representation learning based on an end-to-end pipeline that can reconstruct point clouds using deep autoencoders (AEs) and GANs.The autoen- coder extracts features by learning a lower dimensional representation of the input, based on which the GAN [136] generates point clouds.In the autoencoder architecture, the authors exploited a PointNet-like encoding scheme to learn compact representations.The encoder generates a latent code that is invariant to the order of input points.The latent code is converted back to a point cloud using a standard deep network with three fc layers as a decoder.The authors further investigated three different approaches for point cloud generation: 1) GAN operating on raw point cloud; 2) latent-GAN, which is a plain GAN being trained on the latent space of the pretrained AE; 3) Gaussian mixture models operating on the latent space learned by AE.The study indicated that the proposed AE provides good generalization capacity toward unseen data.However, the output of the proposed DL model architecture is limited to 2048 points, and generating high-quality surfaces with such a small number of points is challenging.
Another closely related approach that attempts to solve unsupervised learning challenges using deep autoencoders is FoldingNet [32].The presented architecture, as illustrated in Fig. 12, utilizes a simple graph-based scheme as the encoder part (similar to the method proposed in [151], an improved and generalized version of PointNet) in order to encode local neighborhood structure information.Since applying convolution operations on graphs is difficult, the authors suggested building the k-nearest neighborhood graph (K-NNG) and repeatedly applying max-pooling operations on each node's neighborhood.This way, the DL model is able to capture locality and extract features of neighboring points.For the decoder part, a foldingbased scheme is proposed to reconstruct the point cloud from a 2-D grid template deformation process.Due to the fact that 3-D point clouds are often sampled from object surfaces, one can make the assumption that any 3-D object surface can be converted and squeezed into a 2-D plane.It is also possible to reverse this process, i.e., wrapping 3-D shapes with a fixed 2-D paper (plane).This property builds the foundation of the proposed method.
The decoder maps 2-D points from a 2-D template grid to the surface of the 3-D object using folding operations.The definition of the folding operations, i.e., 2-D-to-3-D mapping, is the main contribution of this article, making it the first single learned parametric function embedding from a (gridded) 2-D (point) manifold into 3-D space and a fundamental building block for other surface reconstruction approaches.FoldingNet's decoder requires about 7% of the parameters of the fc decoder proposed by Achlioptas et al. [31], which is significantly smaller than the latter.However, it was shown to perform better at feature extraction in terms of classification accuracy and reconstruction loss.Overall, FoldingNet achieves higher classification accuracy than other unsupervised methods, such as LFD approach [122], SPH [130], TL-embedding network [22], and 3D-GAN [23].
PointFlow [33] is a 3-D point cloud generation framework that learns a distribution of distributions, i.e., the distribution of shapes and its respective points.A VAE is applied to transform sampled 3-D points from the point prior into a realistic point cloud conditioned on a shape vector.The distributions are modeled in two steps.First, the distribution of the latent space of shapes is learned.To enable the method to sample multiple shapes, PointFlow extracts latent vectors of different shapes.A sampled Gaussian vector (a shape prior) is transformed into a shape latent vector using a continuous normalizing flow (CNF) [152], [153], [154].In the second step, the distribution of points on a specific shape is learned for shape generation.Given a sampled 3-D Gaussian point cloud (point prior) and a shape latent vector inferred from the first step, a CNF is used to move input points to their new location and transform them into the target shape.For generative tasks, PointFlow outperforms the methods proposed in [31] in terms of the 1-nearest neighbor accuracy (1-NNA) metric while having fewer parameters.With respect to the EMD score, it achieves better autoencoding performance compared to Achlioptas' method [31] for point cloud reconstruction from inputs.

C. Mesh-Based Representations
Meshes are irregular types of data that are difficult to predict by neural networks.Their components are
1) Patch-Based Approaches: Groueix et al. [34] introduced a method for 3-D surface generation, called Atlas-Net, as illustrated in Fig. 13.They suggested generating a 3-D surface and representing it as a set of folded 2-D squares.The input shape can be either a 2-D image or a 3-D point cloud.The method outputs the corresponding 3-D mesh and its atlas parameterization.The main steps of the approach include encoding an input 3-D point cloud into a 3-D shape and reconstructing the 3-D shape from an input RGB image.3-D point clouds are encoded using a PointNetbased encoder, which transforms the input point cloud into a 1024-D latent vector.Input images are encoded using ResNet-18 [173].The decoder consists of four fc layers, which extract the final surface.The target 3-D surface is estimated using multilayer perceptrons (MLPs), which learns the local mapping of 2-D-points to 3-D-surface points.Therefore, by transforming the 2-D squares to the 3-D surface using learnable parametrizations, i.e., MLPs or patches, the final surface is covered in a way similar to putting paper strips on a shape to make a papiermâché.The difference between the proposed method and FoldingNet [32], which is a folding-based method, is that FoldingNet deforms just one 2-D square or patch, while AtlasNet investigates a varying number of 2-D squares.Results from AtlasNet showed that the usage of multiple patches improves 3-D reconstruction.For SVR from a 2-D RGB image, AtlasNet yields qualitatively better performance compared to the dense voxel-based method 3D-R2N2 [20], the octree-based method HSP [28], and a point-based method [30].Furthermore, it was shown that AtlasNet provides good generalization properties; however, it generates artifacts such as self-intersecting parts and overlapping patches.
Badki et al. [37] proposed an approach to extract a 3-D mesh from a noisy, sparse, unordered, and nonoriented set of points.Instead of learning shape priors at the object level, the method learns them locally while enforcing global consistency.In order to represent these priors and local features, small mesh patches, called meshlets, were used.These meshlets can be interpreted as a dictionary of local features and learned priors.The final mesh is the union of all meshlets.The authors used a VAE for learning the priors by using a very large dataset of meshlets, which was extracted from objects in the ShapeNet dataset.During training, the local priors are learned with meshlets.At inference, meshlets are deformed to match the input point cloud via distance minimization.Since individual meshlets are updated independently in order to adapt to the points, the overall mesh extracted from their union is not watertight.Therefore, a global consistency step is performed to eliminate inconsistencies across all meshlets, as illustrated in Fig. 14.Compared to occupancy networks [45] and AtlasNet [34], which are class-specific algorithms that learn priors at the object level, and deep geometric priors [174], this method produces better quantitative results in terms of CD and HD metrics.It also performs qualitatively well at reconstructing objects from unseen classes during training, coping with noise, and being robust to dramatic changes in the object's pose.
For all the aforementioned methods, mesh patches and the tessellation process may affect the quality of the final surface, especially for complex shapes.Therefore, these approaches may generate self-intersecting meshes and might be unable to generate closed surfaces.
2) Deformable Template-Based Approaches: Deformable template-based approaches take a template mesh with predefined interconnections as input, deform the vertices, and predict the final shape based on this.These approaches can generally reconstruct meshes and shapes with simple topology; however, they struggle to generate complex structures with a lot of details.Wang et al. [38] designed Pixel2Mesh, an end-to-end reconstruction pipeline for extracting a 3-D triangular mesh from a single RGB image.Taking an input image and an ellipsoid with fixed numbers of edges and vertices as the initial mesh, it gradually deforms the mesh using a graph-based CNN [graph convolutional network (GCN)] to generate the final 3-D shape.As illustrated in Fig. 15, the overall method is composed of two main parts.1) An image feature network (2D-CNN), which is used to infer perceptual features using an input color image.2) A three-block cascaded mesh deformation network (graph-based ResNet) that takes care of initial mesh deformation in a coarse-to-fine manner.Each graph-based ResNet block takes the perceptual feature concatenated with 3-D feature encoding of the input mesh as input.In their study, the authors showed that Pixel2Mesh outperforms 3D-R2N2 [20] and the pointbased method proposed by Fan et al. [30] in terms of the mean of F-score, CD, and EMD metrics.Qualitywise, it produces smoother surfaces with local details.Nevertheless, the approach shows generalization issues and can only generate meshes and objects of topologies similar to the initial mesh.
Pixel2Mesh++ [39] works along with Pixel2Mesh to produce 3-D meshes from multiview images.The main idea is that adding more images (three to five) of an object as input provides more information for a shape generation method and, thus, results in more accurate and detailed reconstructions.Pixel2Mesh++ consists of a multiview deformation network (MDN), which processes cross-view information for the prediction of optimal deformations.First, a coarse mesh is produced by Pixel2Mesh, which is then fed to the MDN part to be refined progressively by adding details.With regard to the F-score metric, Pixel2Mesh++ generates better results than 3D-R2N2 [20], learned stereo machine (LSM) [175], and two other baselines that the authors implemented using Pixel2mesh [38].In addition, it generalizes well across various semantic categories and produces high-quality outputs with accurate details.
Recent efforts by Kanazawa et al. [40] utilized a CNN image encoder followed by three modules for 3-D shape generation, camera pose estimation, and texture prediction.The CNN acts as an encoder, producing a latent representation of a single input image, which is fed to the three prediction modules.The 3-D structure of a shape is generated by deforming a learned category-specific mean shape with instance-specific predicted deformations.Texture is parameterized as a UV image that is predicted using texture flow.This mechanism enables the method to transfer the texture of one instance onto another.However, it cannot produce the detailed structure of the input shape.The presented approach obtains comparable results to the one proposed by Kar et al. [133] in terms of the IoU metric.Kar et al. [133] exploited segmentation masks and optionally a set of keypoints as annotations during inference to generate 3-D rigid objects.Contrary to that, the method of Kanazawa et al. [40] only utilizes these annotations during training and directly predicts a 3-D structure form an unannotated input image at inference time.
Hanocka et al. [41] introduced Point2Mesh for reconstructing meshes from point clouds.The core idea is a mesh fitting process for the reconstruction of the final mesh.

Fig. 15. Pixel2Mesh network [38] is a deformable template-based approach that reconstructs a 3-D triangular mesh from a single RGB input image. It consists of three mesh deformation blocks used for mesh resolution enhancement and vertex location estimation.
In addition to the input point cloud, an initial watertight mesh is fed to the network.This initial mesh represents a coarse approximation of the point cloud, which is iteratively deformed from outside-in using a CNN to fit the input point cloud, as illustrated in Fig. 16.Accordingly, a network learns displacement and deformation of the mesh vertex positions.The optimization of Point2Mesh is based on MeshCNN [176], which is a CNN-based pipeline applied on triangular meshes.Unlike Screened PSR, Point2Mesh is agnostic to normal orientation and ensures watertight reconstructions from noisy input with missing parts and unoriented normals.It also achieves a higher F-score compared to Screened PSR [9] and deep geometric priors [174] for shape denoising and completion.However, Point2Mesh requires a large amount of compute time and memory, possibly alleviated by data parallelism or model parallelism [177].
3) Other Mesh Generation Methods: Liao et al. [42] investigated end-to-end 3-D surface prediction using a differentiable MC (DMC) algorithm.In previous research, the surface prediction was solved in two steps: first, predicting an intermediate SDF/occupancy representation using an auxiliary loss, and second, taking a postprocessing step for 3-D mesh extraction separately, such as the MC algorithm.On the other hand, applying backpropagation to the MC algorithm is intractable due to nondifferentiability. Hence, in order to unite these steps to create an end-to-end framework, the authors inserted a differentiable formulation as a final layer into a 3-D-CNN.A point cloud, which is used as input, is directly converted into a volumetric representation using a grid pooling operation, e.g., max pooling in each cell.An encoder-decoder network with skip connections is then used to process pooled features, with the decoder operating in volumetric space.That way, it not only estimates occupancy probabilities but also predicts the vertex displacement field for a surface mesh.Compared with baseline methods that infer occupancy or TSDF first and then apply MC as a postprocessing step, DMC achieves superior results with respect to CD, accuracy, and completeness metrics.Nevertheless, difficulties may arise while reconstructing very thin surfaces, and disconnected parts can become connected.
Scan2Mesh [43] is a generative model that combines convolutional and graph neural network architectures to predict a complete, lightweight, and structured 3-D mesh representation from an unstructured and incomplete range scan of an object.The aim is to predict both vertex location and edge.Initially, the feature space is computed through a set of 3-D convolutions from input TSDF.The vertices are then predicted based on the extracted features.An fc graph is generated from the predicted vertices, and all of the vertices are connected to each other via edges.Next, a graph neural network is used to classify edges and extract the ones that belong to the mesh graph structure.Using this intermediate graph of predicted edges and vertices, a dual graph is created which comprises a set of valid potential faces.Finally, another GNN is applied to predict the final face structures from the dual graph.Scan2Mesh offers better qualitative and quantitative performance compared to 3-D ShapeNets [98], 3D-EPN [24], and PSR [8], [9].However, it depends on fc graphs for predicting edges, which leads to limitations in model size (MS).
Mesh R-CNN [44] is an approach that unifies both 2-D perception and 3-D shape prediction.It takes a single RGB image as an input, detects 2-D object instances in the image, and creates a category label, bounding box, segmentation mask, and 3-D mesh predictions of the detected objects as the outputs.Mesh R-CNN utilizes Mask R-CNN [178], an end-to-end region-based 2-D object detector, for the detection of 2-D objects.The 3-D shape prediction step depicted in Fig. 17 is based on a hybrid approach, which primarily produces a coarse voxel representation of a detected object, transforms this voxelization into an initial 3-D triangular mesh, and, finally, refines this mesh by modifying the vertex positions using a GCN.This approach achieves better results compared to a voxel-based method, such as 3D-R2N2 [20], a pointbased method [30], and a mesh-based method, such as Pixel2mesh [38] in single-image shape prediction considering CD and F1-score metrics.
Liu et al. [35] attempted mesh reconstruction from input point clouds by fully utilizing the input and simply adding connectivity to the existing points.Toward this end, they introduced a deep point cloud network that proposes candidate triangles and predicts faces.This information is provided as input to a mesh generation module.First, a k-nearest neighbor (k-NN) graph is built for each point in the input point cloud, in order to decide which three points should form a triangle face and infer candidate triangle proposals.Next, an MLP network is employed to classify candidate triangles and filter out incorrect triangles, such as the ones that connect two independent but spatially adjacent parts of the shape, using the intrinsic-extrinsic ratio (IER).To infer the local connectivity between vertices comprising a triangle, the ratio of the geodesic distance (intrinsic metric) and the Euclidean distance (extrinsic metric) was proposed.Finally, in a postprocessing step, the remaining candidate triangles are sorted and merged in a greedy way to generate the final mesh.The approach outperforms several learning-based methods, such as Atlas-Net [34], deep geometric priors [174], deep MC [42], and DeepSDF [50], as well as traditional reconstruction methods, such as PSR [8], [9], MC [125], and BPA [10] in terms of F-score, CD, and NC metrics.Moreover, it generates higher quality outputs with fine-grained structures than the aforementioned methods and offers the capability to be transferred to unseen categories.
Daroya et al. [36] proposed a recurrent neural network (RNN)-based method, called recurrent edge inference net-(REIN), to produce triangulated surface meshes from sparse input point clouds using a bottom-up approach.The network tries to predict edges sequentially and generates a mesh by processing points one at a time from a queue of points.The latent vector of the input point cloud, which is inferred by a PointNet-based [148] autoencoder, is also used to enrich the data with global structure information of an object.For edge prediction, the authors relied on the application of recurrent networks, inspired by GraphRNN [179].An RNN can be a good choice for inferring sequential predictions based on previous states [180].To tackle memory issues of processing large point clouds, small sections of the input point cloud are fed into the network one at a time, instead of processing all of it at once.In each small section, points in the queue are processed consecutively by REIN in two steps.1) Edge Prediction: REIN tries to predict connections, i.e., edges, between the new vertex (which was chosen from the queue) and the current partially predicted mesh.Two RNNs are used for edge prediction: State RNN and Edge RNN.State RNN encodes the current state of the graph with its nodes and edges, given a point cloud and its latent vector as input.Edge RNN attempts to predict the sequence of edges considering the current state.2) Face Generation: All of the vertices and predicted edges are investigated to form faces.However, the face generation module encounters problems generating surfaces from edge predictions, especially for nonmanifold surfaces.Qualitatively and quantitatively, REIN produces better mesh surfaces than BPA [10] and PSR [8].

D. Implicit Neural Representation
Neural networks are universal function approximators [181]; hence, they can be used to approximate any measurable function, including SDF and occupancy/indicator function, or to model other properties, such as radiance fields.Neural networks that parameterize such implicitly defined functions, without explicitly parameterizing the surface or properties of interest, are considered implicit neural representations [51].
Similar to implicit functions stored in discretized voxel grids, different functions can provide geometric information for parameterizing a surface by a neural network [123].There are also other functions that focus on capturing surface-related properties, such as appearance, texture, or reflectance properties.In particular, these functions can be as follows.
1) Level set methods define a DF f on the entire point set and then extract the zero-level set f = 0 as the boundary of an input object, as illustrated in Fig. 18.They divide a 3-D space threefold into an interior part, an exterior part, and an exact overlap with the object's surface.Given a point (x, y, z), the function f calculates the distance of this point to the boundary of the object, specifies its sign (SDF) [182], and decides the location of the point w.r.t. the surface.
The sign indicates whether a point is inside or outside of the surface.Therefore, in contrast to SDFs stored in voxels that discretize 3-D space and store SDF value in each voxel, SDFs in implicit neural representation are calculated for each point individually using a neural network.DeepSDF [50], which will be explained further down in this survey, was the first paper to propose this approach.2) Occupancy functions model an approximate likelihood of whether a point is occupied by part of an object or not.This can be expressed as a binary classification problem to classify a point as occupied or unoccupied.The approach can be interpreted as a special case of SDF that only considers the sign of SDF values [50].Occupancy networks [45] and IM-NET [49] fall into this category and will be clarified subsequently.3) Radiance fields refer to a set of techniques that aim to model the radiance or appearance properties of an object or scene.Notable examples of these methods include NeRF [57] and its variants, and Sections VI-D2a and VI-D2b will provide thorough explanations of them.

1) Implicit Neural Representation Based on Variants of SDF or Occupancy Function:
The key idea behind these implicit neural representations is to represent a shape as a neural network that takes a point in space as input and outputs some property of that space, i.e., mapping it to occupancy or signed distance of the shape at that coordinate.However, implicit neural representations cannot directly derive detailed 3-D shape features.Thus, an extraction step is needed to infer a corresponding explicit representation, such as a mesh.A possible isosurface extraction approach is the classical MC algorithm [125].
Compared to voxel-based representations, the memory cost of implicit neural representations remains constant with respect to the resolution.However, the capability of these methods to reconstruct fine details is constrained by the capacity of their underlying network architectures [51].
As mentioned previously, occupancy networks, IM-NET, and DeepSDF [45], [49], [50] represent pioneer works in implicit neural representation concurrently.Mescheder et al. [45] introduced a new representation for 3-D geometry, called occupancy networks, which can predict the continuous occupancy function using a neural network for the extraction of 3-D meshes.As illustrated in Fig. 19, the occupancy function is approximated with a DNN that determines an occupancy probability value between 0 and 1 for every possible point in 3-D point space (similar to a neural network for binary classification).The mesh is then generated from the occupancy network by utilizing a simple multiresolution isosurface extraction (MISE) algorithm, which employs octree structures and the MC algorithm [125].This expressive approach does not require the discretization of 3-D space.The representation can be inferred from different kinds of input, such as single images, noisy point clouds, and coarse discrete voxel grids, and can encode various structures efficiently.In comparison to methods using different 3-D representations, such as 3D-R2N2 [20] (a voxel-based method), point set generating networks [30] (a point-based method), and Pixel2Mesh [38] and AtlasNet [34] as mesh-based techniques, occupancy networks show competitive qualitative and quantitative results for various inputs, e.g., single images, noisy point clouds, and coarse discrete voxel grids.
In a similar fashion, Chen and Zhang [49] attempted to solve 3-D shape analysis and synthesis problems by proposing an implicit field decoder (IM-NET), which is based on the application of binary classifiers.Based on two inputs, a point coordinate and a feature vector encoding a shape (extracted from a shape encoder), IM-NET specifies whether the point is inside or outside the surface, using only the sign of its SDF.They utilized their proposed implicit decoder as the decoder part of some conventional frameworks (such as autoencoders (AEs) and GANs) and proposed IM-AE and IM-GAN, respectively.IM-AE and IM-GAN can be used for both 3-D reconstruction and shape generation tasks.Based on visual results, IM-AE generates smoother and high-quality surfaces compared to a classical 3-D-CNN-based decoder implementation, operating on voxelized shapes.IM-GAN showed better performance compared to AtlasNet [34] (in which output quality is constrained by the number of generated points) and 3D-GAN [23] (low coverage).For the single-view 3-D reconstruction task, the proposed framework constructs higher quality results than AtlasNet [34] and HSP [28].However, applying the implicit decoder on each point in the training set increases training time considerably.In addition, the network does not generalize well to other categories since it is trained individually for each shape category.
With DeepSDF [50], a novel shape representation based on the concept of SDFs was introduced.Instead of storing SDF in a discretized regular grid, as done in classical surface reconstruction techniques, the network directly learns continuous 3-D models of SDF from point samples.The trained network predicts the corresponding SDF value of the input data, from which the zero-level set surface can be extracted.The zero isosurfaces can be rendered and visualized through raycasting or polygonization algorithms, e.g., MC [125].The network takes (x, y, z) coordinates and a shape encoding vector as input to model a dataset of shapes.In order to obtain a meaningful latent space of shapes, an autodecoder is used for learning a shape embedding without an encoder.One of the advantages of the method is that the network size is considerably smaller compared to the voxel-based methods.DeepSDF outperforms Atlasnet [34] (a mesh-based method) and OGN [29] (an octree-based method) in reconstructing complex topologies with fine details.It further outperforms 3D-EPN [24] (SDFs stored in voxels) for the shape completion task.
Sitzmann et al. [51] introduced a novel architecture, called sinusoidal representation networks (SIRENs), an fc neural network that uses periodic sine as its nonlinearity for implicit neural representations.The motivation behind this lies in the fact that many recently published studies on implicit neural representation employing rectified linear unit (ReLU)-based MLPs are incapable of capturing high-frequency details of the input signal.There are two possible explanations for this phenomenon.1) Conventional neural network architectures encounter difficulties while learning to apply the same function at two different coordinates, and thus, the learned functions are not shift-invariant in general.2) ReLU nonlinearities cannot parameterize any signal that has information in its second derivative since its second derivative will be zero everywhere.Therefore, the authors suggested replacing conventional nonlinearities, such as tanh or ReLU, with a periodic sine activation function to improve final results.This replacement results in gaining a certain degree of shift-invariance and also addresses the problem of the second derivative since the derivative of sine is a shifted sine itself.The method was applied to a wide variety of areas, including image, audio, and video representations, 3-D reconstruction, and solving first-and second-order differential equations.In the 3-D shape reconstruction task, SIREN generates details of complex objects and scenes better than ReLU-based implicit representations, such as NeRF [57].
a) Methods based on unsigned distances: Some studies exploit unsigned distances instead of occupancy or signed distances for learning representations.With sign agnostic learning (SAL), Atzmon and Lipman [52] proposed a DL approach based on raw input data without any oriented normals or signs.Generally, regressionbased methods utilize regression loss for training and need inside/outside ground-truth information for this process, such as DeepSDF [50] or occupancy networks [45].In contrast to these methods, SAL uses a sign agnostic loss function that can be directly applied to raw unsigned data.The algorithm generates high-quality surfaces in comparison to AtlasNet [34] and a baseline method that approximates SDF based on the work by [9].The D-Faust dataset, which comprises raw scans of humans in various poses, is used for the experiments.Although there is no need to include the signed implicit ground-truth representation in the calculation of the loss function during training and also closing surfaces for training data is unnecessary in this work, SAL predicts SDF as the final output, which also results in closing the gaps even in open surfaces and generating only closed outputs (closed surfaces, in this case, are a division of 3-D space into three regions: inside, outside, and on the surface of an object, and they do not have separate parts).Neural distance field (NDF) [53] is a method to predict the unsigned distance field for 3-D surfaces using a neural network.Similar to SAL, NDF does not close shapes during training.However, it can successfully generate open surfaces, shapes with inner structures, and open manifolds compared to IF-Net [56] and SAL [52].
DUDE [54] is another approach, which is able to represent a surface by combining the unsigned distance field with the normal vector field.Evaluation of this method in comparison to DeepSDF and SAL demonstrates its superiority in producing high-quality outputs, especially for open surfaces, with visually pleasant renderings.The main difference between NDF and DUDE compared to SAL is that the first two can reconstruct both open and closed shapes with complex and detailed topology, while the latter attempts to close parts that should be open.
b) Part-based approaches: Encoding an entire surface into a single latent vector can lead to substantial information loss since the limited size and capacity of the latent representation causes accuracy and generalization issues [48].In order to solve the difficulties of generalizing to other shape categories and scaling to large scenes, researchers resort to conditioning an implicit neural representation on local geometric features [46], [47], [48], [55], [56], [183], [184].There are different approaches to the implementation of such conditioning.Some approaches fuse the volumetric representation (voxel grids) with the implicit neural representation and use local features stored in voxels for inferring implicit neural representation [46], [47], [55], [56].Others use local patches to learn implicit neural functions [48], [183], [184].All of these methods leveraged the advantages of encapsulating local and global information for proposing more generalizable and scalable approaches.
Jiang et al. [55] suggested the local implicit grid (LIG) representation, which decomposes 3-D space into a regular grid of overlapping part-sized local regions and encodes each region with implicit feature vectors.The key idea behind the algorithm is that objects in different categories share similar geometric features and details at neither microscale, i.e., a very small patch, nor macroscale, i.e., the entire object, but part scale.Therefore, a part-autoencoder was used to learn embeddings for different parts of an object and extract meaningful abstraction of its shape.The autoencoder consists of a 3-D-CNN encoder and an implicit network decoder in the form of a reduced version of the IM-NET [49] decoder.During inference, a pretrained implicit function decoder is used in each grid cell in order to generate the respective scene part.Eventually, the overlapping latent grids were optimized via the proposed mechanism to reconstruct the entire scene.Since this method generalizes shape priors learned from object datasets, it does not need any training on the scenelevel dataset for reconstructing scenes from sparse oriented point samples.Therefore, it generates higher quality outputs from unseen object categories than other methods, such as IM-NET [49], since IM-NET learns only a single embedding for an entire object.Compared to traditional surface reconstruction methods such as PSR [8], [9], LIG is capable of recovering thin structures and details very well.
Likewise, Chibane et al. [56] introduced implicit feature networks (IF-Nets) that are composed of an encoding and a decoding tandem.The network takes voxels or point clouds as the input and predicts whether point p lies inside or outside of an object, resulting in a continuous surface at arbitrary resolution.To encode local and global structures of a 3-D shape, a 3-D multiscale grid of deep features is extracted instead of using a single vector to summarize an entire object.Consequently, rather than classifying (x, y, z) point coordinates directly, the decoder classifies a point based on these extracted features and creates occupancy predictions.IF-NET achieves better quantitative results than occupancy networks [45], point set generation network [30], deep MC [42], and IM-NET [49] in point cloud completion, voxel super-resolution, and single-view human reconstruction tasks.Moreover, Chibane and Pons-Moll [185] proposed an extension of IF-Nets for 3-D texture completion.
Peng et al. [46] developed convolutional occupancy networks, a hybrid voxel grid/implicit neural representationbased approach that combines convolution operations with implicit representations in the form of a convolutional encoder with an implicit occupancy decoder.The method is independent of the input representation.Given a point cloud or voxel grid as input, the method uses a 2-D plane encoder/3-D volume encoder based on PointNet to process the input by converting it into features and projecting these local features onto a plane(s)/volume.A convolutional 2-D plane decoder/3-D volume decoder further processes the feature plane(s)/volume using 2-D/3-D U-Nets [186], [187], integrating both local and global information.In the end, a small fc occupancy networks [45] is used to predict the occupancy probability from a given query point p and its feature in 3-D space.For rendering and extracting meshes from the input, the MISE algorithm is applied during inference.Evaluation of both object-and scene-level reconstructions was performed using synthetic and realworld datasets.The major difference between the novel method [46] and the original occupancy networks [45] is that convolutional occupancy networks capture the local features of the space and global features, leading to higher generalizability, scalability, and faster training.Moreover, it benefits from the translational equivariance property of convolutional networks while not supporting the rotational equivariance property.
In a similar work, Chabra et al. [47] introduced deep local shapes (DeepLSs), a method for deep shape representation, which uses learned local shape priors.As illustrated in Fig. 20, the key idea is the decomposition of a shape into small components in order to improve reconstruction results.To this end, local information of these components is stored in a grid of independent latent codes.Based on these, SDFs are predicted by applying DeepSDF [50] as a local shape neural network to each grid cell.DeepLS outperforms DeepSDF in accuracy and inference time by approximately an order of magnitude.
Unlike occupancy networks [45] and DeepSDF [50], which extract the global latent code vector from the entire input, local patches are modeled as deep implicit functions in patch-based approaches [48], [183], [184].Erler et al. [48] presented a patch-based learning framework, called Points2Surf, which generates accurate implicit surfaces directly from raw point clouds without surface normals.The underlying algorithm is based on the notion of considering a shape as a collection of small shape patches.Instead of representing an entire surface as a single latent vector, Points2Surf creates separate feature vectors for different patches to describe local details in addition to global information.By decomposing the surface reconstruction problem into learning a global function (that learns the sign of SDF) and a local function (that learns the absolute distance field of SDF with respect to local patches), Points2Surf succeeds in being robust to noise and missing parts and also generalizing well to unseen shapes.In addition, Points2Surf yields a significant drop in the reconstruction error on unseen classes compared to both data-and nondata-driven methods, such as DeepSDF [50] and AtlasNet [34], or SPR [9].However, this patch-based approach results in longer computation time, inconsistencies between outputs of neighboring patches, and nonwatertight and bumpy surfaces.
c) Equivariant neural networks: Chatzipantazis et al. [203] introduced an SE(3)-equivariant coordinate-based attention network called TF-ONet for 3-D surface reconstruction.Local shape modeling and equivariance are the two core design principles of this method.SE(3) stands for special Euclidean group in three dimensions representing transformations including translations and rotations in 3-D.In simple terms, equivariance means that, when the pattern in the input changes, i.e., when it is rotated or shifted to a specific direction, the output should also change in an equivalent proportion.TF-ONet works directly on unoriented and irregular point clouds and outputs the occupancy field of a shape.To predict the occupancy score at any given point in space, TF-ONet creates equivariant features for each point that function as keys and values of specialized attention blocks.This enables TF-ONet to output high-quality reconstructions and generalize to novel scenes composed of multiple objects, despite being trained on single objects in canonical poses.Inspired by SE(3) transformers [204] and tensor field networks [205], TF-ONet attention modules ensure equivariance by incorporating symmetries into the learning process.It is basically a two-level approach.1) The first level, i.e., an encoder, applies self-attention in local neighborhoods around each point to infer local features from the point cloud.
2) The second level, i.e., a cross-attention occupancy network, uses the extracted point features and the coordinates of a query point in space to calculate the value of the occupancy function for the specific query point.
For single-object reconstruction tasks, TF-Onet performs comparably better than nonequivariant networks, such as occupancy networks [45], convolutional occupancy networks [46], IF-Net [56], and also equivariant networks, such as vector neurons [206] and GraphOnet [207] considering evaluation metrics, such as Chamfer-L1, F1-score, and IoU.For scene reconstruction tasks trained only on single objects, global shape modeling-based techniques, such as occupancy networks [45] and vector neurons [206], are not able to generalize to scenes containing multiple objects.Moreover, local shape modeling-based methods, such as convolutional occupancy networks [46], which are not equivariant under SE(3) transforms, are only able to produce low-quality objects in novel poses.TF-ONet instead excels at the tasks and can generalize to novel scenes with high quality, benefiting both from local shape modeling and equivariant properties.
2) NeRF-Based Approaches: a) Fundamentals of NeRF: NeRFs [57], commonly referred to as NeRFs, are basically used for view synthesis.The main idea behind NeRFs is to train a model that can produce new views of a scene or an object and can represent them in 3-D, given a set of 2-D images from different viewing angles as input.Hence, multiple input views of a scene and their corresponding camera poses are used to render new views of that scene by interpolating between the given views.The NeRF method employs an fc deep network to represent a scene.Each input (x, y, z, θ, ϕ) is a single continuous 5-D coordinate that encompasses spatial position and viewing direction, and each output (RGBσ) is density and view-dependent emitted radiance at that particular spatial location.Consequently, the neural network describes an implicit function that exists throughout all locations as a continuous representation without any discretization.As a result, by implicitly encoding density and color through a neural network, NeRF has demonstrated impressive performance on new view synthesis of a particular scene.
Although overfitting is usually an undesirable behavior in machine learning, the key part of this approach is the usage of a neural network that is overfitted to one particular scene and only cares about this specific scene.For rendering a new scene, it is necessary to take a fresh neural network and train it from scratch until it is overfitted to the new scene.Therefore, instead of storing a scene as a mesh or a voxel grid, the scene is stored in the weights of the neural network.For instance, if a scene consists of a tree, the weights represent this tree and are very specific to it, outputting nonsense for another scene if not being trained once again.
To explain the fundamentals of NeRF in more detail according to Fig. 21, the images have to be transformed to 5-D coordinates (x, y, z, θ, ϕ)s first.(x, y, z) are coordinates of a pixel point in 3-D space, and (θ, ϕ) are related to the viewing angle.For each pixel on an image, a ray is sent through.Therefore, every pixel in every input picture defines a ray, and then, it is sampled along the ray.Consequently, each input image sends out a lot of rays, and for each ray, there are many sampled points.Next, for each location represented as (x, y, z, θ, ϕ), the neural network effectively determines the presence of an object and subsequently identifies its corresponding color.This nine-layer fc network provides four numbers (RGBσ) as the output: the (RGB) is the color of that particular pixel point, and σ is the density for each of the individual points.The density value serves as an indicator of the presence or absence of an object in the designated region of space, as well as its density.If this process is done for all the points in space from all viewing angles, a complete 3-D representation of what it looks like can be inferred.The neural network outputs different results for the same location depending on different viewing angles.Accordingly, it can capture the reflections, lighting effects, and transparency.Eventually, classical techniques for volume rendering are employed to project the network outputs onto a 2-D image.Given that volume rendering is intrinsically differentiable, it is possible to define a loss function that measures the difference between the predicted and the ground-truth color of the ray.In order to convert NeRF to a mesh, MC can be further applied.
To produce high-resolution complex scenes, two interesting tricks were utilized: 1) positional encoding; 2) hierarchical sampling.Positional encoding, which is similar to the same one in transformers [208], is used to map the 5-D input vector to higher dimensional space using sin and cos waves, helping MLP in approximating and representing high-frequency functions.It enhances the ability of a neural network to not only capture coarsegrained structures but also to perform well in representing finer details.Hierarchical sampling is a two-step sampling method with two networks: a coarse network and a fine network.The points on the ray are sampled in a uniformly distant fashion from each other.These sampled points are run through the network for density prediction.Next, an evaluation step is taken place to decide where should be sampled more in the second round, based on the output of the previous step.Thus, the output of the coarse network discloses where the important stuff is.The second round of sampling starts with points with higher density, i.e., points closer to the particular object that is perceived, and the vicinity of such points will be sampled a lot more.Both coarse-and fine-grained networks are optimized at the same time using a loss.
Delving into the advantages associated with NeRFs, it is clear that these methods are not view-dependent, without the need for any 3-D input supervision.In addition, NeRFs are memory-efficient compared to voxel grid representation.One neural network of one scene fits into a few megabytes, which might even be smaller than the input image size for that scene, whereas dozens of gigabytes might be needed for storing the same scene in voxels.Regarding the limitations of NeRFs, what makes them impractical is their requirement for a large number of highquality posed images as input.The more images are fed, the better the output quality will be.Another downside is related to their high computational cost, originating from optimizing each scene individually without sharing knowledge between different scenes [62].This implies that, for every scene, the network should be trained again, and a pretrained one cannot be utilized.For instance, it takes around 100k-300k iterations, i.e., roughly one to two days, for the naive NeRF network [57] to be trained on a single scene using a single NVIDIA V100 GPU.
b) NeRF and its variants for view synthesis: This section provides a summary of some of the papers that aim to enhance NeRF and its abilities.In NeRF++, Zhang et al. [61] analyzed NeRF and uncovered three major problems and situations in which NeRF might fail: shape-radiance ambiguity, near-field ambiguity, and parameterization of unbounded scenes, such as large realworld scenes.The first two issues are related to the fact that NeRF is actually overparameterized, i.e., the degree of freedom for NeRF to hallucinate and move toward a completely wrong answer is high.However, the authors of NeRF [57] use an interesting implementation trick and regularization.They feed viewing angles in the very last layers of the MLP network.Therefore, the MLP actually starts with locationwise coordinates of a point in the beginning, and viewing angles are fed in the last layers, resulting in a limited degree of freedom for NeRF.Accordingly, if all 5-D coordinates are fed to the network from the beginning, the shape radiance ambiguity becomes a big issue, affecting the quality of NeRF's outputs drastically.
NeRF++ proposes a couple of solutions to tackle these three problems and enhance output quality.By introducing an auxiliary loss, NeRF can avoid moving toward a poor solution, which may lead to completely wrong scene geometry estimation, thus addressing the shape-radiance ambiguity issue.Furthermore, adaptive near-field culling is proposed to solve the near-field ambiguity issue.It culls the front part of each view frustum adaptively based on the geometry of a scene, i.e., it prevents estimating the geometry right in front of the camera contrary to vanilla NeRF.The third issue concerns scenarios in real-world settings where precise reconstruction of objects in front of the camera is essential.However, the camera's ability to capture other items beyond these objects necessitates a certain level of reconstruction for the distant items as well.NeRF++ suggests homogenous parameterization that enables having a detailed reconstruction in the foreground and a detailed reconstruction of the background.This is done by training two NeRFs, one for the foreground part of the scene and the other for the background part, increasing the capacity of the model for reconstructing details at different levels.NeRF++ still needs per-scene training, and one scene takes about three days to be trained.
PixelNeRF [62] is built upon the concept of NeRFs for 3-D reconstruction and synthesizing photorealistic 3-D scenes from a single or a small number of posed images.PixelNeRF attempts to tackle the requirement of NeRFs for a lot of images as the input and make it generalizable.Considering the fact that extracting 3-D geometry and the appearance of a scene from limited input is a challenging task, and NeRFs do not share knowledge between the scenes, the framework proposes to condition an NeRF on spatial image features.Thus, pixelNeRF employs a fully convolutional image encoder that infers a pixel-aligned feature grid.Then, a spatial location and its corresponding encoded feature are fed to an NeRF network for color and density prediction.PixelNeRF shows better generalization capabilities and performance compared to NeRF.However, its rendering time is still slow, and more input views cause a linear increase in the runtime.
In another concurrent work to overcome the generalizability issue and long optimization time of NeRFs, MVSNeRF [63] suggests a DNN that can reconstruct an NeRF, given only three nearby input views.This approach combines plane-swept cost volumes, which are used for geometry-aware scene reasoning in MVS, with NeRF models.To create a cost volume, MVSNeRF first warps 2-D image features onto a plane sweep.Then, a 3-D-CNN is leveraged for the reconstruction of a neural encoding volume with per-voxel neural features.Next, features interpolated from the encoding volume are employed to predict density and RGB radiance for an arbitrary point using an MLP.Achieving comparable or better rendering results, MVSNeRF can significantly surpass NeRFs [57] in terms of optimization time efficiency, i.e., roughly 30 times faster, if more images are provided as input.Moreover, it generalizes better than PixelNeRF [62] and IBRNet [209].
MipNeRF [210] attempts to address one of the problems of NeRF, which is the production of blurred or aliased renderings when dealing with training or testing images at different scales.In NeRF, all of the cameras have the same distance from an object.Thus, it is able to do view synthesis without the need to think about scaling or aliasing.However, when new cameras are to be added at different scales, NeRF begins to collapse since it is a singlescale model trying to tackle a multiscale problem.To fix this issue, MipNeRF proposes some modifications to the vanilla NeRF including the following: 1) casting a cone instead of sending a ray through each pixel; 2) slicing up the cone into conical frustums instead of sampling single points along each ray; 3) computing integrated positional encoding instead of positional encoding of a single coordinate along the ray; 4) in general, training a single neural network that describes the scene at multiple scales instead of training separate neural networks at various scales.These new properties help MipNeRF reason about the scale of its inputs.MipNeRF is capable of producing highresolution renderings across multiple scales rather than just at a single scale in vanilla NeRF.NeRF's performance decreases when being trained on multiscale data, while MipNeRF's performance does not.The number of parameters in MiPNeRF is half of that in NeRF while also being 7% faster for their multiscale dataset.Mip-NeRF360 [211] and ZipNeRF [212] are some other recent methods used for antialiasing NeRFs.
In a work proposed by NVIDIA, instant NGP, Müller et al. [213] try to facilitate and speed up neural graphics primitive tasks.A neural graphics primitive is an object represented by a neural network that takes a query as input, such as position and some extra parameters, and outputs appearance and shape attributes.Examples of NGP can be computing SDF, NeRFs, radiance cashing, and so on.To bring about simplicity, instant training, real-time rendering, and high-quality results for instant NGP, solutions such as multiresolution hash encoding by storing the trainable feature vectors in a compact spatial hash table, using a small neural network called a fully fused neural network, and improvement of training and rendering algorithm are proposed as main ideas.
The amount of research efforts based on NeRF is increasing.From relighting [64], [65], [214], [215] and view synthesis without pose supervision [216] to learning nonrigid objects and dynamic scenes [66], [67], [68], [217], [218], [219] and tackling computational challenges of NeRF and heading toward the real-time rendering [58], [59], [60], [220], numerous studies have been conducted to broaden the horizons of NeRF and its various applications.c) NeRF for 3-D surface reconstruction: In an NeRF model, the scene geometry is hidden inside the neural networks, i.e., it is implicit.In order to achieve 3-D surface reconstruction and transform the NeRF representation into an explicit representation, such as a mesh, a surface extraction step is essential.By analyzing and thresholding the learned density, i.e., extracting an arbitrary level set of the density function that is learned by NeRF, and using methods such as MC, the baseline NeRF can extract and reconstruct an approximate explicit 3-D geometry [221].Although NeRF and its variants generate impressive results for the novel view synthesis task, they cannot output high-quality 3-D surface reconstruction.The quality of the extracted 3-D geometry is not satisfactory because the initial objective of NeRF is novel view synthesis, not 3-D surface reconstruction.Since the density-based representation used in NeRFs is flexible and does not have enough constraints on 3-D geometry [222], it imposes some limitations on inferring accurate surface geometry, especially when ambiguities exist.Therefore, the extracted surfaces usually contain artifacts.To alleviate this issue, some papers have been presented for the 3-D surface reconstruction task that tried to incorporate implicit neural surface representation approaches based on an SDF or an occupancy function into NeRF-based methods, benefiting from the advantages of both categories.In these methods, instead of choosing the density-based scene representation used in NeRF, the scene space is usually represented as an SDF or an occupancy function.
Oechsle et al. [223] proposed UNIfied Neural Implicit SUrface and Radiance Fields (UNISURFs), which is a framework for 3-D surface reconstruction and capturing high-quality implicit surface geometry from multiview images without the need for object masks.It unifies the implicit surface models with radiance fields for solid and nontransparent object reconstruction given a set of RGB images.UNISURF represents surfaces and defines object or scene geometries using occupancy values.It learns and optimizes this implicit surface via a volume rendering method like NeRF.The output mesh is extracted using the MISE algorithm [45].Considering reconstruction quality, UNISURF outperforms NeRF [57].There are some limiting factors for this method, including reconstructing only solid objects and constraints to model transparencies, performance drop for overexposed or rarely visible regions in the ground-truth images, and inability to resolve the shape-appearance ambiguities, such as shadows and holes in objects.
In another concurrent attempt, Wang et al. [222] presented NeuS that learns neural implicit surface representation based on SDF using volume rendering, with the goal of reconstructing the 3-D surface of an object or scene given multiple images from different viewing points without leveraging mask supervision.Instead of just doing standard volume rendering or standard surface rendering, this framework suggests using volume rendering (inspired by NeRF) in addition to surface representation with neural SDF.The key idea behind this method is to represent a 3-D surface as the zero-level set of an SDF, i.e., representing a surface with neural implicit SDFs, and to introduce a new volume rendering method by taking inspiration from NeRF, for training a neural SDF representation with robustness.This novel volume rendering technique attempts to learn the weights of the neural network by rendering images from the implicit SDF first and then minimizing the difference between the rendered images and the input images.NeuS performs quantitatively and qualitatively better than NeRF [57] and UNISURF [223] in high-quality surface reconstruction.However, one failure case of NeuS is its inability to accurately reconstruct textureless regions.This limitation is caused by the ambiguity of these textureless regions for reconstruction in neural rendering.
Variants of NeuS [224], [225] have been proposed with the goal of improving the reconstruction quality.HF-NeuS [224], a method for multiview surface reconstruction with high-frequency details, breaks down the SDF into fundamental components, namely, base and displacement functions, and adopts a gradual increase in highfrequency details through a coarse-to-fine strategy.In Geo-Neus [225], by utilizing sparse 3-D points in SfM constraint in conjunction with the photometric consistency in MVS constraint, the learning of neural SDF can be enhanced.
In a similar fashion to NeuS, another concurrent work called VolSDF [226] suggested a volume rendering framework for implicit neural surfaces.Replacing generalpurpose MLP densities with densities from a certain family, i.e., in this case representing the density as a function of the signed distance to the scene's surface, is the core contribution of VolSDF.Two fc neural networks, one for the approximation of the SDF of the learned geometry and the other for representing the scene's radiance field, form the structure of this framework.Compared to NeRF [57] and NeRF++ [61], VolSDF generates more accurate results.One of the limitations of VolSDF is that it assumes the object is homogeneous with a constant density.Moreover, its reconstruction time is still high due to the independent training of the network for each scene.
Recently, SDFStudio [227], which is a framework for neural implicit surface reconstruction, has been released.It is built on top of nerfstudio [228] and includes a unified implementation of VolSDF, NeuS, and UNISURF, three popular neural implicit surface reconstruction techniques.Because of the unified and modular implementation of this framework, transferring ideas between methods is simple.The idea from Geo-NeuS can be integrated with VolSDF, bringing about the Geo-VolSDF method.

VII. D I S C U S S I O N A N D F U T U R E T R E N D S
In Section VI, the latest attempts toward 3-D reconstruction using DL techniques were reviewed.A summary and comparison of presented learning-based surface reconstruction approaches can be found in Table 2. Furthermore, Table 3 contains a quantitative report about the performance of some of the approaches on the ShapeNet dataset.There is a qualitative gap between 3-D models created by learning-based approaches and artist-created CAD models [43], and there are still open problems in this scope.Some of these challenges are listed in the following.
In the existing approaches, serious bottlenecks are caused by computation time and generalization power.The requirement of long training time is a drawback to the adoption of some of the DL-based approaches.On the other hand, there are concerns raised about the environmental impact of prolonged training periods.To this end, designing models with a reduced number of parameters, less complexity, and yet high performance can be a target to hit.In addition, the utilization of transfer learning may serve as a partial solution.Regarding the generalizability issue, methods with the capability of multicategory generalization, i.e., generalizing well to other topology categories, should be further investigated.One solution might be to learn latent shape spaces that are not class-specific.
Consequently, as a future direction, moving toward models with comparable shorter training time and stronger generalizability can be an interesting yet reasonable strategy.
Current methods are highly dependent on an external supervisor for annotating input data.Reducing the need for supervision is a desirable trait for a learning-based approach [40].Furthermore, there are various large-scale datasets appropriate for geometric DL tasks.However, there is still a need for creating datasets with richer 3-D annotations that are suitable for shape and surface reconstruction.
On the other hand, some of the current evaluation metrics fall short of capturing surface properties accurately.Therefore, it is necessary not to be limited to quantitative results but to explore qualitative results to gain a deeper understanding of surface details as well.Moreover, presenting better and more robust evaluation metrics, which are at the same time computationally efficient and less complex (in point cloud comparison, CD has quadratic complexity for instance), is another area that is essential to focus on.
In the context of volumetric methods, various challenges exist that should be tackled.Because of the discretization of data, some input information and details may partially be lost.Cubic growth in memory and computational costs with respect to resolution and poor scalability of these methods with resolution increase lead to difficulty in inferring high-resolution outputs.Considering the influence of 3-D resolution on the performance of volumetric CNNs for instance, better performance can be achieved by designing efficient volumetric CNN architectures for instance, which are able to scale to higher resolutions [128].
For point-based approaches, current methods extract a fixed and limited number of points from the point cloud dataset and feed them to their network architecture, thus affecting the output quality.Overcoming this limitation and implementing models with the ability to handle variable-length input can be ambitious yet interesting future directions.
In mesh-based approaches, it is challenging to define a loss on meshes, which is easy to optimize [34].One of the limitations of patch-based approaches in the mesh-based representation category that affects the reconstruction of fine details is the usage of a fixed scale mesh patch [37].A coarse-to-fine approach and extracting mesh patches at different scales might result in more precise outputs.On the other hand, generating a closed shape using patch-based methods, and recognizing and segmenting shapes using these methods are issues that still require solutions [34].
Implicit neural representations have recently gained popularity due to their performance and favorable properties.Existing isosurface extraction approaches used for extracting representations from implicit neural representations are computationally intensive and, thus, comprise a bottleneck.Furthermore, it may be worthwhile to combine sign-agnostic implicit neural approaches with generative methods, such as GANs [52].Moreover, NeRFbased approaches mostly suffer from high computational cost, long training time, and the inability to share knowledge between various scenes, thus being scene-specific networks.The necessity for more input images in order to have high-quality outputs should be alleviated.Improving NeRF-based methods' time and computation efficiency, their generalizability to unseen scenes, and their surface reconstruction ability can be important research questions.
In general, reducing the performance gap between synthetic and real-world data, proposing better and more representative evaluation metrics for quantifying shape reconstruction analysis results [49], conducting research in the challenging task of scene-level reconstruction, empowering proposed methods with multiscale reconstruction (coarse-to-fine manner) [48], implementing and employing methods for capturing high-frequency details with the purpose of reconstructing thin parts of a scene or object in high-quality, considering the equivariance concept for designing a neural network, and fusing different approaches mentioned in Fig. 6 in order to enjoy the benefits of them simultaneously are aspects that should not be ignored in future studies.In addition, the application of transformer architectures [208], i.e., a DL model that is based on the self-attention mechanism, seems to be promising in 3-D vision [237], [238], [239].On the other hand, self-supervised learning [240], which is a technique for predicting unobserved or hidden part of the input from observed or not hidden part of the input, can be one of the interesting approaches for solving reconstruction and in general computer vision problems with low quality and a limited amount of data.Furthermore, considering the current interest, diffusion models [241], [242], [243], [244], which learn to infer and generate a meaningful output from pure noise, seem to be another exciting approach to be used in 3-D generation, completion, and reconstruction [245], [246].
It is equally expected that surface reconstruction applications play an increasingly important role.One of the major uses will be in observational RS-related disciplines where surface reconstruction will aid in archeological discoveries, agriculture, disaster prevention and response, and cartography.Equally, design-or projectionbased applications have great utilization potential for learned surface reconstruction, including, but not limited to, 3-D modeling in games and movies, architecture, or CAD.Yet, all of the aforementioned scenarios are considering only (close to) static surfaces.The anticipation is that accurate reconstruction of dynamically changing objects and environments, nonrigid objects or scenes, textureless regions and transparent objects, and overcoming the challenges of rarely visible regions, occlusions, shadows, and holes in an object or scene will be crucial and consequential next steps in this field of study.Overall, more applications of neural learning approaches will emerge for surface reconstruction, especially in SFX and VFX animation, human reconstruction, robotics, autonomous driving, and medicine.

VIII. C O N C L U S I O N
In this article, we provided a review of the state-ofthe-art approaches for learning-based 3-D surface reconstruction.We have taken no special perspective, making the manuscript accessible not only to method researchers but also to applied users seeking to contextualize these approaches for their domains.
For this, we have reiterated commonly used open and accessible benchmarking datasets, different input and output data modalities, and some acquisition techniques.To make processing results comparable, we have highlighted widely used metrics for evaluating learned models and detailed their particularities.
The main part of this article has introduced DL-based 3-D surface reconstruction approaches.In summary, these can be classified into four major categories based on their output representations: 1) voxel-based; 2) point-based representation; 3) mesh-based; and 4) implicit neural.For each of the categories, we listed some well-known methods, explaining their contributions, challenges, strengths, and weaknesses.
Although 3-D deep surface reconstruction has made impressive progress over the last few years, there are several remaining challenges.The following nonexhaustive list will highlight the major open issues: 1) computation time; 2) generalizability; 3) energy consumption and environmental impact; 4) representation compression; 5) resolution; 6) water tightness; 7) nonrigid, dynamic, or transparent object reconstruction; 8) reconstruction of rarely visible or occluded regions, shadows, and holes in an object or a scene.Toward the end of this article, we discussed current challenges and possible future trends in deep 3-D surface reconstruction.We assume that coming research will put a strong emphasis on self-attention-based models due to their exceling performance in DL in general and 2-D computer vision problems, i.e., vision transformer and its derivatives, in particular.Moreover, self-supervision will be the strong community focus due to its ability to not only improve reconstructive performance overall but also to leverage small and potentially domain-specific datasets.The application of diffusion models seems to be a promising direction as well.Finally, albeit in a niche setting, the quantification of reconstruction uncertainties will be of utmost importance for safety-critical applications and certain scientific application settings.

A c k n o w l e d g m e n t
The authors thank their funding agencies.

Fig. 1 .
Fig. 1.Output representations of various 3-D surface reconstruction approaches.DL-based 3-D surface reconstruction approaches can be broadly classified into four main categories according to their representation: volumetric, point cloud, mesh, and an example of implicit neural representation based on SDF.(a) Object.(b) Voxelized.(c) Point cloud.(d) Mesh.(e) Implicit.

Fig. 2 .
Fig. 2. Visualization of (a) CD, (b) EMD, and (c) HD metrics.Red dots and blue dots belong to two different point sets, and each of these metrics measures the distance between these two sets in a unique way.

Fig. 5 .
Fig. 5. (a) Comparison of LFDs between two 3-D models: a pig and a cow.First, rendered images are extracted for both 3-D models.Then, as illustrated in (b), all 2-D images from the same views are compared, and a similarity value for this camera angle is obtained.

Fig. 8 .
Fig. 8. TL-embedding network [22].During training, two types of input are fed to the network: 2-D RGB images as the input to ConvNet at the bottom and 3-D voxel maps as the input to the autoencoder at the top.The network outputs a 3-D voxel map.

Fig. 11 .
Fig. 11.PointNet architecture [148], which is used for classification and segmentation tasks, directly accepts a point cloud as input.Each of the points in the input point cloud is processed by a small neural network individually and independently.Then, point features are aggregated by max pooling, a simple symmetric function that respects the permutation invariance of the input points.The aggregation step creates a global feature vector thatencodes the entire shape.

Fig. 12 .
Fig. 12. FoldingNet architecture [32] consists of a graph-based encoder (an improved and generalized version of PointNet), which encodes local neighborhood structure information, and a folding-based decoder, which reconstructs the point cloud from a 2-D grid template deformation process.

Fig. 13 .
Fig. 13.AtlasNet [34] is a patch-based approach that takes either a 2-D image or a 3-D point cloud as input and outputs a 3-D mesh.

Fig. 16 .
Fig. 16.Point2Mesh [41] takes a point cloud (in blue) and a deformable initial mesh as input and gradually reconstructs the final output shape.

Fig. 17 .
Fig. 17.Mesh R-CNN [44] architecture.After the object detection step, the voxel branch predicts a coarse voxel representation for each object detected by Mask R-CNN[178].Then, in the mesh refinement branch, the cubified object transformed into the mesh after a series of refinement steps.

Fig. 18 .
Fig. 18.Level set methods divide a 3-D space into three parts: an interior part (f < 0), an exterior part (f > 0), and an exact overlap with the object's surface (f = 0).

Fig. 19 .
Fig. 19.Occupancy networks architecture [45] predicts occupancy function for each point in 3-D space using a DNN.Different encoder architectures are used in occupancy networks depending on the task and input.A ResNet-18 architecture [173] for image input, a PointNet encoder [148] for point cloud input, and a 3-D-CNN for voxel input are employed.

Fig. 20 .
Fig. 20.DeepLS [47] decomposes a scene into local shapes and uses a set of locally learned continuous SDFs defined by a neural network.
The full ShapeNet dataset is not yet publicly available.
object in this dataset can be a 3-D mesh.The 3-D shapes are stored in the Wavefront object file format (.obj), which describes the surface geometry of a 3-D shape and includes vertices and faces, along with material template library (.mtl) files used to store material definitions.An .mtlfile is a companion file

Table 1 (
Continued.)Comparison of Benchmark Datasets thetic dataset.It includes a comprehensive and clean collection of 127 915 CAD models with 662 object categories and consists of two subsets, Mod-elNet10 and ModelNet40 with ten and 40 classes, respectively.ModelNet10 has also been annotated with the orientation of the CAD models, which are given in the Geomview object file format (.off).

Table 3
Quantitative Report About Some of the Methods' Performance on ShapeNet.CD, IoU, AP, and F_Score Are Calculated as the Average.For IoU, F_Score, and AP, the Higher the Better.For CD, the Lower the Better.The Number of ShapeNet Categories Used in an Experiment (#Cats), Not Measured or Not Mentioned (-), SVR, MVR, Reconstruction (R), Completion (C), Autoencoding (AE), Training Time (T), Inference Time (I), Generating a Mesh (mg), Memory (Mem.), and MS.*Is Calculated for (323).+ Is Related to Chamfer-L1.For Detailed Information Regarding Data Preparation Methods, Train/Test Splits, Metrics, and Other Specific Details, Please Refer to the Context of Each Individual Paper