Introduction
In Recent decades, the demands for effective and efficient monitoring of the dynamic change of buildings and construction installations in urban areas have continually increased in the fields of architecture, engineering and construction/facility management (AEC/FM), urban planing, and surveying and mapping. This is especially true in applications like tracking progress, increasing profitability, controlling quality, ensuring security, and investigating incidents [1]. In these various applications, automatic methods using measurements like 2-D imaging [2]–[4], photogrammetry [5]–[7], or 3-D laser scanning [8]–[10] have been paid much attention, compared with conventional approaches using visual inspection and extensive manual data collection and document analysis. Among all the data type used in these methods, 3-D point clouds generated by laser scanning and multiview stereo vision have become commonplace across a broad range of applications [11]. The point cloud is a kind of direct data source, which has shown itself to be one of the adequate data sources for urban mapping and 3-D building reconstruction. With point clouds, measured points of an object are directly assigned 3-D coordinates during measuring. Compared with indirect data sources such as 2-D projected images or 1-D measured distance, the use of point clouds can considerably streamline the modeling of surfaces and the reconstruction of geometry [12]. Making use of point clouds, the 3-D reconstruction of man-made infrastructure or buildings in an urban scenario has been extensively developed. The scan-to-BIM is one of the most well-known examples [13], [14]. The reconstructed as-built BIM is even becoming a powerful solution for accurate project progress monitoring and change detection tasks on construction sites [8], [15]. However, raw point clouds typically comprise countless dynamic and temporary objects, for instance, temporary formwork for concrete walls, which are deemed impeding to the reconstruction of walls. Furthermore, the acquired 3-D coordinates do not contain any semantic and topological information. Thus, we need a workflow within a designed framework to reconstruct 3-D models from input 3-D point clouds. Unlike in the fields of computer vision, computer graphics, or earth observation, in the fields of AEC/FM and related applications, the building reconstruction should encompass more content. This includes the processes of capturing the geometric shape and the appearance of real objects, reconstructing 3-D geometry, interpreting semantic and topological information, and representing 3-D information with surface or volumetric representations.
In Fig. 1, we demonstrate an example pipeline from the acquired photogrammetric point cloud to the expected semantically rich 3-D building model. As seen from the figure, the creation of desired 3-D models of objects in the scene of construction involves not just acquiring 3-D points, but also deriving the spatial geometry of surfaces and interpreting the semantic labels of objects. That is to say, we need to convert those 3-D scenes of the real world into digital models described with a high-level, semantically rich representation. The following two essential questions should be considered and answered when we design processing workflows and related methods.
Reconstruction of the object with semantically rich 3-D models from point clouds (Example using as-built BIM from [5]).
What is the best method to acquire 3-D point clouds from construction sites and urban scenes?
What is the workflow for reconstructing objects from construction sites and urban scenes?
These two questions have pointed out two major topics of the 3-D reconstruction of buildings and civil infrastructures: data acquisition and processing techniques. Here, aside from buildings, we also take the civil infrastructures into consideration, composing of public and private physical structures like roads, railways, bridges, tunnels, and so on, which is of importance in the field of AEC/FM. Comparing with residential buildings, they have different geometries but use common key techniques when reconstructing 3-D models. Relating to these topics, various attempts and numerous solutions have been reported in publications covering the topics of shape reconstruction or object detection from point clouds of multiple sources. However, the majority of existing publications only partially solved one or several specific problems in the entire workflow of reconstructing 3-D building models, involving diverse techniques such as classification or modeling. Moreover, although there are already comprehensive review papers like [8], [16], and [17], their content focuses more on the applications of point clouds or the general development of scan-to-BIM, rather than the investigation of different data properties and an in-depth analysis of data processing techniques themselves. Considering that the level of technology we have achieved is the bottleneck to any potential application in practical engineering projects, we give a review of related work, covering solutions of the following questions, to have an overview of the current state of the art of data acquisition and processing techniques.
How to acquire 3-D measurements mapping the scene?
How to integrate datasets to the same reference frame?
How to interpret scenes and extract objects of interests?
How to represent the object with geometric models?
Generic workflow of reconstruction of semantically rich 3-D models from point clouds.
Point Clouds Data
A point cloud is a set of 3-D data points in Euclidean space. These points stand for a single point with
A. Acquisition of 3-D Points
The acquisition of 3-D points is to measure the 3-D coordinates, as well as attributes, of points in Euclidean space, and then to record these measured points and organize them under the same coordinate frame. The acquisition of 3-D points can be achieved through a wide variety of sensors and methods following different principles of measuring 3-D coordinates.
1) Major Principles of Measuring 3-D Coordinates
There are two major working principles for measuring 3-D coordinates when generating point clouds, namely the ranging-based principle and imaging-based principle. Methods using the ranging-based principle rely mainly on active sensors, with the 3-D laser scanner being a commonly employed example. Meanwhile, methods using the imaging-based principle rely on measurements from passive sensors, particularly different types of cameras [18]. In Fig. 3, we give an illustration of estimating 3-D coordinates of a point in the same scene using methods of these two principles (e.g., laser scanning and multiview stereo vision). Ranging-based methods physically infer the position of a 3-D point by the use of active rangefinders, including structured light, laser beams, and other active sensing techniques, with light detection and ranging (LiDAR) systems, and time-of-flight (ToF) camera frequently used as sensors. Contrarily, imaging-based methods do not directly derive ranges between the sensor and the object. Instead, they only use the sensor to receive 2-D signals (i.e., images) reflected or emitted by the object surface. Estimation of 3-D point coordinates is then achieved via triangulation from stereo image pairs. For imaging-based methods, image cameras which respond to visible light and output a matrix of digital pixels are usually used. As a comparison, ranging-based methods can directly get 3-D information from measurements with implicit scale factors, while imaging-based methods need to derive this information from the images, meaning scale factors must be identified using ancillary information. Moreover, imaging-based methods can provide more reliable radiometric information ranging from visible domain to infrared ones, depending on the optical sensors used. By contrast, ranging-based methods usually provide methods generally only provide the intensity of reflected signals as the primary attribute. For the accuracy of measured points, points acquired by ranging-based methods typically have higher accuracy in the direction of depth than in the directions perpendicular to depth. The points acquired by ranging based methods normally have higher accuracy in the directions of image plane than that of the depth. One more thing to note is that the quality of each point acquired by imaging-based methods can be assessed by the uncertainty of stereo matching, while for points of the ranging based method we can only conduct a general assessment according to errors of the sensor system. Cameras used in imaging-based methods are portable and more compact, which makes them more suitable in critical situations (e.g., no observation points or highly occluded scenes) of construction-related applications. Benefiting from the recent development of laser scanning devices, more portable active sensors like the backpack-based solid-state LiDAR have been developed, facilitating indoor mapping and reconstruction as well. Some new devices like the Leica CountryMapper1 hybrid LiDAR system and optical sensors have been developed, so that the acquired point clouds can have both high accurate geometry, textures, and RGB colors.
Two principles of measuring 3-D coordinates. (a) Ranging-based method using laser scanning. (b) Imaging-based method using multiview stereo vision.
2) Methods of Generating Point Clouds
By using the two aforementioned principles, as well as a combination thereof, there are a wide variety of methods available for generating point clouds with 3-D measurements, including laser scanning (i.e., ALS, MLS, and TLS), ToF imaging, multiview stereo (MVS) vision, structure from motion (SfM), simultaneous localization and mapping (SLAM), and single image depth estimation (SIDE). The devices used in these method involves laser scanners on various platforms and different types of cameras. In Fig. 4, we illustrate several laser scanning systems and different types of cameras. To have an overview of pros and cons of these method, we provide a comparison of methods using different principles in Fig. 5(a). Considered aspects include point density, point accuracy, point attribute, time efficiency, coverage areas, cost, assistance required. The density and accuracy of points relate to the geometric performance of generated point clouds. Ideally, we expect the acquired points could be dense and accurate. The attribute indicates whether acquired point clouds have rich information (e.g., intensity, RGB colors, and the number of returns). More attributes with rich information can significantly broaden the application fields. Time efficiency, coverage areas, and cost stand for the difficulties and total cost of conducting a measuring campaign when using a method. A feasible solution must keep a good balance between efficiency and cost. The required assistance is to evaluate whether the measure campaign (i.e., the generation of point clouds) need special assistance (e.g., precise maps, GPS/IMU measures) or not. This reflects the need for skilled operators in the measurement campaign and pre-or post-processing.
Laser scanning systems and different types of cameras for point cloud generation. (a) TLS (Z+F IMAGER 5010), (b) ALS (WHU Kylin Cloud-I), (c) MLS (Fraunhofer IOSB MODISSA), (d) handheld single reflected camera (Canon EOS 5D), (e) stereo cameras mounted on the crane (red circles), and (f) UAV-based camera.
Comparison of point cloud generation methods. (a) Advantages and disadvantages of various methods. (b) Statistic from selected papers (shown in Table II) about point cloud data used in reconstruction.
It can be seen from this radar chart that each method has its pros and cons, which makes it suitable for specific applications. For example, the ALS and MLS are more suitable for large-scale outdoor mapping tasks, while ToF imaging and SIDE are more suited for close-range scenarios. SfM and MVS are always used together as sequential steps. For indoor 3-D reconstruction, SLAM is becoming increasingly popular. For the point density and accuracy, TLS is the best solution. For the coverage area, only ALS can measure large-scale point clouds in one single flight but the cost is very high. It is noteworthy that an increasing number of novel data-collection methods and sensors (such as radar) are emerging as well, which enable more selections for generating 3-D point clouds. In Fig. 5(b), we provide a statistic on the methods for generating point cloud in the publication we reviewed. From the statistic, we can find TLS and MVS (using stereo vision) are primary solutions but many new methods like RGB-D data and MLS are emerging.
3) Restrictions on Data Acquisition
For acquiring point clouds in an urban scenario or construction site, it is inevitable to encounter complex environments. In practice, there are plenty of restrictions on the measured sites during the acquisition of 3-D data observation. Especially, several critical aspects should be further considered, including the accessibility and visibility, dynamics and temporal changes, and demands for multimodal measurements.
a) Accessibility and visibility: The first restriction on data acquisition pertains to the accessibility and visibility of the observed site. Due to inaccessibility to specific site locations, the desired viewing angle may not always be available in crowded places. For terrestrial measurements, due to legal issues like the protection of privacy and the ownership of private residences, cameras or laser scanners are not allowed to be placed in a certain area without permission. This means in some situations, the backside of street buildings in residential areas is not accessible. For example, the construction site investigated in [6] is located in central Munich, very near to the central railway station. As seen in Fig. 6(b), the buildings, sidewalks, and even subway lines are crowded in all surrounding areas. The distance from the site to the surrounding buildings can be no more than 12
Restrictions of data acquisition. Accessibility and visibility. (a) Image taken from the crane and (b) the actual scenario of the streets and buildings neighboring the site (reproduced with permission from the author of [6]). Dynamics during measuring: (c) Scanned points of static and moving cars from MLS. (d) Moving excavator in the scene. Multimodal data of the same scene: (e) Thermal infra red image and (f) MLS point clouds of the street building.
Such a large, complex urban environment raises the difficulty of obtaining adequate and accurate images, as suitable locations for acquiring images are quite limited. Moreover, for UAV-based observations, the flying area and altitude of the drone are also limited in urban areas due to security risks caused by power lines and high-rise buildings. Thus, many aerial observation positions are not available. As a consequence of accessibility, the visibility of the observed targets will be strongly affected. For example, occlusions may frequently occur, because barriers between the sensor and the object may impede the line of sight. The occlusions blocking the view for specific observation points will contribute to an inadequate collection and information loss. For example, in Fig. 6(a), the lattice frame of the crane makes some parts of the building roof invisible. Moreover, the man-made objects with repetitive patterns and periodic shapes will increase the difficulty of recognizing certain kinds of structural elements, which is counterproductive to scene interpretation. All these factors should be considered for a successful data acquisition in urban scenarios. Actually, such a requirement will influence related policies and regulations should coincide with the urbanization and engineering demands.
b) Dynamics and temporal changes: The second restriction stems from the dynamics and temporal changes in the observed site. Dynamics represent the moving objects in the observed scene, such as pedestrians and moving vehicles. Temporal changes denote the changes of static objects. For example, building elements that have undergone changes, planned or otherwise, during construction. Thus, it is always challenging to capture 3-D spatial information without disturbances caused by dynamics and temporal changes in the urban scenario. All this will cause substantial deformation of the measured objects from laser scanning points due to the line by line principles of scanning. In Fig. 6(c), we give a comparison between the scanned points of static and dynamic cars, and we can see that the points of the moving vehicle are significantly deformed. Similarly, for construction engineering projects, since construction is always a dynamic process, moving workers, equipment, machines, and temporary device/installation commonly appear in the site. For example, in Fig. 6(d), we show a typical situation of a construction site. We can find that the moving of the excavator will lead to temporal occlusions during laser scanning, leading to missing points in the acquired point cloud. To cope with such dynamic changes during data acquisition, an optimized and adaptive measuring plan and the cooperation with the construction schedule are always necessary, which is different from conventional surveying and mapping applications.
c) Demands for multimodal measurements: A multimodal dataset is different from the commonly used multisource dataset. Only when a data acquisition includes multiple measurements with different modalities, the acquisition is characterized as multimodal ones. The multimodal measurements refer to the measurements acquired with different modalities. Theoretically, an optimized data acquisition should be multimodal, including more than one single data source or data mode, for example, RGB images, 3-D points, GIS maps, and cadastral records. For instance, in the work [19], combining the thermal infrared image and the point clouds gives us a 3-D representation of the object with its temperature distribution and emission property [see Fig. 6(e)], which simplifies the task of thermal inspection of buildings. The need for multimodal measurements is increasing, but conventional data acquisition cannot fully satisfy it.
B. Organization of 3-D Point Clouds
A point cloud is a discrete sampling of continuous surfaces of the object. Thus, in order to ensure adequate resolution and point set density in an urban environment, millions or even billions of points are needed. To this end, the organization of data directly influences the operation of a designed framework or workflow. Moreover, the raw points cloud usually is unstructured, meaning that if we can use the specific data structure to recover the local geometry and spatial topology of measured points, it will also facilitate the processing step and improve the processing performance. Apart from the point-based structure (i.e., the direct use of points as fundamental elements for processing), voxel-based and patch-based structures are also frequently used in plenty of applications. In Fig. 7, we visualize a statistic of reviewed literature on data structures used for point cloud processing. As seen from the pie chart, we can find that the point-based structures still dominate the majority of applications. However, data structures with pre-clustered points (i.e., voxels, supervoxels, and superpoints) occupy approximately 15% of all the studies, revealing that they also play a vital role in the data structure. These data structures will be discussed in further detail in the following sections.
Statistic of studies from selected papers (shown in Table II) using different data structures.
1) Point-Based Structure
Discrete points with their 3-D coordinates stored as items in a list is the most simple data structure for point clouds. Here we can classify the point-based structure into two categories: structured point clouds and unstructured point clouds. In this context, structured means that the point cloud has a structure like a fixed raster or 3-D grids. Laser scanners using fixed rasters will generate such structured point clouds. ToF cameras measure depth in a 3-D scene, which is actually an image with its intensities representing distance. As an image, the point cloud data (i.e., depth image) is naturally structured by 2-D grids, in which the points are also termed as pixels. In such a structured point cloud the relation between points, namely the spatial topology, is known. More specifically, the relation between points is constrained by the raster. With such a regular raster constraint the processing steps for a structured point cloud are made easier. For instance, the detection of planes can be simplified to checking local homogeneity of depth in the 2-D rasters. Moreover, by figuring out the relationship between adjacent points, operations relating to searching for nearest neighbors are accelerated. For a point cloud acquired via stereo matching, as each point can be back-projected to the image having a raster structure, it can possibly be organized as a structured one. By contrast, unstructured point cloud data has no such fixed raster. Thus, in an unstructured point cloud, the processing algorithm has to traverse the entire list of points to identify the adjacency of points, which is time-consuming.
2) Voxel-Based Structure
Like a pixel in a 2-D image, a voxel in 3-D space is a basic rasterized unit structuring the space, standing for a position in a regular cubic grid. Voxelization is to transform a point set into a voxel grid and to estimate the geometries of surfaces that are created with a spatial resolution by points inside voxels. Voxels are normally indexed by tree structures within a 3-D grid. In [20], the input point cloud is organized with an octree structure, which partitions each piece of space into eight equal subspaces. Here the octree serves as a 3-D version of a 2-D quadtree indexing the spatial positions. As a consequence, each subspace (i.e., voxel) of a certain level in the octree structure will occupy different sizes of the space. A general representation of the voxelization of point clouds is shown in Fig. 8.
Compared with the previous point-based data organization method, the voxel structures result in a simpler depiction of complicated scenes. Simultaneously, the voxelization process is also a down-sampling one, eliminating more computational costs. Furthermore, the tree structure is established during the division of the space, which restores the adjacency topology considerably accelerating the traversal process (i.e., searching for neighbors). Furthermore, in common cases, points inside a voxel will be approximated by plane models [21] or abstracted features [22]. In this way, negative influences resulting from the inhomogeneous density of points can be suppressed. The noise and outliers will also be reduced since they are suppressed during the approximation or abstraction process. However, the resolution of voxels determines the granularity of segments and labeled points, meaning the use of voxel structures is always a compromise. In other words, the selection of a suitable resolution of the voxel structure is one of the keys to output performance. It is therefore necessary to have a heuristic or analytical understanding of the application prior to the voxelization, due to the various criteria mandated by different situations. Taking all this into consideration, we conclude that further development of the voxel structure resides in the pre-clustered framework of simplified or evolutionary constraints (e.g., supervoxels).
3) Patch-Based Structure
The patch-based structure entails the preclustering of points having common characteristics into patches, and the use of these patches as basic units for further processing. The supervoxel structure is a popular example of patch-based structures, which includes clustered basic voxels, using local k-means clustering [23], weighted distance [24], link-chain [25], among others. In contrast with the voxel structure, supervoxels maintain the borders between neighboring entities and increase computational performance further. We give an example in Fig. 9 of the comparison of point clouds arranged with three different structures: points, voxels, and supervoxels.
Supervoxelization is, however, merely an over-segmentation of entire point clouds, which involves a second patch clustering. Therefore, the clustering of over-segmented patches into complete segments is an inevitable task when using patch-based structures. There are two popular strategies used to address this twice clustering problem. One utilizes supervised classification to label patches. For obtained patches (e.g., superpoints [26] or supervoxels [27]), geometric features [25], [28], [29], spectral information [30] or colors [24] can be extracted using points allocated in individual patches. Considering that the use of patches involves preclustered points with homogeneous characteristics, the selection of a suitable neighborhood for the approximation of features will easily be avoided, as the edges of a patch have already been defined adaptively during preclustering, with eliminated isolated points and smoothed rough borders. Besides, taking advantage of the strengths of using supervised learning, the assigned labels of patches are highly accurate. Therefore, complete segments can be easily attained through the recurrently coalescing patches of the same semantic labels. Nonetheless, supervised approaches need a large amount of accurate training data, a tremendous amount of time and extensive manual work.
One way around these constraints is to aggregate patches with local or global optimization algorithms in an unsupervised way. The local convexity is coupled with the region growing to cluster supervoxels into complete segments in [31]. In [32], a global adjacency graph with geometric consistency is constructed by the supervoxel structure serving as nodes. Then, the aggregation of supervoxels is accomplished by evaluating the connectivity by minimizing a binary cost function. Supervoxels are put in a local neighboring graph with a certain width in [33]. Then, a clustering is determined by their connections via Markov CLustering (MCL). The main strength of unsupervised approaches is that they require no training sets and usually require lower calculation costs. Nonetheless, they can not receive conceptual patch labels and may have issues with adaptability in dealing with complex structures.
C. Registration of Multiple Point Sets
In many applications the use of multiple PCs point clouds from various sensors, platforms, times, and/or observation positions needs to be considered. The registration of these varying point clouds is a precondition to obtaining full cover of the entire testing scene or repeatedly observed temporal dataset [35]. Point clouds of arbitrary initial positions and orientations are aligned with 3-D models or with other point clouds by the use of a spatial transformation from one coordinate frame to another. In Fig. 10, we display an illustration of aligning different point clouds into the same coordinate frame. These two point clouds are normally termed as target and source data, respectively, and such kind of registration aligning the source data to the target data are called pairwise registration. Once there is more than one set of source data, it becomes a multiview registration. Unlike manual registration using surveying markers, here we focus only on the automatic point cloud registration, which is also termed as marker-less registration.
Illustration of registration between two point clouds. (a) Target data: MLS point cloud. (b) Source data: Photogrammetric point cloud from SfM. (c) Registration result (reproduced with permission from the author of [34]).
There is a wide range of publications describing methods for marker-less registration between different point clouds by the use of geometric characteristics. For any registration method, it always includes two major phases, namely the extraction of feature elements and the finding of correspondences. These types of registration can be roughly divided into three main groups based on the type of feature elements they use: point-based, primitive based, and global feature-based approaches. In Table I, we present a statistic derived from reviewed literature on point cloud registration. As can be seen from the table, points, primitives and global features are all commonly used are commonly used in registration applications. It is also noteworthy that the bulk of the data required by the registration process is the TLS point cloud [see statistics in Fig. 11(a)]. As described in the previous section, this is because the static platform of TLS restricts its field of view, so that occlusions sometimes occur in the urban scenario. The occlusions should be solved by registration to achieve a full scene analysis. Moreover, in Fig. 11(b), we also present a statistic of feature elements used in different methods, which reveals that the points and primitives are still the dominating elements used for registration.Besides the data and feature elements, Table I also lists many strategies and algorithms (e.g., 4PCS) used for finding corresponding points or elements from target and source point clouds. These feature elements and algorithms will be discussed in further detail in the following sections.
Statistic from selected papers (shown in Table I) about the (a) data acquisition and (b) feature elements used in registration.
1) Point-Based Registration
The concept of point-based approaches lies in the determination of corresponding pairs of points from different point clouds. For instance, the classic iterative closest point (ICP), as well as its variants, iteratively minimize distances between points in overlapping areas between various point clouds [43], [85], [86]. ICP-based methods normally require approximate initial transform parameters, and the iteration process takes a great deal of time. Requiring no iterations, the 4-point congruent sets (4PCS), as well as its variants, are another representative point-based method, utilizing unique sets of four congruent points, the ratios of the distance between which are invariant to affine transformations [39], [40], [48]. Nonetheless, the core of 4PCS is to reduce the number of candidate elements, with correspondences still need to be found by the use of rejectors like RANSAC. For a large dataset with a high point density, a down-sampling stage is usually necessary before applying 4PCS-based approaches, but this will miss the specifics in the scene. Rather than a down-sampling of all points, using selected key points as elements is a solution that significantly reduces the computational cost. For selecting key points many detectors are used, for example, SIFT [87], [88], DoG [48], and virtual intersecting points [60]. Similarly, feature points extracted by FPFH [52] and structural semantics [64] are also used as elements, but in these cases, the finding of correspondences is achieved via the similarity between features rather than distances.
In Fig. 12, we give a comparison of point-based registration using key points with the 4PCS strategy and global-based registration using 3-D phase correlation [84]. It is clear that in this workflow key points are first extracted from both source and target datasets using 3-D key points detectors. Then, correspondences between key points from source and target datasets are identified by the use of the 4PCS strategy. In this process, incorrectly matched pairs of points will be rejected. Transformation parameters are finally estimated from the 3-D coordinates of these corresponding points. Point-based methods have been widely used in either coarse or fine registrations, since they are entirely feasible for various scenarios [55]. However, point-based methods are also sensitive to a varying density of points and outliers, as distances measured by point can be influenced by them.
2) Primitive-Based Registration
Primitive-based registration is an alternative registration strategy in which the geometric primitives formed by points (e.g., lines and curves [68], [89] or planar surfaces [69], [72] are generated as candidates for registration. Compared with points, geometric primitives are higher level structures with less degrees of freedom, enhancing the robustness of matching corresponding feature pairs and the estimation of orientations [90]. Line features are typical instances of geometric primitives used for registration. Related primitives used for registration include straight-lines of edges [86], [89], lines between intersecting planes [61], crest curves [68], and borders of building footprints [91]. By contrast, planes [92]–[94], as well as curved surfaces [76], are also utilized as geometric primitives for aligning two coordinate frames. Planes are the dominant structures in many point clouds, particularly for those of urban areas [95], and they can be easily extracted from geometric attributes (e.g., positions and normal vectors of points). Nonetheless, by contrasting point-based registration methods with methods using lines or planes, we can see that the latter ones need abundant linear artifacts or smooth surfaces for creating adequate primitive candidates, which mainly depends on the scene content. Therefore, the primitive-based methods may encounter problems in scenarios with natural landscapes only (i.e. those have no artificial infrastructure). Additionally, the quality of extracted lines or surfaces will affect the registration result at the same time. When using primitive-based methods, the extraction of planes through fitting a plane model or region growing with smoothness is somewhat time-consuming and unreliable when extracting planes using model fitting or region-building algorithms, which decreases the efficiency of registration. The consistency of the planes collected will also have a significant impact on the exactness of the orientation parameters. It is also popular to use voxelized structures as primitives. For example, EGI features of the voxel clusters [96] have been used as coarse registration of correspondence that provides acceptable results when matching point clouds in an indoor scenario. Their promising results encourage the concept of using the voxel structure in lieu of the point structure for fast and efficient registration between point clouds pairs.
As for finding corresponding primitives, the similarity between geometric attributes is usually utilized. In [89], angles and distances between lines are calculated to identify correspondences. In [97], the alignment between extracted planar patches is established by means of an interpretation tree and additional constraints. In [92], planar surfaces identified by region growing are matched via their locations, lengths of boundaries, bounding boxes, and mean intensities. In [98], in order to prevent iterations, global optimization has been implemented by the use of locally consistent of planes. In addition to utilizing similarities between properties of planes, there are also geometric constraints on the layout of planar surfaces. In [99], the intersection angle of plane triples is used to compute the coarse transformation parameters. In [60], the intersecting points are used as tie points for estimating the transformation. Similarly, in [72], the distances between plane triples and intersecting points are minimized by the use of the RANSAC process for the transformation. Nonetheless, in urban scenes, parallel planes (e.g., parallel facades of a building) will lead to ambiguities in searching for correspondences [72]. Thus, when applying plane registration technologies in a large-scale urban district, how to decrease redundant planar surfaces becomes a critical problem. To tackle this drawback, instead of using triple planes, a four planes-based solution combining the plane orientation and 4PCS strategy [34], or similarity matching via angles between pairs of planes [75] is implemented. The use of certain four plane sets considerably reduces the number of element sets so that the finding of correspondences can be accelerated.
3) Global Feature-Based Registration
All the abovementioned registration methods utilize the local information of point clouds derived either from points themselves or clustered primitives. Besides this, the registration can also be achieved using global features of the entire point cloud. For example, point densities are utilized to conduct registration using coherent point drift [79] and kernel correlation of affinities [80], respectively. In [55], authors introduce a global vector of a locally aggregated descriptor in order to align multiple point clouds without knowing the view orders or position. In [81], fundamental spatial structures corresponding to low-frequency components in the frequency domain are separated. With the help of 2-D projection and Fourier transformation, the translation and rotation in Euclidean space can be converted to the global phase difference in the phase domain. In [83], a fast and sturdy solution for shift estimation between point clouds is proposed, which used a global strategy by matching low-frequency components in the frequency domain. As an improvement, a new perspective for point cloud registration from local to global is proposed in [84]. By correlating the whole signals presented by point clouds and estimating parameters in a closed-form way, robust point cloud registration can be achieved, even in low-overlapping and highly-noisy cases. Theoretically the global feature-based registration methods are more robust than those based on local features, but generally require a large overlapping ratio. Without a sufficient overlap, the global features may have significant differences.
Key Techniques for 3-D Reconstruction
Having reviewed the acquisition, organization, and registration of different 3-D point clouds, we will survey key techniques for 3-D reconstruction (i.e., modeling of objects) using point clouds. Here, key techniques stand for the basic and common strategies, approaches, and algorithms that play important and indispensable roles in the reconstruction of 3-D models from point clouds. In the context of this article, key techniques should also have versatility, namely it could be used in the reconstruction of either residential buildings or civil infrastructures. For these approaches reconstructing objects from 3-D point clouds, if we categorize these very different workflows into two strategies, we can distinguish them into the grouping-based strategy and the labeling-based strategy, in accordance with the sequence of conducting segmentation or classification processes. The difference between the workflows using these two strategies is given in Fig. 13. We can clearly see in these two general workflows that for the grouping-based strategy (i.e., Type I), the segmentation or clustering of points will be carried out first. Then the recognition of objects will be done on the segmented primitives. By contrast, in the labeling-based strategy (i.e., Type II), all the points will be first annotated with specific labels. For example, in the Type II case [101] shown in Fig. 13, the input point clouds are only building roofs, which have been labeled already. Then these labeled points will be clustered into individual segments representing various objects.
For the workflow of the grouping-based strategy, the primary procedure is segmenting the point cloud into primitives with the common attributes or geometric properties. Then the partitioned primitives are provided with semantic labels, and subsequently, the modeling of the labeled primitives is enforced. Conversely, the workflow of a labeling-based strategy begins with the semantic labeling directly on the points. Then the labeled points are clustered into geometric primitives, and finally, the modeling is implemented using the cluster of labeled points. However, regardless of which workflow a strategy follows, it can never avoid core processing steps involving segmentation, classification, and geometric modeling. Under certain circumstances the borders between these processing steps may cease to be evident. For instance, the model-fitting method can generate parametric models, but in the meantime, it also serves as the operator segmenting the point cloud and classifying points of different objects. We list a number of representative publications in Table II, concerning topics about the reconstruction of man-made infrastructure and buildings in urban scenarios. The strategies used in the workflow, fields of applications, as well as specific tasks are concluded in the table as well. Specifically, publications marked as Type I stand for the approaches using the grouping-based strategy, while the ones termed as Type II denote the approaches implemented following the labeling-based strategy. In these applications, the 3-D data are acquired via various sensors (e.g., TLS, depth camera, TomoSAR, MVS vision). Simultaneously, according to the applied algorithms and methods, the 3-D data are organized in various structures, including points, pixels, voxels, superpoints, and supervoxels. In accordance with the reviewed publications from the table, we provide detailed reviews and discussions concerning the segmentation, classification, and geometric modeling algorithms and methods in the following sections, in order to give a thorough technical analysis.
A. Segmentation of Point Clouds
The segmentation of point clouds is the grouping of points into several homogeneous components of one or more common features [157]. We provide an example in Fig. 14 that illustrates the segmentation of a set of points. Relevant approaches to point cloud segmentation can be divided into two major categories: attribute-based techniques and geometry-based techniques. The attribute-based techniques use point intensities or color specifics to group them into segments sharing the same semantic information or the same attributes. Meanwhile the geometry-based techniques segment points according to the structural homogeneity of corresponding surfaces or structures that points belong to. Both these two approaches have their positives and negatives. Nevertheless, the brightness or color information is not always accurate or reliable, as the reliability of the information primarily depends on sensor recording technology. On the other side, information regarding the brightness and colors of the materials and the lighting of artifacts, as well as the light conditions, will possibly be influenced. For urban areas particularly, changing lighting conditions, and complex artifact environments with similar textures, colors, and lights render the attribute-based techniques ineffective. Therefore, we face a strictly geometric segmentation problem in many cases when parsing building scenes. In general, the methods using geometry-based techniques can be subdivided into four essential categories: model-based methods, region growing-based methods, clustering-based methods, and energy optimization-based methods [20]. In Fig. 15, we present an statistic of reviewed literature (listed in Table III) on segmentation methods. As seen in the figure, we find that model-fitting methods dominate more than one-third of applications. While the region growing and clustering-based methods share nearly equal portions of all applications, energy optimization-based methods occupies a much smaller portion of all applications. The details, pros, and cons of these four categories...will be discussed in detail in the following sections.
Illustration of building segmentation. (a) Raw point cloud from TLS, rendered with RGB values and (b) segmentation results.
Statistic of studies from selected papers (shown in Table III) using different segmentation methods.
1) Model-Based Segmentation
Model-based approaches associate points at a local or global level using specific mathematical representations, relying on their geometric characteristics (for instance, spatial locations and normal vectors). Points that meet the criteria for fitting the same mathematical model (either spatially or parametrically) are extracted from the point cloud as a single segment. To be specific, model-based methods are primarily implemented via two strategies: parameter domain-based methods and spatial domain-based methods. The parameter domain-based methods match the spatial points in the parameter domain according to the mathematical models transforming spatial structures to parametric expressions. Typical examples for this type of method are the 3-D Hough transform (HT) and its variants. The fitting of points in HT is implemented via a vote-based procedure, which is carried out in the mathematical parameter space with points of the entity chosen by the local maximum in the accumulating space. Herein the mathematical model, as well as corresponding parameters, receiving the highest voting scores, will be chosen as the model for segmenting points. The HT has been utilized in the segmentation of detecting lines [158], planes [159], cylinders [160], and spheres [161] in parametric space. Many similar methods can also be categorized in this group as the voting and accumulating technique within the parameter domain, such as the Gaussian map [162] and tensor voting [163].
Whereas space domain-based approaches explicitly infer the optimal parameters of geometric structures from 3-D co-ordinates points within the space domain, the optimal parameters are usually calculated using robust estimators and least square-based algorithms. Both robust estimators and least-squares serve the model fitting process, but their mechanisms of fitting vary. For a given mathematical model the robust estimators reject those outliers, so that inlier points can be kept. However, a robust estimator can not directly optimize the parameters of a model. Thus, a least-square estimation is usually applied to the inliers selected by robust estimators. RANSAC and its extensions are the most common robust estimators that are used for fitting regular geometric shapes [124], [164], [165], and can even extract shapes formed by primitives from point clouds polluted with noise or outliers. As far as the use of the least square algorithms is concerned, the classic least squares approach is sensitive to gross errors and outliers. Thus, for the model fitting task, the robust variants of least squares are generally used [166]. For example, it is used to classify surfaces and geometric primitives of [167]. Still, their studies often point out how difficult and computationally inefficient it can be to suit higher order surfaces. It is noteworthy that the principal component analysis (PCA)-based methods [168] belong to the least square-based methods since they optimize the
2) Region Growing-Based Segmentation
Region growing (RG)-based methods are the second option, the so-called growing process of which is implemented by a iterative process that analyses neighboring points in seed regions and assesses whether or not they belong to the seed region. In the growing process, the selection of seeds and the growing criteria are two influential factors for this kind of method. A seed is the origin of a growing process, and a region growing procedure consists of a number of parallel growing processes from different seeds. Two growing regions can be merged into a single one if points at the border have common characteristics. In other words, the growing process can cross the boundaries between two regions. Here, the number and distribution of seeds influence the granularity of the segments, while the seed positions impact the performance of the segmentation significantly. Typically, areas with the least curvature [168] or areas with the smallest plane matching residuals [161] are often marked as seeds, because seed positioning should avoid the areas of boundaries and borders. For instance, seeds on the edge of a surface or a corner of an object will yield over-segmentation [e.g., over-segmented fences shown in Fig. 16(b)], as the frequently changed curvatures in these areas will stop the growing process. Over-segmentation may be also possible for curved objects of large sizes (e.g., tubes with an extended radius elbow joint) [169] or surfaces with irregular shapes [e.g., over-segmented bushes shown in Fig 16(b)]. The decision on whether the growing process should be continued or stopped, it depends on the consistency between the grown region and the examined point, which is assessed by the growing criteria. The consistency of orientations of normal vectors [170], the curvatures of points [171], and the smoothness of surface [161] are widely used criteria for continuing or stopping the growing. In [168], PCA-based local characteristics were recently adopted as growing criteria for their distinctive features.
Over- and under-segmentation of point clouds. (a) Original point cloud, (b) result with over-segmentation (area in the dash line box), and (c) result with under-segmentation (area in the dash line box) (reproduced with permission from the author of [182]).
It is noteworthy that the elements used for growing are not limited to original points, meaning patches from clustered points can also be utilized. For instance, in [20], the octree architecture and the region growing frame are combined for rapid surface patch segmentation. Likewise, the octree-based voxel structure in tandem with graph-based slicing is applied to segment cylindrical artifacts in industrial scenarios in [169]. In [172], voxels serve as patches for the growth of planes with the similarity between eigenvalue based features as the criteria. In [173], fragments with planar surfaces identified by RANSAC or 3-D HT are used to represent all surfaces in the scene. In [101], TIN meshes are used for the growth of building roof primitives. Throughout these approaches, the identification of neighboring points is essential to growing. The most commonly used strategy is to find the adjacent ones or the
3) Clustering-Based Segmentation
The third major category consists of segmentation methods based on clustering. Such methods investigate the relation or resemblance of adjacent positions in a given region according to their spatial coordinates and geometric characteristics. Points of proximity or similarity that meet the appropriate thresholds shall be treated as associated or even connected ones. On this basis, all associated points are aggregated into one single cluster, namely a complete segment. In comparison to region growing based methods, methods based on clustering do not involve the setting of seeds. Instead, only the connectivity between points is checked. To ensure the right balance between the completeness and the preservation of edges of segments, the clustering criteria and clustering manner are crucial aspects. The former judges whether two points should be connected or not, while the latter decides which will be the candidate point in the next clustering iteration and the strategy of how to find candidate points. Euclidean distance [174], the angle between normal vectors [20], and the consistency of densities [175], [176] between two elements (e.g., points or patches) are typical criteria that are used as guidelines for performing a clustering. With respect to the clustering manner, mean-shift [177], and connected relations [31] are the most common strategies for clustering.
One of the major bottlenecks for clustering-based methods is computational cost, which is dependent on the complexity with which similarity and/or proximity are calculated and cost functions optimized. Currently, in the segmentation process, normally multiple clustering criteria are implemented to create a reliable segmentation method, which will significantly increase computational costs. In addition, the definition of optimal clustering thresholds also influences the granularity of segmented clusters. Otherwise, under-segmentation [e.g., under-segmented roofs shown in Fig 16(c)] may occur. Recently, patch-based clustering approaches has attracted more attention, which utilizes points-composed 3D patches instead of points as basic elements For instance, voxels [178]–[180], slices [181], and planar fragments [173] are used as elements for clustering. Similar to the general elements used in region growing, the generation of patches is actually a preclustering, which creates over-segmented elements. These elements usually have capabilities in terms of finding edges between objects, and facilitating the boundary preservation of segments. Furthermore, the use of a patch structure will greatly reduce the costs for measurements and reduce the adverse effects of outliers and different point densities [24]. Nevertheless, the resolution (i.e., the size of each element) of patches appears to impact the accuracy and retention of details of segments.
4) Energy Optimization-Based Methods
Energy optimization-based methods convert point cloud segmentation into an optimization task of the energy functions under a specified data structure and the energy estimation process. The aforementioned region growing or clustering-based methods are actually implemented under the local strategy, since the neighboring examination dominates the entire segmentation process. Compared with region growing and clustering, energy optimization-based methods convert the partition of points into a problem minimizing the costs (i.e., energy) of clustering points into different possible groups. This means that assigning a point into a cluster will create a cost, and only when all points are assigned to the optimized/optimal clusters will the sum of all costs of assigning points be minimized. Thus, to find an optimized clustering of points is to find the solution minimizing a designed cost function on a global scale. The graphic model is the most common approach used to to directly depict points with a mathematically sound structure that uses context to deduce hidden information from the provided observations [183]–[185]. The two major groups of methods using the graphical model include the methods using regular weighted graphs and the ones using Markov-based graphs. The former group include normalized cuts [186], min cuts [187], and graph-based segmentation [188], [189]. Meanwhile, the latter group pertains to approaches such as the Markov random field (MRF) [190] or conditional random field (CRF) [191], which are solved by the graph-cut algorithm or its variants [192]. A large topological range of the constructed graph can produce better results in segmentation for methods using graphical models. Still, such a complex and large graph leads to a heavier computational burden [193].
Except for graphical models, other energy-based methods like the level set [194] and global energy minimization [195], [196] can also be used for separating planes from the entire scene. We should also remember that energy optimization-based approaches are often used to refine the initial segmentation outcomes [129], [197], [198], which define segment refining as a labeling optimization task. For some specific applications, energy optimization-based methods are the prioritized solution for segmentation. For example, plane segmentation is better formulated as a global optimization problem concerning the entire scene [196]. During the quest for a globally optimal solution, the optimization-based methods are likely to be resilient to elevated noise and outlier proportions in contrast with the other strategies [199]. They also result in higher computational costs [195], [196].
B. Classification of Point Clouds
Compared with segmentation, classification is a crucial step in parsing the point clouds of 3-D scenes, offering semantic tags for individual points or grouped-point primitives. We display an example of the classification of a set of points with semantic labels in Fig. 17. Recent progress in machine learning and computer vision has also shown that a well-designed solution to 3-D point cloud classification is suitable for the labeling task, even in a real dynamic world. Actually, semantic interpretation via classification is also a vital step for object reconstruction, due to the necessity for parsing the semantics of building points. For the labeling of points or primitives, the classification can be implemented in either a rule-based or data-driven manner. The rule-based manner is to classify points or primitives with predefined rules or prior knowledge. Typical examples include knowledge-based classification [102], model-based recognition [132], among others. In other words, the classifier is manually designed and estimated from prior knowledge. By contrast, data-driven manners require a learning process with labeled training samples, in which the classification rules, as well as the classifier, are learned and optimized from the training. However, for both these two manners, they typically comprise three essential steps [200], namely the recovery of the local neighborhood for the point or primitive, the description of the geometry based on 3-D information of the local neighborhood, and the classification of all 3-D points based on their respective geometric descriptions.
Sketch of classification of the point cloud. (a) Real scene, (b) original point cloud, and (c) labeled points (reproduced with permission from the author of [201]).
We show a standard workflow of carrying out point-based classification using features and classifiers in Fig. 18. We find that for labeling a point, a local neighborhood for this point should be first selected. According to the spatial position and distribution of points within this neighborhood, geometric features can be extracted and then selected. Based on these features a trained classifier can be used to assign this point a label, indicating a specific kind of object in the real world. Researchers have recently reported many accomplishments in solving problems relating to these three steps. Nevertheless, there are still many difficulties in distinguishing building structures in complex scenes from point clouds, such as random point sampling, different point density, complex structural components, and various data sources. In addition, when dealing with large 3-D point clouds, computing costs should also be taken into consideration.
1) Recovery of the Local Neighborhood
The recovery of the neighborhood, which determines the point or primitive of a certain local area, is necessary if one is to represent a point or a primitive with detailed information. With diverse purposes, the description of various objective details is dependent on the local context of all points within the chosen neighborhood. However, the scale and shape of the objects vary, meaning the selected neighborhood should be capable of describing geometric information at various scales and ranges. The ways of defining neighborhoods can be roughly split into two types: single-scale neighborhoods and multiscale neighborhoods. The first type derives characteristics from a fixed-size neighborhood. The second, by contrast, adopts flexible neighborhood sizes. A neighborhood can be defined by a given shape centered at a point of interest with a certain size. For example, the spherical [203] [see Fig 19(a)] or cylindrical [204] [see Fig 19(b)] neighborhoods around an investigating point (i.e., the point of interest) with LRF/LRA are the most widely used single-scale neighborhood definitions. The investigating point here is that of which the features have to be extracted and is generally rich in information content. It is also termed as the key point or the point of interest. Rather than by defining a shape as the neighborhood, the neighborhood of each point can also be defined by a certain number of
In [207], a feature selection technique is used with in a multiscale neighborhood to enhance the performance of feature engineering. The multiscale neighborhood can be described as a hybrid of simple neighborhoods with various forms and sizes, with identical features from multiscale neighborhoods being separately extracted to carry out additional feature encoding. Another option is to use over-segmented or preclustered patches [201] for specifying an adaptive neighborhood. For example, point-based hierarchical clusters with a Latent Dirichlet allocation (LDA) model are generated in [208], in which cluster features are extended for classification of objects of different sizes. Similarly, in [209], in the octree partition framework, the author introduces a multilayer framework for generating features that comprise different levels of subspace, to detect a single entity (e.g., vehicles).
2) Description of the Local Geometry
A local geometry description is intended to abstract local geometric information in a defined neighborhood of the investigating point or primitive. The description encapsulates derived information with feature vectors in the form of a histogram [210], with the similarity or dissimilarity of the feature vectors being used as the basis for inducing labels by the classifier. During the last decade, a wide range of feature extraction algorithms is introduced, with various description methods of local geometry developed. In accordance with the level of geometric details described, they can generally be grouped into two main categories: low-level and high-level descriptions.
The low-level description consists of only fundamental geometric properties of the neighborhood (e.g., dimensionality) and the spatial arrangement (e.g., the curvature of the surface) of 3-D points within a neighborhood [200]. In other words, if we depict the point cloud in the frequency domain, a low-level description will focus only on the low-frequency components. As a representative, the eigenvalue-based feature description is exploited from the tensor of coordinates encoding 3-D structures, relating to the 3-D covariance matrix derived from the coordinates of all points in a local neighborhood [211], [212]. This 3-D tensor structure, which is defined by three eigenvalues from the covariance matrix of coordinates, can be viewed as a dimensionality reduction of local structural information. More precisely, these three eigenvalues will describe the local properties of 1-D, 2-D, and 3-D primitives. A number of local 3-D structural features, including eigenentropy, scattering, ominvariance, etc. [200], have been established, which allow a more intuitive depiction of volumetric structures [213]. It is worth mentioning that the features of the low-level description are usually adaptive to the scale of the chosen area so that the optimum neighborhood size can be identified. In [200] a thorough investigation has been carried out on how to improve the distinction between low-level geometrical features by adjusting neighborhood sizes and subsets for the removal of trivial features. Through optimization of neighborhood dimensions, an acceptable low-level geometric combination can provide a higher quality classification, which can also lead to significant increases in processing time and space usage. In contrast, deriving features through 3-D local shape descriptors presents a kind of high-level description, which is an abstracted or compact depiction of points characteristics based upon their supporting area (i.e., the neighborhood in our statement) [214].
If we depict the point cloud in the frequency domain, a high-level description focuses more on the high-frequency components. As concluded in [6], these 3-D local shape descriptors can be classified into three major categories [210], [215]: descriptors encoding spatial distributions of points in the neighborhood, descriptors depicting the geometric signature of points on the local surface, and descriptors featuring a hybrid of both spatial distributions and geometric signatures. Descriptors of the first category typically specify a local reference frame or axis (LRF or LRA) for the investigating point, which directs the 3-D neighborhood to be separated into a certain amount of bins. By collecting the distribution in these bins of spatial locations in the 3-D support area, a histogram is encoded. This category includes spin image (SI) [216], 3-D tensor [217], and 3-D shape context (3-DSC) [76] as well as its variations such as unique shape context (USC) [214] and cylindrical-3-DSC [218]. For descriptors of the second category, the feature histogram of the descriptor is generated by encoding concise geometrical attributes (e.g., orientations of normal vectors or curvatures of surfaces) of the investigating point in the 3-D neighborhood. Point feature histograms (PFH), fast point feature histograms (FPFH) [132] with efficiency improved, local surface patch (LSP) [219], and radius-based surface descriptor (RSD) [174] are representatives of this category of shape descriptors. Descriptors of the last category utilize a hybrid structure that incorporates histograms of spatial distribution with geometric signatures. An example of the hybrid descriptor is the signature of the histogram of orientations (SHOT) [214], encoding histograms of the directions of normal vectors in accordance with different spatial point locations in a spherical neighborhood. The output histogram of SHOT encapsulates both the global distribution of points and the contextual histograms that represent the angles of normal vectors of points.
The definition of the local reference frame and the scale of the neighborhood have significant effects on the quality of the depiction of local geometry for either a low- or high-level description. A consistent and reliable local reference frame should be invariant with rigid transformations, which will enable the extraction of robust features [215]. For the scale of the neighborhood, a large neighborhood codes more data. However, it renders the local shape descriptors more prone to occlusion and noise, which can significantly affect the effectiveness and reliability of feature extraction [200], [220]. In the event that the point cloud is polluted with distortion and outliers, the LRF/LRA can be skewed, which specifically affects the accumulation of spatial positions of points. Especially when using photogrammetric point clouds, the geometric quality of points is inferior to those created from laser scanning. To address this problem, in the work of [6], the robust estimator (i.e., MLESAC) is used to define the principle axis instead of the one identified via PCA. Foremost in the representation of features is the shape of the neighborhood that defines a respective neighborhood range, encapsulating all considered 3-D points. Developing a local 3-D shape descriptor with a robust LRF and a specific neighborhood could be a potential way to improve accuracy in particular applications.
3) Classifier Used for Labeling Inference
Derived feature vectors representing geometric properties must eventually be loaded into the classifier in order to infer semantic labels. Currently, as we have mentioned, the majority of labeling approaches favor a data-driven classification strategy, which is also termed as a supervised solution. Supervised classification means learning a classifier by means of training data, including its associated vectors and labels. This form of classification can be conducted in several different ways. Point-based classification is a typical category where each point is labeled during the inference process [220]. By comparison, segment-based classification has gained interest due to its advantage of being able to distinguish individual objects from scenes simultaneously, in which preclustering or segmentation is done in advance to produce primitives with homogeneity [221].
Commonly used classifiers for point cloud classification include almost any supervised learning classifiers like support vector machines (SVM) [222], [223], AdaBoost [224], RF [212], and CRF [206], [225]. For an ALS dataset covering a large area, the work of [224] proposes a classification framework using SVM and compares the output of different variants of the SVM algorithm, with a comprehensive analysis of multiclass classification results published. The author uses the AdaBoost classifier in tandem with the input ratio for identification tests and measurements of attribute significance in [226]. A comparison of the various approaches can be achieved by looking at the results of the RF classifier. More than twenty features in [212] are extracted from LiDAR points by iterative backward removal of elements. With the aid of an RF classifier, points with reliable labels in large-scale urban scenes are obtained, with the variable importance estimated. Further research is presented in the use of the RF classifier in [227]. The significance of features and the strengths of the classifier are tested with permutation accuracy parameters in this study.
A multiscale CRF implementation is provided in [225] to improve the classification accuracy of TLS points. CRF offers incremental labeling accuracy via logistical regression, in contrast to other classifiers. For instance, a context-based labeling approach with minimal CRF is developed in [213] to provide a high point density MLS dataset in complex urban scenes. Then the reliability of CRF is objectively measured by integrating the outcomes of the point-specific labeling by the RF category. The researchers also incorporate the RF classifier into a CRF framework in [206] to boost classification accuracy. An evaluation of the RF feature importance is performed, as well as the classification of 3-D scenes by a hierarchical CRF. A multiscale neighborhood selection strategy is applied in the work of [207], grouping neighbors into subspaces of three different dimensions and merely integrating characteristics with lower associations. This results in improved classification efficiency when using the RF classifier. Recently, over-segmentation on the basis of a preclustered data structure (e.g., supervoxels [228]) has often been used to minimize data size and to increase computational performance, together with the classifier and the method from [201].
4) Deep Learning for Classification
In recent years, high-performance computational hardware has significantly boosted deep learning techniques, which have proven to be powerful tools for PC classification tasks. Unlike traditional point cloud classification approaches, deep learning techniques are usually implemented in an end-to-end way, combining feature extraction with classifiers in a single network, so usually we cannot separate these two parts in deep learning based methods.
In its initial stage, the deep learning methods for point clouds are proposed based on the projection strategy from 3-D-to-2-D [229], [230]. For example, in [231], 2-D images are generated using point-based features projecting 3-D local features of a point into a 2-D matrix. After the labeling of 2-D pixels, the semantic label of each pixel is then back-projected to the corresponding 3-D point. Alternatively, deep learning for point clouds can be implemented in volumetric ways, which is inspired by voxel-based point cloud classification. For instance, in [232], the octree-based convolutional neural network is proposed, and normal vectors of points in each leaf are averaged as the input for a CNN network. Moreover, in [233], 3-D points are organized with the voxel structure. Unified features of individual voxels are generated, encapsulating layer based on region proposal networks, tackling the sparsity of 3-D points. Unlike the conventional frame transforming points into other formats, one of the breakthroughs of using deep learning in point cloud classification is the emergence of PointNet and its derivatives [234]–[236], which introduce a novel scheme that directly processes points. For improved techniques derived from PointNet, point sets are processed in the proposed networks directly, so that an end-to-end classification framework is achieved without initial processing of points, dramatically streamlining semantic labeling. On the basis of PointNet, in [237] a multiscale approach is added, which has successfully been applied to a large scale ALS point cloud classification. Recently, graphical structures with neural network have also been adopted in different fields with remarkable performance. For example, GraphCNN is groundbreakingly implemented with mini batches for hyperspectral image classification with stunning performance [238]. In point cloud processing, GraphCNN, utilizing graphical model to organize points for feeding the designed network, has also shown promising results in various applications [26], [239], [240].
Compared with classic approaches, these deep learning-based methods can provide less noisy results with unmanageable regularity as a by-product of the metaparameterization of networks [241]. Moreover, for some deep learning techniques (e.g., PointNet), their performance depends on the sampling of input data, since noise and errors can be induced in the splitting, down-sampling, and interpolation processes for objects of varying scales, especially in the boundaries between those objects. Thus, postprocessing (e.g., interpolation or smoothing) should be applied to network outputs. Recently, deep learning based methods have become popular for classification tasks in construction fields. For example, in [242], road cracks from laser scanned range images can be detected and extracted via deep learning methods. In [243], transfer learning is applied to acquire labels of point clouds from online photos. In [244], point clouds of building interiors are semantically segmented via augmented training datasets and deep learning.
C. Geometric Modeling of 3-D Primitives
Geometric modeling is intended to generate 3-D shape models of labeled primitives, such as walls, floors, and ceilings. The type of representation used for the output of reconstruction may usually be either parametric modeling(e.g., model fitting or matching), surface modeling (e.g., boundary representation), or volumetric modeling [e.g., Constructive Solid Geometry (CSG) representation]. In Fig. 20, we give an illustration sketching the surface modeling of a set of discrete photogrammetric points.
Sketch of modeling of the point cloud. (a) Original photogrammetric point cloud from acquired by UAV. (b) Surface models of individual buildings.
As stated in [8], the volumetric object extraction and parametric object description are the most applicable method for as-built BIM reconstruction, as BIMs are defined mainly by volumetric and parameters representations. However, the reconstruction of models of surface representations is more common in the field of reverse engineering. This is because the solid geometry of as-built BIMs in volumetric models can not be thoroughly observed or deduced due to the existing occlusion of measurement and the absence of topological interactions (e.g., the way of connections) between structures. Only observable surfaces with visible geometrical connections and configurations can be reconstructed. Therefore, we typically recreate a mathematical representation of parametric structures and map them to a volumetric model with additional information (e.g., CAD base) or parametric knowledge (e.g., grammar dictionary of structures). In Fig. 21, we present a distribution of selected literature (listed in Table IV) regarding reviewed segmentation methods. As can be seen in the figure, surface modeling is still the mainstream method used in plenty of applications, but in many newer works, parametric and volumetric methods are preferred. The pros and cons of these modeling methods will be discussed in further detail in the following sections.
Statistic of studies from selected papers (shown in Table IV) using different modeling methods.
1) Parametric Modeling
The most striking feature of parametric modeling is that the geometric representation can be represented concisely using standard mathematical expressions (e.g., cylinders, cubes, or planes). There are two major parametric modeling strategies. The first is model fitting, which has been discussed in the review of segmentation techniques that analyze points at local or global scales, employing certain geometric models according to their geometric attributes (e.g., spatial positions and normal vectors). Among all model-fitting methods using mathematical equations, HT [159]–[161] and sample consensus (e.g., RANSAC [124] and maximum likelihood estimation SAC (MLESAC) [149]) are popular representatives. Since these algorithms have been discussed in the introduction to segmentation methods, the respective details will not be repeated here. As we have pointed out, however, challenges arise because model fitting can not deal with objects or structures with complex and irregular surfaces. One possible solution for modeling objects with surfaces of complex mathematical expressions is shape matching. This approach relates directly to the use of local 3-D shape descriptors, which have been discussed in the section regarding high-level geometry description. To implement this method, we need several objects as references with known mathematical expressions to build a dictionary and match the geometric primitives with those references according to their features extracted via shape descriptors (e.g., feature histogram) [245]. The respective pairs of primitives and references are then considered as matched, and the model of the reference assigns both the primitives and mathematical expressions.
2) Surface Modeling
Surface modeling is a type of nonparametric representation called surface reconstruction. Surface models are useful for modeling complex geometrical entities (e.g., incomplete structures during construction). Unlike parametric modeling, the representation of surface modeling may also be quantified with parameters, but the geometric shape of the primitives is not necessarily described by a fixed mathematical model. Alternatively, a surface representation (e.g., with polygons or meshes) can be generated based on the actual basic geometry. The most popular approaches to surface modeling involve boundary-based description (B-rep) and mesh-based representation [8]. The B-rep defines the 2-D or 3-D contours of the primitives as surfaces and then symbolizes the boundaries of primitives through indirect or tacit lines. These lines form a closed polygon as the surface description, with the 3-D model consisting of a combination of such surfaces. Boundaries of primitives (e.g., structural components) are generated by the use of the alpha-shape algorithm in [6], [130], and then represented with polygons using rotating calipers [246] and cell decomposition [247], respectively. The closed surfaces of 3-D models can be achieved with the use of energy minimization of surface orientations [114], horizontal slicing and vertical projection [129], [130], [248], stochastic analysis, or the graph editing dictionary [249]. The modeling of surfaces is simple and easy to implement. However, the modeling process always attempts to approximate the complex geometry of an object by means of simple polygon surfaces (e.g., planes and curved surfaces). This leads to a tradeoff between the details and the abstraction of polygons when creating simple surfaces. Furthermore, the surface modeling focuses only on the visible part of a structure, which requires a further transformation into a grammar-rich representation of building components. In addition, the mesh-based description is also a viable alternative. Many residential buildings have repeated structures or standard shapes, and pattern-based modeling may estimate such details using predefined configurations for buildings [101]. The mesh-based representation defines the surface by meshes (e.g., triangles [51] or cubes [174]). A mesh-based model is easy to implement and accurately describes complex surfaces. In Fig. 22, we provide a mesh and surface-based modeling workflow given in [141]. This workflow illustrates a typical procedure using the Type II strategy. Points of buildings is first classified and than clustered into primitives with RANSAC. However, it is challenging to parameterize meshes, as there are no clear mechanics.
3) Volumetric Modeling
At present, volumetric modeling has not been comprehensively employed to create 3-D models of objects from point clouds. This approach usually requires prior knowledge to help determine volumetric geometry since, in the majority of cases, only part of the surface can be observed and measured. Thus, the topology between two structural elements cannot be accurately identified through surface observations. In Fig. 23, we give an illustration of this ambiguity problem when recovering volumetric models of connecting walls merely from surface observations. It can be seen in the figure that, when using only surface observations, the way of connecting two wall elements cannot be inferred. To overcome such a problem, a common strategy is to assume that these structures can be described by a combination of a small number of volume primitive elements [8] (e.g., planes, superquadrics, and generalized cylinders). Such volumetric primitives may be placed in a general CAD repository [1], [104] with these basic design models being linked to artifacts by looking for the best alignment between the model and the artifacts. The volumetric primitive is selected to represent the matched part of the object, and the entire object will be reproduced by a combination of such volumetric primitives. The volumetric representation is also currently possible by transforming the surface-based representation [250], which could benefit from mainstream trends in advanced surface modeling techniques.
Ambiguity problem in the volumetric modeling using surface observations of two connected walls, from which there are two possible topological connections. (a) End of Wall II is connected to the surface of Wall I or (b) the end of Wall I is connected to the surface of Wall II.
D. Limitations on Current Techniques
Despite the abundance of research conducted on the matter, there are also limitations on the techniques designed to reconstruct buildings, limiting their application to practical projects. The limitations on current technologies contain four significant aspects: efficiency and effectiveness, the generality of uses, and robustness to disturbances.
1) Efficiency and Implementation
The first limitation is the efficiency of current methods and algorithms. This efficiency directly relates to the computational cost and the complexity of implementation. A method implemented with high computational cost requires high-performance hardware, which goes against the trend of using low-cost Internet of things (IoT) devices for the automated and intelligent monitoring of construction, since these portable and unattended devices are usually of limited computational power. The complexity of implementation will significantly affect the efficiency of executing proposed methods or workflows as well. Moreover, the complexity of the employed algorithms will also influence the applicability of the proposed approaches. Fortunately, current algorithms and methods still have a large room for improving efficiency.
2) Effectiveness of Methods
The second limitation is the effectiveness of current methods, which means that the performance of developed methods is limited by the current techniques, leading to these methods underperforming when practically applied. Specifically, two aspects should be mentioned, namely the levels of detail and accuracy, as well as the levels of automation.
a) Levels of details and accuracy: Regarding the reconstructed 3-D models of buildings, levels of detail (LODs) and levels of accuracy (LOAs) [251] are an essential indicator showing the quality, complexity, and applicability of the modeling [252]. The LODs and LOAs of a building model, indicating how detail and how accurate a model is, are usually set according to various concerns, including data acquisition cost, labor expense, and target applications [253]. For applications in the fields of AEC/FM, a building model with high LODs and LOAs will increase usability, but will require more storage space at a higher cost. Moreover, the reconstruction of buildings from point clouds is a reverse-engineering task, which results in less information compared with the BIM generated from given blueprints, which makes it difficult to fully recover all the details. Regarding accuracy, not only does the level of accuracy matter, but also the type of accuracy. These must be considered as they relate directly to the data structures used and the topology of objects.
b) Levels of automation: In current construction practices, the level of automation is an essential criterion that affects the performance of algorithms and methods. However, with respect to the current system or framework integrating the abovementioned data acquisition and processing techniques, the workflow is still of a low level of automation, with many manual steps involved. Moreover, the setting and tuning of parameters and thresholds still need human intervention. To avoid manual work, the design concept should consider a more adaptive and intelligent workflow with prior knowledge in software development. This prior knowledge can be driven from existing documented records.
3) Generality and Reproducibility of Uses
The third limitation is the generality of uses for the techniques. From the technique aspect, the civil engineering and construction industry are highly standardized fields, but related projects and applications are considered unique to each other. This is because the geological and climate conditions, locations, legal issues and policies, as well as site situations could be different from one project to another. Therefore, the developed algorithms and methods should have generality for various tasks, which lie in two major aspects: universality of methods and interoperability of data formats.
a) Universality of methods: In many presented studies, the developed solutions strongly depend on the prior knowledge or preconditions from the data itself, which is more akin to a data-driven strategy. For example, many classification methods only work on certain classes or shapes of objects and cannot generalize to complex environments. Such a solution is counterproductive to the industrial implementation in practical projects, which requires a standardized workflow or modular processing. In Fig. 24, we give an illustration from the work [22] showing that for the same construction site, the data sourcing from different sensors will present different levels of data quality. Thus, the proposed method should have the universality of dealing with various low-quality data. If the proposed reconstruction method has no universality, it cannot be evaluated with benchmark datasets for a fair comparison of its performance.
Different types of point clouds at the same scene. (a) TLS point cloud and (b) photogrammetric point cloud. (reproduced with permission from the author of [22]).
b) Interoperability of data formats: This is a practical question for data exchange between software and system of different research or engineering fields. This is because different fields will have different and independent standards of data formats so that the created data of different formats can hardly be shared by software and systems between different fields. For example, both the PLY and IFC formats can be used to describe the 3-D shape of an object. However, even if they maintain the same 3-D shapes in the working-space, their ways of modeling parametrization are totally different. In this case, they are surface and volumetric modeling, respectively, so it is almost not possible to directly modify or utilize them in different systems.
4) Robustness to Disturbances
The last major limitation of the current techniques is the robustness to disturbances. Here, the disturbance covers the outliers, noise, and systematic errors in low-quality data. The robustness is the ability of a method to cope with errors during its execution. Urban mapping, construction, and infrastructure-related projects are always dynamic and complex, which means that disturbances like temporal objects, noisy backgrounds, and outliers in datasets are inevitable. For example, defects in point cloud data due to occlusion in cluttered scenes and noisy data due to registration errors will lower the quality of data. Thus, any algorithm or method should be robust. However, even for the same site, this idea has not yet been fully implemented in the design of methodology yet. Too complicated algorithms or methods will increase the risk of failed output. An optimized solution should follow the Occam's razor principle, which consists of merely one dominant framework, including several independent core processing steps. This is easier for the enhancement of the robustness of each step and troubleshooting. Moreover, such a modular design is also mandatory for any possible upgrading of the entire workflow. To increase the robustness and reliability of proposed methods, plenty of software and toolkit has to involve manually operated steps, but this will simultaneously hinder automation.
Research Gaps
So far we have surveyed commonly used 3-D point cloud data and a wide variety of existing methods for building reconstruction from point clouds, and discussed their restrictions and limitations. However, our current research also has considerable gaps between the state of the art and the application demands, which have between previously ignored or limited. Specifically, the research gaps involve three essential aspects: development of a public benchmark and evaluation, adoption and adaptation of computer vision and machine learning, and new trends of novel devices and techniques.
A. Public Benchmark and Evaluation
For any research applications, the performance of the developed algorithm or method should be assessed through public standard benchmark datasets, which contribute significantly to algorithm development, evaluation, and comparison [254]. For building reconstruction from point clouds, we also need standard frameworks for conducting the evaluation process, which should be assisted by benchmark datasets. Unlike urban mapping, construction and infrastructure-related applications are complex and result in unique projects, requiring specific labor, materials, equipment, and processes. This means that it would be challenging to generate benchmarks assessing the performance of algorithms and methods because the applications vary drastically from one to another. Currently, the majority of the evaluation is done based on the self-made ground truth or benchmarks from other fields (e.g., computer vision). Blueprints and BIM are authoritative references that can serve as the reference of surface-based or volumetric based modeling performance. In Fig. 25, we show an illustrative sketch of a 2-D blueprint and the created 3-D BIM corresponding to measured point clouds. However, they are only available when the designs or BIM is accessible. One such proposed benchmark is the ISPRS Benchmark on Indoor Modeling,2 which has brought the validation of indoor modeling to a new stage. However, for many inverse engineering applications in archaeology or architecture, there are still very few existing BIMs or available blueprints for the validation of 3-D reconstruction results. In the future, more benchmarks for building or object reconstruction in the scenario of construction projects should be developed.
Possible reference data. (a) Illustrative sketch of a 2-D blueprint. (b) 3-D BIM. (c) Photogrammetric point cloud.
Evaluations done in civil engineering and construction projects should be slightly different from those used in computer vision and remote sensing, in which the theoretical accuracy is a significant index. Current evaluation methods place emphasis solely on technical issues, which is out of the application reality. Thus, the evaluation system concerning both data acquisition and data processing should be established for selecting and assessing appropriate methods under different applications.
B. Adoption and Adaptation of Computer Vision and Machine Learning
Computer vision (CV) and machine learning (ML) are two of the most active branches in computer science which are interested in point clouds, and they have presented a wide range of inspiring algorithms, methods, and strategies. In many previous studies like [255] and [256], the reconstruction or progress monitoring of as-built buildings or infrastructures have been achieved through the use of methods from computer vision and machine learning. Deep learning in particular, which has asserted itself as a dominating technique in the field of artificial intelligence, has broadened the horizon of 3-D building reconstruction from point clouds, with numerous methods having been proposed. However, to use these algorithms and methods in the fields of civil engineering and the construction industry, there is a gap between the state of the art of point cloud processing techniques from CV and ML and the practical demands of AEC/FM and engineering projects. The following points should be considered in the future in order to facilitate a better collaboration between the fields of computer vision and machine and those of AEC/FM and engineering projects.
1) Application Scenario
The algorithms and methods developed in the fields of computer vision are mainly designed for the indoor scenario and do not pay much attention to low-quality outdoor datasets contaminated by noise and outliers [254]. Our demands in AEC/FM and engineering projects regarding both indoor and outdoor application scenarios are of different scales, densities, and qualities. However, the data acquired in indoor scenario are vastly different from those gathered on construction sites, which makes the current algorithms and methods infeasible.
2) Training Data Preparation
A large percentage of computer vision-based methods are based on the supervised learning strategy, which means that a manual preparation of training datasets is required. Only a few datasets have been proposed for civil engineering tasks [7] and transfer learning from existing datasets can partially solve this problem [257]. However, there is still an urgent demand for generating such training datasets, but this is a time-consuming and labor-intensive task.
3) Evaluation Criteria
In the field of computer vision, the overall accuracy represented by recall and precision and mean average precision are the most significant [254]. These metrics represent the outcomes of a statistical evaluation. However, for applications in the civil engineering, the output accuracy of certain objects is of higher value, which links to object-based evaluation. For instance, topological clarification [258] of objects is a crucial point for reconstructed structural elements, which should be considered as well. Moreover, with different applications, the accuracy can be assessed at a pixel-level, point-level, or even object-level, which differs from field to field and relates to the real demands.
C. New Trends of Novel Devices and Techniques
The last research gap lies in the recent trends of novel devices and techniques, including the Ubiquity acquisition and online processing, embedded system and IoT devices, and computing power for big data. The following points should be considered in the future to enhance the connection between the current data acquisition and processing techniques and the demands of AEC/FM fields.
1) Ubiquity Acquisition and Online Processing
With the increasing variety of sensors, both high-end acquisition systems and consumer-level acquisition devices can provide massive, publicly accessible datasets. These new acquisition paradigms translate into a lower control over the acquisition process, which must be compensated by increased robustness of the algorithms and structural or physical a prior knowledge. Moreover, there are many applications such as disaster management and damage assessment from reconstructed buildings and landscapes where tight timing restrictions make an online reconstruction approach indispensable. In particular, we foresee a need to extend the survey prior to the online setting, in order to support such challenging problems in building reconstruction from point clouds.
2) Embedded Systems and IoT Devices
Due to rapid development in the fields of technology and the internet, embedded systems and IoT devices bring a new era to the AEC/FM and engineering projects. IoT devices and embedded systems intend to amalgamate everything under a common infrastructure and provide not only control over everything, but also define and provide the actual status of things (i.e., buildings and infrastructure). Various applications of IoTs for the development of smart city infrastructure and smart dwelling construction projects have already been presented. The use of IoTs greatly accelerates the automation and monitoring of construction projects. However, for point cloud processing techniques with IoT devices, as well as embedded systems, the research is at a very early stage, including the development of both hardware and software.
3) Computing Power for Big Data
Novel point cloud acquisition methods will not only contribute to an increase in the variety and popularity of collected datasets, but also to a quickly growing scale of acquired data. With the large scale of datasets required for urban mapping and construction tasks, we no longer deal with individual buildings or installations, but rather with entire scenes, possibly at a city-scale with enormous numbers of objects and structural elements of various shapes and sizes. Moreover, the need for computing power also comes from the higher dimension of 3-D point clouds, which have a one more dimension than 2-D images. For recording the same scene, acquired 3-D point clouds would be much larger than images, if we consider relatively similar resolutions. Under such a situation, recovering geometry, semantics, and topology of objects from billions of measured points is a challenging big data problem. To address this problem, the computational power should be improved and the processing frame should be redesigned.
Conclusion
3-D point cloud data plays a vital role in the monitoring of construction sites, construction works, as well as construction equipment. A wide range of techniques has been developed to acquire 3-D point clouds, including ranging-based methods like laser scanning and imaging-based methods like MVS and SLAM. Moreover, the point cloud data have already been used with a range of applications, including 3-D building model reconstruction, building condition assessment, and construction progress analysis.
With this background, this article provides a comprehensive review of the state of the art of 3-D point clouds and their related key techniques. The measuring of 3-D coordinates and the generation of 3-D point clouds are also introduced. Furthermore, the related data structures for organizing discrete points, as well as the registration techniques for aligning multiple point clouds, are reviewed and analyzed. Several essential technologies are reviewed and discussed. These techniques include three major parts: segmentation of point clouds, classification of point clouds, and modeling from generated 3-D points. The benefits, drawbacks, and appropriate conditions are also addressed in this article for each representative technique. As a consequence, based on the literature review and discussions, the limitation of current techniques, as well as research gaps, are discussed and identified, indicating the following future research directions.
An application-oriented data acquisition workflow and a benchmark-based evaluation system should be developed, concerning aspects like improving the accessibility and visibility during data acquisition, the creation of benchmark datasets, the establishment of an effective evaluation system, and the possible use of multimodal datasets.
Advanced data processing techniques should be further studied, considering the balance between efficiency and effectiveness, the generality of uses, and the robustness to disturbances in the scenarios of civil engineering applications and construction sites.
Collaboration with computer vision and machine learning, as well as a deeper connection with the fields of AEC/FM and engineering projects, should be further enhanced to fill the gaps between developed techniques and real demands of applications.
ACKNOWLEDGMENT
This work was carried out within the frame of Leonhard Obermeyer Center (LOC) at Technische Universität München (TUM) [www.loc.tum.de]. The authors would like to take the time to thank Dr. P. Polewski for providing Fig. 20. They would like to thank M. Hödel for his great help of the proofreading. They would also like to thank the authors of Semantic3-D dataset [www.semantic3d.net], which they used as illustrations of point clouds. The authors would like to appreciate the help of Dr. M. Hebel and J. Gehrung of Fraunhofer IOSB for providing the MLS dataset as illustrations, Dr. S. Tuttas and Dr. L. Hoegner for acquiring point clouds of construction sites, and finally Prof. A. Borrmann and Dr. A. Braun of TUM for providing the as-built BIM as illustrations.