Planar Abstraction and Inverse Rendering of 3D Indoor Environments

Scanning and acquiring a 3D indoor environment suffers from complex occlusions and misalignment errors. The reconstruction obtained from an RGB-D scanner contains holes in geometry and ghosting in texture. These are easily noticeable and cannot be considered as visually compelling VR content without further processing. On the other hand, the well-known Manhattan World priors successfully recreate relatively simple structures. In this article, we would like to push the limit of planar representation in indoor environments. Given an initial 3D reconstruction captured by an RGB-D sensor, we use planes not only to represent the environment geometrically but also to solve an inverse rendering problem considering texture and light. The complex process of shape inference and intrinsic imaging is greatly simplified with the help of detected planes and yet produces a realistic 3D indoor environment. The generated content can adequately represent the spatial arrangements for various AR/VR applications and can be readily composited with virtual objects possessing plausible lighting and texture.


INTRODUCTION
A realistic 3D environment has extensive possible applications. The immediate usage would be visualizing 3D content for commercial solutions indoors, namely real estate or interior designs of homes, offices, or hotel rooms. Realistic VR contents could even be a solution for treating mental issues including Alzheimer diseases [1], [2], [3], or for producing training data for machine-learning tasks with physically based rendering [4] and a simulator [5]. The reconstructed environment can ultimately perform a higher level of understanding and manipulations of everyday environments when equipped with semantics and intelligent agents.
Recent advances of 3D scanning technology allow very accurate reconstruction of 3D objects in a controlled setting with one or more cues of stereo reconstruction [6], shading [7], silhouette [8], or depth sensors [9], [10]. However, conventional 3D reconstruction technology cannot simply be extended into un-controlled large-scale indoor environments. This is because the collected data suffers from clutter, and there is often insufficient overlap between the camera frames [11].
The representation of indoor environments is usually based on either 3D reconstruction or image-based rendering. The former has easily noticeable artifacts due to holes, misalignment of texture (or depth boundaries), changed exposures between different frames, and extensive smoothing of small geometry. On the other hand, the latter shows a photo-realistic texture that is visually compelling when the viewpoint is similar to that from which the original image was captured. The rendering quality, however, degrades significantly with substantial changes in viewpoint, especially for geometrically complex objects. Image-based rendering does not have an underlying geometry that could enable manipulations such as changes of object poses, lighting or materials.
In this paper, we present the power of plane-based abstraction to better represent the majority of static geometric structures in indoor environments and complete the occluded parts ( Fig. 1). Because planar-based abstraction has already proved its effectiveness in large-scale urban modeling or Manhattan-world based indoor modeling [12], [13], [14], [15]; indoor environments are well represented and separated by planar structures, mainly walls, floors, and desktops.
We demonstrate that detected planes can play an important role not only for geometric completeness, but also can be extended for texturing, lighting, and registration. Our intuition comes from the fact that computer graphics have long benefited from billboard-based abstraction with bump maps or texture to render high-quality images quickly. Similarly, we can use the underlying base plane primitives with better texture to create illusions of detailed reconstruction, and further extract intrinsic components. The underlying assumption is that the bare-bones geometry of the indoor environment would be surrounded completely by a planar structure (walls, ceiling, and floor), with homogeneous texture.
Starting from 3D-scanned incomplete geometry of the indoor environment, we first detect major planes within the scene and optimize over registration and exposure terms of individual frames observing each plane to have a consistent texture. During the process, the texture is decomposed into the homogeneous, flat background and occluding foreground with variation and details. The flat geometry and background texture of the planes are then interpolated and extrapolated to fill the inevitable holes and to complete the room structure, similar to [16].
In addition to the geometry estimate, corresponding material and texture information is necessary to augment the model with realistic rendering in an AR/VR scenario [17], [18], [19]. We further exploit the assumption to deduce the direct and indirect lighting and background materials jointly. This demonstrates the extended potential for largescale virtual applications of 3D indoor environments with plausible rendering. Even in a confined space, indoor lighting varies for different locations due to the effects of light fixtures, windows, and shadows of complex objects. In previous work, these effects are often ignored, and large-scale lights are simply modeled with environment maps or directional lighting [20], [21], [22], [23]. On the other hand, we directly locate the direct light sources and reflect the effects of different locations from them in addition to environmental lighting.
We offer three main contributions: converting 3D reconstruction of an indoor environment into an abstract and compact representation of a realistic indoor environment based on planar primitives; decomposing the created content using an inverse rendering pipeline, and making it readily available in the form of physically-based rendering with joint analysis of shape, texture, and lighting; extracting basic semantics of foreground/background segmentation utilizing recovered geometry and texture. We demonstrate that our representation creates realistic visualizations for various indoor spaces and at the same time decomposes the scene semantically (walls, floors, desktops, and objects) and intrinsically (geometry, lighting, and texture). The new form of visualization is compared with other recent representations of 3D indoor environments [16], [24], [25].

RELATED WORK
Reconstructing and visualizing large-scale indoor environments have a variety of practical applications. Some examples range from generating AR/VR systems to incorporating an intelligent system into the geometry and semantics of a real-world environment or augmenting data for machine learning [4]. Acquiring large-scale uncontrolled indoor data, however, is challenging. Everyday environments are highly complex and exhibit occlusions with various objects. There exist static furniture and non-static objects, and often the accessibility of the viewpoints is limited to the remaining open space.
Three major approaches exist by which to represent a real-world indoor environment. First, the volumetric fusion of large-scale environments converts the accumulated depth measurement into a dense mesh [25] that could be further used for useful high-level tasks such as semantic segmentation [24], [26] or to simulate activities [5]. This representation creates the most thorough geometry of the actual environment. The texture is usually created by projecting RGB frames onto the measured depth data, but there are possible misalignment issues that create ghosting artifacts. The misalignment comes from either imprecise geometry of the reconstruction method that smooths out small details by accumulation or from calibration errors between multiple frames with insufficient overlap. In addition to texture errors, there exist inevitable holes in measurements when scanning large environments.
The second approach to representation is image-based rendering ( [11], [27] and the references therein), which does not have the texture misalignment problem. Instead, imagebased rendering techniques create as crisp an image as possible by deforming the existing image frames based on rough estimates of the underlying geometry. The rough geometry estimate only serves as a guide to warp images to create a realistic visualization. While the reflectance recovery or Sample modeling results using our pipeline. We present a system that can generate corresponding 3D models from multiple photographs provided the topology of the object category. The topological prior says how individual parts of an object category are connected and compose an object. For example, a teapot (left) has one straight axis in the middle and two curved axes, and a vase (right) has a straight axis and can optionally have a pair of symmetric axes. The sampled input photographs and output textured models are presented on each side of the arrows. augmentation becomes faithful [28] with accurate reconstruction and known illumination, visualization quality deteriorates if the viewpoint changes significantly from that in the measured frames. Without improved geometry, the representation can only be used for visualization and cannot be extended to other tasks such as free-navigation, scene editing, or semantic segmentation.
The third approach to representation uses a geometric proxy to represent the core elements of the indoor scenes. One of the most successful ways to represent large-scale indoor environments is by using a Manhattan-World prior to fill the holes in a stereo reconstruction [14], [15]. Planes are successfully used to create a plausible reconstruction of a building with limited inputs [12], [13], [29]. Huang et al. [16] used plane geometry and texture to fill holes and create plausible 3D content of indoor environments. Starting from the planar assumption to create a complete geometry as suggested in the literature mentioned above, our work incorporates light information to complete the rendering pipeline of scanned indoor environments.
In this paper, we present a plane-based approximation of indoor environments. We are not limited to mere geometric reconstruction, but also generate plausible texture and lighting parameters given the fitted planes. Previous literature for 3D object rendering and compositing extracted environmental light and material components from an image [17], [18], [19]. Similarly, we are solving intrinsic image decomposition [30], which is inherently under-constrained. We substitute shape and reflectance priors with the recovered planar scene geometry and reflectance of background walls, respectively. Our work is inspired by previous work that successfully exploited planar geometry for other tasks. Teichman et al. [31] used detected planes to solve jointly for SLAM and calibrate sensor intrinsics, while Jachnik et al. [32] solved for the entire light field of a small (book size) specular planar surface. Similarly, we jointly solve necessary calibration parameters and reflectance as well as light parameters. Zhang et al. [33] also used planar reconstruction to find reflectance of walls and light parameters. They only recovered indirect lighting as a cube map and 1-3 reflectance values of walls for the entire room; therefore, this cannot be applied in large-scale rooms with non-rectangular shapes. On the other hand, our pipeline embraces more general planar-geometries and recovers both direct and indirect lighting. The planar assumption additionally generates segmentation of nonplanar objects and foreground texture as a by-product of the decomposition, which can potentially provide useful information for other tasks.

PROBLEM AND ASSUMPTIONS
The input to our system is an RGB-D sequence capturing the indoor environment and a reconstructed mesh using the volumetric fusion method [24], [25]. The model is processed in three steps: (1) finding the geometric representation with planar abstraction (Section 4), (2) estimating texture (Section 5), and (3) setting the light parameters (Section 6).
We first detect planes on the reconstructed mesh based on the assumption that the indoor environment is composed of planar structures. It is a widely used assumption, and its success has been demonstrated in various indoor-based applications. Then, we further extend the assumption and extract the bare-bone structure of the indoor environment that encloses the given geometry. Based on the extracted structure of the room, the planes are classified as the room geometry or other structures. The extracted planes of the room geometry are extended, or connected. On the other hand, the remaining non-room geometry are treated separately. These are possible candidates for non-static objects with semantic meaning.
Second, we cluster the colors of each plane and assign a background color with dominant homogeneous reflectance under a Lambertian assumption. The reflectance of an individual plane is estimated by the homogeneous background region, which is the region with no response under a conventional edge detector. We first compensate for color differences caused by per-frame auto-exposure using geometric correspondences between the mesh and the frames. Then the equalized color values are used to create texture on the mesh. From the mesh, we can locate the positions of direct lights and assign background reflectance colors for each plane. The per-plane texture is completed by optimizing the pose in the foreground and by filling the unobserved texture with the background color. At the same time, we estimate direct and indirect light parameters using the observed parts with the homogeneous texture.
The final output is the 3D content of the indoor environment, which is greatly simplified from the original sensor sequence or initial mesh. The lightweight mesh is complete and visually more attractive, with a clear texture and filled holes. We also have a geometric context of the 3D elements distinguishing the room structure of the essential static elements of walls, a ceiling, and a floor with representative background reflectance and other objects or foreground decorations. Also, we obtain the indirect light field as well as direct lighting. In short, the pipeline produces the necessary information on geometry, texture, and lighting utilizing the planar assumption, which can be readily rendered or used for a variety of indoor visualization, navigation, and mixedreality applications.

GEOMETRY ESTIMATION
The use of planar abstraction for indoor geometry has been reported in many previous publications. In this section, we briefly discuss how we incorporate existing techniques using Manhattan world priors. The overall pipeline is depicted in Fig. 2. We first detect planes in individual frames and label the planes on the initial geometry (Section 4.1). We use this information to create a globally connected room structure and fill-in the missing geometry (Section 4.2).

Plane Detection and Refinement
The input to the system is a sequence of color-depth frames and their calibration information. We also have a 3D mesh built by volumetric reconstruction of the registered depth measurements. Instead of using the original depth measurements, which tend to be noisy, we generate the depth frames by reading the OpenGL depth buffer when the reconstructed mesh is rendered with the calibration parameters. From the rendered depth frames, we adapt the fast plane detection method suggested by [34]. By detecting the planes in individual frames instead of the 3D mesh, our method become robust against the large-scale distortion of planes that occur with the 3D mesh and effectively incorporate visibility information. The detected planes are then projected back to the reconstructed mesh, and the planes with sufficient overlap are merged into the same plane. The criteria for overlapping can be heuristic; however, we chose the case that if more than 10 percent of the detected plane pixels overlaps when projected on any of the frames. Examples of detected planes are shown in Figs. 2 and 3.
The plane parameters are found by running optimization on the detected planes, as mentioned by [16]. We find the plane parameters that are jointly optimized for the data fitting term and orthogonality. Further details are mentioned in Section 4.3 of [16].
In addition to the plane parameters, we also record the visibility information of individual planes. We project measurements of the individual frames on the plane and record the visibility information (existing, free-space, or occluded) considering the distance from the plane. This information is going to be used in the next section to complete the missing regions.

Plane Completion
Merely placing the detected planes is not enough to infer a complete room structure. While previous works detect corners and edges to generate a compact planar representation for visualization [12], [16], we found that the choices of connection between neighboring planes are usually ad hoc, and often result in wrong configurations. The main reason is that the observed data is distorted and occluded, missing the necessary connectivity information. The cause of the incompleteness is complex and merging or connecting planes with a constant distance threshold results in an erroneous decision. The connection is in contrast to the plane merging in Section 4.1, where the obviously overlapping planes are marked as the same. Open spaces of doorways, windows, or pillars make the problem even more complicated.
Given the high-resolution mesh, we need to enforce a strong prior to ultimately synthesize a simplified structure. At a high-level, our approach is similar to urban scene modeling [35], but the essential prior being used is different. For urban buildings, they utilize structural clues from GIS footprints, satellite imagery, and semantic segmentation and infer combinations of sweep-edges. For the indoor environment, we rely solely on RGB-D measurements and volumetric reconstruction, and we define individual rooms as space enclosed by a ceiling, a floor, and a loop of walls and enforce a stronger prior to generate the room structure. Currently, we focus on the single dominant loop, but we can extend the algorithm to sequentially add additional loops if needed. The decisions of extrapolating and merging planes are bold in the beginning but become more conservative in the later stages.
Sequentially, we first detect the ceiling and floor, followed by the largest enclosing loop of the walls from the detected planes and the remaining planes are added last. Specifically, we first detect ceiling and floor using the known gravity direction. Then we define candidate walls to be vertical planes that can extend from the ceiling to floor considering visibility information. The candidate walls are projected on the floor plane as lines with two ends, as shown in Fig. 3 (top left). We first merge some of the proximal lines that can be represented as a single wall (Fig. 3, top middle). This is because large planes are often partially observed by the sensor and thus are measured as several disconnected patches. The end-points of the lines are connected to create a loop, to extrapolate or to add new lines as necessary (Fig. 3, top right). The 2D projection can then be converted into a room structure. The described pipeline finds the room structure in most cases, except when there are windows of significant size and hall-ways as in Fig. 4. Once the loop of the room is defined, the remaining planes are conservatively connected when the intersection is observed.
The ceilings and floors are converted into 3D mesh with the help of a polygon triangulation library, 1 and the surrounding vertical rectangular walls are represented with two triangles. Among the remaining planes, the ones that can be fitted into rectangles are also represented with two triangles. The other remaining planes are meshed using a quad-tree structure on the plane (Fig. 3, bottom right). The  1. https://github.com/mapbox/earcut overall size of the mesh is significantly reduced because 3D reconstruction from the volumetric fusion of sensors is usually based on a marching-cube algorithm [36] and creates triangles in the resolution of the voxel grid (Fig. 3, bottom left and bottom middle). In the next section, we create a complete high-resolution texture for the simple mesh to generate visually compelling 3D content.

COLOR ESTIMATION
The previous section describes how we extract simplified geometry that captures the essential elements of the indoor environment. Having a rough geometry can also guide the inverse problem of decomposition, such as intrinsic imaging or inverse rendering. In this section, the planar abstraction is exploited to improve registration and create high-resolution texture. We first explain how we equalize colors by compensating for auto-exposure and white balancing. The equalized color information is used to detect light and to generate texture per plane by registering camera positions. The texture is then segmented into either a detailed foreground or uniform background color. The background color is used to fill the occluded region and solve for light parameters using simplified geometry. These stages are described in detail below, and also depicted in Fig. 5.

Color-Transfer Optimization
The texture can be generated by projecting multiple frames onto planes using the known calibration parameters. However, the corresponding pixels in different frames are not the same color, even on Lambertian surfaces, because of autoexposure or white balancing. We compensate for the different exposures of the same geometric correspondence using the method suggested by [33]. We find corresponding pixels on images on sampled vertices p k on the initial reconstructed mesh. We solve the following optimization problem: where t i are per-image exposures, Cðp k Þ are vertex radiances, and I i ðp k Þ is the gamma-corrected (g ¼ 2:2) observed pixel value of vertex p k in image i. We only use correspondences that are not occluded and that are projected onto lowgradient regions in the images and solve for R, G, and B independently. Note that we did not use locally varying transfer functions as suggested by [16], [37]. This is because we are interested in emulating the actual color transfer by exposure or white balance with per-frame parameters to maintain the global relative brightness of the entire geometry. In practice, there exist non-Lambertian surfaces and outliers due to registration errors or geometric errors in reconstruction. Therefore, we ignore vertices p k for which observation I i ðp k Þ varies by more than 0.2 when the colors are converted to the range of 0 À 1. We optimize the nonlinear least squares problem using the Huber-loss function with a ¼ 0:1. The details on implementation can be seen in the results section (Section 7). We assign the per-vertex reflectance using the correspondence calculated by projecting each vertex position onto individual frames. As the equalized colors cover a wider-range of color variation with the effect of different white-balance values of different frames, the resulting range of vertex colors involves high-dynamic-range radiance values. If there are multiple corresponding pixel values of the vertex, we assign confidence weights based on saturation and the geometric arrangement between the vertex and the camera, as suggested by [33]. The vertex color chosen is the weighted median of the equalized colors of corresponding pixels.
The high-dynamic range mesh can then be used to locate the direct, diffuse lights. To detect lights, we detect vertices whose radiances are above a threshold value, and identify the connected components. The center of mass of the connected mesh is chosen to be a direct light source. The calculated frame weights t i and the locations of direct light sources are used in the later sections.
The sample results of color equalization are shown in Fig. 6. The input frames in Fig. 6a and 6c observe the same wall, but the white balance and exposure vary as the sensor automatically adjusts the exposure on a per-frame basis. Our color-equalization pipeline effectively reduces the color discrepancy, as shown in Fig. 6b and 6d. As a result, the mesh texture in Fig. 6f is more coherent than the original mesh texture in Fig. 6e. Also, the texture assigned with the median of observations greatly reduces blurry texture or ghosting artifacts due to misalignments.

Per-Plane Registration
Similar to the mesh colors, the texture of the indoor structure for the simplified planes can be generated from the projection of RGB frames on the planes. With the known registration and 3D location of planes, this could be solved using simple homography. However, the initial geometry would be distorted, and the planar structure would be reconstructed as non-flat geometry. Consequently, the initial registration to the distorted geometry should be erroneous.
We solve for the camera-to-world 4-by-4 registration T ¼ fT i g of each frame. Individual frames in which the pixels correspond to a specific plane are further mapped to the plane with T p , which transforms the world coordinate system into the plane coordinate system. In our setting, the transform T p is defined such that one meter is mapped into 200 pixels. We use uniform scaling and rigid transformation to map the individual plane whose x and y coordinates represent the pixel coordinates and z ¼ 0. We want to find a small update for the frame transformation T i ¼ DT i Á T 0 i , where T 0 ¼ fT 0 i g represents the initial transformation. Assuming the update required for actual registration is small, we have Then for a point in a camera frame p camera , the point is mapped to a plane by p plane ¼ T p Á T i Á p camera . We first refine the registration of the individual frames by jointly solving for the planarity of geometry, sparse, and dense constraints. The formulation is similar to [16], but the individual terms are evaluated on the plane coordinate system: The weights s and d are set for each plane such that The first term E g ðTÞ represents the geometric term to hold the measurements close to the detected plane. When defined on the plane's coordinate system, we can minimize the distance to the plane by minimizing the z-coordinate. In other words, the geometric term can be written as: where e 3 ¼ ð0; 0; 1Þ represents the unit-vector in z-direction, and the term is minimized for all corresponding points (indexed by k) p plane i;k for projecting all frames i. The second term E s ðTÞ optimizes for the locations of sparse image feature correspondences. As before, all frame images are warped onto the plane, and sparse features are extracted and matched for every pair of ði; jÞ frames. We use the OpenCV implementation [38] of the ORB feature detector [39] and the brute-force matching strategy. We sort the matches by scores and keep only 1/3 of the found matches. We further filter out matches that are too far away when mapped onto the same plane (we use 10 pixels). We empirically found that the matches can be unstable if we have only one or two filtered matches. When there are more than three matches, we add the sparseness term to the optimization: We optimize the x; y coordinate of the location projected on the plane. Furthermore, the third term E d ðTÞ is optimized over the dense photometric consistency of individual pixel values Cðp plane Þ in the generated texture. If I i ðp plane i;k Þ represents the pixel intensity of a point in the color-corrected and warped image of frame i, it can be written as For the dense term and geometric term, considering the points on the entire plane can be too expensive to compute without much constraint added as 3D coordinates on the detected plane are continuous. We chose to use uniformly sampled points to reduce the problem size. When optimized using a point for every five pixels in both x and y direction, it took about five minutes to run the optimization for each plane. The sampling rate can be adjusted considering the number of frames involved and the available memory size for the optimization.

Foreground-Background Optimization
Color Blending. Even though we resort to simple geometry, we can still create the illusion of a realistic environment by rendering the model with a high-resolution texture. We choose the resolution of each texture based on the physical dimensions of individual planes. The color of each pixel is chosen in a manner similar to the generation of the mesh texture in Section 5.1, i.e., warping equalized colors of frames then taking the weighted median. Fig. 7 shows a comparison of the different blending methods used to create texture from multiple frames. Compared to the simple averaging in Fig. 7a, using frame weights (Fig. 7b) compensates for different white balance values and consequently reduces the prominent boundaries of warped frames. Using a weighted median (Fig. 7c) eliminates additional ghosting and blurry details. However, there are still possible misalignments due to geometric errors or motion blur, and, more importantly, values missing from the geometry extrapolated which are not directly observed. The artifacts are reduced after we process the texture with inpainting and Poisson-blending as described in the following paragraphs (Fig. 7d).
Background Inpainting. We take a two-step approach for the background and the foreground. The underlying assumption is that each plane has a dominant color (base texture or background color) that can be smoothly interpolated and used to inpaint the missing regions. On the other hand, there are high-frequency details on top of the base texture, which are assigned as the foreground. The texture refinement steps for the foreground and background regions are described in Fig. 8.
An example of the combined initial texture is shown in Fig. 8a. We first cluster the RGB values of the pixels in the generated texture, avoiding the edge region. We collect dominant clusters that cover more than 60 percent of the non-edge pixels and mark them as background pixels. An example of background clusters is shown in Fig. 8b. We use the selected pixels to create the inpainted background (Fig. 8c). We used the OpenCV implementation of inpainitng [40] because our main focus was not on creating the best inpainting method, but a better approach can be used to fill larger regions. The inpainted background region is merged with the initial texture to create a full texture without missing values (Fig. 8d).
Foreground Blending. After the texture is filled with the background, the blurry foreground region is sharpened by blending images from the sharpest frame. As opposed to the background, the foreground regions have high-frequency details that need to be preserved. We first define the regions to be amended by finding connected components of detected edges. The samples of the regions are shown in Fig. 8d. For each connected component, we extract the best candidate frame to fill the part by selecting the warped texture that covers most of the region. If multiple frames cover the entire region, then we pick the frame that is the least blurry based on the variance of the Laplacian measure, as suggested by [41]. The region is then merged into the plane texture  using Poisson blending [42] (Fig. 8f). As observed in Fig. 8e, the crisp details are successfully preserved.
The approach described fails when our underlying assumption does not hold. In particular, there are cases in which the background texture is not homogeneous, as depicted in Fig. 9. According to our foreground assumption, details are preserved for isolated pictures or decorations, as shown in the cropped region of Fig. 9b. However, the regular pattern of blinds is smoothed out as blurry gray fields. There are possible ways to improve the texture generation. The inpainting method can be replaced by traditional texture synthesis [43], [44], or even deep-learning-based approaches using GAN [45]. Our approach also cannot recover from the misalignment of non-flat global details. Huang et al. [16] suggest a careful non-rigid registration and graph-cut based method to create coherent registration and to create an assembly of patches. However, our approach with a homogeneous background assumption works with most flat walls in ordinary rooms of offices. More importantly, the background regions are represented by a simple reflectance model and can be used to estimate the necessary light parameters.

LIGHT PARAMETER ESTIMATION
In computer graphics, rendering is the process that generates images with geometry, light, and material. Inverse rendering [46], which solves the inverse problem, decomposes an image into light, material, and geometry. As in intrinsic image decomposition [30], [47] or illumination decomposition [48], [49], which are closely related problems in realworld images, inverse rendering is an under-constrained problem and is solved using a strong prior of one or more of the components, or by allowing user interactions.
In our set-up, we solve for light parameters under the assumption of planar geometry for a homogeneous background region. Starting from the low-level reconstruction of distorted mesh with RGB-D measurements, we extend the knowledge about planar geometry and can create full 3D content that ont only contains geometry, but also reflectance and light parameters representing both direct and indirect lighting. Geometric form factors of the radiance equation can be resolved using the simplified geometric estimation, resulting in a plausible solution for an under-constrained problem. With directional light sources, the light varies significantly for the different regions within the space, and our method can capture these effects. This is in contrast to light parameter estimation by previous methods only focused on small regions to place virtual objects, or considering only directions to represent an environment map. With fuller components that represent the environment, we can alter or insert any elements for rendering. We can seamlessly augment virtual objects, create realistic shadows, and change texture or lights.

Setting
Let us first begin with the famous radiance equation [50]: The radiosity of a point V to another point x 0 is the sum of all radiance R received from other points x multiplied by BRDF f and visibility G. If we assume Lambertian reflectance, radiance is constant for any viewing direction, RðV ! xÞ ¼ RðV Þ. This is approximated by the observed vertex radiance Cðp k Þ found in Section 5.1, and the BRDF can be written as a constant fðV Þ. We further assume that detected planes are represented with Lambertian homogeneous reflectance fðV Þ ¼ r in the background region, or in other words, where no high-frequency texture detail is observed. Then for a non-emitting vertex p k the equation above can be re-written for a discretized mesh as Therefore, the pixel intensity Cðp k Þ is a combination of direct lighting P j DðL j ! p k Þ and indirect lighting xðp k Þ. The models of direct lights and indirect lights are described below.

Direct Light
The locations of direct light are detected as bright regions after the vertex colors are equalized as described in Section 5.1. For each detected location, we place point-light sources whose light distribution is axially symmetric with the vertical axis of symmetry, where the vertical axis is defined as the normal of the ceiling and floor detection in Section 4.2. This is how most radially symmetric lights in the IES (Illumination Engineering Society) formats are used [51]. The profile can represent many types of lights found in residential indoor scenes, such as standing lights, ceiling lights, spotlights, and shaded lamps [33]. The intensity of light is dependent on the angle u L j to the vertical axis, and the angular variation is represented by a 32-bin discretization of the angle u L j .
The actual brightness at p k depends on the reflectance of the wall, angle with respect to the normal of the wall u p k , and the distance r to the point, but only when the path from the light source L j to the point p k Þ is not occluded. Such occlusions are represented by a binary visibility function Gðp k ; L j Þ.
To summarize, direct lighting can be written as: We only need to solve for the light intensity, which is formulated as the combination of the RGB factor I L j (3 dimensions) and angular 32-bins A L j ðu L j Þ. Other terms (Gðp k ; L j Þ, u L j , u p k , and r) are geometric form factors and can be calculated using known information. For Gðp k ; L j Þ, we locate the viewing camera at the light location, looking at individual planes and render the original input mesh. The depth buffer is used to detect visible points. For example, to find the light parameters for the hanging light in Fig. 10 (left), we can use the points on the table plane. The original mesh is first rendered with a virtual camera located at the light position, looking at the plane, as shown in Fig. 10 (middle). We can compute the visibility using the depth map and comparing the distances. The pre-calculated distance-based weights are depicted in Fig. 10 (right).

Indirect Light
The physically correct way to simulate indirect lighting is to run ray tracing multiple times with the correct geometry and material, until convergence. When converged, the field of indirect light can be represented by the position and the direction at any point within the volume. The full ray tracing involves a prohibitive amount of calculation and memory. Instead, indirect lights are often pre-calculated or approximated based on the fact that, unlike direct light, indirect light is smoother and subtler. The focus in previous literature has been on creating artificial light to effectively model the global illumination using, for example, ambient occlusion, an environmental map, or photon mapping, to name a few. Modern VR applications usually use an environment map which indexex the normal directions to find the lighting on the face of a virtual object. Such approximation might work well when the environment is convex and confined, such as in a small rectangular room, but cannot work with more dynamic changes in the location. Instead, we assign unknown indirect lighting at each vertex xðp k Þ and add a smoothness criterion for neighboring vertices. In other words, we minimize the following optimization problem: The first term matches the color at the vertex with the light and reflectance, and the second term enforces smoothness between neighboring indirect lighting. For implementation, we regularly sample vertices on the plane region clustered as the background.

Implementation
We solve Equation (10), and the lower and upper bounds of all unknowns are set to be between 0 and 1. Because the problem is under-constrained with many unknowns, we impose constraints on the initialization and solve alternative optimization by grouping variables. We initialize the light intensity using the absolute difference of shadows in one of the close boundaries, and is constant in all directions. The reflectance is initialized as the median of the background texture, and indirect lighting is set to fCðp k Þ À P j DðL j ! p k Þg=r. We alternatively optimize for light intensity, reflectance, and indirect lighting, and use the closest three visible planes per light. The imposed assumptions are sometimes violated due to varying reflectance, simplification of indirect lighting, and incorrect geometry or texture. We, therefore, use the Huber function as a robust loss function.

RESULTS
The pipeline has been implemented in a desktop machine with Intel Core i7 3.6GHz CPU with 16 GB memory. We used the Ceres solver [52] for variation optimization problems involved in the registration, color, and light   Each of the components represent plane detection (Section 4.1), plane completion (Section 4.2), color equalization (Section 5), texture generation (Section 5.2), and light estimation (Section 6). Fig. 11. Plane detection parses the complex geometry of indoor spaces and the remaining objects can further be classified into isolated clusters.
optimization. The rendering of virtual scenes is implemented using the unreal engine with point-light sources. We tested the pipeline with sequences available from [16], the BundleFusion [25], and the ScanNet [24] data set. The initial mesh is built with the VoxelHashing approach [53]. The details of the dataset used are available in Table 1. After about 2-3 hours of processing, the complete and lightweight representation is acquired with only 1-6% of the face elements. Our algorithm produces a comparable number of planes with a smaller number of faces. The major reduction discrepancy is from the meshing approach, which can be seen by comparing the third and fifth row of Fig. 14 and 15. Fig. 13. Provided the acquired geometry and light information, the original video is augmented with a virtual object using differential rendering. Note that the shadow resulted from the directional light above the virtual object. Fig. 12. The detected non-background region to be inpainted. Fig. 14. Comparison of visualization using VoxelHashing [53] (top rows) with our representation (middle and bottom rows). The volumetric reconstruction suffers from ghosting (yellow) or holes (green), and our approach alleviates the artifacts. In the meanwhile, our approach erases non-planar objects (blue) and fill it with nearby background colors. Our representation uses much smaller number of triangles (bottom rows), but exhibit crisp texture for detected foreground.
Details of the timing are available in Table 2. The plane detection and light estimation steps require the most computation.
In this section, we focus on the visualization of the generated 3D content. (While the process is composed of multiple stages, we described the effects and performance of individual components in the respective sections.) Figure Fig. 14  and 15 show the qualitative comparison between the original volumetric reconstruction (first row) and our lightweight reconstruction (second and third row). With a fraction of elements, we can still convey the overall shape of the environment. Samples of reduced triangle faces are highlighted in the second and third rows of Figs. 14 and 15. More importantly, our pipeline significantly reduces the ghosting artifacts near the depth boundaries (shown in yellow) and fill unnecessary holes (shown in green). Most of the details on the texture are crisp, and the color stays equalized regardless of the varying white-balancing per frame. Our planar representation erases non-planar objects from the original reconstruction and fills the unknown texture of the revealed planar region with the background color. Some examples are highlighted in blue in Figs. 14 and 15.
We also compared the geometry and texture of our reconstruction against the 3dlite [16], whose results are presented in fourth and fifth row of Figs. 14 and 15. The reconstruction provided by the authors was in a slightly different coordinate system and we roughly aligned the viewpoint with our reconstruction. Although 3dlite is similarly focused on dominant planes, the approach is slightly different. The most prominent difference is that the choice of planes; for example, the white drawer in Fig. 14(a) and the bookshelf in Fig. 15(b) are detected only in our case. However, the side-board of the table in Fig. 14(a) and cupboard in Fig. 15(e) are only detected with 3dlite. The choice of retained planes becomes different largely because our algorithm emphasizes preserving the room structure, which is obviously presented observing the shape of the room for Fig. 15(e). On the other hand, 3dlite puts more emphasis on finding intersections of detected planes and has a smaller number of floating structure such as monitors or chairs (Figs. 14(e), 15(a)(c)). In terms of texture, 3dlite preserves sharper texture when the underlying geometry is not exactly plane, as they allow non-rigid warping of image  [53](top rows) with our representation (middle and bottom rows). The volumetric reconstruction suffers from ghosting (yellow) or holes (green), and our approach alleviates the artifacts. In the meanwhile, our approach erases non-planar objects (blue) and fill it with nearby background colors. Our representation uses much smaller number of triangles (bottom rows), but exhibit crisp texture for detected foreground.
patches during their texture generation. On the other hand, they do not correctly compensate for the light information and there are prominent differences in white-balance for different walls (Fig. 14(c)(e) and 15(b)). We focus on having the correct balance of reflectance and later use the color information to solve for the light information, which is not in consideration in 3dlite. The light information, in conjunction with the planar abstraction, can be used to augment the original video as in Fig. 13. Our HDR mesh combines multiple frames with a different original exposure into a single, Fig. 16. Samples of input frames and the rendering at the same view for apt and office0 data set. coherent texture with high-dynamic range, and makes it possible to locate light sources. As the result of the light estimation, the effect of directional light is visible with the shading and shadow of the augmented object in the AR example (more frames are available in the accompanied video).
With the retrieved planar proxies and light information, we have a simplified representation and the viewpoints can be quickly rendered. We compare the original input frames (first column) with renderings of the textured mesh in Fig. 16. The simple rendering of the transformed initial mesh reveals an imprecise texture due to the errors in the reconstruction and the registration (second column). The retexturing method using the weighted median of the equalized frames can alleviate such artifacts (third column). We then find the background colors of the planes and use the light location to solve the simplified inverse rendering problem. From the inferred values and using the simplified planar geometry, we can create a virtual scene of an empty room, that contains the detected loop of the walls with background colors (fourth column). The light variations near the light sources are captured with the color variation of the planes. The virtual scene can be freely altered for VR applications (fifth column).
One interesting bi-product of the pipeline is the geometric decomposition of various elements. The geometry estimation creates the decomposition of planes as depicted in Fig. 11. Most of the time, the decomposed planar elements agree with semantic segments, such as individual pieces of furniture, walls, ceilings, or floors. Nonplanar objects can also be segmented after planar objects are distinguished. Color estimation also enables foreground-background segmentation. The texture completion pipeline separates connected components for the foreground, which can be used to approximately locate different elements on the planar surface (Fig. 12).
There are a few exciting future directions. One is to incorporate the acquired room structure and the information of the non-planar geometry to assist intelligent agents in navigating and paying attention to detecting the unknown components [54], [55], [56]. Also, we could incorporate a physical understanding of the scene to complete the unseen parts of the environment [57]. Currently, some of the structure that is partially planar can remain floating in the air after our pipeline.

Limitations
The described pipeline works as post-processing of an initial 3D reconstruction captured by an RGB-D camera, which can also be a challenge for ordinary users. One can extend the work to apply as an on-line process with possible user interaction to quickly acquire the light-weight 3D content using planar abstraction [58].
The major limitation of the work comes from the major contribution of the work, which is relying on a very strong prior. The prior, while effective in many cases, limits the range of environments that can be reconstructed in terms of geometry, texture, and lighting. We can try representing more general scenes by relaxing the restrictions on each of the three components.
In terms of geometry, our suggested pipeline currently uses a single Manhattan-world frame, and an enclosed room composed of a single ceiling and a floor. The assumption excludes places such as large houses with multiple floors, or open spaces such as auditoriums or stadiums. We can extend the pipeline to cover general planes or other proxies, such as combining with CAD models and similarly fill the inevitable holes or artifacts.
The texture and lighting prior can be relaxed by applying techniques from intrinsic imaging. For example, we can apply texture segmentation and solve for texture, lighting, and shading. Currently, we only allow a simple point light assumption, but we can use more complex light models, including windows [33]. While the individual choices of the indoor scene prior can differ, we believe that jointly reasoning about geometry, texture, and light information is possible with a simplified representation of real-world geometry and can allow for light-weight, yet usable capture of the real spaces.

CONCLUSION
We presented a holistic pipeline to represent a captured indoor environment into 3D content ready for AR/VR applications with full geometry, texture, and lighting information. We first focused on completing the room structure based on plane detection. Individual planes were further refined, detecting dominant colors for the background. The detected backgrounds are used to fill in unobserved regions and extract the reflectance and the shading cues for inverse rendering, while the remaining foreground is used to refine the registration and create a crisp texture of the recovered geometry. The generated representation can be used to visualize and navigate the captured environment. From our knowledge, we are the first to suggest an approach to jointly consider geometry, texture, and light information for an indoor environment and convert a real 3D indoor environment into a complete virtual 3D asset.