3D Reconstruction with Spherical Cameras

The goal of image-based 3D reconstruction is to establish a high-quality 3D expression from images. In order to achieve a high-resolution and real-time 3D model, inspired by the open source COLMAP, we propose a novel framework (3DMAP) to reconstruct 3D scenes based on spherical cameras. Unlike traditional methods which focus on building a 3D plane by Poisson distribution function, our method illustrates the key processes of 3D reconstruction: locating the camera based on global feature, estimating the scene’s relative depth from monocular panoramic images, and obtaining a high-quality 3D surface reconstruction. In the camera locating part, we use a global descriptor augmentation model to build a labeled panorama dataset GDAP, in which the images are captured by our designed spherical cameras; In the depth estimation part, we propose a new network UMDE that can estimate the depth of both indoor and outdoor scenes; Finally, in the 3D surface reconstruction section, we turn the reconstruction problem to a graph optimization problem, called GraphFit, in which, we optimize the point clouds with s-t graph and smoothing method successively. We conduct experiments on our own dataset to demonstrate the effectiveness of our proposed 3DMAP framework. Experimental results show that our 3DMAP has achieved good evaluation scores and visual effects.

perspective. To predict the depth from panorama, Zioulis 21 et al. [12] ,which use the ResNet [11] to predict the depth. 22 Cao et al [13] propose to formulate depth estimation as a 23 pixel classification task. But most of them are trained 24 either on indoor or outdoor scenes due to the big difference 25 between the two kinds of scenes. However, recently, the 26 demand for reconstruction in hybrid scenes has grown 27 rapidly. Moreover, as far as we know, there is no single 28 model to fit these two kinds of scenes simultaneously. To   35 traditional reconstruction algorithms [14,15] use the idea of 36 regression. However, the effects of them are impacted by 37 point clouds' quality. In order to improve the performance, 38 some scholars transformed the reconstruction tasks to 39 optimization tasks [16,17]. Nan et al. [16] proposed an 40 effective optimization algorithm called Polyfit, and it has 41 achieved a great improvement, but it is difficult to balance 42 its composite energy equation. Therefore, we propose a 43 novel algorithm based on graph optimization to get a better 44 consequence. It builds a s-t graph [18] and then smoothes 45 the coarse 3D model to get the precision reconstruction 46 result. 47 Our work makes the following contributions: 48 • We propose a new solution 3DMAP to reconstruct 49 3D model with spherical cameras. It consists three 50 main sections, GDAP, a learning based global 51 descriptor augmentation panoramic image dataset; 52 UMDE, a unified depth estimation algorithm fit in 53 both indoor and outdoor scenes; GraphFit, a surface the SuperGlue [19]. Part of our data will be made 63 available at https://github.com/4Dage-Tec/Dataset. Generally, in the Structure From Motion (SFM) [20] 82 system, the purpose of feature matching is to estimate 83 camera's poses. Traditional algorithms like SIFT [1], 84 SURF [2], AKAZE [3], ORB [21] extract key-points and 85 calculate hand-crafted descriptors. The works [4][5][6][7]are also 86 illustrated based on the same idea. Although these 87 algorithms have a good performance in some fields, they 88 cannot deal with the global context well. Additionally, in 89 recent years they have been surpassed by some learning 90 based local feature extraction algorithms [22]. To enhance 91 the feature matching process, some proposed to merge 92 global information into key-points descriptors [23,24] vertical walls [38][39][40] or the Manhattan world [39][40][41].

79
However, these assumptions bring in obvious limitations.

80
In order to achieve a better reconstruction of indoor scenes,

MATERIALS AND METHODS
This section consists of four parts, the overview and the 1 three key aspects of the reconstruction system.  algorithm. Finally, in GrapFit, we use the graph 22 optimization 3D surface reconstruction algorithm to 23 improves the appreciation of the 3D model. In this section, 24 we mainly describers the three improved processes, that is, 25 a learning based global descriptor augmentation panoramic 26 image datasets, a unified learning based monocular depth 27 estimation model and a surface 3D reconstruction 28 algorithm. 31 In the image-based 3D reconstruction process, camera 32 locating is the basic step. Generally, it locates the camera  The performances of existing perspective image-based 38 methods. The traditional patch feature extraction 39 algorithms SIFT [1], AKAZE [3] use image pyramid to 40 enhance their performance. The hand-crafted descriptor 41 cannot meet the requirements of locating the camera well. 42 Though the CNN based algorithms [36,48,49]  FOV images, and the results are shown in Figure 2. We 51 indicate the confidence of key-point matching with color. 52 The colors from blue to red indicate a gradual increase in 53 confidence, and the color bar is on the right side in Figure   54 2. From the figure, we can see that panoramic images 55 contain more information than the three perspective   6 Camera locating method based on panoramic images. In order to get more correct matches of panoramic images, we do the following four operations. Firstly, we split a panoramic image into six perspective images and use a pretrained model (eg. SuperGlue [19]) to infer and get feature matching results. Then we reproject those matching results back into panoramic images. Secondly, with the help of our SFM system, we priorly get those cameras' location information. We feed those matching results into a SFM pipeline to get reliable sparse point clouds. And to improve the matching results' reliability, we only keep those feature matching results with more than four tracks (namely more than four cameras which can observe the feature). Thirdly, we train a global learning-based feature extraction and matching model on the dataset, as shown in Figure 3 (indoor) and Figure 4(outdoor). Finally, we make an inference with our trained model on our dataset to get more reliable corresponding key-points pairs and repeat the process.

C. Unified Mono Depth Estimation on Indoor and Outdoor Panoramic Scenes
The requirements of a unified depth estimation model. Depth estimation is an important step in image-based scene 3D reconstruction. The previous learning based depth estimation models are trained either on indoor [16] or outdoor scenes [13]] separately, due to the big difference between them, such as the light, the object distance, the background distance and etc. Moreover, in a mixed scene, the main prediction error comes from the error brought by the distant view. Therefore, the model will iterate to reduce the long-range error during training, thereby affecting the prediction accuracy of the close-range. As far as we know, there is no single model to suit these two kinds of scenes at the same time. However, recently, the demands for 3D reconstruction of hybrid scenes growing rapidly, and a unified model is necessary to our 3D reconstruction system. In addition, in our study, we found that the depth maps estimated from monocular images is more continuous and structured (its point clouds are more visually pleasing, which is very important to our later reconstruction step) than binocular images. The reason is that binocular image relies on disparity map deeply which drifted a lot with a small disturbance of transformation matrix between left and right (or up and down) cameras. On the other hand, the data was more difficult to collect. Though some excellent works focus on binocular depth estimation(360SD), they usually use the synthetic dataset to train and test their model, which is very difficult to transfer to real scenes. To this end, we propose a Unified Mono Depth Estimation (UMDE) network to mitigate the above problems.
The UMDE network. The aim of UMDE is to estimate the depth of outdoor or indoor scenes. For the indoor scenes, the common depth estimation network (eg. U-ResNet [13]) can predict the depth directly, but for the outdoor scenes, the depth map estimated by U-ResNet [13] (or other usual depth estimation network) contains distant views (like: sky, distant objects etc.) which affects the reconstruction. In order to weaken the impact, we introduce the mask network to filter the distant background. So, we design the UMDE network, and the network choose U-ResNet [13] as the backbone and uses the Binary Cross Entropy loss to train the network in our dataset, which includes the indoor and outdoor panoramic images captured with our spherical camera [8] and the ground truth contains the image depth captured by LIDAR and the semantic information (distance or close view). The network architecture is shown in Figure 5. It includes two subnets, a mask sub-network and a depth estimation sub-network. The two subnets share the U-ResNet [13] and they cooperate to predict the depth map of the input image. First, the mask sub-network trains a mask layer. Second, the depth estimation sub-network first predicts a coarse depth of the input image with the pretrained network in the mask sub-network, and then it calculates the depth map of the image with the coarse depth and mask layer. The mask sub-network. The mask sub-network is to remove the distant view of the coarse depth, and the essence of the mask network is a semantic segmentation network, and its main function is to segment the image into distant and non-distant parts. As described above, it uses U-ResNet [13] as backbone and uses Binary Cross Entropy loss to train the network on our own dataset. The loss function is defined as (1) Where y is the label,ŷ is the predicted value. The trained mask layer divides the image pixels into two categories, distant views and close views or objects. The pixel of the distant view is set to 0, and the others are set to 1. The depth estimation sub-network. In the depth estimation sub-network, we first take the U-ResNet [13] pretrained in mask sub-network to estimate the coarse depth. Then, we calculate the mask layer with the softmax function. Thus, the pixels of distant view will equate to 0, and the others will become the same probability. Finally, we do a dot product of the coarse depth map and the mask to filter those distance backgrounds. The loss function is defined as Where d i n define as eq 4 g i n define as eq 5: With the mask layer to remove the long-range view, not only the problem of the long-range prediction error can be solved, but also the network can iterate in the direction of improving the accuracy of the close-range prediction. Additionally, we train the mask sub-network first because we find that if we train the two subnets together, it is hard to converge and easy to stop in local minimal point.

D. Cubic Reconstruction From AI-DepthMaps (GraphFit)
In this section, we propose a novel algorithm of 3D surface reconstruction from multi-view depth maps GraphFit. Most of the traditional methods treat reconstruction as a regression task, such as the Poisson function [15]. However, these methods require high precision point clouds. What's worse, the point clouds of the scene have too many missing points and noise. Thus, it is difficult to reconstruct the scene well with the traditional method. In this case, some researchers focus on transforming reconstruction problems into optimization problems to deal with this problem. For example, Nan et al. [15] and Coudron et al. [17]. Nan et al. [15] proposed PolyFit [15], which transforms the reconstruction task to an Integer-Programming task, and has achieved a great improvement in the smooth performance. However, it is difficult to balance the weight of each part in the compound energy equation composed of fitting, smoothing, and point coverage, because the three parts of the compound energy equation are mutually exclusive. As a result, we take the divide-and-rule strategy to avoid this problem. Based on this, we propose an algorithm GraphFit, consisting of three main parts shown in Figure 6. GraphFit. It transforms the reconstruction task to graph optimization task, and the whole process contains three sections, generating polygons and polyhedrons, generating the fitting patches and generating the smooth patches.

1)Generating Polygons and Polyhedrons
Polygons and polyhedrons. In this part, we will generate polygons and polyhedrons from the multi-view depth maps estimated from the previous step. Firstly, we generate the free space from the depth maps following and obtain free space from point clouds following Kazhdan et al. [15]. (Figure 6-(a)), and extract exterior point clouds. Secondly, we cluster the point clouds. We have combined normal vectors, adjacency, and plane distribution to cluster point clouds based on the meansift [51] (Figure 6-(b)). Finally, we get the initial planes from point clouds [52] and generate polygons and pollyhedrons based on initial planes with [15] ( Figure 6-(c)).

2)Generating the fitting patches
generated from the previous part, we will select a subset of these polygons that can optimally describe the geometry of the scene. To achieve this goal, we use the divide-and-rule idea to propose two energy equations to deal with fitting and smoothing separately. Building graph. As we know that each two neighbor polyhedrons have only a common polygon. We regard each polyhedron as a vertex, and the polygons shared by adjacent polyhedrons as edges to build the s-t graph. In our s-t Graph, the initial s-vertices are those polyhedrons containing the camera, and the initial t-vertices are those not in free space. Furthermore, we propose an energy equation Where the x is a polygon, and N represents those cameras that can observe the plane containing the x and the ij V is the spatial direction of j pixel on the camera i . Then, we use the min-cut algorithm to find a [S, T], which is the smallest capacity like [14]. Thus, we get a high fitting surface structure of polygon set from min-cut. After that, we define a patch as a set of polygons that are adjacent and coplanar in our fitting surface structure and get an initial patch set. The result is shown in (Figure 6(d)), and the red line is our patch set.

3)Generating the smooth patches
Where the n represents the current patch number, 0 R is the mean original smooth residual of each patch calculated from the min-cut. As we know, the smooth process will increase the fitting residuals. That is, they are inverse. In order to balance the fitting and smoothing residuals, we establish the relationship 0 R between them. The 0 R is defined as according to our study on our dataset, we set a= -0.0001, b=-10000, c=10000 in order to adaptively smooth. To smooth the 3D structure, we iterate the patches. We first select a patch randomly, and then divide the s-t Graph into multiple s-vertices closures and t-vertices closures along with the plane containing the selected patch. To decrease the count of patches, we cut some s-vertices closures down from current s-vertices, or merge some t-vertices closures back to current s-vertices. After that, we calculate the comprehensive energy of the big section and the whole model with formula (7) separately and then compare the value of the two parts' energy. If the energy of the big section is smaller than that of the whole model, we say that the iteration is successful. Then, we keep the big section left to iterate continuously until the comprehensive energy cannot diminish any further. And the comprehensive energy equation is defined as Where S is the patch set in the current model, and the n is the number of S . The smooth result is shown in (Figure 6-(e)), the red line is the smooth patches.

Experiments
In the experiment, we mainly test our dataset, depth estimation network, and 3D surface reconstruction. Besides these three parts, it describes the dataset used to estimate depth maps, the implementation details, and the ablation studies.

A. DATASETS
GADP. The dataset we generate contains over 1,000,000 high-resolution images from about 10,000 scenes covering room, corridor, street and etc. The images are captured by a spherical camera [8], and the label information contains all key-points extracted from each panoramic image and an index of the corresponding key-points of the image pairs (which are usually captured in near location). HSPD. The dataset that we use to train our UMDE model contains 100,000 panorama images, labeled depth captured by the LIDAR equipment, and the semantic information (distance or close view). The obtained sparse depth map is projected back to RGB images. This dataset includes rich scenes and high-resolution images, while the ratio of the outdoor scenes and the indoor scenes is 1:9.

2)Environment
Our network is implemented by PyTorch-1.5.0 and train in 2080ti. The version of python is 3.6.9, and the operating system is Ubuntu 18.04 LTS.

C. Experimental Results and analyzations
This part mainly conducts comparative experiments from three aspects of feature matching based on our dataset, depth estimation, and surface reconstruction.

1) Feature matching and camera location with our Dataset(GADP)
The feature matching results. To prove the effectiveness of our dataset, we first contrast the self-attention and crossattention of SuperGlue [19] and our model on the same panoramic image, as shown in Figure 7. Then, we compare the test results on the same panoramic images of SuperGlue [19] trained on perspective image dataset and our panoramic dataset separately, and the indoor scenes results are shown in Figure 8 (indoor) and the outdoor scenes results are shown in Figure 9 (outdoor), in which, green lines indicate the correct matches and the red lines indicate the wrong matches, the same as in Figure 3.
9 Analyzations. Figure 7 shows the comparisons of SuperGlue [19] and our model tested on the same panoramic image, and the statistic feature matching results can be seen in Table 1, from which we find that base on the same number key-points, the matches number of our model is more than SuperGlue. Thus, we learn that SuperGlue [19] pre-trained on perspective images focuses on a small area of the whole panorama, which decreases its global information. While the model trained on our dataset can span its attention to the whole image making it more robust. Furthermore, Figure 8 and Figure 9 show the comparisons of different trained dataset SuperGlue [19] tested on the same images of indoor and outdoor scenes. The inliers are statistic in Table 2. From Figure 8, Figure 9 and Table 2, we can see the inliers of our panoramic dataset trained model are much more than trained on perspective image dataset neither in indoor scenes nor in outdoor scenes. Additionally, in Figure  8, the rotation of the image pairs in the scene 1 is large, and the transform of the image pairs in the scene 2 is distant., The results prove that for the strict situation (with sharp change), our model trained on our panoramic dataset is also effectiveness. The Figure 9 is similar with the Figure 8.  The camera location results. We train the SuperGlue [19] on our panoramic dataset, combining SFM to get the high performance, like the more correct feature matches and inliers on panoramic image. With the correct feature matches, we can locate the camera accurately. Figure  10(indoor) and Figure 11(outdoor) are the results of the camera pose located through the matched feature., Specifically speaking, Figure 10 displays the camera locations of the image in scene 2 of Figure 8, and Figure 11 displays the camera locations of the image in scene 2 of Figure 9.
Analyzations. The important target of providing panoramic dataset is to locate the spherical camera pose. From Figure  10 and Figure 11, we can count the location results in Table  3 of the two scenes in Figure 8 and Figure 9. And the results tell us that there are overlaps among camera poses located with model trained on perspective image dataset. In contrast, the camera poses located with model trained on our dataset is similar with the ground truth. The overlap will seriously impact the following 3D reconstruction or lead to the 3D reconstruction almost impossible. The correct camera location can prove the value of our GADP dataset furthermore. Figure 8 Scene 2 in Figure 9 Ground truth 11 10 SuperGlue trained on perspective image dataset 5 3

Scene 2 in
SuperGlue trained on our panoramic image dataset 11 10

2)UMDE Depth Estimation Results and analyzations
The depth estimation results. In order to test the performance of our depth estimation network, we compare our model with Omnidepth [13] in four groups of indoor scenes and outdoor scenes separately. And the results are shown in Figure 12 (indoor) and Figure 13 (outdoor). To evaluate the predicted depth, we select the Mean Absolute Error(MAE), the Absolute relative difference (Abs_Rel), the accuracy with threshold DELTA, proposed in [9], and the time cost to process a picture (Time(s)) as the evaluation indexes, the experiment results are shown in Table 4.
Analyzations. Figure 12 shows that our method predicts more detail structure preserved in ground truth. While the result predicted by Omnidepth losses the detail information. Furthermore, in the mixed scenes, our model removes the distant view which impacts the reconstruction, but Omnidepth does not remove it. In Figure 13, we find that our method removes the background precisely and retains the complete building information in outdoor scenes. On the contrary, the Omnidepth [13] estimates the wrong background (the building information is seriously lost due to the background is not removed, and the building cannot be distinguished from the background).
As far as we know, there is almost no other panorama dataset containing indoor and outdoor scenes simultaneously. Although Matterport3D have indoor and outdoor scenes, they removed the distant view of the image handcraft. Though, we have tested our model and Omnidepth on Matterport3D without training again, the results are unideal.

The intermediate outputs of depth estimation.
In order to illustrate our depth estimation model in detail, we also give the intermediate outputs of our unified model in Figure 14.
For our depth estimation model consists of two subnets, the two sub-networks predict coarse depth and semantic segmentation result parallelly, and then an elementwise product between the coarse depth and the semantic segmentation (or mask layer) was applied to get the final depth of input image. So, we choose the coarse depth (or the depth without mask) and the mask semantic segmentation as the intermediate outputs. In order to display the differences between the intermediate outputs and the final depth, we show the ground truth and the final depth at the same time.
Analyzations. The advantage of Our UMDE model is that it can both deal with the indoor and outdoor scenes. From the comparation results of Figure 12 and Figure 13, we know the advantage mainly to the outdoor scenes. So, we just display the intermediate outputs of outdoor cases. From Figure 14, we can see the depth predicted without mask layer not only contains the near views, but also includes the distant views, like the sky, which will impact the following 3D reconstruction, so we should remove the distant views in coarse depth. Look again the mask semantic segmentation, in which the pixel values of distant views are just o (black). According to the process of the UMDE, with the product operation of coarse depth and mask layer, we can get the depth removed the distant views impacting the 3D reconstruction seriously of outdoor scenes, which ensures the effects of the 3D reconstruction of large scenes. So, we can derive that the performance of UMDE depth estimation is high. 3D point clouds results. In order to prove our depth estimation method (removing the distant view of depth maps without any pre-processing) is efficient, we design another experiment to project the depth map to 3d point clouds. We choose four difference scenes including indoor and outdoor scenes. The results are shown in Figure 15.
Analyzations. From Figure 15, we find in the outdoor scene, Omnidepth [13] did not remove the distant view, and in these point clouds, there are wrapped object over the scene, and it severely impacts the reconstructed model. On the contrary, the point clouds projected by the depth predicted with our model and the ground truth have removed the distant view, and the reconstruct models are very nice. From the results of Omnidepth tested on different dataset (indoor dataset, outdoor dataset, mix dataset), we can learn, for the indoor scenes, our model can produce comparative results like Omnidepth. And for the outdoor scenes, our model is much better than Omnidepth. Obviously, it is because our mask removes some difficult things for depth estimation (like trees and other small outdoor things) from the result. Though it is a bit unfair to compare on outdoor scenes, our goal is not to recover as much as depth information from RGB image but to get as much as useful information to recover a visually pleasing model. As for mix dataset, the Omnidepth model did not perform better than ours. And from the point cloud results, we know that Omnidepth model cannot predict distant things very well. To keep continuous, it always tries to warp outdoor scene as a whole, which is not what we expected for our latter pipeline. According to the results shown in Figure 15, we find that the 3D point clouds projected from the depth maps estimated by our UMDE model are well, proving the UMDE's high performance furthermore. The failure cases of UMDE. To more comprehensively illustrate our model, we provide some failure cases in Figure  16. It contains three scenes, from indoor and outdoor scenes including a nearly successful case, a failure case and a totally failure case. Analyzations. In fact, it was not a totally failure case on case 1. Our model predicted the mask successfully. While when it removed distant views from the image, it removed more than we wanted. As we can see, it also removed half of the building from the result (the right building). The reason is that though we got many outdoor data, it was not that much enough to train a very clear edge on our dataset. We need to add more annotated outdoor images in our dataset. The case 2 was predicted using our model on Matterport3D dataset. As we can see, our model did not predict mask very well on 12 this image. The reason is that our dataset is actually captured by real panoramic cameras, so were our training dataset. While the data in Matterport3D we got in the same way as Unifuse [54] told in their paper, was more or less rendered images and lost a lot of distant information in the RGB image. The case 3 was a totally failure case. It removed more than we wanted. The building on the right was totally removed in the result. Except for dataset reason, we will try other methods such as (Laplacian pyramid fusion) to avoid the failures in future.

FIGURE 12. Qualitative results of UMDE tested on our own dataset(indoor). From left to right are input Panoramic images, ground truth, Omnidepth [13]
, and our method in sequence. FIGURE 13. Qualitative results of UMDE were tested on our own dataset(outdoor). From left to right are panoramic input images, ground truth, Omnidepth [13], and our sequence method.

3)GraphFit (3D reconstruction) Results and analyzations
In order to evaluate the performance of our 3D surface reconstruction algorithm, we compare the model with other state of the art models, provide some failure cases and analyze the time cost. Comparations with other models and analyzations. We compare the model with Poisson [14] and Polyfit [16]. The results are shown in Figure 17. We select four scenes and make four group tests. The first group is generated by Poisson [14]. Basically, the surfaces of these models are rough and undulating. Although the reconstruction planes of the models generated by Polyfit [16] are flat, they lack detailed information compared with the parts in the red circles. Additionally, there is a prominent part in Poisson [14], and Polyfit [16] did not reconstruct this part, but our model reconstructs it clearly. Furthermore, our reconstruction results truthfully reflect the presence of bumps in this area. The second group also proves the same conclusion. In the third group, we can see that the area in the red circle is not flat and convex in Poisson [15], while the area in Polyfit [16] is smooth, but it lost some information.
In contrast, the area is flat and clear in our model. Especially, the fourth group is the aerial view of a two-story indoor scene. From the red circle, we can clearly see that the model reconstructed by our method has a more detailed structure than the Polyfit's one [16]. What's worse, it takes a long time to run the Polyfit [16] on the fourth scene (the twostory indoor scene). The failure cases and analyzations. In order to better analyze the performance of our model, among the plenty of experiments, we find the open scenes, the curve plane and the miss situation may fail somehow, so we selected the three type scenes that failed to rebuild with our model for analysis. These failure cases are shown in Figure 18. From the reconstruction results in the first row, we can see that our algorithm and Poisson [14] only reconstruct the ground in the outdoor scene. The reason is that the essence of our algorithm is to find the boundary of free-space. In the open outdoor scenes, the boundary of the camera's field of view is infinite, so the wrong walls and ceilings may be generated. It can be seen from the red circle area in the second row that our model does not reflect the effect of surface curvature compared with the result of Poisson reconstruction. In the third row, the red circle areas show that our model is compared with the result of Poisson reconstruction, and our reconstruction result shows the loss of objects. The reason why the surface cannot be reconstructed and the object is lost is that Polyfit [16] preprocessed clustering algorithm is difficult to get ground-truth. The input and our algorithm uses the same clustering algorithm as Polyfit [16], so these problems occur. The Complexity analysis. In order to further study the performance of our algorithm, we compare the time complexity of our algorithm with Polyfit [16] and Poisson [14]. The results show in Figure 19. It can be seen from Figure 19 that our algorithm is faster than Polyfit [16], but slower than Poisson [14]. In theory, the clustering part depends on the number of point clouds. The complexity of our algorithm is O(n 2 ). The generated graph part depends on the number of polygons generated by the clustering. The complexity is O(n). Smooth Part, depending on the number of polygon patches, the complexity is O(n 2 ). Compared with Poisson [14] which use matching cube, we use a deterministic model, which has low requirements for point cloud accuracy and good regression results. Compared with the binary optimization used by Polyfit [16] (relying on third-party optimization libraries, the optimization time and effect are difficult to control).The efficiency of our algorithm is more stable.

D. ABLATION STUDIES
In order to further analyze our method, we design an ablation test to verify the effectiveness of our smoothing algorithm.
In this part, we compare two sets of experiments with and without the smooth algorithm. Through Figure 20 and Table  5, we can clearly see that our optimization algorithm not only greatly smoothes the wireframe, but also keeps the overall structure of the model. According to Table 5, it can be seen that our smoothing algorithm can reduce the number of patches by 80% on average. With our smoothing algorithm, we can increase the smoothness by four times. Furthermore, the three groups' results show that the total model's energy including fitting energy and smooth energy reduces about 40% after smooth. A set uses our complete model, and another set only uses our s-t graph to build the model without smooth. The compared results are shown in Figure 20 and Table 5.

CONCLUSIONS
We propose an efficient resolution 3DMAP to reconstruct scenes with a spherical camera. It includes a labeled panoramic image dataset (GADP) to locate the camera pose, a unified monocular depth estimation framework (UMDE), and a novel reconstruction algorithm (GraphFit) based on graph optimization. GADP is the first large labeled panorama dataset. It contains rich label information, and it will make an important contribution to computer vision, especially for the 3D reconstruction. UMDE is the first unified framework to deal with the indoor and outdoor scenes simultaneously for making the indoor reconstruction more natural. Finally, GraphFit transforms the surface reconstruction to graph optimization. Likewise, it reduces complexity and makes a simple solution possible for difficult problems. Besides the detail description of the proposed methods, we also provide supplements of experiments and analyzations to evaluate the performance of our resolution of 3D reconstruction based on spherical cameras. Overall, the set of resolutions integrate the traditional and deep learning methods to get the ideal effectiveness. In order to further improve the effect of reconstruction, our future works will focus on the following: • Enrich our datasets and improve the accuracy of labeling. • Improve the accuracy of depth estimation in outdoor scenes based on our proposed method. • Use a learnable method to determine the threshold of our smooth algorithm. • Explore more possible fields of application based our dataset and algorithm.