Scene Target 3D Point Cloud Reconstruction Technology Combining Monocular Focus Stack and Deep Learning

In order to obtain the depth information of the target in the scene and realize three-dimensional (3D) reconstruction, in this paper, a target reconstruction method combining monocular focus stack image and deep neural network is proposed. This method makes full use of the advantages of light field imaging technology and can generate the all focus image. The method first collects multiple frames of continuous images at different focal lengths of the scene, using a divide and conquer algorithm strategy, uplink uses YOLO neural network to identify the target in 3D space and track the position information; the downlink reconstructs the four-dimensional (4D) light field data based on the focus stack image frequency domain back projection, and then uses light field imaging technology to invert the scene parallax; subsequently, achieve scene depth estimation and reconstruction of all focus image; finally, the uplink and downlink are merged to realize the reconstruction of the 3D point cloud of the space target. Experimental results on real scenes show the effectiveness of the proposed algorithm.


I. INTRODUCTION
Scene target detection and three-dimensional reconstruction based on visual sensor terminals have received extensive attention in the field of mobile robot industrial detection and exploration. First, appropriate sensors are needed to capture the three-dimensional information of the world, binocular and RGB-D cameras are widely used, they are limited by the large size and lack of depth information; structured light obtains three dimensional scene information through the principle of triangle, which has short detection distance and low reconstruction accuracy; although the monocular camera is small in size, it has scale uncertainty, and can only record the two-dimensional (2D) space of the scene, with almost no angle information, as a result, conventional 3D vision is sensitive to the local structure of the scene.
The associate editor coordinating the review of this manuscript and approving it for publication was Ye Duan .
As an emerging technology, light field imaging expands the field of computational imaging and computer vision, provides new methods for high-precision 3D vision sensing technology. Light field data can realize the simultaneous collection of light irradiation and direction information, there is a coupling relationship between light direction and scene depth information, which contains rich depth information and can reconstruct the scene depth image with higher accuracy than stereo vision [1], [2]. It can further realize the three-dimensional reconstruction of the scene. Light field imaging data can be obtained by direct methods of main lensmicrolens array [3] and camera array [4], [5], indirect methods such as encoding mask [6], [7] or focus stack image [8] can also be used to reconstruct the light field. The focus stack is the compressed information of the light field data, which is achieved by keeping the imaging parameters constant by shifting the lens, achieving flexible acquisition of the real scene, and reconstructing the light field data at any angular resolution. Using light field data to reconstruct scene depth VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ information is dense and pixel-by-pixel, at the same time, it can generate all focused image of the scene, and project the all focus image into the scene to restore the 3D point cloud of rich colors and textures.
In the actual work of robots, 3D reconstruction is not applied to the whole scene but only at specific target areas, if the reconstruction of the whole scene is time-consuming and the effect is not obvious. In this case, machine learning will become the best choice for target recognition in the scene, in recent years, target detection and positioning based on deep neural networks have developed rapidly and have obvious advantages. Therefore, this paper proposes a fusion method to achieve 3D accurate reconstruction of the target area in the scene, the main contributions of the method mainly include the following three aspects: • An algorithm framework is proposed, which fuses target detection of deep learning and scene reconstruction of light field imaging technology through up-down link parallel method, and finally restates 3D point cloud of specific target in the scene.
• Proposed a monocular passive 3D visual sensing technology suitable for small-dimensional robots, this technology is based on the back-projection of the focus stack image to form a 4D light field and performs scene depth estimation to achieve the reconstruction of the all focus image.
• Experimental studies are carried out in two different real scenes, and the influencing factors of target reconstruction are analyzed and discussed, finally reconstructed 3D point clouds with different set goals in the scene.

II. RELATED WORK
Passive and contactless measurement of visual information as feedback is highly valued for robot exploration and rescue, which can realize scene depth estimation and 3D reconstruction. Singh Mahesh Kr et al. used Kinect to obtain distance information and realized 3D reconstruction of scene by classifying the statistical thick and thin areas, but the TOF sensor was limited in resolution [9]. Yang et al. proposed a 3D reconstruction system combining binocular and TOF depth cameras to accurately identify the distance between objects and the camera, promote the scene acquisition resolution and stereo matching effect, and improve the reconstruction accuracy, however, the system adopts an active sensing method with large scale and high power consumption [10]. Compared with binocular and RGB-D camera, monocular camera has high adaptability and low power consumption. Newcombe et al. proposed a monocular vision pose estimation and sparse point cloud generation method, the method uses structure from motion (SFM) for scene basic grid prediction, and is distorted into a depth image to finally achieve scene model reconstruction, the algorithm needs to predict and update the optical flow of the scene [11]. Compared with the traditional imaging function that only preserves 2D information, light field imaging can obtain richer information during processing and reconstruction. The complete light field information of the scene is expressed by the seven-dimensional all-light function L(V x , V y , V z , θ, ϕ, λ, t) proposed by Adelson in 1991. The two-plane parametric model of the light field proposed by Levoy and Gortler et al. simplifies seven-dimensional function approximately to the four-dimensional light field L(x, y, u, v), plane (x, y) records spatial information, plane (u, v) records directional. The two-plane parametric model can be used to generate images with different parameters, which are widely used in actual imaging systems to achieve digital refocusing [12]- [14], depth reconstruction [15], [16], and high-precision 3D scene reconstruction [17], [18]. Ren N achieved 4D light field data through the lens-microlens array, further reconstructed the scene depth and arbitrary focus images, formed the Lytro handheld light field camera. The angular resolution was acquired at the expense of the spatial resolution by using a single sensor, resulting in a low final imaging resolution [19]. Raghavendra R et al. captured images with different focal points in the scene through the 4D light field camera, further rendered the attributes of multiple depth images, and improved the performance of 3d reconstruction and recognition of scene faces through the new resolution enhancement technology of discrete wavelet transform [20]. Camera array can realize light field data acquisition with rich visual angle by using multi-camera array or single camera combined with precision mobile platform, due to the large equivalent aperture, the resolution can be higher and the sampling density can be flexibly controlled artificially, but it has a large system and high cost. Disadvantages. Wang et al. used the camera array system to render the unfocused areas in the scene based on the anisotropy of the depth estimation, and then generated the refocused image through the reconstructed super-resolution method [5]. Xu et al. indirectly captured high-resolution optical field data by using two attenuation masks, and used a two-dimensional (2D) camera sensor to encode and sample a four-dimensional (4D) spectrum [7]. Based on the focal stack projection model, Liu et al. proposed the filtered-back-projection (FBP) algorithm for reconstructing the 4D light field, and get it according to the reconstruction formula [21]. Tao et al. used gradient detection to establish focus measure, and then fused matching consistency for global optimization of depth reconstruction, which could reconstruct high-precision scene depth [22]. Julia R. Alonso et al. proposed a refocusing method based on arbitrary shapes and sizes in the focus stack image, which realized the high-resolution refocusing of defocusing images by considering the visual information reorganization of the depth point transformation extension function [8].
The light field imaging system based on the focus stack provides better applicability to the detection and rescue of the robot industry in terms of theory, environmental adaptability, and micro-grouping. Most of the existing application scenarios only deal with detection and rescue targets (such as target finding and location). The deep neural network identified candidate frames of multiple target regions in the scene, extracts features from the candidate frames and classifies information, finally returns to obtain the coordinate information of the target. With the continuous development of deep neural network, various types of network update and iteration, the target detection rate and positioning accuracy of scene image are constantly improved. Combining the target detection information of the deep neural network and the light field imaging of the focus stack, can implement a variety of imaging effect by flexible calculation, including scenes of various kinds of target depth estimation, refocus and scene reconstruction, have great potential in many fields, There are huge potentials in many fields, such as Virtual Reality (VR), robot navigation, and simultaneous localization and mapping (SLAM).

III. ALGORITHM
The frame diagram of the 3D reconstruction algorithm for specific targets in the scene proposed in this paper is shown in Fig. 1.
The algorithm of this paper first transmits the collected series of scene focus stack images to the deep neural network and scene reconstruction estimation model, the uplink is based on YOLO deep neural network to detect and locate the target area to be tested and obtain the optimal location information, which will be introduced in Section 3.1. Downlink generates scene depth estimation and all focused image based on focus stack image, which is executed in two steps, first, based on the projection model, the reconstructed light field data analysis algorithm (FBP) and iterative algorithm (Landweber) are formed, which are detailed in Section 3.2.1; then, light field imaging technology is used to reconstruct the scene parallax, depth information is estimated and all focus image of the scene is formed, this part is introduced in Section 3.2.2. Finally, Section 3.3 introduces the fusion of target detection and scene depth of up-downlink, and the reconstruction of the 3D point cloud of the target to be tested by perspective projection model.

A. SCENE TARGET DETECTION AND LOCATION OF UPLINK
The focus stack image within the visual range of the scene is realized by the precision servo controlling the micro movement of the monocular camera lens, the schematic diagram of the image acquisition system and the partially focused images of the scene are shown in Fig. 2.
Each frame of focused image is input into the deep learning neural network algorithm to realize target detection. The previous algorithms are two-step R-CNN algorithm (R-CNN [23], Fast R-CNN [24], Faster R-CNN [25]), which requires the use of the heuristic method or CNN network before classification and regression. This paper uses the classic YOLO algorithm, which uses a single CNN end-to-end process to handle object detection and frame positioning, which has higher computational efficiency than other methods. Algorithm mainly includes three aspects, first, the size of the image to be detected is adjusted to 448 × 448, and the image is divided into S × S grids, each grid is responsible for predicting whether the object center is detected, the grid containing the detected center of the object is taken as the initial condition to predict the border and confidence of the corresponding object, including five parameters, where (w, h) represents the border size, (x, y) represents the distance of VOLUME 8, 2020 the center of the border offset from the corresponding grid, and confidence refers to the accuracy of the border and the possibility of the embedded object of the box. For the object classification, the probability value of prediction category is given for each cell, representing the probability that the object within the border predicted by the cell belongs to various categories. As shown in Fig. 3, the YOLO algorithm framework consists of 24 convolutional layers and 2 full-connection layers. The convolutional layer is composed of two different kernels, including 3 × 3 and 1 × 1 kernels, and the network outputs a vector of size 7 × 7 × 30.
The target search and detection in the actual scene is based on the performance of the equipment carrying the monocular camera and the target data set. The target data set is composed of the actual image of the scene and the public data set. Based on the target detection result of the scene focus stack image obtained, erroneous detection and position information with low confidence are eliminated. In order to ensure that the position information covers the complete target, the maximum value of the frame of multiple sets of position information is taken as the final target position optimal information.

B. SCENE DEPTH ESTIMATION AND RECONSTRUCTION OF ALL FOCUSED IMAGE BASED ON FOCUS STACK IMAGE
This part is realized in two steps. First, the focus stack image is used to reconstruct the 4D light field, and a projection model is established to depict the relationship between the light field and the focus stack data space. Under this projection model, the filtered-back-projection method (FBP) of light field reconstruction is derived, and the Landweber method is used to achieve the optimal iteration of the light field data target functional. Second, reconstruct the scene parallax based on the above light field data and iteratively realize the minimization of the parallax mesh functional to complete high-precision scene depth estimation, and finally combine the focus stack image to form the scene all focus image. The 2D focus image is a form of focus description of a 3D scene, and each image is a 2D projection of a 4D light field.  The 4D light field L s (x s , y s , u, v) can be parametrically represented by the (x s , y s ) and (u, v) two-plane [26], [27], as shown in Fig. 4, (x s 0 , y s 0 ) and (x s , y s ) denote the reference imaging plane and arbitrary imaging plane, respectively, (u, v) is the lens plane. The focusing imaging diagrams of different imaging planes are shown in Fig. 5, where E(s, x s , y s ) represents the focal stack imaging plane at depth s, s and s 0 are the distances from (u, v) to (x s , y s ) and (x s 0 , y s 0 ) planes, respectively.
When the distance between the two imaging planes is the same, L(x, y, u, v) and L s (x s , y s , u, v) represent the same ray, the affine transformation expressions from (x, u) to (x s , u) and (y, v) to (y s , v) are: Introduce Lemma: The positive process of the fourdimensional light field L(x, y, u, v) forming the focus stack E(s, x s , y s ) is the focusing imaging process described by the projection operator.
Here: P denotes the focused imaging process and is a bounded linear projection operator, The forward projection process establishes the projection model relationship from the 4D light field to the focus stack. Introducing CT ideas and techniques in the solution process of the reverse focus stack reconstruction of the 4D light field. On the one hand, based on the Fourier slice theorem, the FBP method of inversely deducing the focus stack to reconstruct the 4D light field; on the other hand, starting from solving the integral equation corresponding to the focus stack, the Landweber method is used for reconstruction and iterative optimization.
Fourier slicing theorem shows that the two-dimensional Fourier transform of the image E(s, x s , y s ) at depth s is a 2D slice of the four-dimensional Fourier transform of the light field L(x, y, u, v) [14], [21].
where F[E(s, x s , y s )] is the 2D Fourier transform of image E(s, x s , y s ), L(ω u , ω v , ω x , ω y ) represents the 4D Fourier transform of the light field L(x, y, u, v).
The corresponding frequency domain slice is selected as: Which is: Integral variable substitution based on slice selection: dw u dw x = J 1 dw 1 ds and dw v dw y = J 2 dw 2 ds, where J 1 and J 2 are the Jacobian determinant, the available variables are replaced by: Based on the frequency-domain projection relationship between the 4D light field space and the focused stack space, the inversion analytical expression is calculated to form a filtered-back-projection (FBP) algorithm [21]. (6) Here, F and F −1 represent the Fourier transform and inverse Fourier transform, respectively, and |ω 1 ||ω 2 | is the optimized filter function.  L(x, y, u, v)], the Landweber method [28] adopted in this paper is the descent method for solving quadratic objective functional E(s, x s , y s ) − P [L(x, y, u, v)] 2 . In the actual calculation, it is converted to the approximate solution of discrete linear equations, the discrete expression is as follows: AX = B, A = (a ij ) M ×N is the projection matrix and ∈ R N is the N-dimensional finite vector of reconstructed light field, and x j is the j−th reconstructed pixel value. In the W-norm and V-norm, the discretized form of the equation is equivalent to the weighted least squares method to solve the optimization problem: The Landweber iterative expression for reconstructing the 4D light field is: Here, α n denotes a relaxation factor, the smaller α n is, the smaller the reconstruction artifact, and the larger α n is, the faster the convergence rate, V and W represent two positive-definite diagonal matrices.
The Landweber iterative algorithm firstly generates initial light field image P 0 [L(x, y, u, v)] with resolution angle u × u based on multi-frame focusing image E(s, x s , y s ) through backward projection matrix. Project the initial light field image forward to the focused image position to form the corresponding estimated focus image E (s, x s , y s ), and calculate the error E(s, x s , y s ) − E (s, x s , y s ), it's the correction artifact. Finally, back-projection to the initial image of the light field to form a correction, and complete an iteration. The relaxation factor α n affects the speed of iteration convergence and the size of artifacts, the optimal selection is based on the actual scene image reconstruction effect. The quality of the light field reconstructed image is not directly proportional to the number of iterations, after reaching a certain number of iterations, the image reconstruction quality will decline if the iteration continues.

2) SCENE DEPTH ESTIMATION AND ALL FOCUS IMAGE RECONSTRUCTION USING LIGHT FIELD DATA
The depth information of the scene is further retrieved by using the light field data. First, the scene parallax dis * (x, y) is reconstructed from the light field L(x, y, u, v), and then the scene depth dep(x, y) is reconstructed, finally, dep(x, y) combined with the focus stack image to generate the all focus image.
In actual scene depth estimation, the estimation error is mainly composed of structural variable error and random error amount. As the real distance of the scene increases, the scene defocus becomes stronger and the focusing ability becomes weaker. At the same time, the lens rotation angle becomes smaller when the focused image is collected, and the micro lens structure variable error increases, resulting in lower depth estimation accuracy at long distances. On the other hand, light field vision measurement is affected by random errors such as image noise and lighting conditions, and the presence of uniform areas such as weak textures in the scene causes the focus and defocus to be similar, which affects the accuracy of scene depth estimation.

a: RECONSTRUCTION OF SCENE PARALLAX BY LIGHT FIELD DATA
Based on the multi-view advantages of the 4D light field data, a pixel-by-pixel scene depth estimation can be obtained, that is, each pixel contains a depth value. In this paper, a target functional [29] with the matching term as the initial term, the gradient term and the classification term as the regular term is established to optimize the target functional to smooth the weak texture region on the basis of preserving the image edge, which is expressed as follows: Here, ||E(dis)|| L2 is the matching term, ||TV (dis)|| L1 is the gradient term, ||Label(dis)|| L1 is the classification item. α and β represent the adjustment scale.
The expression of classification item ||Label(dis)|| L1 is: Here, conf (dis) is the confidence function, represents the number of pixels in the regional neighborhood (W (x, y)) meeting the matching conditions. By setting threshold parameters τ 1 and τ 2 , Label(dis) is divided into exact match and mismatch, where 1 represents the smooth mismatching area, −1 represents the occlusion mismatching area, and 0 represents the accurate matching area.
Transform the depth of the reconstructed scene into the objective functional optimization iteration problem, and the scene parallax dis * (x, y) is obtained by solving the optimization: dis * (x, y) = arg min (Func(dis(x, y))) (11) The algorithm first obtains the initial parallax image by minimizing the block matching ||E(dis)|| L2 of the reconstructed image in the light field. On this basis, the parallax image of the scene is obtained by iteratively minimizing the regular terms ||TV (dis)|| L1 and ||Label(dis)|| L1 .

b: SCENE DEPTH ESTIMATION AND ALL FOCUS IMAGE RECONSTRUCTION
The scene depth dep(x, y) can be calculated from the scene parallax image dis * (x, y) through the view point interval: Here, d represents the viewpoint interval. The schematic diagram of scene depth estimation is shown in figure 6. The scene depth image based on the reconstruction of the 4D light field data, although iterative optimization of the regular items is performed, due to the complexity of the actual scene, the pixels in the block area still have inaccurate depth estimation, which affects the accuracy of 3D reconstruction. In this paper, the non-local mean filtering (NL-means) algorithm [30] is used to perform the second optimization of the depth estimation image, the algorithm is based on the block area in the parallax reconstruction, and look for similar areas in the depth image to find the sum of the weighted average of pixels to filter and denoise, it has an optimized effect on the existence of structurally similar targets in the scene. The all focus image is obtained by merging the focus stack image with the depth information of each point in the depth image, the schematic diagram of the all focus reconstruction is shown in Fig. 7, the darker the color in the figure, the clearer the focus, each target in the all focus image is in focused. The all focus model expression is: where I (x, y) represents the all focus image, and the pixel value of (x, y) at the depth dep(x, y) is the focal point. Assign the corresponding color value of E(dep(x, y), x, y) to I (x, y) to get the all focus image.

C. SCENE TARGET 3D POINT CLOUD RECONSTRUCTION WITH UPLINK AND DOWNLINK FUSION
Fusion of uplink target detection information with downlink scene depth estimation information, the all focus and depth information of the target in the scene was obtained, and the 3D point cloud image of the target was reconstructed by perspective projection model. The expression of 3D point cloud coordinate (x w , y w , z w ) generated from depth image dep(x, y) and all focus image I (x, y) [31]: where f is the focal length of the camera, each pixel value of the target all focus image is rendered to the target 3D point cloud coordinate to form the 3D point cloud image f (x w , y w , z w ) of the target:

IV. TEST VERIFICATION
Experiment uses a high-precision servo motor to drive the lens to rotate to capture the focus stack image and perform scene target detection, scene depth estimation and target 3D point cloud reconstruction. The focus stack image acquisition system includes a camera module, a highly mobile servo module and a power supply module, the camera module consists of a Point Grey industrial camera (GS3-U3-60S6M-C) and a Kowa fixed focus lens (LM35JC10M) with a focal length of 25mm, the servo module contains a 32-bit ARM embedded chip and lens drive circuit. The exposure time of each image captured by the camera is 15ms, the aperture value F takes a value of 1.6, and the gain value is set to 0. The depth range of the first two actual scenes selected in the experiment was 5 meters, and a frame of focus images were collected at an interval of 300mm, respectively. A total of 17 focus images were collected in each scene to form the scene focus stack image. In order to analyze the influence of strongly defocused stack image on the algorithm of this paper, collects 20 focused images in a scene with a depth of field of 13 meters to conduct depth estimation and all focus comparison experiments. The target detection experiment of uplink focus stack image was completed with Dell PowerEdge R740 server and independent video card, the YOLO deep network learning rate parameter is set to 0.005, batch size is set to 16 according to server performance, the optimizer parameter is set to 0.95, and the number of iterations is set to 80 according to the actual training target size of the scene set. The red part in Fig. 8a shows the target detection and location of multiple focusing images in two actual scenes, respectively. Fig. 8b shows the optimal target area determined for multiple sets of target positioning information, the four corner information is obtained after target selection and frame maximization.
The light field image of the downlink focus stack reconstruction is realized by (7)- (8), in the experiment, the angle of view resolution of the light field back projection reconstruction in the experiment are usually 3 × 3, 5 × 5 and 7 × 7, and the angle of view resolution shown in Fig. 9 is 5 × 5 and 3 × 3's schematic diagram of light field reconstruction.   According to the above figure, as the number of iterations increases, it can be seen from the partially enlarged view that the trees, walls, and buildings are clearer in 8 iterations than in 3 iterations. In this iterative algorithm, sinusoidal window function is selected as the filter to improve the detail description of the original image and reduce the fuzziness. Fig. 11 shows the light field reconstruction diagram after sinusoidal window filtering under the iteration times of Fig. 10.
According to the figure above, the iteration results under the sinusoidal window filter function have obvious improvement in clarity and detail compared with the original. This paper selects the average gradient value of the reconstructed image as the evaluation index of the light field reconstruction clarity, Table 1 and Fig. 12 and show the average gradient of the light field image under different iteration times.
The analysis results show that the average gradient of the image increases with the increase of iteration times in the first 8 times, and decreases slightly after the 8 times, although the   average gradient increases slightly in the 13th iteration, it is still lower than the result of the 8 iterations.
The increase of the relaxation factor α n makes the iteration convergence speed faster, but the use of a larger α n increases the correction error in the iterative correction, making the image divergence easy to produce artifacts. The experiment iterated for 5 times under the sinusoidal window filtering function, α n was used to reconstruct the light field with 0.1, 0.5, 1, 2 and 3 respectively, and the results were shown in Fig. 13.
It can be seen from the figure that when α n increases to 3, the artifact in the reconstructed figure has been very obvious, α n is set as 2 to reduce the operation time while ensuring that the influence of the artifact is small.
The above 4D light field image are used to achieve the depth estimation of the scene by (9)- (12). Fig. 14 shows that the viewpoint interval is 4 mm, λ = 1.5, and the image block area size are 3 × 3 and 5 × 5 depth estimation, respectively.
The two figures can resolve the depth well in the edge region, and the depth estimation can be reconstructed at the   weak texture of the image. There are still error depth estimation areas in the scene (pixels in the rectangle and ellipse boxes), NL-means can be used for secondary depth optimization, Fig. 15 shows the block area parameters optimization results.
The optimization of NL-means can improve the consistency and estimation accuracy of depth information. Depth estimation and actual measurement of the three marked points in the depth image. Table 2 shows the results of two depth estimations for three target points.
The table above shows that the depth information of the scene obtained by 3D vision sensor is highly consistent and can achieve millimeter accuracy. In order to analyze the influence of strong defocusing on depth estimation under deeper depth of field, this article uses the method of scene 3 to estimate the depth. The original focus image and depth estimation image of scene 3 are shown in Figure 16.  It can be seen from the figure that although this article sets iterative optimization to reduce the defocusing effect of weak texture areas, but at the same time, due to the increase in field depth and uneven illumination, the depth estimation accuracy of scene 3 is lower than that of scene 1. Especially in the long-distance uniform area depth estimation error is large, and some pixel depth estimation information is lost, as shown in the red box in the figure. The depth estimation information of the three target points in the selected figure is compared with the actual measurement. Table 3 shows the results of two depth estimations for the three target points. The reconstruction of the all focus image of the scene is realized by (13)- (14), Fig. 17 shows the all focus image of the three scenes.
The figure above show that the four enlarged sections in each figure have high resolution and are all focused. In the experiment, the all focus image of scene 1 and scene 3 respectively and the initial focus images were selected to evaluate the fuzziness. The gradient values of the focus images originally collected in the two scenes are shown in Fig. 18a and Fig. 18c. Fig. 18b and Fig. 18d are the average gradient values of the all focus images after the two scenes are reconstructed 5 times.
The gradient values of the original focus images in Fig. 18a are in the range of 3.18∼4.16, and the average value is 3.55,   Fig. 18c are in the range of 1.09∼1.80, and the average value is 1.67, Fig. 18d is the gradient values of the five times all focused reconstruction images, the sizes are 3.76, 3.72, 3.83, 3.65and 3.84. It is found that the all focus image has better global resolution and small detail presentation than the original stack diagram. Affected by conditions such as the range of depth of field and on-site care, the gradient value of the original focus image and the full focus image collected when the depth of field is deeper is smaller than the value of shallow depth of field, and the reconstruction accuracy and details are lower.
Based on the above analysis, the algorithm framework proposed in this paper is more suitable for scene depth estimation and target point cloud reconstruction in the small scene range.   The experiment carries out the target 3D point cloud reconstruction experiment for the actual scene 1 and 2, the experiment flow chart is shown in Fig. 19 and Fig. 20 respectively.
The 3D reconstruction of the target of the two scenes mainly includes the following steps. Fig. 19a and Fig. 20a capture the focus stack image of the two scenes, and the comparison shows that the focus in the image is from near to far. Multi-target detection and positioning of the image (a) are performed through the YOLO neural network, scene 1 detects two targets in each frame of the image, including a building and a tree, and is marked with a light blue and red rectangular frame, as shown in Fig. 19b; Scene 2 detects a house in  each frame of the image and marks it with a red rectangular frame, as shown in Fig. 20b. Fig. 19 d and Fig. 20d show images of multi-frame target detection and optimal position information. Fig. 19c and Fig. 20c are depth estimation image and all focus image reconstructed based on image Fig. 19a and Fig. 20a. Fig. 19e and Fig. 20e are the target all focus image and depth information obtained after fusion of c and d. Finally, the 3D point cloud images of the two detection targets were reconstructed, as shown in Fig. 19f and Fig. 20f.

V. CONCLUSION
To effectively solve the problem of target detection, depth estimation and 3D reconstruction in a specific area of the scene, a fusion method based on deep learning for target detection and focus stack image reconstruction is proposed. In the algorithm framework, 3D visual perception acquires focus stack image through monocular lens, combines the advantages of deep learning and light field imaging, detects specific target area of stack image, reconstructs 4D light field and scene depth estimation, and fuses the target location and depth information of scene to achieve 3D reconstruction of the target. The algorithm in this paper is affected by the depth of field of the scene and the on-site lighting environment. In the long-distance scene, the adjustment accuracy of the micro lens is limited, and the information obtained has a large error. It can obtain more accurate depth estimation and reconstruction effects in the small scale scene. This method is not only applied in the field of robot industrial detection and rescue, but also suitable for 3D face reconstruction and recognition in the scene. The previous related work has introduced the high-resolution reconstruction and recognition of human faces based on the images collected by the 4D light field camera. The theory and algorithm flow of face recognition and 3D reconstruction based on the scene focus stack image are consistent with this article. In the actual operation, a reasonable number of focused images are selected according to the actual scene size to reconstruct the depth estimation image in the scene. Combine the YOLO algorithm to identify different human facial features in the scene, and detect facial location information, finally achieve the point cloud reconstruction of the face target. The future work includes extending the algorithm framework fusion of multiple perspectives of the scene, conducting more extensive research on target detection and reconstruction based on the background of 3D point cloud confidence assessment, and completing the reconstruction of the whole scene, provide theoretical basis and related experimental verification for the follow-up robot visual navigation and SLAM research work.