Dense Feature Learning and Compact Cost Aggregation for Deep Stereo Matching

Recently, Convolutional Neural Networks (CNN) based deep models have been successfully applied to the task of stereo matching. In this paper, we propose a novel deep stereo matching network based on the strategies of dense feature learning and compact cost aggregation, namely DFL-CCA-Net. It consists of three modules: Dense Feature Learning (DFL), Compact Cost Aggregation (CCA) and the disparity regression module. In DFL module, the CNN backbone with Dense Atrous Spatial Pyramid Pooling (DenseASPP) is employed to extract multi-scale deep feature maps of the given left and right images respectively. Then an initial 4D cost volume is obtained by concatenating left feature maps with their corresponding right feature maps across each disparity level. In the following CCA module, each initial 3D cost volume component (i.e., the component across the left or right image feature channel dimension) is aggregated into a more compact one by using the atrous convolution operation with different expansion rates. These updated 3D cost volume components are then fed into the disparity regression module, which consisting of a 3D CNN network with a stacked hourglass structure, to estimate the final disparity map. Comprehensive experimental results demonstrated on the Scene Flow, KITTI 2012 and KITTI 2015 datasets show that the 3D cost volume components obtained by the proposed DFL and CCA modules generally containing more multi-scale semantic information and thus can largely improve the final disparity regression accuracies. Compared with other deep stereo matching methods, DFL-CCA-Net achieves very competitive prediction accuracies especially in the reflective regions and regions containing detail information.

aggregation step is included, traditional methods can be clas-46 sified into local matching methods [8] and global matching methods [9]. 48 In recent years, the steps such as cost computation, cost 49 aggregation, disparity computation, and disparity optimiza-  The 3D cost volume is formed by correlation operation on 58 left and right image features [10]. And 2D encoder-decoder 59 structure with cascaded refinement is usually used to process 60 3D cost volume and compute the disparity map. The 4D cost  In this paper, we observe that many stereo matching net-67 works in the 3D architecture commonly use the Spatial Pyra-   Our main contributions can be summarized as follows:

102
(1) We introduced the dense feature learning module by 103 using DenseASPP [13] to replace SPP [12]. DenseASPP 104 applies the idea of dense connectivity from DenseNet [14] 105 to extract multiscale dense image features.Thus, DFL module 106 can enhance the perceptual field of the network without losing 107 image information.

108
(2) We design an efficient compact cost aggregation mod-109 ule to make the updated cost volume more informative, which 110 can largely improve the final disparity regression accuracies. 111 (3) We propose an end-to-end stereo matching network 112 namely DFL-CCA-Net without any post-processing step. 113 It can achieve an advanced prediction accuracies in Scene 114 Flow and KITTI datasets. Especially, DFL-CCA-Net is par-115 ticularly effective in the reflective image regions and image 116 regions containing a lot of detail information. 118 Mayer et al. introduced the first end-to-end disparity regres-119 sion network Disp-Net [10], which borrows the idea from the 120 optical flow estimation network FlowNet [15]. First, the left 121 and right image features are extracted using a siamese net-122 work, then a 1D correlation operation is performed to obtain 123 the 3D cost volume, and finally the 2D encoder-decoder 124 structure is used to process the 3D cost volume and regress 125 the disparity map. Pang   extraction modules or by combining the idea of multi-task 157 learning, such as [27], [28], [29], and [30].    Figure 2, which is calculated as

II. RELATED WORK
(1) 199 Here, we use Equation (2) to describe the complete process 200 of the left image going through the initial feature learning part 201 where I l is the input left image, f is a mapping from image 203 space to feature space, F l feature ∈ R   kernel size after DenseASPP. That is, the final dimension of 249 the output feature map is 1 The feature pyramids composed of DenseASPP can make 251 the network get a better disparity map. Compared to SPP 252 and ASPP, DenseASPP has better scale diversities, bigger 253 equivalent receptive field and denser pixel sampling. For 254 the scale diversities, the atrous convolutions with different 255 dilation rates can extract features at different scales. For 256 receptive field, the equivalent receptive field size of an atrous 257 convolutional layer is where d is the dilation rate and K is the kernel size. As shown 260 in Figure 4, stacking atrous convolutional layers together can 261 give us a larger receptive field. Therefore, the final receptive 262 field size of DenseASPP is 263 R = R 3,3 + R 3,6 + R 3,12 + R 3,18 + R 3,24 − 4 = 128. (5) 264 For denser pixel sampling, we know that the pixel sampling 265 rate of atrous convolutional layers with large dilation rates is 266 very sparse. However, due to the use of dense connections, 267 DenseASPP allows more pixels to be involved in the compu-268 tation of feature pyramid, so it retains more information while 269 increasing the receptive field. In terms of the effect, the scale 270 diversity of features helps the network adapt to objects at 271 different scales. A larger receptive field helps the network to 272 infer disparity in ill-posed regions, such as reflective regions, 273 repetitive regions, weakly textured regions and plain color 274 regions. And the denser sampling ensures our network can 275 predict the disparity with more detailed information.
313 Finally, we obtain the updated cost volume component C i as 315 FIGURE 6. Comparison of the cost value variances across different disparity levels at each pixel position of the initial and updated 3D cost volume components C i (blue) and C i (orange).We can find that the variances of C i are zeros for all pixels which means the cost values at different disparities at a given pixel position is always constant. However, the corresponding variances and cost values of C i are non-constant, indicating that by using the CCA module, we can achieve more informative 3D cost volume components.
In the next step, we use the disparity regression module to 316 process the updated 4D cost volume C and get the disparity 317 map, which is described in the next section.

318
The CCA module is designed to replace the cost aggrega-319 tion step in the traditional stereo matching methods, which 320 can optimize the initial 4D cost volume obtained in the pre-321 vious step. In fact, as shown in Figure 6, after observing the 322 cost values of a pixel in the initial cost volume component at 323 different disparities, we found that the cost values at different 324 disparities are the same, which is obviously not in line with 325 the actual cognition. To this end, we try to change the constant 326 distribution of cost values into a non-constant distribution 327 through the CCA module. By using the CCA module, cost 328 volume C becomes more informative, which can make it eas-329 ier for the subsequent disparity regression module to calculate 330 the accurate disparity.

332
The input of the disparity regression module is C , and the out-333 put is the disparity map. Specifically, we firstly use a stacked 334 hourglass structure which is shown in Figure 7. Stacked 335 hourglass network was proposed by Newell et al [35], which 336 can achieve better mixing of global and local information 337 through constant downsampling and upsampling operations. 338 In this paper, the hourglass block is shown in Figure 8, and the 339 specific convolution settings of stacked hourglass structure 340 are shown in Table 1. From this table,we can see that the 341 first two convolution layers are used to downsample the cost 342 volume, each layer contains two 3D convolution layers with 343 stride of 2 and stride of 1. Then two deconvolution layers are 344 employed to restore the cost volume to its original size of 345 VOLUME 10, 2022   The output of each hourglass block changes the dimension after two 3D convolutions and upsampling operations, which 351 is noted as C regression . We will use the regression method 352 to build disparity map. Specifically, the predicted disparity 353 is calculated by using the softmax operation σ (·) with the 354 following equation where N denotes the total number of marked pixels, d (x,y) 368 denotes the true disparity at (x, y) coordinate, andd (x,y) 369 denotes the predicted disparity.

370
As shown in Figure 7 and Figure 8, while the output c 6 371 of each hourglass block is used as the input of the next 372 hourglass block, we will also use it to generate a disparity 373 map. Therefore, in the disparity regression module, there are 374 two intermediate disparity maps and one final disparity map 375 generated, and their losses are respectively denoted as L 1 , L 2 376 and L 3 . And the final loss function is generated by weighted 377 summation of L 1 , L 2 and L 3 :  continuing to increase the weight of L 1 and L 2 , the error 445 will instead rise. This indicates that the output of the deep 446 hourglass block contains more valid information than the 447 output of the shallow layer, so L 3 needs to be given a greater 448 weight. In order to verify the effects of the introduced DFL and 451 CCA modules, we do ablation experiments and compare their 452 effects on Scene Flow dataset and KITTI dataset, respec-453 tively. As shown in Table 3 Figure 9, it can be found that the disparity 480 map generated by DFL-CCA-Net is significantly better than 481 PSMNet, especially in the repetitive texture areas and thinner 482 areas such as lines and columns. This visually demonstrates 483 the effectiveness of the DFL module and the CCA module. 484 In addition, from the error maps presented in Figure 9, our 485 proposed DFL module and CCA module can significantly 486 improve the disparity prediction not only in the pathological 487 TABLE 3. Ablation study to show the effectiveness of DFL module and CCA module in the DFL-CCA-Net. We computed the percentage of 3-pixel-error on the KITTI 2015 validation set, and end-point-error on the scene flow test set.  Table 5, and all data are taken 495 from the KITTI test server. In Table 5 detailed information such as the edges of signage, poles, and 509 grasses as marked in the figures. From the disparity maps and 510 error maps in Figure 10 and Figure 11, our proposed DFL and 511 CCA modules can significantly reduce the prediction errors 512 in the background region, thus improving the accuracies of 513 disparity prediction.  percentages for both all pixels (All) and non-occluded pix-519 els (Noc). The pixels involved in the error calculation in 520 Table 5 are the pixels of all regions, and Table 6 shows 521 the pixel test results of reflective regions. As shown in the 522 last row of data in Tables 5 and 6, DFL-CCA-Net is also 523 competitive on the KITTI 2012 dataset, especially reaching 524 VOLUME 10, 2022