Unsupervised Learning for Stereo Matching Using Single-View Videos

This paper proposes an unsupervised approach to construct a deep learning based stereo matching method using single-view videos (SMV). From videos, a set of corresponding points are computed between images, and image patches that center at the computed points are extracted. Negative and positive samples constitute a dataset to train a similarity network that is then used as a matching cost function. In addition, we propose a local-global matching cost network that exploits the first feature maps (local features) accompanying with last feature maps (global features) as output feature of the proposed network. The concatenated features are connected to full-connected layers and the network outputs a similarity measure of an image patch pair as a matching cost. Computed matching costs are aggregated using semi-global matching and cross-based cost aggregation, followed by sub-pixel interpolation, left-right consistency check, median and bilateral filtering. We evaluate the proposed stereo matching methods using popular stereo matching datasets, including KITTI 2012 and 2015, and Middlebury. We submit the disparity maps to their benchmark servers to evaluate the performance of SMV. We also compared the generalization of SMV and baseline methods using the training sets of the three datasets. The benchmark results show that SMV is the most accurate method among unsupervised approach, and it even outperforms several deep learning based stereo matching using supervised manner. The evaluation results of generalization show that SMV is comparative with the baseline method, MC-CNN, which is trained with supervision.


I. INTRODUCTION
Stereo matching aims to reconstruct 3D information from stereo images. Given the left and right images, a stereo matching method estimates a disparity map, in which pixel intensities indicate the depth information from cameras to objects (that contains considered pixels). Figure 1 shows the illustration of stereo matching.
Stereo matching has been intensively researched for several decades because of its important applications for selfdriving cars, 3-D reconstruction, view interpolation, and robot navigation [1], [2]. Scharstein and Szeliski [3] did an excellent survey of stereo matching methods and divided them into local and global methods. Local stereo matching methods normally includes matching cost computation, cost aggregation, and disparity computation steps, whereas The associate editor coordinating the review of this manuscript and approving it for publication was Alma Y. Alanis . global correspondence methods typically consist of matching cost computation and disparity optimization steps. Disparity refinement, such as sub-pixel interpolation via parabolic fitting, a left-right consistency check [4], and image filtering, can be used to improve the quality of the disparity map.
Zbontar and LeCun [5] proposed the first deep learningbased stereo matching cost that exploits a convolutional neural network. The matching costs are processed by crossbased cost aggregation (CBCA) [6] and semi-global matching (SGM) [7], followed by post-processing techniques including sub-pixel interpolation, a left-right consistency check, and median and bilateral filtering.
Since the dawn of deep neural networks, many deep learning-based methods are proposed for matching cost computation [8]- [11], cost aggregation [12], and post-processing [13]. Other work [14]- [19] proposed stereo matching methods that unified deep learning-based components and trained in an end-to-end fashion. Recently, disparity confidence methods [20]- [23] are introduced to improve the performance of stereo matching methods.
However, current deep learning methods require domain data for training. Supervised methods require left and right stereo images and ground truth, although unsupervised training methods require just left and right stereo images. This paper proposes training a matching cost network without requiring domain data. Corresponding image patches are extracted from single-view videos and subsequently employed as the training data. Collecting stereo matching dataset for different situations is not an easy task. Therefore, our approach helps to construct a stereo matching method easily.
In this paper we propose an approach to learn a matching cost network from videos. From single-view videos, feature matching points between frames are computed and then image patches for matching points are extracted to build a dataset of corresponding patches. After that, the dataset is used as a training data. In addition, we propose a local-global matching cost network that takes advantages of local features from the first layer.
The contributions of this paper are as follows: • This paper proposes an approach to train a matching cost network by using single-view videos. This approach does not need stereo images as well as ground truth.
• A local-global matching cost network are proposed to exploit the benefit of using the first layer that can extract features similar to those of local binary patterns.

II. RELATED WORK
Traditional matching cost functions consist of the samplinginsensitive (SI) [24], absolute difference (AD), and squared difference (SD). These traditional functions suppose corresponding pixels between stereo images have the same intensity values. Therefore, they perform poorly when the stereo images are radiometrically distorted. In many cases, intensity changes between stereo images are monotonically nonlinear wherein the orders of the intensity values are preserved. Matching cost functions that exploit ordinal values rather can tolerate this kind of intensity transformation. These matching cost functions include the rank and census transforms [25], the support local binary pattern (SLBP) [26], the fuzzy encoding pattern [27], and the soft rank transform [28].
Han et al. proposed a gradient-based matching cost function [29]. Scharstein et al. [30] introduced a gradient-based measure that can operate under the differences in the camera gain and bias. Wei et al. [31] proposed an intensity-and gradient-based matching method using hierarchical Gaussian basis functions. Zhou and Boulanger [32] introduced a Gaussian weighted sum of absolute difference based on the relative gradients. P. Pinggera et al. [33] proposed dense gradient features for cross-modal stereo.
Mutual information can tolerate any global intensity changes and has been exploited as a matching cost function in stereo matching. Kim et al. [34] proposed a pixel-wise matching cost for stereo matching based on mutual information. Hirschmuller [7] introduced a stereo matching method based on semi-global matching and mutual information. Heo et al. [35] introduced a stereo matching method where the P. N. Hong, C. W. Ahn: Unsupervised Learning for Stereo Matching Using Single-View Videos matching cost function combines mutual information with SIFT descriptor [36] in log-chromaticity color space.
Heo et al. [37] proposed adaptive normalized crosscorrelation (ANCC) which is an improved version of normalized cross-correlation (NCC) and invariant to radiometric distortion. RANCC [38] is an improvement of the ANCC for the context that the effect of texture and noises on image regions. Dinh et al. [39] proposed a matching cost measure to address the non-linearity intensity transformation of pixels between the image patches.
A recent approach to compute matching cost is to use convolutional neural network to predict matching value for a patch pair. Reference [5] introduced a convolutional neural network that is trained for measuring the similarity of a patch pair. Reference [8] proposed a deep embedding model to predict matching cost which explicitly maps intensity values into an embedding feature space to estimate pixel dissimilarities. References [5] and [8] need stereo images and ground truth for training.
Reference [9] proposed a fast matching cost network that uses a product layer for a siamese architecture. Reference [10] proposed a unsupervised approach to estimate matching cost by exploiting left-right consistency check to guide the training process. Reference [11] proposed a weakly supervised techniques for training patch similarity which uses properties of the optical sensor and a rough scene knowledge. Li and Yuan [62] introduced a stereo matching method that is an unsupervised learning method and aware of occlusion problem. Joung et al. [63] proposed a stereo matching method that is trained in an unsupervised manner using confidential correspondence consistency. Tonioni et al. [61], [64] introduced stereo matching methods for domain adaptation using stereo images without ground truth.
The output of the matching cost computation step is a matching cost image space C for which C d (p) is the matching cost value of a pixel p in the reference image, e.g., the left image of a stereo pair, and at a disparity hypothesis d. From C, a disparity value for p can be obtained by using a winnertakes-all strategy, as follows: where D E is an estimated disparity map. Applying a winnertakes-all strategy is the simplest way to obtain a dense disparity map.

III. SMV A. DATASET CONSTRUCTION FROM VIDEOS
In this subsection, we present an approach to construct a dataset from videos which is then used to train a matching cost network. Given a video, we extract two frames. To reduce the scene correlation between frames, the two selected frames should not be continuous in the video. We use the SIFT to compute corresponding points between the frames, as shown in Fig. 2(a). For each pair of the corresponding points, we extract image patches whose center pixels are the corresponding points, as shown in Fig. 2 According to [45], challenges in stereo matching includes textureless regions, occlusion, illumination variations, snow, sun, rain, etc. Therefore, the extracted patches are processed to assimilate the challenges. Each patch is undergone a pipeline of common image transformation, such rotation, translation, elastic distortion, noise adding, and brightness and contrast changes, as shown in Fig. 3.
Brightness and contrast adjustment changes the brightness and contrast by setting the image patch P to P ← P · contrast + brightness. ( where addition and multiplication are element-wise operations. Rotation rotates the patch by rotation degrees, whereas translation translate the patch in the vertical direction by translation. Scaling scales the patch by scaling, and shearing shears the patch in the horizontal direction by shearing.
Elastic distortion [40] is commonly used to generate images that are feasible and label preserving in classification. Elastic distortion distorts an image patch by the intensity of transformation ED alpha and the smoothness for transformation ED sigma . Noise block addition adds a block of random values to an image patch. The position of the block is selected randomly. Foreshortening is inspired from different view-point of stereo cameras. In foreshortening, first we crop left or right side of an image patch by cropping and p lr , and following that the cropped patch is resized to the same size as the original patch. Fig. 4 shows the illustration of the elastic distortion and left and right foreshortening for an input patch.
In order to prepare a training data of positive and negative example, each image patch extracted from an image is undergone through the transformation pipeline two times  with different random setting of parameters. The two transformed patches forms a synthesized pair of corresponding image patches (positive example). The negative example is created by extracted a new image patch that is far from the considered image patch at a distance, data_distance.

B. LOCAL-GLOBAL MATCHING COST NETWORK
We propose a local-global matching cost network that exploits the first convolution layer, as shown in Fig. 4. The first convolution layer extracts low-level features of an image patch which are edge-like features. Each convolution kernel in the first convolution layer often extracts different features. The features in the last layers are considered as global features that extract high-level features of the image patch.
In stereo matching, hand-crafted feature extraction, such as census, rank, slbp, have been successfully operated for stereo images in different conditions. Each of these feature extractors are designed to obtain different features that highly discriminative.
The feature maps of the first convolution layer are somewhat similar to the output of the hand-crafted feature extractors, and even can extract more number of features because the number of feature maps are set, such as 32 or 64, and computed automatically.
As a result, our idea is to combine the local feature (feature maps of the first convolutional layer) and global features (output of the last layer) to increase the discriminative power. The architecture of our proposed network is as follow: Fig. 4 shows the architecture of the proposed multi-patch matching cost network. The architecture of sub-networks consist of a number of convolution layers followed by rectified linear unit layer (RELU). The resulting four vectors are concatenated and forwardly propagated through a series of fully connected layer followed by RELU. The final output of network is fed to a non linear activation function sigmoid to produce a similarity score between the input patches. The binary cross-entropy loss is used for training. Let x denote the output of the network for one training example and y denote the class of that training example; y = 1 if the example belongs to the positive class and y = 0 if the example belongs to the negative class. The binary cross-entropy loss L for that example is defined as The hyperparameters of the proposed network are the number of fully-connected layers (num_fc_layers), and the number of units in each fully-connected layer (num_fc_units), the number of feature maps in each layer (num_fmaps), VOLUME 8, 2020 the number of convolutional layers (num_clayers), the size of the convolution kernels (ckernel_size), the size of the input patch (input_patch_size).
The hyperparameters of aggregation and post-processing methods include cbca_distance, cbca_num_iters_1, cbca_num_iters_2, which denote for similarity threshold for pixel intensities, number of iteration of cross-based cost aggregation before SGM, and number of iteration of crossbased cost aggregation after SGM, respectively. sgm_P1, sgm_P2, sgm_Q1, and sgm_Q2 stands for the first smoothness parameter of SGM, the second smoothness parameter of SGM, a factor 1 used for changing sgm_P1/sgm_P2, and a factor 2 used for changing sgm_P1/sgm_P2, respectively. sgm_V and sgm_D denote for reduction of sgm_P1 by a factor of sgm_D when considering vertical direction and pixel intensity threshold for changing sgm_P1/sgm_P2. Finally, blur_sigma and blur_threshold stand for standard deviation for a post-processing filter and threshold for a postprocessing filter.
In this paper, we have set 11 × 11 image patches as input to the network. The first convolutional layer is used to extract feature maps from the input patches that are then considered as local image features. The five convolutional layers are with 3 × 3 kernel and 112 feature maps. A 224-length vector is formed by concatenating the two 112-length feature vectors. After that, the 224-length vector is passed through three fully-connected layers with 384 units each. The final fullyconnected layer projects the output to a single number that is the similarity score. A matching cost is just a negative value of the similarity score.

C. COST AGGREGATION AND POST-PROCESSING METHODS
The outcome of the local-global matching cost network is a matching cost space that is then aggregated and postprocessed to produce the final disparity map. We follow the pipeline introduced in [41] (used later by MC-CNN [46]) as shown in Fig. 5. The pipeline suggests to use CBCA and SGM to aggregate the matching costs. Then, sub-pixel interpolation, left-right consistency check to detect invalid pixels, followed by median and bilateral filtering. Similar to MC-CNN, we use CBCA before and after SGM.

IV. EXPERIMENTAL RESULTS
We evaluated the proposed stereo matching method using KITTI 2012, 2015, and Middlebury datasets. We uploaded the results for the three datasets to their online benchmark servers.
To evaluate the generalization performance of the testing stereo matching methods, we used different datasets for training and testing steps and compared with MC-CNN, AD, and Census. All the testing methods use the same pipeline of cost aggregation and post-processing methods. We followed the parameter setting in [46] for MC-CNN, AD, and Census.
For the proposed matching cost network, we used grid search method to select parameter setting using the mixed dataset, constructed from KITTI and Middlebury training datasets. For each parameter, we first estimated a feasible range and a value step for the grid search method. After that, we chose the parameter setting that had the best performance on the mix dataset. Table 1 shows the parameter setting for the proposed stereo matching method, and the parameters were fixed for all of our experiments.
We used Cityscapes video datasets [42] for training the proposed matching cost network. Specifically, we used three single-view sequences (stuttgart_00, stuttgart_01, TABLE 2. KITTI 2012 benchmark results in error rate (%) for SMV. Out-Noc is the percentage of erroneous pixels in non-occluded regions, and Out-All is the percentage of erroneous pixels in total. Avg-Noc is the ratio between the average disparity and end-point error in non-occluded regions, and Avg-All is the ratio between the average disparity and end-point error in total.
stuttgart_02) which include about 2900 images totally with 2048 × 1024 resolution. Let i be the frame index of a video. We use a image pair of I i and I i+2 for compute corresponding points using the SIFT. Totally, about 12.5 millions of point pairs are detected and hence about 25 millions of sample patches (including positive and negative samples) are extracted.
We exploited stochastic gradient descent to optimize the cross-entropy loss of the proposed network training. The network was trained for 22 epochs with the learning rate initially set to 0.003 and decreased by a factor of 10 on the 18th. The training dataset was shuffled prior to learning for each epoch, and the batch size was set to 128.
Disparity maps were evaluated using the average proportion of erroneous pixels in all zones, except occlusions. We used the KITTI error thresholds (th = 3) pixel and Middlebury error thresholds (th = 1). The error rate (%) was calculated as where I nocc is the set of all non-occluded pixels, |I nocc | is the number of pixels in I nocc , and D G (p) and D E (p) are the ground truth and estimated disparity at p, respectively.

A. QUANTITATIVE RESULTS USING STEREO MATCHING BENCHMARKS
The KITTI 2012 and 2015 datasets [43], [44] include outdoor stereo images with sparse ground truth (approximately 50% of the pixels). The KITTI 2012 dataset has 194 stereo pairs for training and 195 stereo pairs for testing, and the KITTI 2015 dataset provides 200 stereo images for training. Middlebury provides indoor stereo images with dense ground truth.
Since the KITTI and Middlebury servers constrain the limited numbers of submissions, we used the servers to evaluate the results for the complete version of the proposed stereo matching method. Tables 2, 3, and 4 show the results of SMV in the KITTI 2012, 2015, and Middlebury benchmarks, respectively. The proposed stereo matching method significantly outperformed SGM and ELAS methods that are considered as baseline methods for traditional stereo matching approach. In addition, for all the three benchmark results, The proposed stereo matching method performed better several deep learning-based stereo matching methods, even though The proposed stereo matching method is constructed without using a single stereo pair. Figs. 6 and 7 show some disparity maps of SMV downloaded from KITTI server for the KITTI 2012 and 2015 datasets, respectively.

B. GENERALIZATION
In this subsection, we compared the performance of SMV, MC-CNN, AD, and Census methods for data generalization. In other words, MC-CNN, AD, and Census use a training data to train and/or tune parameters of a method, and then are evaluated using different data. For AD and census, the parameters of the post-processing techniques were set the same as in the MC-CNN paper. Let MC-CNN_K15, MC-CNN_K15, and MC-CNN_MB denote MC-CNN with accurate architecture and being trained using the KITTI 2015, KITTI 2015, and Middlebury training sets, respectively. In addition, to evaluate the effective of the multi-patch matching cost network in SMV, we designed a version of SMV that the number of input patch is set to 1, denoted SMV(-). Except the cropped size of 9 × 9, proposed method(-) parameters were set the same as those of SMV. Figs. 8 and 9 show the quantitative results of the testing stereo matching methods for the first 100 stereo pairs for KITTI 2012 and 2015 training sets, respectively. SMV performed better AD and census significantly, and had a comparative performance with the MC-CNN variants that require training data. Because of using multi-patch network, SMV performed much more robustly than SMV(-).   In addition, we computed the average performance for the testing stereo matching methods over the KITTI 2012 and 2015 training data, respectively. Fig. 10 shows the average error rates of the testing stereo matching methods. AD and census had the largest error rates, whereas SMV and the MC-CNN variants had similar performance. SMV(-) that does VOLUME 8, 2020  not use the multi-patch network performed poorly, with error rates approximately double those of SMV. In all the cases, even though SMV did not use training data, its error rate is nearly as good as MC-CNN variants, with slightly larger error rates.

C. USING LOCAL BINARY PATTERNS
In this subsection, we evaluate the performance of combining the handcrafted features and the feature maps of convolutional networks. Specifically, instead of combining feature maps from the first and last convolutional layers, we computed census, rank, and SLBP transforms for input images and then concatenate them with the last convolutional feature maps. We denoted this method as SMV_LBP. Figure 10 shows an illustration of census and rank transforms for an image.
We used window size (3 × 3) for both census and rank transforms. We normalized the transformed images before concatenating with feature maps, computed from the last con-volutional layer. Figure 11 shows the quantitative results of SMV_LBP using the KITTI 2012 and 2015 training datasets. SMV_LBP had marginally better performance than SMV(-) and performed worse than SMV. The reason is that census and rank transforms are just two matrix instances of a (3 × 3) convolution matrix and their weight values are fixed. In contrast, SMV extracted 112 feature maps using 112 convolution matrices, in which weights were selected optimally for a training dataset.

D. SMV EVALUATION
We evaluated the stereo matching methods using their raw matching costs on the KITTI 2012 and 2015 datasets. In addition, we trained SMV in a supervised manner using KITTI 2012, KITTI 2015, and Middebury training datasets, denoted as SMV_K12, SMV_K15, and SMV_MB, respectively. For a fair comparison, we used the same data augmentation as in MC-CNN.     Table 6 shows the error rates for KITTI 2012 (K12) and KITTI 2015 (K15) training datasets. AD and Census had the worst performance, and AD even outperformed Census. These performance of AD and Census in our experiments are similar to those in [46]. The supervised versions of SMV outperformed MC-CNN for all corresponding datasets. That validates the effectiveness of the use of the local and global CNN features in SMV.

E. SENSITIVITY ANALYSIS
In this subsection, we present the way to select parameter values and analyze the effect of different parameter configurations to SMV. As shown in Table 1, the SMV network has six parameters, including input_patch_size, num_clayers, num_fmaps, ckernel_size, num_fc_layers, and num_fc_units.
For the kernel size ckernel_size, using two 3 × 3 kernels have the same receptive field with a 5 × 5 kernel. Therefore, these days, a 3 × 3 kernel size is commonly used for CNN. SMV and MC-CNN share the three common parameters, which are num_fmaps, num_fc_layers, and num_fc_units. In our work, we have selected the values for the three parameters, as recommended by the MC-CNN work. There are two reasons for this. The first reason is that the three parameters were carefully selected by using the grid search in the MC-CNN work. The second reason is that using the same values could show the effectiveness of exploiting the local-global features in SMV.

V. CONCLUSIONS
This paper proposed an approach for stereo matching method that uses single-view videos in an unsupervised manner. In addition, we proposed a matching cost network that exploits explicitly local and global features. The proposed stereo matching method was evaluated using commonly used datasets in stereo matching, including KITTI 2012, KITTI 2015, and Middlebury. Experimental results the benchmarks showed that the proposed method had the best performance among unsupervised methods and outperformed several supervised methods. It also performed well cross different datasets.
In future work, we plan to investigate deeply image similarity functions in traditional approaches as well as learning based ones. Applications of similarity functions in computer vision and ways to construct them in case of datasets available in different domains.