Aggregation of Convolutional Neural Network Estimations of Homographies by Color Transformations of the Inputs

The standard approach to the estimation of homographies consists in the application of the RANSAC algorithm to a set of tentative matches. More recent strategies based on deep learning, namely convolutional architectures, have become available. In this work, a new algorithm for the estimation of homographies is developed. It is rooted in a convolutional neural network for homography estimation, which is provided with a range of versions of the input pair of pictures. Such versions are generated by perturbation of the color levels of the input images. Each generated pair of images yields a distinct estimation of the homography, and then the estimations are combined together to obtain a final, more robust estimation. Experiments have been designed and carried out to test the validity of our approach, including qualitative and quantitative performance measures. In particular, it is demonstrated that our approach consistently outperforms the baseline approach consisting of using the output of the homography estimation deep network for the original input pair of images.


I. INTRODUCTION
One of the basic low level processes to be performed in computer vision consists of estimating a homography between two pictures. This task involves finding the best non-unique linear transformation between the key points in both images, in order to reduce the correspondence error as much as possible [1], [2]. This is essential for plenty of high-level computer vision challenges such as mosaicing [3], line matching [4], foreground detection [5], camera movement [6], action recognition [7] and multi-view tracking [8]- [11].
Feature-based methods have been the most used to date for homography estimation [12], [13]. Most techniques mainly focus on locating points of interest in each image using detectors based on computer vision such as the Harris The associate editor coordinating the review of this manuscript and approving it for publication was Md. Asikuzzaman . detector [14] or the SIFT method [15]. Subsequently, those non-valid correspondence pairs are filtered using the Random Sampling and Consensus (RANSAC) method, providing the best possible homography estimation [16]- [18]. These methods perform better than traditional or direct methods [19]. Still, they may be ineffective when the number of points of interest is insufficient (mainly due to poorly textured and homogeneous images) or when mismatches occur due to substantial changes in lighting or significant differences in the viewpoints in both images. Due to its complexity, other techniques for extracting points of interest have been developed [20] but, although they are more efficient, their performance is slightly lower.
There are different approaches, either from a supervised to an unsupervised point of view. In [21], DeTone et al. designed a deep convolutional network with supervised learning for homography estimation from two images, called HomographyNet. In addition to the images, they use four points defined in both images as input for the training phase, instead of the standard 3 × 3 homography matrix [22]. Other approaches directly estimate the parameters of the homography matrix using deep convolutional models [23]. A modification of the DeTone et al. model is proposed in [24], by including hierarchical networks, in which the images are managed by independent deep networks and, like the classic iterative models, the estimation of the homography is progressively improved in several stages. In the same line, another similar model is proposed in [25], although lighter in complexity and number of parameters, whose novelty consists of the inclusion of a homography refinement module to improve the final homography matrix. This refinement is performed by minimizing the masked pixel-level photometric discrepancy between the warped image and the destination image using a gradient-decent algorithm. Compressed convolutional networks are also considered to robust homography estimation [26]. On the other hand, in [27], Siamese networks are applied together with spatial pyramid pooling modules for supervised homography estimation. To avoid or reduce overfitting, they propose invertibility constraints, taking into account that there are a couple of homographies from each pair of images that are inverse matrices. With regard to unsupervised approaches, it is worth highlighting the Nguyen et al. paper [28], where the previous supervised proposals are improved using an unsupervised deep model in which it is not necessary to provide the ground truth labels (i.e., the correspondence points for each pair of images in the training phase). This model is the primary reference for our proposal.
Therefore, our objective is to outperform the performance of deep homography estimation networks by applying an ensemble approach. This strategy, which has been successfully used in previous works [29], consists of presenting altered versions of the input to the network and combine all the outputs which the network produces, by applying a kind of aggregation function. These modified inputs are slightly disturbed versions of the original one. In this way, it is expected that the combination of the outputs produced by the deep model provides a more robust estimate of the homography than each individual one. Random color transformations are the type of variation of the input data considered in this study, because of the fact that they incorporate a significant amount of variability in the result set, while the image quality is maintained.
The paper is structured as follows. Firstly, the methodology of this approach is described in Section II. The experimental results are detailed in Section III, whereas conclusions are exposed in Section IV.

II. METHODOLOGY
This section is devoted to the presentation of our method for the estimation of homographies. The rationale behind our proposal is as follows. Deep homography estimators output significantly different estimations for color transformed versions of the same input pair. Since the true homography does not vary under such color transformations, we propose to take advantage of this effect by aggregating the outputs associated with several random color transformations in order to yield a single estimated homography.
Let us note F the deep learning based homography estimator. The estimator accepts two images (I A and I B ) with size M × N pixels as the input, while it returns a vector of 8 components as output, which represents the two coordinates of the four corners of the homography: Here it must be noted that tristimulus color values are considered, which lie in the interval [0, 255].
Since a tristimulus color space is assumed, it follows that the images have this arrangement: where q is the color channel index, r is the pixel row coordinate, and s is the pixel column coordinate. Now, let us note ϕ a color transformation: As explained before, our proposal is to obtain an aggregated estimation by combining the homography estimations associated to several color transformations: where ψ is an appropriate aggregation function for homographies, S is the set of homographies to be aggregated, and H is the number of elements of S. An individual homography h i is obtained as the output of the neural homography estimator with a pair of randomly transformed images: where ϕ i,j stand for the color transformations, and u ∈ {1, 2} are the pixel coordinate indices. The procedure to randomly generate color transformations ϕ i,u is detailed next. Here we propose to consider color transformations which entail four phases to be executed sequentially. First, a gamma transformation γ i,u is applied. Then, a brightness transformation β i,u is done. Thirdly, a tone transformation τ i,u is carried out. Finally, color clipping κ i,u is performed. Hence the overall color transformation is given by: The gamma transformation is defined as follows: where a is a random number coming from the uniform probability distribution on the interval of real numbers [A 1 , A 2 ]. VOLUME 8, 2020 Secondly, the brightness transformation is given by: where b is a random number coming from the uniform probability distribution on the interval of real numbers [B 1 , The tone transformation is given by: where c 1 , c 2 and c 3 are three random numbers obtained from the uniform probability distribution on the interval of real Fourthly, the color clipping is computed as follows: The color transformation considered here was previously considered by Nguyen et al. [28].
The introduction of random color transformations that has been proposed above implies that many samples of estimations of the true homography can be drawn as needed. As the number H of sample homographies grows, it is expected that the aggregated homography which combines them all is closer to the true homography. In other words, the accuracy of the homography estimation procedure is expected to enhance as compared to the application of the deep homography estimator to the original input pair of images.
The operation of the proposed methodology is summarized in Figure 1. First, an input pair of images (I A and I B ) are given. Four points from one of these images (I A ) are also provided, so that, the four corners to be predicted in the other image (I B ). Please, note that these points are not illustrated in the figure in order to show it in a clearer visualization. This way, the input pair of images (and also the four corners) are the input of the proposed model. Then, H − 1 pair of randomly color transformed images are generated from the raw pair of images. So that, H pair of images are considered. The color shift process consists of a gamma, brightness and tone transformations and a color clipping step, in that order. After that, a homography estimation method used as a base is employed to predict the homography between a pair of images. Each pair of images is given to the estimator. Thus, H predictions are obtained where h 1 corresponds to the prediction of the raw pair of images and h 2 ..h H corresponds to the predictions of the random color transformed pair of images. Finally, all of these predictions are suitable to be the input of a selected aggregation function. The output of this function is the predicted homography result (ĥ) of the proposed methodology. Please, note that the use of a single homography in the proposed methodology (so that H = 1) produces the same result than the obtained prediction offered by the homography estimation method. An input pair of images is provided to the proposal that is composed of H homography estimation methods. The four points of one of the provided pair of images are also given; however, the figure does not report these points in order to show it in a clearer way. Then, each method offers its prediction h i where h 1 is the estimation for the raw input pair of images while remaining predictions h 2 ..h H are the estimation for that input pair of images with an applied random color transformation. After that, the proposal predicts its output by combining the different predictions h i with the application of a considered consensus function.

A. AGGREGATION FUNCTIONS
Several kinds of aggregation functions have been considered. In order to define each aggregation function, let S be the set of homographies, as defined in equation (6). Let us remember that H is the number of homographies to be aggregated by the ψ aggregation function, as it was previously defined.
• Mean except the worst (Mean-1) As it was previously defined, a homography h i is composed by 4 points. Let µ j be the mean of the j-th points of the homographies from S, with j ∈ {1, . . . , 4}, so that µ j is a vector of dimension 2 × 1: For each point q i,j , the Euclidean distance between µ j and that point is calculated.

VOLUME 8, 2020
Let h e be the homography with the furthest j-th point from µ j . After that, the Mean-1 aggregation function is defined as follows: • Median except the worst (Median-1) The process is the same than Mean-1, except for the following equation: • Geometric Mean (GMean) • Geometric Median (GMedian) The algorithm for the geometric median presented in [30] has been implemented.
• Mean and Geometric Mean, and then Mean (GMean + Mean) ψ GMean+Mean = mean(mean(S), Gmean(S)) (22) • Mahalanobis per point (MahaPoint) Let C j be the covariance matrix of the j-th points, so that C j is a matrix of dimensions 2 × 2: where the operator T is the transpose of a vector. For each j-th point q i,j , the Mahalanobis distance for that point is calculated: where C −1 j stands for the inverse of the matrix C j . Let T be the set of the H (p − 1) homographies with the furthest j-th points from their means µ j , where p is a given real number with p ∈ [0, 1], which represents the fraction of homographies that will be considered to calculate the result of the aggregation function. Therefore, p is a percentile, while 1 − p is the fraction of the homographies that will be excluded. After that, the MahaPoint aggregation function is defined as follows: • Mahalanobis per homography (MahaHomo) The procedure is similar to the presented in MahaPoint.
While the MahaPoint aggregation function considers the points individually and discards the worst points according to their distance respecting to the mean µ j , MahaHomo considers homographies instead of points. So that, this aggregation function discards the worst homographies.
For each point q i,j with i ∈ {1, . . . , H } and j ∈ {1, . . . , 4}, the Mahalanobis distance between the mean of the j-th points of the homographies µ j and q i,j is calculated, following Eq. 24. After that, for each homography h i , the sum of distances D i between its points h i,j and their means µ j is calculated. So that: Let T be the set of the H (p − 1) furthest homographies (according to D i ) from the homography composed by µ j , where p is a given real number where p ∈ [0, 1] which represents the fraction of homographies that will be considered to calculate the result of the aggregation function. So that, p is a percentile, while 1 − p is the fraction of the homographies that will be excluded. After that, the MahaHomo aggregation function is defined as follows: • Euclidean per point (EuclideanPoint) This aggregation function is defined in the same way that MahaPoint. The difference respecting that aggregation function is the definition of the distance between two given points, where its definition is as follows: • Euclidean per homography (EuclideanHomo) This aggregation function is defined in the same way that MahaHomo. The difference respecting that aggregation function is the definition of the distance between two given points, where it is defined with Eq. 28.

• Manhattan per point (ManhattanPoint)
This aggregation function is defined in the same way that MahaPoint. The difference respecting that aggregation function is the definition of the distance between two given points, where its definition is as follows: • Manhattan per homography (ManhattanHomo) This aggregation function is defined in the same way that MahaHomo. The difference respecting that aggregation function is the definition of the distance between two given points, where it is defined with Eq. 29. Next, a brief discussion of the proposed aggregation functions is carried out. Many of them are based on the sample mean, which is a fast way to obtain an unbiased estimator of the distribution mean. Nevertheless, in the presence of outliers, the functions based on the sample median can yield more accurate results, at the expense of a slightly higher computational load. The introduction of the Mahalanobis distance in some of the functions compensates for the different scaling that may arise in the dimensions of the homographies. Finally, in several functions, the most extreme observations are removed prior to the computation of the final result, in order to filter out outliers. The computational complexity of our approach is linear with respect to the number of color transformations H , O(H ), except for the aggregation functions that are based on the median, which have complexity O (H log(H )). Nevertheless, the computation of the median takes a negligible amount of the overall runtime, so in practice, the complexity is linear with respect to H .

III. EXPERIMENTAL RESULTS
This section depicts the experiments that have been carried out and the obtained results. First, the software and hardware which have been employed are described in Subsection III-A. Then, Subsection III-B specifies the image dataset used. After that, the parameter selection of our proposed model is depicted in Subsection III-C. Finally, the results which have been obtained are reported in Subsection III-D.

A. METHODS
A well-known recent method from the state-of-the-art has been selected in order to test the approach. The homography estimation model chosen for that purpose is the method (noted as Base) proposed in [28]. Given two input images, this method predicts the homography between them. The code is written in python by using TensorFlow and OpenCV libraries. The code of this method can be download from its website. 1 Our proposed method is also implemented in python by employing the same libraries commented above.
The reported experiments have been carried out on a 64-bit Personal Computer with an eight-core Intel i7 3.60 GHz CPU, 32 GB RAM, and an NVIDIA GeForce GTX 1080 Ti.

B. DATASET
The COCO dataset [31] has been used in the experiments. This dataset is a public one with the purpose of large-scale object detection, segmentation, and captioning. 2 The version that we have used is the 2014 Test, which is formed by 40,775 images. It can be downloaded from its website. 3 In order to test the performance of the homography estimation method competitors, a synthetic dataset has been created from the COCO dataset. The generation of this dataset is similar to the used in other articles from the state-of-the-art [21], [28], [32]- [34]. This way, 5,000 pair of images have 1 https://github.com/tynguyen/unsupervisedDeepHomographyRAL2018 2 http://cocodataset.org/ 3 http://images.cocodataset.org/zips/test2014.zip FIGURE 2. Graphical example of the operation of the proposal. In (a), it is presented the input pair of frames which are asked to predict the homography between them. In (b), H predictions of that homography are shown (in this case, the number of predicted homographies H is 5), where each image is composed by two subimages: left image shows the selected four points to be predicted while right image exhibits the ideal result (ground truth) and the predicted homography obtained by the considered method estimator. Additionally, first row of (b) exhibits the prediction for the raw pair of images (Base) while remaining rows of (b) show the prediction provided by the homography estimator for that input pair of images with an applied random color transformation. In (c), the combination of the predictions given in (b) by using an aggregation function builds the homography estimation of the proposal. In this figure, Mean and Median are considered as example of aggregation functions which can be applied to combine the predictions of (b). This way, Mean and Median predictions are illustrated in (c) to establish a visual comparison between them in order to observer how different they can be. Furthermore, Base prediction (so that, the result of the first raw in (b)) as well as the ground truth are also reported. been generated to compose the testing dataset where each pair of images is formed by two images: an image chosen from the COCO dataset (the first 5,000 images from the test set have been selected) and a synthetic image obtained by a random transformation of the chosen image. The applied transformation consists of a random color, brightness, and gamma shifts that are injected. Additionally, the amount of image overlap is also controlled by a point perturbation parameter ρ. The objective of this generation image pairs process is to produce a dataset that covers a wide range of adverse conditions (illumination variation, large image displacement. . . ).

C. PARAMETER SELECTION
A wide range of parameter values has been chosen in order to test the approach. As previously presented in Section II,  several aggregation functions have been considered to compute the output of the ensemble. The number of CNNs which form the ensemble H is also analyzed, while the color transformation parameters, as well as the point perturbation used to generate the synthetic dataset, are detailed in Table 1.

D. RESULTS
In order to test the performance of the proposed approach, a comparison has been established. The results of these experiments are shown from a qualitative and quantitative point of view. First of all, a pair of two input images are presented (a), where the aim is to predict the homography between them, as it was detailed in Section II. Secondly, this pair of images are given to the homography estimator that returns the prediction of the homography between both images as output (b, first row).

1) QUALITATIVE RESULTS
After that, a random color transformation is applied to the pair of images. Then these transformed images are supplied to the homography estimation method, where a new prediction is acquired as output (b, second row). This process of random color transformation and prediction of the homography is repeated H − 1 times. Thus, H predictions are obtained at the end of this process (in this case, H = 4). Each prediction output is composed of two subimages, where points (and their connection) to be predicted are shown in the left subimage while the ground truth and the predicted homographies are shown in the right subimage. Finally, all predicted homographies are combined by using an aggregation function, and a consensed homography is provided as output (c). In this case, the output of the Base and the Mean and the Median aggregation functions are shown in order to observe how different the output can be. Figure 3 reports several qualitative results. In this figure, homographies produced by the Mean and Median consensus proposals are compared against the Base method and the ideal VOLUME 8, 2020 result (ground truth) in a visual way. These Mean and Median aggregation functions have been chosen because they are the simplest ones to be implemented considered in this work.
Selected pair of images represent a wide range of situations where Base does not achieve a high performance while Mean and Median (especially Mean) overpass it. Usually, the predicted homographies do not exactly match with the ideal result. However, in general, the predictions produced by the consensus offer a better performance than the Base method. As it can be observed, the consensus homography predictions are close to the ground truth homography. In particular, it is shown how Mean yields better than Median, where its predictions are closer to the ideal results than Median.

2) QUANTITATIVE RESULTS
The performance of the selected methods has also been compared from a quantitative point of view. As was detailed in Section II, a homography composed of 4 points with 2 coordinates per point can be defined as an 8-dimension vector. This way, a quantitative performance can be obtained from the comparison of the predicted and the ideal homographies. This comparison has been established according to the performance indicated by several well known measures which have been chosen for that purpose. The selected measures are about the error between the predicted and the ideal result of each pair of images, by providing a real value from 0 onwards where lower is better. All of these measures are defined as follows: • Euclidean Distance Point Error (EDPE) • Mean Absolute Error (MAE) whereh,ĥ ∈ R 8 are the ground truth and predicted homographies, respectively and K is the dimension of the homography vectors (K = 8). Table 2 shows the mean performance yielded by each approach over the dataset. Due to a large number of tuned configurations, only the best configuration is reported for several of the considered aggregation functions. The first key point is that all proposed approaches improve the performance of the Base method. As it can be observed, not only the mean errors are lower, the standard deviation of the errors are also lower. Thus, the proposed methodology by using an aggregation function is more trustworthy than the employment of a single base method. Moreover, GMean + Mean proposal achieves the best performance. Additionally, GMean and Mean are also good in terms of the yielded performance. It is interesting to see how approaches that use the median function to compute their prediction are worse than those that use the mean function.
It is also interesting to analyze the behavior of the performance of the proposal according to the number of members that belong to the ensemble, i.e., the parameter H . Figure 4 reports the performance yielded by several selected approaches by considering the different tuned values for the parameter H . In order to show the behavior of the performance in a clearer way in the figure. This way, MahaHomo, EuclideanPoint, EuclideanHomo, ManhattanPoint and Man-hattanHomo approaches have not been included in that figure since they exhibit a similar behavior. Note that H = 1 represents the Base method. The shape of the figure is very similar for the considered measures. Additionally, for the best proposals, the higher the value of H , the better their performance. However, this effect is not present in the case of median approaches, where the error slightly swings. It must be highlighted that the improvement of the proposal when H is increased is not as prominent as when H > 20.

IV. CONCLUSION
In this paper, a new ensemble-based strategy is considered to improve the homography estimation methods. This proposal combines the outputs of a deep convolutional network applied to homography estimation, whose inputs (pair of images) are perturbing after applying color function transformations. This combination is carried out by several aggregation functions. Among them, the geometric mean must be highlighted, whose results according to different metrics (EDPE, RMSE, and MAE) present a significant improvement with regard to a single execution of the neural network homography estimator. Additionally, as the number of combined outputs grows, the quality of the homography estimate increases consistently. This fact proves the viability and stability of the proposal presented.