Exploring Performance Bounds of Visual Place Recognition Using Extended Precision

Recent advances in image description and matching allowed significant improvements in Visual Place Recognition (VPR). The wide variety of methods proposed so far and the increase of the interest in the field have rendered the problem of evaluating VPR methods an important task. As part of the localization process, VPR is a critical stage for many robotic applications and it is expected to perform reliably in any location of the operating environment. To design more reliable and effective localization systems this letter presents a generic evaluation framework based on the new Extended Precision performance metric for VPR. The proposed framework allows assessment of the upper and lower bounds of VPR performance and finds statistically significant performance differences between VPR methods. The proposed evaluation method is used to assess several state-of-the-art techniques with a variety of imaging conditions that an autonomous navigation system commonly encounters on long term runs. The results provide new insights into the behaviour of different VPR methods under varying conditions and help to decide which technique is more appropriate to the nature of the venture or the task assigned to an autonomous robot.


I. INTRODUCTION
V ISUAL place recognition (VPR) represents the ability of a robot to decide whether an image shows a previously visited place.VPR is a fundamental task in many endeavours in the field of robotics and hence has been subject to great advancements in recent times in regard to both existing algorithms and new techniques [27].Often VPR approaches are mutually compared in order to develop a better understanding of the advantages and disadvantages of each technique and attains its full potential during the employment period.Among the state-of-the-art methods some have received limited but prior attention in terms of performance comparison to each other in [56].VPR techniques are often rated on their performance on different datasets, each having a different intensity of changing variables including illumination [34], [42], presence of dynamic objects [7], [55], viewpoint [19], [28] and seasonal variations [35], [38].These factors yield changes in the appearance of places, which is the main reason for VPR remains a challenge in autonomous robotic navigation.Though it has been evident through several experiments that each VPR technique might have some perks or an edge when working with a particular dataset and appearance changes [49] but the extent of the critical analysis and comparison among this performance difference still remains an untapped territory.
This letter proposes a new performance metric denoted as Extended Precision (EP ) and an evaluation framework which aims to tackle the potentially overlooked features in previous VPR performance comparisons.The evaluation process consists of two phases.The first explores the upper and lower performance bounds of VPR techniques across an environment in order to assess the reliability of the image matching in a changing environment.The second phase uses a statistical approach to identify performance differences between VPR methods.EP is obtained by combining several features of a Precision-Recall Curve into a scalar value which is used in the evaluation framework to measure VPR performance and carry out statistical tests.The proposed framework is then employed to assess several stateof-the-art VPR methods over different datasets, each presenting different types of environmental changes.The results provide new insights into the behaviour of VPR under varying conditions and can give an indication on the more appropriate technique to employ according to the nature of the venture or the task assigned to an autonomous robot.
The rest of the letter is organized as follows.Section II provides an overview of related work.Section III describes the proposed framework and metric.The experimental results are presented and discussed in Section IV.Finally, conclusions are given in Section V.

II. RELATED WORK
Visual place recognition (VPR) is an arduous endeavour in the field of robotic navigation, with the primary goal to accurately recognize a location from visual information.Despite the significant advancements in recent years, VPR still remains a This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see http://creativecommons.org/licenses/by/4.0/perfectible task due to the extreme viewpoint and conditions changes it faces in real-world situations.There have been several improvements to the state of the art VPR techniques along with new additions to the list over the last decade.The core problem of VPR is image matching, indeed most of the effort of the research community in recent years has been towards robust image representation techniques.Early approaches were based on hand-crafted image descriptors.SIFT [25] is a local feature descriptor that is used for VPR in [48].SIFT detects keypoints from an image using Difference-of-Gaussian (DoG) and uses Histogram-of-Oriented-Gradients (HOG) to compute a descriptor of their neighbourhood.SURF is inspired by SIFT but is more efficient and it is used in a variety of VPR approaches [12], [36].CenSurE [2] is used in FrameSLAM [23].CenSurE detects keypoints in images using centre surrounded filters across multiple scales at each pixel location.Other important handcrafted techniques are Bag-of-Visual-Words (BOW) [40] and Vector of Locally Aggregated Descriptor (VLAD) [21].They are used to partition the feature space, such as SIFT descriptors, in a fixed number of visual words in order to obtain a more compact yet informative image representation that allows more efficient image matching.For example, FAP-MAP 2.0 [12] and [16] use BOW and VLAD respectively.Gist [39] is an example of global image descriptor which is used for image matching in [37], [50] and [46].Gist extracts global features from an image using a set of Gabor filters at different orientations and frequencies.The results are then averaged to obtain an image representation in the form of a vector.Another global descriptor is Histogram-of-Oriented Gradients (HOG) [13], [18].It calculates the gradient of all image pixels and uses the results to create a histogram, with each bar representing the gradient angles and carrying the summation of gradient magnitudes.McManus et al. used HOG for VPR in [31].SeqSLAM [35] performs the localization by comparing a sequence of images against previously visited places.SeqSLAM does not base the image representation on local or global features but uses intensity patch normalization instead.
In recent years, several VPR approaches based on machine learning have been proposed.The features computed by pretrained AlexNet can be used for VPR [20].In particular, the features extracted from the conv3 layer are most robust to condition variations while those from pool5 are better for viewpoint changes [57].The RegionVLAD [22] results in an improvement in image retrieval speed and accuracy due to its low computational CNN-based regional approach combined with VLAD.While CALC [33] is also a convolutional auto-encoder that uses a distorted version of the base image as input and regenerates the HOG descriptor.Next, NetVLAD [3], an advanced version of VLAD that is commonly used for image retrieval, consists of two trainable end-to-end stages.The first is a CNN that extracts the features from an image and the second is a layer that combines them to form an image descriptor by mimicking the behaviour of VLAD.AMOSNet and HybridNet [8] have the same architecture as CaffeNet [24] but they are trained differently.While AMOSNet has been trained from scratch on Specific Places Dataset (SPED) [8], HybridNet uses the weights of the top-5 convolutional layers from CaffeNet, which is trained on ImageNet dataset [43].Cross-Region-Bow [11] is a cross-convolution technique that collects traits and features from convolutional layers.It further collects the highest 200 energetic regions described using the activations from prior convolutional layers by searching for the prominent sectional approaches from the layers of object-centric VGG-16 [45].The regional maximum activation of convolutions (R-MAC) [51] operates on the principle that region based description of features can increase matching performance.For CNN based descriptors that proved efficient for image search while the results obtained were improved by employing the geometric re-ranking and query expansion particularly by utilizing the encoded several image regions made by max-pooling.
With the significant addition of VPR techniques, an enquiry that rises in importance is the evaluation of performance differences between these algorithms.Previously, for each technique a different assessment methodology has been proposed, mostly based on Precision-Recall Curves [5], [16], [29], [44].Ehsan et al. [15] present a performance comparison made for evaluating the limitations of image feature detectors utilizing repeatability measure, however, it significantly draws attention to the importance of performance analysis.Among the several VPR techniques mentioned above, only a few have been used for performance comparison before [56] while employing three standard metrics, Matching Performance, Matching Time, Memory Footprint, however, the datasets used in the experimental setup were only a moderate size thus limiting the diversity of the conclusion.

III. EVALUATION FRAMEWORK
The proposed evaluation framework consists of two main phases.The first one determines the upper and lower performance bounds of a VPR method in a given operating environment.This allows determining the performance consistency across an environment.The second phase is designed to compare VPR methods.Comments in [54] suggest that many evaluation approaches tend to emphasize on beating the latest benchmark numbers without considering whether the improvement of vision system over other methods is statistically significant.This consideration can be extended to VPR evaluation where most methods seem to have confined themselves to some particular test conditions to demonstrate their superiority over other competing techniques.Driven by these motivations, the second phase of the evaluation framework uses the McNemars test to determine whether the performance differences are statistically significant or they are due to random artefacts in data.The evaluation framework is based on the new Extended Precision (EP ) to measure VPR performance.As detailed below, EP addresses several shortcomings of the existing metrics and it can also be used independently from the proposed framework to evaluate VPR performance.

A. Extended Precision Measure
VPR tasks are characterized by datasets with a prominent skew where positive matches for a query image are rare as compared to negative matches.As Precision-Recall curves (PR-Curves) are preferable with imbalanced data, they are frequently used to evaluate VPR [14], [41].Precision (P) and Recall (R) are computed from the outcome of a VPR algorithm: the correct matches are regarded as True Positives (TP) whereas the incorrect matches called as correct are regarded as False Positives (FP).The matches erroneously excluded from the query results are denoted as False Negatives (FN).Precision is the ratio between the correct matches and the total of the predicted positive matches.Recall denotes the proportion of real positives cases that are correctly identified as positive matches.Formally: A PR-Curve shows the relation between Precision and Recall and can be obtained by varying an algorithm's parameters [35], the threshold to call a positive match [56] or the number of retrieved images [16].A PR-Curve can be summarized with several indices.Area Under the PR-Curve (AUC) [14] indicates a VPR performance with a value between 0 and 1.However, AUC does not retain any information regarding the features of the original PR-Curve, including whether Precision reaches or not 1 at any Recall value.R P 100 [6] is also an important performance indicator; it represents the Recall value at which the Precision drops from 100%, namely it is the highest value of the recall that can be reached without any FP.As a single FP may cause severe failures for many robotic applications [6], [30], R P 100 is considered a good performance indicator and it is widely employed for VPR evaluation [6], [26].However, R P 100 is not capable of determining the lower performance bounds of a VRP method.Indeed, R P 100 cannot be determined for those PR-Curves that never hit 100% Precision.To circumvent this problem we introduce Extended Precision: where P R0 denotes the precision at the minimum Recall value and the factor '2' in the denominator is to have EP ∈ [0, 1].If P R0 < 1, R P 100 is set to 0 and EP depends only on the Precision at minimum Recall while for P R0 = 1, EP is greater than 0.5 and works similarly to R P 100 .An example is given in Fig. 1.VPR1 has P R0 = 1 and R P 100 = 0.6 then, the corresponding EP is 0.8.The Precision of VPR2 is constantly below 1 thus R P 100 is undefined end set to 0. The resulting EP for VPR2 is 0.4.Accordingly with EP definition, VPR1 outperforms VPR2.EP combines P R0 and R P 100 into a single scalar value which provides more comprehensive insights into VPR performance than using them individually.P R0 is determined by the number of FPs before the first TP and can only measure the precision of a method, without describing how the performance is affected by including more query results.R P 100 indicates the occurrence of the first FP but its applicability requires P R0 = 1.Therefore, it cannot be computed for any PR-Curve and it cannot measure VPR lower performance bounds.

B. Identification of the Upper and Lower Performance Bounds
In the first phase of the proposed framework the upper and lower performance bounds of VPR techniques are identified.The evaluation process uses two sets of images.A reference dataset (I REF ) which represents the previously visited places and a query dataset (I e ) that shows the same locations as the reference dataset but under different viewing conditions (e.g., from a different viewpoint).This phase consists of the following steps which are repeated for every VPR method to assess.i) Let v be a VPR technique and q a query image.The images in I REF are ranked by their similarity with q using v.Then, a PR-Curve for q is computed by varying the number of retrieved images from 1 to the last image that corresponds to R = 1.For each step, a confusion matrix is computed using the ground truth, the corresponding P and R values form a point of the PR-Curve.This process is repeated for all q ∈ I e to produce a set of curves for method v. ii) For each PR-Curve from the step i), the pair (P R0 , R P 100 ) is computed.Then equation ( 2) is used to compute the set of EP values for v: iii) The upper and lower performance bound for v on the dataset I e corresponds with the highest and lowest EP values in E v respectively: iv) The proposed approach considers the precision crucial for VPR as FPs might have a severe impact on robotic applications [6].To this end, the ratio of query images with EP > 0.5 is a relevant performance indicator: where n is the number of images in I e .
It is worth mentioning that when VPR is cast as an image retrieval task, EP > 0.5 indicates that the first retrieved image ranked by similarity is a correct match.Thus, S P 100 represents the share of successful single matches in I e and it is particularly useful to assess VPR-based systems that use a single image to perform localization.

C. Identification of Statistically Significant Performance Differences
This part of the proposed evaluation framework is concerned with determining whether the performance differences between VPR techniques are statistically significant or are due to random artefacts in data.Following the approach described in [15], the proposed evaluation method interprets the process of testing VPR against a sequence of query images as a series of success/failure trails on the same dataset.Under this assumption, the resulting distribution follows a binomial model and the comparison between two algorithms (v and w) can be addressed with McNemar's test [17], [32].
where the '−1' in the numerator is a continuity correction; N sf denotes the number of trials where the algorithm v succeeded and w failed; N fs denotes the number of trials where v failed and w succeeded.The proposed framework uses Z score which is obtained as the square root of equation ( 6): When N sf + N fs ≥ 30, the test is reliable and χ 2 has a chisquared distribution with one degree of freedom.As with the Chi-Square test, the cut-off point for 95% significance level is 3.84 which corresponds to 1.96 for Z.Therefore, if the Z value is larger than 1.96, one can say the results are a consequence of artefacts in data by chance only one in twenty (p = 0.05).
McNemar's test cannot be used to compare more than two VPR methods at the same time thus, a series of pairwise comparisons are made.However, executing multiple statistical tests requires a correction to the significance level of each single tests.Bonferroni correction is a well-known solution to this problem: let α the significance level for the whole family of N tests then each test needs to be performed with a significance level of α/N .To perform the McNemar's test, it is required to determine when an algorithm fails or succeeds.In the proposed framework, success occurs when EP is greater than a threshold t otherwise, a fail is accounted.EP is characterized by two intervals: 0 to 0.5 where the value is determined by P R0 , and 0.5 to 1 where EP mimics the behaviour of R P 100 .This feature of EP allows comparing VPR methods from different perspectives by using multiple thresholds.If McNemar's test is performed with a threshold ≤ 0.5, the VPR pair are compared on the basis of P R0 , which is determined by the number of FPs before the first occurrence of a TP.Conversely, using thresholds greater than 0.5 successes and failures are determined by R P 100 , namely by the length of the TP sequence before the first FP occurrence in the retrieved images.
Let T be the set of thresholds used to execute the McNemar's test variant: For a pair of VPR methods to compare, a set of Z scores is computed using the equation ( 7), one for each value t i ∈ T .
where v and w denote the tested VPR techniques and z i is the value of Z obtained with the i th threshold value in T .Although there is not any specific selection criterion for T , a good practice is to select the threshold values in order to capture the entire spectrum of variations of the performance metric [15].As detailed in Section IV, a good setup for EP is with nine evenly spaced thresholds between 0.1 and 0.9.

IV. RESULTS
The proposed evaluation framework is employed to compare several state-of-the-art VPR methods: AMOSNet, Hy-bridNet [8], R-MAC [51], NetVLAD [3] and Cross-Region-Bow [11].To test AMOSNet and HybridNet, we used the models trained with SPED dataset [8] by their authors [9].The implementation of R-MAC used for the experiments has been obtained from [52].For a fair comparison, the geometric verification module has been deactivated for the tests.The MATLAB source of NetVLAD is available from [4] along with several sets of weights.The results presented in this section are obtained using the VGG-16 model trained with Pittsburgh 250 K dataset [53] and using a dictionary with 64 words.Cross-Region-Bow is also available as a MATALB implementation [10].For the experiments, the VGG-16 model pre-trained on ImageNet dataset [43] has been utilized with a BoW dictionary of 10 K words.
In order to obtain comprehensive results, VPR methods have been assessed under different image variations using the five datasets summarized in Table I and shown in Fig. 2. Berlin Halensee Strasse [11] includes two traverses of an urban environment.This dataset exhibits moderate to strong viewpoint variations and changes in appearance due to dynamic elements such as cars and pedestrians.The ground truth is obtained using GPS coordinates to build place-level correspondence using a maximum distance of 25 meters as a criterion.For the experiments, the image set berlin-halenseestrasse-1 has been used as a reference and berlin-halenseestrasse-2 as a query dataset.Lagout and Corvin [29] are synthetic datasets consisting of several flybys around buildings.Lagout traverses at 0 • and 15  ground truth data for Lagout and Corvin are made available by their authors [1].Gardens Point Dataset [8] consists of three traverses of the Queensland University of Technology (QUT).Two occurred during the daylight by walking on two opposite sides of the walking path (laterally changed viewpoint) and the third during the night on the right side.The results are presented for illumination changes, thus the right-day and right-night traverses are used as reference and query datasets respectively.The traverse footages are synchronized thus the ground truth is obtained by frame correspondences.For the test, a reasonable tolerance is to consider a match correct if the query and the retrieved images are within 5 frames from each other [26], that is a retrieved reference image must fall between n − 2 and n + 2 where n is the query image index.Nordland Dataset [47] is built from footage for every season along a railroad in Norway.It shows extreme seasonal changes, especially between summer and winter journeys, which are used as reference and query datasets to obtain the results presented in this letter.Similarly to Garden Points, the footages are synchronized but the speed of the train is considerably faster than a human walk.Thus, the ground truth is built considering a tolerance of 11 frames as indicated in [26].That is a reference image must fall between n − 5 and n + 5.

A. Upper and Lower Performance Bounds Discussion
The results obtained with the use of the first phase of the proposed framework are summarized in Fig. 3. Green bars represent the upper performance bounds of VPR methods and correspond to EP Max .Similarly, the lower performance bounds EP min are represented with red bars.The values of S P 100 are indicated in yellow and are read on the right-side y-axis.
In terms of EP Max the considered VPR methods exhibit comparable performance on Berlin, Garden and Lagout with EP values equal or close to 1. Thus, all the VPR techiniques reach a good performance peak with dynamic objects, illumination and moderate to strong viewpoint changes.The prominent viewpoint variations of Corvin pose a hard challenge and none of the tested methods can reach EP = 1.In every Corvin's location, the assessed methods cannot recover all the true matches for a query image without including one or more FPs in the result set.EP Max < 1 indicates that there are not easily recognizable places that a robot system can use to localize itself reliably.Similarly, Nordland is also a difficult environment as we used the summer and winter traverses that exhibit prominent variations in appearance.Indeed, except for NetVLAD, EP never hit 1.This is due to datasets used to train the considered VPR which are not meant to cope with extreme seasonal variations.
S P 100 indicates the place share where a VPR technique successfully identifies a true match as the most similar image to the query.From the perspective of S P 100 , the differences between VPR methods are more significant.NetVLAD is the best approach in the urban environment of Berlin-Halensee.This is not surprising as the model has been trained using images captured from urban scenes.Cross-Region-Bow appears to be the most reliable VPR method for illumination and seasonal changes.It scores a S P 100 of 0.88 and 0.53 on Garden Point and Nordland respectively.Cross-Region-Bow uses a pre-trained network on ImageNet which is not prominent in any specific image transformation.Thus, its performance should be accounted to the approach used to combine features into a robust image representation.Corvin is confirmed to be the most difficult dataset also from the perspective of S P 100 .The only technique that can hit at least S P 100 = 0.5 is R-MAC, which can be considered the most reliable VPR method on Corvin.
The lowest performance bound is close to zero in most of the tested scenarios.This means that in some places the localization by means of visual features might be very difficult because of the frequent occurrence of FPs in the retrieved image set.The only exceptions are R-MAC, NetVLAD and Cross-Region-Bow whose lower bounds are constantly above 0.5 on Lagout.As EP min ≥ 0.5 requires P R0 = 1, the most similar reference image to the query which is retrieved by these three methods is a TP in every Lagout's place.

B. Statistical Performance Comparison
This section presents a statistical performance comparison between the VPR methods evaluated in the previous phase.The results are obtained by utilizing the second phase of the proposed framework and presented in Fig. 4. The threshold set T used for the experiments includes 9 values (p = 9) equally spaced  (Fig. 4(a) and Fig. 4(e)).HybridNet outperforms or achieves comparable performance as AMOSNet in most of the test scenarios (Fig. 4(u)).Our results is coherent with the performance analysis by their authors [8].Z-score exhibits wide variations on Corvin for every VRP technique.In particular, R-MAC presents large positive Z values for thresholds between 0.1 and 0.5 against every other VPR technique (third row in Fig. 4) At larger thresholds, Z decreases and becomes negative against NetVLAD starting from 0.7 (Fig. 4(i)) Thus, R-MAC outperforms the other approaches when the evaluation is carried out by observing low EP values which are mostly influenced by P R0 .As the threshold increases, the number of successes is determined by the contribution of R P 100 .In such evaluation conditions, R-MAC is outperformed by NetVLAD which demonstrates to be capable of retrieving longer sequences of TPs on Corvin.McNemer's test outcome confirms and supports the bounds analysis presented in the previous section.As it is shown in 3, R-MAC has the best S p100 while Cross-Region-Bow and NetVLAD reach higher EP Max .

C. AUC as an Alternative to EP
AUC can be used as an alternative to EP to measure VPR performance.However, we consider AUC less appropriate than Extended Precision for use in the proposed evaluation framework.The most important reason is that AUC does not penalize top-ranked FPs in the query results.Indeed, AUC might be significantly incremented by long sequences of TPs regardless of their position in the retrieved image ranking.As opposed to  this, P R0 component of Extended Precision penalizes top-ranked false positives by forcing EP ≤ 0.5.In other words, large AUC does not guarantee the first images retrieved by a VPR technique are correct matches.For example, the blue curve in Fig. 5 has AUC = 0.38 and EP = 0.51.The green curve has a larger AUC (0.77) and a smaller EP (0.25).As described in Section III-B.iv,P R0 = 1 is an important evaluation criterion for the proposed evaluation framework, thus the blue curve is considered better than the green one regardless of the smaller AUC.
Finally, AUC is more difficult to interpret than P R0 .Except for 0 and 1, the value of AUC is not related to any specific condition or PR-Curve feature.For this reason, McNemar's test based on AUC is harder to understand.Fig. 6 shows the test results for a pair of VPR methods using both EP and AUC.The large negative score of HybridNet against AMOSNet at 0.5 on Corvin means that HybridNet's AUC cannot reach the threshold as often as AMOSNet.However, a clear interpretation of this outcome is hard to give as AUC ≥ 0.5 does not have any specific meaning related to VPR performance.Conversely, the positive Z value at 0.5 for the EP -based test means that the top-1 retrieved image by HybridNet on Corvin is more often a correct match than for AMOSNet.

V. CONCLUSION
In this letter, a new framework to evaluate VPR performance is proposed.It consists of two phases: one is designed to assess the consistency of a VPR method performance across an environment and the second uses a variant of McNemar's test to identify the statistically significant performance differences between VPR methods.The proposed framework is based on the newly introduced Extended Precision measure for VPR performance.EP summarizes a PR-Curve by combining two of its most relevant features, P R0 and R P 100 , into an easy to read measure for VPR performance.EP addresses several shortcomings of AUC which would produce less significant and hard to understand results if used with the proposed evaluation method.The proposed framework is then used to assess and compare several state-of-the-art VPR techniques using different datasets including one or more appearance variations such as illumination and viewpoint changes.NetVLAD has shown solid end reliable performance in most of the test scenarios and in urban environments in particular.Cross-Region-Bow has exhibited good performance too, especially with illumination and seasonal variations.AMOSNet and HybridNet achieved the worst performance among the considered methods, especially in dealing with strong viewpoint variations where, to the contrary, R-MAC resulted to be the most reliable VPR approach.

Fig. 1 .
Fig.1.An example of how our proposed EP is computed for two hypothetical VRP systems.At glance, the two curves suggest that VRP1 is better than VPR2.Indeed, the corresponding EP values are 0.8 and 0.4 respectively.

Fig. 2 .
Fig. 2. A sample of the datasets used for the experiment.Reference images at the top and query images at the bottom.

Fig. 3 .
Fig. 3. Upper and lower performance bounds for the assessed VPR methods.The green and red bars represent the maximum and minimum EP scored on each dataset (top x-axis).The yellow bars indicate S P 100 and the related y-axis is on the right side of the figure.

Fig. 4 .
Fig. 4. Pair-wise comparison for the VPR methods tested.A sign convention is used to present the results, a positive value of Z indicates that the first method of the pair outperforms the second one, whereas a negative Z score has the opposite meaning.
between 0.1 to 0.9.As detailed in Section III-C, at low threshold values a successful trial is determined by the component P R0 .Conversely, for thresholds greater than 0.5, is the component R P 100 of EP to determine successes and fails.This setup allows the comparison of VPR methods from different perspectives by exploring the complete range of variation of EP .In Fig. 4 a colour code is used to represent the Z values for all combinations of threshold and dataset.Although Z is always positive, a sign convention is used to indicate which VPR method obtains better performance.A positive Z score means that the first technique of the pair is better than the second, namely N sf > N fs .A negative value of Z indicates that is the second VPR method of the pair to outperform the first one (N sf < N fs ).It is worth noting that |Z| increases with the difference |N sf − N fs | thus, Z can be interpreted as a performance gap between two VPR approaches.NetVLAD and Cross-Region-Bow outperform the other approaches on Berlin-Helensee, Graden Point and Nordland datasets as confirmed by their large positive Z values (Fig. 4(b) to Fig. 4(h)).They have comparable performance on Corvin and Lagout while NetVLAD is better than Cross-Region-Bow on Berlin-Halenstrasse and worse on Nordland at every threshold.

Fig. 5 .
Fig. 5.A comparison between three PR-Curve for netVLAD on Corvin with their respective EP and AUC values.
• are used as reference and query datasets respectively to test VPR techniques under moderate viewpoint changes.Similarly, the Corvin's loops captured at ground level and at 45 • are used to assess VPR methods under very strong viewpoint changes.The

TABLE I DATASET
VARIATIONS AND GROUND TRUTH TOLERANCE