Holistic Descriptors of Omnidirectional Color Images and Their Performance in Estimation of Position and Orientation

The use of visual sensors in robotic navigation tasks is a common approach, and numerous examples can be found in the literature. This work focuses on the problem of map building and localization using omnidirectional images as the only source of information. The main objective of this paper is to present a thorough comparison of global-appearance description techniques including the use of color information in different approaches. Some of the descriptors have been widely tested in previous works using gray-level images. In the present work we concentrate on the role and efficiency of the color information. Other descriptors are presented for the first time. To carry out this study, a database captured in different areas of an office environment is used, including two different datasets: training and test datasets. The experimental results include computational requirements in the map building and localization processes, and the accuracy in the pose estimation of the test images in a topological map, separating both position and orientation. To complete the study, the behavior of the descriptors is tested when the images present noise or occlusions, specially the effect on the color information.


I. INTRODUCTION
The autonomous navigation of mobile robots is a wide area of investigation. For this task, robots must gather and interpret information from their environment. In the literature, different approximations can be found depending on the kind of sensors used. Over the last few years, an important line of research is the use of visual sensors [1], due to the many possibilities they offer, the richness of the information they provide, and their suitability for this purpose, since they consume less power than other sensors, which is important for the autonomy of navigation, and their cost is relatively low.
Visual systems can be classified depending on the number of cameras they use and their field of view. That way, we find examples of systems based on one camera [2], [3], stereo The associate editor coordinating the review of this manuscript and approving it for publication was Shuhan Shen.
cameras [4], [5] that simulate the human vision, trinocular systems [6] or even arrays of cameras that gather the 90% of the spherical field of view around the robot [7]. If we consider also adaptations in the architecture of the visual sensors, catadioptric systems can be highlighted [8]. They use a reflective surface to expand the visual field of view [9], [10].
The richness of the visual information implies important memory and computational requirements to store and process the scenes. In real-time navigation tasks, this quantity of information might become unmanageable. For that reason, it is necessary to represent the images using descriptors that reduce the information to a vector of features, but preserve the ability to recognize the image among others in a database.
Such descriptors can be classified into two categories: local-features descriptors and global-features or holistic descriptors. On the other hand, local-features descriptors extract outstanding points from the images, which the robot can recognize easily. These features are also called landmarks. Landmarks can be artificial, as Okuyama et al. show in [11] using QR codes, or natural. Natural landmarks are extracted directly from the image, and usually correspond to recognizable points as corners, doors or windows, as we can see in [12], [13]. Another example is found in [14], where a novel method for object recognition and pose estimation based on 3D point extraction using an RGB-D sensor is presented. The main disadvantage of these techniques is the complexity in the extraction of stable landmarks in real and changing environments, and the computational cost of processing the image to extract those features and comparing them.
On the other hand, global-appearance descriptors extract the information of the image as a whole, avoiding any local pattern of the scene. Map building and localization with these descriptors is less complex than using 3D landmarks [15]. However, the size of the maps can be excessive, since they contain information of the entire image. That way, the study of global-appearance descriptors normally focuses on the kind and quantity of features they extract from the images. In contrast with the descriptors based on landmarks, they do not contain any metric information. For that reason, they are typically used for topological navigation approaches, in which the localization of the robot can be addressed as an image association problem with the information in the map [16]. Several authors have addressed a variety of problems in autonomous vehicles using visual information and global-appearance descriptors. For example, Hu et al. [17] use holistic descriptors from images, with the purpose of recognizing signals in road environments. They build these holistic descriptors from local features and a method based on the k-nearest neighbours. Payá et al. [18] present a framework for topologic map creation using global-appearance descriptors. Additionally, these description techniques can be combined with clustering algorithms in order to improve the maps and the localization process, as [19] shows.
Image retrieval plays an important role in robot localization, and this problem has been extensively addressed using grayscale images and holistic descriptors. Li et al. [20] study the image matching problem, with the objective of detecting loop closures in a SLAM (Simultaneous Localization and Mapping) application. They solve it by using a combination of clustering methods and descriptors built both from holistic and local features of grayscale images. Horst and Möller [21] focus on place recognition in mobile robotics using grayscale images. They investigate the effect of warping in place recognition and the NSAD (Normalized Sum of Absolute Differences) distance measure. Doan et al. [22] also study the problem of place recognition using visual information and exploiting the temporal continuity of the acquisition process. The image retrieval pipeline uses local features and an encoding method that represents each image as a single vector.
A recent approximation showed in [23] demonstrates that it is possible to estimate relative positions between images using global-appearance techniques. The framework, called multi-scale analysis, uses plane projections of the omnidirectional images that permit estimating displacements between two positions of the robot using only visual information. That way, it improves the accuracy of the robot's localization in the map.
The descriptors included in this work are based on Discrete Fourier Transform [24], the Histogram of Oriented Gradients [25] and gist [26]. Most of the descriptors that can be found in the literature are designed to be used with greyscale images. In the present work we explore the role of color information along with global-appearance descriptors [27] and we assess the performance of such information in a topological localization task, addressed as an image retrieval problem.
The remainder of the article is structured as follows: section II introduces the global-appearance techniques used to describe the omnidirectional images in this work. Section III outlines the introduction of color features to the description techniques. Later, section IV presents the sets of images used in the experiments. Section V details the experimental setup. After that, Section VI presents the results, and finally, section VII summarizes the main conclusions.

II. VISUAL DESCRIPTORS
This section introduces the techniques used to describe globally the appearance of the panoramic images in the present work. Some of them have been extensively described in previous works: the Fourier Signature (FS) and the Histogram of Oriented Gradients (HOG) in [28] and Principal Components Analysis (PCA) and gist in [29]. In this work, this set of techniques is complemented with two additional descriptors based on the Discrete Fourier Transform and one additional gist descriptor based on color information [30]. These techniques are outlined in the next subsections. In all the cases, the initial information is a set of N panoramic images captured from several points in the ground plane, distributed along the environment to model F = {f 1 , f 2 , . . . , f N }, where f j ∈ R N x ×N y , j = 1, . . . , N represent each image of the map set. N x and N y denote, respectively, the number of rows and columns of the image f j . In general, after describing each of these images, the result is a set of position descriptors, one per original scene ∈ R k pos ×1 and a set of orientation descriptors, also one per original scene D or . k pos is the size of the position descriptor and k or is the size of the orientation descriptor. Their specific values depend on each description technique, as described in Section V-B.

A. TECHNIQUES BASED ON THE DISCRETE FOURIER TRANSFORM
The Discrete Fourier Transform (DFT) converts the sequence of numbers {a 0 , a 1 , . . . , a N y −1 } in the complex sequence {A 0 , A 1 , . . . , A N y −1 } according the equation: Ny kn  where N y is the number of components of the sequence. This transformation represents a discrete signal in the frequency domain. One relevant property for this work is the shift theorem, which states that a circular shift of the initial sequence produces a transformed sequence whose components have the same magnitude and the arguments can be calculated with (2).
where q is the amount of circular shift in the first sequence.

1) ONE-DIMENSIONAL DFT (1D-DFT)
Briggs et al. [31]- [33] propose a descriptor that reduces a panoramic image into a unidimensional vector for localization and navigation purposes in robotics. From these works, we develop the idea of creating a one-dimensional vector from the average values of the pixels of each column of the panoramic image. After that, we apply the DFT to the resulting vector. Fig. 1 shows the descriptor creation process. This process is applied individually to each panoramic image in the initial set F. If the movement of the robot is contained in the ground plane and the catadioptric vision system is mounted vertically, the 1D-DFT descriptor presents interesting properties when it is applied to the panoramic images obtained from this system. First, the most relevant information of the image is contained in the lowest frequency components so only a number of components is usually retained. Moreover, the last components are usually affected by the presence of highfrequency noise in the original image. For that reason, we keep only the first components, having a substantial compression effect. Second, since the transformed sequence is complex, the information can be separated into two vectors: one with the magnitudes and the other with the arguments.
Additionally, according to the shift theorem of the DFT in (2), the magnitudes vector is invariant against changes of the orientation of the robot in the ground plane and can be used for localization purposes, while the arguments vector retains information of phase that is useful in the estimation of the relative orientation of the robot. In this theorem, if the robot rotates θ degrees, the sequence that represents the original panoramic image circularly shifts q positions. Therefore, the magnitudes vector can be considered as the position descriptor (it contains information on the appear-ance of the environment as seen from a specific position, independently on the orientation), and the arguments vector can be considered as the orientation descriptor (it is useful to estimate the relative orientation of the robot with respect to a reference one).

2) FOURIER SIGNATURE (FS)
Ishiguro and Tsuji [34] proposed the creation of visual maps using the DFT of each row of a panoramic image. This descriptor is also used in [24] with the name of Fourier Signature (FS). The FS is a complex matrix and it is also calculated independently for each panoramic image in the initial set F. Using the same property than in the previous subsection, from each row, only the first terms of the transform are retained. The resulting magnitudes and arguments matrices are arranged into two vectors to compose, respectively, the position and the orientation descriptor of each initial panoramic image.

3) TWO-DIMENSIONAL DFT (2D-DFT)
Finally, it is also possible to apply the 2D-DFT directly over a digital image to transform the visual information into the frequency domain. If we represent an image with the discrete function f (x, y), with N x rows and N y columns, the 2D-DTF is obtained as: Like in the previous DFT-based descriptor, the coefficients of the transform can be divided in two matrices, one with the magnitudes (or power spectrum) which is useful as position descriptor, and other with the arguments, which is the orientation descriptor. A pure rotation of the robot in the floor plane produces a shift of the columns of the panoramic images. The shift theorem of the 2D-DFT is expressed as: In this case, to compose the final descriptor, a number of low-frequency components is retained (i.e. a submatrix starting from the first component of the transform).

B. TECHNIQUES BASED ON PCA
Principal Component Analysis (PCA) is a technique which is widely used to extract the most relevant information from a set of data vectors, which consists in performing a transformation that projects these data vectors into a lowerdimensional space that preserves most of the variance of the data [35].
The pixels of an image can be arranged into a column vector x ∈ R M ×1 , with M the number of elements of the image. Considering N the number of images of the dataset, the matrix of data is denoted as To perform PCA, we normalize the data by subtracting the average value from each image. We denote the new matrix as X . From these data, the covariance matrix is obtained From the eigenvectors of this matrix u j , ordered by the relative importance of their associated eigenvalues, we obtain the transformation matrix: with U ∈ R M ×N . The projection of the original information in the new basis is: y j ∈ R N ×1 is the projection of x j in the new basis. In practice, we select only the first eigenvectors u j to build the new basis.

1) ROTATIONAL PCA
PCA has demonstrated to be a robust algorithm in the compression of information. However, if PCA was used directly, considering that the data vectors are panoramic images captured from different positions, then the projections would not be rotationally invariant. That is to say, the projection of two images captured from the same position but with different robot orientations would lead to completely different projections in the new space.
To solve this problem, Jogan and Leonardis [36], [37] propose the Eigenspace of Spining-Images. The algorithm makes R im equally distributed artificial rotations of each original panoramic scene and builds the initial data matrix with them. Using the algorithm they propose, every image is transformed into a column vector (also named 'projection') whose components are complex numbers. Figure 2 represents that, in the case of a panoramic image and its evenly rotated siblings, every specific component of their projections has the same magnitude, and these components have a phase lag which is constant between consecutive rotated siblings. More concisely, the blue asterisks show the second component of the projection of a panoramic image and the second component of the projections of its rotated siblings. The magnitude of these second components is the same, and there is a phase lag between the second component of the projection of consecutive rotated siblings which is constant and equal to φ 1 . The green asterisks show the same concept by representing the third component of the projection of a panoramic image and their rotated siblings. Again, there is a phase lag between the third component of the projection of consecutive rotated siblings which is constant and equal to φ 2 . Therefore, the map only needs to contain the projection of one image per position (position descriptors), and the phase lag between the coefficients of consecutive rotated images (orientation descriptors). That way, we can artificially simulate the projections of the different rotations of the image. The magnitude will be used to find the location of the robot in the map, and the argument information to estimate the orientation. Additionally, it is necessary to store the transformation matrix U .
The angular resolution of the dataset will depend on the number of artificial rotations included in the map, according to (8). However, high resolutions will require an extremely high calculation time to obtain the projections.

Min. Angle(
2) PCA OVER THE FOURIER SIGNATURE As stated before, PCA is a technique which is not rotationally invariant. However, if the information of the data matrix X presents rotational invariance, the new representation will also keep this property, as stated in [38], [39]. For this reason, in this section we propose the next method. For each original panoramic image, we calculate the magnitudes matrix of its FS and we arrange the information in a column vector. The data matrix X will be composed of the column vectors obtained from the set of panoramic images, and PCA is subsequently performed with this matrix. The projections in the new space will be used for the robot localization. For the orientation estimation, the descriptor uses the arguments of the Fourier Signature, without any change of basis.

C. TECHNIQUES BASED ON HISTOGRAMS OF ORIENTED GRADIENTS
The Histogram of Oriented Gradients (HOG) [40] describes the image using the pixel intensity distribution in local areas. For that purpose, first, the gradient of the image is obtained.
If I x and I y represent the derivatives of the image regarding axis x and y respectively, it is possible to calculate the magnitude and orientation of the gradient as: After that, the image is divided in cells and an histogram of oriented gradient per cell is compiled. The histogram of each cell is built from the information of the gradient orientation of each pixel in the cell, weighted by the gradient magnitude of this pixel. To build the histogram, a number of bins must be defined. In this work, we divide the orientation range (0 • to 180 • ) into 8 bins, i.e. each 22.5 • .
In order to adapt the technique to localization and orientation estimation purposes, we create two different descriptors: one for position and another for orientation estimation. Since we work with panoramic images, which contain the same information per row independently on the robot orientation, we use horizontal cells (with the same width than the image) to obtain a position descriptor, which presents rotational invariance. Regarding the orientation, we use overlapped vertical cells (with the same height than the image), separated a distance of D pixels between consecutive cells. By shifting the histograms of these cells, we can simulate a rotation of the robot. The resolution in the phase estimation depends on D: Fig. 3 shows the division of the image in cells, both for the position and the orientation descriptors. The descriptor will contain the histograms of each cell, appended and arranged in a column vector.

D. TECHNIQUES BASED ON GIST
To obtain the essential information from the image, the descriptors based on gist try to mimic the human perception system and its ability to recognize a scene through the identification of colour or remarkable structures, avoiding the representation of specific objects or local features ( [41], [42]). Therefore, they can be seen as global-appearance descriptors. In this work, we consider two approaches: gist-Gabor and gist-color.

1) GIST-GABOR
The gist-Gabor descriptor [26] is based on the use of Gabor filters, and collects frequency and orientation information from the images. The first step is to create a bank of Gabor filters, with orientations evenly distributed in the range [0 • , 180 • [. Gabor masks are frequency waves multiplied by a Gaussian function, so they are determined both in frequency and space domain. In this work, two different spatial scales are considered to create the Gabor bank. Fig. 4 presents a sample panoramic image and the resulting images after filtering it with four different Gabor masks, changing both scales and orientations. Once the scene has been filtered with the different masks and scales, the algorithm divides each resulting image into a set of (a) non-overlapping horizontal blocks, to create the position descriptor and (b) overlapping vertical blocks, to create the orientation descriptor, as seen in Fig. 3, and the average value of the pixels inside each block is calculated. Like in the case of the HOG descriptor, the resolution of our descriptor in orientation estimation depends on the distance between consecutive vertical blocks D, as seen in (11).
The second descriptor based on gist is gist-color [43]. This technique collects color, intensity and orientation information from each scene. The color features are extracted from a Gaussian pyramid of images, using the color channels proposed by Hering [44], that defines three opposing color pairs: red/green, blue/yellow and black/white. The last one corresponds to the intensity of the pixel. The descriptor calculates five primary channels: R (Red), G (Green), B (Blue), Y (Yellow) and I (Intensity).
where r, g, b are the red, green and blue channels in the original RGB panoramic scene. The opposing color pairs are obtained from the primary colors as: After that, a Gaussian pyramid is used to carry out a set of center-surround operations with the three opposing color channels RG, BY and I (21). Fig. 5 shows a Gaussian pyramid with 8 scales, created from a sample panoramic image. In these operations, the center corresponds to the lower scales (with higher resolution), that is denoted by c in (21). For the surrounding pixels (s), the lower resolution scales are used. The comparison between scales is represented with : Using the center-surround operations, we obtain information in different scales which is expected to be robust against changes of lighting conditions, as Siagian et al. state in [45]. The scales used in the center-surround operations in this work are summarized in Table 1. Fig. 6 shows the resulting images after applying the center-surround operations with the three opposing color pairs of a sample image. The features of spatial distribution of the scenes, they are extracted using Gabor filters. For gist-color, we use 4 filter orientations (θ i = 0 • , 45 • , 90 • , 135 • ) applied to two different pyramid scales. Finally, all the resulting images (both those with the color and those with the orientation information) are individually blockified. Like in the previous subsection (gist-Gabor), two descriptors are created: one with the values of the horizontal cells for localization purposes, and another with the values of the vertical cells for orientation estimation.

III. GLOBAL-APPEARANCE DESCRIPTORS AND COLOR INFORMATION
The descriptors included in Section II, with the exception of gist-color, extract the information from the scenes using only the gray-level intensity of each pixel. In fact, the great majority of global-appearance descriptors in the bibliography are applied only to grayscale images. However, if the images are captured with a color camera, the information provided by the different color channels can be used with the aim of improving the descriptors with more insightful information from the scene.
Initially, we can take advantage of the color information by applying the same description method separately to each of the three RGB channels. However, there is usually a high correlation among the information of these three channels. As a result, it is expected that the different descriptors also present a high correlation between them. If that happens, this would not add any useful information with respect to grayscale. As an example, Fig. 7(a) shows the values of the HOG descriptor applied to the same image in grayscale, and applied to the R, G and B channels of the same scene separately. As shown, a high correlation exists between the four descriptors. In this case, creating the descriptor of each RGB channel is almost equivalent to repeating three times the information of the grayscale descriptor. Additionally, Fig. 7(b) presents the same comparison but using the HSV channels (Hue, Saturation and Value). As expected, the descriptor of VOLUME 8, 2020 channel V is the same that grayscale space. However, H and S provide different information.
For this reason, we suggest other means of using the color information in order to extract useful features. In the literature, we can find several works that use the HSV color space. For example, Sablak and Bould [46] create a descriptor with the histograms of the image values in HSV space. Specifically, the descriptor is made up of the position of the local maxima of the histograms of channels H, S and V of the image separately. Suhasini et al. [47] also use HSV instead of RGB in order to obtain a descriptor based on the combination of SIFT (Scale Invariant Feature Transform) and ICH (Invariant Color Histogram), presenting an important improvement in image association tasks compared with the same algorithm applied to RGB. Junhua and Jing [48] show an image classification algorithm based on the Contourlet Transform using the H channel in the HSV space.
The color information of the scene can also be represented with the values of the pixels of each channel using histograms. These features are also independent on the scale and resolution of the image. With the aim of creating a useful descriptor, we propose to extract features by dividing the image into cells and building a histogram per cell using the information in the color channels. For localization, we divide the image in horizontal cells, as we do to obtain the HOG and gist descriptors (Fig. 3). This way, the resulting color descriptors are rotationally invariant since, from a specific position of the robot in the environment, they contain the same information, independently on the robot orientation.
Therefore, for each cell and channel of the color scene, a new histogram with the pixel intensity values is created. All these histograms are put together to create the final descriptor. We name this descriptor Color Histogram (CH). The bins that divide the histogram are equally distributed along the range of values of each channel. We also normalize the histograms by dividing the values of the bins by the number of pixels included in the cell. The size of the CH descriptor will directly depend on the number of cells of each image, and the bins of each histogram.
We can append the CH information to the descriptors that result from each of the techniques presented in Section II, obtaining complete descriptors that contain information both about the spatial distribution of the scene and about color. Specifically, we build a descriptor per scene using either a DFT-based method, HOG or gist, as presented in Section II and subsequently append the CH information. Before appending the color information to compose the final descriptor, we normalize each vector separately. This way, we avoid that any of the two parts weights excessively due to the number of components or the different magnitudes of each part. Regarding the normalization of color information, we take it into account both the number of histograms included in the descriptor CH, and the number of cells into which the image is divided.
We define h H j , h S j and h V j as the column vectors that contain the values of the histograms of the channels H, S and V, respectively, compiled in the cell j. Each histogram is divided by the number of pixels of the cell. Then, we define h color j as the set of these histograms as: Finally, the descriptor with the color features includes the set of normalized histograms of all the horizontal cells in which the image is divided. If n is the number of cells, we define the color histograms (CH ): In the same way, the descriptors of the spatial distribution are normalized. We denote them by D spatial independently on the method used to obtain them (FS, HOG or gist). In the case of the descriptors based on FS, the normalization is carried out by dividing each row by its first component in the frequency domain, which corresponds to the average value of the row. It should be noted that this value is different for each row of the transformed image. The normalization of the descriptors based on HOG and gist is carried out by dividing the elements of the descriptor by the sum of all their values. Finally, the weighing of the color and the spatial information can be weighted differently to compose the final descriptor: where w spatial and w color are weighting factors. This work includes a complete and systematic comparison of the different descriptors based on the global appearance described in Section II applied to panoramic images, focusing on the utility of the color information. With this purpose, several options will be tested and compared in subsequent sections. Each description technique will be applied separately (a) to the grayscale image, (b) to each RGB channel, (c) to each HSV channel, (d) both to each RGB and HSV channels to compose a unique descriptor and (e) the vector CH is calculated and appended to each of the different descriptors as explained in this section.

IV. SETS OF IMAGES
This section presents the sets of images used to carry out the experiments. These sets have been captured by ourselves in different areas and offices of the second floor of the Innova building of the Miguel Hernández University and are accessible from [49], where we can find more information about the dataset, including bird's eye views of the capture points both of the training and the test sets, the dimensions of every room and their distribution in a floor plan. Specifically, the datasets include images from a corridor (1), three offices with different configurations (2,3,4) a library (5) and a conference room (6). A catadioptric system is used to capture the datasets. It is composed of a color camera (Imaging Source model DFK-21BF04) and a hyperbolic mirror (Eizoh Wide70) which captures omnidirectional color scenes, with 640 × 480 pixel resolution.
Two datasets have been captured to test the performance of the descriptors: the training and the test ones. About the training dataset, the capture points compose a regular 40 cm × 40 cm grid on the floor, and all the captures are performed under real operating and lighting conditions. It is a challenging environment due to the presence of large windows that force us to reduce the gain of the camera to avoid the saturation of the image. For that reason, the histograms of the scenes are normally concentrated on the low area of the color range. Table 2 shows the number of images per area in the training dataset. Second, the test dataset is composed of some images captured in the same environment, and they will be used to carry out experiments of position and orientation estimation. While capturing the test images, 3 different cases were considered about the capture points with respect to the training grid: (1) the test image is captured very close to the position of a training image; (2) the test image is captured halfway between two map images and (3) it is captured approximately equidistant to four images of the grid. In the experimental part (section V), the descriptors are evaluated in an image retrieval framework, in which the descriptor of each test image is compared with the descriptors of the training images and the most similar descriptor (nearest neighbour) is retained. In these experiments, the result will be considered a correct retrieval if the nearest neighbour was captured in the geometrically nearest point in case (1); in one of the two nearest points in case (2) and in one of the nearest 4 points in case (3).
These test images have been captured at different times of the day and days of the year, under real operating conditions, what hinders this task. This way, the test images include perceivable changes in lighting conditions, in the position of some pieces of furniture with respect to the training images and some people appearing in the scenes. These facts make the database more challenging.
Additionally, from each test position, 16 different images were captured, with different orientations in the ground plane, with a lag of 22.5 o between consecutive rotations. Table 3 shows the number of test images per area. The descriptors included in this work are defined to be used with panoramic images. For that reason, we obtain the cylindrical projection of the omnidirectional images. Finally, the panoramic scenes are obtained by changing the cylindrical system to Cartesian coordinate system. The resolution of the panoramic images is 128 × 512 pixels. Fig. 8 includes a sample image from each area. It should be noted that, since it is an office environment, there are several elements that appear repeatedly in the different rooms with similar appearance. For that reason, the images might present visual aliasing. In that case, the descriptors may lose their capacity of distinguishing images due to the existence of similar scenes, and one of the objectives of the experiment is to check if any description method is able to cope robustly with this phenomenon.
As an example, Fig. 9 shows two scenes from the corridor, which are captured from two different positions, with a distance of 240 cm between them. We can see that their appearance is very similar. Fig. 9 (c) includes the Fourier Signature of both images. We can see that both descriptors are very similar despite the fact the scenes are different and separated. Additionally, the robustness of the descriptors is also tested when partial occlusions or noise appear on the test images to check if they could be able to operate if these phenomena occur in real-operation situations. The occlusions are introduced with four vertical stripes with different width that cover different percentages of the panoramic images. Regarding the noise, zero-mean Gaussian noise with different variances is artificially added to the different color channels. Fig. 10 presents a panoramic image with examples of the occlusions, that vary from 5% to the 40% of the image, and with Gaussian noise, whose variance takes values between σ = 0.0025 and σ = 0.0200. They constitute specially challenging situations for the methods.

V. LOCALIZATION FRAMEWORK AND EVALUATION
The main objective of the paper is carrying out a comparative evaluation of the descriptors presented so far in a localization framework, focusing on the relevance of color information. This section is structured as follows. First, subsection V-A presents the localization framework implemented to carry out the tests and the measurements used to check the performance of the descriptors. Then, subsection V-B details the main parameters of the descriptors, whose sensibility is studied along the experiments.

A. ESTIMATING THE POSITION OF THE ROBOT
As outlined previously, the localization problem is addressed as an image retrieval problem. First, the descriptor of each training image is obtained. This set of descriptors is considered as the map. Second, for every test image, its descriptor is obtained and compared with all the descriptors stored in the map. The descriptor that presents the minimum Euclidean distance (nearest neighbour) is retained. This association is considered to be correct if the retrieved image is the one which was captured from the geometrically closest point. The performance of this process is evaluated throughout this work using two representations: recall-precision curves, and geometric distance between the capture point of the test image (ground truth) and the capture point of the retrieved map image.
Recall-precision curves [50], [51] permit evaluating the performance of the descriptors in image association tasks. The concepts of recall and precision are defined as: recall = # of correct matches retrieved # total of correct matches (25) precision = # of correct matches retrieved # correct matches (26) This way, recall represents the ability of the descriptor to find all the correct associations, and precision the ability to find the correct associations as the number of experiments grows. Their values are between 0 (that would indicate that no correct match has been retrieved) and 1 (that would mean that the descriptor has found all the correct matches).
The process to obtain the recall-precision curves is the following: where N is the number of images in the map. After this step, an array of distances is available l Test = l Test,1 , l Test,2 , . . . , l Test,N .
2) The match between the test image and the map is determined from the minimum image distance in the vector l Test . Once the algorithm has calculated the Euclidean distance between the test image and all the images of the map, it selects the association with the minimum distance. 3) The algorithm determines whether the match is correct by checking if the capture point of the retrieved image is the one which is metrically closest to the capture point of the test image (ground truth).

4) After repeating this process for all the test images
(N Test is the number of test images) we obtain a matrix with N Test rows and two columns. The first column contains the minimum Euclidean distance of each test image (min l Test ), and the result of the match (1 or 0 depending on whether it is correct or not). 5) Then, we sort the association list in ascending order using the image distance, and obtain the values of recall and precision according to (25) and (26). The distribution of the recall-precision curves provides information about the robustness of the descriptors with false positives considering a threshold in the image distance. So that, it is desirable that the precision keeps near 1 for every recall value, since it means that we have fewer false positives under that threshold. As an example, Fig. 11 shows two different recall-precision curves. Although the final values are similar, the distribution of the blue curve shows a better performance of the descriptor. If we set a threshold distance corresponding to recall = 0.3, the precision of R-P 1 is 100%, but R-P 2 is 82%. It means that we are able to obtain the 30% of correct matches with a 100% of probability in the first case, and with a probability of 82% in the second example.
To complement the results, we consider three different cases to create recall-precision curves, considering only the image with the minimum image distance (Nearest Neighbour or N.N.), looking for a correct match within the two cases with the lowest image distance (Second Nearest Neighbour or S.N.N.), or the three cases (T.N.N.).
We consider that it is interesting to analyze S.N.N. and T.N.N. apart from N.N. for the following reason. In this paper we focus on the role of color in the performance of the descriptors. Therefore, we address the localization task in a straightforward way: as a global localization problem, solved with an image retrieval approach (among all the images in the dataset). This assumes that we have no information about the previous pose of the robot while solving the problem. Only visual information is used. However, nowadays, many local localization algorithms exist (i.e. probabilistic algorithms) that take it into account the pose of the robot in the previous time instant to estimate the pose in the current time instant (apart from the odometry and visual information). In such algorithms, not only the N.N. but also other k-NN neighbors could play an important role, owing to the fact that the previous pose is known and the new pose should be at a relative distance form it.

B. PARAMETERS OF THE DESCRIPTORS
In Section II, we detail each of the compression techniques based on the global appearance that we include in this comparison. Next, we present a summary of the parameters of each one.
Regarding the descriptors based on the DFT, it is possible to select the number of elements of the transform in the frequency space. In the case of the 1D-DFT, the parameter is the number of elements retained from the transformed sequence. In the case of 2D-DFT, we can select the size of the submatrix that gathers the lower frequencies of the image. Finally, as for the FS, the parameter is the number of elements kept from every row. We select separately the number of elements retained from the magnitudes' matrix (N pos ), that allow us to carry out the estimation of the position of the robot in the map, and the number of elements in the arguments' matrix (N rot ), that provides information to estimate the orientation.
The technique that applies PCA over the FS is determined by the number of elements per row retained from the magnitudes' matrix (N pos ), and the number of main eigenvectors that compose the new projection basis (V PCA ). The orientation works as in the case of FS (N rot ). About rotational PCA, the main parameter of the descriptor is the number of rotations per image (R im ), and the number of eigenvectors that compose the new basis (V PCA ).
The parameters of HOG are the number of horizontal cells (cells with the same width that the panoramic image, denoted as C H ) used for localization purposes, and the width of vertical cells (cells with the same height of the image, denoted as S V ) as well as the distance between these vertical cells for the estimation of the orientation (D V ). Horizontal cells have no overlapping, so its number is determined by their height and the number of rows of the image. However, vertical cells may have some overlapping if the distance between consecutive cells is lower than their width. In the case of HOG, the number of bins per histogram is another parameter. However, we set this parameter to 8 since preliminary experiments showed that increasing the number of bins, the precision does not improve, but if it is lower, the precision decreases.
The parameters of gist-Gabor are the number of masks the image is filtered with, and the number of cells used to divide the filtered images. About the localization descriptor, we use two spatial scales for Gabor filtering in order to limit the computational cost. The variables are the number of masks considered in each of the two spatial scales (Masks 1 and Masks 2 ), and the number of horizontal cells that divide each filtered image (C H ). The filtering direction of the Gabor masks depends on the number of masks, since they are equally distributed between 0 • and 180 • . For the orientation descriptor, we only use the information in the first level of Gabor spatial filtering, with a maximum of 4 masks. So, for orientation, the parameters that define the descriptor are the width of the vertical cells (S V ) and the distance between them (D V ).
Finally gist-color uses always the same number of Gabor masks to filter the image, as stated in Section II, with 4 orientations. The filtering spatial scales are determined by the number of scales of the Gaussian pyramid. When a new image arrives, we create a pyramid with six levels. For the Gabor filtering, we use the three first levels of the pyramid. Regarding the color features, it uses the six levels to carry out the comparison between opposite color channels, as shown in Table 1. The parameters of the position descriptor are the number of horizontal blocks used to blockify the information of Gabor-filtered images, denoted as C HG , and the number of cells used to blockify the information of color (C HC ). For orientation, we use only the information of the Gabor masks. The parameters are the number of vertical cells (S V ), and the distance between them (D V ).
In Table 4, a summary of the different parameters of each descriptor is included.

VI. RESULTS
The experimental section focuses on the comparative evaluation of the performance of the global-appearance descriptors with color information. We study the precision in the pose estimation (both position and orientation), using all the images in the test dataset, and comparing them with the map built from the descriptors of the training images (section IV). We also include a comparison of the computational time of each descriptor in the map building and pose estimation tasks.
This section is structured as follows. First, subsection VI-A carries out a study using the initial description methods presented in Section II (FS, HOG, gist-Gabor and gist-color) to adjust the different descriptor parameters, and to make a first comparison of computational requirements and performance of the descriptors. Then, subsection VI-B completes the experimental part including the use of color information as described in Section III. Finally, the performance of the descriptors when noise or occlusions are present in the test images is evaluated in subsection VI-C. All the algorithms and simulations have been developed using Matlab. The experiments have been performed using a computer with two Quad-core processors of 2.8GHz and 10GB of RAM. It is necessary to point out that it has not been possible running rotational PCA with the whole training dataset because of the excessively large computational requirements, specially RAM. For that reason, it appears with an asterisk in the graphs. The experiments of this descriptor use only three rooms of the training dataset, that correspond with the three offices, i.e. zones 2, 3 and 4 (Table 3). Only in the case of this descriptor, the reduced map is composed of 191 images, with 32 test locations, that means 512 test images considering their rotations. In the case of the other descriptors all the training and test images are considered.
In the remainder of this work we use the term Precision with the meaning stated in (26), and Accuracy to refer to the performance of the localization algorithms, as fas as geometric distance or orientation (measured in the ground plane between the pose of the test image and the pose of the nearest neighbour) are concerned.

A. RESULTS OBTAINED USING RAW DESCRIPTORS
In order to select the parameters of each descriptor and make a first study of feasibility, we consider only the initial descriptors (as presented in Section II) and the original space of representation of each technique. This space is the grayscale in the case of the descriptors based on DFT, HOG and gist-Gabor, and RGB in the case of gist-color. To tune the parameters of the descriptors, it is necessary to check both the performance in position and orientation estimation and the necessary calculation time. After a sensitivity analysis, the values selected for each parameter are shown in Table 5. The resulting recall-precision curves after solving the image retrieval problem with each of the different descriptors are shown in Fig. 12. They show that HOG and gist-color ( Fig. 12(f) and 12(h)) show a better performance than the other descriptors. Moreover, the rates of false positives are very similar. The results of rotational PCA can also be highlighted, specially its low percentage of false positives until a relatively high recall value.
Next, Fig. 13 presents the accuracy of the position estimation process. The legend shows the average geometric distance between the capture point of each test image (ground truth) and the capture point of the nearest neighbour of the map (measured as the Euclidean distance on the floor). Therefore, the figure shows the percentage of experiments under a specific geometric distance. We can appreciate that the results have a similar behaviour as shown in the recall-precision curves, with HOG and gist-color the descriptors that present a better performance, specially HOG. Although FS, 2D-DFT and gist-Gabor present final values of precision which are similar, the descriptor based on gist shows a better performance regarding the geometric distance to the image retrieved from the map, obtaining similar results than rotational PCA. It is important to highlight that, for some experiments, it is not possible to have an error d ≤ 10cm, due to the resolution of the grid and the position of the capture points of the test images (e.g. some test images are in the middle of the 40 × 40 cm grid of the map). To know how significant these rates are, the size of the environment is shown in [49]. In order to understand the relative performance of each localization algorithm, the size of the whole environment must be considered, since the experiments include all the images of the map and they are not limited to individual rooms (i.e. the localization is approached as a global localization problem).
Regarding the orientation estimation, Fig. 14 presents the error in the estimation. To evaluate the performance of the descriptors in the orientation estimation, we consider only the associations whose geometric localization error is lower than 40 cm. That way, we avoid to estimate the phase lag between images that are too far from each other and the performance of each descriptor in orientation estimation is more realistically addressed, making this study more independent on the accuracy of the position estimation. The techniques that present the best results in the orientation estimation are rotational PCA, gist-Gabor, gist-color and HOG. However, we should remark that in those descriptors, the phase information is sampled, either by the number of rotations of the images included in the map (in rotational PCA) or by the number of vertical cells.
The error in orientation estimation with FS and 2D-DFT is similar. FS and PCA over FS estimate the orientation using the same algorithm since PCA is not applied to the phase information. The slight difference between both results is consequence of the different associations during the position estimation. 1D-DFT presents the highest error in the phase estimation. Even so, it provides 80% of the experiments with an error equal or less than 10 • using only 4 terms per image.
In these experiments, the map is composed of the descriptors of all the images from the different rooms. Fig. 15 shows the size of the map using the different descriptors, including the memory to store position and orientation information separately. The most compact descriptor is 1D-DFT, followed by HOG and gist descriptors. To improve the orientation accuracy in HOG and gist, the growth of the orientation descriptor size would be noticeable. Regarding rotational PCA, the memory requirements include the projection basis with the selected eigenvectors, the projection of the original map into the new basis, and the difference of phases between consecutive projections. As shown in Fig. 15, the information of orientation is insignificant compared to the location. However, to improve the accuracy in orientation estimation, more rotated siblings of each initial image should be considered and the computational cost of the mapping process would be even greater. Finally, we can see that after the projection to the new basis, the database of position estimation of FS is reduced from 29Mbytes to 4Mbytes when PCA is applied. Fig. 16 shows the time spent in the map building and in the pose estimation of the robot, including both position and orientation. Regarding the map building ( Fig. 16(a)) the techniques based on DFT can be considered the most efficient, except for PCA over FS, since PCA is a computationally expensive process, being 15 times greater than FS. Rotational PCA is the algorithm that spent more time in the  map building task. Additionally, HOG and gist descriptors require more time than the descriptors based on DFT, specially gist-color.
The pose estimation time, showed in Fig. 16(b), includes the necessary time to create the descriptor of the test image, the estimation of the position, and the orientation. HOG and gist descriptors present similar time values to create the descriptor. Compared to the other techniques, FS and 2D-DFT present a relatively high time in the estimation of the pose compared to the map creation. This is due to the estimation of the orientation, since it is a computationally complex process, specially in the case of FS. However, since 1D-DFT only uses 4 phase components, it is not affected by this fact. Finally, rotational PCA is one of the fastest algorithms in the pose estimation, since it only projects the image into the new basis, and calculates the image distance.

B. RESULTS OBTAINED USING COLOR INFORMATION IN THE DESCRIPTORS
In this section, the results of the position estimation and computational requirements of each algorithm using the color information are presented. The comparison includes the application of the methods 1D-DFT, FS, 2D-DFT, PCA over FS, rotational PCA, gist-Gabor and gist-color to different color channels, to obtain a variety of descriptors. The performance of each descriptor will be tested subsequently in a position and orientation estimation task.
The next combinations are tested: (a) applying the description method to each RGB channel, to obtain a unique descriptor per scene; (b) the same to the HSV channels; (c) appending the descriptors obtained with (a) and (b) to create a unique descriptor; (d) applying the description method to the intensity channel and appending the color information using the Color Histograms (CH), as seen in (24). It is worth highlighting that the CH information has been used differently for each PCA based method, to adapt it to each description method. In the case of PCA over FS, first, the FS descriptor and CH are created, and then appended to form a vector. After that, PCA is performed with these data vectors. However, in the case of rotational PCA, the projection of the images in the new basis is obtained first, and then the vector CH is appended to this projection. In order to complete the experimental evaluation, we create a version of gist-color for grayscale space. The comparison among the color channels is replaced by the multiscale comparison in grayscale space. As far as the experiments are concerned, we use the parameters included in Table 5. For the Color Histograms, we use 8 or 16 horizontal cells depending on the descriptor, and 32 bins per histogram. In order to limit the number of variables when comparing the performance of the different descriptors and color spaces, in the combination of the greyscale descriptor and the Color Histogram (combination (d), denoted as Greyscale+CH in the experiments) we set the coefficients of the and color information, giving them the same weight (w spatial = w color = 0.5). At the end of the work, we include a study of the effect of varying these weighting coefficients in the different descriptors. Fig. 17 shows the precision in the estimation of the position using all the combinations while building the descriptor. According to the results, the descriptors improve their performance substantially when the color information is used, except FS with RGB, 2D-DFT with RGB+HSV and rotational PCA using RGB+HSV. It is specially remarkable the improvement of 1D-DFT applied to HSV, since it triples its precision considering the T.N.N., from 19% to 57%. The performance of HOG with CH can be also highlighted, with a T.N.N precision over 80%. The results of RGB color space show no significant improvements with any descriptor compared to grayscale. This is due to the high correlation between channels R, G and B, as stated in Section III. The exception is gist-color. As stated before, this descriptor is specially designed for the RGB space, since it includes the color opponency comparison of Hering (section II-D.2). When we add the information of Color Histograms, the performance of all the descriptors improves. 2D-DFT, HOG and gist-Gabor specially benefit from the addition of this color information.
On a general basis, the application of the descriptors over HSV channels presents the same or better results than over RGB channels. The improvement is specially significant in the case of 1D-DFT, FS and rotational PCA.
The necessary memory to store the map (position descriptors), using each combination, is presented in the bar graph included in Fig. 18. As expected, the use of RGB or HSV color spaces triples the memory of the map, and the joint use of both color spaces (RGB+HSV) multiplies by six. Regarding the Color Histogram, it adds a fixed quantity of information. PCA over the FS is the only descriptor that does not increase the size of the map using different color spaces, since PCA is applied to all the information that composes the original basis, and a fixed number of eigenvectors is selected in all the cases. Fig. 19 shows the necessary time to build the map ( Fig. 19(a)) and to estimate the pose of the robot in this map ( Fig. 19(b)). As far as the map building task is concerned, PCA over FS and rotational PCA are the algorithms that take more time. This fact demonstrates again that PCA is a computationally expensive process, specially when the size of the map increases when using RGB and HSV color spaces. Last, gist techniques require, in general, more time than methods based on DFT or HOG.
In the pose estimation task, methods based on PCA are remarkably fast. Additionally, except for 1D-DFT, all the techniques based on DFT use approximately the same time than gist methods. HOG is a descriptor with relatively low computational time. Regarding the color spaces, when we increase the number of color channels, the time rises, as expected. When we use descriptors over HSV, although the number of channels is the same than RGB, the time increases slightly due to the color space transformation of the original image.
The calculation of the Color Histogram varies between about 0.05 and 0.1 seconds depending on whether we use 8 or 16 cells per image. In the case of 1D-DFT, PCA over FS and rotational PCA, when we add the CH, the estimation of the pose takes more time than when we use the other color channels. For FS and 2D-DFT, the pose estimation time using the CH is only lower than RGB+HSV.
In general, the precision when we use the color information is higher than using only greyscale space. HSV and grayscale+CH are the methods that present the best results, except for PCA over FS, that achieves the best performance using RGB+HSV.

C. ROBUSTNESS AGAINST THE PRESENCE OF OCCLUSIONS OR NOISE IN THE TEST IMAGES
In order to complete this comparative evaluation, a set of additional experiments is performed to test the robustness of the descriptors. In these experiments, the effect of Gaussian VOLUME 8, 2020 noise and partial occlusions in the test images is tested. These experiments allow us to check the performance of the different techniques under challenging, complex and changing environments, as it happens in real environments under realoperation conditions. Fig. 10 shows a test image with different percentages of occlusion and with Gaussian noise with different variances. These challenging test images are used in this section to test the performance of the descriptors. Fig. 20 shows the results for the occlusion experiments. The horizontal axis shows the different description combinations considered in the analysis and, for each combination, the percentage of occlusion in each test image.
According to the results, the descriptors that are more negatively affected by occlusions are 1D-DFT and rotational PCA, specially the last one. HOG and gist descriptors are less  sensitive to occlusions in the image, specially HOG using Greyscale+CH, and gist-color using any color method. We can highlight the performance of FS, 2D-DFT and gist-color up to 10% of occlusion in the test image. On a general basis, the higher percentage of occlusion, the lower precision, as expected. However, some descriptors present a relatively good behaviour for occlusion percentages up to 10%, such as FS with HSV, 2D-DFT with HSV and gist-color.
Additionally, Fig. 21 shows the performance of the descriptors when the test images are affected by different levels of Gaussian noise. The horizontal axis shows the different description combinations considered in the analysis and, for each combination, the variance of the Gaussian noise which is present in each test image. We can see that the HSV color space is specially sensitive to Gaussian noise. Only HOG along with HSV presents a precision which is similar to the precision obtained with other color spaces. To obtain the images with noise, the Gaussian noise is added to R, G and B channels of each original test image separately. Fig. 22 shows the channels H, S and V of a test image without noise and with Gaussian noise with mean equal to zero and σ = 0.0200. We can observe that channels H and S are especially affected by noise, doing it almost impossible to recognize the original image. When the descriptors use the Color Histograms, the results present a reduction of the performance when the image present noise. Comparing the results of grayscale and grayscale+CH, the CH improves the location precision only with 1D-DFT. As far as the description techniques are concerned, FS, 2D-DFT and rotational PCA present a better performance.
Next, we include the results in the estimation of orientation when the test images are affected by occlusions ( Fig. 23(a)) and by Gaussian noise (Fig. 23(b)). As in subsection VI-A, the error in the estimation of the orientation is only calculated in the experiments whose position error is equal or less than 40 cm in the map. Moreover, the orientation information is calculated only from the greyscale space. First, regarding occlusions, methods based on the DFT present a significant increase of the error. Results show that, for 40% occlusion, the error doubles compared to the original test images (with no occlusion). This is specially significant in the case of 1D-DFT. Rotational PCA shows the best performance in the orientation estimation. In the case of gist-Gabor, the average error is below 3 • for any occlusion level of the test image. HOG and gist-color present a similar precision, with a mean error lower than 8 • . Second, about the presence of Gaussian noise, the most affected descriptors are gist-Gabor and PCA over FS. The orientation error in gist-Gabor is specially high, multiplying by 6 the mean error when the noise variance is 0.020. Techniques based on DFT (except for PCA over FS) present little variation when the test image is affected by noise, with a similar error in all the cases. Finally, HOG and gist-color present a better performance in the orientation estimation when the image presents noise than with occlusions. Finally, an experiment has been conducted to study the effect of considering different weighting factors in (24) when combining the greyscale descriptor and the Color Histogram (denoted as Greyscale+CH). Figure 24 includes the precision in localization for all the descriptors using Greyscale + CH when the weighting coefficients vary. On a general basis, the precision is higher when the combination w spatial − w color is 0.5 − 0.5 or 0.6 − 0.4. However, in 1D-DFT we can see that the results improve when w color increases. The reason is that this spatial descriptor is very compact and contains little information about the scene. On the contrary, gist descriptors and PCA over FS present a better performance when w spatial is higher, specially in the Nearest Neighbour experiments.

VII. CONCLUSION
In this work, the role of color information in the construction of global-appearance descriptors has been explored. A complete comparative evaluation has been carried out to uncover the performance of a variety of description methods and color channels in a pose estimation task. This evaluation has included the calculation time and the memory requirements of the different descriptors in the creation of a dense map, and the time consumed and precision in the estimation of the position and orientation of a robot in this map. Also, the robustness of these methods against the presence of noise or occlusions has been studied.
Next, we gather the main conclusions of this evaluation:

Position estimation and computational requirements
• In general, the use of the color information improves the performance of the descriptor in the localization task. This happens with all the description techniques.
• The color space HSV provides better results than RGB, except when the test images are affected by Gaussian noise.
• When the Color Histogram information is appended to the descriptors, the percentage of correct localization improves (comparing it with the case of using only the information in the grayscale space).
• Except for the FS and 2D-DFT, the computational cost that supposes appending the Color Histogram information is higher than if the color information is obtained by applying the descriptor over RGB and/or HSV channels.
• The combined use of RGB+HSV does not mean an improvement in the localization performance compared to the use of only a color space (RGB or HSV), but it supposes a significant increase of the computational requirements (both time and memory).
• Rotational PCA presents high precision in the pose estimation task, although the computational requirements make it infeasible for the task of dense map building in large environments. Moreover, together with PCA over FS, they are the only techniques that do not permit us building the map incrementally (i.e. all the images to be included in the map must be available initially). When a new image must be added to the map, PCA must be carried out with all the images (including the new one) from the scratch.
• HOG presents a good trade-off between precision and computational requirements, specially when we introduce the Color Histogram to the descriptor. Moreover, it is a very compact descriptor.
• 1D-DFT is the most compact descriptor using any color space. Although the precision of position estimation is low when we use only the grayscale image and the RGB color space, the precision increases to 58% when we use HSV. It becomes an interesting descriptor if the algorithm application has important restrictions of time, and specially, memory.
• The FS and 2D-DFT present a reduced computational cost during the map building process. However, the necessary memory to store the map is high compared to the rest of descriptors, and so the calculation time for the pose estimation is, because of the orientation estimation algorithm.
• Gist-color shows a better performance in the localization task than gist-Gabor, although it also needs more time during the map building and pose estimation processes.
• The gist descriptors need more time than the descriptors based on the DFT in the map building process, although they lead to maps whose size is substantially lower.

Estimation of the orientation
• Regarding the estimation of the orientation, all the descriptors present an average error lower than 8 • when the test image is not affected by occlusions or noise.
• The DFT-based techniques produce a higher error in orientation estimation compared to the other description techniques.
• However, it is worth noting that the angular resolution depends directly on the information included in the descriptor in the case of gist, HOG and rotational PCA, since it is sampled. This resolution can be increased at the expense of increasing the size of the descriptor and the calculation time both to build the map and to solve the orientation estimation problem. Therefore, in those descriptors, the phase information is less flexible and more sensitive than the descriptors based on DFT.

Localization when the test images present partial occlusions
• In general, the effect of the occlusions is more significant when we use the color spaces than using grayscale images. However, the different descriptors still present a better performance using color spaces than greyscale space.
• Gist and HOG are the techniques which are less affected by occlusions. The results obtained with the combinations HOG using grayscale+CH, and gist-color over RGB and HSV are specially remarkable.
• In the orientation estimation, the DFT-based descriptors are the least robust against the presence of occlusions in the test images, specially 1D-DFT. However, except this descriptor, the average error remains below 8 • .

Localization when the test images present noise
• The Gaussian noise remarkably affects the channels Hue and Saturation in the space HSV. This implies a significant reduction in the localization precision when we use the color space HSV and RGB+HSV.
• Descriptors based on the DFT, rotational PCA and gistcolor present the lower reduction in the precision when noise is present in the test images. LUIS PAYÁ received the M.Eng. degree in industrial engineering in Spain, in 2002, and the Ph.D. degree in industrial technologies, in Spain, in 2014. He currently works as an Associate Professor with the Department of Systems Engineering and Automation, Miguel Hernández University, in Spain. He teaches some subjects related to the fields of automatic control, electronics, and robotics. He has authored several books, articles, and communications in his research topics. His current research interests include omnidirectional vision and global appearance algorithms, topological map building and localization of mobile robots, and also the implementation and testing of remote laboratories.
WALTERIO MAYOL-CUEVAS (Member, IEEE) received the B.Sc. degree from the National University of Mexico and the Ph.D. degree from the University of Oxford. He is a member of the Department of Computer Science, University of Bristol. His research with students and collaborators proposed some of the earliest versions of visual simultaneous localization and mapping (SLAM) and its applications to robotics and augmented reality. These include flagship humanoid robots and early commercial applications of visual mapping for wearable computing. His most recent works include novel concepts of human-robot interaction, fast computer vision methods for scene understanding, and algorithms for novel visual sensors and machine learning applications to assess its skill. He is the General Co-Chair of BMVC 2013 and the General Chair of the IEEE ISMAR 2016.
LUIS MIGUEL JIMÉNEZ received the degree in industrial engineering from the Polytechnic University of Madrid and the Ph.D. degree in robotics and automation from Miguel Hernández University, Elche, Spain. He is an Associate Professor with Miguel Hernández University, in the areas of systems engineering and automation. His research interests are focused in the fields of automation, robotics, and computer vision with the ARVC Research Group, Miguel Hernández University.
OSCAR REINOSO (Senior Member, IEEE) received the degree in industrial engineering and the Ph.D. degree from the Polytechnic University of Madrid (UPM), in 1991 and 1996, respectively. From 1994 to 1997, he worked with the Research and Development Department, Protos Desarrollo, in a visual inspection system. Since 1997, he has been with Miguel Hernández University, as a Professor, in control, robotics, and computer vision. He has authored several books, articles, and communications in his research topics. His research interests include robotics, teleoperated robots, climbing robots, visual servoing, and visual inspection systems. He is a member of the CEA-IFAC.