Rotation Equivariant Convolutional Neural Networks for Hyperspectral Image Classification

Detection of surface material based on hyperspectral imaging (HSI) analysis is an important and challenging task in remote sensing. It is widely known that spectral-spatial data exploitation performs better than traditional spectral pixel-wise procedures. Nowadays, convolutional neural networks (CNNs) have shown to be a powerful deep learning (DL) technique due their strong feature extraction ability. CNNs not only combine spectral-spatial information in a natural way, but have also shown to be able to learn translation-equivariant representations, i.e. a translation of input features into an equivalent internal CNN feature map. This provides great robustness to spatial feature locations. However, as far as we know, CNNs do not exhibit a natural way to exploit rotation equivariance, i.e. make use of the fact that data patches in a HSI data cube are observed in different orientations due to their orientation or on the varying paths/orbits of the airborne/spaceborne spectrometers. This article presents a rotation-equivariant CNN2D model for HSI analysis, where traditional convolution kernels have been replaced by circular harmonic filters (CHFs). The obtained results over three well-known HSI datasets showcase the potential of the approach.


I. INTRODUCTION
Imaging spectrometry data, in general, and hyperspectral imaging (HSI) in particular [1] are widely used remote sensing technologies for a large variety of applications due their abundant information about earth-surface material, such as: natural resources management [2], for instance in those activities related to forestry [3]- [5], geology/mineralogy [6]- [8] or hydrology [9]- [11]; precision agriculture [12]- [14], soil degradation [15]- [17] and crop stress/disease detection [18]- [20]; urban planning [21]- [23]; risk prevention [24]- [26], and disaster monitoring [27]- [29], among others. This information is gathered by operational HSI systems (aerial and satellite-based imaging spectrometers) [30], [31] in hundreds of channels that collect the surface solar reflectance by measuring the electromagnetic spectrum The associate editor coordinating the review of this manuscript and approving it for publication was Yongqiang Zhao . between the visible spectrum and the short-wave infrared with a very narrow bandwidth. As a result, a huge 3D image cube is obtained, where pixels represent narrow and continuous spectral signatures, which are unique for each captured land-cover material. This interesting peculiarity makes HSI data a powerful tool for target detection and land-cover categorization tasks.

A. TRADITIONAL MACHINE LEARNING METHODS FOR SPECTRAL-SPATIAL HSI PROCESSING
Traditionally, HSI information has been exploited in machine learning (ML) by pixel-wise methods which consider HSI data as a list of spectral vectors, assuming that each pixel is pure and typically labeled as a single land cover type [32]- [34]. In the current literature, there are abundant pixel-wise methods, such as the popular support vector machines (SVMs) [35], [36], K-nearest neighbor (KNN) [37], [38], multinomial logistic regression (MLR) [39], [40], VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ random forests (RFs) [41], [42] and standard artificial neural networks (ANNs) [43], among others. As these methods conduct only spectral processing to assign each pixel to its corresponding land-cover class, their reliability (in terms of classification accuracy) is strongly related to the number and spectral-quality of training samples. However, HSI scenes usually exhibit a low spatial resolution. For instance the AVIRIS instrument [30] has a ground sample distance of 20 mpp. Therefore, a pixel collects the reflectance of a very large area, which may contain different materials, such that the captured spectral signature will be very mixed. Moreover, uncontrolled changes (such as atmospheric conditions), introduce great variability in the spectral domain.
In this sense, pixel-wise classification methods suffer a high intra-class variability (for instance, class-centered pixels and frontier pixels that belong to the same class), and also the high inter-class similarity (frontier pixels that belong to different classes). Furthermore, the so-called ''salt and pepper'' noise problem appears frequently, since these methods ignore the spatial dependencies between neighboring pixels [44]. Recent investigation has shown that taking spatial information into account in pixel-wise HSI processing may improve classification performance [45]. This is based on the general assumption that adjacent pixels commonly belong to the same class, such that exploiting their information may reduce intra-class variance and labeling uncertainty due to the characterization of the contextual features. Literature discusses several ML techniques to perform spectral-spatial HSI classification [46]. For instance, Tarabalka et al. [47] and Zhang et al. [48] combined the SVM with Markov random field (MRF) modelling to process spectral and contextual information, reaching promising results. Also, Li et al. [49] and Sun et al. [50] applied MRFs to refine the classification results obtained by the MLR pixel-wise classifier.
Other spectral-spatial methods take advantage of the texture features, for instance Huo and Tang [51], and Seifi Majdar and Ghassemian [52] used Gabor filtering and SVMs to extract texture information from HSI data. Moreover, He et al. [53] presented the improved discriminative low-rank Gabor filtering (DLRGF) to extract suitable spectral-spatial features of the HSI data. Many other image filtering methods have been successfully applied for spectral-spatial HSI processing, for instance Kang et al. [54] introduced edge-preserving filtering (EPF) to refine the SVM classification map. Guo et al. [55] applied guided filtering (GF) to the KNN classifier, while Wang et al. [56] combined GF, principle component analysis (PCA) and deep neural networks to extract discriminative multi-features from HSI scenes. In [57], Liao and Wang fused two spatial-based filters, in particular curvature (CF) and domain transform recursive (DTRF) filtering to enhance the performance of HSI classification, and in [58] they implemented an adaptive manifold filter with spatial correlation feature (AMSCF) approach.
Morphological profiles (MPs) [59], [60] have also been successfully applied to extract relevant spatial information from the HSI data cube. For instance, Benediktsson et al. [61] built one MP (by applying a set of openings and closings) for each principal component extracted from the HSI data, and used all together in one extended MP (EMP), applying an ANN to perform the final classification. In [62], Plaza et al. applied multiscale sequences of EMP transformations to extract spatial relations within HSI data, while in [63] the authors proposed a new multi-channel MP approach with SVM. Fauvel et al. [64] combined morphological and spectral information within a SVM classifier, and Huang et al. [65] conducted an extensive study about several strategies for producing the base images for MPs, among traditional PCA. As an extension of MPs, attribute profiles (APs), extended APs (EAPs) and extended multi-attribute profiles (EMAPs) have been considered as powerful tools to better model the spatial information contained within the HSI data cube in comparison to conventional MP-based techniques, while reducing the computational complexity of MPs. In this context, Dalla et al. [66] enhanced their obtained accuracy results when conducting HSI classification through EAPs and EMAPs, while Falco et al. [67] conducted an independent component analysis (ICA) spectral analysis with reduced APs, in order to avoid the spatial information redundancy.
In recent years, the use of local binary patterns (LBPs) [68] has shown good spatial-classification performance in HSI classification. For instance, Li et al. [69] extracted spatial LBPs information and combined it with the spectral information through an extreme learning machine (ELM) classifier [70], [71]. Recently, Ye et al. [72] developed segmented LBPs in order to address the limitations of the original approach in the representation of physical meaning due to the existing noise bands, while preserving the intrinsic geometrical structure of original data. And Sidike et al. [73] used the interesting multi-scale completed LBP (CLBP) to enhance the spatial-processing of HSI scenes.
Image segmentation is quite interesting in spatial processing methods, where they split the HSI scene into non-overlapping homogeneous regions according to some partitioning criteria. In this sense, Tarabalka et al. [74] performed spectral-spatial HSI classification combining pixel-wise SVM classification results and the partitional clustering segmentation map. Furthermore, in [75], the authors combined the SVM classification results with watershed segmentation. Zhang et al. [76] adopted an active learning strategy with hierarchical segmentation to perform spectral-spatial HSI analysis and pixel classification. Also, Tarabalka et al. [77] proposed a multiple spectral-spatial classification (MSSC) scheme, which combines several pixel-wise and segmentation-based classifiers to process the HSI scene.
Among these approaches, many kernel-based techniques have been developed to perform spectral-spatial HSI classification [78]. For instance Camps-Valls et al. [79] presented composite kernels (CKs) as an efficient and flexible spectral-spatial processing framework, balancing between the spatial and spectral information comprised by HSI data. Li et al. [80] extended CKs by avoiding weight parameters (generalized composite kernel machines -G-CKs-), reaching promising results with a MLR classifier. Also, Camps-Valls et al. [81] generalized the CK behaviour by graph kernels (GKs), which compute spatial similarities in a proper feature space.
Despite their notable results (in terms of accuracy), these standard ML-based techniques have to face several challenges. In particular, their performance strongly depends on the parameters involved in the algorithm, the selected classification criteria or discriminative measurement, and/or some handcrafted features, which are very hard to obtain and are not flexible. In this sense, the success of these methods hinges largely on the user/programmer knowledge, involving a high computational burden (some of them) and being quite difficult to train.

B. DEEP LEARNING MODELS FOR SPECTRAL-SPATIAL HSI PROCESSING
In contrast to previous ML-based methods, deep learning (DL) models [82], [83] offer a bunch of quite interesting neural architectures to automatically process the spectral-spatial information contained in HSI data, without requiring a priori knowledge about the data distribution/features. In this context, deep neural networks (DNNs) can be interpreted as hierarchical stacks of L operational layers, where each layer l applies a transformation function, x (l) = f (l) (x (l−1) , P (l) ) over its inputs x (l−1) to obtain a certain output data representation x (l) , which will depend on the layer parameters P (l) (weights and biases). By adjusting their parameters, DNNs aim at automatically extracting those representations that best fit the problem, in order to optimize the architecture and achieve an adequate accuracy. This behaviour, combined with a large variety of architectures and their great versatility (DNNs are considered universal approximators), makes DNNs a pretty powerful tool for the analysis of complex non-linear data, exhibiting a greater ability of feature abstraction than standard ML models and shallow architectures.
Some HSI literature has tackled the spectral-spatial processing through deep architectures. For instance, Chen et al. [84] developed a stacked autoencoder (SAE) to perform deep spectral-spatial feature extraction and classification, concatenating several flattened principal components from a spatial pixel neighborhood with the corresponding spectral features of the pixel. Following the same procedure, Chen et al. introduced deep belief networks (DBN) [85] to extract high-level spectral-spatial features from the HSI data cube. These models overcome the limitations expressed by traditional ML models, as they automatically extract the most relevant spectral-spatial features when classifying HSI data. However, these models are originally based on fully-connected layers, which only process vectorized data, i.e. 1-dimensional arrays. Consequently, the spatial representations must be flattened and vectorized in order to be processed by these architectures, i.e. spatial relations (in terms of feature locations) are lost.
Among these methods, convolutional neural networks (CNNs) [86] appear to be a promising DL technique within image processing tasks, in general, and HSI data analysis in particular [87]. The corresponding model offers a natural way to combine spectral-spatial information due to its hierarchy of convolution filters and pooling operations, providing a flexible mechanism to process n-dimensional arrays. As a result, the CNN model has shown a good performance in HSI data classification [88]. Although CNNs are excellent tools for feature extraction, without enough training samples, they exhibit the typical deep learning limitation of overfitting. There are two main reasons: on the one hand, there are not enough samples to cover all the variability of the HSI scene, and on the other hand, the parameters are over-adjusted to the training data distribution. In this context, despite the great performance during the training stage, the behavior of the network deteriorates significantly during the inference phase due to its over-adjusted parameters and the lack of knowledge about unseen samples. Moreover, they are not really effective in exploring some spatial relations among features and their generalization power depends on the transformations present in the training data.

C. SPATIAL TRANSFORMATIONS EQUIVARIANCE OF CNN MODEL FOR HSI CLASSIFICATION
In this context, affine transformations such as shifting, scaling, translations and rotations can appear within the data, where objects that belong to the same class appear at different locations, with different sizes and orientations. In practical HSI data, scenarios can be observed with a varying orientation due to the capturing spectrometer position, the altitude at which it is located and the path followed by the airborne spectrometer. From this perspective, achieving a certain invariance/equivariance to these transformations can be quite useful for classifying HSI data, making classification models more reliable, improving their generalization ability and reducing the number of training parameters. Eq. (1) provides the mathematical formulation of transformationinvariance/equivariance, where f : X → Y defines the feature extraction function, which is invariant to any transformation g(·) of the multidimensional input data X (which belongs to a group of transformations G, i.e. g ∈ G) if the output of f (·) remain unchanged. On the contrary, f (·) is equivariant to every g ∈ G if there is another group H such that, for every g(·) ∈ G there exists a h(·) ∈ H that reflects the transformation. Therefore, the concept invariant means that there is no variation at all, while equivariant implies a variation in a similar or equivalent proportion.
Moreover, becoming robust to rotation transformations is a quite important topic within ML and DL due to the high VOLUME 8, 2020 FIGURE 1. CNN translation (left) and rotation (right) equivariance. In the first case, the translation g(·) of the HSI scene X is equivalent to the translation h(·) of the feature maps, i.e. the translation is preserved by each layer. However, the rotation of the input X by g(·) has not a equivalent transformation h(·), which could produce the same result if it is applied over the feature maps, and if an h(·) exits, it is very hard to find. complexity and the negative effects it introduces during data processing [68], [89]- [96]. Some interesting efforts in remote sensing HSI data analysis can be found in [97]- [100]. However, these approaches are usually based on traditional ML techniques and feature extractors such as SIFT. Focusing on deep CNN models, convolution kernels capture translations within the data, i.e. they are translation-equivariant due the particular form of parameter sharing, while pooling operations provide some invariance to small translations [83]. However, the CNN model is not naturally equivariant to rotation transformations. In fact, rotations in the inputs do not generate an equally logical-response in the outputs [94], as we can observe in Fig. 1. Therefore, other mechanisms have to be introduced in order to aid these transformations.
Traditionally, spatial regularization [101] and data augmenting techniques [102]- [104] have been developed as reinforcement to help CNN learning, using normally a large number of rotated data variations. However, this procedure works as a black box, where the learning mechanism does not exploit the characteristics of the problem [94]. Moreover, these techniques do not ensure local equivariances within the CNN model, which is compelled to learn several copies of the same filter to fit different feature rotations, increasing both the number of redundant parameters and the learning complexity.
Several methods have been developed for computer vision, optical image and remote sensing processing to include rotation-equivariance within CNN models, which can be classified into those that 1) rotate the inputs (input data or feature maps) to generate rotation invariant representations [91], [105] and 2) rotate the CNN filters or activation [106]. For instance, Fasel and Gatica-Perez [91] rotated the input data before sending it into several stacked CNNs, generating rotation invariant representations through pooling. Cohen and Welling [106] generalized the convolution layer behaviour to perform rotations and reflections equivariances by rotating kernels within each layer.
In this context, this article tackles the rotation-equivariance within CNNs based on 2D kernel for spectral-spatial HSI data processing and classification. In particular: • Instead of achieving rotation-equivariance by transforming the inputs, we propose to transform the standard convolution filters of the network into steerable filters [107].
In this way, we develop a new rotation-equivariant CNN2D architecture for remote sensing HSI data classification to deal with all local rotation information present in HSI data.
• Therefore, our deep model uses as structural element a kind of steerable filter, named circular harmonic filter (CHF) [94] to represent all rotated versions of a filter (i.e., 360 • -rotation equivariance) by considering a linear combination of a number of steering bases.
• In this way, the new architecture aims at reducing the high redundancy in the filters learned by the CNN, simplifying the learning process and improving the performance when conducting HSI data classification.
• Obtained results over three well-know HSI datasets demonstrate the potential of the approach.
The remainder of the paper is organized as follows. Section II describes the used methodology of the rotation-equivariance CNN model. Section III compares its performance with baseline CNNs over three widely-used HSI data set instances. Finally, Section IV concludes the paper with some remarks and possible future research lines.

A. THE CONTEXT OF THE CNN MODEL
A CNN architecture can be divided into two main parts: i) the feature extractor (FE), which is usually composed of one or more convolution layers, followed by data normalization, non-linearity function and, often, by a subsampling or pooling layer, which obtains high level representations (feature maps) of the original input data, and ii) the classification net, composed of several fully-connected layers.
The design of each FE-stage is intrinsically related to the input data structure, i.e., the dimensional arrangement of the input array. In this context, FE-stage layers can process n-dimensional arrays, taking the maximum advantage of the information contained in the data. Thus, considering a standard CNN model with L FE-stages, let × N (l−1) features and F (l−1) channels. 1 Therefore, the l-th FE-stage will apply its corresponding FE-function f (l) (·) to X (l−1) in order to obtain the output volume X (l) ∈ R N (l) H (·) and pooling f (l) P (·) layers, so we can reinterpret each FE-stage as [108]: where P (l) represents the learnable kernel parameters (i.e., weights W (l) and biases b (l) ). Each sub-mapping function is defined as follows: Eq. (3a) defines the convolution layer operation, which is composed of F (l) filters that are applied to small regions of X (l−1) . This region is defined by the local receptive field ) . In this sense, the corresponding weights ×F (l) are applied along the input like a sliding window, where each kernel application combines the features contained in the receptive field along the channels. Eq. (4) details the mathematical expression of f (l) C given by Eq. (3a). As we can observe, the convolution ( * ) between the weights W (l) and the data X (l−1) plus the bias vector b (l) obtains the output feature volume Z (l) ∈ R N (l) i,j,t is obtained after the affine transformation [109] between the corresponding weights and input data, where i ∈ {1, . . . , N (l) } and j ∈ {1, . . . , N (l) } index along spatial dimensions of the output and input arrays,î ∈ {1, . . . , K (l) } andĵ ∈ {1, . . . , K (l) } index along spatial dimensions of the convolution kernel, 1 Focusing on HSI processing, the first FE-stage l = 1 will receive the original HSI patch X ∈ R N ×N ×F , composed of N × N vector pixels, where pixel (i, j) contains the reflectance measurements within F spectral bands, } indexes along the spectral dimension of the convolution kernel and output volume, andt ∈ {1, . . . , F (l−1) } indexes along the spectral dimension of the input volume.
C , the data is usually normalized by the batch normalization function f (l) B in order to recover its original distribution. This limits the internal covariate shift effect introduced by f (l) C by forcing the resulting output distribution to have mean µ = 0 and variance σ 2 = 1 [110], while making the landscape of the optimizer more smooth [111]. Eq. (5) details the mathematical terms behind Eq. (3b), where the step-size provides certain numerical stability and learnable parameters γ and β control the distribution N (µ, σ 2 ), making it more flexible.
is then processed by a non-linear activation function f to extract the nonlinear representations from the data.
In this sense, f H is interpreted as the detector unit of the FE-stage [83]. Indeed, these activations indicate the presence of the features identified by the convolution kernel. Usually, the rectified linear unit (ReLU) [112]- [114] is implemented, soz P is conducted by adopting a downsampling strategy. In this sense, the pooling step summarizes the spatial information of each feature map inZ (l) , retaining the most relevant features by applying a max, average or sum operation within a neighborhood window [115]. For instance, the max-pooling operator selects the largest feature from the rectified volumeZ (l) , considering a spatial window of P × P. As a result, pooling summarizesZ (l) intoX (l) by dividing the spatial dimensions into N (l) /P parts. This reduces the number of parameters, which helps to control the overfitting problem.

B. LIMITATIONS OF CONVOLUTION KERNELS
As mentioned before, the l-th convolution kernel comprises F (l) learnable filters with size K (l) × K (l) , which are applied in a ''sliding window'' fashion across the entire layer input X (l−1) . In this regard, the weight sharing mechanism provides an easy way to reuse the same K (l) × K (l) weights given by filter t across the entire input, ∀t ∈ {1, . . . , F (l) }. This mechanism significantly reduces the number of weights that have to be learned. In addition, the filter is automatically adapted to detect a particular feature, such as a combination of edges. Therefore, if there are similar features within the input data, the filter will detect them independently of their location in the inputs, recording their position in the output volume and preserving the translation at each layer [106]. Thus, a convolution kernel is translation-equivariant, as the spatial translation in inputs generates a consequent translation in outputs [83], following Eq. (1): However, low-level features, such as borders and edges can appear with different orientations. This may affect the CNN performance dramatically, because filters cannot fit the data properly, as has been observed in filterbanks visualized by Zeiler and Fergus [116]. In this sense, the CNN is compelled to learn the rotated version of the same filter, investing a significant amount of computation to learn all these redundant weights, which are actually copies of one, which is rotated in different ways [94].
This behavior should not be misunderstood with the invariance offered by pooling layers. In this sense, the pooling layer helps the network model to obtain a data representation approximately invariant to small translations and rotations of the input, as we observe in Fig. 2.
However, none of these layers is naturally equivariant to rotation. Focusing on convolution layers, if X (l−1) is rotated, the feature vectors obtained by f (l) C do not rotate in an easily predictable way [94], while in pooling layers, the rotation equivariance is not skipped at all. In fact, it has to be approximated by small translations, normally requiring data augmentation (i.e. including in the training set data with every possible rotation) and leading to high sample complexity, while CNN models are highly sensitive to little input data perturbations [117].
Encoding these properties into the neural network architecture can enhance data understanding, improving the learning procedure and reducing the number of parameters [118].

C. CIRCULAR HARMONIC FILTERS
In order to encode the 360 • -rotation equivariance within the CNN for remote sensing HSI data classification, the developed spectral-spatial model uses CHF [94] as the structural element to perform the FE stage. The CHF belongs to the steerable filter family [107], [119], which defines a set of orientation-selective convolution kernels that can be constructed at any rotation by simply using a a linear combination of a finite number of appropriate base filters [120]. Within a continuous Euclidean space, the CHF is defined by a sinusoidal angular part multiplied with a radial function [119] W m (r, φ; R, β) = R(r)e i(mφ+β) , where (r, φ) denotes the polar coordinates of feature maps (r ≥ 0 is the radial distance and φ ∈ [0, 2π) is the horizontal plane angle), R : R + → R is the radial profile function to control the filter overall shape, and β ∈ [0, 2π) is a phase offset term that gives the filter orientation-selectivity [94]. In particular, R and β must be learned during the training step. An interesting parameter is the rotation order m ∈ Z, which limits the kind of transformation applied to the data. Hereof, following Eq. (6), we can consider g θ ∈ G as the θ-degree rotation 2 of the input X(r, φ) (in polar coordinates, superscripts have been avoided to facilitate the understanding of mathematical concepts), so X(r, g θ [φ]) = X(r, φ − θ). Thereby, the application of the CHF on the rotated input must be equivalent to the application of the same filter on the original data, considering h θ = e imθ , i.e.
Taking back the superscripts to indicate the FE stage in which we are, and considering the CNN as a deep stack of CHFs, it must be noted that the cross-correlation between the feature maps X with rotation order m j will lead to new feature maps X (l) m i +m j with rotation order m i +m j . However, considering the summation of two feature maps with the same order m, the obtained result remains of order m (non-linear activation functions do not affect the rotation order) [94]. Consequently, to construct a M -equivariant output, the chained cross-correlation constraint of rotation orders is applied along any data-path of the CNN model. Moreover, the equivariance condition given by Eq. (9) must be met at every feature map: Our deep rotation-equivariant CNN2D for spectral-spatial HSI data classification has been developed with a dual-path architecture, replacing standard convolution kernels W (l) by its CHF counterpart W (l) m , generating m-equivariant feature maps with rotation order m. In particular, each path represents a different stream with constant rotation order outputs. As can be observed from Fig. 4, the upper stream exhibits an m = 0 rotation order (i.e. its outputs are rotation-invariant), while the lower stream sets its rotation order to m = 1 (i.e. its outputs are rotation-equivariant). To keep the rotation order constant along streams, each one comprises several rotation order zero cross-correlation and non-lineal activation functions. Moreover, every cross-correlation output of each stream is combined within the opposite stream through cross-correlations of rotation order equal to the difference between those two streams, 4 holding at every connection the equivariance condition. 4 As we can see, to combine the outputs of stream m = 1 with the outputs of stream m = 0, filters of order m = −1 are needed. In this case, for negative orders, the conjugated weights are considered (Fig. 5), i.e. if we consider W −m = R(r)e i(mφ+β) as the negative order filter, its conjugated counterpart will be W m = R(r)e −i(mφ+β) . Furthermore, to translate the continuous cross-correlation function into the discrete space of real HSI data, the data is bandlimited and resampled before cross-correlation via Gaussian resampling [94]. Each feature of the resulting downsampled data is a center of equivariance where the feature maps will exhibit local rotation equivariance. Also, the complex cross-correlation function has been split into four real cross-correlation: Imaginary part (10) Following Eq. (10), the CHFs are implemented by resampling the polar-based filters into a grid-resampled version through the the trigonometric transformation, where any coordinate (r, φ) can be represented as (r cos(φ), r sin(φ)) [94]. We summarize the concept of the the network architecture. First of all, the spectral dimension of the HSI scene is reduced to one band, F = 1, applying Maximum Noise Fraction (MNF). Then, following the procedure outlined in [86], the HSI scene is cropped into spectral-spatial patches of size N × N × 1, where the label of a patch corresponds to that of the central pixel. These patches feed the network, whose architecture is composed of two streams with rotation orders m = 0 and m = 1, respectively. Each stream comprises three CHF-based layers, which are followed by data normalization and ReLU activation functions [94]. Moreover, these CHF-based layers contains 5×5×8, 5×5×16 and 5×5×32 filters, respectively. This is illustrated in Fig. 4, where each cross-correlation observes Eq. (9). At the end of the network, the two streams come together and are joined into a final layer with m = 0 and with kernel size 5 × 5 × C, where C is the number of different land cover classes. This final data representation is vectoriced to obtain the final classification result. The model has been optimized with ADAM optimizer, considering a starting learning rate of 1e − 3. This is updated every 50 epochs, where it is divided by two in order to refine the behavior of the optimizer. 250 epochs have been employed, using a batch size of 100 elements.

III. EXPERIMENTS A. EXPERIMENTAL ENVIRONMENT
In order to evaluate the performance of the developed approach called reCNN2D (rotation-equivariant CNN2D), we use as hardware environment a 6th Generation Intel R Core TM i7-6700K processor with 8M of Cache and up to 4.20GHz (4 cores/8 way multitask processing), 40GB of DDR4 RAM with a serial speed of 2400MHz, a graphical processing unit (GPU) NVIDIA GeForce GTX 1080 with 8GB GDDR5X of video memory and 10Gbps of memory frequency, a Toshiba DT01ACA HDD with 7200RPM and 2TB of capacity, and an ASUS Z170 pro-gaming motherboard. The used software environment consists of Ubuntu VOLUME 8, 2020  16.04.4 x64 as operating system and Python 2.7 as programming language.

B. EXPERIMENTAL HYPERSPECTRAL DATASETS
For the test instances, three well-know HSI datasets have been used. The first one is known as Indian Pines (IP) (see Table 1), which was gathered by AVIRIS [30] in 1992, over a set of agricultural fields with regular geometry and with multiple crops and irregular patches of forest in Northwestern Indiana. The IP scene has 145 × 145 pixels with 224 spectral bands in the range from 400 to 2500nm, with 10nm of spectral resolution, 20m spatial resolution and 16 bits radiometric resolution. After an initial analysis, 4 zero bands and another 20 bands with lower signal-to-noise ratio (SNR) due to atmospheric absorption have been removed, retaining only 200 spectral channels. Moreover, about half of the pixels in the HSI image (10249 of 21025) contain ground-truth information, which comes in the form of a single label assignment having a total of 16 ground-truth classes.
The second HSI dataset, known as Pavia University (PU) (see Table 1) was collected by ROSIS [123] during a flight campaign over Pavia, northern Italy. The dataset covers an urban environment with various solid structures (asphalt, gravel, metal sheets, bitumen, bricks), natural objects (trees, meadows, soil), and shadows (9 classes in total). Other objects with a composition which differs from the labeled ones are considered as clutter. The PU scene was collected over the university area. It contains 103 spectral bands of 610 × 340 pixels in the spectral range from 0.43 to 0.86µm, with a spatial resolution of 1.3m/pixel. About 20.62% of the pixels in the HSI image (42776 of 207400) contain ground-truth information.
The third HSI dataset was also collected by the AVIRIS instrument, in this case over Salinas Valley (SV), California ( Table 1). The covered area has 512 × 217 samples and the spatial resolution is 3.7m per pixel. 204 out of the 224 bands are kept after 20 water absorption bands are removed. The ground-truth is composed of 54129 pixels and 16 land-cover classes, including vegetables (such as lettuce and brocoli), bare soil, and vineyard fields.

C. EXPERIMENTAL DISCUSSION
Four experiments have been performed with the aim to test the developed method for HSI classification: • The first experiment performs a general stateof-the-art comparison, comparing the developed rotation-equivariant CNN2D model for HSI data classification with six popular HSI classifiers [87]: i) RF [41], ii) MLR [124], iii) SVM with radial basis function (RBF) [125], iv) multilayer perceptron (MLP) [126], v) CNN1D [127] and vi) CNN2D [128]. Apart from reCNN2D and the standard CNN2D (implemented similarly), the considered classifiers are traditional pixel-wise methods, whereas CNN2D-based architectures are spectral-spatial classifiers, whose inputs comprises patches of 31×31 pixels with one principal component extracted from IP, PU and SV datasets through PCA. The spatial size has been chosen following [86] in order to obtain enough spatial information The training percentage has been varied to test the accuracy performance of the developed method, in particular 3%, 5%, 10% and 5% of the IP dataset (which is spectrally more complex), and 1%, 3%, 5% and 10% of PU and SV datasets, respectively. The accuracy of each method was evaluated by three quantitative metrics: overall accuracy (OA), average accuracy (AA), and the Cohen's kappa (K) coefficient [129].
• The second experiment extends the comparison between the standard and rotation equivariant CNN.
In particular, we have compared the standard CNN2D, considering the IP, PU and SV datasets with rotated and non-rotated training-test samples, and the presented method with rotated inputs. The main goal is to observe the degradation in terms of OA of the standard CNN2D when rotated samples are fed to the model. Also in this experiment, after dividing the scenes into patches of size 31×31, they have been randomly rotated with angles 0 o , 90 o , 180 o and 270 o and then divided into training and test sets (using 3%, 5%, 10% and 5% of IP and 1%, 3%, 5% and 10% of the PU and SV datasets as training percentages) in order to measure the rotational effect over the standard CNN2D. Again, no data augmentation has been performed during the training of the models. VOLUME 8, 2020 • The third experiment performs a detailed comparison between the new rotation-equivariant network with CHFs for remote sensing HSI data classification, the standard CNN2D model and the regularized FE model proposed by Chen et al. [130]. Chen et al. obtain the representation of the deep features of the HSI scenes by implementing a standard CNN2D, performing the final classification through a logistic regression classifier, which is fed with the extracted feature vectors. The experiment uses IP and PU scenes, where both the spatial size of the input HSI patches (27 × 27) and the size of the training and test sets has been adjusted to those indicated in [130]  • As a last comparison, the fourth experiment compares the performance between the new rotation-equivariant CNN2D model, the standard CNN2D and the feature enhancement fully CNN2D model FEFCN) proposed by Li et al. in [131]. This network comprises several FE stages of standard convolution, deconvolution and pooling layers. Obtained feature vectors are processed by an ELM [71] for classification purposes. The experiment follows the training details described in [131], where 10% and 5% of training data is used for the IP and PU data sets respectively, using patches of 48 × 48. For the standard CNN2D and reCNN2D, the HSI data has been randomly rotated with angles of 0 o , 90 o , 180 o and 270 o .

1) FIRST EXPERIMENT: GENERAL STATE-OF-THE-ART COMPARISON
The obtained results of the first experiment can be observed in Table 2. With the exception of two isolated cases (PU and SV scenes with 1% of training samples), reCNN2D achieves the best classification result. Looking at the spectral classifiers (which are not affected by the random rotation of spatial information), they reach typical OA results, improving as more training samples are provided, where SVM and CNN1D exhibit the best classification results. However, the graphic results depicted in the classification maps of Fig. 6 show the traditional salt and pepper noise due to the incorrect classification of those pixels located in the interior areas of the classes.
On the contrary, those classifiers that introduce contextual information can overcome this limitation, reducing the uncertainty introduced by the spectral variability by taking advantage of the information provided by the neighbors pixels. These classifiers are based on a CNN2D model, and they have been denoted as CNN2D_wR and reCNN2D, respectively. The first one comprises the standard 2D convolution filters, receiving as input data, rotated and non-rotated spatial patches (hence the suffix wR, i.e. with rotation), while the second one implements the new methodology, receiving also rotated and non-rotated spatial patches.
Focusing on IP, one can observe that the reCNN2D achieves the best overall accuracy (OA, 78.93%) with least training data (1%). Comparing the traditional CNN2D_wR with reCNN2D we can observe how the last one performs 36.98 percentage points better than the CNN2D_wR model. It clearly appears more robust to rotational variations when few training samples are provided, understanding the concept of ''robustness'' as the ability of the neural model to cope with complex data (in particular, rotated data with few training samples) during classification, without its classification performance being adversely affected. This behavior is repeated with the rest of the percentages (for instance 37.64 and 30.33 percentage points better with 5% and 10% of raining samples), although as more training is added, the distance is reduced (18.81 percentage points better with 15%). This makes sense, since as more samples are introduced, networks are able to cover the data variability, learning better during the training stage and generalizing better over the unseen test samples.
Considering PU and SV datasets, SVM is the best classifier with only 1% of training data, while the new rotation-equivariant CNN2D is the third one (behind the CNN1D). However, for the remaining percentages, reCNN2D reaches the first position, exhibiting the best OA result. Moreover, reCNN2D exhibits a better generalization power than CNN2D with 1%, 3%, 5%, 10% and 15% of training data. While standard CNN2D hardly reaches the spectral classifiers performance, reCNN2D is able to overcome the OA values without difficulty. Regarding, the classification maps in Fig. 6, we can observe an important salt and pepper noise within spectral methods. In particular in SV scene, it is very interesting to note how the CNN2D-based network suffers more than pixel-wise models in the lettuce region, which is quite hard to identify due the high similarity between the spectral signatures of the lettuces at different stages of maturation. The same behaviour is observed between Vineyard-untrained and Grapes-untrained classes. In this sense, the problem of rotation is compounded by the lack of spectral information, thus the lack of spectral information between very similar spectra and the rotation of patches (in particular, the orientation of those patches with edges between different classes) make the CNN2D classifier negatively affected. On the contrary, reCNN2D provides a clearer map, indetifying these regions better than CNN2D_wR model.

2) SECOND EXPERIMENT: EVALUATING THE BEHAVIOR OF CONVOLUTIONAL MODELS WITH INPUT ROTATED PATCHES
This experiment is focused on studying how traditional convolution kernel is negatively affected by the introduction of rotated patches in both training and testing, while new reCNN2D CHF-based architecture is more robust and generalizes better than standard CNN2D.
By rotating both the training and test patches, we are introducing some ''spatial variability''. In this sense, traditional convolution filters will learn to identify the features associated with some orientation, compelling them to reserve many parameters that in the end are redundant, as they are rotated copies of the same filter. On the contrary, reCNN2D should reveal a better performance, since its filters can obtain rotation-equivariant responses, independently of the input degree of transformation, so it should generalize better in those unseen samples of the test set.
Obtained results can be graphically observed in Fig. 7. On the upper row, we compare the standard CNN2D model introducing randomly rotated (CNN2D_wR) and non-rotated (CNN2D_nR) patches within the training and test sets. The obtained results of CNN2D_nR are as expected [87], exhibiting a low performance when few training samples are available during the training stage, specially with IP and SV datasets, where first one the has an important spectral mixture of the pixels due to the low spatial resolution, while the second contains two problematic regions due to the similar spectral signatures of the lettuces at different stages of maturation and the vineyards crops. However, the CNN2D_nR improves its classification performance as more training samples are provided, reaching an OA higher than 95%. Despite these results, the standard CNN2D performance degrades significantly when rotated patches are introduced. Focusing on IP scene, the CNN2D_wR OA falls around 30 percentage points. This fall is slightly less in the other two HSI scenes, because they have more training samples than IP, so they can cover more information than IP, although the classification performance reduction is still remarkable.
On the bottom row, we compare the CNN2D_wR and the new network (denoted as reCNN2D), including rotated patches within the raining and test sets. In this sense, these images graphically depict the OA provided in Table 2. In PU the difference between CNN2D_wR and reCNN2D is slightly smaller than in IP and SV scenes. The main reason resides in the large number of structures, which are oriented and located in different areas of the scene, providing a direct way for the traditional filters to cover the features. However, IP and SV are more complex scenes not only due to their spectral complexity, but also because of the lack of structure. In this context, the CNN2D_wR suffers a dramatic OA reduction compared to the rotation-equivariant reCNN2D model.
On the other hand, if we compare the CNN2D_nR (without rotated inputs) and reCNN2D (with rotated inputs) we can observe that the obtained results are quite similar as more training samples are introduced, where reCNN2D obtains the best OA percentages when there are very few samples (for instance in IP and SV scene).

3) THIRD EXPERIMENT: COMPARISON AMONG CNN2D MODELS
The third experiment compares several CNN models. In particular, we compare the CNN2D-LR proposed by Chen et al. in [130] (called CNN2D-LR_nR, as it receives non-rotated patches), the standard CNN2D and reCNN2D. The same experimental configuration as described in [130] have been used. Table 3 provides the obtained results over IP and PU datasets. The first column provides the results obtained by the CNN2D-LR_nR model in [130], which has been trained with 1765 training samples from IP and 3930 from PU, using patches of 27 × 27. Following the same training procedure, the second and third columns show the results obtained by the CNN2D_wR with the same architecture as Fig. 4, implemented by traditional convolutional kernels, and the new method respectively. For this experiment, patches have been  [131] (considering non-rotated data), standard CNN2D and reCNN2D (both with rotated input data).
randomly rotated with angles 0 o , 90 o , 180 o and 270 o and then divided into training and test sets.
As we can observe, for both data sets the new neural model provides the best OA results. In fact, its OA results are 5.32 and 4.55 percentage points better than the second best model, the CNN2D-LR_nR, and 22.89 and 22.24 points better than the CNN2D_wR model.

4) FOURTH EXPERIMENT: COMPARISON BETWEEN ROTATION-EQUIVARIANCE CNN AND FULLY-CNN MODELS
The last experiment compares the efficiency in terms of OA, AA, and Kappa values between the feature enhancement fully CNN (FEFCNN) [131] model, the standard CNN2D model with rotated inputs (CNN2D_wR) and the newly designed rotation-equivariant approach reCNN2D. The obtained results are shown in Table 4 for both IP and PU data sets considering the experimental configuration provided in [131]. In this sense, the comparison have been conducted by utilizing the 10% and 5% of available labeled training samples randomly taken from IP and PU datasets. The FEFCNN network takes the input of overlapped patches of size 48 × 48 pixels with a stride of 15 for both IP and PU data sets. The second and third columns of Table 4 reports the results obtained by the CNN2D_wR with the backbone architecture shown in Fig. 4, which incorporates traditional convolutional kernels, and the new method respectively. In order to explore the rotational effect the extracted patches have been randomly rotated 0 o , 90 o , 180 o and 270 o angles and then the transform samples are divided into training and test sets. It can be seen that, for both data sets, the new rotation-equivariant network for HSI classification provides the best OA, AA, and Kappa values showing the robustness under rotational variations.

IV. CONCLUSION
This article presents a new rotation-equivariant convolutional neural network reCNN2D based on CHFs for classyfying HSI remote sensing data. Obtained results with three popular real HSI data sets varying the rotation in images in a systematic way, illustrate the performance of the new method with a small amount of training data, avoiding the overfitting problem exhibited by traditional CNN models. Moreover, comparison with several CNN2D models shows that the new approach exhibits a better robustness to data rotational variance, performing a better classification and providing a better generalization without requiring data augmenting techniques. ELIGIUS M. T. HENDRIX is currently a European Researcher and a Professor with 30 years of experience in mathematical modeling and optimization questions. He is also affiliated with Wageningen University, Wageningen, The Netherlands, and the Universidad de Málaga, Málaga, Spain. He has published more than 70 journal articles. His research interest includes how to use the mathematical structure of an optimization application in order to derive specific solution methods and algorithms. His current research interest includes how to adapt algorithms such that they can exploit modern computer structures.