Hyperspectral Image Classification-Traditional to Deep Models: A Survey for Future Prospects

Hyperspectral Imaging (HSI) has been extensively utilized in many real-life applications because it benefits from the detailed spectral information contained in each pixel. Notably, the complex characteristics i.e., the nonlinear relation among the captured spectral information and the corresponding object of HSI data make accurate classification challenging for traditional methods. In the last few years, Deep Learning (DL) has been substantiated as a powerful feature extractor that effectively addresses the nonlinear problems that appeared in a number of computer vision tasks. This prompts the deployment of DL for HSI classification (HSIC) which revealed good performance. This survey enlists a systematic overview of DL for HSIC and compared state-of-the-art strategies on the said topic. Primarily, we will encapsulate the main challenges of traditional machine learning for HSIC and then we will acquaint the superiority of DL to address these problems. This survey breakdown the state-of-the-art DL frameworks into spectral features, spatial features, and together spatial-spectral features to systematically analyze the achievements (future research directions as well) of these frameworks for HSIC. Moreover, we will consider the fact that DL requires a large number of labeled training examples whereas acquiring such a number for HSIC is challenging in terms of time and cost. Therefore, this survey discusses some strategies to improve the generalization performance of DL strategies which can provide some future guidelines.


I. INTRODUCTION
H PERSPECTRAL IMAGING (HSI) is concerned with the extraction of meaningful information based on the radiance acquired by the sensor at short or long distances without substantial contact with the object of interest [1]. HSI provides detailed spectral information by sampling the reflective portion of the electromagnetic spectrum covering a wide range of 0.4 − 2.4 m (i.e. visible 0.4 − 0.7 m to short wave infrared 0.7 − 2.4 m) region in hundreds of narrow and contiguous spectral bands. HSI can also explore the (light) emission properties of objects in the range of mid to long infrared regions [2]. Despite the detailed information, it brings several challenges since traditional analysis techniques for monochromatic, RGB, and multispectral images cannot be directly exploited to extract meaningful information from Hyperspectral ones due to several reasons, e.g. HSI exhibits the unique statistical and geometrical properties of high dimensional spectral/spatial data, arXiv:2101.06116v3 [eess.IV] 27 Apr 2022 i.e. the volume of a hypercube and hypersphere concentrates on corners and outside shells respectively.
HSI has been adopted in several real-world applications including but not limited to the atmosphere, environmental, urban, agriculture, geological and mineral exploration, coastal zone, marine, forestry (i.e. track forest health), water quality and surface contamination, inland waters, and wetlands, snow and ice, biological, medical contexts, and food processing [3]- [8]. There are also several military applications in camouflage, landmine detection, and littoral zone mapping. Furthermore, HSI has been used in space, air, and underwater vehicles to acquire detailed spectral information for a wide range of uses [9]- [12].
Infield collection and spectral library indexing of ground truth signatures for any of the said applications are critical for many reasons. For instance, the spectral information of vegetation is prejudiced by a wide range of environmental situations that make it challenging to satisfactorily represent variability without the collection of site-specific field spectra. But the real potential of HSI is mostly untapped since it allows it to go deeper than surface features considering that usually, each feature has a different spectrum band. HSI, indeed, can capture more than 200 spectral bands which help practitioners to discriminate objects that were not possible before. A few HSI application examples are shown in Fig.  1, but several other domains (e.g. smart city, Industry 4.0, Intelligent Transportation Systems) can greatly benefit from such an approach.
This survey aims to give an overview of the widely used DL-based techniques to perform HSIC. Specifically, we will first summarize the main challenges of HSIC which cannot be effectively overcome by traditional machine learning (TML), and later we will enlist the advantages of DL to handle the aforementioned issues. At a later stage, we will provide a framework to categorize the corresponding works among: 1) Spectral and spatial feature learning, individually, and 2) Spectral-spatial feature learning to systematically review the achievements in DL-based HSIC. 3) Future research stems to improve the generalization performance and robustness of DL models while considering the limited availability of reliable training samples. The remainder of this paper is structured as follows. Section II introduces the task of HSI Classification (HSIC) and briefly discusses the HSIC paradigm shift from Traditional (Conventional) Machine Learning to Deep Learning (DL) models, describing HSI data characteristics along with the advantages and limitations of DL that are faced while working with HSI. In section III and IV, we give an overview of different forms of HSI representations and basic machine learning strategies, respectively. Section V describes a few commonly used types of layers and reviews recent developments (specifically from 2017 onward) of some intensively utilized DL frameworks for HSIC. Sections VI, VII, VIII, and IX presents the stateof-the-art developments of Convolutional Neural Networks (CNN), Graph CNN (GCNN), Autoencoders (AEs), Deep Belief Networks (DBNs), Recurrent Neural networks (RNNs), respectively. In section X, we briefly discussed various strategies to overcome the low generalization performance of HSIC due to the limited availability of training data. Section XI presents the experimental results and discussion on results obtained using different deep learning strategies. Section XII concludes the paper with a few future research directions related to joint exploitation of spectral-spatial features of HSI, limited training data, and computational complexity.

II. HYPERSPECTRAL IMAGE CLASSIFICATION (BACKGROUND AND CHALLENGES)
A. Traditional to DL Models The main task of HSIC is to assign a unique label to each pixel vector of HSI cube based on its spectral or spectral-spatial properties. Mathematically, an HSI cube can be represented as X = [x 1 , x 2 , x 3 , . . . , x B ] T ∈ R B×(N ×M ) , where B represent total number of spectral bands consisting of (N × M ) samples per band belonging to Y classes where x i = [x 1,i , x 2,i , x 3,i , . . . , x B,i ] T is the i th sample in the HSI cube with class label y i ∈ R Y . The classification problem can be considered as an optimization one, in which a mapping function f c (.) takes the input data X and after applying some transformations over it, obtains the corresponding label Y, to reduce the gap between obtained output and the actual one [69].
where θ is a certain adjustable parameter that may be required to apply transformations on input data X such that f c : X → Y .
In literature, substantial work has been done on HSIC and there is a growing trend in the development of such techniques as shown in Figure 3. Most HSIC frameworks seemed to be influenced by the methodologies used in the computer vision domain [70]. Traditional machine learning-based HSIC approaches use hand-crafted features to train the classifier. These methods generally rely on utilizing engineering skills and domain expertise to design several human-engineered features, for instance, shape, texture, color, shape, spectral and spatial details. All these features are basic characteristics of an image and carry effective information for image classification. Commonly used hand-crafted feature extraction and classification methods include: texture descriptors such as Local Binary Patterns (LBPs) [71], Histogram of Oriented Gradients (HOG) [72], Global Image Scale-invariant Transform / Global Invariant Scalable Transform (GIST) [73], Pyramid Histogram of Oriented Gradients (PHOG), Scale-invariant Feature Transform (SIFT) [74], Random Forests [75], kernel-based Support Vector Machine (SVM) [76], K-nearest Neighbours (KNN), and Extreme Learning Machine (ELM). Color histograms are simple and effective handcrafted features used for an image classification task. They are easy to compute and invariant to small changes in images i.e. translation and rotation. The major drawback of a color histogram is that it does not provide spatial contextual information, hence it becomes difficult to distinguish between objects of the same color but different distribution. Moreover, color histograms are sensitive to variance in illumination. HOG features represent the histogram of edge orientations of spatial sub-regions. It can effectively extract the edge and local shape details and has been utilized in various remote sensing related works [46], [77]- [79].
Scale-invariant Feature Transform (SIFT) is a broadly used robust feature descriptor applied to image classification tasks [80]- [83]. The advantage of the SIFT descriptor is that it is invariant to the changes in image scale, rotation, illumination, and noise. SIFT is used to extract local features that describe a specific point in the image. The disadvantage of SIFT is that it is mathematically complex which increases its computational cost. GIST represents the global description of important aspects of an image that is the scales and orientations (gradient information) of various subregions of an image. GIST builds a spatial envelope in terms of different statistical properties like roughness, openness, and ruggedness, etc [84]. Texture descriptors such as local binary patterns (LBPs) are used for remote sensing image analysis [71], [85]. LBPs are used to describe the texture around each pixel by choosing pixels from the square neighborhood and gray level values of all neighborhood pixels are thresholded with respect to the central pixel.
Hand-crafted features can effectively represent the various attributes of an image, hence working well with the data being analyzed. However, these features may be insubstantial in the case of real data, therefore it is difficult to fine-tune between robustness and discriminability as the set of optimal features considerably vary between different data. Furthermore, human involvement in designing the features considerably affects the classification process, as it requires a high level of domain expertise to design hand-crafted features.
To mitigate the limitations of hand-crafted feature designing, a deep feature learning strategy was proposed by Hinton and Salakhutdinov in 2006 [108]. Deep learning (DL) based methods can automatically learn the features from data in a hierarchical manner, to construct a model with growing semantic layers until a suitable representation is achieved. Such models have shown great potential for feature representation in remote sensing image classification [109], [110].
DL architectures can learn the behavior of any data without any prior knowledge regarding the statistical distribution of the input data [111] and can extract both linear and non-linear features of input data without any pre-specified information. Such systems are capable of handling HSI data in both spectral and spatial domains individually, and also in a coupled fashion. DL systems possess a flexible architecture in terms of types of layers and their depth and are adaptive to various machine learning strategies like supervised, semi-supervised, and unsupervised techniques.

B. Hyperspectral Data Characteristics and DL Challenges
Despite the above-discussed DL potentials, there are still some challenges that need to be considered while applying DL to HSI data. Most of these challenges are related to the characteristics of HSI data i.e. hundreds of contiguous and narrow spectral channels with very high spectral resolution and low spatial resolution throughout the electromagnetic spectrum coupled with limited availability of training data. Although the pixels with rich spectral information are useful for classification purposes, however, the computation of such data takes a lot of time and resources.
Furthermore, processing such high-dimensional data is a somewhat complex task due to an increased number of parameters. This is known as the curse of dimensionality which considerably influences the classification performance especially in the case of supervised learning [112]. Since the size of training data is not adequate/insufficient and/or not reliable (i.e. the training samples may not provide any new information to the model or may have similar patterns/structures) to properly train the classifier which may lead the model to overfit. This is known as the Hughes phenomena [113] which occurs when labeled training data is significantly smaller than the number of spectral bands present in the data. Lack of labeled HSI data is a major issue in HSIC as labeling of HSI is a time-consuming and expensive task because it usually requires human experts or investigation of real-time scenarios.
In addition to high dimensionality, HSIC suffers from various other artifacts like high intra-class variability due to unconfined variations in reflectance values caused by several environmental interferers and degradation of data caused by instrumental noise while capturing the data [114]. Furthermore, the addition of redundant bands due to HSI instruments affects the computational complexity of the model. Spectral mixing is another challenge related to the spatial resolution of HSI. HSI pixels with low to average spatial resolution cover vast spatial regions on the surface of earth leading to mixed spectral signatures which result in high inter-class similarity in border regions. As a result, it becomes difficult to identify the materials based on their spectral reflectance values [115]. Following are some main challenges that come across when DL is applied to HSIC: • Complex Training Process: Training of Deep Neural Network (DNN) and optimization by tuning parameters is an NP-complete problem where the convergence of the optimization process is not guaranteed [116]- [118]. Therefore, it is assumed that training of DNN is very difficult [111] especially in the case of HSI when a large number of parameters need to be adjusted/tuned. However, the convergence task becomes somehow easier due to the advancement of various optimization techniques for deep CNNs. Among stochastic gradient descent (SGD) [119] and its momentum version (SGDM) [120], RMSProp [121], Adam [122], AdamW [123], dif-fGrad [124], RAdam [125], gradient centralization (GC) [126], AngularGrad [127], respectively are the successful CNN optimization techniques and widely used in any classification problems. • Limited Availability of Training Data: As discussed above, supervised DNN requires a considerably large amount of training data otherwise their tendency to overfit increases significantly [128] leads to the Hughes phenomena. The high dimensional characteristic of HSI coupled with a small amount of labeled training data makes the DNNs ineffective for HSIC as it demands a lot of adjustments during the training phase [69]. • Model's Interpretability: The training procedure of DNNs is difficult to interpret and understand. The black box kind of nature is considered as a potential weakness of DNNs and may affect the design decisions of the optimization process. Although, a lot of work has been done to interpret the model's internal dynamics. • High Computational Burden: One of the main challenges of DNN is dealing with a big amount of data that involves increased memory bandwidth, high computational cost, and storage consumption [129]. However, advanced processing techniques like parallel and distributed architectures [130], [131] and high-performance computing (HPC) [115] make it possible for DNNs to process large amounts of data. • Training Accuracy Degradation: It is assumed that deeper networks extract more rich features from data [132], however, this is not true for all systems to achieve higher accuracy by simply adding more layers. Because by increasing the network's depth, the problem of exploding or vanishing gradient becomes more prominent [133] and affects the convergence of the model [132].

III. HSI REPRESENTATION
Hyperspectral data is represented in the form of a 3D hypercube, X ∈ R B×(N ×M ) , which contains 1D spectral and 2D spatial details of a sample where B represents the total number of spectral bands and N and M are spatial components i.e., width and height, respectively. The HSI cube is shown in Figure 4.

A. Spectral Representation
In such representations, each pixel vector is isolated from other pixels and processed based on spectral signatures only which means the pixel is represented only in spectral space x i ∈ R B . Where B can either be the actual number of spectral channels or just relevant spectral bands extracted after some dimensionality reduction (DR) method. Usually, instead of using original spectral bands, a low dimensional representation of HSI is preferred for data processing in order to avoid redundancy and achieve better class separability, without considerable loss of useful information.
Dimensionality Reduction (DR) approaches for spectral HSI representation can either be supervised or unsupervised. Unsupervised techniques transform the high dimensional HSI into a low dimensional space without using the class label information, for example, Principal Component Analysis (PCA) and locally linear embedding [134]. On the other hand, supervised DR methods utilize labeled samples to learn the data distribution i.e. to keep data points of the same classes near to each other and separate the data points of different classes. For instance, linear discriminant analysis (LDA), local Fisher discriminant analysis (LFDA) [135], local discriminant embedding (LDE) [136] and nonparametric weighted feature extraction (NWFE) [137]. LDA and LDFA provide better class separability by maximizing the inter-class distance of data points and minimizing the intra-class distance. However, due to the spectral mixing effect, in which the same material may appear with different spectra or different materials may have the same spectral signatures, it becomes difficult to differentiate among different classes based on the spectral reflectance values alone.

B. Spatial Representation
To deal with the limitations of spectral representation, another approach is to exploit the spatial information of the pixels, in which pixels in each band are represented in the form of a matrix, x i ∈ R N ×M . Due to high spatial correlation, neighboring pixels have higher probabilities to belong to the same class. Therefore, in the case of spatial representation, neighboring pixels' information is also considered and the neighborhood of a pixel can be determined using kernel or pixel centric window [138]. Some common methods to extract spatial information from HSI cube are morphological profiles (MPs), texture features (like Gabor filters, gray-level co-occurrence matrix (GLCM), and local binary pattern (LBP), etc.) and DNN based methods. Morphological profiles are capable of extracting geometrical characteristics. Few extensions of MPs include extended morphological profiles (EMPs) [139], multiple-structure-element morphological profiles [140], invariant attribute profiles (IAPs) [141].
The texture of the image provides useful spatial contextual information of HSI. For instance, a Gabor filter, a texture analysis technique, can efficiently obtain textural information at various scales and orientations. Similarly, LBP can provide rotation-invariant spatial texture representation. The GLCM can effectively determine the spatial variability of HSI by exploiting the relative positions of neighborhood pixels. The DNNs can also extract spatial information of HSI by considering the pixel as an image patch instead of representing it as a spectral vector. The spatial information contained in HSI can also be extracted by combining various of the afore discussed methods. For instance, [142]combined Gabor filter and differential morphological profiles [143] to extract local spatial sequential features for a recurrent neural network (RNN) based HSIC framework.

C. Spectral-Spatial Representation
This representation jointly exploits both spectral and spatial information of data. In such approaches, a pixel vector is processed based on spectral features while considering spatialcontextual information. The strategies that simultaneously use both spectral and spatial representations of HSI, either concatenate the spatial details with spectral vector [62], [144] or process the 3D HSI cube to preserve the actual structure and contextual information [145].
In literature, all these HSI representations are widely exploited for HSIC. Most of the DNNs for pixel-wise classification utilized the spectral representation of HSIs [146], [147]. However, to mitigate the limitations of spectral representation, many efforts have been made to incorporate the spatial information [148], [149]. Recently, joint exploitation of both spectral and spatial features has gained much popularity and led to improved classification accuracy [67], [150]- [154]. These HSI feature exploitation approaches, for HSIC, are further discussed in the following sections.

IV. LEARNING STRATEGIES
Deep learning models can adopt various learning strategies that can be broadly categorized into the following:

A. Supervised Learning
In a supervised learning approach, the model is trained based on the labeled training data which means training data is comprised of a set of inputs and their corresponding outputs or class labels. During the training phase, the model iteratively updates its parameters in order to predict the desired outputs accurately. In the testing phase, the model is tested against the new input/test data in order to validate its ability to predict the correct labels. If trained sufficiently, the model can predict the labels of new input data. However, supervised learning of DNNs requires a lot of labeled training data to fine-tune the model parameter. Therefore, they are best suited to scenarios where plentiful labeled data is available. The details of various supervised learning techniques for DNNs will be explained in the respective sections.

B. Unsupervised Learning
In contrast to the supervised learning approach, unsupervised learning techniques learn from the input data with no explicit labels associated with it. These approaches try to identify the underlying statistical structure of input representations or patterns in the absence of corresponding labels. As there is no ground truth available for the training data so it might be difficult to measure the accuracy of the trained model. However, such learning strategies are useful in the cases where we want to learn the inherent structure of such datasets which have a scarcity of training data. The principal component analysis (PCA) is an unsupervised learning technique that can be used to learn a low-dimensional representation of the input. Similarly, k-means clustering is another unsupervised learning method that groups the input data into homogeneous clusters.

C. Semi-supervised Learning
The semi-supervised learning technique is halfway between unsupervised and supervised approaches. It learns from the partially labeled datasets that are a small amount of labeled training data can be utilized to label the rest of the unlabeled data. These techniques effectively utilize all available data instead of just labeled data, therefore, these techniques have gained much popularity among the research community and are being widely used for HSIC [155]- [158]. The details of these methods are briefly described in section X.

V. DEVELOPMENT OF DNNS (TYPES OF LAYERS)
In the following, we review recent developments of some widely used DNN frameworks for HSIC. We specifically surveyed the literature published from 2017 onward. DNNs exhibit a great variety of flexible and configurable models for HSIC that allow the incorporation of several types of layers. Few widely used types of layers are explained in the following.
A layer is the key building block of DNN and the type of layer has a decisive impact in terms of feature processing. A layer takes the weighted input, processes it through linear or non-linear transformation, and outputs these values to the next layer. Generally, a layer is uniform, as it has a single activation function. The first layer of the network is known as the input layer and the last layer as an output layer. All other layers in the network, in between the input and output layers, are known as hidden layers. These layers progressively find different features in the input data by performing various transformations. The choice of layer type depends on the task at hand, as some layers perform better for some tasks than others. The most commonly used layers for HSIC are explained below.

A. Fully Connected Layers
A fully connected (FC) layer connects every neuron in the lower layer to every neuron in the upper/next layer. Mostly, they are used as the last few layers of a model usually after convolution/pooling layers. FC takes the output of the previous layer and assigns weights to predict the probabilities for class labels. Due to a large number of connections, a large number of parameters need to be adjusted which significantly increases the computational overhead. Moreover, due to a large number of parameters, the model becomes more sensitive to overfitting [49]. However, to mitigate the effect of overfitting, a dropout method is introduced in [159].

B. Convolutional Layers
The convolutional (CONV) layer convolve the input data or feature maps from a lower layer with the filters (kernels). The filter contains weights whose dot product is calculated with the subset of input data by moving it across the width, height, and depth of the input region. The output of the filter is known as a feature map. CONV layer provides spatial invariance via a local connectivity approach in which the neuron in the feature map connects to a subset of input from the previous layer rather than connecting to every neuron. This reduces the number of parameters that need to train. To further reduce the number of parameters, the CONV layer uses the mechanism of parameter sharing in which the same weights are used in a particular feature map.

C. Activation Layers
Activation layers are assumed to be a feature detector stage of DNNs [160]. FC and CONV layers provide linear representations of input data or it can be said that they work similarly to linear regressors and data transformed by these layers is considered to be at the feature extraction stage [69]. Therefore, to learn non-linear features of data, an activation layer must be used after FC and CONV layers. In the activation layer, feature maps from previous layers go through an activation function to form an activation map. Some commonly used activation functions are sigmoid, hyperbolic tangent (tanh), rectified linear unit (ReLU), LiSHT [161] and softmax. However, in HSI analysis, softmax and ReLU are widely employed activation functions [69]. Figure 5 presents a graphical representation of a few commonly utilized activation functions. The pooling layer, also known as the sub-sampling or downsampling layer, takes a certain input volume and reduces it to a single value as shown in Figure 6. This provides invariance to small distortions in the data. The pooling layer helps the model to control overfitting as the size of data and model parameters both are reduced which also leads to a decrease in the computational time. The commonly used down-sampling operations are max-pooling, average-pooling, and sum-pooling. Recently, a pooling technique, wavelet-pooling is introduced in [162] whose performance is commensurable to max-pooling and average-pooling. Alternatively, [163] proposed another trend in which the pooling layer is replaced by the CONV layer of increased filter stride.

VI. CONVOLUTIONAL NEURAL NETWORK (CNN)
The architecture of the Convolutional Neural Network (CNN) is inspired by the biological visual system presented in [164]. Following the natural visual recognition mechanism proposed by Hubel and Wiesel [164], Neocognitron [165] is regarded as the first hierarchical, position-invariant model for pattern recognition [166] which can be considered as the predecessor of CNN [167]. The architecture of CNN can be divided into two main stages: one is Feature Extraction (FE) network and the other is a classification based on the feature maps extracted in the first stage.
The FE network consists of multiple hierarchically stacked CONV, activation, and pooling layers. The CONV layer extracts the features from input data by convolving a learned kernel with it. On each CONV layer, the kernel is spatially shared with whole input data which reduces the model's complexity and the network becomes easier to train as the number of parameters that need to be fine-tuned is reduced. Convolved results are then passed through an activation layer which adds nonlinearities in the network to extract non-linear features of the input. This is achieved by applying a non-linear function to the convolved results. Afterward, the resolution of the feature map is reduced by applying a pooling operation to achieve shift-invariance. Generally, the pooling layer is added with every CONV layer followed by the activation function.
The classification stage consisting of FC layers and a Softmax operator gives the probability of input pattern belonging to a specific class based on the feature maps extracted at the FE stage. FC layer connects every single neuron in the previous layer to every neuron in the current layer. In [168] and [169], the authors proposed that the FC layer can be disregarded by using a global average pooling layer. Softmax is commonly used for classification tasks [170], [171] however, many works have also utilized SVM [172], [173] for this purpose.
In the following, we reviewed three types of CNN architectures for HSIC: i) Spectral CNN, ii) Spatial CNN and iii) Spectral-spatial CNN. Figure 7 illustrates the general architecture of these three frameworks.

A. Spectral CNN Frameworks for HSIC
Spectral CNN models only consider 1D spectral information (x i ∈ R B ) as input, where B could either be the original number of spectral bands or the appropriate number of bands extracted after some dimensionality reduction method. In [174], a CNN structure was proposed to mitigate the overfitting problem and achieved a better generalization capability by utilizing 1 × 1 convolutional kernels and enhanced dropout rates. Moreover, a global average pooling layer is used in place of a fully connected layer in order to reduce the network parameters. To reduce high correlation among HSI bands [169] proposed a CNN architecture for HSIC which fully utilized the spectral information by transforming the 1D spectral vector to a 2D feature matrix and by cascading composite layers consisting of 1 × 1 and 3 × 3 CONV layers, the architecture achieved the feature reuse capability. Similar to [174], [169] also utilized the global average pooling layer to lower the network's training parameters and to extract high dimensional features.
In [175] authors presented a hybrid model for HSIC in which the first few CONV layers are employed to extract position invariant middle-level features and then recurrent layers are used to extract spectral-contextual details. Similarly, [146] used a hybrid architecture for classifying healthy and diseased Wheat heads. For the input layer, they transform spectral information into a 2D data structure. In [176] CNN proved to be more effective as compared to SVM and KNN for the spectral-based identification of rice seed's variety. A similar application of CNN was explored in [147] where various varieties of Chrysanthemum were identified using spectral data of the first five PCs of Principal component analysis (PCA). PCA is a dimensionality reduction method that is widely used in many DL applications to handle/preprocess high dimensional data. In [177] PCA was utilized to preprocess medical HSI and then the fusion of CNN kernels with Gabor kernels using dot product is used for classification.
The study [178] analyzed another dimensionality reduction technique Dynamic Mode Decomposition (DMD) which converted 3D HSI data to 2D and then this data is fed to vectorized CNN (VCNN) for classification. To overcome the noise effect in pixel-wise HSIC, a method of averaged spectra is used in Fig. 7: General architecture of Spectral CNN, Spatial CNN and Spectral-spatial CNN frameworks for HSIC. [179] where an averaged spectra of a group of pixels belonging to bacterial colonies is extracted for further analysis.

B. Spatial CNN frameworks for HSIC
Spatial CNN models only consider spatial information and to extract the spatial information from HSI data, dimensionality reduction (DR) methods are employed on spectraldomain to lower the dimensionality of original HSI data. For instance, [180] used PCA to extract the first PC with refined spatial information and fed it to a fully CNN framework for classification. Similarly, [181] trained a spatial-based 2D-CNN with one PC. In [182], PCA whitened input data considering three PCs is fed to a random patches network as a 2D-CNN classification framework. However, the limited training samples with highly similar spectral feature make DL models prone to over-fitting. To overcome this [183] proposed a probabilistic neighbourhood pooling based attention network (PNPAN) for HSI classification.
The method proposed in [184] cropped the patches from 2D input images (i.e. images from the different spectral bands) to train a 2D-CNN architecture that learns the data-adaptive kernels by itself. Furthermore, some authors also proposed the utilization of handcrafted features along with spectral-domain reduction. For example, [185] combined the Gabor filtering technique with 2D-CNN for HSIC to overcome the overfitting problem due to limited training samples. The Gabor filtering extracts the spatial details including edges and textures which effectively reduce the overfitting problem. The work [186] proposed a deformable HSIC network based on the concept of deformable sampling locations which can adaptively adjust their size and shape in accordance with HSI's spatial features. Such sampling locations are created by calculating 2D offsets for every pixel in the input image through regular convolutions by taking into account three PCs. These offsets can cover the locations of similar neighboring pixels possessing similar characteristics. Then structural information of neighboring pixels is fused to make deformable feature images. Regular convolution employed on these deformable feature images can extract more effective complex structures.

C. Spectral-Spatial CNN frameworks for HSIC
Spectral-spatial pixel-wise HSIC can be achieved by integrating spatial features into spectral information. For instance, [187] presented an improved pixel pair feature (PPF) approach called spatial pixel pair feature which is different from traditional PPFs with respect to two main aspects: one is the selection of pixel pair that is only the pixel from the immediate neighborhood of central pixel can be used to make a pair, second is the label of pixel pair would be as of central pixel.
To extract discriminative joint representation [188] introduced a Supervised Spectral-Spatial Residual Network (SSRN) that uses a series of 3D convolutions in the respective spectral and spatial residual blocks. An efficient deep 3D-CNN framework was proposed in [189] that simultaneously exploits both spectral and spatial information for HSIC.
Similarly, to reflect the variations of spatial contexture in various hyperspectral patches, [190] implemented an adaptive weight learning technique instead of assigning fixed weights to incorporate spatial details. Besides this, to make the convolutional kernel more flexible [154] explored a new architectural design that can adaptively find adjustable receptive filed and then an improved spectral-spatial residual network for joint feature extraction. The discriminative power of the extracted features can be further improved by combining both the max and min convolutional features before the ReLU non-linearity reported in [191] for the classification task. CNN's are failed to exploit rotation equivariance in a natural way [192] introduced the translation equivariant representations of input features which provides extra robustness to the spatial feature locations for HSIC.
The deeper networks may suffer from the issues of overfitting and gradient vanishing problems due to the smaller number of available labeled training samples and to overcome this shortcoming the lightweight CNN's gain good attention in HSIC communities. The paper [193] introduced an endto-end 3D lightweight convolutional neural network to tackle the limited numbers of training samples for HSI classification. To reduce the large gap between the massive trainable parameters and the limited labeled samples [194] proposed to extract the spatial-spectral Schroedinger eigenmaps (SSSE) joint spatial-spectral information, and then further reduced the dimensionality using compression technique. Approximately 90% of trainable weights of the total parameters are used immediately after the flatten operation i.e., in the fully connected layer, whereas the remaining only 10% weights are used on the previous convolutional layers of the whole network. To overcome the paper [195] introduced a lightweight bag-offeature learning paradigm into an end-to-end spectral-spatial squeeze-and-excitation residual network for HSIC.
The morphological operations i.e., erosion and dilation are powerful nonlinear feature transformations that are widely used to preserve the essential characteristics of shape and structural information of an image. Inspired by these the paper [196] introduced a new end-to-end morphological convolutional neural network (MorphCNN) for HSIC which utilizes both the spectral and spatial features by concatenating the outputs from spectral and spatial morphological blocks extracted in a dual-path fashion.
The work [190] proposed a two-stage framework for joint spectral-spatial HSIC which can directly extract both spectral and spatial features instead of independently concatenating them. The first stage of the proposed network is comprised of a CNN and softmax normalization that adaptively learns the weights for input patches and extracts joint shallow features. These shallow features are then fed to a network of Stacked Autoencoder (SAE) to obtain deep hierarchical features and final classification is performed with a Multinomial Logistic Regression (MLR) layer. A 3D-CNN model was introduced in [197] to jointly exploit spectral-spatial features from HSI and to validate its performance comparison is performed with spectral-based DBN, SAE, and 2D-spatial CNN for HSIC. The work [198] introduced a bilinear fusion mechanism over the two branches of squeeze operation based on the global and max-pooling whereas the excitation operation is performed with the fused output of squeeze operation.
The work [199] proposed a deep multiscale spectral-spatial feature extraction approach for HSIC which can learn effective discriminant features from the images with high spatial diversity. The framework utilizes the Fully Convolutional Network (FCN) to extract deep spatial information and then, these features are fused with spectral information by using a weighted fusion strategy. Finally, pixel-wise classification is performed on these fused features.
In [200] a dual-channel CNN framework was implemented for spectral-spatial HSIC. In the proposed approach, 1D-CNN is used to hierarchically extract spectral features and 2D-CNN to extract hierarchical spatial features. These features are then combined together for the final classification task. Furthermore, to overcome the deficiency of training data and to achieve higher classification accuracy, the proposed framework is supported by a data augmentation technique that can increase the training samples by a factor of 6. In [201], a multiscale 3D deep CNN is introduced for end-toend HSIC which can jointly learn both 1D spectral and 2D multiscale spatial features without any pre-processing or postprocessing techniques like PCA, etc. In order to reduce the band redundancy or noise in HSI, [202] explored a novel architecture for HSIC by embedding a band attention module in the traditional CNN framework. The study [203] proposed an HSIC architecture in which PCA transformed images are used to obtain multi-scale cubes for handcrafted feature extraction by utilizing multi-scale covariance maps which can simultaneously exploit spectral-spatial details of HSI. These maps are then used to train the traditional CNN model for classification.
The work [204] combined CNN with metric learning-based HSIC framework which first utilizes CNN to extract deep spatial information using the first three PCs extracted by PCA. Then, in a metric learning-based framework, spectral and spatial features are fused for spectral-spatial feature learning by embedding a metric learning regularization factor for the classifier's training (SVM). Similarly, [205] combines multiscale convolution-based CNN (MS-CNN) with diversified deep metrics based on determinantal point process (DPP) [206] priors for (1D spectral, 2D spectral-spatial, and 3D spectral-spatial) HSIC. Multiscale filters are used in CNN to obtain multi-scale features and DPP-based diversified metric transformation is performed to increase the inter-class variance and decrease intra-class variance, and better HSI representational ability. Final classification maps are obtained by using a softmax classifier.
In recent work, [207] an HSIC framework is proposed to extract multi-scale spatial features by constructing a threechannel virtual RGB image from HSI instead of extracting the first three PCs through PCA. The purpose of using a three-channel RGB image is to utilize existing networks trained on natural images to extract spatial features. For multiscale feature extraction, these images are passed to a fully convolutional network. These multi-scale spatial features are fused and further joined with PCS extracted spectral features for final classification via SVM.
A two-branch (spectral and spatial) DNN for HSIC was introduced in [208]. The spatial branch consists of a band selection layer and a convolutional and de-convolutional framework with skip architecture to extract spatial information of HSI, and in the spectral branch, a contextual DNN is used to extract spectral features. The paper [209] introduced an adaptive band selection based semi-supervised 3D-CNN to jointly exploit spectral-spatial features whereas [210] explored dual-attention based autoencoder-decoder network for unsupervised hyperspectral band selection and then joint feature extraction for land cover class prediction. Similarly, in [211] spectral-spatial features are simultaneously exploited in an unsupervised manner using a 3D convolution autoencoder. The pixel-wise land use and land cover (LULC) classification using traditional CNNs is often suffered by the presence of wrong / noisy labels in the training set and can easily be overfitted to the labeled noises. To overcome this problem of accurate classification [212] proposed a lightweight heterogeneous kernel convolution (HetConv3D) for HSI classification with noisy labels by effectively combining both the spectral and spatial kernel feature to produce discriminative and invariant feature maps for classification.
A hybrid 3D-2D-CNN architecture was presented by [213] in which 3D-CNN is first used to extract joint spectralspatial features and then 2D-CNN is further used to obtain more abstract spatial contextual features. The study [214] proposed to use adaptive Markov random field for HSIC. The CNN first extracts joint spectral-spatial features and then a smooth MRF prior is placed on class labels to further refine the spatial details. Convolutional neural networks are greatly affected by overfitting and vanishing gradient problems and to overcome this a separable attention network was introduced by [215]. Where the input feature maps are divided into several groups and split along the channel dimension and finally an attention mask encodes global contextual information by combining them. Recently, generalized gradient centralized 3D convolution (G2C-Conv3D) was introduced in [216] to combine both the intensity level semantic information and gradient level detailed information extracted from raw HSIs during the convolutions operation. To boost the performance of accurate land-cover types classification, G2C-Conv3D can be easily plugged into the existing HSIs feature extraction networks.

D. GCN frameworks for HSIC
Graph Convolutional Networks (GCNs) [217] have been garnering increasing attention to researchers in various application fields, owing to their flexible and diversified network architecture that is capable of processing non-grid highdimensional data. Such properties provide new insight and possibilities in processing hyperspectral data more effectively and efficiently. In detail, GCNs enable the modeling of the relations between data (or samples). Accordingly, this naturally motivates us to use the GCNs to capture the spatial relations of spectral signatures in HSIs. Due to the GCNs' limitations in the graph construction [218], particularly for large graphs (need expensive computational cost), GCNs fail to classify or identify materials in large-scale hyperspectral scenes using normal PCs, which leads to relatively less popularity compared to CNN's in HSIC. For this reason, there have been some tentative researches using the GCNs in the HSIC task.
For example, a second-order GCN was proposed in [219] by modeling spatial-spectral relations on manifolds for HSIC by the attempts to reduce the computational cost on graphs. Authors of [220] first used superpixel segmentation techniques on HSIs and fed superpixels instead of pixels into GCNs. This enables the network training of GCNs on a large number of pixels in HSIs with the application to the land cover classification task. Nevertheless, these methods still fail to solve the problem of GCNs essentially. To this end, Hong et al. [218] proposed a novel miniGCN. As the name suggests, miniGCN trains the GCNs in a mini-batch fashion, which is the same as CNN. The proposed miniGCN not only reduces the computational cost-effectively but also makes it possible to make a quantitative comparison and fusion with CNNs, further yielding a FuNet for HSIC.

E. Future directions for CNN-based HSIC
In the preceding section, we have reviewed the recent developments of CNNs for HSIC. Although CNN's based HSIC frameworks have achieved great success with respect to classification performance, there are still many aspects that need further investigation. For instance, there is a need to further work on such models that can jointly employ spatial and spectral information for HSIC. Many of the abovesurveyed frameworks use dimensionality reduction methods to achieve better spectral-spatial representation but such approaches discard useful spectral information of HSI. Hence the development of robust HSIC approaches that can preserve spectral information is required. However, the processing of such approaches increases the computational burden, and the training process becomes slower, therefore, parallel processing of such networks using FPGAs and GPUs is desired in order to achieve the computationally fast models, that can even be suitable for mobile platforms, without the performance degradation.
Moreover, as the CNNs are becoming deeper and deeper, more labeled training data is required for accurate classification, and as discussed before, there is a lack of labeled training data in HSI. In order to overcome this issue, more research is required to integrate the CNN with unsupervised or semisupervised approaches. Furthermore, we should pay more attention to the generalization ability of CNNs, particularly for the input data format (not only limiting to the grid data). GCNs might be a good solution to combine with CNN's together to develop a more general CNN-based new framework. Using this, we expect to be able to further break the performance bottleneck, yielding more efficient HSIC.

VII. AUTOENCODERS (AE)
Autoencoder (AE) is a popular symmetrical neural network for HSIC due to its unsupervised feature learning capability. AE itself does not perform a classification task instead it gives a compressed feature representation of high-dimensional HSI data. AE consists of an input layer, one hidden or encoding layer, one reconstruction or decoding layer, and an output layer as shown in Figure 8. AE is trained on input data in such a manner to encode it into a latent representation that is able to reconstruct the input. To learn a compressed feature representation of input data, AE tries to reduce the reconstruction error that is minimizing the difference between the input and the output. Whereas, the Stacked Autoencoder (SAE) is built by stacking multiple layers of AEs in such a way that the output of one layer is served as an input of the subsequent layer. Denoising autoencoder (DAE) is a variant of AE that has a similar structure as AE except for the input data. In DAE, the input is corrupted by adding noise to it, however, the output is the original input signal without noise. Therefore, DAE, different from AE, can recover original input from a noisy input signal.
To learn high-level representation from data, the work [221] proposed a combination of multi-layer AEs with maximum noise fraction which reduces the spectral dimensionality of HSI, while a softmax logistic regression classifier is employed for HSIC. The study reported in [222] combined multimanifold learning framework proposed by [223] with Counteractive Autoencoder [224] for improved unsupervised HSIC. The work [225] jointly exploited spectral-spatial features of HSI through an unsupervised feature extracting framework composed of recursive autoencoders (RAE) network. It extracts the features from the neighborhood of the target pixel and weights are assigned based on the spectral similarity between target and neighboring pixels. A two-stream DNN with a class-specific fusion scheme was introduced in [226] which learns the fusion weights adaptively. One stream composed of stacked denoising auto-encoder is used to extract spectral features and the second stream is implemented to extract spatial information using Convolutional Neural Network (CNN), while final classification is performed by fusing the class prediction scores obtained from the classification results of both streams.
Another work proposed a hybrid architecture for multifeature based spectral-spatial HSIC which utilizes PCA for dimensionality reduction, guided filters [227] to obtain spatial information, and sparse AE for high-level feature extraction. The framework proposed in [228] exploited both spectral and spatial information for HSIC by adopting batch-based training of AEs and features are generated by fusing spectral and spatial information via a mean pooling scheme. Another work [229] developed a spectral-spatial HSIC framework by extracting appropriate spatial resolution of HSI and utilization of stacked sparse AE for high-level feature extraction followed by Random Forest (RF) for the final classification task.
Similarly, [230] also used stacked sparse AE for various types of representation that is spectral-spatial and multi-fractal features along with other higher-order statistical representations. A combination of SAE and extreme learning machine was proposed in [231] for HSIC, which segments the features of the training set and transform them via SAE, after transformation, feature subsets are rearranged according to the original order of the training set and fed to extreme learning machine-based classifiers, while Q-statistics is used for final classification result. This processing of feature subsets helps to improve variance among base classifiers [231]. Similarly, in a recent work [232] implemented a computationally efficient multi-layer extreme learning machine-based AE which learns the features in three folds, as proposed in [39] for HSIC.
To overcome the issue of high intra-class variability and high inter-class similarity in HSI, [233] developed an SAEbased HSIC which can learn compact and discriminative features by imposing a local fisher discriminant regularization. Similarly, in the latest work [234] a k-sparse denoising AE is spliced with and spectral-restricted spatial features that overcome the high intra-class variability of spatial features for HSIC. The study [235] proposed an HSIC architecture that first makes the spectral segments of HSI based on mutual information measure to reduce the computation time during feature extraction via SAE, while spatial information is incorporated by using extended morphological profiles (EMPs) and SVM/RF is used for final classification. Recently, [236] used SAE for the classification of an oil slick on the sea surface by jointly exploiting spectral-spatial features of HSI.

A. Future Directions for AE-based HSIC
In the above section, we have surveyed the recent developments of AEs based techniques for HSIC. Although such frameworks provide powerful predictive performance and show good generalization capabilities, more sophisticated work is still desired. Many of the discussed approaches do not fully exploit abundant spatial information so further techniques need to be developed that can fully employ joint spatial and spectral information for HSIC. Moreover, the issue of high intra-class variability and high inter-class similarity in HSI also hinders the classification performance. Many of the abovereviewed works have addressed this issue but further research to overcome this aforesaid issue is required. One direction could be further exploring approaches like pre-training, cotraining, and adaptive neural networks, etc for AE-based HSIC frameworks.

VIII. DEEP BELIEF NETWORK (DBN)
Deep Belief Network (DBN) [237] is a hierarchical deep DNN that learns the features from input in an unsupervised, layer-by-layer approach. The layers in DBN are built using Restricted Boltzmann Machine (RBM) comprised of a twolayer architecture in which visible units are connected to hidden units [238] as shown in Figure 9. A detailed overview of RBM can be found at [238]. To extract more comprehensive features from input data, the hidden unit of one RBM can be fed to the visible units of other RBM. This type of layer-by-layer architecture builds a DBN, which is trained greedily and can capture deep features from HSI. The architecture of three-layer DBN is shown in Figure 10. In literature, several works implemented DBN for HSIC. For instance, [239] used DBN for land cover classification by combining spectral-spatial information and making a comparison with some other classification approaches. The usual learning process of DBN involves two steps: one is unsupervised pre-training with unlabeled samples and the second is supervised fine-tuning with the help of labeled samples. However, this training process may result in two problems: first, multiple hidden units may tend to respond similarly [240] due to co-adaptation [241] and second is linked with the sparsity and selectivity of activations neurons that are some neurons may always be dead or always responding [242]. To mitigate these two problems, [243] introduced a diversified DBN model through regularizing the pre-training and finetuning process by imposing a diversity prior to enhancing the DBN's classification accuracy for HSI.
To extract efficient texture features for the HSIC, the work [244] proposed a DBN based texture feature enhancement framework that combines band grouping and sample band selection approach with a guided filter to enhance the texture features, which are then learned by a DBN model and final classification results are obtained by a softmax classifier. The work [245] implemented a parallel layers framework consisting of Gaussian-Bernoulli RBM which extracts highlevel, local invariant, and nonlinear features from HSI and a logistic regression layer is used for classification.
To improve the classification accuracy, some works are considered to jointly exploit the spectral and spatial information contained in HSI. For instance, [246] introduced a DBN framework with the logistics regression layer and verified that the joint exploitation of spectral-spatial features leads to improved classification accuracy. Similarly, [247] proposed a spectral-spatial graph-based RBM method for HSIC which constructs the spectral-spatial graph through joint similarity measurement based on spectral and spatial details, then an RBM is trained to extract useful joint spectral-spatial features from HSI, and finally, these features are passed to a DBN and logistic regression layer for classification.

A. Future directions for DBN-based HSIC
In the preceding section, we have reviewed the latest developments of DBN-based HSIC frameworks. We have observed that relative to other DNNs, very few works have utilized the DBNs for HSIC. Therefore, there is a need to further explore the DBN-based robust techniques that can jointly employ spatial and spectral features for HSIC. In addition, another research direction can be the regularization of the pretraining and fine-tuning processes of DBN to efficiently overcome the issue of dead or potentially over-tolerant (always responding) neurons.

IX. RECURRENT NEURAL NETWORK (RNN)
The architecture of the Recurrent Neural Network (RNN), shown in Figure 11, comprises loop connections, where the node activation of the next step depends on the previous step [248]. Therefore, RNNs are capable of learning temporal sequences. RNN models process the spectral information of HSI data as time sequence considering the spectral bands as time steps [249]. There are three basic models of RNN a) Vanilla, b) Long-Short-Term Memory (LSTM) and c) Gated Recurrent Unit (GRU).
Vanilla is the simplest RNN model and leads to information degradation while processing high-dimensional data. LSTM models composed of two states overcome this issue by controlling the information flow through three gates: input, forget, and output gates. It learns the relevant information over time by discarding the extraneous information. However, the gate controlling strategy makes the LSTM a considerably complex approach. GRU variant of LSTM enjoys the simplicity of the Vanilla model and provides high performance similar to  The work [70] proposed an RNN based HSIC framework with a novel activation function (parametric rectified tanh) and GRU, which utilizes the sequential property of HSI to determine the class labels. In [142] a local spatial sequential (LSS) method based RNN framework was introduced which first extracts low-level features from HSI by using Gabor filter and differential morphological profiles [143] and then fuse these features to obtain LSS features from the proposed method, these LSS features are further passed to an RNN model to extract high-level features, while a softmax layer is used for final classification.
Keeping in view the usefulness of spatial information to achieve improved classification accuracies, the work [250] proposed a spectral-spatial LSTM based network that learns spectral and spatial features of HSI by utilizing two separate LSTM followed softmax layer for classification, while a decision fusion strategy is implemented to get joint spectral-spatial classification results. Similarly, [251] proposed a patch-based RNN with LSTM cells that incorporate multi-temporal and multi-spectral information along with spatial characteristics for land cover classification.
In literature, several works proposed CNN-based hybrid RNN architectures (CRNN) for HSIC. For instance, [175] implemented a convolutional RNN in which the first few CONV layers are employed to extract position invariant middle-level features, and then recurrent layers are used to extract spectral-contextual details for HSIC. Similarly, [252] utilized such a model for semi-supervised HSIC by using pseudo labels. The study [253] suggested an HSIC framework in which CNN is used to extract spatial features from HSI, then these features are passed to a GRU-based fusion network that performs feature level and decision level fusion.
Similarly, Luo, et al., [254] exploited both spectral and spatial information contained in HSI by combining CNN with parallel GRU-based RNN which simplifies the training of GRU and improves performance. Bidirectional Convolutional LSTM (CLSTM) was proposed in [153] to jointly exploit spectralspatial feature of HSI for classification. In, [255] combined multiscale local spectral-spatial features extracted by 3D-CNN with a hierarchical RNN which learns the spatial dependencies of local spectral-spatial features at multiple scales. Recurrent 2D-CNN and recurrent 3D-CNN for HSIC were proposed in [256] and along with an interesting comparison of these frameworks with their corresponding 2D and 3D-CNN models, which validates the superiority of recurrent CNN. The work [257] integrated CNN with CLSTM in which a 3D-CNN model is used to capture low-level spectral-spatial features and CLSTM recurrently analyzes this low-level spectral-spatial information. Recently, [70], introduced a cascade RNN for HSIC which consist of two layers of GRU-based RNN, the first layer is used to reduce the redundant spectral bands and the second layer is used to learn the features from HSI, furthermore, a few convolutional layers are employed to incorporate the rich spatial information contained in HSI.

A. Future directions for RNN-based HSIC
In the above section, we have surveyed the recent developments of AEs based techniques for HSIC. Although RNNbased HSIC frameworks have attracted considerable attention to the remote sensing community and achieved great success for classification performance, there are still many aspects that need further investigation. For instance, the construction of sequential input data for RNN. Most of the surveyed methods considered HSI pixel as a sequential point that is the pixel from each spectral band that forms a data sequence. However, This increases the length of RNN's input sequence considerably large which can lead to an overfitting issue.
Moreover, processing such large data sequences increases the computational time and the learning process becomes slower. Therefore, the use of parallel processing tools needs to be further investigated to achieve good generalization performance of RNN-based HSIC. In addition, approaches like a grouping of spectral bands to decrease the data sequence length and utilization of the entire spectral signature to better discriminate between various classes can further be explored to construct the sequential input of the RNN model. Another interesting future direction may involve the implementation of RNN-based HSIC frameworks in a real multi-temporal HSI context.

X. STRATEGIES FOR LIMITED LABELED SAMPLES
Although DNNs have been successfully exploited for the task of HSIC however, they require a considerably large amount of labeled training data. However, as discussed earlier, the collection of labeled HSI is very critical and expensive due to numerous factors that either demand human experts or exploration of real-time scenarios. The limited availability of labeled training data hinders classification performance. To overcome the aforesaid issue, many effective strategies have been proposed in the literature. In this section, we will briefly discuss some of these strategies while focusing on active learning algorithms.

A. Data Augmentation
To combat the issue of limited training samples, data augmentation is proven to be an effective tool for HSIC. It generates new samples from the original training samples without introducing additional labeling costs. Data augmentation approaches can be categorized into two main strategies as i) data wrapping; ii) oversampling [258]. Data wrapping usually encodes several invariances (translational, size, viewpoint, and/or illumination) by conducting geometric and color-based transformations while preserving the labels, and oversampling-based augmentation methods inflate the training data by generating synthetic samples based on original data distributions. Oversampling techniques include mixture-based instance generation, feature space augmentations [258], and Generative Adversarial Networks (GANs) [259].
Referring to HSIC literature, several data augmentationbased frameworks have been employed to improve the classification performance by avoiding potential overfitting, which is generally caused by the limited availability of training data. For instance, [260] enhanced the training data by using three data augmentation operations (flip, rotate, and translation), and then this enhanced data is exploited to train CNN for HSIC. The work [261] presented a comprehensive comparison of various extensively utilized HSI data augmentation techniques and proposed a pixel-block pair-based data augmentation that utilized both spectral and spatial information of HSI to synthesis new instances, to train a CNN model for HSIC. The work [262] compared the classification performance of a combination of CNN and AL with and without data augmentation techniques and demonstrated that the data augmentation leads to higher classification accuracies. Similarly, in another comparison [263], data augmentation-based CNN exhibited a 10% increase in HSIC accuracy when compared to a PCAbased CNN model.
The above-discussed methods utilize offline data augmentation techniques that increase the training data by creating new instances during/before the training process of a model. Recently, a novel data augmentation framework for HSI is proposed in [264] which, rather than inflating the training data, generates the samples at test time, and a DNN trained over original training data along with a voting scheme is used for the final class label. To improve the generalization capability of DNN models, the work [264] also proposed two fast data augmentation techniques for high-quality data syncretization. A similar PCA-based online data augmentation strategy is proposed in [265] which also synthesis new instances during the inference, instead of training.

B. Semi-Supervised/Unsupervised Learning
Semi-Supervised Learning (SSL) approaches learn data distribution by jointly exploiting both labeled and unlabeled data. These techniques expand the training data by utilizing unlabeled samples along with labeled ones in order to construct a relationship between feature space and class labels. Several SSL-based HSIC frameworks have been proposed in the literature that can mainly be categorized as follows: i) Co-training, ii) Self-training, iii) GANs, iv) Graph-based SSL models and v) Semi-supervised SVM. A recent comprehensive survey on these SSL techniques can be found in [266]. Moreover, another in-depth survey of SSL approaches is also presented in [267].
The SSL-based HSIC techniques are briefly summarized in [268], where authors also made a detailed comparison of these methods. The method presented in [252] used pseudo or cluster-labeled samples to pre-train a CRNN for HSIC and small-sized labeled data is used to fine-tune the network. Similarly, [156] proposed a semi-supervised HSIC framework that exploits PCA and extended morphological attribute profiles to extract pseudo-labeled samples which are fed to a CNN-based deep feature fusion network.
The work [269] proposed a dual strategy co-training approach based on spectral and spatial features of HSI. Similarly, [270] separately pre-trained two SAEs, one using spectral and the other using spatial features of HSI, and fine-tuning is achieved via a co-training approach. [271] proposed a region information-based self-training approach to enhance the training data. A graph-based self-training framework was developed in [272] where initial sampling is achieved through subtractive clustering. Recently, [157] improved the HSIC performance by pseudo-labeling the unlabeled samples through a clustering-based self-training mechanism and regulating the self-training by employing spatial constraints.

C. Generative Adversarial Network (GAN)
GAN proposed by [273], is comprised of two neural networks, one is known as a generator and the other is known as discriminator ( Figure 13). GANs can learn to replicate the samples by exploiting the data distribution details. The work [274] proposed a spectral feature-based GAN for SSL-based HSIC. Similarly, [275] proposed a GAN-based spectral-spatial HSIC framework. Similarly, [276] developed CNN-based 1D-GAN and 3D-GAN architectures to enhance the classification performance. A 1D customized GAN is used to generate the spectral features [277], which is further used by CNN for feature extraction, and then majority voting is performed HSIC. Very recently, [278] introduced a spatial-spectral multi-class GAN (MSGAN) which utilizes two generators to produce spatial and spectral information with the help of multiple adversarial objectives. To address the data imbalance problem for HSI classification [279] proposed a new semi-supervised model which combines GAN with conditional random fields (CRFs).
Similarly, [280] investigated a Caps-TripleGAN model which effectively generates new samples using a 1D structure Triple Generative Adversarial Network (TripleGAN) and classifying the generated HSI samples using the capsule network (CapsNet). The work [281] proposed to utilize a 3D CNNbased generator network and a 3D deep residual networkbased discriminator network for HSIC. To learn high-level contextual features combination of both capsule network and convolutional long short-term memory (ConvLSTM) based discriminator model has been proposed in [282] for HSIC.
The work [283] proposed to address the scarcity of training examples by utilizing a GAN model where the performance of the discriminator is further improved by an auxiliary classifier to produce more structurally coherent virtual training samples. Besides this, to enhance the model performance [284] proposed a generative adversarial minority oversampling-based technique for addressing the long-standing problem of classwise data imbalanced imposed by HSIC.

D. Transfer Learning
Transfer learning enhances the performance of a model by using prior knowledge of a relevant primary task to perform a secondary task. In other words, information extracted from the relevant source domain is transferred to the target domain to learn unseen/unlabeled data. Therefore, transfer learning can be effectively employed in domains with insufficient or no training data. Based on the availability of labeled training instances, transfer learning frameworks can further be categorized as supervised or unsupervised transfer learning. Generally, both source and target domains are assumed to be related but not exactly similar. However, they may follow different distributions as in the case of HSIC where categories of interest are the same but data in two domains may vary due to different acquisition circumstances.
In DNN based HSIC, the model learns features in a hierarchical manner, where lower layers usually extract generic features, when trained on various images. Therefore, the features learned by these layers can be transferred to learn a new classifier for the target dataset. For instance, [285] pertained to a two-branch spectral-spatial CNN model with an ample amount of training data from other HSIs and then applied the lower layers of the pre-trained model to the target network for the robust classification of target HSI. To learn the target-specific features, higher layers of the target network are randomly initialized and the whole network is fine-tuned by utilizing limited labeled instances of target HSI. Similarly, [286] proposed a suitable method to pre-train and fine-tune a CNN network to utilize it for the classification of new HSIs. The study [287] combined data augmentation and transfer learning approaches to combat the shortage of training data in order to improve HSIC performance.
As discussed before, data in source and target domain may vary in many aspects, for instance, in the case of HSIs, the dimensions of two HSIs may vary due to the acquisition from different sensors. Handling such cross-domain variations and transferring the knowledge between them is known as heterogeneous transfer learning (a detailed survey of such methods can be found in [288]). In HSIC literature, several works have been proposed to bridge the gap for transferring the knowledge between two HSIs, with varying dimensions and/or distributions.
For example, [289] proposed an effective heterogeneous transfer learning-based HSIC framework that works well with both homogeneous and heterogeneous HSIs, and [290] used an iterative re-weighting mechanism-based heterogeneous transfer learning for HSIC. Similarly, a recent work [291] proposed a band selection-based transfer learning approach to pre-train a CNN, which retains the same number of dimensions for various HSIs. Furthermore, [292] proposed an unsupervised transfer learning technique to classify completely unknown target HSI and [293] demonstrate that the networks trained on natural images can enhance the performance of transfer learning for remote sensing data classification as compared to the networks trained from scratch using smaller HSI data.

E. Active Learning
Active Learning (AL) iteratively enhances the predictive performance of a classifier by actively increasing the size of training data, for each training iteration, by utilizing an unlabeled pool of samples. In each iteration, AL enhances the training dataset by actively selecting the most valuable instances from the pool of unlabeled data and an oracle (Human or machine-based) assigns the true class labels to these instances. Finally, these useful instances are added to the existing training dataset and the classifier is retrained on this new training dataset. The process continues until a stopping criterion, that maybe the size of the training dataset, the number of iterations, or the desired accuracy score, is achieved. A general framework of AL is illustrated in Figure  14.
The selection of the most useful/effective samples is made in such a way that the samples should be informative and representative of the overall input distribution in order to improve accuracy. Based on the criteria of adding new instances to the training set, AL frameworks can be designated as either stream-based or pool-based. In stream-based selection, one instance at a time is drawn from an actual set of unlabeled samples and the model decides whether to label it or not based on its usefulness. While in pool-based strategy, samples are queried from a pool/subset of unlabeled data based on ranking scores computed from various measures to evaluate the sample's usefulness. The work [294] found that streamed-based selection gives poorer learning rates as compared to pool-based selection as the former tends to query extra instances. In pool-based selection, it is important to incorporate diversity in the pool of samples, in order to avoid redundancy within the pool of samples. Generally, the following three aspects are focused on while selecting/querying the most valuable samples: heterogeneity behavior, model's performance, and representativeness of samples. A brief introduction of these sampling approaches is given below: 1) Heterogeneity-based selection: These approaches select the samples that are more heterogeneous to the already seen instances with respect to model diversity, classification uncertainty, and contention between a committee of various classifiers. Uncertainty sampling, expected model change, and query-by-committee are examples of heterogeneity-based models.
• Uncertainty Sampling: In this approach, the classifier iteratively tries to query the label of those samples for which it is most uncertain while predicting the label. The selection of new instances is based on ranking scores against a specified threshold and the instances with scores closest to that threshold are queried for labels. One simple example of such a scheme could be implementing the probabilistic classifier on a sample in a scenario of binary classification and querying its label if the predicted class probability is close to 0.5. • Query-by-Committee: Such heterogeneity-based approaches perform the sampling process based on the dissimilarities in the predictions of various classifiers trained on the same set of labeled samples. A committee of various classifiers trained on the same set of training data is used to predict the class labels of unlabeled samples and the samples for which classifiers differ more are selected for querying labels. The committee of different classifiers can either be built by using ensemble learning algorithms like Bagging and Boosting [295] or by changing the model parameters [296]. Generally, a less number of diverse classifiers is adequate for constructing a committee [295], [297]. • Expected Model Change: Such a heterogeneity-based approach chooses the instances which result in a significant change from the current model in terms of the gradient of the objective function. Such techniques attempt to query the label for those instances that are considerably different from the current model. These sampling techniques only fit the models which follow gradient-based training procedures/optimization.

2)
Performance-based Selection: Such methods consider the effect of adding queried samples to the model performance. They try to optimize the performance of the model by reducing variance and error. There are two types of performance-based sampling: • Expected Error Reduction: This approach is interrelated to uncertainty sampling in such a way that uncertainty measures maximize the label uncertainty of the sample to be queried for the label while expected error reduction reduces the label uncertainty of the queried sample.
Referring to the already discussed example of the binary classification problem, the expected error reduction approach would choose the samples with a probability far away from 0.5 in order to reduce the error rate. Such techniques are also known as the greatest certainty models [296]. • Expected Variance Reduction: Reducing the variance of the model is guaranteed to reduce future generalization error [298]. Therefore, expected variance reduction techniques attempt to indirectly reduce the generalization error by minimizing the model variance. Such approaches query the instances that result in the lowest model variance. The Fisher information ratio is a well-known variance minimization framework.
3) Representativeness-based selection: Heterogeneitybased models are prone to include outlier and controversial samples but performance-based approaches implicitly avoid such samples by estimating future errors. Representative sampling tends to query such instances that are representative of the overall input distribution, hence, avoid outliers and unrepresentative samples. These approaches weigh the dense input region to a higher degree while the querying process. Density-weighted techniques like information density are examples of representativeness sampling approaches that consider the representativeness of samples along with heterogeneity behavior, and are also known as hybrid models [296].
Recently, AL has been intensively utilized in HSIC. [299] proposed a feature-driven AL framework to define a wellconstructed feature space for HSIC. [300] proposed a Random Forest-based semi-supervised AL method that exploits spectral-spatial features to define a query function to select the most informative samples as target candidates for the training set.
Spatial information has been intensively exploited in many AL-based HSIC. For instance, [301] presented an AL framework that splice together the spectral and spatial features of superpixels. Similarly, [302] considered the neighborhood and superpixel information to enhance the uncertainty of queried samples. In recent work, [303] exploited the attribute profiles to incorporate spatial information in an AL-based HSIC framework.
Batch-mode AL frameworks have been widely employed to accelerate the learning process. Such approaches select a batch of samples, in each iteration, to be queried for a label. Therefore, the diversity of the samples is extremely critical in batch mode AL techniques in order to avoid redundancy. A multi-criteria batch-mode AL method proposed by [304] defines a novel query function based on diversity, uncertainty, and cluster assumption measures. These criteria are defined by exploiting the properties of KNN, SVM, and K-means clustering respectively, and finally, genetic algorithms are used to choose the batch of most effective samples. Similarly, [305] proposed a regularized multi-metric batch-mode AL framework for HSIC that exploits various features of HSI.
A multiview AL (MVAL) framework was proposed in [306] that analyzes the object from various views and measure the informativeness of the sample through multiview Intensitybased query criteria. Similarly, [307] also exploited the concept of multiview learning using the Fisher Discriminant Ratio to generate multiple views. In another work, [308] proposed a novel adaptive MVAL framework for HSIC which jointly exploits the spatial and spectral features in each view. Recently, [309] proposed an MVAL technique that utilizes pixel-level, subpixel-level, and superpixel-level details to generate multiple views for HSIC. Moreover, the proposed method exploits joint posterior probability estimation and dissimilarities among multiple views to query the representative samples.
In the HSIC literature, several works have combined the AL and DNN. For instance, [310] joined autoencoder with AL technique and [311] proposed a DBN-based AL framework for HSIC. Similarly, [312] coupled Bayesian CNN with AL paradigm for spectral-spatial HSIC. Recently, [262] proposed a CNN-based AL framework to better exploit the unlabeled samples for HSIC.
Many works integrated AL with transfer learning for HSIC. For example, [313] proposed an AL-based transfer learning framework that extracts the salient samples and exploits highlevel features to correlate the source and target domain data. Another work, [314] proposed a Stacked Sparse AE-based Active Transfer Learning technique that jointly utilizes both spectral and spatial features for HSIC. Another work [315] combined domain adaptation and AL methods based on multiple kernels for HSIC.
AL-based HSIC offers some sophisticated frameworks to enhance the generalization capabilities of models. For instance, [35] proposed a fuzziness-based AL method to improve the generalization performance of discriminative and generative classifiers. The method computes the fuzzinessbased distance of each instance and estimated class boundary, and the instances having greater fuzziness values and smaller distances from class boundaries are selected to be the candidates for the training set. Recently, [316] proposed a non-randomized spectral-spatial AL framework for multiclass HSIC that combines the spatial prior Fuzziness approach with Multinomial Logistic Regression via a Splitting and Augmented Lagrangian classifier. The authors also made a comprehensive comparison of the proposed framework with state-of-the-art sample selection methods along with diverse classifiers.

XI. EXPERIMENTAL EVALUATION
The most research-oriented works published in the literature present a comprehensive experimental evaluation to highlight the pros and cons of the work/s proposed. However, to some extent, these works may have chosen different experimental settings, for instance, training, validation, and test samples may have the same number or percentage but the samples may be different as these samples are normally chosen randomly. Therefore, to make a fair comparison among different works proposed in the literature, one must need to have the same experimental settings.
These experimental settings include the same samples (geographical locations should remain the same for all chosen models no the different ones) and the number of samples should have been selected for each round of training in the cross-validation process. Normally, these samples have been chosen randomly, thus high likely, they may be different for different models if the models are executed at different times.
The other issue with most of the literature proposed in recent years is overlapping between training/test samples, i.e., training/validation samples have been randomly selected (including or excluding the above point) for training and validation however, the entire dataset has been passed at a testing phase which leads to a highly biased model (as the training samples have already been seen by the model) and produces high accuracy. Thus, in this work, the training/test samples are though chosen randomly (because all the models have been executed at the same time) however, the above point has been taken seriously and the intersection among these samples remain empty.

A. Experimental Datasets
The Indian Pines (IP) dataset was gathered by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) [317] over the Indian Pines test site in North-western Indiana. It contains 224 spectral bands within a wavelength range of 400 to 2500 nm. The 24 null and corrupted bands have been removed. The spatial size of the image is 145 × 145 pixels, and it comprises of 16 mutually exclusive vegetation classes. The spatial resolution is 20 meters per pixel (MPP). The detailed class description and ground truth maps are presented in Figure  15a. Moreover, the disjoint Training/Test sample maps are presented in Figures 15b and 15c.
The Kennedy Space Center (KSC) dataset was gathered in 1996 by AVIRIS [317], with wavelengths ranging from 400 to 2500 nm. The image has 512 × 614 pixels and 176 spectral bands after removal of some low signal-to-noise ratio (SNR) bands. The KSC dataset comprises 5202 labeled samples, with a total of 13 upland and wetland classes. The detailed class description and ground truth maps are presented in Figure  16a. Moreover, the disjoint Training/Test sample maps are presented in Figures 16b and 16c.
The University of Pavia (UP) dataset was acquired by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor during a flight campaign over the university campus at Pavia, Northern Italy [318]. It consists of 610 × 340 pixels with 103 spectral bands in the wavelength range from 430 to  860 nm and 2.5 MPP. It comprises 9 urban land-cover classes. The detailed class description and ground truth maps are presented in Figure 17a. Moreover, the disjoint Training/Test sample maps are presented in Figures 17b and 17c.     The University of Trento (UT) dataset was gathered by the using AISA eagle sensor over the rural regions in the south of Trento, Italy. The HSI contains 63 spectral bands within a wavelength of range 0.42 to 0.99 µm [320]. The scene has 600 × 166 pixels, which comprises of 6 mutually exclusive vegetation land-cover classes where the spectral resolution is 9.2 nm, and the spatial resolution is 1 meter per pixel (MPP). In addition, the available samples are divided into disjoint training and test samples of 6 classes and Fig. 19 lists the information about the per class number of samples for six different land-covers.
To some extent, all the aforesaid works are based on Convolutional and Recurrent Networks and are evaluated on four benchmark HSI datasets namely IP, PU, KSC, Houston Scene, and the University of Toronto. This survey only pays attention to the robustness of all these models while considering the small sample size of training data to classify HSI for joint spatial-spectral classification.
Here we have enlisted the experimental results with detailed discussion on the obtained results. The obtained accuracies for disjoint training and test samples are shown in Tables III, IV, V  and VI and Figures 20, 21 22, and 23. All the results shown in the Tables and Figures are obtained using the 10-crossvalidation process to compute the overall, average and kappa (κ) accuracy for comparison purposes. For instance, let us assume the case of Pavia University results, for this particular case, the work [196] has the highest average, overall and kappa (κ) accuracies which are 95.51%, 93.95%, and 93.95% respectively in comparison with the average, overall and kappa (κ) accuracies for other comparative works; 92.55%, 89.94%, Fig. 20: Classification Maps obtained by MLR [321], SVM [322], MLP [69], RNN [70], LSTM [323], GRU [324], CNN-1D [218], CNN-2D [325], CNN-3D [326] and MorphCNN [196] on the disjoint train-test dataset for the UP scene.  [75], MLR [321], SVM [322], MLP [69], RNN [70], LSTM [323], GRU [324], CNN-1D [218], CNN-2D [325], CNN-3D [326], HybridSN [213], and MorphCNN [196] on the disjoint train-test dataset for the IP scene.  Figure. Moreover, the overall accuracy for Grapes Untrained is lower than the other classes due to the reasons mentioned above. In a nutshell, one can say that higher accuracy can be achieved by increasing the number of labeled training samples. Thus a higher number of labeled training samples can produce better accuracies for all competing methods.
Moreover, one can conclude that the AE-based models do not perform well as compared to the other models, although the unsupervised methods do not require the samples to be  [75], MLR [321], SVM [322], MLP [69], RNN [70], LSTM [323], GRU [324], CNN-1D [218], CNN-2D [325], CNN-3D [326], HybridSN [213], and MorphCNN [196] on the disjoint train-test dataset for the KSC scene.   [321], SVM [322], MLP [69], RNN [70], LSTM [323], GRU [324], CNN-1D [218], CNN-2D [325], CNN-3D [326] and MorphCNN [196] on the disjoint train-test dataset for the KSC scene. labeled, if there are no constraints, these methods might learn nothing. Moreover, AE has a symmetric architecture that leads to the explosion of training parameters which increases the difficulty in training. The works [329] and [330] overcome the above-mentioned issues, however, the work [228] does not adopt the greedy layer-wise approach thus producing the worst results, thus, there is room for improvement in such methods. In a nutshell, the classification results based on CNN are way better than AE-based methods while considering the limited availability of labeled training samples. Although the AEs can learn the internal structure of the unlabeled data, the final feature representation might not have task-driven characteristics which might be the reason for not performing well as compared to the supervised learning models. Moreover, AL and/or SL takes the benefits from the selection of the most important samples for training which enables the model to focus more attention on indistinguishable samples for HSIC.
Whereas, FSL benefits from the exploration of the relationship between samples to find a discriminative decision boundary for HSIC. TL makes good use of similarity among different HSI's to reduce the quantity required for training also reduces the number of trainable parameters while boosting the models' robustness. According to the raw data (i.e., processing the HSI without extracting/learning the features), DA generates more samples which bring a diversity of samples.

C. Experiments with Convolutional Feature Extractors
This section revisited several deep Hyperspectral feature extraction processes, i.e., a traditional convolutional process and a gradient centralized convolutional process. In this hierarchy, we have conducted several experiments using several stateof-the-art works published in recent years. This experiment is specifically designed to check the performance of the convolutional process rather than testing the model's performance. The  [321], SVM [322], MLP [69], RNN [70], LSTM [323], GRU [324], CNN-1D [218], CNN-2D [325], CNN-3D [326] and MorphCNN [196] on the disjoint train-test dataset for the UT scene.
baseline models apply convolutional feature extractors which include a 2D convolution neural network for HSI classification (Conv2D) introduce by Makantasis et al. [325] and the 3D convolutional approach for remote sensing image classification (Conv3D) proposed by Hamida et al. [326] (a traditional 3D convolutional feature extractor), and recently Roy et al. introduced generalized gradient centralized 2D convolution (G2C-Conv2D) [216], and generalized gradient centralized 3D convolution (G2C-Conv3D) [216] to extract the fine-grained spectral-spatial feature representation. The generalized gradient centralized 3D convolution (G2C-Conv3D) operation is designed by using a weighted combination between the vanilla and gradient centralized 3D convolutions (GC-Conv3D) to extract both the intensity level semantic information and gradient level information from the HSIs.
All the aforementioned convolutional feature extractors have been evaluated on 5 different Hyperspectral datasets, namely, IP, PU, Trento, UH, and KSC datasets. The experimental results are illustrated in Table VIII. From all these results, one can easily conclude that the G2C-Conv3D convolutional process outperformed Conv2D and Conv3D followed by G2C-Conv2D. A similar trend has been observed for all datasets except the Trento dataset on which the 3D convolutional process slightly performed better as compared to the traditional Conv2D and G2C-Conv2D, respectively. The accuracy difference is not that high as compared to the G2C-Conv3D for other datasets. Most importantly, the G2C-Conv3D convolution operation is simple to implement and can easily be plugged into existing CNNs to boost both the robustness and classification performance.

XII. CONCLUSION AND FUTURE DIRECTIONS
The rich information contained in HSI data is a captivating factor that constitutes the utilization of HSI technology in real-world applications. Moreover, advances in machine learning methods strengthen the deployment potentials of such technologies. In this work, we surveyed recent developments of Hyperspectral Image Classification (HSIC) using state of the art Deep Neural Networks (for instance, Auto-encoder (AE), Deep Belief Network (DBN), Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), Transfer Learning (TL), Few-shot Learning (FSL), Active/Self Learning (AL/SL), and Data Augmentation (DA)) in a variety of learning schemes (specifically, supervised, semi-supervised and unsupervised learning). In addition, we also analyzed the strategies to overcome the challenges of limited availability of training data like Data Augmentation, Few-shot Learning (FSL), Transfer Learning, and Active Learning, etc. According to the methodologies discussed above, we select some of the representative works to conduct the experiments on benchmark HSI datasets.
Although the current HSIC techniques reflect a rapid, remarkable, and sophistication of the task, further developments are still required to improve the generalization capabilities. The main issue of deep neural network-based HSIC is the lack of labeled data. HSI data is infamous due to the limited availability of labeled data and deep neural networks demand a sufficiently large amount of labeled training data. Section X discussed some widely used strategies to combat the aforesaid issue but significant improvements are still needed to efficiently utilize limited available training data. One direction to solve this problem could be to explore the integration of various learning strategies discussed in section X to cash in   the joint benefits. One more way is to exploit a few-shot or Kshot learning approaches that can accurately predict the class labels with only a few labeled samples. Moreover, there is a need to focus on the joint exploitation of spectral-spatial features of HSI to complement classification accuracies achieved from the aforementioned HSIC frameworks. Another future potential of HSIC is computationally efficient architectures. Therefore, the issue of the high computational complexity of deep neural networks is of paramount importance and it is crucial to implement parallel HSIC architectures to speed up the processing of deep neural networks to meet the computational stipulation of time-critical HSI applications. In this direction, high-performance computing platforms and specialized hardware modules like graphical processing units (GPUs) and field-programmable gate arrays (FPGAs) can be used to implement the parallel HSIC frameworks. Hence, to assimilate aforesaid aspects in the development of a new HSIC framework is to appropriately utilize the limited training samples while considering joint spectral-spatial features of HSI and maintaining the low computational burden.
Manuel Mazzara is a professor of Computer Science at Innopolis University (Russia) with a research background in software engineering, serviceoriented architectures and programming, concurrency theory, formal methods, software verification and Artificial Intelligence. Manuel received a PhD in computing science from the University of Bologna and cooperated with European and US industry, plus governmental and inter-governmental organizations such as the United Nations, always at the edge between science and software production.
The work conducted by Manuel and his team in recent years focuses on the development of theories, methods, tools and programs covering the two major aspects of Software Engineering and Artificial Intelligence: the process side, describing how we develop software, and the product side, describing the results of this process.
Salvatore Distefano is an Associate Professor at the University of Messina (Italy). He authored and co-authored more than 250 scientific papers and contributions to international journals, conferences, and books. He visited as a scholar and professor different universities and research centers such as collaborating with top scientists such as UMass Dartmouth, UCLA, Duke, Innopolis, and kazan Federal University.
He took part in several national and international projects, such as Reservoir, Vision (EU FP7), SM-SCOM (EU FP7 ERC Advanced Grant), Beacon, IoT-Open.EU (EU H2020). He is a member of international conference committees and he is on the editorial boards of IEEE Transactions on Dependable and Secure Computing, Journal of Cloud Computing, International Journal of Big Data. His main research interests include non-Markovian modeling; Quality of Service/Experience; Parallel and Distributed Computing, Grid, Cloud, Autonomic, Volunteer, Crowd, Edge, Fog Computing; Internet of Things; Cyber-Physical Social Systems; Smart Cities; Intelligent Transportation Systems; Big Data, Stream Processing; Software-Defined and virtualized ecosystems; Hyper Spectral Imaging; Machine Learning. During his research activity, he contributed to the development of several tools such as WebSPN, ArgoPerformance, GS3 and Stack4Things. He is also one of the co-founders of the SmartMe.io start-up, a spin-off of the University of Messina established in 2017.