Configurable 2D–3D CNNs Accelerator for FPGA-Based Hyperspectral Imagery Classification

Convolutional neural network (CNN) is used efficiently for the classification of hyperspectral imagery (HSI). Both 3-D CNN and hybrid 2D–3D CNN have better performance due to the full extraction of joint spatial-spectral features. Most of previous acceleration research on field-programmable gate array (FPGA) has concentrated on 2-D CNN, few of which were made for 3-D CNN acceleration. More importantly, it is notable that the CNN models for HSI classification have some unique characteristics of computation and memory, different from those for object detection. Thus, the conventional accelerating approaches may be not applicable. To address this problem, we propose a dynamic configurable accelerator architecture suitable for both 2-D and 3-D CNN, ensuring fast development for various networks. The multiple nested loops are optimized delicately and the convolutional layers are fully parameterized, making it easy to scale up the network. Furthermore, we develop the parallelism-oriented memory pattern and data access strategy to optimize the data path and the local buffer. Finally, we implement the proposed architecture for 2-D, 3-D, and hybrid CNNs, validating its effectiveness and reconfigurability. To demonstrate its extensibility, we also prototype the accelerator on two FPGA platforms. We achieve the average and maximum throughput of 18.2/166.18 giga operations per second (GOPS) for HybridSN. To compare with previous FPGA accelerators, a typical baseline 3-D model is also implemented with the average and maximum throughput 160.25/225.87 GOPS, and the performance efficiency achieves up to 1.23 with 10.7% power efficiency gain.

The conventional classification of hyperspectral data relies on hand-crafted features, which cannot tackle the high dimensionality and heterogeneity of HSI.The recent advanced research of deep learning has been enriching the solutions for HSI classification tasks [12], [13], [14], [15].In a general way, the convolution operation widely employed in the existing convolutional neural network (CNN) networks for HSI classification can be divided into three categories, namely 1-D, 2-D, and 3-D CNN, respectively.
The 1-D CNN network is a spectral-based classification approach, which extracts the deep features along the radiometric dimension in pixel vector.The 2-D CNN mainly extracts spatial information through the convolutional layer, with few good radiometric spectral attributes.Theoretically, the 3-D CNN works well in extracting the spatial and spectral features simultaneously, however, it seems hard to achieve the best performance in classification with similar textures over multiple spectral bands.Meanwhile, the high computation complexity can not be ignored.The collaboration of 2-D CNN and 3-D CNN not only presents great advance in extracting both spectral and spatial features, but also achieves a valuable tradeoff of high accuracy and low computation cost.Roy et al. [18] proposed a classification model named HybridSN, which mainly utilizes 3-D CNN for joint spatial-spectral feature, and 2-D CNN for deeper spatial features subsequently.The results showed good performance and adaptability for hyperspectral images classification in different scenes.
Meanwhile, in some time-critical applications, there is an urgent need of capabilities of immediate analysis and realtime processing, like disaster and damage control, surveillance, precision farming [19], health care [20], pharmaceutical [21], etc.In recent years, graphics processing units (GPUs) have been growing as one of the efficient tools for deep learning processing.However, their high energy consumption cannot be neglected when employed in embedded platforms.Compared with GPUs, field programmable gate arrays (FPGAs) provide impressive advantages of not only low power consumption but also of high dedicated hardware design flexibility [22], [23].On the other hand, FPGAs have the excellent capability to follow the rapid evolution of algorithms and also sensors, better than application-specific integrated circuits.
Although the reconfigurable computing for hyperspectral data has attracted much research attention, most of the existing work focuses on applications like unmixing [24], [25], dimensionality reduction [26], [27], and target detection [28], [29].To the best of our knowledge, the studies on real-time hyperspectral image classification based on FPGAs remain limited.On the other hand, numerous studies have been carried out on object detection for traditional images with the colorful pattern of RGB, such as VGG [30], YOLO [31], etc.Since the framework and data pattern are quite different from that of HSI classification, it is the motivation for us to realize real-time hyperspectral image classification, incorporating more energy-efficient FPGAs as a hardware accelerator.
One of the challenges for the real-time classification of HSI is accelerating the CNN framework.Thanks to the great success of computer vision, abundant studies in acceleration on FPGAs have been undertaken to accelerate 2-D CNNs [32], [33], [34], while that for accelerating 3-D CNNs remain limited.It has been demonstrated in previous works that, despite the similar computation patterns between 2-D and 3-D CNNs, 3-D CNN is more computationally complex and in high demand of memory bandwidth.Therefore, the design of 3-D CNN accelerator cannot directly employ the existing approaches for 2-D CNNs, requiring more intensive and dedicated design efforts.
Another problem is that it is a time-consuming process to design a dedicated accelerator to match a new version of the CNN-based HSI classification framework or a new resourceconstrained FPGA platform, especially for complicated 2-D and 3-D CNNs.A parameterizable and algorithm-oriented configurable hardware accelerator is an excellent solution, taking full advantage of the reconfigurability of FPGAs.
Considering the impressive enhancement of HSI classification with collaboration of 2-D and 3-D CNNs, as well as the implementation on FPGA in a shorter period of time, here we attempt to design a configurable acceleration architecture, which can be configured as 2-D, 3-D, or hybrid framework dynamically.Fig. 1 shows a typical implementation of the architecture.Relying on the similar computation patterns of 2-D and 3-D CNNs, we build the processing engine (PE) with a dedicate data loader to satisfy the computation demands of both 2-D and 3-D CNNs, rather than the accelerators for them respectively.
Based on the multiple nested loops in the computation of CNNs, we design the flexible configurable parallelism, in which the size of PE array can be changed through the parallelization factors.The PE is designed with the register-transfer level (RTL) compiler, rather than high-level synthesis (HLS).The finegrained parallelism architecture not only enables the efficient hardware resource utilization, but also enhances performance of accelerator.In addition to computation, the high memory bandwidth is also significant to high performance.Matching the parallel processing, a data-path optimization is conducted by improving the memory pattern and data access, through which the local buffer demand is reduced and the data access latency can be hidden by prefetching.
As a case study, three well-known HSI classification models are implemented, including 2-D CNN model, 3-D CNN model, and HybridSN, with multiple datasets.The networks are retrained and quantized, and the weights and activations are represented as to 8-bit fixed-point.Meanwhile, we prototyped the accelerator on two FPGA platforms: AXU15EG and Zedboard.While running HybridSN, the solution achieves the average and maximum throughput of 18.2/166.18giga operations per second (GOPS).For the typical 3-D CNN test case C3D, the average and maximum throughput of 160.25/225.87GOPS, and the performance efficiency achieves up to 1.23 with 10.7% power efficiency enhancement.Compared with previous dedicated solutions, the proposed architecture is significantly more competitive and is easy to design.
The main contributions of our work can be summarized as follows.
1) We propose a dynamic configurable accelerator architecture suitable for both 2-D and 3-D CNNs on FPGAs, providing good applicability to multiple CNN algorithms for HSI classification.It also shows good extensibility to FPGAs with different resource constraints, supporting the full utilization of resources.2) A parallelism-oriented memory pattern and data access strategy is designed, aiming at the complicated data patterns of feature maps and weights in 3-D CNNs for HSI classification.The weights are reused to minimize the local buffer, eliminating the data access to external memory.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
3) A parameterized parallelism strategy is designed based on multiloop optimization methodologies, which enables the parallelization degree to be configured flexibly.Thus, the convolutional network can be rebuilt easily and the DSPs are no longer the limiting resource in hardware implementation.4) The proposed architecture can be easily extended to widely used CNN frameworks for HSI classification: 2-D CNN, 3-D CNN, and HybridSN.Experimental results also show its high scalability and performance on two FPGA platforms under conditions of low resource utilization.The rest of this article is organized as follows.Section II introduces the background of CNN networks for HSI classification and FPGA-based accelerations.Section III describes the proposed configurable accelerator architecture.Section IV presents the experiment results for the implementation of CNN networks, performance on different FPGAs, and the comparison with state-of-the-art methods.Finally, Section V concludes this article.

A. CNN Networks for HSI Classification
The classification of hyperspectral imaging is one of the remarkable techniques of hyperspectral imagery (HSI) analysis with a wide range of applications [8].The aim is to classify each pixel into a predefined number of classes according to the spectral and spatial attributes.Its spatial properties are associated with the neighboring pixels, and the spectral characteristics are derived from the reflected/emitted spectra at every pixel in multiple narrow spectral bands.
In the early stages, conventional classification methods mainly relied on spectral information.First, the handcrafted features are extracted from the high dimensional hyperspectral data, then obtained features are sent to the classifier, such as support vector machine (SVM) [9], k-nearest neighbor [10], and random forest (RF) [11].
Most recently, deep learning approaches have emerged with quite promising classification results, including 1-D CNNs, 2-D CNNs, or 3-D CNNs.The 1-D CNN network is a spectral-based classification approach, which extracts the deep features along the radiometric dimension in pixel vector.Hu et al. [12], for the first time, directly employed CNN network with multiple layers for the classification of HSI only using the spectral information.
In addition to spectral information, spatial information is also an important information resource in hyperspectral images.Yan et al. [13] have confirmed that introducing spatial information into the classifier can significantly improve the classification effect, compared with that only using spectral information.Makantasis et al. [14] designed a classification model based on 2-D CNN network, to extract spatial information mainly through the convolutional (CONV) layer.The experimental results have shown that the proposed method has excellent classification performance, compared with the SVM-based classification method.
Considering the rich representation capability with the spectral and spatial information of the hyperspectral image, the HSI classification methods extracting both the spectral and spatial features for the classifier attract more attention [16].Regarding the hyperspectral image as 3-D hypercube, the 3-D CNN is a straightforward way as the spectral-spatial classification approach conceptually, in which the joint spatial-spectral information can be utilized simultaneously.Hamida et al. [17] have introduced a 3-D CNN for the HSI classification, relying on the joint spectral-spatial properties without other preprocessing.
In summary, the 2-D CNN alone has the shortcoming of feeble extraction of good discriminating feature maps from spectra information.The 3-D CNN works well in extracting the spatial and spectral features simultaneously, but the computation complexity increased sharply.Taking into consideration both the classification accuracy and computation cost, hybrid frameworks with 2D-3D CNN are proposed [18], presenting good performance and good adaptability for hyperspectral image classification in different scenes.
The methodology for HSI classification based on 2D/3D CNNs is described as follows.Let the hyperspectral data cube be denoted by I ∈ R M ×N ×D , where M and N are the width and height of the image, and D represents the number of spectral bands.In order to be aligned with the specific nature of CNNs, the HIS data cube is divided into 3-D patches, each one of which contains spectral and spatial information for a specific pixel.The patches are represented as P x,y ∈ R S×S×D from I, centered at the spatial location (x, y), covering the S × S spatial extent and D spectral bands.The goal turns to predict the truth label of each pixel p x,y by feeding patch P x,y into a CNN, which hierarchically builds high-level features and then executes the classification task.
In 2-D CNN, the patch data are convolved with 2-D kernels, which strides over the input data to cover the full spatial dimension.
The operation formula can be shown as follows: where z l,r x,y is the activation value in position (x, y) in the rth feature map (FM) in layer l, f (.) is the activation function, b l,r is the bias parameter, w l,r,m i, j is the value of weight parameter for the rth FM in the lth layer.The m index is a set of feature maps of layer (l − 1), I l and J l are the width and height of kernel.
The convolution operations of the 3-D CNN have a great similarity to that of 2-D CNN.The main difference is the 3-D kernels are used over multiple contiguous bands to capture the spectral information.
The 3-D convolution operation is formulated as follows: where K l is the depth of the 3-D kernel along the spectral dimension and k is the number of kernels in the lth layer.The other parameters are similar to those in (1).Furthermore, for HSI analysis, it is proved that the redundancy from interband correlation is very high.The hundreds of spectral bands increase the pressure on the network model to process data and also consumes a lot of computing resources.The traditional principal component analysis (PCA) is applied along the spectral bands in many studies [41], [42], to remove the spectral redundancy while maintaining the spatial information.The number of reduced spectral bands after PCA has been proved to affect the classification accuracy, therefore, it is recommended to be 15 and 30 for Pavia University (PU) and Indian Pines (IP) respectively [43].

B. In-Depth Analysis of CNNs
In practice, the recommended dimensions of 3-D patches differ in different frameworks and even for different dataset.As mentioned above, as the input data unit of CNN, the dimensions of the patches directly affect the cost of calculation and memory, which has a significant impact on hardware acceleration.Therefore, as shown in Tables I-III, we summarized the characteristics of the most widely used supervised models, such as 2-D CNN [14], 3-D CNN [17], and HybridSN [18].As to 2-D CNN model, the patch size is set to be 5, and the numbers of principal components are set to be 10 for PU and 30 for IP.In HybridSN, the numbers of principal components are recommended to be 15 for PU and 30 for IP, while the patch size is set to be a larger value of 25 to achieve better performance.The 3-D CNN model is fed with the original data, without PCA as the preprocessing, and patch size is also set to be 5.
From Tables I-III, four observations can be drawn.First, previous studies have suggested that over 90% of the total computation is occupied by the convolutional operations, and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE III LAYERWISE ANALYSIS OF HYBRIDSN LAYERWISE
a similar conclusion can also be drawn with regard to 3-D CNN model and HybridSN.The CONV layers in them are computationally intensive and occupy over 98% of the total computation.Second, the memory occupation of FC layers differs in the three models, which is less than 18% in 2-D and 3-D CNN and over 93% in HybridSN.It is noteworthy that, the amount of weights and computation in the first FC layer is much larger, since the feature maps are flattened, which is quite conspicuous in HybridSN.However, in 3-D CNN model, the weights of FC layer occupy only 12% because of the full feature extraction after six CONV layers.Third, the processing for HSI classification is carried out patch by patch with a small patch size (e.g., 5 × 5 × 30, 25 × 25 × 30), thus the amount of intermediate data, transferred between the layers, is equivalent or even less than the weights.This is a different phenomenon from that for object detection, which is significant for the design of memory access and data reuse.Furthermore, though with fewer CONV layers, HybridSN has a larger scale (20.5 MB weights, 495.4 MOP/patch), compared with 2-D CNN model (1.2 MB weights, ∼1 MOP/patch) and 3-D CNN model (∼120 KB weights, 4.2 MOP/patch).This is mainly due to the larger patch size (25 × 25 × 30).One of the direct consequences is the large amount of weight in the first FC layer (18.9 MB), whereas the rest is only 1.55 MB.

C. CNN Acceleration on FPGAs
The classification accuracy of CNNs is enhanced through the comprehensive assessment of the feature maps, accompanied by the huge cost of complex computation.In contrast with central processing unit (CPU) or GPU, accelerating CNNs with FPGAs has distinct advantages in flexibility in customized hardware design, due to the various parallelism logic sources and distributed on-chip buffers.FPGAs integrate a quantity of hard-wired digital signal processing (DSP) blocks suitable for the multiply accumulate (MAC) operations and a collection of on-chip memories, which eliminate the data accesses to off-chip memory [44].
CNN accelerator designs generally focus on two techniques: the computing engine and the memory system.Since CNNs are highly computational-intensive, powerful computing engines are essential for accelerating [45].One of the most common approaches is to exploit the MAC operations parallelism in the convolutions.Since the convolutions of each layer can be regarded as the computations of multiple nested loops, the loops can be dedicatedly designed as sources of parallelism.Depending on the convolutional layers' characteristics, effective loop optimization techniques can be applied, including loop unrolling, loop tiling loop interchange.Therefore, the sources of parallelism can be exploited in association with the size of feature maps, the channels of convolution kernels, and also the on-chip resources of FPGA.The inner architecture of parallelism cores can be further defined, such as the interconnections among MACs and registers, the temporal task scheduling.
As summarized in Section II-B, the amount of weights is quite comparative or even larger than that of the intermediate data, and proper memory and access optimization are significant for the overall performance.Typical FPGA-based CNN accelerators [35], [36] introduce several levels of memory hierarchy to minimize the latency between on-chip and off-chip memory.The loop optimization strategy and dedicated on-chip buffers are also significant in reducing the communication consumption.
In recent years, more studies are carried out to accelerate 3-D CNNs.Fan et al. [37] proposed a three-dimensional residual module and a hardware architecture accelerating different types of convolutional computations, based on algorithms-hardware codesign.However, the performance efficiency is only 1.02 with DSP utilization up to 90%.Shen et al. [38] proposed a uniform template-based architecture to accelerate 2-D and 3-D CNN, based on Winograd algorithm.Although Winograd algorithm can reduce the multiplications in convolution computation significantly, it is only applicable when the stride of kernel is 1.The transformation matrix is very sensitive to the kernel size, affecting the performance of accelerator directly.Liu et al. [39] accelerate 2-D and 3-D CNNs based on a uniform architecture, which has a module mapping convolution operation to matrix multiplication, with two-dimensional MAC array to improve the efficiency.However, compared with RTL design, HLS design cannot realize the hardware implementation delicately.Thus, there could be further improvement of performance.

III. PROPOSED ARCHITECTURE
To build a generic CNN accelerator architecture with good applicability for various HSI classification models, close attention should be paid to the network characteristics of typical HSI classification models, as summarized in Section II-B.There exist several challenges.First, the 2-D CNN and 3-D CNN should be both supported, in which kernel size may be different.Thus, we use the traditional spatial convolution rather than which is based on transformation algorithm.In this way, the architecture can achieve good adaptation only by adjusting some configuration parameters.Second, another important feature of accelerator architecture is the easy and fast migration and the efficient use of resources for platforms.Based on the multiloop structure and the loop optimization method, it was achieved by flexible configuration of parallelization degree.Furthermore, taking into account the complexity and the repeated data access in convolution computing, especially in 3-D CNN, the data reuse opportunity is identified in weight data, and the storage scheme of feature maps and weights are designed delicately, to improve the memory access efficiency and reduce the bandwidth demand.

A. Overview of Configurable Accelerator Architecture
Fig. 2 presents the overview of proposed accelerator architecture on FPGAs.The whole architecture consists of two parts: the programmable logic (PL) and the processing system (PS).PS consists of one or multiple general-purpose processors, such as ARM.All the parameters configuration, data access, and inference procedure are controlled by PS, and the data preprocessing PCA can also be realized.The advanced eXtensible interface is the high-speed communication bridge between PL and PS.
The main CNN accelerating is realized on the PL side, relying on the abundant PL resources for parallel acceleration.The network is first retrained and the weights and activations are quantized with 8-bit fixed-point.At system initialization, the All the input data, including the input feature maps and weights, are stored in the external memory initially, then transferred to BRAM in FPGAs.To reduce the memory and bandwidth demand and simplify data prefetching, the storage pattern of feature maps and weights is optimized.The circular buffer for prefetching weights and FM data is introduced to overlap the data computation and access.The data loader transfers the feature maps and weights to the PEs following the delicate load scheme, supporting the different CNN computation pattern, the kernel size, and kernel stride.The output feature maps are rewritten back to FM buffer.
Furthermore, the computation of FC layers, which is a dense matrix-vector multiplication, can be represented as convolution computations, either in 2-D or 3-D CNN networks.Thus, our proposed design is a good fit for both the CONV and FC layers.The pooling layers are also supported in the architecture, which can be optionally bypassed in the network configuration.

B. Parameterized Parallelism Strategy
2-D CNN can be regarded as a unique form of 3-D CNN with a convolution kernel depth of 1, hence we discussed the loop operation and parallelism design using 3-D CNN as an example.Fig. 3(a) presents the original loop computation for 3-D CNN composed of eight nested loops.To support the efficient execution of CNNs on a hardware platform with limited resources, the loop computation needs to be optimized with optimization techniques like loop unrolling, loop tiling, and loop interchange proposed in [46], [47], and [48].
In general, it is impossible for the on-chip memory of FPGAs to hold the entire feature maps and weights, the amount of which will increase exponentially as the network gets deeper.On the other hand, the sources of parallelism can be exploited in association with the characteristics of the convolutional layers, like the size of feature maps and the channels of convolution Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.kernels.As summarized in Section II-B, the kernels are generally 3 × 3 or 1 × 1, which is not suitable for loop tiling.The width, height, and depth of output feature maps vary greatly in different layers, increasing the difficulty of identifying the optimal parallelism factors.Meanwhile, the depth of the kernels is generally 35 or 7.The channel of both the input feature maps and output feature maps is usually a multiple of 8 or 16.They are much more suitable for loop tiling to simplify PE design and improve resource utilization.Thus, as shown in Fig. 3, we tile loops in L1a, L5a, and L6a, corresponding to the channels of output feature maps, kernel channels, and kernel depths, by the parallelization degree P n , P c , and P d , respectively.The tile factors T n , T c , and T d correspond to the parallelization degrees as shown in (3).All the configurable parameters are shown in Table IV.In addition, we also pipeline these three loops which make the processing of different tiles overlapped, achieving high throughput of the computation engine Different loop strategies are corresponding to different data reuse opportunities.To reuse the weights, we choose to merge the loops L2a-L4a to the new loop L2b, corresponding to that of the depth, height, and width of the output feature maps.Further, the new loop L2b is tiled by the parallelization degree P a and tile factor T a , which means one weight data will be reused with T a input feature data.In other words, the size of PEs and input buffer can be adjusted by T a .
After loop tiling, to increase the data locality by dividing the data into multiple blocks, we unroll the inner four loops L7b-L10b in Fig. 3(b), which is mapped into a PE array.That means the PE array calculates the convolutions with T c channels of the input feature maps and T n channels of the output feature map with kernel depth of T d in parallel.Thus, the size of PE array can be initialed by the tile factors, which directly affects the consumption of computing resources.
Furthermore, in the spatial convolution, MAC operations are carried out along the directions of the width, height, and depth of the filter successively.Since the width, height, and depth of the filter can be configured at initialization, the accelerator architecture is applicable not only for 2-D and 3-D CNN, but also for multiple kernel sizes.

C. Data-Path Optimization for Streaming Computations 1) Parallelism-Oriented Memory Pattern:
It is important for CNN accelerating on FPGAs to minimize the off-chip accesses of intermediate data, which is time-consuming and powerwasting.A data-path optimization to efficiently use the temporary data is able to ease the problem.As described in Section III-B, we unroll the inner loops and map them into the PE Array, which performs the convolution in three parallelism dimensions: the channel of output feature maps (P n ), the channel of kernel (P c ), and the depth of kernel (P d ).For more complex data structures in 3-D CNN, the storage pattern of feature maps and weights is designed to reduce the on-chip SRAM and simplify data prefetching.To support the streaming design of PEs and less rely on high bandwidth, the weight reuse is further investigated.
As shown in Fig. 4, in 3-D CNN the feature maps can be represented as a four-dimensional tensor and the weights are five-dimensional, which is not friendly to prefetch the relevant Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.As mentioned above, the channels of kernel, which are equal to that of the feature maps, are tiled to T c for processing in parallel.After reshaping, the data in the same tile is concentrated together to be a minimized block as (1 × 1 × T c ), which can be fed to PEs directly.
With similar approach, the five-dimensional weight tensor (W w , H w , D w , C w , N w ) can be rearranged to be a new four-dimensional tensor (W w , H w , C w × D w , N w ), which also takes into account the parallelism tiling of T c , T d , and T n .
2) Memory Access Promotion With Weight Reuse: On-chip memory is not always large enough for the entire feature maps and weights.Since there exists a great amount of repeated data access in convolution computing, an effective memory access and data reuse method is essential to reduce the demand of bandwidth and transmission power.The reuse of input feature data and weights is applied usually, depending on the characteristics of data access in the CNN model.
If all the input feature data and weights are loaded from external memory without data prefetching, the total external memory access (EMA) in a convolutional layer can be represented as As to the input feature data reuse, the loaded feature data is reused with R d weights, where R d ∈ {0, 1, 2, . . ., N w }.The total data access is In the weights reuse, the loaded weight is convolved with R w feature data, where R w ∈ {0, 1, 2, . . ., D d × H d × W d }.The total data access is Combined with the above-mentioned HSI classification models, it can be found that the range of R w is much larger than that of R d .Compared with the feature data reuse, the weight reuse method can minimize the total amount of data loaded from the internal cache, thus the efficient memory access of the limited on-chip memory with weight reuse is investigated.
As to the weight reuse strategy, each weight is reused for the whole input channel.Since it is not practical to read the entire input feature maps to the buffers, the block-based processing can be an efficient solution to reduce the buffer size significantly.As shown in Fig. 4, after data reshaping, the minimal data element involved in convolution each time can be represented as 1-D tensor of dimensions 1 × 1 × T c .The optimized input and output data path is shown in Fig. 5.As we can see, the sliding of filter in conventional spatial convolution is substituted with the traversing of the input feature data.Every clock one weight element is reused with T a feature data elements, which are corresponding to the weight in the plane dimension.The computation costs T c × T a MACs in parallel and generates a temporary T a output tensor saved into a line buffer.Next clock, a new weight element and its corresponding T a feature data Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.elements are computed.The convolutional outputs are accumulated with the corresponding temporary outputs in line buffers and then temporary outputs are refreshed.The operations are repeated H w × W w times until all the weights are traversed.At this time, the value in the line buffers can be transferred to BRAM, where it maybe the last output feature data or maybe accumulated again with other line buffers, depending on the parallelization degree.
It is notable that, for better understanding, only the parallelism in the channel direction is represented in Fig. 5.In the actual accelerator architecture, the processing is conducted in three parallelism dimensions of P n , P c , and P d concurrently.Therefore, the input buffer size for feature maps is T a × T c × T d , and that for weights should be 1 × T c × T d × T c .The output line buffer size is T a .The feature maps and weights are stored in off-chip SRAM mostly in practice, thus the circular buffer for prefetching is suggested to hide the transmission latency.More clearly, the PE reads the T a feature data elements from the buffer directly, simultaneously, the following lines of feature data are loaded from off-chip memory.The computation of the current lines and the transfer of the following lines are executed in parallel due to the circular buffer design.The line number should be evaluated accurately to achieve perfect pipelining, taking into account the processing speed and the transmission rate of external memory.To sum up, the proposed scheme offers good support to the configured parallelism design and achieves smaller buffers with no system degradation.

D. Parallelism Design of Convolutional Layer
The PE is the central component of the architecture, supporting the acceleration for both the CONV and FC layers.To achieve a better tradeoff between the accelerating performance and the resource cost, a PE array is always built according to the parallelism strategy.
We choose to parallelize the channel of output feature maps (P n ), the channel of kernel (P c ), and the depth of kernel (P d ), by unrolling the inner four loops L7b-L10b in Fig. 3(b).Fig. 6 shows the parallelization design of the PE array.The innermost loop L10b is mapped to be a PE with T c MACs.There is a total of T d × T n PEs in parallel.Therefore, the size of PE array can be flexibly changed depending on the processing resources in hardware through the titling factor T d , T n , and T c .
For better understanding, Fig. 6 is explained in conjunction with Fig. 5.In Fig. 5, the input feature data and weights are prefetched to the circular buffers beforehand.The data in the first line is read to process.The additional lines of the circular buffer enable the overlapping of the computation of the current line and the data prefetching from external memory.The feature data element (1 × 1 × T c ) is streamed into the PE clock by clock.The weight data remains constant during the T a clock and is reused for the T a data elements.The convolution for one data element takes one clock with multiple MACs, the amount of which is equal to the factor T c .Since the convolution result is only corresponding to one point of the weight block, it is stored in the line buffer temporarily.The partial convolution result is accumulated with that of the next weight in the next period.The traversal for weight block takes H w × W w periods (e.g., 9 for 3 × 3 kernel), and the result is written to output buffer.For 2-D CNN, the final convolution output is achieved at this time, while it is still partial for 3-D CNN.As shown in Fig. 6, the partial convolution along the depth of kernel should be accumulated further to generate the final output.Due to weight reuse, the PE array only needs to be interconnected in dimension T d , which reduces the complexity of interconnection significantly.In addition to convolution, the rectified linear unit is also implemented by introducing comparison operators to the output buffers.Additional attention should be paid to the data width.Because the width occupied by the intermediate data is larger than that of the final output, the data interception is necessary before the output buffers.

A. Experiments Setup
We use two FPGA platforms to evaluate our techniques: 1) Alinx AXU15EG and 2) Avnet Zedboard.Alinx AXU15EG contains a high-performance Xilinx UltraScale+ MPSoC FPGA (ZU15EG, 16 nm) and four 1 GB DDR4.The Zedboard is a low-cost development board, integrating a Xilinx Zynq-7000 SoC (XC7Z020, 28 nm) and 512 MB DDR3.This is used to evaluate the system portability of our accelerator architecture on the platform with less computational resources.The FPGA implementation is operated at 100 MHz frequency on both platforms.To measure the running power, we plugged a power meter in the FPGA platform.
In the following, to discuss the wide application of the architecture, first we perform case study using the well-known HSI classification models, including 2-D CNN model, 3-D CNN model, and HybridSN, with multiple datasets.CNNs are retrained and quantized to 8-bit fixed-point, without significant loss of accuracy.Then we present the HybirdSN implementation on both FPGA platforms, analyzing the performance and source consumption.Finally, to compare our work with previous works, we evaluate our design using a more representative 3-D CNN model: C3D.

B. Implementation and Performance of HSI Classification Models
To evaluate the application of the architecture, we choose to analyze the hardware implementation of the three well-known HSI classification models, summarized in Section II-B.because of the 1 × 1 kernel, the parallel performance of the accelerator architecture cannot be fully utilized.The throughput remains in a lower level.HybridSN outperforms in both terms of throughput and accuracy rate.The simultaneous extract of both the spatial and spectral features promotes the accuracy indeed.HybridSN yields higher computation, because of both the larger kernel size and larger patch size.Thus, its execution time is longer.
Except the model architecture, the accelerator throughput is highly correlated with input data.The larger patch leads to more computation-intensive, presenting higher throughput.
The HybridSN model on IP dataset produces the highest throughput of 27.23 MOPS.However, the parallel processing capability is not utilized fully as the implementation of object or activity detection described in Section IV-D.
According to the accuracy results obtained in Table V, the HybridSN model has the best performance and the accuracy of 2-D CNN model ranks second.To make a visual comparison, the classification maps of different classification models are presented in Fig. 8.The classification map of HybridSN has the highest quality, and that of the 3-D CNN model shows the most noisy scatter points.
The accelerator architecture can be configured to adapt to various networks.Table VI shows the parallelism factors for the three modes.The parallelism factors are empirically chosen in accordance with the operating frequency and the resource  The resource utilization is also analyzed through the Xilinx Vivado tool, as shown in Table VI.In the hardware implementation of network, the DSP resources are the strict constraint in most cases.Especially for the full-precision network, the Vivado Synthesis tool tends to map multiplications to DSPs, making DSPs always in shortage.Because the weights and activations are quantized to 8-bit fixed point, the low-precision multiplication can be implemented using look-up tables (LUTs), reducing the DSPs consumption.Furthermore, based on the parallel design, the DSPs become abundant resources and LUTs have a higher utilization.
To further investigate the layerwise performance of the networks and have a fair comparison, we choose to verify the three typical network models on the same dataset of IP, and the details are shown in Tables VII-IX.The 2-D CNN model has only two 3 × 3 convolutional layers, which take up most of the computation and execution time.It is interesting that the two layers have the same amount of operations.The channels of the output feature maps in the two layers are 90 and 270, respectively, while T n is set as 64.The convolution calculation requires multiple serial processing of PE array, resulting in a better performance of the first layer.Thus, the throughput is approximately 4× between the two layers.
In the 3-D CNN network, the smaller kernel size (1 × 1) is introduced to make the network lighter, while it is actually not helpful to the high performance of parallel processing.Meanwhile, the small patch size is also another negative factor.It is evident that the first three layers take up the most computation, while only the second layer shows a higher throughput.For the first convolutional layer, the channel of input feature map is 1.Since the parallelism factor T c is set as 32, it fails to fully utilize the parallel capability.This phenomenon also exists in the next test case HybridSN.
The implementation of HybridSN achieves a higher throughput with more powerful parallel processing.For the first three 3-D convolutional layers, as the channels of output feature maps and that kernels increase, the processing power also increases, which ascends to the peak on layer Conv3d_3.The first layer phenomenon is the same as in 3-D CNN network.For the second layer, the channels of the input and output feature map are 8 and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.16, corresponding to the factor T c and T n of 32, which does not achieve the peak parallel capability.The 2-D convolutional layer has large channels of output feature maps and that of kernels, so the accelerator can fully carry out parallel computation.In the FC layer, the huge amount of parameters costs much time for data accessing, leading to a lower throughput.

C. Performances on Different FPGAs
The proposed accelerator architecture can also be easily configured to satisfy different FPGAs.To evaluate the system portability, we implement HybridSN on two FPGA platforms: Zedboard and AXU15EG.The Zedboard is a low-cost development board with less logic resources.AXU15EG contains a higher-performance MPSoC FPGA with a larger amount of LUTs (6.4×) and DSPs (16×) than Zedboard, which is more likely to be chosen for CNN network implementation.Therefore, the lower parallelization factors (8, 8, 7) are set for Zedboard, adapting to the less logic resources.The higher parallelization factors (32,32,4) are set for AXU15EG, achieving higher performance with more abundant logic resources.
Table X presents the performance of each layer and the resource utilization of HybridSN on FPGA platforms.As we can see, the first layer of the benchmarks shows relatively lower performance on both FPGAs, the reason of which has been discussed in Section IV-B.For the latter three convolutional layers, the performance of implementation on AXU15EG peaks in the layer Conv2d_1 (166.18GOPS), and the peak on Zedboard appears in the layer Conv3d_2 (12.54 GOPS).This occurs because the implementation on Zedboard sets smaller parallelization factors due to the limited logic resources.
As shown in Table X, the DSPs are no longer the limiting resource in hardware implementation, illustrating the feasibility of the parallelism design.Instead, LUTs dominate the resource consumption, since they are mainly used for multiplication and accumulation.

D. Comparison With State-of-the-Art Methods
There is rarely FPGA implementation for the typical networks for HSI classification discussed above, to the best of our knowledge.To compare our architecture with other works, we choose a 3-D framework C3D, which has more implementation research.The weights and activations are quantized to 8-bit fixed-point and the parallelism factors (T , T c , T d ) are set as (64, 32, 3).Fig. 9 represents the throughput performance of convolutional layers.The PE array is under-utilized in the Conv1 layer due to the input channel being only 3, leading to a minimum throughput of 22.08 GOPS.Since the other convolutional layers have much more channels of output feature maps and kernels, the parallel capability of the accelerator architecture can be fully utilized.The peak throughput is 225.87 GOPS in Conv2 layer.
Table XI shows the performance comparison of the proposed design with previous works in the implementation of C3D.The results show that our design achieves a higher power efficiency more than 10.67% than other works.Since different FPGA platforms and DSP slices are used, we list the performance density (the number of arithmetic operations that one DSP executes in one cycle) for fair comparisons.Comparing with [40] and [37], which also use spatial convolution, our design achieves more than 20% performance density.Shen et al. [38] use the Winograd algorithm to reduce the multiplications in convolution by 70.4%, but with the cost of 101% more additions.Because the additions are mapped to LUTs instead of DSPs, the implementation in [38] achieves the best performance density.The main drawback of the Winograd algorithm is the fixed kernel size and convolutional stride, which means FPGA needs to be reprogrammed for different CNN networks.In a comprehensive perspective, the proposed design achieves a relatively desirable performance density, with a tradeoff between better applicability and high performance.

V. DISCUSSION
The in-depth analysis of CNNs illustrates that convolutional operations occupy over 98% of the total computation.The high computation complexity results in a high demand of computation resources, which further causes the DSP resources to be the strict constraint in most cases.The proposed dynamic configurable accelerator architecture with parameterized parallelism strategy enables the parallelization degree to be configured flexibly, according to the resources in FPGA and CNN frameworks.Thus, the convolutional network can be rebuilt easily and the DSPs are no longer the limiting resource in hardware implementation.
Different from the processing of object detection, the amount of intermediate data in HSI classification processing is equivalent or even less than the weights, because of the patchwise computation.In view of this characteristic, the proposed parallelismoriented memory pattern and data access strategy, based on weights reuse, are beneficial to hide transmission latency.
Furthermore, since the HSI classification processing is carried out patch by patch, the patch-wise has a serious impact on the total computation.For example, despite the fewer CONV layers, the operation of HybridSN framework is up to 495.4 MOP/patch, due to the larger patch size.As to the IP data set with 145 × 145 spatial dimension pixels, the inference computation for one frame reaches up to 10 415 GOPs.Whereas the Yolov3 model Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
for object detection only has 193 GOPs for each frame inference.It can be seen that the HSI classification processing is extremely computation-intensive.The techniques of light-weight will be investigated in our future research.Although the 8-bit quantization is proven to be effective to reduce computation in this article, quantization with less bias is promising to promote processing efficiency.Meanwhile, weight pruning provides another effective solution to simply the network structure and accelerate the inference processing.

VI. CONCLUSION
In this work, a parameterized configurable accelerator architecture is presented, good applicability to multiple CNN algorithms for HSI classification.We also develop a parallelism-oriented memory pattern and data access strategy to minimize the local buffer and eliminate the accesses to the external memory.The proposed architecture is evaluated by the implementations for 2-D, 3-D, and hybrid 2D-3D CNNs across multiple FPGA platforms.The experimental results demonstrate that our configurable accelerator architecture achieves a desirable performance with good applicability and extensibility.

Fig. 1 .
Fig. 1.Schematic diagram of the configurable acceleration architecture for HSI classification.

Fig. 4 .
Fig. 4. Storage pattern of feature maps and weights.
2-D CNN model[14] consists of two (3 × 3) 2-D convolutional layers.3-D CNN model[17] consists of six 3-D convolutional layers with kernel size of (3 × 3 × 3) and (1 × 1 × 3).HybridSN[18] consists of three 2-D convolutional layers and one 2-D convolutional layer.It is convincing to verify the applicability and reconfigurability of the accelerator for multiple types of convolution and kernel size.Furthermore, the recommended patch size differs for different data sets, bringing the size diversity of feature maps.In our study, the developed framework is validated with three publicly available hyperspectral datasets, namely, Pavia University (PU), Salinas Valley (SA), and Indian Pines (IP).The false color images for data sets are shown in Fig.7.The PU data set includes nine urban land-cover classes, with spatial dimension of 610 × 340.After spectral bands are eliminated, there remain 103 spectral reflectance bands from 430 to 860 nm.There are nine urban land-cover classes in total.The SA data set comprises 512 × 217 pixels with 224 spectral bands.The wavelength range is between 360 and 2500 nm.It includes 16 classes in this data set.The IP data set consists of 145 × 145 spatial dimension pixels with 224 spectral bands from 400 to 2500 nm.The labels are divided into 16 vegetation classes.We randomly chose 50% of the tagged samples.Then 66.5%, 3.5%, and 30% of the chosen data are randomly divided into training set, validation set, and testing set, respectively.TableVpresents the performances of the three models on different data sets.The 2-D CNN model has the lightest architecture and the minimum patch size.Thus, it takes the least execution time.The 3-D CNN model takes raw data as input, without dimensionality reduction.There are four layers using 1 × 1 kernel to take the place of 3 × 3 kernel, achieving a lighter model.Therefore, the total time consumption is about 1 ms.Meanwhile, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
FOR 3-D CNN ON IP DATASET constraints on FPGAs.Since 2-D CNN can be regarded as a special form of 3-D CNN with convolution kernel depth of 1, the factor T d is set to 1.The maximal kernel depth of 3-D CNN model is 3, so we set T d as 3. Since the kernel depth of HybridSN differs in the range of {3, 5, 7}, we choose a compromise value of 4.

TABLE V PERFORMANCE
OF IMPLEMENTATION FOR HSI CLASSIFICATION MODELS

TABLE VI PARALLELIZATION
AND RESOURCE UTILIZATION OF HSI CLASSIFICATION MODELS

TABLE VII PERFORMANCE
OF EACH LAYER FOR 2-D CNN ON IP DATASET

TABLE IX PERFORMANCE
OF EACH LAYER FOR HYBRIDSN ON IP DATASET TABLE X COMPARISON OF MULTIPLE FPGA PLATFORMS FOR HYBRIDSN

TABLE XI COMPARISON
WITHOTHER FPGA IMPLEMENTATIONS OF C3D Fig.9.Performance of each layer in C3D.