FPGA Acceleration of a Composite Kernel SVM for Hyperspectral Image Classification

Hyperspectral image classification is one of the most important techniques for analyzing hyperspectral image that have hundreds of spectrum luminance values of near-infrared to visible light. For this classification, supervised learning methods are widely used, but in general, they typically trade off between their accuracy and computational complexity. Our approach is based on a composite kernel method, and the computation is simplified to achieve higher processing speeds by efficiently using the hardware resources of FPGA. The accuracy of this approach reaches 98.0% and 98.8% on two benchmark datasets, Indian Pines and Salinas via simulation, which is comparable to those in previous works. Two implementations, one with less hardware resources but more off-chip memory bandwidth, and another with more hardware resources but less off-chip memory bandwidth, are implemented on an FPGA and evaluated. The processing speeds of the two implementations are the same, which is 1.3 Mpixels/ $s$ for 2048 pixel wide images. This processing speed is fast enough for real-time processing, and faster than previous studies when normalized by hardware size and power consumption. We also introduce two more implementations that aim to reduce the on-chip memory usage of the second implementation within a reasonable increase of off-chip memory bandwidth, and we discuss which implementation is advantageous under what conditions.


I. INTRODUCTION
Hyperspectral images have many bands that range from visible to near infrared light. Hyperspectral image (HSI) classification involves classifying each pixel in an image taken by a hyperspectral camera mounted on an airplane, satellite, or drone into several categories such as field, forest, and even different stages of plant growth using many bands, which is impossible with the three bands in RGB images. Hyperspectral image classification are used for problems in many areas such as crop analysis, knowing the growth of plants, ecological investigation, geological mapping, mineral exploration, defense research, urban investigation, military surveillance, and locating people when a disaster occurs such as an earthquake or hurricane. For this classification, a prediction model obtained by a supervised learning is typically used. Supervised learning is a method that is used to enable The associate editor coordinating the review of this manuscript and approving it for publication was Ilaria De Munari . machines to classify objects, problems, and situations based on related data fed to the machines.
For this supervised learning, the pixels in images and their true classification are given, and the parameters in the model are tuned for the given data. Then, for other pixels, their categories are estimated using the prediction model obtained. Fig.1 shows two examples of hyperspectral image classification. The top low shows an image in the 'Indian Pines' dataset [1], and the bottom low shows an image in the 'Salinas' [1] dataset. These datasets are used for the evaluation in Section IV. The images on the left are the target images, and those on the right are their ground truth in 16 categories, such as Alfalfa, Corn-notill, Corn-mintill, and Corn for Indian Pines, and vegetables, bare soils, and vineyard fields for Salinas. Recently, the performance of hyperspectral cameras has been constantly improved in terms of resolution in time, space, and wavelength, and the amount of data of hyperspectral images are becoming increasingly large [2]. To achieve a higher classification rate, many algorithms have been proposed [3], [4]. For example, [5] succeeded in improving the accuracy by modeling the spectral feature as a graph using a graph convolution network and combining it with a CNN. Furthermore, they proposed the mini-batch method that focuses on the graph convolution network to improve learning efficiency.
Achieving real-time hyperspectral image classification is desirable, because after classification, the image size can be reduced to 1/200 or less. Before classification, each pixel has over 200 bands, but only a few bits (only 4 on Indian Pines and Salinas because the number of classification categories is 16) after classification. This data size reduction is particularly important when the system is mounted on a satellite with limited data communication capacity. In addition, some application problems require real-time processing. For accelerating hyperspectral image classification, graphics processing units (GPUs) and field-programmable gate arrays (FPGAs) are widely used, but when considering power consumption, FPGAs have more advantages.
In this paper, we first describe an algorithm based on a composite kernel support vector machine (SVM) [6]. It requires less learning data, and its computational sequence is simple, which enables the efficient use of hardware resources. Then, we evaluate two FPGA implementations for real-time processing: one with less hardware resources but more off-chip memory bandwidth, and another with more hardware resources but less off-chip memory bandwidth. The outline of our algorithm and preliminary evaluation of the above-mentioned first implementation were already presented in [7], and the accuracy of our algorithm was shown to be comparable to those in previous studies, and its processing speed was faster than previous studies when normalized by hardware size and operational frequency. The contributions of this paper are as follows: 1) the detailed evaluation of the two implementations on circuit size, processing speed, and power consumption changing the parallelism, and a comparison with previous studies, 2) the introduction of two more implementations that aim to reduce the on-chip memory usage of the second implementation with a reasonable increase of off-chip memory bandwidth based on the evaluation results, and 3) we clarify which implementation is advantageous under what conditions through the evaluations and discussion. A detailed description of our algorithm, which is a simplified and modified version of a composite kernel SVM [6] for parallel processing with less power consumption, is also provided.
The rest of the paper is organized as follows. Section II describes related work. Section III introduces the outline of a composite kernel SVM [6] that is the basis of our algorithm, and Section IV provides the details of our algorithm. Section V describes the two implementations, and they are evaluated in Section VI. In Section VII, two more implementations are introduced based on the evaluation results, and the four implementations are compared. Section VIII presents our conclusions.

II. RELATED WORKS
Many GPU systems for hyperspectral image classification have been proposed. Reference [9] accelerated the computation described in [6], which is the algorithm that is the basis of our system, using a GPU. Reference [8] proposed a patch-free learning framework for HSI classification, and accelerated the computation of its CNN using a GPU. In this approach, the teacher signals for each iteration are selected to avoid the redundant calculation caused by overlapping window on target pixels.
In [11], parallel independent component analysis was implemented on an FPGA to reduce the number of bands, and in [10], it was discussed to use an FPGA for hyperspectral image classification on a satellite. Reference [12] proposed a FPGA-based cloud detection system on satellite. These studies showed that an FPGA-based system has advantages in term of processing speed when considering power consumption and customization flexibility.
Considering these advantages of FPGAs, such as the lower power consumption than GPUs and more flexibility than dedicated LSIs, we focus on FPGA systems. However, FPGA systems for hyperspectral image classification are not so abundant. Reference [13] implemented a region growth-based multi-resolution algorithm on an FPGA. With this approach, the power consumption was 20% of a CPU when processing a single band. However, the accuracy of the algorithm was not verified although modified for hardware acceleration. In [14], a simple CNN for HSI classification was implemented and the performances on three different platforms, CPU, GPU, and FPGA, were compared. For the FPGA implementation, Xilinx Vitis AI tools were used. In these implementations, the computational cost of a CNN is reduced using pruning process, and it was reported that GPU achieved the fastest processing.
Reference [15] and [16] are two important studies on hyperspectral image classification on an FPGA aiming for use on a satellite. However, in these studies, the number of usable bands is limited, and those bands must be chosen manually before classification. This requirement limits its usability, although it achieved a good classification rate. Reference [17] is another important study that proposed a high processing speed and high-precision system considering hardware efficiency, and achieved real-time on-board HSI classification. The weights of all layers of the network are located in the on-chip memory, and it enables to process the CNN model with high throughput. Consequently, this system achieved high accuracy in HSI classification using popular benchmark datasets. However, the CNN requires many calculations and huge hardware resources accordingly.

III. COMPOSITE KERNEL SUPPORT VECTOR MACHINE
In this section, we introduce the outline of the composite kernel support vector machine (SVM) proposed in [6].

A. SUPPORT VECTOR MACHINE (SVM)
Let x i ∈ R N and y i ∈ {−1, +1}. Given a labeled training dataset {(x 1 , y 1 ), . . . , (x n , y n )}, a non-linear mapping function φ(·) to high dimensional (or infinite dimensional) Hilbert space H is given as a function to solve the equation under the following constraints where w and b denote linear classifiers in the feature space, C is a parameter for the normalization, and ξ i is a positive margin for handling errors. Cover's theorem [18] guarantees that data mapped by a non-linear mapping function φ(·) are easily linearly separable in the feature space. The dimension of w is generally high, and equation (1) can be solved by solving the Lagrange dual problem. Thus, the function (1) can be rewritten as follows: where auxiliary variable α i (0 ≤ α i ≤ C) is a Lagrange multiplier that constrains the range of variables, and variable y i satisfies i α i y i = 0, i = 1, . . . , n.
All φ(·) used for the learning of the SVM can be given in the form of an inner product. Thus, a kernel function K can be defined as Using the kernel function, a non-linear SVM can be constructed without fixing φ(·). That is, by introducing equation (3) into (2), a Lagrange dual problem can be derived without defining φ(·), and by solving this problem, a decision function f (·) for any test vector x can be obtained as follows: In this equation, b is easily calculated from α i . For classifying data x into one of the two categories, the sign of the output of decision function f (x) is used. When the number of categories is more than two, for all pairs of categories ( M C 2 pairs when the number of categories is M ), the decision function is applied using different support vectors, and the category to which x is most frequently classified is chosen as the category of x.

B. COMPOSITE KERNEL FOR HYPERSPECTRAL IMAGE CLASSIFICATION
Here, we introduce the composite kernel for hyperspectral image classification. In [6], two kernel functions were used; one for spatial feature x s , and another for spectral feature x ω . Let x be a pixel value in the hyperspectral images. Then, x is used as x ω , and the average of the pixels in a window centered by x, or the standard deviation, is used as x s . The formulated composite kernel is given as where K s and K w are the kernels for the spatial and spectral features respectively, and µ is a real number in 0 < µ < 1 that decides the balance of the spacial and spectral features. The value of µ is decided in the learning step. In [6], a polynomial kernel is used as spectral kernel K w , and a radial basis function kernel where C 0 is a constant, is used as spacial kernel K s .

IV. OUR ALGORITHM
In the original algorithm, for the polynomial kernel, poly(),x (the normalized x) is given as its argument, andx, which is the normalized average or standard deviation of the pixels in the window centered by x, is given to the radial basis function kernel, rbf (). However, in our approach, the same argument, x, is given to both kernels. Fig.2 shows the flowcharts of the original and proposed algorithms. The calculation ofx (the dotted line box in the original) is eliminated in our approach, andx is given to poly() instead ofx. In this figure, f (x) in equation (7) is written as f (x,x) to make the difference from the original algorithm f (x,x) clear. This approach does not follow the basic idea of spectral and spacial kernels, but this approach is used because of the following two reasons. 1) This approach has a comparable classification accuracy to other methods as discussed in Subsection IV-B. 2) Using the same argument, its calculations can be shared between the two kernels, and this enables efficient computation with less hardware resources and a lower power consumption.

A. OUR COMPUTATION ALGORITHM
First, in the learning phase, support vectors SV i (i = 1, . . . , n sv ) are generated for each classification. For classifying into M categories, the M C 2 series of SV i are generated.
In our system, this phase is executed on CPUs or GPUs in advance. Then, the circuit on FPGA is tuned for determining the classification using the given support vectors. As for classification on an FPGA, an algorithm that can 1) achieve almost the same level classification accuracy as other methods with less computation, and 2) maintain the accuracy when the data width of the operation is reduced to then reduce the circuit size is desirable.
Suppose that a pixel in the hyperspectral image is given as a vector x ∈ R d , where d is the number of bands of each pixel. Then, the value of each band of x is normalized to the interval [0, 1].x ∈ R d denotes the average of all pixels in a (2S + 1) 2 pixel window on the image centered by x.
Here, to classifyx using a composite kernel, calculating is required, where n sv is the number of support vectors required to classifyx into one of two categories. The category of x is decided by the sign of the calculation result of equation (7). In this equation, the polynomial kernel is given as follows: and the RBF kernel is Here, we consider calculating equation (11) first. This equation must be calculated n sv times, as shown in equation (7).
1) The first term in equation (11), of SV i . This means that calculating this term is necessary only once for each pixel in the image. It does not depend on the number of support vectors.
2) The second term, x, SV i , must be calculated n sv times for each pixel x. This term also appears in equation (9) and can be reused for calculating poly() in equation (8).
3) The last term must also be calculated for each support vector. However, this term is independent of pixel data x. Therefore, this term can be calculated before starting the classification of pixels when the learning step completes. With this term, for each support vector, we must hold d + 1 values (d bands and this term). Thus, we must only calculate the second term during classification, and it can be used for the calculation of both kernels. This feature enables the reduction of the computational complexity, that is, the circuit size and power consumption.

B. ACCURACY OF OUR ALGORITHM
To achieve high performance on an FPGA, the accuracy of algorithm is required to be sufficiently high, and it can be maintained when the data width of main operations are reduced. By reducing the data width, the circuit size can also be reduced, and a lower power consumption can be expected. We evaluate the accuracy when the data width of operations is reduced and compare it with other algorithms.
To compare the accuracies, hyperspectral images from the Indian Pines and Salinas datasets taken from an airplane, are used. Images from Indian Pines consist of 145 × 145 pixels and 224 spectral bands in the wavelength ranging from 0.4 × 10 −6 to 2.5 × 10 −6 m. We used the reduced version in which the number of bands are reduced to 200 from 224 by removing bands covering the region of water absorption. The pixels in the image can be classified into 16 categories such as corn field, and soybean field, including different growth stages of soybean fields that cannot be distinguished by human sight. In this experiment, 20% of the pixels in the image are randomly chosen for learning, and the remainders are used for evaluation. During learning, the parameters, C, µ, and C 0 are tuned in the range . respectively. The accuracy of the classification was evaluated using 5-fold cross validation. Images from Salinas consist of 512 lines by 217 samples and 224 spectral bands. The number of bands are reduced in the same manner as with Indian Pines, and the same parameter tuning and same evaluation method are used. In this case, the 16 categories include vegetables, bare soils, and vineyard fields. Table 1 shows the classification accuracy evaluated by software simulation (the same calculation as the hardware implementations is executed on a program written in Python) when the width of data given to equation (9) is reduced. This data width reduction is necessary to reduce the required on-chip memory size and memory bandwidth of the external memory. The range of these values is [0, 1.0). For example, 8b in Table1 means that all 8 bits are used only for the decimal part (no sign bit and no integer part). In the calculation of equation (9), the data width is expanded as necessary starting from the reduced data width, for example, the result of 8b × 8b becomes 16b, and 16b + 16b becomes 17b (in this case, 1b integer part is added), eventually reaching 77b (69b × 8b) in the last multiplication in Fig.3. The learning phase is also executed using the reduced data width. The accuracy worsens as the data width is reduced. However, if the data width is not smaller than 8b, the accuracy drop is insignificant (8b shows a better accuracy than 9b, but the difference is too small to compare them), and 8b is small enough for efficient implementation on an FPGA. Table 2 compares our accuracies on Indian Pines and Salinas with other systems. Composite Kernels [6] is the algorithm used as the basis of our approach, and [9] is its GPU implementation with some improvements. Reference [15] and [17] are FPGA systems (their details are described in Section VI), and others are top-rank software programs. As shown in this table, our accuracy is comparable to those systems.

V. IMPLEMENTATION METHOD
Our goal is to process 2048 pixel-wide images with 200 bands in 200 lines/s. This configuration of pixels is different from the images from Indian Pines and Salinas used for the accuracy evaluation, but in our implementation on an FPGA, the architecture and its processing speed are not affected by this difference because (2S + 1) 2 pixels around each target pixel are accessed for the calculation, and these pixels can be efficiently cached on an FPGA as described below. In the second implementation, on-chip memory that is proportional to the line width is required for this data caching, and it may become the critical point of the implementation. However, the memory usage is discussed in detail in Section VI. This target processing speed is fast enough for practical applications, because this speed is comparable to the data acquisition speed of current high-end line-scan hyperspectral cameras [19]. To achieve this processing speed with as little power consumption as possible, the algorithm is simplified and modified for parallel processing, and implemented on an FPGA. In the following discussion, let N sv = n sv be the total number of support vectors to classify the pixels into M categories, W be the width of images, and (2S + 1) 2 be the window size.

A. PARALLELISM AND THE BASIC UNIT
Hyperspectral image classification has a lot of parallelism. In our implementation, d bands are processed in parallel in the two implementations discussed in the following, and P support vectors are also processed in parallel to achieve the target processing speed. When the d bands in one pixel are processed sequentially, the partial sum for the pixel must be maintained on FPGA and managed. This requires more complex control sequence and hardware resources. Pixels are not processed in parallel, because to process k pixels in parallel, k sets of d j=1x 2 j andx in equation (11) must be calculated in parallel, and this requires considerable hardware resources. Fig.3 show a block diagram for calculating one iteration of equation (7). In this unit, is calculated following equation (8) to (11) by giving the values of five terms, α i , SV i , d j=1 SV ij ,x, and d j=1x j 2 in advancoe. In this unit, the most computational intensive part is the module to calculate s = (x j · SV ij ) (equation (9)). Here, we consider its efficient calculation. First, LUTs are used for the multiply operations to avoid using DSP slices (it becomes possible to use LUTs owing to the reduced data width) because a large FPGA is required to supply 200 × P DSP slices. Then, the products are summed using a pipelined adder tree. These multiply and add operations performed by LUTs can be efficiently implemented as follows. Let the lth bit ofx j isx j [l] (x j [l] be 0 or 1). Then,x j is given as follows when the data width ofx j is w: l).
Equation (9) can be rewritten as follows: In the calculation following equation (  corresponds to equation (12), and the lower half corresponds to (13). In Fig.4, the number on the shoulder of '+' is the depth of the adder tree. Here, suppose that, in each clock cycle, only the additions of the same level are executed, and their results are held in the registers. In the upper half, the number of logic cells (and registers) required for each level is 10b × 16, 12b × 8, 16b × 4, 17b × 2, and 18b × 1, which becomes 372 in total, whereas that in the lower half is 9b×16, 10b×8, 12b×4, 15b×2, and 18b×1, which becomes 320 in total. In practice, the data width is 10b and d = 200. Thus, this difference becomes much larger.
In the rbf unit, the calculation of exp() is necessary. First, its argument is calculated: s is shifted, and subtracted from the sum of x 2 j and SV 2 ij . Then, its result is given to the address generator to calculate the approximate value of exp(). In our implementation, for calculating the approximate value, a linear approximation is used as where C R is an integer, int() is a truncate function, and G[] and I [] are tables that store the gradients and y-intercepts, respectively. The address generator converts its input to the address of the two tables. From input x, an index to the tables, the range of which is decided by C R , is generated, and used to refer to the tables. With this approximation, for x in [C R ×m, C R ×(m+1)) (m is an integer), G[m]×x+I [m] is used as exp(x). According to our experiments, the error caused by this approximation is less than 0.1%, and is sufficiently small.

B. TWO DATA MAPPING APPROACHES
Storing all support vectors and required pixels for the calculations is unfeasible on an FPGA because huge FPGAs become necessary only for the on-chip memory. However, to achieve the target processing speed, support vectors or pixels must be cached on an FPGA. We consider the following two implementations.
1) All support vectors are held on the FPGA. The pixels are stored in the off-chip memory banks once, and the required pixels are fetched while N sv support vectors are being applied to the current target pixel P by P.
2) The pixels from the camera are cached on the FPGA ((2S +1)+1 lines in total), and while P support vectors VOLUME 11, 2023 are being applied to the pixels on the target line, the next P support vectors are fetched from the off-chip memory. Fig.5 shows a block diagram of the first implementation. In Fig.5, the pixels are processed one by one, and P support vectors are applied to each pixel in parallel using P f -units. Therefore, N sv /P clock cycles are required for processing one pixel. Each f unit has its own on-chip memory bank, and N sv /P sets of SV i , d j=1 SV ij , and α i are stored in it. In this implementation, while processing the current pixel, calculating thex of the next pixel is required by fetching the data of the (2S +1) 2 pixels around the next pixel from the offchip memory. These (2S + 1) 2 pixels are added sequentially, because N sv /P is larger than (2S + 1) 2 under the target processing speed. This data fetching overhead can be reduced by maintaining the data of (2S+1)×2S pixels for the previous pixel, and by fetching the data of only (2S + 1) pixels in the next column for the next pixel. This caching approach provides sufficient time for fetching pixels from the offchip memory, and store the pixel data from the hyperspectral camera into the off-chip memory under the control of the FPGA. Fig.6 shows a block diagram of the second implementation. In the second implementation, the support vectors are applied to the W pixels on the target line as follows.
a) First, the pixel data are directly given from the camera. In Fig.6, (2S + 1) lines are cached in the line buffers. Suppose that their center is the ith line. b) Simultaneously, P sets of SV i , d j=1 SV ij and α i are fetched onto the FPGA from the off-chip memory banks. c) The line buffers are scanned from left to right. In each clock cycle, i) the (2S + 1) pixels of the next column are read from the line buffers, andx andx 2 of the center pixel are calculated using (2S + 1) × 2S pixels already read and held on the register array, and ii) P support vectors are applied to the pixel in parallel (these calculations are pipelined). Here, we need to calculatē x andx 2 every clock cycle. The same P support vectors are continuously applied to all pixels (W pixels) on the ith line using W clock cycles, and their intermediate results are stored in the score buffer. This score buffer is used to store the partial sum of equation (7). Another buffer (voting buffer in Fig.6) of size W × M (the number of categories) is also required for storing the voting results to each category. d) Meanwhile, W clock cycles required for Step c), the next P sets are fetched from the off-chip memory, and stored on the registers on the FPGA. P W , and P support vectors can be easily fetched from the off-chip memory. e) Then, for the new P sets, Steps c) and d) are repeated. f) When all support vectors are applied to all pixels on the ith line, the category of each pixel is determined. g) During Steps c) to f), the pixels in the (i + S + 1)th line are given to the FPGA from the camera and stored in another line buffer, which means that (2S + 1) + 1 line buffers are required in total. h) Steps c) to g) are repeated until all lines are processed. This implementation requires additional buffers on the FPGA, but enables us to achieve the target processing speed with less memory throughput from the off-chip memory banks.

VI. EXPERIMENTAL RESULTS
We evaluated the two implementations in Figs.5 and 6 on Xilinx Kintex-7 XC7K355TFFG901-1 that has 356,160 logic cells, 1,440 DSP slices, and 25,740 Kb of memory (715 BRAMs). This FPGA was chosen because this is a typical moderately sized FPGA, and is sufficiently large for all our implementations. The implementations are written using Verilog HDL (high level synthesis tools are not used to achieve higher performance with less hardware), and it is assumed that a DDR3 memory bank is attached to the FPGA. In both implementations, a) the data width is reduced to 8b as described in Section IV-B, b) the number of categories, M , is 16, c) the total number of support vectors, N sv , is 1226, d) the target image width, W , is 2048, e) the window size, (2S + 1), is 5, and f) the operational frequency is 200MHz for accessing DRAM easily. In the experiment with 'Indian Pines' and 'Salinas' datasets in Section IV-B, N sv is 1226, and this size can be considered to be sufficiently large for classifying all pixels in the images into 16 categories. The data transfer speed of the target camera is 200lines/sec when the image width is 2048 pixels, and the number of bands of each pixel is 200 [19]. This means that 488 clock cycles can be used for processing one pixel, That is, 2.5 support vectors must be processed in parallel.
In the following, we evaluate the circuit size, processing speed, power consumption, and required off-chip memory bandwidth. Table 3 shows the hardware usage when P = 2, 4 and 8. P = 4 is fast enough for real-time processing, but 2 is a bit slower than that. The number of LUTs and registers in the first implementation is almost proportional to P, because most LUTs and registers are used for P f -units. However, in the second implementation, it is not proportional. In the second implementation, it is necessary to calculatex every clock cycle (in the first implementation, it is calculated sequentially using (2S+1) 2 clock cycles), and the size of this module (20KLUTs approximately) does not depend on P. Primarily because of this module, more hardware resources are required than the first implementation.

A. CIRCUIT SIZE
The number of required DSP slices is the same in the two implementations, and it is proportional to P because the DSP slices are used only in f units.
The first implementation requires less block RAMs. In the first implementation, all block RAMs are used for the memory banks attached to f units. Their width is 1635b as shown in Fig.2, and the depth of each bank is N sv /P. When P = 2, the block RAMs configured as 36b × 1024 are used to store N sv /2 = 613 support vectors. For one bank, 1635/36 = 46 block RAMs are used, and it becomes 92 in total. When P = 4, the block RAMs are configured as 72b × 512 to store N sv /4 = 307 support vectors. For one bank, 1635/72 = 23 block RAMs are used, and it becomes 92 in total. When P = 8, we also use block RAMs configured as 72b × 512 because this is the shallowest configuration. In this case, 23 block RAMs are required for one bank, and 184 block RAMs are required in total, but only 30% of the memory is used.
In the second implementation, block RAMs are mainly used for the (2S + 1) + 1 line buffers. Each pixel has 200 bands, and their data width is 8b. Using 89 block RAMs configured as 18b × 2048 (its total data width becomes 18b × 89 = 1602b > 1600b), one line buffer can be constructed. The total block RAMs for the 6 line buffers is 534, and it becomes 543 in total including the score and voting buffers. The number of block RAMs does not depend on P. Using a larger FPGA with more block RAMs, all support vectors can be stored in the on-chip memory as in the first implementation: however, in this case, only the usage of block RAMs becomes high, and the large FPGA cannot be used efficiently. VOLUME 11, 2023 Table 4 compares our processing speed with those of [15] and [17]. To the best of our knowledge, no other performance of FPGA systems for hyperspectral image classification have been reported. Reference [15] was implemented on an Altera Stratix V 5SGSMD8N2F45C2 FPGA on Maxeler MAX4 DFE, and [17] was implemented on Xilinx Zynq ZC706 board that consists of a Xilinx XC7Z045 FPGA with a dual-core ARM Cortex-A9 Processor and 1 GB of DDR3 memory. The hardware platforms are different, and a fair comparison with our system is not straightforward. To compare the processing speed, Mpixels/s (the number of pixels processed in one second), is normalized by the number of logic cells (although the functions supported on logic cells are different on Xilinx and Altera FPGAs) and operational frequency.

B. PROCESSING SPEED
In Table 4, the fastest in Mpixels/s is ours with P = 8. When normalized by the number of LUTs and logic cells, our implementations are faster in all cases. This comparison is advantageous to [15], because it is normalized only by the number of LUTs, not considering the number of DSP slices or number of block RAMs. When further normalized by the operational frequency, our second implementation with P = 2 is slower than [15]. This is because of the operational frequency of [15] (120MHz) being slower than ours (200MHz). In general, a higher operational frequency leads to a higher power consumption. Next, we evaluate the power consumption and processing speed normalized by the power consumption. Table 5 compares the power consumption and processing speed normalized by the power consumption. In this table, 2'nd is the power reduced version of the second implementation, and its details are described below. As shown in this table, our implementations are faster than those in previous studies in all cases when normalized by the power consumption. This efficiency in power consumption has a great advantage when the system is mounted on satellites. The power consumption of all our implementations are lower than that in [15] although our operational frequency is faster. This is because, in our system, the algorithm is simplified and modified for efficient parallel processing with less hardware. For a larger P, our power consumption becomes larger than  that in [17], but the low power consumption of [17] is caused by its much slower processing speed, as shown in Table 4.

C. POWER CONSUMPTION
In 2'nd in Table 5, a technique to reduce power consumption is used, although it requires more computation time. In Fig.8,x is calculated sequentially using (2S + 1) 2 clock cycles as in the first implementation. For this purpose, the registers to hold P sets of SV i , d j=1 SV ij , and α i for the current and next pixel in Fig.6 are replaced by P buffers to hold two sets of (2S + 1) 2 of pixels. For each pixel, (2S + 1) 2 sets are applied sequentially, and during this period,x of the next pixel is calculated sequentially, and the next P sets are fetched from the off-chip memory. This approach requires dual-port buffers, but on Xilinx FPGAs, these shallow depth buffers can be efficiently realized by using LUTs. However, by applying P×(2S +1) 2 support vectors during one scan, the number of scans are reduced from N sv /P to N sv /P/(2S + 1) 2 , and splitting loss occurs. The power consumption becomes 7.29W from 9.74W, but the processing speed becomes 5.5% slower when P = 4.
As shown in Table 5, 2'nd shows a faster processing speed than 2nd when normalized by the power consumption, but it requires a considerable number of shallow depth dual-port buffers, which is impossible in all types of FPGAs. In the following discussion, the second implementation is used without this technique to reduce the power consumption.

D. OFF-CHIP MEMORY BANDWIDTH
The required off-chip memory bandwidth is an important factor to evaluate the usability of the system.   The first implementation requires N sv /P clock cycles to process one pixel, and during this period, 2S + 1 pixels have be fetched from off-chip memory to calculatex, and a new pixel from the camera must be stored. Its required memory bandwidth becomes P N sv · ((2S + 1) + 1) · 200B/clock cycle, ignoring the irregular computations when the target pixel moves to the next line. In the second implementation, the line buffers, whose width is W , are scanned N sv /P times. During each scan, the next P support vectors are fetched from the off-chip memory bank. The new data from the camera can be directly given to the circuit. Thus, its required bandwidth is given as P W · 1635b/clock cycle. This bandwidth does not depend on N sv . Table 6 shows the required memory bandwidth. As shown in this table, the first implementation requires faster off-chip memory such as DRAMs, but the required memory bandwidth for the second implementation is low. Furthermore, no write operation to the off-chip memory is necessary during the computation. These mean that any type of memory can be used in the second implementation.

E. COMPARISON WITH OTHER PLATFORMS
Here, we compare our first implementation with other hardware platforms: Jetson Nano [20], one of the smallest embedded GPUs, and Myriad 2 [21] and Myriad X [22], dedicated low power CPUs for image processing. The second implementation requiresx, a normalized average of 200 bands of (2S + 1) 2 pixels centered at x, to be calculated every clock cycle, which cannot be done efficiently on these hardware platforms based on SIMD instructions. In our first implementation, the following three approaches are used to achieve high performance: a) P × 200 multiplications (8b × 8b → 16b) are executed in parallel every clock cycle, and their results are added using pipelined tree adders, b) each multiplier is supplied with new data every clock cycle to prevent it from becoming idle, and c) other values such as poly() and rbf () in Fig.3, and voting in Fig.5 are calculated every clock cycle in parallel with the multiplications and additions.
When P = 2, the number of multiplications and total arithmetic operations are 80.0 and 197.2 GOPS (Giga Operations Per Second) as shown in Table 7. Jetson Nano has 128 cores that run at 921.6MHz. Since various calculation methods are possible with this GPU, we use total performance for comparison. Its peak performance is 471.9 GOPS when half-precision floating-point (FP16) is used, and its power consumption is 5-10W. This peak performance is 2.4 times our sustained total performance, and the power consumption is 2 to 3.5 times that of ours as shown in Table 7, which means that when the peak performance is sustained, its normalized performance by power consumption is about the same as ours. In Jetson Nano, however, it is not possible to supply a sufficient number of data to its cores since the size of its shared memory (64KB), a small but fast on-chip memory like primary cache, is not large enough to store all support vectors (244.7KB) and required pixel data. Furthermore, the values in c) cannot be calculated simultaneously. Considering the execution time of memory access operations to its global memory and the calculation time of these values, its actual normalized performance can be considered lower than ours.
In [12], the performance of Myriad 2 in cloud detection system is compared with that of FPGA, however, the computational complexity of the system is considerably lower than our hyperspectral image classification, and its results cannot be directly compared. Myriad 2 has 12 VLIW 128-bit vector units, called SHAVE, that run at 600MHz, and its peak performance of FP16 multiplications is 57.6 GOPS, because a single SHAVE can perform eight multiplications. Here, we consider applying one support vector consisting of 200 bands tox sequentially in each SHAVE, to avoid idle time due to the data transfer between SHAVEs. Myriad 2 has hardware support for pipelined addition of multiplication results, and its on-chip memory size and maximum throughput in specification are sufficient to achieve b). Thus, as maximum performance, the multiplications and additions can be performed in 25 (200/8) clock cycles, ignoring delay. However, as in Jetson Nano, the values in c) cannot be calculated in parallel with the multiplications and additions. VOLUME 11, 2023 To calculate these values, at least 20 clock cycles are needed as can be seen from the flowcharts in Fig.3 and Fig.5 (these calculations cannot be accelerated by SIMD processing). Then, its maximum multiplication performance degrades to 32.0 GOPS. The power consumption of Myriad 2 reported in [12] is 1.8W. Given these facts, we can consider that our normalized processing speed by power consumption is faster than Myriad 2, as reported in [12]. Myriad X [22] is the successor to Myriad 2. Its peak performance of multiplication is 89.6 GOPS, and the estimated maximum performance is 49.8 GOPS. If its power consumption is similar to Myriad 2, we can consider our implementation is slightly faster for the same reasons.

VII. DISCUSSION
The two implementations evaluated thus far are typical: less hardware resources but higher off-chip memory bandwidth, and less bandwidth but more hardware resources. The required off-chip memory bandwidth in our implementations is considerably lower than the maximum bandwidth of DDR3 memory banks (its theoretical maximum is 12.8GB/sec, and effective bandwidth is 10-12GB/sec when the operational frequency of FPGA is 200MHz, though it depends on the memory interface module) as discussed below. However, when we consider to build a dedicated hardware, it becomes possible to reduce the data width of DDR3 memory banks, namely the number of DDR3 chips, or to use other types of memory such as USB memory, which leads to smaller size system with lower power consumption.
Here, we consider two more implementations, which aim to reduce the number of block RAMs in the second implementation. To reduce the number of block RAMs used for the line buffers, two approaches are considered: a) using partial sums along the y axis, and b) changing the scan order.
In Fig.7, three line width buffers are used instead of (2S + 1) + 1 line buffers. The first and third buffers are line buffers, and the second is a buffer to store partial sums along the y axis. Before starting the calculations of the ith line, all pixels on the (i + S)th line, the partial sum along the y axis  − 1, y). This scan requires W clock cycles. During the following N sv /P/(2S + 1) 2 scans, which takes N sv /P · W clock cycles, (1) the pixels on the (i + S + 1)th line from the camera are stored in the first buffer and off-chip memory, and (2) the pixels on the (i − S)th line are fetched from off-chip memory, and stored in the third buffer. The overhead of the  additional scan is W /(N sv /P · W ) = P/N sv , which is small enough, 0.3% when P = 4.
In Fig.9, all lines from the camera are stored in the off-chip memory once. Then, the lines are separated every L lines, and the L lines are divided into D blocks. The blocks are processed from left to right. The pixels on the top line in block-0 are processed first, and then, the pixels on the next line are processed. After all pixels in block-0 are processed, the pixels in block-1 are processed in the same manner. The next L lines are stored while processing D blocks. With this change of scan order, the depth of the line buffers in Fig.6 can be reduced to W /D. In particular, it becomes W /D + 2S, but 2S can be held on the register array. For larger D, the buffer size can be smaller, but more off-chip memory bandwidth becomes necessary. To process the pixels of L lines, the pixels S + L + S lines are required, and it means that before starting the calculations of pixels in each block, 2S line buffers must be filled (one line buffer can be filled while processing the previous block). No calculations can be executed during this period. By using larger L, this overhead, 2S/(L + 2S), can be relatively smaller, but the outputs of the system are delayed more. Table 8 shows the computation losses of the two implementations (called the third and fourth implementations) compared with the first and second. The goal is real-time processing, and if the processing speed is faster than the data transfer speed from the camera, those losses are acceptable. The losses shown in this table are acceptable in most cases. Table 9 shows the circuit size of the third and fourth implementations when P = 4, as well as the second implementation that is used as their base. The circuit size of the third and fourth implementations is almost the same as the second implementation, except that fewer block RAMs are used, as we intended. The power consumption by these implementations is less than that of the second implementation, and the lower performance implementation consumes less power.  For the first implementation, three lines are shown. Two of them are for when N sv is ×1 and ×2, and the other horizontal line is for when the processing speed is maintained using ×2, ×3, and ×4 f units when N sv becomes ×2, ×3, and ×4, respectively. The #BW and #BRAMs of the other three implementations do not depend on N sv . Therefore, in these implementations, a larger P does not necessarily means higher processing speed.
In the first implementation, the #BRAMs is proportional to P because the depth of block RAMs used in this case is 512, and cannot be shallower. The #BW is also proportional to P, and its gradient is steeper than that of the other three implementations, which means that a higher #BW becomes necessary for larger Ps than others. When N sv is doubled (N sv × 2), the #BW is halved because the processing time of each pixel becomes double. The third line is plotted from another perspective. When the processing speed is maintained as described above, the #BW is kept constant, although the #BRAMs still increases proportional to P.
The #BW and #BRAMs of the other three implementations do not depend on N sv , the processing speed. For larger or smaller N sv , the processing speed becomes slower or faster, but the #BW and #BRAMs are always those shown in this graph. The #BW for the second implementation is lowest, but the #BRAMs is the highest. The third implementation requires a higher #BW (still much less than the first implementation), but #BRAMs can be reduced to almost half. The fourth implementation enables the reduction of the #BRAMs, particularly when D = 4. However, it requires a higher #BW than the second and third implementations, and the gradient is steeper (but gentler than the first implementation).
If the off-chip bandwidth is considerably low, the second implementation is the top candidates although it requires more block RAMs. It also requires more logic cells, but the difference becomes smaller for a larger P and a higher power consumption. The #BW of the third implementation is the second lowest, and its #BRAMs is the second highest. For large P, this implementation can also be a candidate. If faster off-chip memory is available, the first implementation seems to be the best. The #BRAMs for the fourth implementation is small. This implementation becomes the top candidate if higher processing speed is required under very limited #BRAMs. Here, we consider realizing our classification system on the space-grade FPGA, Virtex-5QV, that has 130K LUTs and 298 block RAMs. It has no DSP slices, but in our implementations, only 8 DSP slices are used in each f unit, and they can be realized using LUTs. A total of 130K LUTs allows up to P = 6 (a large N sv is not allowed), which is much faster than real-time processing. However, because of the limitation of the #BRAMs, only the first and fourth implementations are available on this FPGA. The #BW of the fourth implementation with D = 2 is the smallest, but it use almost all block RAMs. When D = 4, it requires a lower #BW and #BRAMs than the first implementation (except for P = 4), but its processing speed is slightly slower.
With the latest space grade FPGA, Kintex-XQ, all implementations become available. This FPGA has 331KLUTs and 1080 block RAMs, which allows up to 16 f units. The superiority of the four implementations is mainly decided by the #BW for large P. The #BW of the second implementation can be further reduced using more BRAMs. Using (2S + m) + m line buffers (and m score and voting buffers), the cached P or P × (2S + 1) 2 support vectors can be continuously applied to the pixels on m lines, and during this period, the next m lines from the camera are read into the m line buffers. With 1080 block RAMs, m = 2 or 3 is possible, which means that the #BW can be reduced to 1/2 or 1/3 (approximately 100MB/s when m = 3 and P = 16).
The two implementations from [15] and [17] cannot be implemented on Virtex-5QV because of their circuit size (particularly high usage of DSPs). [17] can be implemented on Kintex-XQ, but its processing speed is not fast enough for real-time applications. Reference [15] is fast enough but requires more on-chip memory banks than Kintex-XQ.

VIII. CONCLUSION
In this paper, we proposed a hyperspectral image classification system on an FPGA. In our system. an algorithm based on a composite kernel SVM is used. Its accuracy is comparable with other approaches, and it is feasible to implement on an FPGA because of its simplicity. Its two implementations, one with less hardware resources but more off-chip memory bandwidth, and another with more hardware resources but less off-chip memory bandwidth, are evaluated. Two additional implementations are also introduced, which, based on the evaluation results, stand between the other two, and we discussed which implementation is advantageous under what conditions. All implementations are fast enough for real-time processing, and, according to the number of support vectors, available off-chip memory bandwidth, and size of on-chip memory, the best implementation can be determined.
KENTO TAJIRI received the master's degree in engineering from the University of Tsukuba, in 2017, where he is currently pursuing the Ph.D. degree with the Graduate School of Systems and Information Engineering. His research interest includes reconfigurable parallel computing systems.
TSUTOMU MARUYAMA (Member, IEEE) received the Ph.D. degree in engineering from The University of Tokyo, in 1987. He is currently a Professor with the Graduate School of Systems and Information Engineering, University of Tsukuba. His research interest includes reconfigurable parallel computing systems.