Interpretable Deep Learning Framework for Land Use and Land Cover Classification in Remote Sensing Using SHAP

An interpretable deep learning framework for land use and land cover (LULC) classification in remote sensing using Shapley additive explanations (SHAPs) is introduced. It utilizes a compact convolutional neural network (CNN) model for the classification of satellite images and then feeds the results to a SHAP deep explainer so as to strengthen the classification results. The proposed framework is applied to Sentinel-2 satellite images containing 27000 images of pixel size $64 \times 64$ and operates on three-band combinations, reducing the model’s input data by 77% considering that 13 channels are available, while at the same time investigating on how different spectrum bands affect predictions on the dataset’s classes. Experimental results on the EuroSAT dataset demonstrate the CNN’s accurate classification with an overall accuracy of 94.72 %, whereas the classification accuracy on three-band combinations on each of the dataset’s classes highlights its improvement when compared to standard approaches with larger number of trainable parameters. The SHAP explainable results of the proposed framework shield the network’s predictions by showing correlation values that are relevant to the predicted class, thereby improving the classifications occurring in urban and rural areas with different land uses in the same scene.


I. INTRODUCTION
K NOWLEDGE of land use and land cover (LULC) is important for the conceptual design of infrastructure projects in urban and rural areas [1]. The acquisition of such knowledge can be difficult, due to the complexity of urban/rural areas; in remote sensing and specifically in highresolution Sentinel-2 satellite images, one pixel corresponds to 10 m on ground, meaning that a very small image, e.g., of size 64 × 64, covers a huge area, which is approximately 42 km 2 . Therefore, the ground sampling distance in such images may contain many land uses, for instance, crops with Manuscript received 16 [2], as wide areas of interest are investigated and analyzed with a "birds-eye-view." Geographic information systems (GISs) provide robust solutions for annotating fast and accurate LULC from EO data [3]. To improve their annotation, deep neural networks (DNNs) with an emphasis on convolutional neural networks (CNNs) for image classification are considered [4]. They are compelling for object detection in remote sensing data, covering several applications, including building extraction [5], deforestation [6], land cover change [7], and others.
In remote sensing, data can be very complex, as different objects belonging to the same category appear in the same scene [8], for instance, permanent crops, herbaceous vegetation, and forest areas, all belong to the vegetation category. DNNs, on the other hand, classify the output image, without further interpreting the results corresponding to a scene [9], [10]. Therefore, the establishment of techniques for black-box procedures to be more transparent and understandable has critical importance in remote sensing.
Explainable AI (XAI) is a technique for interpreting machine learning algorithms and DNNs models, making them more understandable to humans [10], [11]. Focusing on remote sensing data, XAI methods on images, DeepLIFT [12] and Grad-CAM [13], as well as on both images and features, local interpretable model-agnostic explanation (LIME) [14], have been used in the literature to provide insight into how a model is making decisions about LULC classification. In detail, in [9], a CNN is applied to SEN12MS [15] and to BigEarthNet [16] datasets along with selected XAI methods, such as LIME [14], DeepLIFT [12], Grad-CAM [13], Guided Grad-CAM [13], and others in [9] to show the correlations among the classes. In [17], a CNN is applied to EuroSAT [18], while LIME [14] is used to extract the correlations. However, the experimental results in [17] are limited to red green blue (RGB), and in addition, the CNN is trained using these channels only.
All the above methods discussed are constrained to local explanations, meaning that they may not be able to correctly capture information existing in the whole dataset used. Furthermore, the explanations in some cases are limited to specific This work proposes an XAI framework for remote sensing data utilizing Shapley additive explanations (SHAPs) [19]. Compared with the existing XAI approaches, the use of SHAP enables both local and global explanations, allowing for information between different spectral bands in a dataset to contribute toward the explanations. Compared with the existing approaches being limited to RGB channels, the proposed approach considers different band combinations for the classification and the explanation of their results so as to highlight the interference of information from different wavelengths of the spectrum affecting the classes. This improves both the classification accuracy of each individual class and the explanations' interpretability, leading to a better estimation of the channels' contribution to the final prediction.

II. PROPOSED DEEP SHAP FRAMEWORK
The high-level model capturing the operation of the proposed deep SHAP framework is shown in Fig. 1. Initially, a deep CNN is trained using a dataset containing LULC images, and once trained, it classifies any multichannel image in one of its classes. The classification's result along with the image are then fed to the SHAP explainer, which outputs the pixels' positive or negative correlation for each one of the K existing classes. The positive or negative correlation corresponds to what extend each feature contributes to each class, allowing for a better interpretation and understanding of the following: 1) many different objects existing within an input image and 2) which image spectral band combinations responded better in the LULC classification. In Sections II-A and II-B, the CNN architecture and the deep SHAP model used are explained.

A. CNN Architecture
The CNN architecture utilized by the proposed deep SHAP framework is illustrated in Fig. 2. The input image is of size k × l × c, where k and l are the number of rows and columns, respectively, and c is the number of the image's channels. It consists of five convolution-convolution-max-pooling layers connected sequentially, where each one downsamples the input image by increasing powers of 2, having the number of filters doubled in each layer. The convolution-convolutionmax-pooling layers are then followed by three fully connected (FC) layers, where each one has 512, 256, 128 neurons, respectively. The activation function used here in all layers is the Gaussian error linear units (GeLUs) [20], defined as follows: where (y) is the standard normal cumulative distribution function, Y ∼ N (0, 1), and y is the input to the activation function. The difference of GeLU over the rectifier linear unit (ReLU) originates from the stochasticity of the former; GeLU samples from the standard normal distribution according to (1), thus introducing a natural dropout regularization, which is not feasible with ReLU [20].

B. Deep SHAPs
The SHAP is a method for interpreting machine learning models, introduced in [19]. It maps inputs, x ′ , to the original ones, x, through a mapping function x = h x (x ′ ) so as to explain a prediction f (x). The simplified inputs allow for the interpretable model to ensure that for any feature z ′ ∈ R and whenever z ′ ≈ x ′ , then g(z ′ ) ≈ f (h x (z ′ )). The SHAP's additive feature attribution method is based on a linear function of binary values as follows: where N is the number of input features, n = 1, 2, . . . , N is the feature index, the values of φ n ∈ R are the model's coefficients and the values of z ′ n ∈ {0, 1} N denote the observation of a feature. Note that each z ′ n refers to a feature of z ′ . To explain the derivation of the coefficients φ n , we proceed with some definitions. Let N S be a feature subset, such that N S ⊆ N , where N is the set of all features with cardinality |N |. Assuming that x N S represents the values of the input features existing in N S and n is a feature, the model f x (N S ) used for the calculation of the SHAP values is defined as The model is trained two times; one including the feature n, i.e., f N S ∪{n} (x N S ∪{n} ), and one excluding it, i.e., f N S (x N S ). Predictions are then derived from their comparison f N S ∪{n} (x N S ∪{n} ) − f N S (x N S ), while the procedure is repeated for every possible subset, such that N S ⊆ N . Combining the above equations, the coefficients φ n from (2) are derived as follows: (3) Note that unique solutions within the class of additive feature attribution methods exist if and only if the following three key properties are satisfied [19]: 1) local accuracy; 2) missingness; and 3) consistency.
The deep SHAP framework of Fig. 1 combines the Shapley values calculated using (2), (3), and the DeepLIFT method. DeepLIFT is a compositional approximation of the Shapley values under the assumptions that the following hold: 1) the deep model is linear and 2) the input features are uncorrelated to one another. It is an additive feature attribution method that satisfies local accuracy and missingness, two of the three key properties for additive feature importance. Therefore, with the inclusion of the Shapley values, the consistency of the model is achieved [19].

A. Experimental Setup
The performance of the proposed framework is evaluated using the EuroSAT dataset proposed in [18]. EuroSAT contains LULC images taken from the Sentinel-2 satellite, covering 13 spectral bands and consisting of ten classes in total with 27 000 labeled and geo-referenced images. Out of the 13 spectral bands, we consider only the use of red, green, near infrared (NIR-Band 8), short-wave infrared (SWIR-Band 11), and the remote sensing normalized difference indexes stemming from them, including vegetation index (NDVI), buildup index (NDBI), and water index (NDWI).
In the experiments, the following different three-band combinations of the above selected bands are used, which are the following: 1) SWIR-NIR-RED; 2) NIR-RED-GREEN; and 3) NDBI-NDVI-NDWI. These combinations capture information existing in wavelengths that are able to identify vegetation, water bodies, soil, and man-made constructions, as their percentage reflectance is higher compared with other bands [18].
All the experiments are conducted using Google Colab Pro, Python 3, and TensorFlow. With respect to the training phase, the dataset is split into 70/10/20 train/validation/test sets, while the network is trained for approximately 70 epochs, using an early stopping criterion with patience of ten epochs, monitored using the validation loss. The batch size used is 64, the learning rate is 10 −3 and the seed value is 42. Moreover, the selected optimizer is layer-wise adaptive moments optimizer for batch training (LAMB) proposed in [27]. Compared with adaptive moment estimation (ADAM), it applies adaptive elementwise updating and layerwise learning rates on large batches of input data, hence speeding-up the training when large datasets are used.

B. Experimental Results
To compare the performance of the proposed framework, we consider several deep NN architectures applied on the EuroSAT dataset, including the following: 1) a Shallow CNN [21]; 2) GoogleNet [22]; 3) DenseNet121 [23]; 4) Inception V3 [24]; 5) ResNet50 [25]; 6) ResNet101 [25]; 7) VGG16 [26]; and 8) GeoSystemNet [21]. Note that in [21], a fusion of the initial EuroSAT dataset along with different-scaled imaged derived from MapBox application programming interface (API) [21] is used. In the comparisons, we use the following standard classification metrics: where TP, TN, FP, and FN denote, respectively, the true positive, true negative, false positive, and false negative values. The results of the precision, recall, and F1 score are cited in Table I, whereas the accuracy is cited in Table II. Note that the accuracy reported in Table II is calculated using all 13 bands so as to have a fair comparison between the works. It should be noted that among the classification metrics in (4), precision and recall are the most important ones in the LULC classification; precision reflects a model's ability in identifying correctly a single object among many, thereby strengthening its reliability, whereas recall measures a model's ability in identifying correctly a single object regardless of the rest, thereby strengthening its effectiveness. On the other hand, accuracy measures the model's overall performance without considering that many classes have similar spectral signatures; for instance, permanent crop and forests are different, but belong to the vegetation category.
According to Table I, the proposed framework yields the highest precision value in all classes, except from pasture in which GeoSystemNet is better. The recall values follow similar behavior to those of the precision, with the main difference being the forest class in which GeoSystemNet has lower value. With respect to the F1 score, it is observed that our framework results in the highest values except from the herbaceous and permanent crop, which is expected, since it is the harmonic mean of the precision and recall, hence affected by them. Yet, considering all the 13 bands, the classification accuracy can be improved. From the results cited in Table I, one can conclude that reducing the number of channel bands improves the classification accuracy of each class separately.
From Table I, it can be seen that the use of three channel combinations greatly improves the classification of each class separately. However, the classification accuracy reported in Table II for the proposed framework is reduced compared with the other models, which is reasonable given the number of its trainable parameters. It should be mentioned though that the SHAP by itself is the computationally expensive XAI technique, as it calculates the Shapley values for various features in a prediction instance [28]. Therefore, despite the smaller number of trainable parameters of the CNN used by the proposed framework, it results in faster (in time) extraction of the explanations, as less features are fed to the SHAP explainer.

C. XAI Results
According to Fig. 1, the CNN model and an image to be classified are fed to the SHAP deep explainer for the derivation of the SHAP values. Once derived, the SHAP image plot tool is used to visualize the positive (red) and negative (blue) correlations in each pixel of each of its classes. An example case of annual crop, river, and highway using the three different band combinations is illustrated in Fig. 3.
In the NDBI-NDVI-NDWI case of Fig. 3, as expected, positive correlations exist in the annual crop class given the CNN's correct classification. In the second case, SWIR-NIR-RED, positive correlations exist in the river class, while negative correlations are denser in the annual crop class, implying that it is less likely for annual crop to exist within the image. Of important interest is the final case, NIR-RED-GREEN, in which the positive correlations are intense in the area where the highway is present. Apart from the local explanations SHAP image plot depicting the impact of each pixel on the models predictions. SHAP image plots and classification from top to bottom. First row: Annual crop using NDBI-NDVI-NDWI, Second row: river using SWIR-NIR-RED, and Third row: highway using NIR-RED-GREEN. The red pixels indicate strong correlation among the predicted classes, whereas the blue ones indicate weak correlation. shown in Fig. 3, the proposed framework can be used to extract global explanations, as shown in Fig. 4. It can be seen that the selected bands red, green NIR, and SWIR 1 result in the highest average SHAP values for almost all classes, meaning that they are critical in the explainability of the classification results. Note that the cirrus band (B10) is not included in Fig. 4, as it does not contain surface reflectance information [29].

IV. CONCLUSION
In this work, we presented a deep XAI framework based on SHAP applied on satellite images of Sentinel-2. Experimental results on different spectral band combinations of the EuroSAT dataset demonstrated that the proposed framework improves the classification accuracy of each individual class among the existing ones, also shown with comparisons to the existing CNN methods from the literature. The local explanation results verified that the model predictions were correctly derived, highlighting for each class which pixels had positive correlation, whereas the global explanation results showed the contribution of each individual band toward the explanations. Therefore, using the proposed framework, the end user can classify satellite images in an automatic and reliable way, as the introduced qualitative visual pattern assists the quantitative metric of the classification accuracy. This improves the concept of multilabel LULC classification, especially when multiple objects exist in the same scene.