A Vision Transformer Model for Convolution-Free Multilabel Classification of Satellite Imagery in Deforestation Monitoring

Understanding the dynamics of deforestation and land uses of neighboring areas is of vital importance for the design and development of appropriate forest conservation and management policies. In this article, we approach deforestation as a multilabel classification (MLC) problem in an endeavor to capture the various relevant land uses from satellite images. To this end, we propose a multilabel vision transformer model, ForestViT, which leverages the benefits of the self-attention mechanism, obviating any convolution operations involved in commonly used deep learning models utilized for deforestation detection. Experimental evaluation in open satellite imagery datasets yields promising results in the case of MLC, particularly for imbalanced classes, and indicates ForestViT’s superiority compared with well-established convolutional structures (ResNET, VGG, DenseNet, and ModileNet neural networks). This superiority is more evident for minority classes.


I. INTRODUCTION
C OMPREHENDING and monitoring deforestation and its implications may have an impact on climate change through greenhouse gas emissions reduction [7]. Various studies explore the causal factors of deforestation; it is usually caused due to drought, forest fire, use of fuelwood, the spread of extensive agriculture, and rapid urbanization, among others. Converting forests into agricultural land, for example, emits huge amounts of greenhouse gases. Using sustainable forest and land management practices can instead help those ecosystems retain and store significant amounts of carbon. To this end, it is important to detect the most susceptible areas to future deforestation based on spatial variables that represent its regional drivers. The current state of environmental conservation is associated with neighboring land uses, which often act as driving forces of deforestation. Understanding the dynamics of these changes can assist in planning and assessing future conservation actions to prevent or mitigate adverse impacts. In this work, we formulate deforestation as a multilabel classification (MLC) problem attempting to capture the various land uses related to deforestation. MLC implies that an instance may be associated with multiple labels. For example, a part of an image may depict a forestry region, while another part shows land that has been deforested due to logging. Thus, this image will be assigned to two classes simultaneously.
Remote sensing imagery has been instrumental in the process of keeping track of deforestation worldwide [7]. Furthermore, deep learning (DL) has recently attracted increasing attention from remote sensing researchers because of its ability to automatically extract features from the image dataset, highlevel semantic segmentation, nonlinear problem modeling, and mapping in complex environments [7]. Convolutional neural networks (CNN) have emerged as a particularly powerful tool in remote sensing applications. However, recent trends of incorporating attention mechanisms in vision have led researchers to reconsider the use of convolutional layers as the primary building block. Beyond helping CNNs to handle long-range dependencies, attention can completely replace convolution and achieve state-of-the-art performance on vision tasks [6]. Inspired by its capacity to learn meaningful interdependencies between words, researchers have recently considered using self-attention in vision tasks. Self-attention enhances nonlocal relationships across the image and learns complex connections not only between neighbors but also to the neighbors' neighbors (long-range dependencies), which can be beneficial for our MLC problem.
In this work, we show that the self-attention mechanism achieves competitive or better results compared with wellestablished CNNs methods in deforestation monitoring, especially, as regards the less frequent classes in the dataset. We propose, design, and train a vision transformer model to identify the driving forces of deforestation (agriculture, habitation, infrastructures, and other drivers) of primary forest loss using satellite imagery in the Amazon rainforest. This This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ task is challenging to automate due to the heterogeneity of drivers within images and driver classes and the rapid evolution and changing of landscapes. Here, we propose a transformer network architecture for an MLC problem in satellite imagery data, which has the comparative advantage in modeling long-range spatial dependencies and is capable of handling imbalanced datasets.
A. Related Work 1) Remote Sensing in Deforestation: Remote sensing has proven to be an effective tool for deforestation detection and monitoring. Landsat and MODIS imagery data have been widely used for deforestation detection, as indicated by various relevant studies. Due to their good resolution and global coverage, Landsat images are analyzed in the work of Hansen et al. [10], where maps of global forest change with a 30-m spatial resolution are presented; whereas ML techniques have been utilized in a later work of Hansen et al. [11] for the Landsat and MODIS data with the aim to map the forest cover. The work of Banskota et al. [2] presents a review of different approaches applied to the Landsat data for forest monitoring. The spatial and spectral resolutions offered by Landsat and MODIS are well-suited for deforestation analysis; hence, there are various studies dealing with deforestation that analyze datasets coming from Landsat and MODIS satellites ( [7], [14], [19]). With remote sensing, deforestation monitoring systems use a number of automatic methods based on change detection techniques [13], e.g., image processing and analysis methods [16], [30] or ML techniques [12], [35].
2) Deep Learning in Deforestation: DL [32] techniques are popular in remote sensing applications [28], [38]; as such, there are already a few studies related to the application of DL methods to the deforestation detection problem [25]. Most of these works are dedicated to the Amazonian rainforest, which is the most impacted forest in the world [1]. The works of [17] and [36] deal with deforestation detection and forest change monitoring using classification [27] or segmentation methods, as in [18]. The method described in [21], implements a CNN for deforestation detection on satellite images. The study of [37] analyzes satellite images of deforestation in the Amazon using neural networks. Ortega et al. [27] reviewed several DL methods (early fusion CNN and Siamese CNN models) for deforestation detection in Amazon forests. Irvin et al. [17] develop a DL model called ForestNet to classify the drivers of primary forest loss in Indonesia, using Landsat 8 satellite images. The study in [22] classified landscape affected by human-induced deforestation using high-resolution remote sensing and a U-Net network, while Isaienkov et al. [18] presents the results of high-frequency forest change monitoring within Ukraine using data from the Copernicus Sentinel-2 mission and a U-Net model trained on image difference and concatenation.
3) Transformer Models in Computer Vision: Transformers [31] have recently demonstrated very good performance in a wide range of time-dependent 1-D data applications, such as language tasks, e.g., text classification and machine translation. The most popular ones include bidirectional encoder representations from transformers (BERTs) [8], generative pretrained transformer (GPT) [29] and robustly optimized BERT pretraining (RoBERTa) [23]. Transformer architectures are based on a self-attention mechanism that learns long-range temporal dependencies between elements of a sequence in the 1-D space. Thus, the self-attention layers consider causality in a given sequence by learning the relationships between the token set elements (e.g., words in language or patches in an image). In the 1-D space, transformers replace successfully the recurrent operations that process one local neighborhood at a time [34] and search for dependence (locally) at its previous time-related element. Moving from time (1-D) to 2-D space, the recently proposed vision transformer [9], is an interesting attempt to showcase how (convolution-free) transformers can replace standard convolutions in deep neural networks in a similar manner to the case of 1-D space, where transformers successfully replace recurrent neural networks and their variants. There, attention mechanisms detect nonlocalized patterns and long-range pixel interdependencies (long-range spatial dependencies) [6], [33]. Vision transformers are applied on large-scale computer vision datasets, forming a CNN-free image classification model, able to capture long-range spatial dependencies and achieve state-of-the-art performance on vision tasks.

B. Article Contribution
Multiple factors, either of human or natural origin, cause deforestation. The expansion of agriculture is a major driver of deforestation at the global level, followed by the construction of infrastructures, such as roads or dams, together with mining activities and urbanization. Identifying the key driving forces of deforestation (agriculture, urbanization, infrastructure, and so on) for the primary forest, using satellite imagery data, is a challenging process. This is due to the heterogeneity of the various deforestation factors within the captured images and the fact that an image is associated with multiple land uses. In this article, we tackle the problem of MLC of satellite imagery of the Amazon rainforest in domains, including atmospheric conditions, terrain, and indicators of human impact in order to survey and prevent deforestation. In particular, we provide a model to get an insight into the complex interactions and dependencies among visual features and labels for identifying regions at risk for deforestation.
In this article, we leverage vision transformer as an effective and scalable structure [9] with self-attention mechanisms, modeling long-range contextual/spatial relation (dependencies) between different image areas [3]. Self-attention allows the model to jointly process information from different image subspace representations without using convolutions, as traditional DL models do. Therefore, ForestViT addresses the heterogeneity and MLC issues mentioned earlier. The role of the self-attention mechanism is to differentially weigh the significance of each part of the input data. The ability of ForestViT to derive long-range contextual/spatial dependencies is combined with the MLC mechanism that simultaneously assigns multiple labels in an image, related to drivers of deforestation in near-the-forest areas. Thus, ForestViT exploits the complex dependencies among visual features and labels [20] in a satellite image, identifying forest areas at risk.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
An additional difficulty in MLC for deforestation monitoring is the uneven distribution of labels. For example, in tropical rainforests, deforestation events are rare and sparse compared with the large forestry area, thereby creating imbalanced classes. Although these are rare, it is important to detect these eventscauses of deforestation-and prevent a future catastrophic impact. ForestViT deals with the imbalanced nature of the dataset and has markedly better performance in detecting the less common (rare) classes. In particular, we experimentally prove that ForestViT has a comparative advantage against CNN-based architectures, especially for imbalanced datasets in deforestation monitoring.

II. DEFORESTATION DETECTION AS A MULTILABEL CLASSIFICATION PROBLEM
MLC for deforestation detection in satellite images refers to the task of assigning multiple labels to satellite images. Let us denote by I = (I 1 , . . . , I N ) ∈ I, ∀i = 1, . . . , N a set of images, then assuming the label vector y = (y 1 , . . . , |C| is a finite set of predefined classes, the purpose is to decide the subset of classes that are found in the i th image through a learning process. Each label attribute y j corresponds to the absence (0) or presence (1) of each class c j . The classes are related to deforestation, land use, and mangrove deforestation factors. In contrast to multiclass learning, alternatives are not assumed to be mutually exclusive, such that multiple classes may be associated with a single image [26]. This characteristic is extremely useful for deforestation monitoring; the main concern in deforestation is to alert for tree cover loss, detect illegal mining and logging, and protect the biodiversity. The loss of natural areas (e.g., forests) and the increase of anthropogenic areas and activities (e.g., agriculture, habitation, and infrastructures) are tightly coupled. Thus, detecting natural forest areas together with land uses in images that are related to anthropogenic activities and could harm the forests (drivers) could be an indication of danger and risk for deforestation.
The problem at hand is to detect the multiple classes of an image related to deforestation. Let H : I − → Y be a multilabel classifier that estimates the label subset ofŷ = (ŷ 1 , . . . ,ŷ M ) that comprises of various land use classes that appeared in each instance-image x i ∈ X. Thuŝ Taking into consideration the complexity of the deforestation problem and the difficulties arising using the satellite images as input to detect land uses related to deforestation, leads to applying DL methods to improve the classification accuracy.

III. FORESTVIT: PROPOSED VISION TRANSFORMER FOR
DEFORESTATION DETECTION Fig. 1 shows a schematic overview of the proposed ForestViT model. The standard transformer receives as input a 1-D sequence of token embeddings. To handle 2-D images, we reshape image I ∈ R h×w×b into a sequence of flattened 2-D patches I p ∈ R n×( p 2 ·b) , where (h, w) is the spatial resolution of the original image, b is the number of bands/channels, ( p, p) is the resolution of each image patch, and n = (h · w)/ p 2 is the resulting number of patches, which also serves as the effective input sequence length for the transformer. We map the vectorized patches I p into a latent D-dimensional embedding space using a trainable linear projection. To encode the patch spatial information, we learn specific position embeddings, which are added to the patch embeddings to retain positional information as follows [9]: where E ∈ R ( p 2 ·b)×D is the patch embedding projection, and E pos ∈ R (n+1)×D denotes the position embedding. Then, the resulting sequence z 0 of embedding entities x = (x 1 , . . . , x n ) serves as input to the transformer encoder. The transformer encoder has L encoder layers, and each encoder layer is composed of a multihead self-attention layer (MSA) and a feed-forward layer Fig. 2. The MSA layer consists of several attention layers running in parallel [31]. The goal of self-attention is to capture the interaction among all the embedding entities x by encoding each entity in terms of the global contextual information. This is done by defining three learnable weight matrices to transform queries (W f ∈ R n×d f ), keys (W g ∈ R n×d g ), and values (W v ∈ R n×d v ). The input sequence x is first projected onto these weight matrices to get f = x W f , g = x W g and v = x W v . The output z of the self-attention layer is then given by [31] For a given entity in the sequence, self-attention basically computes the dot-product of the query with all keys, which is then normalized using softmax operator to get the attention scores in (3). Each entity then becomes the weighted sum of all entities in the sequence, where weights are given by the attention scores.
In order to encapsulate multiple complex relationships among different elements in the sequence, the multihead attention mechanism in every l-layer, comprises h-self-attention blocks. Each block has its own set of learnable weight matrices ] ∈ R n×H ·d v and projected onto a weight matrix W . Therefore, the output of the lth multihead selfattention (MSA) layer is where LN(·) denotes the layer normalization operator and z l is the encoded image representation. Then, a fully connected feed-forward dense block follows in every encoder block (see Fig. 2) Finally, a multilabel classifier makes the final predictionsŷ (see Fig. 1). We use a feedforward network (FFN) with two dense layers and a sigmoid activation function  [9] and the encoder part of the NLP Transformer. The standard Transformer receives a 1-D token embedding sequence as input. Here, the images are split into fixed-sized patches and fed into the model. A learnable positional embedding vector is assigned to every patch to utilize the order of the input sequence. The ForestViT model assigns the existing classes for each output. We highlight that each image patch could be assigned in more than one class.

IV. EXPERIMENTAL EVALUATION
We compare our proposed convolution-free ForestViT model with traditional DL models that have convolutional layers as the core structure, such as: VGG16 [24], ResNet50 [4], DenseNET121 [5], and MobileNET [15], which are widely used models for remote sensing and deforestation applications or used as a baseline to evaluate various vision transformers structures. The hyperparameters of these baseline models in our multilabel experiment are the VGG16, ResNet50, DenseNET121, and MobileNET with the image size and batch size hyperparameters equal to 256 × 256 and 128, respectively.
The section is structured as follows: first, we describe the dataset used in our experiments, and then we provide the implementation details for the ForestViT model; whereas, in Section IV-A, we describe the metrics used for the evaluation, and finally, we present the experimental results.

A. Dataset Description
We utilize a dataset (URL: https://www.kaggle.com/c/ planet-understanding-the-amazon-from-space/data) published in a Kaggle competition (by Planet company), containing coarse-resolution imagery data from satellites with varying spatial resolution characteristics, i.e., the imagery has a ground-sample distance (GSD) of 3.7 m and an orthorectified pixel size of 3 m. The data comes from Planet's Flock tow satellites in both Sun-synchronous and ISS orbits and was collected in the time interval between January 1, 2016, and February 1, 2017. All of the images are derived from the Amazon basin. Mangrove deforestation in the Amazon forest is an intense phenomenon, and a plethora of factors that contribute to deforestation is observed there. Each entry contains imagery data in RGB plus the infrared band in geo-referenced.tiff format. In our experiment, the images are classified into 14 classes and the labels are broken into three groups: atmospheric conditions, common land cover/land use phenomena, and rare land cover/land use phenomena (see Fig. 3). Here, each entry is assigned to one or more classes.

2) Vision Transformer Encoder:
The vision transformer encoder accepts the images as input and produces a 14 × 1 tensor containing each label's probabilities as output. The last activation function is sigmoid so that each distinct probability in the output tensor can take values in [0, 1] regardless of the probabilities of the rest of the labels (MLC problem). The transformer encoder consists of encoder blocks. Each encoder block contains two sublayers: MSA and positionwise feedforward networks, where a residual connection followed by layer normalization is employed around both sublayers. Our implementation contains four transformer encoder blocks, each one with an eight-head self-attention mechanism.

C. Accuracy Assessment
We employ several metrics to comparatively evaluate the proposed and existing methods.
1) Per-Class Accuracy: In order to objectively evaluate our results, the metrics of accuracy and recall are considered. Accuracy (ACC c i ) is defined as where the nominator contains the true positives (TP c i ) and true negatives (TN c i ) samples, while denominator contains the TP c i and TN c i and false positives (FP c i ) and false negatives (FN c i ). Precision (PR c i ), recall (REC c i ), and F1-score (F1 c i ) are given as 2) Overall Accuracy: To measure the effectiveness in an MLC problem, averaging metrics is also required. In microaveraging all TP c i , TN c i , FP c i , and FN c i for each class c i , 3) Multilabel Accuracy: In MLC, a misclassification is no longer a hard wrong or right. A prediction containing a subset of the actual classes should be considered better than a prediction that contains none of them, i.e., predicting two of the three labels correctly is better than predicting no labels at all. Hamming-Loss is the fraction of labels that are incorrectly predicted. Given an i -image input to the model, and assuming an output vector with binary elementsŷ i, j , dimensions [1× N] and the corresponding ground truth with element y i, j , the hamming loss metric is Table I demonstrates the proposed model performance over the unseen (test) data. To verify the performance of our self-attention models, we use the unseen (test) set to assess the model performance to data totally outside the training phase. The results have been obtained using the accuracy objective criterion (see Section IV-D.2) for the tested set, separately for each category. We can see that high-performance results are obtained. Our convolution-free ForestViT model has slightly better results compared with ResNet VGG16, DenseNET, and ModileNet approaches.

D. Evaluation of Deep Learning Techniques for Deforestation Detection 1) Per-Class Analysis:
Focusing on the different classes of the deforestation dataset, the per-class recall and F1-score objective criteria are shown in Fig. 4. For the classes of the most frequent samples,  such as "hazy," "primary," "agriculture," "clear," "water river," "cloudy," "partly cloudy" and so on, performance scores do not indicate the substantial difference between the proposed ForestViT model and the state-of-the-art techniques applying convolutional operations (e.g., DenseNET, MobileNET, ResNET, and VGG16 networks). However, the comparative advantage of the proposed ForestViT method appears in Fig. 4(b), where we show the performance metrics between ForestViT and well-established convolutional approaches for the classes containing infrequent samples (rare occurrences), such as "conventional mining," "blooming," "selective logging," and so on. In Fig. 4, the F1-score, which is a suitable measure of models' performance in imbalanced datasets, shows better results for ForestViT, especially in cases of the rare classes.
2) Model's Overall Accuracy Assessment: In Table II, we report the micro-averaged recall and precision metrics on the test set for ForestViT, ResNET, VGG16, DenseNET ,  TABLE II   MICRO-AVERAGED RECALL AND PRECISSION METRICS ON THE TEST  SET FOR FORESTVIT, RESNET, VGG16, DENSENET,  AND MOBILENET NETWORKS and MobileNET networks. Given that the micro-averaged multilabel performance metrics are defined by averaging over both labels and examples, they adequately capture the perclass performance imbalance, also observed in Fig. 4. Thus, the overall precision expressed as micro-averaged precision is averaged down to 0.80 for ForestViT and <0.80 for the other models used for comparison. As far as computational complexity issues are concerned, ForestViT demands almost similar computational requirements compared with convolutional models (ResNet and VGG) during the test phase, that is, the performance of the model to new incoming data, in its operating phase, once the training has been completed. Regarding computational complexity of the training phase, the cost per epoch for ForestViT is around 3% lower than the cost of VGG16, whereas ForestViT has a computational complexity considerably greater (around 37%) than the ResNET model used for comparison. However, it should be noted that model training is executed once. ResNET and VGG16 models are shown here since these models are the next best models in terms of the overall performance; however, the comparative advantage of our proposed method, as discussed in Figs. 5 and 6, is its ability to detect rare labels in highly imbalanced datasets.
3) Deforestation Risk Analysis Enabled by Multilabel Classification: Fig. 5 shows the ForestViT model performance in detecting land-use properties acting as drivers for deforestation among with the existence of forest areas (class "primary") in Fig. 5 (MLC). Hamming loss metric shows that our proposed ForestViT model has a better performance compared with ResNET, VGG16, DenseNET, and MobileNET methods. Here, we should stress that datasets on deforestation drivers are inherently imbalanced. The change of intended land use from forest to nonforest (urban, agricultural, and so on) results in forest degradation and is often a gradual process. Therefore, in our dataset nonforest land-uses are rare compared with the "primary" land use. Even though rare, the detection of nonforest land uses near the forestry areas is important for environmental conservation. However, the combined detection of forestry plus a nonforestry area in an image possesses additional challenges, given that nonforestry samples are rare in the dataset (imbalanced classes). As indicated by Fig. 5, for Hamming loss, there is a certain range of missing labels that ForestViT performs better at compared with convolutional architectures. ForestViT successfully detects the multiple classes of an image, regardless of whether those belong to frequent or infrequent classes. Fig. 6. Probability of successfully detecting the probability P prim versus the probability P prim,x of detecting both primary and a driver related class, for the ForestViT, ResNET, VGG16, DenseNET, and MobileNET models.
As mentioned earlier, in MLC, a misclassification is no longer a hard wrong or right, given that a prediction containing a subset of the actual classes should be considered better than a prediction that contains none of them. However, in our application, the existence of the "primary" class in an image with an additional label related to land use that possibly could act as a driving factor of deforestation could indicate possible areas with a high risk of deforestation. Thus, the consecutive existence of both classes assigned corrected is of great importance for our application scenario.
In our last scenario, we consider seven different cases that contain images having at least two different labels. The primary forest label is included as the standard label for all the cases and the second label varies and is one of the selected drivers (agriculture, cultivation, mining, road infrastructure, habitation, logging, and bare ground) for each case. In this case, we compare the probability to detect the primary forest label in those images with the probability of jointly detecting both the primary forest and the driver respective label where x stands for the drivers [agr:agriculture, cul:cultivation, min:mining, roa:road, hab:habitation, log:logging, and bar:bare ground], N prim,x is the total number of images that include at least the primary forest label and the driver −x label, n prim is the subset from the N prim,x set of images that correctly identified as primary forest, and n prim,x is the subset from the N prim,x set of images that correctly identified both as primary forest and x− label. Fig. 6, demonstrates the probability of successfully detecting primary versus the probability of detecting both primary and a driver-related class. In particular, we compare P prim and P prim,x values per architecture. As observed, the primary forest class is accurately detected for all the examined cases and DL architectures (see also Fig. 4). The cases that include the agriculture and road labels are identified with high accuracy. Cultivation, habitation, and mining labels follow in accuracy performance, whereas logging and bare ground labels appear to have the worst performance. Logging and bare ground, however, are rare occurrences (minority classes) in the dataset, which explains their low performance. We should hereby emphasize the results in rare classes, in which ForestViT is significantly better than CNN-based models (see Fig. 6). In particular, as regards the logging case, the probability of ForestViT to detect both classes (i.e., the frequent "primary" forest and the rare "logging" class) is higher than the respective probability obtained using a CNN-based architecture. In addition, as regards the rare "bare ground" class, it is also observed that the ForestViT performance is remarkably better compared with the conventional CNN methods.

V. CONCLUSION
The fact that the human landscape is rapidly evolving emphasizes the need for the analysis of deforestation data, the update of deforestation risk maps, and the appropriate adaptation of mitigation strategies. In order to capture co-occurrence patterns among labels, this article proposes a multilabel vision transformer classifier, ForestViT, to detect dependencies among the output variables. We show that the self-attention between neighboring image patches in ForestViT and without any convolution operations achieves superior performance in MLC of deforestation images compared with state-of-the-art DL models.