Deep Semantic Segmentation of Trees Using Multispectral Images

Forests can be efficiently monitored by automatic semantic segmentation of trees using satellite and/or aerial images. Still, several challenges can make the problem difficult, including the varying spectral signature of different trees, lack of sufficient labelled data, and geometrical occlusions. In this article, we address the tree segmentation problem using multispectral imagery. While we carry out large-scale experiments on several deep learning architectures using various spectral input combinations, we also attempt to explore whether hand-crafted spectral vegetation indices can improve the performance of deep learning models in the segmentation of trees. Our experiments include benchmarking a variety of multispectral remote sensing image sets, deep semantic segmentation architectures, and various spectral bands as inputs, including a number of hand-crafted spectral vegetation indices. From our large-scale experiments, we draw several useful conclusions. One particularly important conclusion is that, with no additional computation burden, combining different categories of multispectral vegetation indices, such as NVDI, atmospherically resistant vegetation index, and soil-adjusted vegetation index, within a single three-channel input, and using the state-of-the-art semantic segmentation architectures, tree segmentation accuracy can be improved under certain conditions, compared to using high-resolution visible and/or near-infrared input.


I. INTRODUCTION
F ORESTS are one of the principal factors that maintain Earth's climate stability [1]. Accurate monitoring of their state and sustainability is a global concern. Compared to other more traditional methods, such as aerial surveys or plot-based analyses, remote sensing has proven to be the most efficient way to monitor forest cover change processes. Conventionally, remote monitoring of processes, such as deforestation or forest degradation has been interpreted manually by expert analysts. However, with recent developments in the field of artificial intelligence, deep learning (DL) algorithms using satellite and/or Manuscript  aerial imagery are becoming the de facto tool in forest monitoring [2]. Remote sensing imagery can be categorized by the altitude of the aircraft/spacecraft or the type of the optical system (or, simply put, the sensor) [3]. These two factors largely determine the tradeoff between the area covered and the amount of detail (i.e., the resolution) of the constructed image. Depending on the type of the sensor, the optical system can sense a part of the electromagnetic spectrum, sample different spectral bands separately and simultaneously, and hence, create multi or hyperspectral imagery. In addition, various spectral bands are used to construct the so-called "spectral indices," which are mathematical combinations of spectral reflectance of different wavelengths. These hand-crafted indices are used to detect the presence of objects or situations, such as vegetation [4], water [5], fire [6], and landslide [7]. Although these hand-crafted indices are valuable features for pattern recognition, today's trend is moving toward building nontransparent, end-to-end DL models to detect any feature of interest, usually using supervised techniques [8].
In this article, we address the problem of segmenting forest areas (or simply trees) in both satellite and aerial images using deep semantic segmentation techniques and multispectral imagery. We attack the problem in multiple dimensions. To begin with, we examine the effect of altitude using satellite and aerial images with various metric and spectral resolutions. Second, we benchmark different deep semantic segmentation architectures, including models with pretrained convolutional encoders. We utilize different spectral bands, such as visible [red-green-blue (RGB)], near-infrared (NIR), and short-wave infrared (SWIR) (and their multispectral combinations), as raw inputs to these models. Moreover, we experiment with hand-crafted spectral vegetation indices (VIs) as input and analyze their performance compared to raw spectral input images. Our goal is to determine how to efficiently utilize DL-based architectures for tree segmentation, what practical issues arise with limited remote sensing data, and which methods should be utilized. In order to accomplish this, we conduct large-scale experiments.
The rest of this article is organized as follows. In the following section, we refer to the literature on the subject and underline our contribution. The third section provides the details of our benchmarking environment with respect to the image sets, the segmentation models, and the spectral indices we utilize in our experiment. Section IV presents the experimental results and covers related analyses and discussions. Finally, Section V concludes this article. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ II. RELATED LITERATURE Survey studies on object detection in optical remote sensing images [9], [10], [11] show that efforts to detect trees in satellite images date back to the early 90s [12], [13], [14]. Previously, traditional machine learning methods, such as multinomial logistic regression [15], random forest (RF) [16], [17], [18], and support vector machine [19], [20], [21] dominated the area of tree detection and segmentation within remote imaging. These studies relied not only on hand-crafted feature extractions but also on feature encoding [22] and feature pooling [23]. Because of the low-level characteristics of the extracted features, these methods were not able to capture high-level features that reside in complex high-dimensional (spatial, temporal, and spectral) satellite or aerial imagery. Recently, DL approaches have emerged to tackle such limitations by making use of hierarchical learning processes while extracting high level, complex abstractions from the data [24], [25], [26], [27], [28], [29]. Convolutional neural networks (CNNs) [30], with their proven ability to learn these high-level abstract feature representations with the help of convolutional and pooling layers, are the most commonly used DL models for semantic segmentation.
It is also worth mentioning that classical tree segmentation methods commonly utilized 3-D airborne light detection and/or range (LIDAR) data [46], [47], [48], [49], [50], [51], [52], [53], [54], [55]. For instance, when using 3-D LIDAR data to perform tree segmentation, data-specific methods that relied on crafted features were employed, such as canopy height model [56], based watershed algorithm [57], or normalized point cloudbased layer stacking algorithm [58]. Besides, in some studies both LIDAR and multispectral data were utilized by applying different algorithms for each data type. While implementing machine/DL algorithms that are well-suited to multispectral data, the studies concurrently applied methods native to LIDAR data [59], [60], [61]. Nevertheless, due to its flexibility in both spatial and spectral resolution, and its compatibility with different remote sensing systems (UAV, satellite, etc.), multispectral and/or hyperspectral imagery are today's main standards for not only tree segmentation, but for several other related problems, such as crop, water, and field segmentation [3].
The specific problem of detecting trees presents some challenges, such as geometrical complexity, color variations, and/or self-occlusion of the tree branches [62], just to name a few. As a result of these challenges, the problem proves to be considerably difficult. For example, regarding self-occlusion, even a slight change in the viewpoint could cause the same structures to appear significantly different. The tree segmentation problem is also challenging due to the environmental background, which may contain other types of green vegetation, such as small bushes. A closely related problem is differentiating between the view of a large canopy, hence, the overlap between adjacent individual trees. A satellite-retrieved spectral signature is the combination of different reflectance sources, such as tree crown, tree crown shadow, soil, and herbaceous vegetation [17].
Another key factor that limits individual tree mapping is that the spatial resolutions of most satellite/aerial sensors are usually not capable of capturing objects with sizes less than one square meter. However, thanks to commercial satellites and UAVs with multispectral cameras, the improved spatial resolution makes it possible to map every tree on large scale [63].
Instead of using individual spectral bands, VIs have been proposed in the literature [64]. VIs are derived from varying mathematical combinations of two or more spectral reflectance values to examine the presence of the vegetation within a pixel. VIs have several drawbacks, notably their inadequacy at interpreting the information content for forest canopy, thereby insufficiently quantifying the effects among different variables, and hence, making them less favorable for rapidly changing biophysical factors, such as chlorophyll content per unit leaf area, leaf angles, and fractional cover [65]. Performances of VIs are compared to that of the spectral mixture analysis in terms of tree canopy cover and are found to be poor [66]. Since there is a high correlation between many VIs, they can be used together and complement each other in different aspects by addressing specific needs that characterize tree canopy.
Most of the DL architectures that process satellite/aerial imagery for the segmentation of objects, including trees, are end-toend models that take multi-or hyperspectral bands as input channels [67]. However, due to both their multimodal, geolocated, and multitemporal nature and the insufficient amount of labelled remote-sensing data, applying DL models tends to be a challenging task [25]. Improvements in the high-performance computing architectures, such as GPUs lag far behind the growth of highdimensional big satellite/aerial data; ergo, recent efforts toward building large-scale DL models fall short of training big remotesensing data, unless a breakthrough is achieved in the acceleration of computing power. One solution to this problem, which we propose in this article, is to utilize VIs as a priori knowledge to the DL models. VIs by definition, provide insight into spectral characteristics of a surface by taking into account the interaction between electromagnetic radiation and vegetation, and hence, can be considered physics-based modelled features.
Our study explores the effect of fusing VIs as a physical model and various ED architectures as DL models in order to perform tree semantic segmentation tasks effectively and accurately. For this purpose, we implement various ED-based semantic segmentation architectures, such as U-Net [34], SegNet [35], DeepLabv3+ [36], DLinkNet [37], and DFANet [38], as well as a pre-DL era ensemble learning algorithm, namely RF [68]. We utilize specific VIs so as to integrate physics-based models as additional input information to the benchmarked models. A framework is constructed for satellite and aerial images that not only provides a comparison between different ED-based semantic segmentation architectures but also leads to insights about fusing various VIs or spectral bands for the purpose of pixel-wise tree segmentation.

III. METHODOLOGY
In this section, we provide a methodological background for the experiments conducted within our study. We commence by introducing the multispectral image sets utilized in our experiments. We then provide details on the benchmarked semantic segmentation architectures of our experiments and the types of input we provide to these models.

A. Multispectral Image Sets
The previous studies in the literature have proven that high spectral resolution is a major factor in solving the tree segmentation problem, as is the requirement of high spatial resolution [16]. The earlier airborne cameras and satellite-borne multispectral sensors provided high spatial resolution, while their spectral bands were limited to blue (0.45-0.51 μm), green (0.51-0.58 μm), red (0.6-0.69 μm), and NIR 1 (0.77-0.90 μm). Because several satellite-borne sensors, such as WorldView-2 and WorldView-3 were launched after 2009, numerous other spectral bands that are strongly related to vegetation properties have also been included. One is the Coastal band (0.40-0.45 μm), where the associated reflectance is highly bound up with the chlorophyll content of the vegetation. Another band, the red edge (0.71-0.75 μm), leads to intuitive discrimination of healthy trees, which is the primary task in precision agriculture. Aside from the detection of healthy trees, infected trees can also be exposed if the Yellow band (0.59-0.63 μm) is utilized to extract the yellowness of their crowns. Apart from the NIR 1 band, the NIR 2 band (0.86-1.04 μm) can provide additional information on the vegetation analysis since it is less influenced by atmospheric conditions [69]. The SWIR (SWIR) band (1.19-2.36 μm) is useful to characterize vegetation water content since it is sensitive to water absorption.
In order to conduct supervised tree segmentation experiments, multispectral image sets with pixel-label ground truth are required. Table I shows detailed information about the two image sets that we use in our conducted experiments. In Table I, ground sampling distance (GSD) refers to the distance between the centres of two adjacent pixels corresponding to their representation in the real world [70]. The improvement of segmentation performance is significantly related to the spatial resolution, with GSD being the most important descriptor [71]. Additionally, ground field-of-view (FOV) refers to the metric area covered by the sensor at the chosen operational altitude of the sensor during image collection. The reader may refer to [72], [73], [74], [75] for a list of available multispectral image sets with pixel labels. The two image sets used in our experiments are discussed as follows.  covers an area of 1 km × 1 km of the Earth's surface. In this image set, both 3-band and 16-band formats are provided with different spatial resolutions. The images are captured by WorldView-3, which is a commercial satellite. The WorldView-3 optical system provides very high-resolution satellite imagery with 31 cm panchromatic resolution, 1.24 m multispectral resolution, and 7.5 m SWIR resolution [77]. WorldView-3 satellite captures the same area with different types of satellite images, such as a panchromatic band (high-resolution), an RGB band (high-resolution), and a multispectral band (M -band) (lower resolution), and an SWIR band (A-band) (lowest resolution). The advantageous feature of having 11-and 14-bit as image color depths enables more information from each pixel to be used in a DL model. There are a total of 10 labeled object types, including trees.

2) RIT-18 (The Hamlin State Beach Park) Aerial Image Set:
The RIT-18 aerial image set [73] was acquired in 2017 by mounting the Tetracam MicroMCA6 multispectral sensor on board the DJI-S1000 octocopter. This unmanned aircraft system is much cheaper than those based on manned aircraft and satellite systems. RIT-18 has been collected with a ground sample distance of 0.047 m and includes six very high-resolution multispectral bands. Three of them are visible RGB bands, while the other three are NIR bands, with all having very high spatial resolution. Due to their high spatial resolution, 1 the NIR bands introduced by the RIT-18 are strengthening the discriminative power of DL architectures, especially for cases where vegetation exists [73].

B. Semantic Segmentation Architectures
In the past decade, studies on different semantic segmentation architectures have shown tremendous development [31]. Mostly driven by industry, the majority of these studies were designed for medical applications or unmanned systems, such as driverless cars. As mentioned in Section II, we benchmark different DL-based semantic segmentation architecture and a pre-DL ensemble learning method called the RF algorithm. In this part, we provide details on the semantic segmentation techniques that we utilize in our experiments.
The pioneering DL-based semantic segmentation architecture is the FCN [32]. This network is considered to be the ancestor of today's widely used ED architectures in semantic segmentation [31]. Although there are many different architectures and approaches in DL-based semantic segmentation literature, it is safe to say that the ED architecture is an off-the-shelf solution to many semantic segmentation problems.
Due to its role in recovering spatial resolution in extracted feature maps, the decoder is especially important to implementing ED architectures for multispectral imagery so that the high-resolution details can be recovered effectively. Given this effect, we considered various ED architectures, while paying attention to the different aspects of decoders. Upsampling layers are one of the most crucial parts of decoders, as they increase the spatial resolution [78]. Therefore, upsampling layers are the basis for selecting the proper architectures in this article.
Below we explain the five fundamental DL-based semantic segmentation architectures that we utilize in our experiments: U-Net [34], SegNet [35], DeepLabv3+ [36], DLinkNet [37], and DFANet [38]. The first two are considered to be the basic ED examples, while the latter three are advanced state-of-the-art models that have been proven to be more successful in today's semantic segmentation challenges [79]. Finally, we provide information about the RF algorithm, which is one of the strongest pixel-wise tree segmentation methods of the pre-DL era.
1) U-Net [34]: U-Net is the one of the pioneering and mostused ED approaches in the semantic segmentation literature [see Fig. 1(a)]. It is notable for its unique architecture, in which the whole feature map is transported from the encoder to the corresponding building block in the decoder via skip connections. Compared to state-of-the-art architectures, the decoder part of U-Net is computationally inefficient since it reconstructs the segmentation map, the size of which is equal to that of the original image through many up-scaling operations between high-level semantic information and precise local information. However, U-Net is still considered to be a de facto solution to any semantic segmentation problem.
The implemented U-Net model is a four-level-depth symmetric ED model. At each depth, two repeated blocks of convolutional layers are followed by a batch normalization operation during training and a rectified linear unit (ReLU) activation function in the experiment itself. Attached to these two blocks, each level is concluded by a 2 × 2 max-pooling layer (stride 2) and a dropout layer with rate = 0.5, hence halving the spatial dimension at each depth. For each depth, the feature channels vary among 64, 128, 256, and 512, which is also symmetrical in the decoder. In the decoder part of the process, instead of the max-pooling layer, there are up-convolution layers that help double the spatial dimensions. For all convolutional layers, kernel, stride, and padding sizes are set as 3 × 3, 1 × 1, and 1 × 1, respectively, thus keeping the spatial dimensions fixed within a block. In order to implement the binary classification of trees, the final decoder block is followed by a single depth 1 × 1 convolution layer and pixel-wise sigmoid layers, which provides an input-sized single channel map of the detection measure. Skip connections between each depth carry feature activation values, a trademark of the original U-Net architecture.
Among the many different ED architectures, we consider U-Net a fitting choice for benchmarking, given that it is a basic example of using transposed convolution as the up-sampling layer [78].
2) SegNet [35]: SegNet is very similar to U-Net architecture. The only difference is that, rather than passing the entire feature map of the encoder to the decoder, SegNet uses a very simple yet computationally efficient transfer of information. Instead of unpooled features, maximum pooling indices are passed [see Fig. 1(a)].
In the implemented SegNet model, the architecture and the related parameters are the same as in the implemented U-Net model described earlier. The only difference is, as the trademark of the SegNet architecture, skip connections carry only the pooling indices, which is why SegNet is considered to be an efficient implementation of the U-Net.
Rather than using the transposed convolution as the upsampling layer, such as U-Net, SegNet employs interpolation by implementing the unpooling operation [80], [81]. Therefore, the criterion for selecting SegNet is to compare the performance of this upsampling operation at the decoder.
3) DeepLabv3+ [36]: One of the most advanced, successful, and efficient semantic segmentation frameworks in the literature is the DeepLabv3+ [82] [see Fig. 1(b)], which is the final architecture of the DeepLab family [82], [83], [84]. It is originally trained with general-purpose image sets, such as PASCAL VOC 2012 [79] and COCO [85]. The main idea of DeepLabv3+ is to consolidate multiscale contextual information by using atrous spatial pyramid pooling layers. Compared to a default ED architecture, the decoder part is extremely lightweight, although with better segmentation accuracy.
As semantic segmentation architectures developed rapidly, recent studies have focused on utilizing state-of-the-art architectures, such as ResNet [86], Inception [87], and Xception [88] in the encoder, referred to as encoder backbone networks. The goals of this usage are to improve the generalization ability and increase the prediction accuracy of the model. Like many other semantic segmentation architectures, DeepLabv3+ can also function with pretrained encoders. In scenarios, where the labeled data is limited, different encoder backbone networks may help the model perform with increased efficacy. In the implemented DeepLabv3+ model, the first two levels of the ResNet34 or ResNet101 encoder are transferred for this purpose. However, because we utilized pretrained encoders that originally accepted three-channel (RGB) input, the experiments with this architecture are limited to spectral input with 3 or fewer channels.
We utilize DeepLabv3+ in our experiments because its upsampling layer is completely different from that of the other ED architectures. DeepLabv3+ employs bilinear upsampling + convolution in its decoder design [78], mainly for computational efficiency.
D-LinkNet [37]: Another state-of-the-art model in the literature is DLinkNet. This model is originally trained with RGB satellite images for pixel-wise road segmentation, which makes it special in the sense that it is one of the rare models specifically designed and trained using image sets with almost-zero zenith angles. The model is based on an approach focused on additional dilated convolution layers in the centre part of an ED architecture [see Fig. 1(c)]. DLinkNet is an improvement of the Linknet architecture [89], which is also an ED-based semantic segmentation model.
Similarly to the implemented DeepLabv3+ model, DLinkNet's utilized encoder is transferred and fine-tuned. The first two levels of the ResNet34 encoder is transferred to the implemented DLinkNet model in the conducted experiments. Hence, the experiments with this architecture are also limited to spectral input with less than or equal to 3 channels (please refer to Section III-C for model input types).
DLinkNet is also another example of the ED architecture that relies on transposed convolution in the decoder design [37]. However, it is not considered to be as computationally efficient as the DeepLab family. 4) DFANet [38]: The majority of recent semantic segmentation studies are targeted at real-time applications like autonomous driving that require fast understanding of the surrounding scenes while achieving high performance [90], [91], [92], [93]. In our study, we chose the state-of-the-art DFANet as a benchmark model because it has a sophisticated ED structure similar to DLinkNet, but is still real-time in performance.
The main contribution of DFANet is the so-called "Crosslevel feature aggregation based ED architecture" [see Fig. 1(d)] that aims to reduce the number of parameters, while preserving a certain semantic segmentation performance. DFANet encoder is constructed by first starting from the Xception model as a lightweight backbone, and then providing the high-level feature maps obtained from this backbone as an output to the next Xception backbone. This process, which is called subnetwork aggregation, continues for three parallel Xception backbones. The goal of adding a fully connected attention module to the tail of each backbone is to acquire the maximum receptive field. In addition to this subnetwork aggregation, another module, namely the substage aggregation, helps us to ensure the combination of multiscale information. By recovering the lost spatial information caused by deep architectures, the aim of the substage aggregation module is quite similar to that of the skip connections except that skip connections are not able to keep the large-scale object and edge information in very deep architectures. The substage aggregation module is based on the fusion of different stages of the same depth in subnetworks. In other words, the output of the layer at a certain resolution from the previous backbone is contributing to the input of the corresponding layer of the next backbone.
The encoder, which is composed of the aggregation of three lightweight Xception backbones by means of subnetwork and substage modules, is followed by a slight decoder designed particularly to tackle real-time inference concerns. This decoder is conceived as an efficient feature upsampling module that is characterized by the convolution and bilinear upsampling layers. This simple decoder structure fuses the high-level features and the low-level features of the three backbones within itself. After the high-level features are bilinearly upsampled by a factor of 4, these are added to the low-level features. Finally, this total sum is upsampled by a factor of 4 and the final prediction is obtained.
5) Random Forest [68]: In order to compare the data-driven learning power of deep neural networks relative to traditional machine learning algorithms, the RF method, which relies on hand-crafted features, is also introduced for segmentation performance evaluation. The RF method includes a set of decision trees in which each acts as a base classifier and depends on the independently sampled values of a random vector from the same distribution. As an ensemble learning algorithm, RF builds many decision trees at the time of training and relies on the maximum voting of the decision trees in the forest. Its robustness to overfitting and good generalization ability are among the advantages of the RF algorithm in semantic segmentation [94], [95].
The hand-crafted features that we fed to the RF model are basic image features, specifically derivatives (up to second order) in three different scales. The number of decision trees in the RF ensemble is selected as 20, whereas the minimum number of observations per tree leaf is set to 60.

C. Model Input
In this study, we are also interested in exploring the correlation of spectral bands and VIs with tree segmentation performance. For that purpose, the bench-marked models are trained and tested with a set of input data with varying channel depth. We categorize the utilized model input into two main categories, threechannel and multispectral input. The first, the three-channel input, includes RGB, single VIs (three channels being the same), the NIR input (including separate three NIR sub-bands) and a three-channel vegetation index combination, including NDVI, atmospherically resistant vegetation index (ARVI), and soiladjusted vegetation index (SAVI) indices (show below). Since this input group has a depth of three channels, DL models that comprise pretrained encoders, such as ResNet or Xception can be used in experiments.
The second group, multispectral input, includes image sets with a higher number of channel depths. The characteristics of the multispectral band input depend on the image set (i.e., the sensor) and are explained in detail as follows. As expected, experiments using this input group were only conducted for the U-Net and the SegNet DL models, which can accept any number of input channel depths. On the other hand, RF experiments are applied to both model input groups.
1) Red-Green-Blue: As the name implies, these images consist of only red, green, and blue channels. The difference between the reflected radiations in visible (i.e., RGB) and NIR wavelengths give more information about vegetation in particular. A large difference suggests that the pixels are more likely to belong to dense vegetation, such as forest area, while a small one is likely to indicate sparse vegetation, such as grassland [96]. On the other hand, for most commercial satellites (such as world-view-3), the spatial resolution of RGB channels is exceedingly higher than the original multispectral output of the sensor due to a resolution enhancement postprocess applied to visible band output.
2) Vegetation Indices: Vegetation covers exhibit a unique spectral behaviour, by which they can be differentiated from other ground elements [64]. Commonly, chlorophyll concentration is responsible for absorbing the radiation in the red band, while the leaf cellular structure is responsible for reflecting the radiation in the shorter NIR bands (700 to 750 nm). A deviation between the red and NIR is observed in the spectral reflectance curve, providing deeper insight into the existence of vegetation.
The key idea is to make use of this deviation to differentiate vegetation from bare soil based on the contrast between the spectral reflectance behaviours. Another spectral radiance difference measured between the red and the blue bands acts as a self-corrector by reducing the atmospheric scattering effects in the red band, making it possible to define atmospherically resilient VIs.
Current VIs are commonly categorized in the following three fundamental families: mean VIs, atmospherically resilient VIs, and soil-adjusted VIs [97]. For this reason, we utilized three widely-used indices from each family in our experiments.
The key characteristics of these indices are delineated in the following: 3) Normalized Difference Vegetation Index (NDVI): is selected to represent the mean vegetation index category, for that it is one of the most widely used and well-known indices of this category [98]. NDVI for the DSTL image set is defined by where the WorldView-3 NIR1 band covers the range 0.77-0.90 μm and the red band covers the range 0.63-0.69 μm.
In the case of the RIT-18 image set, NDVI is calculated as follows: where the RIT-18 NIR band covers the range 0.71-0.91 μm and the red band covers the range 0.67-0.68 μm.

4) Atmospherically Resistant Vegetation Index:
ARVI is particularly designed to correct atmospheric effects. In order to obtain such an atmospherically resilient VI, blue or green spectral bands are taken into account along with the red and NIR bands. There are also some other related VIs that reduce atmospheric turbidity [99]. ARVI for the DSTL image set is given by For the RIT-18 image set, ARVI calculation should be arranged according to the existing NIR bands due to the absence of the SWIR wavelengths. The following shows the calculation of ARVI for the RIT-18 image set where the RIT-18 covers the NIR range 0.71-0.91 μm, the red range 0.67-0.68 μm, and the blue range 0.48-0.49 μm; γ is taken as 1.0.

5) Soil-Adjusted Vegetation Index:
Besides the two categories described earlier, there are many other VIs that have been proposed to reduce the soil background noise on NDVI by making use of a parameter called L, which incorporates the area density factor. One of these, SAVI enables the soil brightness effect to be assessed where the vegetation density is low in the area of interest [100]. For the DSTL image set, SAVI is given by where L is taken as 0.5. Using the NIR range of 0.71-0.91 μm and a red range of 0.67-0.68 μm, SAVI can be formulated for the RIT-18 image set as follows: where L is taken as 0.5.

6) Mixed-Vegetation input (NDVIA + ARVI + SAVI):
In addition to utilizing the aforementioned VIs separately as input to the benchmarked models, we propose a new three-channel input, called mixed-vegetation input, which uses three different VIs to test the combined effect of minimized soil brightness influences, corrected atmospheric scattering effects, and minimized topographic effects. The VIs selected from each of the categories are as follows: NDVI from mean VIs, ARVI from atmospherically resilient VIs, and SAVI from soil-adjusted VIs. The combined three-channel VI image is fed into the DL architectures just like an RGB input. A false-color image of this fused input is depicted in Fig. 2.

7) Near Infrared input (NIRi):
Both the DSTL and RIT-18 datasets utilized in our experiments are shot with sensors that output three NIR channels, details of which are provided in Table I. It can be seen from this table that, the spectral resolution of the RIT-18 dataset (TetraCam MicroMCA6) is finer than DSTL's sensor (World-View-3). On the other hand, World-View-3 NIR bands cover a larger spectrum. Hence, we believe that testing these two 3-channel NIR input types in our experiments will provide us with an insight into the effect of NIR spectral resolution on tree segmentation.

8) Visible + Near Infrared input (VNIR):
VNIR is one of the multispectral input groups utilized in our experiments. DSTL VNIR input includes eight subbands (five visible + three NIR), whereas RIT-18 includes six subbands (three visible + three NIR) as a result of their individual sensory output (see Table I).
9) Visible + NIR + SWIR (VNIR + SWIR): VNIR + SWIR is the second multispectral input group utilized only for the DSTL image set in our experiments since SWIR bands are out of the RIT-18 sensor's spectral range. For this purpose, two separate groups are constructed, the first being the VNIR + SWIR 2 input that includes 8 DSTL + VNIR subbands and the two shortest SWIR bands (i.e., SWIR-1 and SWIR-2). The second group is referred to as the VNIR + SWIR 8 and includes all of the original multispectral bands of the World-View-3 sensor in 16 channels.

IV. EXPERIMENTAL RESULTS AND DISCUSSIONS
Before presenting the experimental results and the related discussions, we provide the details of the experimental setup and our evaluation methodology as follows.

A. Experimental Setup
A CUDA-enabled NVIDIA Quadro RTX 5000 GPU with 16 GB memory is employed in the experiments. Other hardware configurations are: Intel(R) Xeon(R) Gold 6240R CPU @ 2.40 GHz and 128 GB RAM. The PyTorch [101] DL framework is used for training, validation, and testing on a Windows 10 operating system. In both of the image sets, a simple data preprocessing scheme is employed by dividing images into 224-by-224 image patches. The total number of image patches belonging to each image set can be seen in Table I. The DSTL image set is split into training, validation, and test sets in which 20% of the data is reserved as a test set to perform cross-validation. The remaining 80% of the data is further partitioned into 72% training and 8% validation sets so that hyperparameter tuning can also be implemented. Five-fold cross-validation is applied to obtain a less biased estimate of the benchmarked methods. In order to improve segmentation performance, manual hyperparameter tuning is applied. All ED architectures are trained using the adaptive moment estimation (Adam) [102] algorithm. For the DSTL image set, the initial learning rate is chosen as 10 −4 for U-Net and DFANet, while the initial value of 5 × 10 −5 is used for the rest of the architectures, SegNet, DLinkNet, and DeepLabv3+. For DLinkNet architecture, the learning rate is reduced by 9% in every ten iteration steps, while for the others it is reduced by the same amount in every five iteration steps. Except for the SegNet, the mini-batch size is set to 8 and Xavier uniform is chosen for initialization. A Mini-batch size of 4 is chosen for SegNet. By observing the learning curves, 70 epochs were found to be sufficient for convergence.
Due to the fact that test image labelling is not provided for the RIT-18 aerial imagery, it is possible to use the already partitioned training and validation sets [73], where the validation  II  RESULTS FOR THE DSTL IMAGE SET USING THE "THREE-CHANNEL INPUT" GROUP set is utilized as the test set in our experiments. For comparison, the same DSTL hyperparameters are also tuned for RIT-18. In order to handle the "black" (i.e., no data) regions in the image set (which are formed after image rectification), we did not utilize patches with more than 50% black region for training or testing. Thus, the training set is left with 814 image patches, while the test set contains 964. In the case of the RIT-18 image set, the same initial learning rate of 10 −4 is used for SegNet, DLinkNet, and DFANet architectures, while 5 × 10 −5 is utilized for U-Net and DeepLabv3+. SegNet is the only model with its learning rate reduced by 9% in every 5 iteration steps, while other architectures have the same drop rate in every 10 iteration steps. The mini-batch size takes the value of 4 for SegNet and the DSTL image sets, while it is 8 for the rest. Xavier uniform is used as the initialization method and 100 epochs are evaluated with respect to convergence.
1) Evaluation Metric: Intersection over Union (IoU or Jaccard Index) considers false alarms and missed values simultaneously by counting the total number of mislabeled pixels. The Jaccard Index is defined as Jaccard Index = TP TP + FP + FN (7) where TP denotes the true positive pixels, FP denotes the false positive pixels, and FN denotes the false-negative pixels. Jaccard Index is introduced as the evaluation metric since it is becoming the de-facto standard metric for semantic segmentation evaluation tasks in the literature. Since 2008, The Jaccard Index has been used in the PASCAL VOC challenge as the fundamental evaluation standard [103]. The average Jaccard Index (mJI) for all test images is provided in our results. Table II shows average Jaccard index values that are obtained from five-fold cross-validation results for the DSTL image set. As seen from the table, the best performance achieved is 0.571 (with a standard deviation of ±0.211 among all test images) by applying DLinkNet with the ResNet34 backbone to RGB images. The U-Net architecture applied to RGB images yields very close performance to that of DLinkNet, with 0.570 mJI. The closest performance of a non-RGB input type is again by the DLinkNet+ResNet34 architecture using the mixed VIs, which yields an mJI of 0.561. We believe that the strength of the RGB input type on the DSTL image set is mainly because of the enhanced spatial resolution of the World-View-3's RGB channels. Although obtained from the original (i.e., unenhanced, lower) spatial resolution of the World-View-3 sensor, the mixed VIs input type shows a competitive performance. These results show that for this sensor (i.e., with this spatial resolution and altitude) when fused within a three-channel signal, the hand-crafted VIs provide valuable low-level features to the DL architectures and, hence, perform as well as the enhanced RGB input type. We should note that the resolution enhancement postprocess of RGB images from the World-View-3 sensor is an additional computation, which is carried out offline and may not always be supported for edge systems like satellites or UAVs.

B. Results
On the other hand, even though the NIR reflectance information is quite important for assessing and distinguishing trees, the NIRi input type shows a low performance for all segmentation models for the DSTL image set. We believe that this is mainly due to two reasons: first, the aforementioned lower spatial resolution of the original NIR bands in the DSTL image set; and second, the coarse spectral resolution of the NIR subbands of the World-View-3 sensor, compared to the RIT-18's performance with finer NIR bands as discussed in the following.
The segmentation performance results belonging to the RIT-18 image set are presented in Table III. There is a clear trend in NIRi results, with the highest values for most of the architectures. Judging by these superior mJI values, NIR reflectance shows a promising performance for tree segmentation. The peak value of 0.921 mJI is observed in the NIR reflectance band when DLinkNet with Resnet-34 encoder is applied. NIR wavelengths also work well for the U-Net architecture, with 0.885 mJI, giving evidence of the contribution from both high spatial (GSD 0.047 m) and high spectral resolution (3 channels between 485-685 nm).
To this extent, the segmentation performance of DLinkNet (Resnet-34) in the RIT-18 image set is consistent with those obtained in the DSTL image set. DLinkNet pretrained with ResNet-34 reliably outperforms other ED architectures and is better able to preserve detailed spatial information. The combination of the VIs (NDVI + SAVI + ARVI) does not have a significant impact on the performance if an image set already has a very high spatial and spectral resolution for NIR reflectance; thus, the NIR band alone can be a better choice for aerial remote sensing.
Another significant difference between the results of the DSTL and the RIT-18 image set experiments is the performance of the ARVI input type. Although performing comparatively well on the DSTL dataset, ARVI shows the lowest mJI values  TABLE III  RESULTS FOR THE RIT-18 IMAGE SET USING THE "THREE-CHANNEL INPUT" GROUP   TABLE IV RESULTS FOR BOTH SETS USING THE "MULTISPECTRAL INPUT" GROUP for the RIT-18 dataset experiments. As explained in the previous sections, atmospheric correction requires information from the SWIR band. Hence, the lower performance of ARVI on the RIT-18 image set is due to the lack of SWIR spectral bands in the image set's sensory output. When calculated properly using the required spectral bands, we observe a clearly improved performance in the hand-crafted VIs for tree segmentation.
The results of the experiments that utilize multispectral input types are provided in Table IV. Even though there are a number of band fusion modules applied by slightly modifying the networks [104], [105], the motivation in this study is to implement only the original architectures so only U-Net, SegNet, and RF are used. Although tested for a limited number of architectures, the results are consistent with three-channel input type experiments. The lowest performance is obtained with the VNIR+SWIR 8 input type for the DSTL image set. We see that the effect of increasing the number of input channels by itself is not significant to improve the segmentation performance; however, the optimum combination of these input channels is significant [106], [107], [108]. The Literature already shows that because of an insufficient amount of labeled remote-sensing data, DL models are usually difficult to be trained in an end-to-end fashion using multispectral input [25]. It is one of our fundamental research questions in this article to address the need for using VIs as hand-crafted features, instead of end-to-end training. In [109], it has already been shown that adding the input of extra-bands causes an inference to the network parameter learning and this, in turn, causes a degradation of the network performance. We believe that even if the input of all band data can increase the global information, those extra bands may be insufficient to identify detailed small tree objects in the scene [110].
Another point that cannot be ignored is most of these multiple spectral bands, such as NIR and SWIR in the DSTL image set have low spatial resolutions compared to RGB, leading to a degradation in the overall segmentation performance [106], [111]. This is another indicator that supports the idea of using VIs as hand-crafted features when the spatial or spectral resolution of the visible and/or NIR bands is not sufficiently fine and the scale of the labelled image set is limited.
As obvious from Table III, the Jaccard index values of the RF algorithm are almost comparable to those of the ED architectures for the RIT-18 image set while this is not the case in Table II, i.e., for the DSTL image set. As already shown in [112], RF can considerably improve the performance by increasing the number of spectral bands in high spatial resolution images. Moreover, the RF algorithm not only offers the significant performance for dealing with multidimensional complex data [113], [114], but also requires only slight parameter tuning [115]. Therefore, RF is more likely to be robust to the performance degradation when multiple bands are used. This is why the best result of the multispectral input in Table IV belongs to the RF for the RIT-18 image set, where the spatial resolution is high.
The tree segmentation problem has many similarities to that of road segmentation in cases where multispectral remote sensing imagery is utilized. Occlusion is one of them, in which the object is partially or completely covered [116]. In the case of remote sensing imagery, the need for large receptive fields is crucial, since the input images are high resolution. The benefit of preserving detailed spatial information is additionally important, thus making the segmentation capable of acquiring complex geometries with highly discriminative feature information. We believe that these similarities, related to occlusions, high-resolution, and geometrical complexity, are the main reasons why a state-of-the-art road segmentation network, such as DLinkNet is able to provide excellent results for tree segmentation as well.
The similarities of our results can further be generalized to other road segmentation studies [80], [81], [117] as well. We observe that the performance of the DeepLabv3+ architecture employed with any of the state-of-the-art encoder backbones  is falling short of effective occlusion handling. DeepLabv3+ models are highly optimized to the particular problem definition and so it is unsurprising that they show performance degradation in diverse remote sensing imagery. Hence, DeepLabv3+ cannot handle the occlusion problem and may perform below its potential due to the high demand for training samples [81].
Since our selection criterion in the ED architectures is mainly related to the decoder design, segmentation performance results should also be appraised with respect to the types of upsampling layers. U-Net and DLinkNet are expected to provide broadly similar results, as they both use the same upsampling layer of transposed convolution. Segmentation performance results confirm that the architectures designed with transposed convolution, i.e., U-Net and DLinkNet, achieve superior performance over other ED architectures. This behaviour is persistent regardless of the image sets and no matter what kind of input is fed to these networks. DLinkNet achieves a comparable and even better performance than U-Net by exploiting dilated convolution to expand the receptive field and by using pretrained ResNet-34 as its encoder. The reason why SegNet fails to achieve the superior performance is that the recovery of high-resolution spatial details using "index-based" skip connections is not sufficiently effective [81]. In DeepLabv3+, the implementation of bilinear interpolation upsampling may lead to a smooth segmentation so that the decoder is unable to recover pixel-wise tree predictions accurately [118], [119]. This is mainly due to the fact that the decoder in Deeplabv3+ is not suitable for processing high-resolution remote sensing imagery, where the tree pixels are small compared to the overall image [120].
The performance of the DFANet architecture should be further examined against that of the other architectures, since it is specifically tailored for real-time semantic segmentation. Just like the Deeplabv3+, DFANet also employs bilinear upsampling operation in the decoder part such that both have similar segmentation performances. However, as shown in Table II, DFANet Jaccard index values are even lower than those of the Deeplabv3+ due to the low spatial resolution of the DSTL image set. As a result of its lack of network depth and reduced number of parameters, DFANet cannot capture high-dimensional features like U-Net [121]. The DFANet experimental results are notable for their relatively high Jaccard index values for NDVI+SAVI+ARVI combination, which is consistent with our results obtained from other DL architectures. Moreover, the experimental results in Table III show that the segmentation performance of DFANet architecture is almost close to that of U-Net architecture, since the RIT-18 image set has high spatial resolution. We believe that, for the DFANet to be able to exhibit better semantic segmentation performance in high-resolution remote sensing imagery input, some improvements on the DFANet architecture should be done which allow the model to better fit to this kind of data [122]. Obviously, such improvements will also affect its real-time nature.
Although it is highly preferable to draw conclusions from averaged results, such as those from Tables II-IV, it is worth noting that the single image results introduced in Figs. 3 and 4 depict explicit behavior in which ARVI introduces significant false alarms. The pixels marked with pink are the false alarms, which can be seen to be quite dense in ARVI compared to other VIs. ARVI becomes sensitive to all kinds of green vegetation other than trees while reducing atmospheric scattering effects and, hence, presents false alarms by brightening the pixels belonging to other types of green vegetation, including small bushes and grasslands.
Tables II-IV show that the RF algorithm achieves comparable segmentation performances, even though the results are not as good as those from DL-based ED architectures. Even the best of the RF algorithm results (RGB, with 0.371 mJI), shown in Table II for the DSTL image set, is still lower than all the DL-based results. On the other hand, the achieved results shown in Table III for the RIT-18 are quite high, with the highest being 0.866 mJI for NIR reflectance, indicating how well the RF algorithm can perform when spatial resolution is sufficiently high.
For all input types and utilized methods, results exhibit high standard deviations of Jaccard index values. This leads us to a conclusion that individual maximum or minimum performances of all input types and/or methods are statistically close to each other. We believe that the reason behind this fact is due to the tested images being significantly diverse in nature. In RIT-18 image set, some images cover only a car park, whereas some images only cover vegetation. In DSTL, some images cover only buildings, whereas some others cover only traffic and roads. Due to this extreme diversity, we believe that the high deviation in the individual results are justifiable. However, the deviation in Jaccard values is calculated based on the results that are obtained from thousands of images. The average Jaccard values (mJIs) obtained from this dataset are considerably different (reaching 0.3 mJI of difference between some input types and methods). We believe that, although there is a deviation of success related to the diverse nature of individual images, the mJI values obtained from thousands of samples are a true indicator of the capabilities of different input types and/or segmentation methods.

V. CONCLUSION
In this article, we study different semantic segmentation models and different multispectral input combinations for the pixelwise tree segmentation problem on remote sensing imagery. For this purpose, we utilize a set of comparative experiments where we benchmark the selected models and the given input types. The essence of the comparison for the segmentation models lies in the selection of ED architectures for tree segmentation. We specifically analyze the descriptive power of these models to overcome issues, such as tree occlusion and geometric complexity. Different decoder designs, which make use of highresolution information of remote sensing imagery and enable ED architectures to handle these problems, are explored. The comparative results procure significant conclusions, the most important of which are summarized as follows.
1) The spatial resolution of the remote sensing imagery is the most important factor and has a greater influence on segmentation performance than spectral resolution and applied architectures. It is necessary to utilize a very high-resolution remote sensing image set: otherwise, it is not possible to reach considerable segmentation accuracy, even when a powerful, DL-based ED architecture is implemented. Therefore, an aerial image set is a better choice than satellite imagery, which is not only insufficient in terms of spatial resolution but also expensive. 2) The second most important factor is the spectral resolution of the NIR reflectance, for increasing the spectral discrimination between trees and other green vegetation.
Although it is known that NIR reflectance information is significantly valuable for discriminating trees, the experiments demonstrate that if the spectral resolution of the NIR band is not sufficiently fine, sufficiently accurate tree segmentation cannot be obtained. 3) DLinkNet architecture consistently outperforms the compared semantic segmentation models, mainly for the cases including tree occlusion and grassland. Thanks to the decoder design with transposed convolution layers, further improved with dilated convolution and pretrained ResNet-34, the large receptive field of the DLinkNet is best suited to high-resolution remote sensing images. 4) It can be advantageous to use a combination of VIs in cases where the spatial resolution of the remote sensing image set is insufficient, such as the case we observe in our experiments with the DSTL image set. In Table II, we see that RGB and mixed-Veg input types perform similarly because the spatial resolution is not sufficiently high. This is not the case in Table III. Thus, we conclude that the following three factors should be taken into account when VIs are to be used for semantic segmentation of trees: a) the input signal should include a combination of VIs of diverse characters; b) the spatial resolution is relatively lower as in the case of satellite imagery; c) a DL-based model architecture with sufficiently descriptive strength should be utilized. 5) To assess trees in the presence of the most common types of aerosols, such as smoke or sulfates, it is especially essential to utilize the SWIR band while calculating VIs that reduce atmospheric effects, such as ARVI. Due to the sensitivity of the SWIR band to the liquid water content of the tree leaves, ARVI calculated with the SWIR band shows a better segmentation performance than the one that does not utilise SWIR reflectance. Regarding the future directions for this study, our first plan is to focus on improving the backbone architecture. The pretrained models, such as the ResNet are trained on RGB data, and they can potentially downgrade the performance when fine-tuning on small remote sensing datasets due to the large domain gap. Because, as we saw in our preliminary experiments, it is not possible to obtain convergence for training, when we train some deep architectures (such as DeepLabv3+ or DLinkNet) from scratch with a limited scale of data. Hence, the only ideal option can be to use encoders pretrained with satellite or aerial imagery with the same sensors, or in other words, the same multispectral bands trained with the same architectures. Such pretrained encoders, to the best of our knowledge, do not publicly exist. However, there are efforts to implement domain adaptation to multispectral images obtained from different sensors, so that larger-scale experiments can be carried out [123]. Following this direction, a domain (i.e., sensor) independent multispectral backbone can be obtained.
As an alternative, converting the RGB backbones to a compatible model that can process multispectral data is also a viable direction. Additive group normalization method discussed in [104] is a recent example of this approach. By analyzing which bands contribute more to segmentation, they attempt to obtain DL-based VI designs that will be compatible with three-channel backbones, such as the ResNet. Following this path, we may be able to utilize multispectral band input to advanced architectures, such as DLinkNet or DeepLabv3+, and achieve better performance.
Another promising direction, not only for this particular problem at hand but in DL is utilizing vision transformers (ViT) instead of convolution-based architectures. Following the very recent ViT design [124], computer vision problems are being attacked with transformer-based architectures. As of the time this article is prepared, many transformed-based multispectral applications including segmentation [125], and many others [126], [127], are proposed, and will likely increase in near future.