Comparative Study of Real-Time Semantic Segmentation Networks in Aerial Images During Flooding Events

Real-time semantic segmentation of aerial imagery is essential for unmanned ariel vehicle applications, including military surveillance, land characterization, and disaster damage assessments. Recent real-time semantic segmentation neural networks promise low computation and inference time, appropriate for resource-limited platforms, such as edge devices. However, these methods are mainly trained on human-centric view datasets, such as Cityscapes and CamVid, unsuitable for aerial applications. Furthermore, we do not know the feasibility of these models under adversarial settings, such as flooding events. To solve these problems, we train the most recent real-time semantic segmentation architectures on the FloodNet dataset containing annotated aerial images captured after hurricane Harvey. This article comprehensively studies several lightweight architectures, including encoder–decoder and two-pathway architectures, evaluating their performance on aerial imagery datasets. Moreover, we benchmark the efficiency and accuracy of different models on the FloodNet dataset to examine the practicability of these models during emergency response for aerial image segmentation. Some lightweight models attain more than 60% test mIoU on the FloodNet dataset and yield qualitative results on images. This article highlights the strengths and weaknesses of current segmentation models for aerial imagery, requiring low computation and inference time. Our experiment has direct applications during catastrophic events, such as flooding events.


I. INTRODUCTION
R EAL-TIME semantic segmentation models are applicable for unmanned aerial vehicles (UAVs) to enhance situational awareness of aerial scenes. As shown in Fig. 1, the main task is to label each pixel of an aerial image with a corresponding class and generate a segmentation map. For example, pixels corresponding to grass are colored yellow on the segmentation map (see Fig. 1). This process occurs in real-time on edge computing devices, such as portable embedded GPU processors mounted on UAVs. Consequently, the segmentation algorithms must be small in size, low power, and computationally efficient.
With the advent of deep learning, convolutional neural networks (CNNs) are prominently employed for scene segmentation. Semantic segmentation with a dilation backbone is the standard layout for many networks, including DeepLabV2 [13], DeepLabV3 [14], PSPNet [104], and DenseASPP [93]. These methods yield high accuracy due to dilated convolutions and wide receptive fields. However, their computational overhead bottlenecks on UAV systems with limited computing resources. Furthermore, while heavyweight architectures and multiscale convolutions enhance the overall segmentation accuracy of networks, they are time-inefficient processes, especially during real-time inferences. Moreover, they usually have large-scale backbones, leading to excessive parameters and high inference time. Thus, efficient semantic segmentation is essential to leverage the power of deep learning in real-time applications. The real-time semantic segmentation methods are classified into encoder-decoder architectures and two-pathway architectures [38]. Encoder-decoder architectures are the most common lightweight networks for aerial image segmentation in real-time settings. In this approach, the encoder gradually downsamples an original image to capture contextual information about the image and produce the feature map. The decoder upsamples the feature map to generate a segmentation mask. Thus, lightweight encoder  [72] of UAV images captured after hurricane Harvey with pixel-level annotations for nine different objects. Building-F and Road-F mean the flooded building and flooded road, respectively. and decoder are crucial in implementing efficient networks. Although we may enhance the performance of segmentation tasks by scaling up the encoders, it increases operations, the number of trainable parameters, and inference time. Therefore, it is significant for real-time semantic segmentation to employ very efficient structures, such as MobileNets [27], [40] or Ef-ficientNets [80], as an encoder and simple multilayer perceptrons (MLPs) as a decoder. In addition to the encoder-decoder paradigm, recent real-time algorithms propose two-pathway architectures usually comprised of spatial and context paths [96]. The spatial path captures spatial details of the input image, and the context path extracts sufficient receptive fields, which encode high-level semantic context information. Such bilateral structures simultaneously merge spatial and contextual information to produce pixel-level predictions.
We train semantic segmentation models on the FloodNet dataset [72]. Annotated high-resolution UAV datasets captured during catastrophic events are not abundant. However, they are essential to explore real-time semantic segmentation performances under adverse situations. Furthermore, high-resolution UAV datasets offer advantages as providing spatial details about the aerial scenes compared to satellite imagery, which obscures minor details of objects. Consequently, UAV imagery datasets are ideal for training the networks to segment damaged areas. In particular, the FloodNet dataset [72] contains aerial images captured at 200 feet above the ground-and pixel-level annotations for nine different classes, including flooded roads and flooded buildings. Fig. 2 shows the overlay of segmented classes on aerial images in the FloodNet dataset. The image on the left-hand side in Fig. 2 demonstrates flooded roads in dark green and flooded buildings in red.
Our main contribution is to develop a pipeline to benchmark two types of the most recent real-time semantic segmentation models on a postdisaster aerial imagery dataset: 1) two-pathway architecture; and 2) encoder-decoder architecture, as shown in Fig. 1. First, we study encoder-decoder models, which include the following methods.
3) HarDNet [10], one of the most efficient methods in terms of computation. 4) SegFormer [90], the architecture that uses vision transformers in the structure of encoders for real-time applications. Next, we study the two-pathway architecture, which include the following networks. 1) BiSeNetV1 [96], a bilateral segmentation network (BiSeNet) for real-time semantic segmentation. 2) BiSeNetV2 [95], a bilateral network with guided aggregation for real-time semantic segmentation. 3) DDRNet [38], deep dual-resolution networks for real-time and accurate semantic segmentation. 4) PIDNet [91], proportional-integral (PI)-derivative network, one of the latest architectures, evolving from twopathway models. To the best of our knowledge, it is the first time these models have been benchmarked on the FloodNet dataset [72]. Furthermore, we provide a detailed analysis of the accuracy and efficiency of real-time networks. This work provides accuracy for all classes, including special classes, such as flooded buildings or flooded roads, which are necessary to spot affected areas during flooding events. Most models attain more than 50% test mean intersection over union (mIoU) on the test set (see Table I) and yield qualitative results (see Fig. 7). In addition, we demonstrate the efficiency of different methods in terms of the number of parameters, inference time, and computational complexity (see Table II and Figs. [8][9][10][11]. Finally, we provide some strategies to improve our models' accuracy and explanations about our trained models' generalizability.

A. Remote Sensing Semantic-Segmentation-Related Works
Remote sensing semantic segmentation has been a fundamental research topic for decades. Subsequently, with the superiority of deep learning in computer vision, researchers endeavor to utilize the power of deep learning in remote sensing image analysis [41], [43], [71], [84]. Many of these works focus on the semantic segmentation of satellite images. Among others, deep learning techniques, such as CNNs and autoencoders, are employed for scene understanding from satellite imagery [99]. Deep fully CNNs (F-CNNs) are applied for road segmentation of satellite images [35]. In addition, new methods and frameworks have been developed for remote sensing semantic segmentations of aerial images. ResUNet is a deep learning framework for semantic segmentation of remotely sensed data [26]. Efficient patchwise semantic segmentation is proposed for remote sensing images [58]. Symmetrical dense shortcut deep fully convolutional networks (FCNs) are developed for semantic segmentation for remote sensing images [11]. In addition, neural networks are exploited for object detection and image segmentation as an adaptive method in Earth observation [37]. Also, deep learning methods for semantic segmentation of remote sensing imagery are applied to nonconventional data, such as hyperspectral images and point clouds [98].
The datasets widely applied for remote semantic segmentation are Potsdam [1] and ISPRS Vaihingen [2]. These datasets contain orthophoto (TOP) images with 2-D semantic labeling, popular benchmark datasets for remote sensing segmentation models. RoadNet [57] is a high-resolution remote sensing dataset with pixelwise annotation. LoveDA [85] is a more recent land cover dataset for semantic segmentation and domain adaptation tasks. However, these datasets do not consider the segmentation of postdisaster imagery. Some research are specifically on semantic segmentation on postdisaster UAV datasets [15], [21], [30], [54]. There is a comprehensive study of semantic segmentation on UAV imagery for postdisaster damage assessment [18]. The segmentation for disaster analysis and anomaly detection is investigated in a comprehensive survey of deep learning in remote sensing [6]. Also, the instance segmentation evaluates building damage using aerial videos [107]. Despite the extensive research in remote sensing and postdisaster semantic segmentation, these studies do not consider the most recent efficient segmentation models for postflooding damage assessment. Consequently, this article systematically compares the real-time semantic segmentation models on remote sensing images during flooding events.

B. Semantic-Segmentations-Related Works
Early studies on semantic segmentation on aerial images employ machine learning algorithms and unsupervised learning to segment remote sensing images [8], [25], [33]. New unsupervised methods employ self-supervised feature learning for semantic segmentation. For example, PiCIE [51], pixellevel feature clustering using invariance and equivariance, is an unsupervised image segmentation method to tackle the pixelwise classification problem. Segsort [45] is the segmentation method using discriminative sorting of segments to maximize similarity within segments and minimize similarity between segments. This work [101] used segments obtained from self-supervised learning and hierarchical grouping. MaskContrast [31] is an unsupervised semantic segmentation method that learns feature embeddings to propose a contrastive object mask. Furthermore, autoregressive unsupervised image segmentation [68] maximizes the mutual information between different constructed views to output clusters related to semantic labels.
Semisupervised semantic segmentation allows a model to learn from a few annotated images and numerous unlabeled ones. Semisupervised semantic segmentation using a generative adversarial network [77] produces additional images to improve the features learned from the deep network. In addition, semisupervised semantic segmentation based on adversarial learning [44] increases the accuracy by treating the segmentation network as the generator in a GAN framework. This work [65] uses high-and low-level consistency for semisupervised semantic segmentation. Also, some works uses self-training [28] and strong, varied perturbations [29] for semisupervised segmentation. Semi-upervised segmentation based on error-correcting supervision [64] and semisupervised segmentation with selfcorrecting networks [47] are among the recent approaches in this area.
Semantic segmentation models also apply supervised learning for classifying pixels into different categories [59], [73], [104]. With the popularity of deep learning, the method of choice for image segmentation shifts towards deep CNNs, which are utilized as classifiers. In recent years, countless deep learning models have been introduced to increase the accuracy of semantic segmentation. For example, the FCNs [59] have significantly enhanced state-of-the-art performance. It contain dense layers helping the network performs accurately for segmentation tasks. Furthermore, pyramid pooling-based methods, such as PSPNet [104], achieve tremendous improvement in terms of accuracy. Subsequently, architectures with large dilated convolutions attain remarkable performance due to increasing effective receptive fields. Therefore, the models based on dilation backbones are considered accurate segmentation methods for standard tasks when computational costs and inference time are not determining factors [13], [14], [16], [93], [97], [104]. However, despite high accuracy and promising performance achieved by advanced techniques with dilated backbones, these approaches involve complex networks leading to computational overhead and time-consuming inference. On the other hand, real-time aerial-scene segmentation on UAV systems requires low computational cost, low memory consumption, and low latency. Consequently, lightweight segmentation models are indispensable for balancing accuracy and efficiency. Generally, encoder-decoder and two-pathway architectures are common methods for real-time semantic segmentation [38]. We elaborate on standard real-time semantic segmentation models in the following sections.

C. Encoder-Decoder-Architectures-Related Works
Real-time encoder-decoder architectures are composed of efficient encoders and lightweight decoders to maintain real-time performance. As shown in Fig. 3, the encoder blocks downsample the original image (for example, 1/2, 1/4, 1/8, 1/16, and 1/32 of the input image resolution), and the decoder blocks restore the resolution by upsampling the feature maps. Generally, the encoder-decoder structure combines low-and high-level features using skip connections to achieve higher accuracy. One of the first attempts at designing efficient encoder-decoder architecture is by developing SegNet [4], applying skip connections and transposed convolutions for upsampling. ENet [70] introduces an efficient multistage architecture based on a bottleneck module originated in ResNet [34]. Although ENet increases efficiency, especially in terms of frame rate [i.e., frames per second (fps)], it significantly deteriorates accuracy and yields about 43% mIoU on FloodNet [72]. ICNet [103] proposes the cascade feature fusion unit combining extracted features to enhance the accuracy. SwiftNet [82] upgraded the encoder-decoder by reducing redundant computations with pixel-adaptive memory.
One of the versatile encoder-decoder models is UNet [73]. As shown in Fig. 5, UNet baseline architecture is composed of lightweight encoder blocks, decoder blocks, and concatenation blocks to fuse low-and high-level features. The encoder is called a contracting path, capturing context features from an input image. The contracting path increases feature information and usually loses spatial information due to the gradual downsampling of the feature maps. On the other hand, the decoder in UNet or expansion path localizes features and rebuilds the segmentation map. Concatenation blocks supplement this structure, combining low-and high-level features, and improving the network's overall performance. Long skip connections pass fine-grained details from encoder to decoder to further optimize the architecture.
Many researchers have adapted the structure of the UNet model for better accuracy and efficiency of segmentation on images. Among these adaptations, Feedback U-net [76] includes convolutional LSTM and feedback process to extract features in two rounds; the features in the second round are extracted based on the features received in the first round. Similarly, recurrent UNet architecture [87] and BConvLSTM U-Net [3] take advantage of U-Net structure and recurrent networks. Circle-U-Net [79] is an efficient network with residual bottleneck and circle-connect layers to improve accuracy compared to UNet. Moreover, researchers used UNet as the baseline for the semantic segmentation of remote sensing images. Context-transfer-UNet [55] applies dense boundary blocks to refine features and increase segmentation abilities. Deep residual UNet [102] integrates the robustness of residual learning and UNet structure. UNet++ [106] is a UNet-based structure using densely connected nested decoder subnetworks. It improves feature extraction and accuracy of the original UNet, especially on medical image segmentation. Some researchers supplement attention mechanism to UNet architecture to enhance detailed feature extraction and better mapping and expression of features. Attention U-Net [67] architecture uses an attention mechanism to focus on target structures of varying shapes and sizes. Similarly, TransUNet [12] employs transformers to make strong encoders in UNet architecture. Some works concentrate on computationally more efficient UNet structures [5], [9], [24]. For example, squeeze U-Net [7] is an efficient model that substitutes the upsampling and downsampling layers in U-Net with modules comparable to the fire modules in SqueezeNet [46], leading to low MACs and memory use.
Encoder-decoder architecture allows for designing real-time models by incorporating lightweight encoders as a backbone. We applied this adaptability to create different lightweight UNetbased models in this work. There are many attempts to design an efficient backbone encoder. For example, Xception [17] employs depthwise (Dwise) separable convolutions, which can be used as a lightweight encoder. Similarly, ShuffleNet [100] is another lightweight encoder that uses pointwise group convolution and channel shuffle to enhance the efficiency of the network. Although lightweight encoders reduce the latency and memory usage of segmentation models, information must go through the deep encoder and then reverse back to pass the decoder, which increases the latency [91]. Besides, some lightweight encoders with Dwise separable convolution are not optimized well on GPU, reducing the inference speed in spite of fewer FLOPs and parameters. Overall, the real-time encoder-decoder semantic segmentation architecture has a wide range of applications in different fields, including autonomous driving [89], medical image segmentation [49], remote sensing, and emergency response for damage assessment [74].

D. Two-Pathway-Architectures-Related Works
Two-pathway architectures claim to increase accuracy while maintaining the efficiency of the network [95], [96]. The main building blocks of these networks are two lightweight encoders or pathways: 1) one pathway obtains semantic information; and 2) the other provides rich spatial details. Fig. 4 shows that the context path downsamples the feature maps (for example, 1/2, 1/4, 1/8, 1/16, and 1/32 of the input image resolution) to extract context information and the spatial path downsamples features (for example, 1/2, 1/4, and 1/8 of the input image resolution) to provide spatial details. The fusion of data from these two parallel paths achieves a good balance between accuracy and speed [38]. BiSeNet [96] includes a spatial path to generate high-resolution features and a context path with an efficient downsampling strategy to enlarge the receptive field. The features obtained from the two branches are combined efficiently to develop a more accurate segmentation map. BiSeNetV2 [95] improves BiSeNet by adding a guided aggregation block and booster training strategy to enhance performance. Context aggregated bilateral network [94] is a two-pathway structure with a high-resolution branch to capture spatial detail and a context branch with global aggregation capability. This architecture combines long-range and local contextual dependencies to provide accurate segmentation with low computational costs.
Compared to encoder-decoder architecture, the two-pathway architecture demonstrates higher inference speed and proper characteristics for real-time applications. However, there are still challenges in restoring the lost information during the downsampling process in the two-pathway architecture. First, the multiscale context features from the early layers of the context path are not fully fused, which reduces the accuracy. Besides, the direct fusion of low-level details and high-level semantics harms the accuracy of two-branch models [91].

E. Flood Detection and Segmentation-Related Works
Flooding events are among the deadliest forms of natural disasters, which cause severe damage to property and loss of human lives [22], [52], [108]. Thus, automatic monitoring methods from a distance are quick and cost-effective ways for rescue teams who estimate damages in the early stages of flooding incidents. In addition to machine learning algorithms developed for flood detection [48], [60], [62], [78], many deep learning methods have been recently exploited for this purpose. For example, the deep CNN VGG16 was applied to flood identification [50]. Dilated and deconvolutional networks perform flooding identification in high-resolution remote sensing images [66]. F-CNNs are employed to map the flooding extent from Landsat satellite images [75]. Similarly, F-CNNs extract flooded areas from UAV imagery [32]. A deep learning neural network is utilized for flash-flood susceptibility mapping of aerial imagery [20]. However, few studies consider deep networks for the semantic segmentation of flooding images. Various CNNs are applied for flood image semantic segmentation [69]. A comparative study of segmentation models on flooding events investigates the performance of multiple segmentation networks on the flooding dataset [36], [74]. Furthermore, water [53] and river segmentations with deep learning models are considered for flood monitoring [61].
A wide range of optical and radar sensors provide valuable remote sensing data for training flood detection methods [60], [62], [63], [88]. However, the UAV dataset for supervised flood segmentation is scarce. Therefore, in contrast to previous works on flood detection, our real-time segmentation models are trained on FloodNet [72], which provide a high-resolution UAV dataset for postflood scene understanding.

III. METHODOLOGY
This section describes the architectures of all real-time semantic segmentations we developed and trained in this work. We divided all methods into encoder-decoder and two-pathway architectures.  [39] or EfficientNet [80]. The downsample feature maps are 1/2, 1/4, 1/8, and 1/16 of the original image. The top-right diagram shows that the encoder comprises blocks to scale up and down the network by adding or removing the encoder blocks. The bottom-left block diagram illustrates that UMBV2 [27], [73] uses the inverted residual block and Dwise separable convolutions. The skip connections connect two narrow feature channels, skipping wider channels. The bottom-right diagram illustrates that UMBV3 [39], [73] uses a squeeze-and-excitation and hard swish activation function to increase efficiency.
1) UNet-Based Architectures: We developed UNet [73] baseline architecture, identical to the original model. It consists of lightweight encoder blocks for downsampling, decoder blocks for upsampling, concatenation blocks that fuse low-and high-level features, and skip connections, improving accuracy while reducing computations. In addition, we benefit from new lightweight encoders for less operation and faster speed. Our main contribution is to benchmark the accuracy and efficiency of UNet using various lightweight encoders. As shown in the top-left diagram in Fig. 6, we designed an asymmetric UNet baseline with lightweight encoders, including MobileNet [27], [39] or EfficientNet [80]. The decoders for all UNet-based models are identical. As shown in Fig. 6, downsample feature maps are 1/2, 1/4, 1/8, and 1/16 of the original image. A decoder in UNet baseline is a list of upsampling blocks consisting of a 2-D convolution layer, batch normalization, followed by a ReLu. These blocks act as the expansive path that upsamples the feature map and halves the number of feature channels. However, variation in lightweight encoder highlights the impact of using MobileNets and EfficientNets in UNet baseline architecture for remote sensing applications. Mainly, these lightweight models allow us to observe the effects of Dwise separable convolutions [40] and inverted residual blocks [27] on the efficiency and accuracy of real-time UNet-based models. To highlight the differences between these models, we explain the three models' architecture in more detail in the following sections. a) UMBV2: We named UNet architecture [73] [40]. It consists of two distinct layers: i) Dwise convolution; and ii) pointwise convolution. Dwise convolution applies a filter per channel input, and pointwise convolution creates new features by linearly combining input channels. Dwise separable convolutions generally decrease computations compared to standard convolutional layers [27]. MobileNetV2 contains inverted residual blocks or mobile bottleneck convolution (MBConv) with low parameters, enhancing efficiency. The bottom-left block diagram in Fig. 6 illustrates the inverted residual block, the central element in lightweight MobileNetV2. In this approach, the skip connections connect two narrow feature channels, skipping wider channels. Consequently, MBConv creates low feature dimensions while carrying information from the earlier layers. The output of MobileNetV2, including multiscale feature maps, goes to the decoder blocks and output segmentation map. We initiated our training by using MobileNetV2 pretrained on ImageNet [23] and train UMBV2 network on the FloodNet [72] dataset. b) UMBV3: We named UNet architecture [73] with Mo-bileNetV3 [39] encoder UMBV3. As UMBV2, we developed the UMBV3 consisting of MobileNetV3 as the contracting path and a list of decoder blocks as the expansive path (as shown in Fig. 6). MobileNetV3 [39] is the upgraded version of MobileNetV2, keeping the main building blocks of MobileNetV2. It adds squeeze-andexcitation blocks to improve the accuracy of the network without adding extra computational costs. As shown in the bottom-right diagram in Fig. 6, a squeeze-and-excitation network is a convolutional block that modifies each feature map's weights according to its significance. In addition to the squeeze-and-excitation block, MobileNetV3 redesigns some nonefficient components to reduce computations. For example, they replace the sigmoid activation function with a hard swish to build the most effective models. Similar to UMBV2, we initiated our training by using MobileNetV3 pretrained on ImageNet [23] and trained UMBV3 network on the FloodNet [72] dataset. UMBV3 is the most efficient network in terms of the number of parameters. c) UEffB0 and UEffB1: We named UNet architecture [73] with EfficientNet [80] encoder UEffB. We developed the UEffB0 and UEffB1 that consist of EfficientNet as the contracting path and a list of decoder blocks as the expansive path. EfficientNet is a family of lightweight encoders we apply to the UNet baseline architecture. Similar to MobileNets, the primary building blocks of EfficientNet are MBConvs [27]. In addition, squeeze-and-excitation optimization is used in the structure of EfficientNet, similar to MobileNetV3 [39]. The main difference from the previous methods is applying a compound scaling method that simultaneously scales each convolutional block in width, depth, and resolution dimensions. EfficientNet allows scaling up and down the network based on available computing resources. For example, the EfficientNetB0 has the minimum number of MBConvs, and EfficientNetB1 to EfficientNetB7 scales up the network to increase the accuracy of the architecture. As shown in the top-right block diagram in Fig. 6, the encoder comprises some blocks. Therefore, we can scale up and down the network by adding or removing the encoder blocks. To maintain the efficiency and real-time characteristics of the network, we used EfficientNetB0 and EfficientNetB1 pretrained on ImageNet [23] in this work, because they are the most lightweight networks in the EfficientNet family. Similar to UMBV2 and UMBV3, the outputs of EfficientNet go to the decoder blocks and generate a segmentation map. We trained both UEffB0 and UEffB1 network on the FloodNet [72] dataset. 2) UNetFormer: We developed a framework for training UNetFormer [86], an efficient remote sensing semantic segmentation method, on the FloodNet [72] dataset. This architecture is specially designed for real-time urban scene segmentation. Similar to other UNet-based architecture, we exploited the lightweight ResNet18 [34] as an encoder. The pretrained ResNet18 provides multiscale semantic feature maps with low computational cost. Moreover, ResNet18 allows a lightweight CNN-based encoder, making the network appropriate for realtime applications. The main novelty in this work is the structure of the decoder. The decoder is a transformer-based architecture that efficiently models global and local information. The decoder's global-local transformer block (GLTB) constructs two parallel branches to extract the global and local contexts. The semantic features generated by the encoder are fused with the features from GLTB to produce generalized fusion features. In short, a transformer-based decoder captures global and local contexts at multiple scales and maintains high efficiency. Consequently, our implementation is a real-time architecture containing lightweight ResNet18 as the contracting path and an efficient transformer-based decoder as the expansive path. We trained the network on FloodNet [72] and measured the accuracy and efficiency of the network. Our result shows a 7.89-ms inference time for a 704 × 1056 image, confirming that our UNetFormer implementation is suitable for real-time applications.
3) HarDNet: HarDNet [10], i.e., harmonic densely connected network, optimizes the DenseNet [42] structure by sparsifying the shortcuts and reducing the number of intermediate feature maps. Authors of HarDNet argue that the excessive number of feature maps and parameters increases dynamic random-access memory (DRAM) traffic for the read and write of parameters and feature maps. It consequently increases power consumption and reduces the inference speed, negatively affecting high-resolution applications at the edge. Therefore, HarDNet minimizes the number of shortcuts between blocks, which lowers DRAM traffic to tackle this issue. Furthermore, it amplifies essential layers by increasing the number of channels to compensate for sparsifying the shortcuts. This method increases efficiency in terms of low MACs and memory traffic while maintaining accuracy compared to other methods. Our HarDNet-70 implementation uses pretrained backbone encoder on the HELEN [56] dataset. After training the whole network on the FloodNet [72] dataset, the efficiency of our implementation demonstrates 12.56 G multiply accumulate (MAC), confirming the efficiency of the network in terms of MACs.

4) SegFormer:
We developed a framework for training Seg-Former [90], which redesigns the encoder-decoder structure by using a hierarchical transformer as an encoder and a lightweight MLP as a decoder. The encoder comprises multiple mix transformer (MiT) encoders, i.e., MiT-B0-MiT-B5, allowing the encoder to obtain multilevel features to generate the segmentation mask. Furthermore, this structure enables the MLP decoder to combine local and global attention to generate effective representations. The hierarchical transformer encoder has a larger effective receptive field than conventional CNN encoders, allowing lightweight MLP as a decoder. This model enables the scalability of the network such that we change the transformer layers according to our resources. In this work, we avoid using large networks, such as MiT-B1-MiT-B5; instead, using MiT-B0 to improve efficiency in real-time applications. Our implementation initiated training using MiT-B0 block pretrained on the ADE20K [105] dataset. After training the network with only MiT-B0 encoder on the FloodNet [72] dataset, we called the network SegFormerB0 and measured the accuracy and efficiency of the network. This network highlights the impact of using MiT encoders to increase the accuracy of real-time networks. Furthermore, our results demonstrate that SegFormerB0 is the most accurate method among all trained models in this work, emphasizing the impact of MiT-B0 for improving accuracy while maintaining efficiency.

B. Two-Pathway Architecture
Although the encoder-decoder architectures are feasible for real-time semantic segmentation, they lose partial information during repeated downsampling, which is hardly restored by upsampling. On the other hand, the two-pathway architecture claims to improve the performance of real-time semantic segmentation [38], [95], [96]. Since the semantic information and rich spatial details are fused, this architecture achieves a good tradeoff between accuracy and speed [38]. This section explains two-pathway methods, including BiSeNetV1 [96], BiSeNetV2 [95], DDRNet [38], and PIDNet [91]. Note that PIDNet has three branches, but the basic architecture is similar to the two-pathway structure. Consequently, we discuss its design under the two-pathway architecture.
1) BiSeNetV1: We developed a framework to train BiSeNet [96], which consists of a spatial path, context path, and feature fusion module, which combines the features of these two paths. The spatial path preserves affluent spatial information by generating output feature maps 1/8 of the original image (see Fig. 4). The large spatial size of feature maps encodes more spatial information, leading to high accuracy. Context path enlarges receptive field with consideration of efficient computations. The context path contains a lightweight model that downsamples the feature map and global average pooling layers to maximize the receptive field. The attention refinement module refines the output feature of each stage in the context path. This module captures the global context information by computing an attention vector and guiding the feature learning. Finally, feature fusion module fuses high-level features of the context path with low-level features of the spatial path. The fusion module combines and balances different feature scales from spatial and context paths. After implementation and training the BiSeNetV1 on FloodNet [72], BiSeNetV1 shows inference time lower than most encoder-decoder structures, confirming the real-time characteristics of two-pathway architectures.
2) BiSeNetV2: We developed a framework to train BiSeNetV2 [95], a two-pathway architecture, including the detail and semantic branches. In addition, an aggregation layer combines the information from two branches. The detail branch extracts the spatial data and low-level details, which contain wide channels and shallow layers [95]. The high number of channels helps the detailed branch capture considerable spatial information, and shallow layers allow the detailed branch to extract low-level features. In contrast, the semantic branch contains a lightweight and deep network. The lower number of channels in the light semantic branch promotes fast downsampling; besides, a deeper network enlarges the receptive field to acquire high-level semantics of an input tensor. The aggregation layer is a convolutional layer that fuses the detail and semantic information. Bilateral guided aggregation layer is proposed to aggregate feature responses of two branches. This layer uses the semantic branch's high-level information to control the detail branch's feature responses. Since different scale guidance extracts different feature representations, multiscale data can be obtained from the guidance approach. It is worth considering that the critical building block of BiSeNetV2 is the gather-and-expansion (GE) layer. It benefits from the convolution layer, which expands features to higher dimensional space, Dwise convolution, and the projection layer, which inverts the output of the Dwise convolution layer into lower dimension channels. Although the GE layer is more complicated than the inverted bottleneck in MobileNetv2, it is very efficient in computation and memory access costs. Furthermore, BiSeNetV2 enhances the segmentation accuracy as a booster training strategy. This strategy boosts the feature representation during the training; however, they are discarded during the segmentation inference. After implementation and training the deep model on FloodNet [72], BiSeNetV2 shows inference time lower than most encoder-decoder structures. However, the network's accuracy is not increased compared to BiSeNetV1 [96]. The mIoU of BiSeNetV2 is lower than BiSeNetV1. It is not surprising since we did not implement BiSeNetV2-Large in this work. Similar to our result, BiSeNetV2 shows lower accuracy than BiSeNetV1 trained on Cityscapes [19] in the original work [95].
3) DDRNet: We developed a framework for training deep dual-resolution networks [38], a family of architectures that can be employed for real-time segmentation. The baseline model is a simplified version of HRNet [83] such that one branch is used to capture high-resolution feature maps, and the other branch enlarges receptive fields. The novel compact HRNet improves the inference speed and reduces memory consumption. The basic building blocks of both branches are residual blocks ending with a bottleneck block, improving the representation capability. In addition, this architecture introduces bilateral fusion, which combines high-and low-resolution features. Finally, a deep aggregation pyramid pooling module (DAPPM) is supplemented in this architecture to enrich the context extraction without considerable effect on inference speed. In summary, the deeper branch is responsible for extracting high-resolution features, and the thinner branch is in charge of capturing low-resolution features. These two features are concatenated to improve the accuracy and overall performance of the real-time semantic segmentation method. We trained DDRNet-23 on the FloodNet [72] dataset; however, we did not initiate the learning parameters by using a pretrained model, which contributed to lower accuracy. The results of our implementation show that DDRNet-23 is the most efficient network in terms of inference speed.

4) PIDNet:
We developed a framework for training PIderivative network [91], which is a real-time network that modifies two-path architectures by adding a third branch. The authors of PIDNet suggest that the two-path way architecture behaves as a PI controller such that direct fusion of low-level details and high-level semantics cause overshoot issues, meaning object boundary corroded. To alleviate the problem, the PIDNet employs a boundary attention branch to guide the aggregation of detailed and context branches. In other words, PIDNet is composed of three branches to parse the detailed, context, and boundary information. The newly added boundary branch extracts the high-frequency features to predict the boundary regions. Also, the pixel-attention-guided fusion module is added to the detailed branch to learn the useful semantic features from the context branch selectively. Furthermore, the parallel aggregation pyramid pooling module (PAPPM) in PIDNet-s concatenates multiscale pooling maps. The PAPPM is faster than DAPPM proposed by [38]. Finally, the boundary-attention-guided fusion module fuses the features of three branches by balancing the  [72] dataset. However, we did not initiate the learning parameters using a pretrained model, contributing to lower accuracy. Similar to other two-pathway architectures, the results of our implementation show that PIDNet-s inference time is lower than most encoder-decoder structures.

IV. DATASET DESCRIPTION
FloodNet [72] is a pixelwise annotated training dataset that comprises 2343 images with 3000×4000 resolution. The images were captured after Hurricane Harvey occurred near Texas and Louisiana in August 2017, a category 4 hurricane, reflecting the natural state of flooded areas during a disaster. DJI Mavic Pro quadcopters captured high-resolution images 200 feet above ground level with a spatial resolution of 1.5 cm [72]. The images in the dataset were captured from different geographical locations, including Missouri City, Katy, Sugar Land, Needville, Rosenberg, Wallis, and Richmond. Some special classes, such as flooded roads and buildings, are labeled, necessary for damage estimation of impacted areas. In addition, labeling the water and flooded areas allow to differentiating between natural water and flooded water. As shown in Fig. 2, the semantic segmentation labels are categorized into the following nine classes. regions containing grass) [36], [72]. FloodNet captures the diversity and complexity of segmenting aerial objects captured by drones. It also highlights current challenges regarding semantic segmentation feasibility under challenging situations. We evaluate the results of efficient semantic segmentation models on FloodNet, focusing on the segmentation of images for UAV systems.

A. Train Setting
For the implementation of all models, the FloodNet dataset [72] splits into 1791 pixel-level annotated images for training, 317 pictures for validation, and 235 images for the test. We adopt an AdamW optimizer with a base learning rate of 0.001, a weight decay of 0.0001, and cross-entropy loss as the loss function. We use scaling, random horizontal and vertical flipping, 20% grid situation, random brightness, and Gaussian noise for image augmentation. All models train with 40 epochs. The batch size for all models is 3, except for UMBV3, UEffB0, and UEffB1 for which we use a batch size of 2. We resized the images to 704×1056 during training. We train all models on a single system with NVIDIA GeForce RTX 3090 GPU with an Intel Core i9 processor. In short, apart from the batch sizes of UMBV3, UEffB0, and UEffB1, all train settings are identical, and results are reported with a single system.

B. Measure of Segmentation Performance
We report two metrics to evaluate the performance of semantic segmentation models: 1) pixel accuracy (PixAcc); and 2) mIoU. We compute true positive (TP), true negative (TN), false positive (FP), and false negative (FN) values to measure these metrics. The TPs denote the correctly predicted pixels. The TNs are pixels that are correctly predicted as not belonging to a specific class. Pixels that belong to the class but are mispredicted to a different type are denoted by the FNs. Finally, those pixels falsely mispredicted to the class are denoted by FPs.
1) Pixel Accuracy: It is the percentage of pixels that are correctly classified in the output segmentation mask. We measure PixAcc globally across all classes. The PixAcc of all categories is calculated using the following formula: where k is the total number of classes and i denotes each class.
2) mIoU: The intersection over union (IoU) measures the accuracy of semantic segmentation on a FloodNet dataset [72]. IoU is the metric that measures the pixel-level similarity between the generated mask and ground truth by using the following formula: The metric for accuracy of semantic segmentation is the mIoU on the FloodNet dataset [72]. We used IoU for all nine classes: 1) flooded building; 2) nonflooded building; 3) flooded road; 4) nonflooded road; 5) water; 6) tree; 7) vehicle; 8) pool; and 9) grass (see Table I). Then, we measure the mIoU by applying the following formula: We apply the abovementioned metrics to evaluate and compare the performance of implemented segmentation models.

C. Measure of Efficiency
To measure the efficiency of our models, we used three metrics: 1) the number of trainable parameters; 2) computational complexity (MAC); and 3) inference time (see Table II).

1) Parameters:
The total number of learnable parameters in a feedforward neural network is one metric in this article to quantify efficiency.
2) MAC: The multiply accumulate (MAC) measures computational complexity based on the number of multiply-add operations in neural networks.
3) Time: The inference time is measured on a single RTX 3090 GPU, CUDA 11.1, CUDNN 8.1, and PyTorch 1.9.0. After running some dummy examples to initialize the GPU, we run the same network 300 times under an input resolution of 704×1056 and batch size of 3 and report the average time. The standard video streaming is 24 fps; consequently, if a model process an image in less than 41 ms, we can consider the model as a realtime model.

VI. RESULTS AND ANALYSIS
Tables I and II summarize the results for performance and efficiency, respectively. The qualitative results of different models are provided in Fig. 7. This section discusses our experiments, results, and some of the strategies to increase the accuracy of our models.

A. Results of Performance
The performance table (see Table I) provides the IoU for all nine classes of the FloodNet dataset [72]. Each model's overall performance and accuracy are reported in the mIoU and PixAcc columns. As shown in Fig. 7, it is evident that encoder-decoder and two-pathway architectures are the feasible baselines for aerial imagery segmentation. SegFormerB0 [90] is an efficient method that outperforms all other real-time counterparts in terms of mIoU with 61.6% and PixAcc of 89.5%. To explain, SegformB0 is the only model that exploits the vision transformer encoder. Our results confirm the importance of using vision transformers to produce local attention and enlarge the effective receptive field, thus increasing accuracy. All other CNN models require a context module to increase the receptive field, which adds to the complexity of CNN models [93]. Consequently, our experiments demonstrate that using vision transformers is rewarding in enhancing the accuracy of the network.
UNet-based architectures, including UMBV2 [27], [73], UMBV3 [39], [73], UEffB0 [73], [80], and UEffB1 [73], [80], are among the most popular models for real-time applications. UMBV2 yields 52.4% mIoU and 86.6% PixAcc using 6.63-M parameters. This result is similar to UMBV3 and UEffB0 with 50.7% and 50.4% test mIoU, respectively, and producing 84.7% and 83.8% PixAcc, respectively. Hence, to have more lightweight architecture, we can upgrade lightweight encoders EfficienNetB0 or MobileNetv2 with a more efficient and newer MobileNet version, i.e., MobileNetV3. UMBV3 yields almost the same performance by using only 2.9-M parameters. Accordingly, we can observe that UMBV3 provides a lower number of trainable parameters for limited resource platforms. We can also increase segmentation performance by scaling up UEffB0 encoders by supplementing a larger variant of EfficientNets. For example, supplementing larger encoder EfficientNetB1 raises the performance of UEffB1 to 56.3% mIoU and 84.7% PixAcc, using 8.76-M parameters. In contrast, scaling down the UNetbased network harms the accuracy. For example, UNetFormer structure is scaled down for higher speed, lowering the accuracy to 47.2% mIoU and 83.0% PixAcc. Finally, the HarDNet-70 [10] architecture yields 57.9% mIoU and 84.8% PixAcc using 4.1-M parameters. HarDNet-70 is one of the most accurate models in our table while it maintains efficiency in terms of MACs and memory traffic.
Two-pathway architectures reduce the latency of the network by compromising accuracy. For example, PIDNet-s [91], DDRNet-23 [38], and BiSeNetV2 [95] with 41.8%, 46.1%, and 49.0% test mIoU are among the least accurate methods, respectively. As we can see in Table I 80.6%, respectively). These results illustrate a tradeoff between accuracy and efficiency when deciding on real-time semantic segmentation methods. For example, two-pathway architectures are not the most accurate models for aerial image segmentation; however, they reduce the latency of the network and show the lowest inference time (see Table II). In summary, the segmentation experiments (see Table I) show that encoder-decoder architectures are more accurate baselines for the FloodNet dataset [72]. We attempt to train two-pathway architectures, but they yield lower accuracy. On the other hand, two-pathway architectures provide a high-speed network (see Fig. 11). Fig. 7 shows the qualitative results where the first row is the original aerial image and the second row is the ground truth segmentation mask. The rest of the rows demonstrate the segmentation output of real-time models. As shown in the last column of Fig. 7, real-time segmentation on images with large structured objects, such as river, grass, and trees, generate more accurate results. Similarly, the first column of Fig. 7 illustrates that nonflooded buildings are segmented using real-time models. The second and third columns demonstrate that current real-time deep networks can detect flooded buildings in red and flooded roads in olive green. Although segmenting the flooded buildings in different images is feasible, the distinction between flooded and nonflooded buildings in the same image is still challenging, as shown in the second column of Fig. 7. Table II gives our efficiency results, including the number of parameters, multiply-accumulate (MAC), and inference time (Time). In Table II, three efficiency metrics are denoted by params, MAC, and time. Also, Fig. 8 is a 3-D diagram of the comprehensive performance evaluation of different models. Fig. 8 expresses the overall accuracy versus efficiency of all models. It is reasonable to assume that higher points in the coordinate plane are more accurate models, such as SegFormerB0 [90] or HarD-Net [10]. Also, the points on the coordinate plane's left-hand side are faster than those on the right-hand side. For example, DDRNet [38] is much faster than UEffB1 [73], [80]. Finally, the points closer to the front of the diagram are computationally more efficient, such as HarDNet [10], DDRNet, and PIDNet  [38] and PIDNet [91], is very efficient, but do not show high accuracy on FloodNet [72]; UNet-based models are not very efficient compared to the recent models, and HarDNet [10] and SegFormerB0 [90] demonstrate high accuracy and efficiency on FloodNet.

B. Results of Efficiency
(refer to Fig. 9 for better visualization of MACs versus mIoU). The following three facts are evident in Fig. 8 and Table II. 1) DDRNet [38] and PIDNet [91] are very efficient, but do not show high accuracy on FloodNet [72]. 2) UNet-based models are not very efficient compared to the recent models. 3) HarDNet and SegFormerB0 demonstrate above average accuracy and efficiency on FloodNet. Figs. 9-11 show the performance of each model against three efficiency metrics, where red dots represent the two-pathway architectures and blue dots represent the encoder-decoder architectures. As given in Table II and Fig. 9, HarDNet-70 [10] is the most efficient method in terms of computational complexity and demonstrates the lowest MAC among all methods. The sparsification technique and reducing the number of feature maps in the architecture of HarDNet [10] lead to lower MACs. Moreover, Fig. 9 shows the efficiency of DDRNet-23 [38] and PIDNet-s [91] in computational complexity due to its thinner information flows. Regarding the number of parameters, Fig. 10 shows that UMBV3 [39], [73] has the lowest number of trainable parameters rooted in using very lightweight MobileNetV3 [39]. In contrast, two-pathway structures have a larger number of trainable parameters. Figs. 10 and 11 show that UNet-based architectures, including UMBV2 [27], [73], UMBV3, UEffB0, and UEffB1 [73], [80], have a small number of trainable parameters; however, they have slower inference time. It confirms that lightweight encoders with Dwise separable convolution do not optimize well on GPU despite fewer parameters. On the other hand, Fig. 11 indicates that two-pathway architectures, such as BiSeNetV1 [96], BiSeNetV2 [95], DDRNet-23, and PIDNet-s [91], reduce the latency of networks significantly.

C. Improve Accuracy by Transfer Learning
One of the main benefits of encoder-decoder architectures is allowing transfer learning using pretrained lightweight encoders. To observe the impact of pretrained encoders on segmentation performance, we trained the HarDNet-70 [10] and SegFormerB0 [90] on our FloodNet dataset [72] with and without pretrained encoders. As given in Table III, these architectures yield 16.4% and 29.2% mIoU, respectively, when the encoders are not pretrained. We repeated this experiment for SegFormerB0 and HarDNet-70 with pretrained backbone encoders on HELEN [56] and ADE20K [105] datasets. The result increased to 57.9% and 61.6% tests mIoUs.
As a result, in this research, we employed MobileNetV2 [27], MobileNetV3 [39], and EffcienNets [80] encoders, which are pretrained on the ImageNet [23] dataset. It is one of the advantages of encoder-decoder architecture, leading to higher accuracy of aerial segmentation by applying transfer learning.
Furthermore, pretrained encoders can elevate the accuracy of aerial image segmentation. MiTs in SegFormer are pretrained on ADE20K [105], HarDNet [10] encoder is pretrained on the HELEN [56] dataset, and encoders used in UNet-based architectures are pretrained on the imageNet [23] dataset. We can transfer learning in aerial segmentation from training encoders on different datasets, such as ImageNet, ADE20 K, and HELEN, which help enhance the accuracy of our networks. In contrast, DDRNet-23 [38] and PIDNet-s [91] models, which do not use pretrained backbone models and transfer learning, produce the least accuracy (mIoU) among all methods, confirming the impact of transfer learning to increase accuracy. The measure of learning from these datasets might vary depending on the size of datasets and type of objects in the trained images. How much learning we can transfer from different datasets for aerial image segmentation is a research question that this work does not touch upon and remains for future studies.

D. Improve Accuracy by Increasing Spatial Information
To increase the accuracy of efficient models, we must extract sufficient spatial information, which requires either a deeper network or larger receptive fields.    Table I, adding a deeper EfficientNetB1 encoder improves UEffB1 compared to UEffB0 [73], [80]. In addition, using the vision transformer in SegFormerB0 [90] enlarges the effective receptive field, leading to the best accuracy among all models. Both these strategies increase spatial information of the network to have higher accuracy. To further experiment with the impact of growing spatial information on accuracy, we upsample the input image and compare the accuracy of a network by using two different input image sizes. We train UEffB0 and UEffB1 using input image sizes of 704 × 1056 × 3 and 1024 × 2048 × 3, respectively. As given in Table IV, the accuracies of UEffB0 and UEffB1 increase from 50.4% and 56.3% to 57.5% and 59.4%, respectively. Although the larger input image size increases the   [72] generalizes well on different views. The leftmost image is segmentation on building in green, the middle image is segmenting trees in blue, and the rightmost image is the segmentation on water in red. computational complexity of networks, it prevents information loss and increases the accuracy of some of the networks.

E. Generalizing Segmentation Models
One of the main characteristics of real-time training models on the FloodNet [72] dataset is generalizing these models to flooded areas. It is significant for real-time applications to discover damaged regions during catastrophic events. Furthermore, the FloodNet dataset contains exceptional classes of objects, such as flooded buildings and flooded roads. Consequently, the ability to segment flooded buildings and roads is essential for trained models. We demonstrate networks' ability to segment flooded areas in Figs. 12 and 15.
In addition, as given in Table I  accuracy for water or grass class with 74.4% IoU for water and 85.5% IoU for segmenting grass. We splitted the dataset based on the distribution of different classes. For example, we maintained a similar number of flooded roads in the training, validation, and test sets. But, the splits are not based on geographical location. Figs. 13 and 14 show that we tested our models by random sample images. The qualitative results of our models on random images and even videos show that training the models on FloodNet is generalizable for images captured from different angles.
Examining the accuracy of small or deformed objects in Table I indicates that there is still a gap between the segmentation of damaged objects versus standard objects. In other words, recent segmentation models mostly generalize better for structured items, such as structured buildings and roads, than damaged objects, such as flooded buildings or roads. In addition, Figs. 13 and 14 show that real-time models generalize well from different angles. Hence, they can help UAVs enhance situational awareness of autonomous flying devices, providing a pixel-level understanding of the environment in which flying vehicles operate.

VII. CONCLUSION
This article investigated efficient architectures, including encoder-decoder and two-pathway architectures, trained on the FloodNet [72] dataset. We trained 12 real-time models on the aerial imagery dataset and obtained qualitative and quantitative results on all our models. We benchmarked the efficiency and accuracy of all models on the FloodNet to examine their feasibility during emergency response for aerial image segmentation. To the best of our knowledge, it is the first time these models have been benchmarked on the FloodNet dataset. Moreover, we trained SegFormerB0 [90], DDRNet-23 [38], and PIDNet-s [91] architectures on aerial imagery datasets for the first time. Our results demonstrate that two-pathway architectures provide faster inference; DDRNet-23 and PIDNet-s have the highest inference speed. The encoder-decoder architecture UMBV3 [39], [73] provides the lowest number of learnable parameters, and HarDNet [10] shows the lowest computational cost. SegFormerB0 [90] manifests the highest accuracy indicating the importance of using vision transformers to enlarge the effective receptive field and enhances the accuracy of networks. In short, HarDNet and SegFormerB0 demonstrate a good balance between accuracy and efficiency on FloodNet. We also examine strategies to improve the accuracy of segmentation models by scaling up the encoder, transfer learning from other datasets, and improving accuracy by increasing spatial information. Our experiment has direct applications during natural disasters, such as flooding events.