Attentional Dense Convolutional Neural Network for Water Body Extraction From Sentinel-2 Images

Monitoring water bodies from remote sensing data is certainly an essential task to supervise the actual conditions of the available water resources for environment conservation, sustainable development, and many other applications. Being Sentinel-2 images some of the most attractive data, existing traditional index-based and deep learning-based water extraction methods still have important limitations in effectively dealing with large heterogeneous areas since many types of water bodies with different spatial-spectral complexities are logically expected. Note that, in this scenario, optimal feature abstraction and neighborhood information may certainly vary from water to water pixel, however existing methods are generally constrained by a fix abstraction level and amount of land cover context. To address these issues, this article presents a new attentional dense convolutional neural network (AD-CNN) especially designed for water body extraction from Sentinel-2 imagery. On the one hand, the AD-CNN exploits dense connections to allow uncovering deeper features while simultaneously characterizing multiple data complexities. On the other hand, the proposed model also implements a new residual attention module to dynamically put the focus on the most relevant spatial-spectral features for classifying water pixels. To test the performance of the AD-CNN, a new water database of Nepal (WaterPAL) is also built. The conducted experiments reveal the competitive performance of the proposed architecture with respect to several traditional index-based and state-of-the-art deep learning-based water extraction models.

sis [1], [2], [3], [4]. In this way, monitoring water bodies becomes an essential task to supervise the actual conditions of the available water resources along with environment conservation and sustainable development [5]. This relevance is such that even small changes in the water distribution may have a huge impact on human lives, causing soil subsidence, inland inundation, and health hazards, among other critical issues. Besides, water is also an integral part of different thematic and topographic maps used for many different purposes. Under this scenario, timely updated data are logically required to effectively monitor water bodies, which tend to change from time to time unlike other more stable structures like buildings or roads [6]. Unfortunately, this demand is difficult to cover using time consuming in situ procedures, especially in the context of developing countries [7].
With the expansion of remote sensing technologies, different satellites and constellations were designed to satisfy the regular provision of multispectral Earth observation data, which become particularly useful for water monitoring [8]. From Moderate-Resolution Imaging Spectroradiometer (MODIS) [9] and Landsat [10], to many other open and commercial satellites (e.g., Sentinel, Rapideye, ZY-3, EnviSat, Corona, radar satellite (RADARSAT), Gaofeng, etc.), multiple Earth observation data can be available for analysis [11]. Among all the available alternatives, Sentinel-2 has certainly shown to be one of the most suitable missions for the accurate detection of water bodies because of the advantages of its imaging products [12]: free availability, 13-band spectral resolution, and high spatial resolution of up to 10 m. Different water detection works published in the literature exemplify this fact, e.g., [13], [14], [15].
In general, two dominant trends can be identified when it comes to automatic water body extraction from remote sensing images [16]: traditional index-based methods and deep learningbased techniques. Despite their simplicity based on spectral indices and thresholding, traditional water extraction approaches may have important constraints in accurately distinguishing water from snow, mountains, buildings, and shadows due to the own limitations of pixel-wise computations [17], [18]. Auxiliary data like digital elevation model (DEM) may help to relieve some of these issues [19], [20]. However, how to choose the most suitable threshold value to extract even small water bodies is still a major problem [21]. As a result, traditional approaches are often not the best solution at global scales since they are unable to integrate shapes and texture information characteristic of water pixels.
In contrast, deep learning-based methods take advantage of convolutional neural networks (CNNs) to uncover more This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ discriminating spatial-spectral features for the better identification of water bodies [22]. In this respect, different CNN technologies have been successfully exploited, being the classification scheme one of the most general mapping frameworks. For example, Pu et al. [23] propose a hierarchical CNN for water-quality classification. Analogously, Rezaee et al. [24] build a two-level network for exploiting high-level water features too. Chen et al. [25] adopt an adaptive pooling to better preserve water context and boundary information. Other works also propose different multiresolution schemes for further improving the generalization capabilities of CNN-based features for water classification [26], [27].
Despite all the conducted research, there are still important challenges in terms of the abstraction level of the uncovered water features based on the network design. Due to larger depths, deep learning models tend to suffer from the vanishing gradient problem that rapidly degrade the learning process, and hence, the quality of the results [28]. Thus, many of the existing water classification networks, e.g., [23], [24], [25], try to control the number of layers and feature maps for relieving these negative effects. Nonetheless, this strategy may often reduce the abstraction capabilities of the extracted features while limiting the resulting classification performance, especially when considering rich spatial-spectral data like in the Sentinel-2 case. In this scenario, this article proposes a new CNN-based classification model (AD-CNN) especially designed for water body extraction from Sentinel-2 imagery, based on the following key aspects: dense connectivity, residual learning, and attention. On the one hand, dense connections are used to relieve vanishing gradients as well as an excessive expansion of receptive fields at very deep layers with the objective of better preserving water local information when extracting deeper features. Besides, they also work for jointly exploiting from lower to higher level features in order to deal with the numerous spatial-spectral complexities of water pixels at large scales. On the other hand, a new residual attention module (RAM) is implemented to dynamically put the focus on the most relevant spatial-spectral features when identifying water bodies. To evaluate the performance of the proposed approach, we first create a new dataset of Nepal (WaterPAL) made of Sentinel-2 images, DEM data, and ground-truth water information. Then, we conduct multiple experiments including several state-of-the-art index-based and CNN-based water extraction methods. Summarizing, the main contributions of this work can be listed as follows: 1) We build a new database of Nepal (WaterPAL) composed by Sentinel-2 images, DEM data, and ground-truth water information. 2) We propose a novel water extraction architecture (AD-CNN) that jointly exploits dense connectivity, residual learning, and attention mechanisms to uncover more discriminating deep features from water bodies. The remaining part of this article continues with the literature review of related works in Section II. Section III describes the geographical location of the study area and the detailed steps for the dataset preparation. Section IV delineates the workflow and structure of the proposed methodology. Section V provides details about the experimental setup and results. Finally, Section VI concludes this article.

A. Traditional Index-Based Methods
In general, index-based methods focus on the spectral properties of water with the objective of defining single-band or multiband computations to isolate water pixels within a particular value range. In this way, a common practice consists in exploiting the conjugate ratio between green and red bands to segregate the spectral response of water [29]. To avoid noise from artificial constructions like buildings, this approach is often improved by using near-infrared (NIR) instead of red bands [30]. With these considerations in mind, different indices have been proposed and utilized in the literature to extract water bodies [31]. One of the most popular indices is the normalized difference water index (NDWI) [32], which is calculated using green and NIR bands as follows: where ρ green and ρ NIR are green and NIR reflectance bands, respectively. The values of this index range between −1 and 1, representing positive values water bodies [33]. However, depending on the region of interest, some built-up areas could still generate noisy false positive results. Then, other authors also propose the modified NDWI (MNDWI) [34] as where ρ MIR is the midinfrared (MIR) reflectance band. With this change, built-up areas usually become negative but some additional problems appear with mountain shadows and snow, making the MNDWI index mainly suitable for urban water extraction. Similarly, another index termed new water index (NWI) was also proposed in [35], where green and NIR bands are replaced by blue and Landsat MIR bands as follows: In addition to these, other related indices have also shown prominent results in detecting water bodies from remote sensing data. For instance, it is the case of the normalized difference vegetation index (NDVI) [36], which employs the difference between NIR and red bands, following the same scheme as NDWI, to primarily extract vegetation while detecting water as negative values. In fact, some works in the literature show the advantages of jointly exploiting both NDWI and NDVI for water body extraction, e.g., [37]. Other authors also propose using the principal component analysis (PCA) approach to only consider the most informative image components when computing the own index, as in the case of the enhanced water index (EWI) [21]. However, the high computational cost of PCA strongly limits the applicability of this scheme over large interest regions.

B. Deep Learning-Based Methods
Despite their efficacy, traditional index-based methods usually have important limitations when working at global scales since optimal water detection ranges may often vary from local to local scenes [17]. To provide a more general solution, deep learning methods aim at exploiting characteristic spatial-spectral information of water pixels via CNNs. In this regard, numerous approaches can be found in the related literature.
For instance, Yang et al. [38] propose using an stacked sparse autoencoder for extracting pixel-wise features that take into account neighborhood information using a feature expansion algorithm. In the case of [22], the authors opt by developing a classification CNN, named Deep-WaterMap, which is specifically trained to separate water from land, snow, ice, clouds, and shadows using Landsat images as input. Chen et al. [25], extend this concept to ZY-3 and Gaofeng satellites by adopting a self-adaptive pooling into the own network to extract water features more robust to terrain local variations.
Despite the positive results achieved by these and other relevant deep learning methods, the high spatial diversity of water bodies may lead to highly boundary-dependent features that may eventually limit the generalization power and performance of the uncovered features. To relieve these effects, different multiresolution schemes have been proposed in the literature. For example, Wang et al. [39] present a multiscale CNN for extracting urban water from Landsat imagery. Zhang et al. [26] also define a multiresolution encoder-decoder network, which is intended to characterize water pixels regardless the considered terrain conditions. Additionally, Pu et al. [23] propose a four-layer CNN with a hierarchical structure to accurately estimate nonoptically active parameters when classifying water quality levels. Following a similar scheme, Rezaee et al. [24] develop a two-level CNN for complex wetland classification from Rapideye images. Unlike these classification models that work at pixel level, other works also try to exploit different scene-based segmentation schemes to uncover water. In [40], the authors recommend using the correlations among multiresolution scales to refine the uncovered features. Xia et al. [41] take advantage of an U-shaped segmentation network (U-Net) to allow skip connections between different resolution levels. Zhang et al. [42] adopt an squeeze-and-excitation technique for the recalibration of feature channels when segmenting water. Nonetheless, these segmentation models have the disadvantage of requiring full-scene annotated data in contrast to pixel-based water classification, which become more suitable to relieve the data scarcity problem in developing countries like Nepal.
In all water extraction models, it was observed that initial layers tend to extract low-level features, like edges, whereas deeper layers are focused on higher level features, like spatialspectral patterns and textures. In this scenario, one may think that deeper features are expected to provide more generalization capabilities for water extraction since the deeper the network the higher the abstraction level. Nevertheless, this is not always the case due to the so-called vanishing gradient problem [43]. When it comes to CNN-based methods, many of the most successful water classification networks, e.g., [23], [24], [25], need to control the number of layers and filters for avoiding a poor gradient propagation, and hence, a rapid performance saturation. Under these circumstances, the use of a reduced number of layers may certainly constrain the abstraction capabilities when characterizing water bodies, especially when dealing with rich spatial-spectral data like in the Sentinel-2 case. Although some mechanisms, such as residual [44] or dense models [45], have also been presented in the standard computer vision field to allow additional layers, how to effectively implement and exploit deeper features for outperforming state-of-the-art water classification models with remote sensing data is still an open-ended issue. Similarly, with the operational exploitation of the most recent CNN-based attention mechanisms to dynamically pay attention to the most relevant features [46]. Beyond existing pixel-wise water classification networks, this article pursues to design a novel CNN classification architecture especially designed for water body extraction from Sentinel-2 data by jointly exploiting the following three aspects: dense connections [45], residual learning [44], and attention [47]. Section IV will provide all the corresponding details.

III. STUDY AREA AND DATASET
The study area comprises 18 districts of the Terai region located in the southern plains of Nepal. Specifically, it occupies about 28 402.98 km 2 within 26.42 • to 29.07 • North latitudes and 80.47 • to 87.01 • East longitudes in the WGS 1984 coordinate system. The Terai is considered as the greenbelt of Nepal being covered with grasslands, tropical monsoon forests, savanna, clay, and loam soil. In terms of biodiversity, Terai is also home to 35 species of mammals, 111 of birds, 46 of herpetos, and 106 of fishes [48]. With the 55.7% of its agricultural land within an altitude range from 60 to 300 m, Terai is known as the rice bowl or agricultural production house of the country [48], [49]. Moreover, nearly a 47% of Nepal population inhabit in Terai with an increasing population density of around 350 people per km 2 [50]. Certainly, all these factors make regional water resources a major concern for the global development of the country as well as the sustainability of its agricultural sector. In this sense, Terai contains many seasonal and annual rivers mostly originated from the Siwalik hills on the northern side of the region. Besides, Terai features 163 wetlands and 4 Ramsar sites [51] that also make the automatic and remote detection of water a particularly relevant task. Fig. 1 shows the study area of this research and the corresponding Sentinel-2 tiles. Focusing on this region of interest, we build a water body extraction database (WaterPAL), made of Sentinel-2 images, DEM data, and ground-truth water information, as detailed in the following sections. The WaterPAL collection will be accessible on https://github.com/rufernan/ADCNN.

A. Sentinel-2 Images
Sentinel-2 images [52] contain 13 spectral bands with three different spatial resolutions of 10, 20, and 60 m. Blue (B), green (G), red (R), and near-infrared (NIR) bands are provided at 10 m, while the four vegetation red-edge bands and the two short wave infrared (SWIR) bands are provided at 20 m. The remaining channels, i.e., coastal aerosol, water vapor, and cirrus (SWIR), are provided at a 60-m resolution. Table I summarizes the list of bands acquired by the multi-spectral instrument carried by Sentinel-2.
Considering this data nature, a total of 11 cloud-free Level-2 A Sentinel-2 products from 2020 were downloaded to cover  the whole study area. For such task, we essentially used the Copernicus Open Access Hub 1 (COAH) platform. Initially, a preliminary tile inspection regarding the amount of clouds, cirrus, number of bands, spatial coverage, etc., were performed to avoid any issue in the selected products. However, some data problems (such as, missing regions or bands) were found in some of the cloud-free tiles retrieved by COAH for the year 2020. Hence, T45RUK, T45RUL, T45RTL, and T45RVK tiles were alternatively obtained from the United States Geological Survey Earth Explorer data portal 2 (USGS-EE) to complete our dataset. For these scenes, Level-1 C data were downloaded from USGS-EE and converted into Level-2 A images by applying the Dark Object Subtraction atmospheric correction available in the Quantum Geographic Information System Desktop 3.14.15. Finally, all the downloaded products were processed by the Sentinel Application Platform to generate uniform data cubes at 10 m using a bicubic resampling kernel. Besides, B01 and B10 1 [Online]. Available: https://scihub.copernicus.eu/ 2 [Online]. Available: https://earthexplorer.usgs.gov/ bands were excluded since they are only useful for atmospheric correction purposes. It is important to note that CNN-based models require spatially homogeneous input data, thus, in this work, we used the standard bicubic interpolation for up-sampling the lower resolution Sentinel-2 bands to their best resolution bands (10 m).

B. DEM Data
DEM raster data covering the study area were downloaded from the United Nations Office for Humanitarian Affairs Services (UN-OCHA) at the following website. 3 Specifically, these data were provided by the NASA Shuttle Radar Topographic Mission, being last updated on November 10, 2019. In more details, the downloaded data are represented at 90-m spatial resolution with geographic latitude/longitude coordinates. Since the downloaded Sentinel-2 images use the Universal Transverse Mercator projection system (with Zones 44 N and 45 N), DEM data were accordingly projected, resampled via a bicubic kernel to 10 m and converted to unsigned 16-bit integers in order to be integrated as an additional band in the corresponding Sentinel-2 products. For all these steps, we made use of ArcGIS Pro 2.6 software with its default settings.

C. Ground-Truth Information
For obtaining ground-truth information about the water bodies within the region of interest, we downloaded the River dataset from UN-OCHA website. 4 In this case, this dataset was last updated on November 24, 2015, containing different water body types in vector format. For the sake of simplicity, we binarized the available labels to water and non-water classes. Besides, we also reprojected, rasterized, and clipped the resulting data to generate a ground-truth water map for each Sentinel-2 tile.  Likewise in the case of DEM data, we employed ArcGIS Pro 2.6 for all these steps. Fig. 2 shows a sample data product corresponding to the T44RQR Sentinel-2 tile. Additionally, Table II summarizes the considered number of training, validation, and test patch samples per category .

IV. METHODOLOGY
This section presents the proposed model for extracting water bodies from Sentinel-2 data. First, let us formulate the water extraction problem from a classification perspective. Let I = {I 1 , . . . , I N } be a collection of Sentinel-2 images (with the possibility of including DEM data as an additional band) covering a particular region of interest with a spatial-spectral size of (I 1 × I 2 × B). Let W = {W 1 , . . . , W N } be their corresponding ground-truth water classification maps considering C classes. In this scenario, it is possible to extract M nonoverlapping patches from I (using a (P × P ) spatial size) in order to build the following set: Considering that each patch is used for representing its central pixel, i.e., ( P/2 , P/2 ) spatial position, it is also possible to extract a label set Y = {y 1 , . . . , y M } with the class labels of the central pixels as one-hot-encoding vectors. Under this notation, the proposed AD-CNN architecture pursues to approximate a function F : X → Y, which essentially takes Sentinel-2 patches as input and classifies their central pixels as output. In this sense, the AD-CNN tries to relieve some limitations of current CNN-based water classification models by means of jointly exploiting two different elements: residual attention and dense connections. Now, let us describe these two components as well as the proposed network topology in details.

A. Residual Attention Module
Certainly, both residual and attentional learning paradigms have shown to be two excellent mechanisms for CNNs since they allow focusing on the most discriminating features along the learning process. On the one hand, residual blocks (RBs) [44] are able to provide better feature representations at deeper layers by using skip connections that allow the model shortcut some convolutions when convenient. In this way, over-fitting and vanishing gradient problems can be relieved since unnecessary layers may be skipped while gradients more easily restored. On the other hand, attention [47] is another important tool for allowing the network to dynamically pay attention to the most relevant feature maps and regions with respect to the desired output. Hence, an attention block can emphasize or suppress features with the objective of refining intermediate data representations.
Despite their potential, these two mechanisms have not yet been used in the context of extracting water bodies from RS data, e.g., [23], [24], [25]. In this scenario, the proposed approach takes advantage of residual and attentional paradigms to define an RAM especially designed to extract water features from Sentinel-2 data. Specifically, RAM is made of several RBs, which consist of batch normalization layers (BN), rectified linear activation functions (ReLU), 2-D convolutional layers (Conv2D), and residual addition layers (Add). Fig. 3(a) shows a graphical visualization of the considered residual building block. As it is possible to observe, three of the Conv2D layers use a (1 × 1) kernel size, whereas the other one employs (3 × 3) kernels. Additionally, we set K 2 to the spectral size of the block input and K 1 = K 2 /4 in order to compress/decompress the number of feature maps within each RB. The objective of this diabolo-shape consists in simplifying the spectral information coming from Sentinel-2 to better identify water signatures, which are typically more prominent in the visible spectrum where Sentinel-2 has only a limited number of bands.
Using our RB as basic building unit and inspired by the ideas presented in [47], we further define RAM based on additional max-pooling layers (MaxPool), up-sampling layers (Up), sigmoid activation functions (Sigmoid), and residual multiplication layers (Mult). Fig. 3(b) displays the defined RAM. In particular, MaxPool applies a maximum pooling operation with a (2 × 2) window size and Up does a 2× up-scaling using a nearest neighbor filter. The three first RB units pursue to extract a fundamental deep representation of the input data. Then, the four following elements work for simplifying the spatial information at a higher abstraction level by down-scaling/up-scaling the corresponding feature maps. In this way, coarser texture patterns can be uncovered to better identify water pixels, which usually have rather homogeneous neighborhoods. Finally, the last RB is intended to remove some possible spectral noise that could appear after weighting the feature maps and could be rather prejudicial for water detection, given the limited spectral resolution of Sentinel-2 in visible wavelengths.

B. Dense Module
In general, increasing the number of convolutional layers in a network allows extracting higher level features that can help to achieve a better visual understanding [53]. However, standard feed-forward CNNs have two important limitations in this regard: vanishing gradients and receptive field expansion. On the one hand, the use of back-propagation requires computing the derivatives of the cost function to update the network parameters. Since the parameters of each layer logically depend on the former ones, the chain rule is used for unrolling these gradient computations. In this scenario, the deeper the network the higher the number of nested derivatives, and hence, the higher the chances of canceling the propagated gradients and network updates. On the other hand, standard CNNs process the input data layer by layer. In this way, the selected kernel sizes determine the spatial neighborhoods (or receptive fields) involved in each convolution, becoming the considered area of the input image logically bigger as more convolutional layers are sequentially stacked. As a result, very deep CNNs could also produce a degradation of the uncovered features due to an excessive increase of receptive fields.
In order to overcome these limitations when extracting water bodies from Sentinel-2 data, we design a dense convolutional module (DM) by taking advantage of the connectivity scheme presented in [45]. Specifically, our DM is made of multiple sequential blocks with the following layers: ReLU, Conv2D, and concatenation layer (Concat). Fig. 4 visualizes the defined DM. In more details, DM contains a total of D convolutional blocks with K 3 (3 × 3) kernels each. With this configuration, we densely propagate feature maps from shallow to deep layers in order to generate more consistent gradient computations during training while providing context information to deeper layers. In this manner, each input Sentinel-2 patch can be characterized by multiple receptive fields in order to improve its context information for a better prediction of water bodies.

C. Proposed Attentional Dense CNN
In contrast to many of the existing CNN-based water extraction methods, e.g., [23], [24], [25], the proposed architecture takes advantage of the designed modules to focus on the most distinctive features of water while allowing very deep data representations. In general, it is easy to see that water bodies have particular spatial-spectral features that play a fundamental role in their recognition. From an spectral perspective, water molecules usually have spectral responses more focused on the visible and near-infrared spectrum [54]. Precisely, this is the point that classical water indices try to exploit. However, fixing the bands for such computations can often be a too rigid strategy under heterogeneous in-land scenarios, where different spectral mixtures may be expected. By contrast, existing CNN-based models take into account the whole spectral input that may eventually introduce too much noise for detecting purer water. From an spatial perspective, a similar reasoning can also be done since water bodies tend to have specific rounded and smooth shapes and textures, being other spatial information not so useful. In this sense, the proposed architecture adopts the developed RAM module to automatically pay more attention to those initial spatial-spectral features that can be more relevant to identify water pixels, but without neglecting any other input information. Moreover, the proposed network also integrates within its topology the defined DM for effectively uncovering very deep features that are able to gather multiple receptive fields that may help to decide whether a pixel is water or not at different abstraction levels.
With all these considerations in mind, we define the proposed architecture according to Fig. 5. Specifically, the AD-CNN is made of the following components: head block (HB), RAM, DM, transition block (TB), DM, TB, DM, and end block (EB). As it is possible to see in Fig. 6, HB is made of only two layers: BN and Conv2D with K 3 (3 × 3) kernels. Besides, TB has a total of three layers: Conv2D with K 3 (1 × 1) kernels, average pooling (AvgPool) with a (2 × 2) window, and BN. Finally, EB contains: ReLU, global average pooling (GAvgPool), a dense layer (Dense) with C units and a softmax activation function (Softmax). Let us describe the rationale behind the selected  components in more details. Initially, HB (1) processes the input data (i.e., a Sentinel-2 image patch with the possibility of including DEM information) to generate an initial low-level characterization of the normalized input. Then, these representations are passed through RAM (2) in order to generate a weighted version of the data, in both spatial and spectral dimensions, according to the objective task. After this process, the most relevant features for identifying water pixels are emphasized to drive the following higher level steps. Subsequently, three DMs separated by two intermediate TBs are used for extracting very deep features. In this case, transition blocks are used to reduce the data complexity since a large increase on the number of feature maps is generated within each DM. Additionally, TB also works for progressively reducing the spatial size by means of an average pooling operation. Once obtained the corresponding deep features, the objective of EB consists in projecting them to the final label space to decide whether the central pixel of the input patch is water. To further prevent overfitting, a global average pooling operation is used to summarize each feature map into a single scalar before the final fully connected classification layer.

A. Experimental Settings
In order to validate the proposed architecture in the task of extracting water bodies from Sentinel-2 images, we conduct multiple classification experiments with the dataset described in Section III. For comparison purposes, we consider some of the most popular methods used for water extraction, including classical index-based and more recent CNN-based models: NDWI [32], NDVI [36], NDWI-NDVI [37], EWI [21], water quality classification CNN (WQC-CNN) [23], complex wetland classification CNN (CWC-CNN) [24] and self-adaptive pooling CNN (SAP-CNN) [25]. To complete the experimental comparison, we also test the performance of other CNNs that have not been explicitly used for water extraction but they have some connections to this work: basic CNN (base-CNN) [55], dense CNN (DenseNet) [45], and residual attention CNN (At-tResNet) [47]. To validate the effectiveness of the proposed architecture, we also conduct an ablation study to compare the AD-CNN with a simplified version (named as D-CNN) that omits the RAM module. In this way, the improvements generated by the proposed dense architecture and attention mechanism can be fairly isolated. It is important to note that all CNN-based models (logically including the proposed approach) take as input an spatial-spectral patch, whereas water indices only require the spectral information of the central pixel to perform the corresponding classification.
Regarding the considered data, we selected two Sentinel-2 tiles (from the 11 tiles available in our dataset) for an external qualitative evaluation. From the remaining ones (nine tiles), we extracted their patches (i.e., X ) and labels (i.e., Y) for training and testing the models considering different patch sizes P = {8, 12, 16, 20}. Specifically, the 60% of the data were used for training (with a 20% of it for validation) and the other 40% for testing. Since water/non-water classes may logically became highly imbalanced in inland scenarios like Nepal, we further balanced the data by means of random sampling to keep the ratio between majority (non-water) and minority (water) classes as (2 : 1). Under this settings, we carry out the following experiments to study the performance of index-based and CNN-based methods as well as the contribution of Sentinel-2 and DEM data in the water extraction task. 1) Experiment 1: Index-based models using Sentinel-2 data as input. Note that index-based methods cannot be used with DEM data, hence, they are tested in isolation in this experiment. 2) Experiment 2: CNN-based models using only Sentinel-2 RGB channels as input data, i.e., B = 3. 3) Experiment 3: CNN-based models using all Sentinel-2 channels, i.e., B = 11 (note that two bands are removed by the atmospheric correction). 4) Experiment 4: CNN-based models using Sentinel-2 together with DEM data as input, i.e., B = 12 (DEM data are integrated as an additional input band). With respect to the hyperparameters of the proposed architecture, we set K 1 = 4, K 2 = 16, K 3 = 16, D = 12, and C = 2 according to the information provided in Section IV. In the case of the considered competitors, we logically used the settings described in their corresponding articles. For training all CNN-based models, we made use of the standard cross-entropy  loss with the ADAM optimizer using the following parameters: 100 epochs, 1e −3 learning rate, and 128 batch size. Additionally, we also applied a learning rate decay (0.2 factor) on each validation loss plateau after 15 epochs. All the experiments were performed on a server with an Intel(R) Core (TM) i7-6850 K processor, 64 GB of DDR4 RAM, and an NVIDIA GeForce GTX 1080 Ti. Besides, Ubuntu 20.04 ×64, CUDA 10.1, Ten-sorFlow 2.1.0, Keras 2.3.1, and Python 3.6 were used as software environment. The codes of this article will be accessible on https://github.com/rufernan/ADCNN.

B. Results
Tables III-VI present the quantitative evaluation obtained for the considered experiments (i.e., Experiments 1-4, respectively). In more details, Table III contains the results of indexbased methods, whereas Tables IV-VI provide the quantitative assessment of CNN-based models when considering different combinations of the input data (i.e., only Sentinel-2 RGB bands, all Sentinel-2 bands, and Sentinel-2 bands together with DEM data). As it is possible to see, all the tables are organized with the tested methods in rows and the considered patch sizes and metrics in columns. In this regard, two different quantitative classification metrics are considered: overall accuracy (%) and class recall (%). For the sake of clarity in the visualization of the tables, all recall values are rounded to integer figures. Besides, the two best accuracy values for each patch size are highlighted in bold font, being the best result displayed with gray background. Note that we use the label N/A to highlight that the corresponding result is not available, whether the model is unable to converge or run with the considered patch size. For conducting a qualitative evaluation of the methods, Fig. 8 also shows some of the classification maps obtained over the external tiles when focusing on the two first experiments with P = 16.
C. Discussion 1) Experiment 1: According to the results reported in Table III, the use of NDWI, NDVI, and NDVI_NDWI achieved a maximum overall accuracy of 75% with recall values of 52% for water and 98% for no-water classes. In general, it was found that the performance of all the considered indices were approximately similar to each other across all patch sizes. In more details, NDVI slightly edged the rest indices in terms of overall accuracy and recall metrics. Besides, NDWI was found to have exactly the same performance as NDVI at patch size 16, which reveals the affinity of both indices for the study area. Finally, the highest recall values were obtained by NDVI_NDWI for no-water classes. In contrast to the other experiments, the considered traditional indices certainly obtained the worse general performance.
2) Experiment 2: As Table IV shows, the proposed model (AD-CNN) consistently achieved the best performance through all the considered patch sizes when using Sentinel-2 RGB bands as input. In general, it is possible to observe that the larger the patch the higher the accuracy since, logically, more local information is available for consideration. In this sense, it is also important to note that the CWC-CNN was not able to converge when considering RGB bands with a small patch size of 8. Besides, AttResNet was only able to manage multiples of two as patch size due to the down-sampling operations performed inside this architecture. Regarding the other competitors, the WQC-CNN was always found to perform better than the SAP-CNN, and DenseNet was able to obtain the third best general performance for all the considered patch sizes. In comparison to the remaining experiments, the use of RGB channels yielded the poorest results for all CNN-based models.

3) Experiment 3:
In the case of Table V, it is possible to observe how all the networks were able to increase the performance around a 5% with respect to the previous experiment. Similarly, the AD-CNN provided the best results over all path sizes, being the highest overall accuracy 91.52% with a recall of 87% for water and 93% for no-water classes. Again, CWC-CNN was found to achieve the worse performances, followed by AttResNet and SAP-CNN, respectively. Besides, DenseNet and WQC-CNN obtained the most competitive results, after the proposed model ones. In general, this experiment revealed a significant performance improvement when using the complete spectral information provided by Sentinel-2 for the more accurate characterization of water. Table VI, it was found that the integration of DEM data as an additional input band was only able to improve the results around a 1%. This evidence indicates that full Sentinel-2 spectral information is certainly more important than DEM data to uncover water bodies over the region of interest. Overall, the proposed model achieved again the best performance with metrics ranging from 89.66% to 91.34% of accuracy and 83 to 87 of recall for water, when increasing the patch size from 8 to 20. As in all the conducted experiments, the CWC-CNN was the least performing method. Moreover, DenseNet was followed by WQC-CNN and SAPCNN on the quantitative evaluation. This time the general trend showed a moderate performance increase with respect to the previous experiment due to the inclusion of DEM data.

4) Experiment 4: In
Taking into account that the study area has an important class imbalance, Fig. 7 further analyzes the proposed approach when considering different balance factors between the majority class (non-water) and minority class (water). As it is possible to observe, the overall accuracy over the test set increases with the balance factor. However, the decreasing water recall reveals that these improvements are based on the underestimation of the minority class since less water pixels are successfully retrieved. In this way, the considered balance factor (i.e., 2 : 1) shows a reasonable tradeoff between accuracy and water recall. Additionally, Table VII also provides the average number of trainable parameters and computational time (in seconds) per training/test epoch for each one of the considered CNN-based methods. As shown, the proposed approach figures are comparable to the ones of the best performing competitor (DenseNet).
From the qualitative results displayed in Fig. 8, several important observations can be made to support the conducted analysis. Regarding index-based methods, all the three considered indices (i.e., NDWI, NDVI, and NDVI_NDWI) produced similar output results where water bodies become rather underestimated. As it  Fig. 8(d)] tended to obtain a slightly better estimation but, in general, many water bodies were still missed with respect to the ground-truth data [see Fig. 8(b)]. In this sense, the particularly low water recall values reported in Table III also support these observations where index-based methods tend to essentially detect the most pure water spectral signatures. When inspecting the visual results of the six considered deep learning-based competitors [see Fig. 8(f)-(k)], we can observe a different trend. Overall, all the networks seemed to extract not only pure water bodies but also other neighboring regions such as river banks or partially dried streams. In details, it was found that DenseNet [see Fig. 8(j)] was able to extract more true water pixels than the other competitors, being CWC-CNN [see Fig. 8 Although all the tested deep learning-based methods lean to overestimate water with respect to index-based ones, the need of using index thresholds according to the spectral properties of water often makes traditional indices fail to extract water bodies beyond pure water pixels. In contrast, CNNs take advantage of context information for characterizing water pixels with richer spatial-spectral features while providing a more general solution to water body extraction. Nonetheless, many of the existing water classification networks are only able to satisfactorily perform using a fix abstraction level given by a relatively small number of convolutional layers, which may eventually saturate their learning performances. Note that, when working with study areas as heterogeneous as the considered one, many types of water bodies with different complexities are naturally expected. Besides, the reasonably good spatial-spectral resolution of Sentinel-2 data is also a plus for the need of exploring deeper features. Hence, it becomes desirable to simultaneously learn from lower to higher water feature abstraction levels in order to solve from the simplest to the most challenging cases. Logically, feature abstraction and neighborhood information are important factors for identifying a pixel as water but the optimal abstraction level and amount of context may certainly vary from patch to patch. Precisely, the proposed AC-CNN model exploits this idea by implementing attentional dense connectivities that allow transferring multiple characterization levels while focusing on the most relevant features for water extraction.

VI. CONCLUSION
This article presented a new CNN classification architecture (termed AD-CNN) especially designed for water body extraction from Sentinel-2 data. Unlike other models in the remote sensing literature, the AD-CNN adopted a novel attentional dense scheme that pursues to effectively exploit deeper convolutional features for the better identification of water pixels. On the one hand, dense connections were implemented to allow extracting deeper features while characterizing multiple data complexities at once. On the other hand, a new RAM was designed to dynamically put the focus on the most relevant spatial-spectral features for classifying water pixels. In order to test the proposed model performance, a new water database of Nepal (WaterPAL) was built. The experiments, conducted on WaterPAL, revealed the competitive results achieved by the AD-CNN with respect to several traditional index-based and state-of-the-art CNN-based water extraction models.
According to the obtained results, several important conclusions can be made with regard to the use of Sentinel-2 data and the performances of the tested models. First, the most effective data configuration for water body extraction has shown to be the complete Sentinel-2 spectra together with DEM data. However, it is also important to highlight that the contribution of DEM is rather small with respect to multispectral Sentinel-2 information. Second, traditional index-based methods are generally unable to provide satisfactory results under heterogeneous large-scale scenarios since only pure water signatures are mainly detected. Third, deep learning-based methods provide more competitive results although they also tend to be more prone to overestimate water. Fourth, the proposed attentional dense scheme allows extracting deeper and more complete features for a more accurate estimation of water bodies. Although the outcomes of this work are promising, there is still room for future improvements based on extending the proposed network to different intersensor platforms, multimodal data, and multitemporal stages.