Watershed-Based Attribute Profiles With Semantic Prior Knowledge for Remote Sensing Image Analysis

—In this article, we develop a novel feature extraction method that combines two well-established mathematical morphology concepts: watersheds and morphological attribute proﬁles (APs). In order to extract spatial-spectral features from remote sensing data, APs were originally deﬁned as sequences of ﬁltering operators on inclusion trees, i.e., the max-and min-trees, computed from the input image. In this study, we extend the AP paradigm to the more general framework of hierarchical watersheds. Moreover, we explore the semantic knowledge provided by labeled training pixels during different phases of the watershed-AP construction, namelywithintheconstructionofhierarchicalwatershedsfromthe raw image and later within the ﬁltering of the resulting hierarchy. We illustrate the relevance of the proposed method with two applications including land cover classiﬁcation and building extraction using optical remote sensing images. Experimental results show that the new proﬁles outperform various existing features using twopublicdatasets(ZurichandVaihingen),thusprovidinganotherhighpotentialfeatureextractionmethodwithintheAPfamily.


Watershed-Based Attribute Profiles With Semantic Prior Knowledge for Remote Sensing Image Analysis
Deise Santana Maia , Minh-Tan Pham , and Sébastien Lefèvre , Senior Member, IEEE Abstract-In this article, we develop a novel feature extraction method that combines two well-established mathematical morphology concepts: watersheds and morphological attribute profiles (APs).In order to extract spatial-spectral features from remote sensing data, APs were originally defined as sequences of filtering operators on inclusion trees, i.e., the max-and min-trees, computed from the input image.In this study, we extend the AP paradigm to the more general framework of hierarchical watersheds.Moreover, we explore the semantic knowledge provided by labeled training pixels during different phases of the watershed-AP construction, namely within the construction of hierarchical watersheds from the raw image and later within the filtering of the resulting hierarchy.We illustrate the relevance of the proposed method with two applications including land cover classification and building extraction using optical remote sensing images.Experimental results show that the new profiles outperform various existing features using two public datasets (Zurich and Vaihingen), thus providing another high potential feature extraction method within the AP family.

I. INTRODUCTION
M ATHEMATICAL morphology is an efficient tool that has a long history within the analysis and processing of remote sensing images, as attested by early surveys on this topic [1], [2].Since the last decade, a special attention has been particularly given to a multilevel feature extraction method namely morphological attribute profiles (AP) [3], which appeared and mostly replaced the classical morphological profiles (MPs) [4] for the analysis of remote sensing images.Even though both of these methods are successful in conveying spatialspectral features of those image data, APs have been proved to be more generalized and efficiently scalable to deal with large-scale data, thanks to their construction from tree-based hierarchical representation [5].In this article, we study the relevance of hierarchical watersheds integrated in an AP processing framework for remote sensing applications.Watershed segmentation was proposed in the late 70's and, since then, this concept has been extended to several frameworks and implemented through a variety of algorithms.The intuition behind the various definitions of the watershed segmentation derives from the topographic definition of watersheds: dividing lines between catchment basins, which are, in their turn, areas where collected precipitation flows into the same regional minimum.These notions can be extended to gray-scale images and graphs, leading to different definitions of watershed segmentation.In this article, we focus on watershed-cuts and hierarchical watersheds defined in the context of edge-weighted graphs, as formalized in [6] and [7].We present the notions of graphs, hierarchies of partitions, and hierarchical watersheds in Section III-A.
In addition to the construction of watershed-based APs, our second contribution in this work is to exploit semantic prior knowledge during different phases of the Watershed-AP construction so that the final extracted profiles could better encode and characterize the spatial-spectral multilevel features from the image.In the context of image segmentation, a widespread method to introduce prior knowledge in the results is to consider user-defined markers, which are subsets of image pixels indicating the locations of objects of interest.Such markers guide the segmentation algorithm and assure that the objects of interest are segmented into distinct regions.The notion of markers has been especially explored in watershed segmentation, in which catchment basins are grown from input markers instead of the regional minima of an image [8].We provide more background notions and related studies of how semantic prior knowledge is exploited in watershed segmentation in Section II, as well as our proposed strategies to incorporate prior knowledge into watershed-based APs in Section IV-B.
It should be noted that this manuscript is an extension of our recent conference paper [9], which has provided preliminary results of watershed-APs applied to panchromatic remote sensing image classification.In the present article, we investigate other methods to compute watershed-APs with prior knowledge from training pixels and we present a more extensive evaluation of the proposed methods in land-cover classification.Moreover, considering that other methods in the literature, such as [10], [11], work well on binary pixel classification/segmentation of object/background, we decided to validate our method through another binary classification task, which is relevant for remote sensing imagery, namely building extraction.Both land-cover classification and building extraction are evaluated using multispectral images from two publicly well-known datasets: Zurich and Vaihingen.The rest of this article is organized as follows.We first review related works on the use of semantic prior knowledge in Section II.Then, we present some basic definitions of graphs, hierarchical watersheds, and APs in Section III.In Section IV, we introduce the proposed watershed-APs and our strategies to integrate semantic knowledge within its construction.Finally, experiments conducted on the two aforementioned remote sensing datasets are given in Section V with extensive analysis and discussion in Section VI.Finally, Section VII concludes this article.

II. RELATED WORKS
In this section, we review some related studies on the use of prior knowledge and markers in the field of image processing and analysis, with a focus on watershed segmentation.We present the main findings of each study and discuss how they relate to our proposed watershed-AP.
As already mentioned, the idea behind the watershed segmentation is to partition a surface/image into its different regional minima.To prevent oversegmentation, markers are often used to guide the computation of the watershed segmentation, in which catchment basins are grown from input markers instead of the image's regional minima.In this article, we show that the use of markers in a watershed segmentation can go beyond the introduction of new regional minima.The spectral-spatial information regarding the marked pixels can provide further knowledge about the objects we aim to segment.For instance, in [12], [13], the authors use the spectral signature of training samples in the construction of watershed segmentation of multispectral images.First, the spectral signature of training pixels is used to train a classifier, which is then applied on the whole image and used to obtain a probability map per class.Then, those maps are combined and used to obtain a single watershed segmentation.In the present work, we consider a similar approach to include priorknowledge in the watershed-AP construction, with the main difference being the construction of a multilevel watershed segmentation instead of a single segmentation.Moreover, we go one step further to exploit such probability maps during the image reconstruction/filtering phase of the watershed-AP construction.
Another use of supervised classification for watersheds has been proposed in [10], where the watershed segmentation is computed from user-defined markers combined with probability maps computed for each targeted class.More precisely, catchment basins are grown from different markers, and the probability maps, combined with the original data, are used simultaneously in the process.In remote sensing, the later approach has been applied to the detection of buildings [14] and shorelines [15] in multispectral images.Similarly to [12] and [13], the method proposed in [10] deals with single level watershed segmentation.Finally, in [11], prior knowledge from markers is employed on several interactive image segmentation methods, including watersheds, in the framework of edge-weighted graphs.Edge weights are defined as a linear combination of the weights obtained from two sources: from the pixel values, and from the classification probability maps computed from the markers that are incrementally provided by the users.More generally, knowledge from markers can be used by other kinds of preprocessing methods beyond watershed segmentation.Namely, spectral signatures of training pixels have been used in [16] to optimize the data preprocessing with alternating sequential filters.Hence, training pixels are used for preprocessing the input data, as well as for the final pixel classification.A related approach is proposed in [17], where training pixels are used to optimize vector orderings for morphological operations applied to hyperspectral images.
In the context of hierarchical segmentation, prior knowledge can play a role in defining which regions should be highlighted at different levels of a hierarchy.In [18], a marker-based hierarchical segmentation is proposed for hyperspectral image classification.Labeled markers are derived from a probability classification map, which is obtained from training samples, as done in [10], [12], [13].Then, those labeled markers guide the construction of a hierarchical segmentation by preventing regions of different classes to be merged, and by propagating the labeled markers to unlabeled regions.Another related approach, proposed in [19], uses prior knowledge to keep the regions of interest from being merged early in the hierarchy, i.e., the details in the regions of interest are preserved at high levels of the hierarchy.This later idea is also explored in the watershed-AP framework, in which we aim to filter out the regions with low probability of belonging to a given ground-truth semantic class.Finally, in [20], the authors propose a knowledge-based hierarchical representation for hyperspectral images.In their approach, a dissimilarity measure learned from training pixels is employed in the construction of α-trees.This last approach share some common features with the one proposed in the present paper, with the main differences being the family of hierarchy under consideration as well as the learning algorithm used to explore prior knowledge from training pixels.Moreover, we also consider another way of using such prior knowledge which has not been considered in [20], namely in the filtering step rather than in he construction phase of a hierarchy.
Finally, following the current tendency of using deep learning for solving computer vision problems, various methods for coupling prior knowledge with deep learning based models have been explored in more recent works.For instance, in [21], edge information is combined with the output of a Fully convolutional network (FCN) in order to refine the segmentation results given by the later.And, in [22], crop classification in SAR time series are performed using autoencoders (AE), convolutional neural networks (CNN), and FCN, and then postprocessed using prior knowledge regarding crop dynamics, i.e. expert's knowledge about which crop transitions might or not occur over time in the same field.Though deep learning models perform well in computer vision tasks in general, including remote sensing imagery, there are advantages of using feature extraction methods based on hierarchical representations and morphological operators, such as the proposed watershed-APs.In particular, we can mention the interpretability of the method when compared to the black box parameters of deep learning models, and the low need for lots of ground-truth data, which makes those morphological methods well adapted to datasets with scarce annotated samples.

III. BACKGROUND NOTIONS
In this section, we present some basic notions of graphs and hierarchical watersheds.Then, we recall the definition of morphological attribute profiles and their extensions in the literature.

A. Graphs and Hierarchical Watersheds
A weighted graph is a triplet G = (V, E, w), where V is a finite set, E is a subset of V × V , and w is a map from E into R.The elements of V and E are called vertices and edges (of G), respectively.Let G = (V, E, w) be a weighted graph and let G = (V, E, w) be a graph such that V ⊆ V and x n and if there are no repeated edges in π, we say that π is a cycle (in G ).The subgraph G of G is said to be connected if, for any x and x in V , there exists a path from x to x .Let G be a connected subgraph of G.We say that G is a connected component of G if 1) for any x and x in V , if {x, x } ∈ E then {x, x } ∈ E ; and 2) there is no edge e = {y, y } ∈ E such that y ∈ V \ V and y ∈ V .Let G = (V, E, w) be a graph and let G = (V, E, w) be a connected subgraph of G.If the weight of any edge in E is equal to a constant k and if w(e) > k for any edge e = {x, y} such that x ∈ V and y ∈ V \ V , then G is a (local) minimum of G.For instance, Fig. 1(a) illustrates a weighted graph with four minima delimited by the dashed lines.We note that in the remainder of this section, G = (V, E, w) denotes a connected weighted graph and n denotes the number of minima of G.
e∈E w(e) is minimal among all graphs which satisfy conditions (1) and (2).A partition of V is a set P of disjoint subsets of V such that the union of the elements in P is V .The partition of V induced by a graph G is the partition P such that every element of P is the set of vertices of a connected component of G .A hierarchy of partitions of V is a sequence H = (P 0 , . . ., P n ) of partitions of V such that P n = {V } and such that, for any 0 < i ≤ n, every element of P i is the union of elements of P i−1 .Any hierarchy of partitions H can be represented as a tree whose vertices correspond to the regions of H and whose edges link nested regions.For instance, Fig. 1(b) shows a tree representation of the hierarchy H = (P 0 , P 1 , P 2 , P 3 ), where P 0 = {{a, b}, {c, d}, {e, f }, {g, h}}, P 1 = {{a, b}, {c, d}, {e, f, g, h}}, P 2 = {{a, b, c, d}, {e, f, g, h}}, and P 3 = {{a, b, c, d, e, f, g, h}}.
Let S = (M 1 , . . ., M n ) be a sequence of n distinct minima of G such that, for any 0 < i ≤ n, we have The hierarchy of MSFs of G for S, also known as hierarchical watershed of G for S, is a hierarchy H = (P 1 , . . ., P n ) of partitions of V such that each partition P i is the partition induced by the MSF of G rooted in the graph A hierarchical watershed of the graph G of Fig. 1(a) for the sequence (C, A, B, D) of minima of G is illustrated in Fig. 1(b).

B. Attribute Profiles
In remote sensing image analysis, morphological APs [3] appears to be one of the most efficient multilevel feature extraction methods.To convey spatial-spectral features of remote sensing images, APs were initially defined as sequences of filtering operators on the max-and min-trees computed from the original data.Let X : P → Z, P ⊆ Z 2 be a gray-scale image.The calculation of APs on X is achieved by applying a sequence of attribute filters based on a min-tree (i.e., attribute thickening operators {φ A k } K k=1 ) and on a max-tree (i.e., attribute thinning operators {γ A k } K k=1 ) as follows: where φ A k and γ A k are, respectively, the thickening and thinning operators with respect to the attribute A and to the threshold k, and K is the number of selected thresholds.More precisely, the thickening φ A k (X) of X (resp.thinning γ A k (X) of X) with respect to an attribute A and to a threshold k is obtained as follows: given the min-tree T (resp.max-tree T ) of X, the A attribute values (e.g., area, circularity, and contrast) of the nodes of T are computed.If the attribute A is increasing, the nodes whose attribute values are inferior to k are pruned from the tree T ; otherwise other pruning strategies can be adopted [5].Finally, the resulting image is reconstructed by projecting the gray levels of the remaining nodes from the pruned tree into the pixels of X.
Since its appearance, the notion of APs has been extended to other hierarchical representations including tree-of-shapes and partition trees such as α-tree and ω-tree (see a comparative study of AP constructed from different trees in [23], and [24] for a more general survey on morphological trees).To obtain a profile from a partition tree instead of a component tree, some adaptations have to be made to the original definition of APs, as discussed in [25] and [26].For instance, the nodes of a partition tree are not naturally associated to gray-level values, as it is the case of component trees.The strategy adopted in [25] is to represent each node as its level in the tree or as the maximum, minimum, or average gray-level of the leaf nodes (pixels) of this node.For more details about APs' extensions, we invite readers to refer to a recent survey [5].

IV. WATERSHED AP
In this section, we present watershed-APs [9] and discuss different methods for introducing semantic prior knowledge during the generation and post-processing of hierarchical watersheds.

A. Watershed-Based APs
As mentioned in the previous section (see Section III-A), hierarchies of partitions, such as hierarchical watersheds, can be equally represented as a (partition) tree.Hence, the filtering strategy of watershed-APs is similar to the strategy described in [25] for the αand ω-APs: if a node is filtered out, all of its descendants are also removed from the tree.As discussed in [25], image reconstruction from partition trees is not straightforward as it is from component trees.For node representation, we adopt one of the solutions proposed in [25] and already mentioned in Section III-B, in which a node is represented by the average gray-level of the pixels belonging to it.We highlight that, in the case of multichannel images, the average grey level computed on each band might lead to spectral values not present in the input image.However, in the context of APs used for pixel classification, our aim is not image filtering.Hence, the fact that new spectral values (a.k.a.false colors) are created is not a problem as long as they allow us to distinguish between different semantic classes.
Note that hierarchical watersheds are usually constructed from a gradient of the original image, which contains more information about the contours between salient regions than about the spectral signature of those regions.Hence, we consider the original pixel values to obtain the nodes representation instead of the image gradient.
Formally, let X : P → Z be a gray-scale image and let G = (V, E, w) be a weighted graph, which represents a gradient of X, i.e., V = P and, for every edge e = {x, y} in E, the weight w(e) represents the dissimilarity between x and y (e.g., w(e) = |X(x) − X(y)|).Let S be a sequence of minima of G ordered according to a given criterion C, and let H be the hierarchical watershed of G for the sequence S. Given the tree representation T of H, a watershed-AP of X for the criterion C is constructed as a sequence of image reconstructions from filtered versions of T .
Fig. 2 summarizes the construction of the watershed-APs for a given gray-scale image I.The solid arrows represent mandatory steps for the watershed-AP construction, while the dashed arrows indicate optional steps to include prior knowledge, which will be discussed in the following section.

B. Watershed-APs With Semantic Prior Knowledge
As discussed in Section II, user-defined markers and prior knowledge from training pixels are valuable tools for optimizing the construction of various image representations, such as hierarchical segmentations, as well as for postprocessing such representations.In this article, we investigate the use of prior knowledge in the construction of watershed-APs.We present a general framework for including prior knowledge at different stages of the watershed-APs construction, followed by two instances of this framework that are later validated on multispectral remote sensing datasets.
AP and its variants are essentially unsupervised feature extraction methods, in which only the spectral values and the relative position between pixels are taken into consideration.During the filtering and reconstruction steps of APs, low levels of prior knowledge regarding the shape and size of the objects of interest may be taken into account, but semantic knowledge from training pixels remains little exploited.In this context, we have identified two ways to incorporate prior knowledge into watershed-APs.
1) Hierarchical Watersheds With Semantic Prior Knowledge: In general, a hierarchical segmentation is a satisfactory representation of an image for a given task when the lowest level of the hierarchy contains all regions of interest for this specific task, and when regions are merged in a meaningful way, i.e., similar pairs of regions are merged before dissimilar pairs.In the context of feature extraction with watershed-APs, regions of interest are composed of neighbouring pixels belonging to the same semantic class.Due to the interclass similarity and intraclass variability of pixel in remote sensing images, hierarchical segmentations based only on pixel values often do not reflect Fig. 3. Semantic prior knowledge combined with the image gradient before the construction of a hierarchical watershed.Given a multispectral image I, a classifier trained on a subset of pixels of I provides probability maps per semantic class.Then, those maps are combined into a single map which is used to obtain an edge weighted graph G P .The graph G P is then combined with the image gradient G G for each channel of I.The resulting graph is given to the step 2 of the watershed-AP pipeline presented in Fig. 2.
the inner semantic structure of the data.With that in mind, we rely on training pixels to obtain hierarchical watersheds whose regions partially reflect the semantic structure of the input data.In practice, we aim to enforce regional minima at the regions with high probability of belonging to any given ground-truth class.This is done through a combination of the methods proposed in [10], [12], [13], as described below.
Given a dataset I (e.g., a panchromatic or an RGB image) and its training set composed of n classes, we compute its hierarchical watershed using prior knowledge as follows.
which represents a gradient of I, i.e., edge weights indicate the dissimilarity between neighboring pixels.v) Combination of the weight maps w P and w G into a map w GP .vi) Computation of the hierarchical watershed of G GP = (V, E, w GP ) for a given sequence of minima of G GP .Readers may note that each of the above steps can be implemented in many different ways.In our experiments, our choices have been adopted based on solutions present in the literature and on empirical results on the tested datasets.
In the first step of our method, we are aware that: 1) there might be sample pixels of a given class whose spectral values are not represented in the training set, and 2) there might be pixels in the training set with very similar spectral signatures but which belong to distinct classes.In those cases, we expect the classifier to assign low class probabilities to such pixels.This means that the watershed segmentation at those regions will be mostly guided by the original gray-levels of the image gradient.Then, in the step (ii), we combine the class probability maps into a single probability map μ.We expect this combination to provide flat zones of pixels with high probability of belonging to any given class, i.e., subsets of pixels that should be merged early in the resulting hierarchical watershed.In the extreme case where the classifier assigns very high class probabilities to all pixels of I, we would have a single flat zone and, consequently, a hierarchy with a single segmentation level.In that case, we expect that the pixel features extracted from the resulting AP will have little influence in the final classification results.In other words, the final results will be similar to the ones obtained with the original pixel values.In the step (iii), a weighted graph (V, E, w P ) is obtained from the combined probability map μ.In our experiments, we chose to compute edge weights as the maximum between the probability values of neighboring pixels based on a few experiments with the datasets described in the conference version of this article [9].In the steps (iv) and (v), a gradient (V, E, w G ) of I is computed and then combined with (V, E, w P ).In our experiments, this combination will be simply implemented as a multiplication of edge weights, similarly to [10].We note that the proposed method is related to ones introduced in [19] and [20], the main difference being the type of hierarchy under consideration and how the original data is combined with the prior knowledge.Finally, in step (vi), we compute a hierarchical watershed H of G for a given sequence of minima of G.Those minima are often ordered according to regional attributes (e.g., area, volume) of the catchment basins associated to each minimum.Those regional attributes are known as extinction values [27], [28].
We now analyze the time complexity of the proposed method as described in Algorithm 1.Given an image I and a training set S of I, whose samples are labeled into n classes, we aim to compute a hierarchical watershed of I with prior knowledge from training pixels.Given that the pixel features are simply their spectral values or derived from a small window around each pixel, line 1 can be executed in linear time O(|I|).Then, the time complexity to build and train the classifier C depends on the chosen method.For instance, if C is a Random Forest (RF) composed of m trees, obtained from t training samples with f features, the time complexity to build such a forest is O(m × t × log(t) × f ).Taking into account that f is a small constant in our experiments, the time complexity to build such a forest is O(m × t × log(t)).In An illustration of the proposed method is given in Fig. 4. A crop of the image zh20 from the Zurich dataset [30], its ground-truth composed of six semantic classes and a class probability map (obtained as described in Algorithm 1) are shown in Fig. 4(a).Fig. 4(b) and (c) shows image reconstructions from filtered versions of two different hierarchical watersheds of zh20: in (b), we considered a hierarchical watershed obtained from the infrared channel of zh20 computed without any prior knowledge from training pixels and, in (c), we considered a hierarchical watershed with prior knowledge from the map μ, as described in Algorithm 1.When comparing both results, we observe that images in (c) preserve some of the boundaries between distinct semantic classes, as, for instance, between the regions belonging to classes trees (represented in dark green in the ground-truth) and grass (light green) in the lower right corner of the reconstructed images.
2) Hierarchy Filtering With Semantic Prior Knowledge: The illustration given in Fig. 4 shows that the hierarchical watersheds computed with prior knowledge from training pixels can indeed provide regions, which are more semantically coherent.On the other hand, considering that our objective is feature extraction at pixel level, including prior knowledge into the hierarchy construction may suppress part of the finer regions present in the original data.In order to preserve the information provided by those finer regions, we could delay the utilization of prior knowledge in the watershed-AP pipeline to the filtering step (step 3 of the pipeline given in Fig. 2).The idea is to replace the handcrafted criteria, which are usually employed at this step (e.g., area and moment of inertia thresholds) by markers obtained from class probability maps per semantic class, as the ones given in Fig. 3. Given a dataset I and its training set composed of n classes, we propose the following pipeline.
1) Computation of a hierarchical watershed H of I.
2) Training of a classifier using the training set of I and computation of per-pixel class probabilities (p 1 , . . ., p n ) for all pixels of I. 3) For each probability map p i , computation of a list L i = (p 1 i , . . ., p k i ) of k binary thresholded versions of p i such that, in each map p m i , white or one valued pixels are the ones whose class probabilities are larger than a given threshold.4) Filtering out the nodes of regions of H which only contain black or zero-valued pixels.5) Reconstruction of an image from each of the filtered versions of H computed in the previous step.The proposed pipeline is shown in Fig. 5 and described in Algorithm 2. In terms of time complexity, Algorithm 2 adds as much to the complexity of computing watershed-APs as does the method described in Algorithm 1.More precisely, both algorithms include the training of a classifier C and the computation of classification probability maps per semantic class, which consist of the most "time consuming" parts of the algorithms.Then, in the lines 7 − 17 of Algorithm 2, we perform the filtering step of the watershed-AP.In practice, filtering the input hierarchy is indeed similar to the standard filtering step of APs using criteria such as area and moment of inertia: attribute values are propagated from the leaves to the root node, and nodes are removed if their attribute values are below a given threshold value.Therefore, we can conclude the proposed methods for including semantic prior knowledge into watershed-APs present equivalent time complexities.Fig. 6 illustrates the method described in Algorithm 2 on a cropped patch of an image from the Vaihingen dataset [31].The image and its ground truth regions are given in Fig. 6(a).In Fig. 6(b) and (c), we give thresholded versions of the probability map for the class trees (represented in green in the ground truth), along with the image reconstructions obtained from a hierarchical watershed filtered using each of those probability maps, as described in Algorithm 2. Similarly, Fig. 6(d) and (e) shows the results obtained for the class cars (represented in yellow in the ground-truth image).For a better visualization, all image reconstructions are represented in false colors.In Fig. 6(c), we observe that the finer regions belonging to the class trees are preserved in all reconstructions and the same is true for the class cars in Fig. 6(e).

V. EXPERIMENTAL SETUP
In this section, we conduct experiments to evaluate the performance of the proposed watershed-AP (computed with and without prior knowledge) in the context of land-cover classification and building extraction of remote sensing images.We first describe the multispectral images considered in our study, as well as the experimental settings used for evaluation.We provide detailed analysis and show that oftentimes watershed-APs outperform AP and its variants including SDAP [32], α-AP [25], and ω-AP [25], on both datasets in the following section.
Experiments were performed in Python1 using the open source simple attribute profile (SAP) 2 and Higra [33] libraries. 3

A. Datasets
Our experiments were conducted on the multispectral remote sensing Vaihingen [31] and Zurich [30] datasets.
The Vaihingen dataset [31] contains 38 images collected over the city of Vaihingen, Germany.Each image has a spatial resolution of 9 cm and is composed of four channels (near infrared, red and green and blue) plus a digital surface model (DSM).As done in [34], we only consider the first three channels, namely near infrared, red, and green in our experiments.Images width and height range in the intervals [1388,3816] and [1281,3313], respectively.Ground-truth annotations, which are provided for each of the 38 images, consist of at most six thematic classes: impervious surfaces, buildings, low vegetation, trees, cars, and background/clutter.Following previous works [34], we only consider the 16 images for which ground-truth annotations were provided during the ISPRS semantic labeling contest.Moreover, the background class is not used for land-cover classification.To perform building extraction, the ground-truth labels are divided into two classes: buildings and background (i.e., the union of all remaining semantic classes).Training samples were extracted from eleven images (image IDs: 1, 3, 5, 7, 13,17,21,23,26,32,37), and the remaining five images (image IDs : 11, 15, 28, 30, 34) were used for evaluation.For each semantic class, training samples were composed of 1% of randomly selected pixels in the training images, leading to the number of training and test pixels per semantic class given in Table I.The test images of the Vaihingen dataset and their ground-truths are shown in Fig. 7.
The Zurich Summer dataset [30] is a collection of 20 images obtained from a QuickBird acquisition of the city of Zurich, Switzerland, in August 2002.The images in this dataset have various dimensions, with widths and heights in [622,1639] and [782,1830] ranges, respectively, and are composed of four channels (near infrared, red, green, and blue).The ground-truth provided for each image consists of at most nine thematic classes, namely: roads, buildings, trees, grass, bare soil, water, railways, swimming pools, and background.For the building extraction task, all classes except buildings are merged into a single background class.Following previous works [34], [35], training pixels were extracted from the first fifteen images of this dataset, and the remaining five images (zh16, zh17, zh18, zh19, and zh20) were used for evaluation.As for the Vaihingen dataset,  the background class of the Zurich dataset is not considered for land-cover classification.In total, the training set is composed of 122,630 pixels, which corresponds to 1% of the labeled pixels randomly extracted for each class of each training image.The number of training and test pixels per semantic class is given in Table II.The test images and their ground-truths are given in Fig. 8.

B. Experimental Settings
Most works in the literature evaluate AP and its variants in the following way: training and test pixels come from the same remote sensing image, which allows training and test features to be extracted from the same hierarchical representation of the image under study.As discussed in [5], this setting does not generalize well to more realistic scenarios, in which we may have a dataset composed of several images without any annotation.For that reason, we extend the evaluation on the Zurich dataset already presented in our conference paper [9].In the present article, training and test features are extracted from independent hierarchical representations computed from the training and test images.
AP and its variants, including watershed-APs, were computed with the usual area and moment of inertia (MoI) attributes.The following 10 area thresholds and four MoI thresholds were adopted for both datasets λ area = {25, 100, 500, 1000, 5000, 10000, 20000, 50000, 100000, 150000} The only exceptions are the watershed-APs filtered using semantic prior knowledge, as described in Section IV-B2.To compute those watershed-APs, the hard-coded criteria based on the area and MoI attributes are replaced by the nodes' probability of belonging to each ground-truth semantic class.In our experiments, the set T of threshold values used by Algorithm 2 is defined as an interval of evenly spaced numbers ranging from the minimum to the maximum values of each classification probability map.In order to obtain a similar number of features as the other APs, different numbers of threshold values were considered for each task and for each dataset (see Table III).
The APs and their extensions are first computed independently on each of the NIR+RGB bands (resp.NIR+RG) of the Zurich (resp.Vahingen) dataset and then concatenated.For each image I in the Zurich and Vaihingen datasets, hierarchical watersheds were computed from two four-connected edgeweighted graphs: from the graph G G = (V, E, w G ) obtained from a gradient of the original data I (without semantic prior knowledge from training pixels), and the second one computed from the combination of the graph G G with the classification probability map obtained from the training set of I, as described in Section IV-A.Supervised pixel classification was performed twice, once for obtaining the classification probability map and then to provide the final land-cover or building extraction classification.Both were performed using a RF classifier with 100 trees.The number of variables used for training was set to the square root of the feature vectors length.To optimize the construction and filtering of hierarchical watersheds using prior knowledge, the first RF, which is used to obtain the classification probability maps, is trained on the features extracted from a 5 × 5 window around each training pixel.
For land-cover classification, the different approaches are compared using the overall accuracy (OA), average accuracy per class (AA) and κ coefficient, as done in [3].To evaluate the proposed methods on building extraction, the following standard measures were considered: OA, precision, recall, the F1 score, and the mean intersection over union (mIOU).For each tested method, we report the mean and standard deviation of the classification scores over ten runs in the form of mean ± standard deviation.
As defined in Section III-A, hierarchical watersheds can be computed for any given ordering on the minima of a weighted graph.In our experiments, such orderings are obtained from extinction values [27], [28] based on the area, dynamics, and volume attributes.

A. Land-Cover Classification
Tables IV-IX present the results of land-cover classification on the Vaihingen and Zurich datasets.We compare the performance of the following methods: the baseline, in which every pixel is represented by its NIR+RG or NIR+RGB values, AP-maxT, and AP-minT obtained by filtering the max-and min-tree, respectively; AP [3] obtained as a concatenation of AP-maxT and AP-minT; SDAP [32]; α-AP and ω-AP [25]; and the watershed-APs computed/filtered with and without prior knowledge.To simplify the notations, watershed-AP computed without prior knowledge is denoted as A-WS-AP, and the watershed-APs computed and filtered using prior knowledge are denoted as A-CPWS-AP and A-FPWS-AP, respectively, where A is the attribute used in the construction of the hierarchical watersheds, namely Area, Dynamics (Dyn) and Volume (Vol).
On the Vaihingen dataset, as shown in Tables IV and V, the Area-WS-AP and Volume-WS-AP, which are computed without any prior knowledge, outperform all other methods in the literature in terms of OA and κ scores.The same is true for Area-CPWS-AP and Vol-CPWS-AP, as well for all three FPWS-APs.Our best method, namely Vol-FPWS-AP, outperforms AP by 2.46% and by 3.29% in terms of OA and κ scores, respectively.Regarding the use of prior knowledge for WS-APs, all three FPWS-APs performed better than their respective CPWS-APs and WS-AP counterparts.The most meaningful improvements were observed for Dyn-FPWS-AP, which presented OA, AA, and κ scores more than 2% higher than Dyn-WS-AP.Considering the classification result per semantic class given in Table VIII, we observe that the watershed-APs yielded the best results for the impervious surfaces, buildings, low vegetation and tree classes, but falls behind AP on the classification of cars.
To provide further insight into the performance of APs in the context of semantic segmentation, we compare one of our results with a recent deep learning method [34] (see Table VI).In [34], the authors propose an adaptation of the FCN to consider sparsely annotated training data.They trained their proposed model with different types of scribbled annotations,    On the Zurich dataset, as shown in Table VII, the Area-WS-AP and Vol-WS-AP, computed without any prior knowledge, as well as the CPWS-APs outperform all other methods in the literature in terms of OA and κ.Our best method, namely Vol-CPWS-AP, outperforms AP by 3.29%, 2.26%, and 4.17 in terms of OA, AA, and κ, respectively.Considering the classification result per semantic class given in Table VIII, we observe that the watershed-APs yielded the best results for most classes, namely roads, buildings, trees, grass, and water, but, for the remaining three classes, none of the APs outperformed the baseline NIR+RGB.Concerning the use of prior knowledge, it led to consistent improvements when used in the construction of hierarchical watersheds: all three CPWS-APs outperformed their respective WS-APs by at least 1% in terms of OA, AA, and κ.The most significant improvement was observed for the Dyn-CPWS-AP, which outperformed Dyn-WS-AP by 3.08%, 2.21%, and 3.89 in terms of OA, AA, and κ × 100, respectively.However, regarding FPWS-APs, we concluded that this method is not well adapted for this dataset and, hence, we do not present land-cover evaluation scores of FPWS-APs on Zurich.The reason is that the Zurich test images do not contain ground truth pixels of all eight semantic classes and, different from CPWS-APs, the FPWS-APs are constructed by considering the classification probability maps of each class individually.For instance, the swimming pool class is present in only two among the five Zurich test images.Hence, the classification probability map for this semantic class would be meaningless for remaining three test images, leading to flat or redundant image reconstructions.
Table IX compares the F1 scores per class of our best method, namely Vol-FPWS-AP, with the results of two FCN based deep learning methods present in [34].As for the Vaihingen dataset, their best results on Zurich were achieved by the FCN-Festa-dCRF method trained on scribbled lines, which are composed  [34] of 330,767 annotated samples from all semantic classes, except for background/clutter.As we can see in Tables VII and IX, our Vol-CPWS-AP outperformed FCN-Festa-dCRF in terms of OA and F1 scores for all semantic classes, except for the bare soil class.We note that our results were achieved by using less than a half of their number of training samples.On the other hand, in contrast to our experiments, the authors of [34] do not consider the blue channel of the Zurich dataset, which may explain the gap between both methods.Moreover, APs still falls behind the state-of-the-art FCN model trained on dense annotations (i.e., trained on all pixels of the training set).Fig. 9 illustrates the classification results on the test image zh18 of the Zurich dataset.As already suggested by the scores given in Table VIII, we can see that none of the APs are able to improve the results for the bare soil class, but most of them provide better classification results for the roads (in black) and grass (in light green).When comparing the WS-APs and CPWS-APs, the most significant improvements are observed on the roads, trees, and grass classes, in the regions highlighted by the blue boxes in Fig. 9(m)-(o).The red boxes indicate a few regions in which classification results were worsened by the use of semantic prior knowledge.

B. Building Extraction
Taking into account that other methods in the literature, such as [10], [11] which also employ prior knowledge for image segmentation, perform well on binary classification/ segmentation of images into object and background regions, we aim to validate our watershed-APs on a binary classification task, which is meaningful for remote sensing imagery, namely building extraction.For evaluating our proposed methods on this task, we consider two semantic classes on the Zurich and Vaihingen datasets, namely buildings and background, such that the latter is the union of all remaining classes of each dataset, as mentioned in Section V-B.Evaluating our methods on this binary classification task might give us a clearer idea of their performance, since semantic prior knowledge come from only two classes.Moreover, we no longer have semantic classes, e.g., swimming pools, which are absent in some of the Zurich test images.
Table IV reports the evaluation results of building extraction on the Vaihingen dataset.On this dataset, the highest F1 score among the APs were achieved with the Area-FPWS-AP, which outperformed one of the best methods in the literature, namely AP, by 0.51%, 1.90% and 2.59% in terms of OA, F1, and mIOU, respectively.Regarding the use of prior knowledge on watershed-APs, both Area-CPWS-AP and Area-FPWS-AP (resp.Vol-CPWS-AP and Vol-FPWS-AP) outperformed Area-WS-AP (resp.Vol-WS-AP) with respect to all metrics, except for precision, which validates the interests of semantic prior knowledge in the context of building extraction.The largest improvement was observed for the watershed-AP filtered with area, with the Area-FPWS-AP being 1.24%, 7.70%, 4.08% and 5.47% better than Area-WS-AP with respect to OA, recall, F1 and mIOU scores, respectively.All three FPWS-AP presented very similar results and outperformed their respective WS-AP counterparts by more than 6%, 3%, and 4%, respectively, in terms of recall, F1, and mIOU scores.
Classification results on one image of the Vaihingen dataset are illustrated in Fig. 10.The true positives, false negatives and false positives are represented in white, red, and cyan, respectively.On this image, we can observe that the area and volume watershed-APs produce less noisy classification results when compared to the other methods.In particular, the Vol-CPWS-AP presents a higher precision than all other methods.
Table XI presents the evaluation of building extraction on the Zurich dataset.With respect to the baseline NIR+RGB, all APs (except for AP-minT) provided better scores for most metrics.For each metric, the highest score was achieved by one of the watershed-APs, with Area-FPWS-AP giving the highest OA, F1, and mIOU scores among all tested methods.In general, FPWS-APs led to more pixels being classified as belonging to the building class, increasing the number of true and false positives, as attested by the lower precision and higher recall scores of FPWS-APs when compared to CPWS-APs.Still, Area-FPWS-APs (resp.Dyn-FPWS-APs) achieved F1 scores at least 4% higher than Area-WS-AP and Area-CPWS-AP (resp.Dyn-WS-AP and Dyn-CPWS-AP).
In conclusion, watershed-APs computed with the help of semantic prior knowledge showed its usefulness in both landcover classification and building extraction on multispectral data.On both tested datasets, the highest scores were achieved by one of the watershed-APs, whether it was with CPWS-APs or FPWS-APs.Based on our experiments, the effectiveness of considering semantic prior knowledge either during the construction phase of hierarchical watersheds or during the filtering step of watershed-APs depends on the task and on the dataset under study.For land-cover pixel classification, CPWS-APs seem to be more adapted than FPWS-APs when there is a high imbalance between the number of semantic classes, which are present in each test image, such as in the Zurich dataset.On the other hand, FPWS-APs performed better than CPWS-APs on the Zurich dataset when the data was divided in only two classes: building and background.

VII. CONCLUSION
This article proposed the watershed-AP as an extension of AP to hierarchical watersheds computed from (edge) weighted graphs.Besides, the relevance of using semantic prior knowledge was considered in the construction and filtering of such   hierarchies.We evaluated and validated our approach on the pixel classification and building extraction of two multispectral remote sensing images, which showed the potential of hierarchical watersheds in this field.On both datasets, the best results were achieved by watershed-APs computed or filtered with prior knowledge from training pixels.
As future work, we will to explore the versatility of hierarchical watersheds by considering other methods to obtain the gradient of remote sensing images, as well as different ways of obtaining prior knowledge from training pixels and of incorporate such knowledge into the watershed-AP construction pipeline.Also, the proposed approach has a great potential to be adapted and applied to other remote sensing data such as synthetic aperture radar images or LiDAR point clouds.

Fig. 1 .
Fig. 1.(a) A weighted graph G = (V, E, w).(b) A tree representation of the hierarchical watershed H of G for the sequence (C, A, B, D) of minima of G. (a) G = (V, E, w).(b) H.

Fig. 2 .
Fig. 2. General framework to compute Watershed-APs.Here, the solid arrows represent mandatory steps while the dashed arrows indicate optional steps to include prior knowledge.Details are provided in Section IV-B.
i) Training of a classifier using the training set of I and computation of perpixel class probabilities (p 1 , . . ., p n ) for all pixels of I. ii) Combination the class probabilities into a single map μ. iii) Computation of a weighted graph G P = (V, E, w P ) from μ. iv) Computation of a weighted graph line 5, C(x) is thus computed in O(m × d), where d is the maximal depth of each tree.Considering that the number n of semantic classes is a small constant in most datasets, the for loops of lines 4 − 9 are executed in time O(|I| × m × d).The graph G can be constructed in time O(|I|) given that each vertex in V corresponds to a pixel in I and that the number of edges increases linearly with the number of vertices in a 4 or 8-connected graphs.That being said, the three for loops in lines 14 − 21 are executed in time O(|I|).Finally, the hierarchical watershed H can be obtained in time O(|I|log|I|), as stated in [29].Therefore, the overall time complexity of Algorithm 1 is O(m × t × log(t) + |I| × m × d + |I|log|I|) if we consider the learning step of the classifier C.However, since C is trained only once, we can simplify it to O(|I| × m × d + |I|log|I|), which is a function of the number m of trees of C, their depths and the dimensions of the image I.

Fig. 4 .
Fig. 4. Image reconstructions (represented in false colors) obtained from two hierarchical watersheds of the image zh20 (Zurich dataset V-A) computed without semantic prior knowledge (b) and with semantic prior knowledge given by the probability map μ (c).The reconstructions given in (b) and (c) are computed by filtering their respective hierarchical watersheds with the area attribute and the following thresholds: 5 k, 10 k, 20 k.(a) Cropped RGB image zh20 from the Zurich dataset, its ground-truth and the combined probability map μ.(b) Images reconstructed from filtered versions of a hierarchical watershed computed from a gradient of I. (c) Images reconstructed from filtered versions of a hierarchical watershed computed from a combination of a gradient of I with μ.

Fig. 5 .
Fig.5.Semantic prior knowledge employed at the filtering step of watershed-AP.Given a multispectral image I, a classifier trained on a subset of pixels of I provides probability maps per semantic class.Then, those maps are thresholded at many different levels, resulting in binary maps, whose white pixels have the highest probabilities of belonging to a given class.Finally, the threshold probability maps are used as markers to guide the filtering of the input hierarchical watershed.

Fig. 6 .
Fig. 6.Comparison between image reconstructions obtained with the filtering method illustrated in Fig. 5.To better distinguish between neighboring regions with similar gray levels, all image reconstructions are represented in false colors.(a) Cropped patch of the first image (denoted here as vn 1 ) from the Vaihingen dataset [31] and its ground-truth semantic classes.(b) Class probability map for the tree class (in green) thresholded at increasing values.(c) Image reconstructions (in false colors) obtained by filtering a hierarchical watershed of vn 1 using the maps given in (b).(d) Class probability map for the car class (in yellow) thresholded at increasing values.(e) Image reconstructions (in false colors) obtained by filtering a hierarchical watershed of vn 1 using the maps given in (d).

Fig. 8 .
Fig. 8. Test images of the Zurich dataset.First line from left to right: RGB images with IDs from 16 to 20. Second line: ground truth labels including nine semantic classes: roads, buildings, trees, grass, bare soil, water, railways, swiming pools, background.

TABLE I NUMBER
OF TRAINING AND TEST SAMPLES OF THE VAIHINGEN DATASET USED FOR LAND-COVER CLASSIFICATION AND BUILDING EXTRACTION

TABLE II NUMBER
OF TRAINING AND TEST SAMPLES OF THE ZURICH DATASET USED FOR LAND-COVER CLASSIFICATION AND BUILDING EXTRACTION On this dataset, Vol-FPWS-AP approached the scores of FCN-Festa-dCFR on four classes, namely impervious surfaces, buildings, low vegetation and trees, but presented much poorer results on the classification of cars.With respect to FCN, both Vol-FPWS-AP and FCN-Festa-dCFR performed worse in general, except for the trees class, for which Vol-FPWS-AP and FCN-Festa-dCFR outperformed FCN by

TABLE V AVERAGE
[34]RACY PER CLASS FOR THE TEST IMAGES OF VAIHINGEN OVER TEN DIFFERENT RANDOM TRAINING SETS TABLE VI COMPARISON OF PER-CLASS F1 SCORES OBTAINED WITH VOL-CPWS-AP FOR THE TEST IMAGES OF VAIHINGEN WITH THE RESULTS OF TWO DEEP LEARNING METHODS REPORTED IN[34]

TABLE VIII AVERAGE
ACCURACY PER CLASS FOR THE TEST IMAGES OF ZURICH OVER TEN DIFFERENT RANDOM TRAINING SETS TABLE IX COMPARISON OF PER-CLASS F1 SCORES OBTAINED WITH VOL-CPWS-AP FOR THE TEST IMAGES OF ZURICH WITH THE RESULTS OF TWO DEEP LEARNING METHODS REPORTED IN

TABLE XI AVERAGE
CLASSIFICATION RESULTS OF BUILDING EXTRACTION ON THE ZURICH DATASET FOR THE IMAGES 16-20 OF ZURICHThe samples are divided in only two classes: background and buildings.