Crowdsourcing Experiment and Fully Convolutional Neural Networks for Coastal Remote Sensing of Seagrass and Macroalgae

Recently, convolutional neural networks and fully convolutional neural networks (FCNs) have been successfully used for monitoring coastal marine ecosystems, in particular vegetation. However, even with recent advances in computational modeling and data acquisition, deep learning models require substantial amounts of good quality reference data to effectively self-learn internal representations of input imagery. The classical approach for coastal mapping requires experts to transcribe in situ records and delineate polygons from high-resolution imagery such that FCNs can self-learn. However, labeling by a single individual limits the training data, whereas crowdsourcing labels can increase the volume of training data, but may compromise label quality and consistency. In this article, we assessed the reliability of crowdsourced labels on a complex multiclass problem domain for estuarine vegetation and unvegetated sediment. An interobserver variability experiment was conducted in order to assess the statistical differences in crowdsourced annotations for plant species and sediment. The participants were grouped based on their discipline and level of expertise, and the statistical differences were evaluated using Cochran's Q-test and the annotation accuracy of each group to determine observation biases. Given the crowdsourced labels, FCNs were trained with majority-vote annotations from each group to check whether observation biases were propagated to FCN performance. Two scenarios were examined: first, a direct comparison of FCNs trained with transcribed in situ labels and crowdsourced labels from each group was established. Then, transcribed in situ labels were supplemented with crowdsourced labels to investigate the feasibility of training FCNs with crowdsourced labels in coastal mapping applications. We show that annotations sourced from discipline experts (ecologists and geomorphologists) familiar with the study site were more accurate than experts with no prior knowledge of the site and nonexperts, with our results confirming that biases in participant annotation were propagated in FCN performance. Furthermore, FCNs trained with a combined dataset of in situ and crowdsourced labels performed better than FCNs trained on the same imagery with in situ labels.


Crowdsourcing Experiment and Fully Convolutional
Neural Networks for Coastal Remote Sensing of Seagrass and Macroalgae Brandon Hobley , Michal Mackiewicz , Julie Bremner , Tony Dolphin , and Riccardo Arosio Abstract-Recently, convolutional neural networks and fully convolutional neural networks (FCNs) have been successfully used for monitoring coastal marine ecosystems, in particular vegetation.However, even with recent advances in computational modeling and data acquisition, deep learning models require substantial amounts of good quality reference data to effectively self-learn internal representations of input imagery.The classical approach for coastal mapping requires experts to transcribe in situ records and delineate polygons from high-resolution imagery such that FCNs can self-learn.However, labeling by a single individual limits the training data, whereas crowdsourcing labels can increase the volume of training data, but may compromise label quality and consistency.In this article, we assessed the reliability of crowdsourced labels on a complex multiclass problem domain for estuarine vegetation and unvegetated sediment.An interobserver variability experiment was conducted in order to assess the statistical differences in crowdsourced annotations for plant species and sediment.The participants were grouped based on their discipline and level of expertise, and the statistical differences were evaluated using Cochran's Q-test and the annotation accuracy of each group to determine observation biases.Given the crowdsourced labels, FCNs were trained with majority-vote annotations from each group to check whether observation biases were propagated to FCN performance.Two scenarios were examined: first, a direct comparison of FCNs trained with transcribed in situ labels and crowdsourced labels from each group was established.Then, transcribed in situ labels were supplemented with crowdsourced labels to investigate the feasibility of training FCNs with crowdsourced labels in coastal mapping applications.We show that annotations sourced from discipline experts (ecologists and geomorphologists) familiar with the study site were more accurate than experts with no prior knowledge of the site and nonexperts, with our results confirming that biases in participant annotation were propagated in FCN performance.

I. INTRODUCTION
C OASTAL ecosystems, such as wetlands, estuaries, and coral reefs, represent dynamic and important nurturing habitats for a wide variety of plants, fish, shellfish, and other wildlife [1].With growing concerns over climate change, these coastal areas will be subject to changing atmospheric and ocean temperatures, sea levels, ocean chemistry, weather patterns, and the increased demands of a growing global population.This emphasizes the need to create and act on strategies that maintain a sustainable balance of coastal ecosystem health while also effectively managing the use of resources that are derived from these ecosystems [2], [3].
In coastal monitoring, remote sensing has provided a major platform for ecologists to assess and monitor sites in many applications [4], [5].Satellite imagery can provide global to regional observations at regular sampling intervals with successful applications for coastal management [2].However, this avenue of data acquisition often struggles with cloud contamination, oblique views, costs for data acquisition, and coarse resolution relative to the often narrow features of interest that stretch along the coast [6].The shift to uncrewed aircraft systems (UASs) and commercially available cameras tackles the latter issues as it resolves coarse satellite resolution (typically 2-30 m) by collecting several overlapping very high resolution (VHR) images and stitching sensor outputs together using Structure from Motion (SfM) techniques to create high-resolution orthomosaics [7], [8] (commonly less than 0.1 m).
Parallel to the advancements in data acquisition, computer vision (CV) has also improved over the past decade with deep learning (DL) [9] and the introduction of convolutional neural networks (CNNs) [10].These methods have surpassed previous state-of-the-art results in a wide variety of CV applications [10], [11], [12].Traditionally, supervised machine learning (ML) methods can be defined by two separate components: feature extraction and model training.Instead, CNNs learn hierarchical abstract representations of input imagery in a self-learning fashion, which in effect combines feature learning and supervised classifier training in one optimization [9].Fully convolutional neural networks (FCNs) are an adaptation of CNNs that perform per-pixel classifications and enable contextual features to be extracted within a wide receptive field while also preserving the spatial origin of these features to produce a fine-grained and spatially explicit segmentation of the object [11], [13].This is appropriate for remote sensing mapping applications, where aerial imagery can be segmented into meaningful sets of classes in order to delineate objects or species of interest [14], [15], [16], [17], [18], [19].
This said, even with the advent of UASs and FCNs to map coastal environments, the quantity and quality of data labels is a pivotal concern in many real-world scenarios because DL models perform best with large, labeled, training datasets [9], [20].In remote sensing, reference observations (FCN training data) are often acquired in situ, which involves high logistic efforts, potential inaccuracies due to geolocation errors as well as sampling and observation bias [21], [22].Moreover, the volume of data generated with UAS imagery may cover a substantial spatially continuous area with respect to the real world, yet the ratio between the area covered via in situ surveying and the total area covered in imagery is often relatively small [16], [23].Methods, such as transfer learning [24], data augmentation [25], and semisupervision [26], [27], can provide tools for FCNs to self-learn if there are limited amounts of labeled data, as is often the case for environmental monitoring.However, an alternative for efficient in situ data collection is visual identification and delineation of training data directly from orthomosaics [28], [29], [30]-possible in UAS imagery because the resolution is sufficiently high that even features as small as 10 × 10 cm can often be accurately identified and labeled.Further to this, crowdsourced labels can provide an even more cost-effective alternative to laborious labeling procedures from aerial imagery involving individual domainspecific experts with studies showing that aggregated labels can provide better quality generalization in ML modeling, which draw parallels with field of expert frameworks and ensemble learning [31], [32].
Remote sensing applications have also leveraged the use of crowdsourced labels to supplement aerial imagery datasets in a variety of manners [33].Commonly, web-based applications prompt participants to classify binary tasks with known GPS information for accurate geolocation.This has led to successful workflows that combine DL and crowdsourcing for several study sites: Guatemala, Laos, and Malawi using MapSwipe [34]; the Missing Maps humanitarian project using OpenStreetMap [35]; settlements in Nigeria, Somalia, Pakistan, and Afghanistan using Tomnod platform [36]; and for crop mapping in South East India using Plantix [37].Furthermore, coastal surveying has also leveraged crowdsourced annotations for DL applications of litter mapping on the shores of Xabelia beach in Lesvos, Greece [38], and shoreline change mapping in two open-coast sandy beaches located within the Sydney metropolitan area [39].
These studies focus on combining crowdsourced labels with DL models on binary problem domains to avoid ambiguity for participants and erroneous labeling [33].In contrast, coastal mapping requires the identification of multiple feature classes, some of which are superficially similar depending on the situation (e.g., sand and mud, seagrass, and filamentous algae).
In this article, we tackled the problem of deriving crowdsourced training data for estuarine vegetation and unvegetated sediment ecosystems at Budle Bay (Northumberland, U.K.).We performed an interobserver experiment of crowdsourced annotations on a complex multiclass problem domain that includes intertidal coastal species, such as seagrass, saltmarsh, and macroalgae.The experimental population consisted of 12 participants split into 3 groups based on their discipline and level of expertise in habitat mapping.The experiment was analogous to crowdsourcing labeled data in remote sensing applications as participants were prompted to classify predetermined points.Our experimental setup comprised two sets of points: a set whereby the true semantic value of each human annotation was known according to an in situ survey of the study site conducted by the U.K. Centre for Environment, Fisheries and Aquaculture Science (Cefas) and U.K. Environment Agency (EA), and an extra set of points created through expert photointerpretation to balance class distribution (see Section II-D).
The analysis of our interobserver variability experiment uses Cochran's Q-test to assess the statistical differences of crowdsourced annotations from each group.Furthermore, the annotation accuracy and a per-class analysis of crowdsourced annotations were used to assess for any potential observation biases.
Given the annotations from the interobserver experiment, the feasibility of FCNs trained with crowdsourced annotations was investigated in two scenarios: First, four FCNs were trained with different versions of labeled data on the same imagery.Three FCNs were trained with labels based on majority-vote annotations from each participant group in the interobserver experiment and the other FCN was trained with transcribed labels from the in situ survey.This scenario allows for a direct performance comparison for FCNs trained with in situ labels and crowdsourced labels, and evaluates whether biases in crowdsourced annotations were propagated in FCN performance.The second scenario investigates the feasibility of supplementing transcribed in situ labels with crowdsourced labels using two FCNs.For this scenario, one FCN was trained with the set of points, as described in Section II-D, whereas the other FCN was trained with a combination of transcribed in situ labels and crowdsourced labels on the same imagery.Consequently, we list the following contributions in the proposed article.
1) Discipline experts (ecologists and geomorphologists) familiar with the study site were more accurate than experts with no prior knowledge of the site and nonexperts.2) FCNs trained with crowdsourced labels from discipline experts familiar with the site had comparable performance to FCNs trained with in situ labels.3) FCNs trained with a combined labeled set of in situ labels and crowdsourced labels were more accurate than FCNs trained with in situ labels on the same imagery.The rest of this article is organized as follows.Section II-B and C details the study site.Section II-E describes the experimental setup and Section II-F describes the FCN model and parameter training.Section III-A and B presents the results of the interobserver experiment and FCN experiments, and Section IV-A-C presents the analysis and discussion of the interobserver and FCN experiments.Finally, Section V concludes this article.

A. Study Site
The research focused on Budle Bay, Northumberland, U.K. (55.625 • N, 1.745 • W).Budle Bay is a large (c.300 ha) estuarine embayment with a single tidal inlet [40], [41], [42].Sinuous and dendritic tidal channels are present within the estuary, and bordering the channels are areas of seagrass and various species of macroalgae.The tidal range varies between 1 and 4 m for the majority of the year and the estuary is fully drained on low spring tides.

B. Image Collection
Full details of the data collection can be found in [23].Fig. 1 displays a VHR orthomosaic of Budle Bay created from the Cefas and EA RPA survey in September 2017 using Agisoft's MetaShape [43] and SfM.SfM techniques rely on estimating intrinsic and extrinsic camera parameters from overlapping imagery [44].A combination of appropriate flight planning in terms of altitude and aircraft speed, and the camera's field of view are important factors for producing good quality orthomosaics.For this work, a MicaSense RedEdge3 multispectral camera was used to capture the site.The camera consisted of five narrow-band filters for red (655-680 nm), green (540-580 nm), blue (459-490 nm), red edge (705-730 nm), and near-infrared (800-880 nm) channels at a ground sampling distance of approximately 8 cm.
The resulting VHR orthomosaic was orthorectified using GPS logs of camera positions and ground control markers spread out across the site.This process ensured that the mosaic was well aligned with respect to the real world and ecological features present within the coastal site.The orthomosaic had 32 647×26 534 pixels in five image bands.For ease of processing, the orthomosaic was split into 24 nonoverlapping tiles of 6000×6000 pixel images with each image containing geographic information for further processing.

C. In Situ Survey and Class Domain
The accompanying ground survey identified 13 ecological classes grouped into background sediment, algae, seagrass, and saltmarsh.
Classes defining background sediment were rock, gravel, mud, and sand.The in situ measurements of unvegetated sediment were predominately in the presence of water and moisture.However, as parts of the orthomosaic included dry sand, an extra sediment class was added through photointerpretation (16 polygons).Two heuristics for delineating dry sand polygons were defined: first, the spectral reflectance of sand varies with the presence of surface moisture and presents higher reflectance intensity for patches of dry sand [45].Therefore, polygons were delineated by examining bright unvegetated areas in Fig. 1.Second, each generated polygon was cross checked with the topographic digital surface model (DSM) to ensure that the patches of dry sand only occur if the surface level was raised.
Algal classes include microphytobenthos, Enteromorpha sp., and other macroalgae (inc.Fucus sp.).Finally, the coastal vegetation classes were seagrass and saltmarsh.Thus, a total of seven classes were listed as follows.
2) Algae: microphytobenthos, Enteromorpha, and other macroalgae (including Fucus).3) Seagrass: Zostera noltii and Zostera angustifolia merged into a single class.4) Other plants: saltmarsh.The in situ survey recorded 108 geographically referenced tags with the percentage cover of all listed ecological features within a 300-mm radius.The percentage cover was estimated in quadrat sampling fashion [46], [47].For each in situ measurement, the class value with maximum percentage cover was chosen as the label.

D. Class Distribution
The class distribution of in situ measurements was not balanced, which may add cognitive bias and, consequently, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
skew results in human annotations for the experiment [48].Recognizing biases during crowdsourced data collection efforts is an important step to countering the effect these may impose on model training and is an enabling factor for algorithmic fairness [49].Therefore, a set of points from the in situ survey were combined with extra points added through expert photointerpretation (by the lead author) in order to balance the class distribution for the experimental setup.From the original set of 108 in situ points, a balanced set of 53 points was chosen (the remaining 55 in situ points were used for FCN performance evaluation, see Section III-B).Then, added points through photointerpretation were based on class-dependent heuristics.
First, no extra points for dry sand were added, as the set of photointerpreted polygons covered a substantial area to generate enough points for both the experiment and FCN testing.Other bareground was a sediment class that comprised wet sediment features, such as wet sand and mud.Selected points presented dark brown or gray color rugged texture and low-elevation values relative to the rest of the site.Generally, added points were sampled within a close vicinity of known in situ records.But, this was not considered an important factor for other bareground points as long as color, texture, and elevation within a 300-mm (6×6 image patch) radius was consistent.
Vegetation classes were split into three sets: algae, seagrass, and saltmarsh.The geolocation of extra points for vegetation classes was always within the vicinity of known in situ points to establish a baseline for comparing color and texture.Saltmarsh points were found to be easily identifiable due to slight elevation changes in the DSM but also because coastal saltmarsh occupies the interface between land and sea [50].Therefore, saltmarsh points were mostly present on estuary borders.Identifying points for both species of intertidal seagrass was dependent on the following texture and color features: both species occur in mixed beds of waterlogged depressions between free-draining hummocks dominated by Zostera noltii and presented sparse leaves with light yellow green or green color [51], [52], [53].
Microphytobenthos are microscopic organisms that inhabit the upper millimeters of illuminated wet sediments, typically appearing only as a subtle greenish shading [54].Identifying extra points for microphytobenthos was only possible within very close vicinity of known in situ points, with color (greenish shading) used as the identifier.Extra points for Enteromorpha sp. had to present bright green color, while other macroalgae (inc.Fucus), with a similar texture to Enteromorpha sp., was presented in a dark brownish color [55], [56].Enteromorpha sp. and other macroalgae were spatially continuous compared with seagrass, which was more likely to be sparse.This further aided in distinguishing and picking extra points for these classes.While the vegetation species may be found in other circumstances (e.g., saltmarsh hummocks can grow amongst seagrass slightly away from estuary borders), our intent was to maximize our confidence that our selected points were classified correctly rather than to select across the range of possible appearances for each species.Overall, extra 54 points were added through expert photointerpretation to maintain the class distribution balance.Therefore, the set of points to be annotated for each  participant comprised 119 points whereby 53 points were drawn from the in situ survey and extra 54 were created through photointerpretation and the remaining 12 points were randomly selected from dry sand polygons.

E. Experimental Setup
The goal of the experiment was to examine the variability in annotations from multiple participants with differing backgrounds in research and expertise with marine habitat mapping.Each participant was presented with a unique and random order of points to be annotated and a small set of labeled sample images representative of the vegetation classes to assist with identification.Figs. 2 and 3, respectively, display the set of labeled sample images presented to each participant and the user interface available to participants during the experiment.Participants used ArcMap 10.6.1 to visualize and annotate samples.
Each participant generated 119 annotations with each cell containing a semantic value corresponding to the class domain in Section II-D.The participant population was split into three groups based on their level of expertise to explore whether prior knowledge of the study site, research background, and/or previous experience with marine annotation could influence experimental results.The criteria separating each group were as follows.
1) Group A: Expert ecologist or geomorphologist, present at the in situ survey and/or had previous experience with annotating marine biology for the study site.2) Group B: Expert ecologist or geomorphologist but was not present at the in situ survey and/or did not have experience with annotating marine biology for the study site.3) Group C: Neither an expert ecologist or geomorphologist nor had experience with annotating marine biology from aerial imagery.Therefore, annotations were grouped into three sets based on the stated groupings.
To evaluate the interobserver variability within each group, Cochran's Q-test was used to investigate the statistical significance of differences between K observations on the same n elements with binomial distribution [57], [58].For this work, K series of observations corresponded to participants within a group and elements for each observation were individual annotations of participants.Therefore, the null hypothesis was that annotations for participants within a group were drawn from one common dichotomous distribution, which would imply low variability in annotations.However, Cochran's Q-test states that each annotation must be dichotomous and represented as 0 or 1.Since the experimental annotation setup was a complex multiclass problem, each annotation was compared with the assigned label (either in situ or photointerpreted) and represented as 1 if correct; otherwise, the annotation was represented as 0.
Cochran's Q-test statistic with K − 1 degrees of freedom follows a χ 2 distribution and is given as follows: where C j is a column total, R i is a row total, C is the average column total, and S is the total score, i.e., S = i R i = j C j .
In this context, a column total is the sum of correct annotations for a single participant, and a row total is the sum of correct annotations for a single point across all participants.

F. Fully Convolutional Neural Networks
CNNs have proven to surpass prior-art techniques in a large number of different CV applications since the introduction of AlexNet [10].The shift from supervised traditional ML algorithms, whereby tailored feature extraction methods and classifier tuning are replaced with a joint optimization of both procedures, is an enabling factor for CNN success.The feature extraction process consists of repeated convolution and pooling operations that transform the input image into hierarchical abstract representations of data.The joint optimization is achieved by adjusting convolutional kernel weights and biases through the derivative chain rule that minimizes the error between network outputs and annotated labels [9].
FCNs [11], [13] are an adaptation of CNNs for semantic segmentation.The architecture of FCNs can be broken down into three parts: an encoder, a decoder, and a classification layer.The encoder network is a CNN without the final fully connected layer, the decoder network applies repeating upsample and convolution operations on feature maps created by the encoder network, and the classification layer consists of 1 × 1 convolution kernels and a softmax transfer function to produce per-pixel class probabilities.Fig. 4 displays the architecture used for this work.The overall architecture was a U-Net [11] and the encoder network is a VGG-13 [60] pretrained on ImageNet.However, the weights in the input layer were randomly initialized and changed to handle a five-channel input image.
1) Data Preprocessing and Training Parameters: FCNs were trained with segmentation maps that contain a one-to-one mapping of pixels encoded with a semantic value, with the goal to optimize this mapping [13].Segmentation maps were generated using the geographic coordinates stored at each point and converting real-world coordinates to image coordinates.If a point or multiple points resided within an image tile, then the candidate image was sampled into 256×256 image blocks centered on the labeled parts of the image.For each point, a bounding box consistent to 300 mm was placed.Fig. 5 shows a gallery of sample imagery used for training FCNs.
The loss was computed by processing a minibatch of images with the FCN, which result in per-pixel probabilities P ∈ R B×K×H×W and comparing network outputs with the corresponding annotated maps Y ∈ Z B×H×W ; where B, K, H, and W are, respectively, the batch size, number of target classes, height, and width of the image.Then, the negative log-likelihood loss was calculated between segmentation maps and network probabilities where x ∈ Ω; Ω ⊆ Z 2 is a pixel location and P k (x) is the probability for the kth channel at pixel location x, with For each image, the loss was the sum of all individual pixel losses using (2) and averaged according to the number of labeled pixels within Y .Previous work on the same study site uses semisupervision methods to improve the generalization and performance of FCNs [23].However, the use of an unsupervised loss term would influence the analysis of our experimental setup by allowing networks to adjust weights based on nonlabeled parts of the image, whereas our goal was to determine the effects of aggregated crowdsourced labels.
During training, each image was augmented with stochastic transformations that consisted of rotations up to 25 • and horizontal or vertical flips.Each network was trained for 200 epochs with a batch size of 12 with Adam optimizer.The optimizer learning rate was constant and set to 0.001.All FCNNs were implemented and trained using Pytorch version 10.2.

A. Interobserver Experimental Results
Table I and Fig. 6 give the results of our experiment.The significance level for each control group was set to 5% and the degrees of freedom were set according to the number of participants within a particular group.Therefore, the critical Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 4. U-Net architecture and loss calculation.The input channels were stacked and passed through the network.The encoder network applies repeated convolution and max pooling operations to extract feature maps, while the decoder network upsamples these and stacks features from the corresponding layer in the encoder path.The output is a segmented map, which was compared with the ground-truth mask using cross-entropy loss.The computed loss was used to train the network, through gradient descent optimization.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I PARTICIPANT ANNOTATION ACCURACY AND COCHRAN Q-TEST STATISTIC RESULTS
values according to a χ 2 distribution were 9.49, 12.59, and 7.81, for control groups A, B, and C, respectively.The test statistic described in (1) objectively evaluates the statistical significance of differences between K observations on the same n elements with a binomial distribution.By comparing each annotation with the known in situ label and representing correct annotations as 1 and incorrect as 0, Cochran's Q-test evaluates whether annotations, which can be correct or incorrect, were drawn from the same binomial distribution.Therefore, the test statistic for a group may not allow us to reject the null hypothesis, which would imply low interobserver variability, but participants within that group could collectively annotate test points incorrectly.In fact, participants were more likely to be collectively incorrect than correct due to different incorrect annotations being represented as 0. For example, if the class label for a given point was dry sand but participants annotated the said point as other bareground and microphytobenthos, then both annotations were represented as 0, which would contribute to a smaller test statistic value.Hence, the test statistic was analyzed along with the annotation accuracy metrics so that emphasis was placed on groups that were collectively correct and also yielded a test statistic that did not reject the null hypothesis.

B. FCNs' Results
The metrics to quantify FCN performance were pixel accuracy, precision, recall, and F1-score.Pixel accuracy is the ratio between pixels that were classified correctly and the total number of labeled pixels in the test set for a given class.Equation (3) describes each metric, where TP, TN, FP, and FN are, respectively, the true positive, true negative, false positive, and false negative pixel classifications.
Our evaluation consisted of two different tests: The first test shows the effects of training several FCNs on different versions of labeled data based on using majority-vote annotations from each group.This test evaluated whether errors in the annotation experiment were propagated to the FCN performance.For training the FCNs, we used the same points as in the interobserver variability experiment-a set of 53 randomly selected points from the in situ survey, an additional 54 points chosen through expert photointerpretation, and 12 points from dry sand polygons.The remaining 55 points recorded in situ were used for model testing and further 12 points from dry sand polygons.Therefore, FCNs were trained on the combined set of 119 points IV. DISCUSSION

A. Interobserver Experimental Analysis
From our results, the null hypothesis that participant annotations were drawn from the same distribution was not rejected only in group A. Moreover, group A also exhibited the highest mean and lowest variance in accuracy for annotations with 72.43±3.106%,which showed that participants in group A were more likely to be correct than the other two groups.The pre-exposure of participants in group A to the target classes at the study site justified the lowest test statistic for participant annotations within this particular group.Furthermore, the latter statement can also be supported by examining the majority-vote confusion matrix for group A (top-left matrix in Fig. 6), where the accuracy of the majority-vote annotations was 81.31% for Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.group A-higher than the highest accuracy of any participant in the experiment.This illustrates that annotations for participants in group A were better if performed collectively and, as a whole group A, were good candidates for crowdsourcing labels for this particular study site.Given the low variability in annotations for group A, examining Fig. 6 also informed us about the problematic classes to annotate from aerial imagery.Other bareground was a sediment class composed of rock, mud, and wet sand, and microphytobenthos typically appeared only as a subtle greenish shading on wet sediment [54], which could justify why both classes were mutually misannotated.The same reasoning can be applied to annotations for Enteromorpha sp. and seagrass since both classes exhibit similar color and texture from an aerial point of view.
The null hypothesis for participants in group B was rejected by a significant margin.This could be due to the following: First, participants in this group were not familiar with annotating aerial imagery for this study site.In IR crowdsourcing, this is also known as the ambiguity effect whereby missing information makes annotations appear more difficult and, consequently, less attractive [59].Alternatively, the participant population contained experts from different disciplines who may have conflicting biases during annotation.If participants do not agree with each other, then the test statistic yields a high value based on whether annotations were correct or not.Specifically, the second highest overall annotation accuracy was from participant 9, while the lowest accuracy was from participant 6, both of whom belong to group B. In fact, participant 9 is a benthic ecologist with specific knowledge of identifying intertidal algae, while participant 6 is an expert in sedimentology.This contrast in discipline is reflected in annotation and, subsequently, in the test statistic due to correct or incorrect annotation on the same test points.The average accuracy was lower than in group A-57.50±18.16%and the majority-vote confusion matrix paints a similar picture-high variability and feature ambiguity lead to erroneous labeling, with an overall normalized majority-vote accuracy of 64.41% (middle-right matrix in Fig. 6).
For participants in the final group C, the null hypothesis was also rejected, however by a smaller margin than group B. Again, this implies that participants within this group exhibit high interobserver variability.Both the average accuracy and majority-vote accuracy were the lowest out of all groups, with 53.5±10.82 and 60.75% (bottom-left matrix in Fig. 6), which also reflected low confidence in participant annotations.However, even with lower accuracy, participants within group C showed less variability in correct/incorrect annotations than group B participants.This could be due to participants in group C not having any prior knowledge of the study site or with annotating aerial imagery and associating similar color and texture based on the sample images in Fig. 2 to the same class.The confusion matrix for group C provides insights into problematic target classes to annotate for subjects with the least experience.Algae classes, e.g., Enteromorpha sp. and other macroalgae, were often mutually mislabeled, while seagrass was often annotated as Enteromorpha sp.This implies that vegetation classes were hard to discern from an aerial point of view with no prior knowledge.Furthermore, and similarly to group A, other bareground, a sediment class that includes wet sand, was also incorrectly annotated as microphytobenthos, which again implies that these two classes are hard to discern from each other.
To sum up, this analysis covers three groups and assesses the interobserver variability in participants with different backgrounds and expertise while also assessing the accuracy of each participant, average group accuracy and majority-vote accuracy.Participants in group A showed to have low interobserver variability while also correctly annotating 81.31% of the points collectively.Participants in groups B and C exhibited high interobserver variability.Examining the criteria separating each group, having discipline expertise, prior knowledge of the site and/or previous experience annotating marine biology play an important role in minimizing interobserver variability and ensure accurate annotation, and lack of exposure to these criteria leads to high variability and low confidence.However, our results also suggested that an expert ecologist or geomorphologist without in situ exposure produced similar overall accuracy Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
annotations as nonexperts, this was influenced by the individual accuracy result of participant 6 since the majority of participants within group B yielded a higher accuracy in annotations than two of three participants in group C. Finally, aggregating labels based on majority-vote annotations also draw parallels with field of expert frameworks in low-level image processing and ensemble learning [31], [32], [62], [63].These frameworks model high-dimensional probability distributions by taking the product of several expert distributions, where each expert works on a low-dimensional subspace that is relatively easy to model.This is similar and accurate for annotations in all groups.In general, aggregating labels showed an increase in accuracy scores of 8.88%, 6.91%, and 7.25%, respectively, for groups A, B, and C.This alludes to the specific and complementing nature of different research backgrounds aiding the accurate annotation.

B. FCNs With Different Versions of Labeled Data
The first test in our evaluation considered four FCNs trained with different versions of the labeled data.
First, FCNs trained with in situ labels (top-left matrix in Fig. 7) were viewed as the baseline for the remaining FCNs trained on majority-vote annotations from each group.The normalized accuracy with in situ labels was 87.79% and models exhibited high confidence and accurate predictions for dry sand, other macroalgae, seagrass, and saltmarsh.Other bareground proved to be a problematic class to model with a majority of predictions confused with microphytobenthos and Enteromorpha sp.This paints a similar picture to majority-vote annotations for participants in group A (top-left matrix in Fig. 6), whereby microphytobenthos was mislabeled as other bareground.However, FCNs do not mutually mislabel seagrass with Enteromorpha spp., which implies that FCNs were better at discerning these two specific vegetation classes than participants from group A.
The normalized accuracy for FCNs trained with majority-vote annotations from participants in group A was 81.99% (top-right matrix in Fig. 7).This particular group exhibited low interobserver variability and accurate annotations with the exception of microphytobenthos and other bareground; which may be due to both classes being present in wet sand.Furthermore, Enteromorpha sp. was mutually mislabeled with seagrass because both classes showed similar color and texture from an aerial point of view.The latter bias in annotations from participants in group A was propagated to FCN performance-where 23.3% of seagrass labels were predicted as Enteromorpha sp.(top right in Fig. 7).However, examining Enteromorpha sp.predictions showed that this particular class was over-represented due to erroneous predictions and confusion with other vegetation classes, such as saltmarsh, seagrass, and other macroalgae.Therefore, erroneous labels from participants in group A caused FCNs not only to mutually mislabel Enteromorpha sp. with seagrass but also resulted in cascading errors for other vegetation classes due to overfitting for Enteromorpha sp.Similarly to previous work using aerial imagery for annotation, this test also showed that empirical models can compensate certain degrees of erroneous human annotations [19], [28].
FCNs trained with majority-vote annotations from participants in group B yielded a normalized accuracy of 63.72% (bottom-left matrix in Fig. 7).Annotations from participants in group B exhibited high interobserver variability, resulting in low confidence in majority-vote annotations.This was due to conflicting biases between experts, i.e., ecologists, geomorphologists, and sedimentologists, and the ambiguity effect through lack of exposure to the in situ survey or aerial annotation of marine vegetation species from the study site.The main trends in human annotations from this group were other bareground mislabeled as dry sand, and a general confusion of vegetation classes among Enteromorpha sp., other macroalgae, and seagrass.These errors were also propagated into FCN performance, as 64.1% of other bareground predictions were mislabeled as dry sand and seagrass was severely misclassified and predicted as Enteromorpha sp. and other macroalgae, respectively, 60.4% and 35.6% (bottom-left matrix in Fig. 7).
The final set of majority-vote labels from group C yielded a normalized accuracy of 66.36% (bottom-right matrix in Fig. 7).Even though the average and majority-vote accuracy for annotations provided by group C were lower than results yielded by group B-FCNs trained with majority-vote annotations from subjects in group C yielded a higher test set accuracy than majority-vote annotations from group B. Our experiment showed that participants in group C presented high interobserver variability but by less of a margin than group B (see Table I in Section III-A).The analysis also showed that nonexpert participants in group C exhibited low confidence predictions for other bareground with 31.8% of points labeled as microphytobenthos (bottom-left matrix in Fig. 6).Similarly to participants in group B, they exhibited a general confusion in annotations for vegetation classes-in particular-seagrass and Enteromorpha sp. were often mutually misannotated.Again, these errors in human annotations were propagated to FCN errors, e.g., mutual misclassifications for seagrass and Enteromorpha sp.classes.
Our analysis supports the hypothesis that errors in crowdsourced human annotation were propagated into the FCN performance.All groups had a similar trend whereby annotations for microphytobenthos were mislabeled with wet sediment classes.This bias was propagated into all models trained with majority-vote annotations where other bareground was either under-represented (bottom-left matrix in Fig. 7), overrepresented (bottom-right matrix in Fig. 7), or confused with dry sand (top-right matrix in Fig. 7).The mutual mislabeling of Enteromorpha sp. and seagrass points for participants in group A caused the FCN to misclassify all vegetation classes as Enteromorpha sp.This showed that poor annotations not only propagated errors into the FCN performance but also could cause cascading errors with classes that exhibit similar color and texture from an aerial point of view.This stresses the need for good quality labels as FCNs optimize their weights and biases based on a nonlinear one-to-one mapping between image pixels and labeled maps [13].However, our results also showed that FCNs trained with low interobserver variability and high confidence annotations, as shown with subjects in group A, can demonstrate comparable performance to the FCNs trained with in situ labels.Conversely, training with annotations from groups B or C, which manifested high interobserver variability and higher rates of erroneous labeling, severely degraded FCN performance.

C. FCNs With Balanced In Situ Only Versus Crowdsourced Supplemented Labeled Data
The second and final test in our evaluation considered two FCNs.One model was trained with only the balanced in situ labels.Therefore, the training set was the initial balanced set of 53 random points with in situ labels (see Section II-C) and the labels for the remaining 66 photointerpreted points were replaced with the semantic value of majority-vote crowdsourced annotations.
For comparison, we considered an FCN trained with just the balanced set of 53 in situ labels, which yielded a normalized test set accuracy of 82.9% (top-left matrix in Fig. 8).The accuracy was lower than FCNs trained with the combined full training set of 53 in situ labels and 66 photointerpreted labels (top left in Fig. 7).This was expected as FCNs learn hierarchical representations of data through gradient descent [9], and if FCN kernel weight and bias adjustments were based on fewer image examples, then model performance and generalization also degrade.The main affected and under-represented class was seagrass where the accuracy dropped from 99.5% (top-left matrix in Fig. 7) to 43.6% (top-left matrix in Fig. 8).
The normalized accuracy for FCNs trained with the in situ set supplemented with the labels from the participants in group A was 89.6% (top-right matrix in Fig. 8), which was also the highest accuracy of all FCNs in our analysis.This setting improved the test set accuracy compared with the model trained with just in situ labels.This was due to two reasons: first, supplementing the dataset allows for more unique samples to be incorporated into the training set, and second, the supplemented crowdsourced portion of the training set from group A exhibited low interobserver variability and accurate annotations.Furthermore, this particular result provided an interesting comparison with the FCN trained on in situ plus photointerpreted labels (the top-left matrix in Fig. 7).Both FCNs yielded satisfactory results, which confirms that the aggregated labels from multiple annotators within group A were as good as the efforts of a single expert annotator (lead author).This comparison also showed that in situ efforts can be combined successfully with aerial imagery annotation, which could reduce costs and labor from in situ surveys.
The accuracy for FCNs trained using in situ labels supplemented with the labels from participants in groups B and C was, respectively, 73.34% and 68.7% (bottom-left and bottom-right matrices in Fig. 8).Our analysis of both datasets was performed jointly as FCNs trained in both settings paint a similar picture.Both sets of models failed to achieve better results than models trained with just the balanced set of in situ labels (top left in Fig. 8), which again stresses the need for good quality crowdsourced labels.FCNs trained with majority-vote annotations from participants in group B over-represented seagrass and also misclassified all other macroalgae pixels, mostly as seagrass (bottom-left matrix in Fig. 8).A similar outcome happened for models supplemented with the labels provided by group C-again all other macroalgae class instances are misclassified, this time mostly as saltmarsh (bottom-left matrix in Fig. 8).In both settings, this would be due to poor annotation performance from these two groups (see Fig. 6).

V. CONCLUSION
This work analyzed the feasibility of using crowdsourced annotations on a complex multiclass problem domain that includes intertidal coastal species, such as seagrass, saltmarsh, and macroalgae.
To assess the quality of crowdsourced annotations, an interobserver variability experiment was performed with a population of 12 participants that were split into 3 sets of groups.The criteria for each group were based on discipline expertise and previous experience with either annotating aerial imagery for this study site or marine biology in general.The assessment was possible by analyzing the statistical differences in crowdsourced annotations using Cochran's Q-test.Furthermore, the annotation accuracy and a per-class analysis were used to assess for any potential observation biases.
The results of our experiment show that discipline experts familiar with the study site were more accurate than experts with no prior knowledge of the site and nonexperts.This confirms that discipline expertise, prior knowledge of the site, and/or previous experience annotating marine biology play an important role in minimizing interobserver variability and ensuring accurate annotation, and that lack of exposure to either of these criteria leads to high variability and low confidence.Furthermore, the results of our analysis also point to a small performance gain between annotators with expert discipline knowledge versus annotators with no previous experience in marine biology annotation or domain expertise.However, this may be skewed due to annotations from participant 6.
The experiment stressed the difficulty of labeling a complex multiclass marine biology problem, and therefore, we conclude that pre-exposure to the study site is important for intertidal classification if good quality labels are to be guaranteed and that in situ ground truthing may be unavoidable to prevent confusion by site experts, for instance, the general confusion between microphytobenthos with other bareground and Enteromorpha sp. with seagrass (see Sections III-A and IV-B and C).
For the experiment with FCNs trained with crowdsourced annotations, two scenarios were considered: the first was a direct comparison of FCNs trained with majority-vote crowdsourced annotation from each participant group with FCNs trained with transcribed in situ labels.This showed that annotations that exhibit low interobserver variability and high confidence annotations, as shown with subjects in group A, demonstrate comparable performance to the FCNs trained with in situ labels.Conversely, training with annotations from groups B or C, which manifested high interobserver variability and higher rates of erroneous labeling, severely degraded FCN performance.Therefore, we conclude that errors in crowdsourced human annotations were propagated into FCN performance.The second experiment considered two FCNs: one whereby the training set Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
was the initial balanced set of 53 points with transcribed in situ labels (see Section II-D), and the other where the training set was the initial set of 53 points with in situ labels supplemented with majority-vote annotations from each participant group.In this scenario, FCNs supplemented with majority-vote annotations from participant group A reported a normalized accuracy of 89.6%, which was also the highest accuracy of all FCNs in our analysis.This showed that in situ efforts can be combined successfully with crowdsourced aerial imagery annotation, which could reduce costs and labor from in situ surveys, given that crowdsourced labels are consistent and accurate.Similarly to the previous scenario, FCNs supplemented with majority-vote annotations from participant groups B and C severely degraded FCN performance, which again stresses the need for good quality crowdsourced labels.
However, this work does not fully exclude in situ surveying but merely affirms that good quality labels can be found in situ but a healthy quantity of labels can also be supplemented from aerial imagery, which would reduce in situ efforts and costs.

Fig. 1 .
Fig. 1.Distribution of tags recorded during the in situ survey (left) and the full set of points to be annotated, comprising the in situ points plus those determined using expert photointerpretation (right).

Fig. 2 .
Fig. 2. Sample images representative of vegetation classes used in the analyses.

Fig. 3 .
Fig. 3. User interface for providing participant annotations during the experiment.

Fig. 6 .
Fig. 6.Confusion matrices for the majority-vote annotations for each control group.

Fig. 7
shows the results of our first experiment and TableIIprovides further insight into class-specific performance of FCNs trained with in situ data versus FCNs trained with majority-vote annotations from group A. Fig.8shows the results of training FCNs on a reduced dataset of in situ labels versus FCNs trained on a combined train set of in situ labels and majority-vote annotations.The confusion matrices and tabled metrics contain the average results of five sequential train and test runs pixel accuracy = TP + TN TP + FP + TN + FN (

Fig. 7 .
Fig. 7. Confusion matrices for FCNN models trained using different versions of labeled data.Results for models trained on in situ labels (top left) and majority-vote annotations for group A (top right), group B (bottom left), and group C (bottom right).

Fig. 8 .
Fig. 8. Confusion matrices for FCNN models trained using a set of in situ labels (left) and using the same in situ set supplemented with majority-vote annotations for groups A, B, and C (top right, bottom left, and bottom right).

TABLE II PRECISION
, RECALL, AND F1-SCORES FOR MODELS TRAINED WITH IN SITU LABELS AND FOR MODELS TRAINED WITH MAJORITY-VOTE ANNOTATIONS FROM GROUP A and the remaining 67 points comprised the test set.For our second test, the combined training set was reduced to the same initial set of 53 randomly selected in situ points and the remaining 66 labels (54 from photointerpretation plus 12 points from dry sand polygons) were replaced with majority-vote annotations from each group.The goal of the second experiment was to determine whether supplementing a reduced training set with majority-vote annotations still achieves comparable results to models trained with in situ labels.