A Framework to Assess Remote Sensing Algorithms for Satellite-Based Flood Index Insurance

Remotely sensed data have the potential to monitor natural hazards and their consequences on socioeconomic systems. However, in much of the world, inadequate validation data of disaster damage make reliable use of satellite data difficult. We attempt to strengthen the use of satellite data for one application—flood index insurance—which has the potential to manage the largely uninsured losses from floods. Flood index insurance is a particularly challenging application of remote sensing due to floods’ speed, unpredictability, and the significant data validation required. We propose a set of criteria for assessing remote sensing flood index insurance algorithm performance and provide a framework for remote sensing application validation in data-poor environments. Within these criteria, we assess several validation metrics—spatial accuracy compared to high-resolution PlanetScope imagery (F1), temporal consistency as compared to river water levels (Spearman's ρ), and correlation to government damage data (R2)—that measure index performance. With these criteria, we develop a Sentinel-1 flood inundation time series in Bangladesh at high spatial (10 m) and temporal (∼weekly) resolution and compare it to a previous Sentinel-1 algorithm and a Moderate Resolution Imaging Spectroradiometer (MODIS) time series used in flood index insurance. Results show that the adapted Sentinel-1 algorithm (F1avg = 0.925, ρavg = 0.752, R2 = 0.43) significantly outperforms previous Sentinel-1 and MODIS algorithms on the validation criteria. Beyond Bangladesh, our proposed validation criteria can be used to develop and validate better remote sensing products for index insurance and other flood applications in places with inadequate ground truth damage data.


I. INTRODUCTION
R EMOTELY sensed data have the potential to monitor natural hazards, such as floods, and their consequences on socioeconomic systems. Floods cause more damage than any other disaster and accounted for 60% of the damages in agricultural production in Asia from 1960 to 2015 [2]. Flood exposure and damages are increasing due to worsening rainfall and storms, urbanization and development, and land subsidence [3], [4], [5], [6]. Despite the known high costs of floods, 92% of disaster costs are not covered by international aid in 77 of the poorest countries, and only 3% of this unmet cost is absorbed by insurance [7]. The intersection of an insurance gap and growing flood risk pushes people into poverty and causes setbacks to development as government budgets are stretched, people without financial protection are forced to sell assets, and historically successful adaptation strategies are undermined [8], [9], [10]. Expanding insurance coverage could reduce anticipated losses from floods and increase resilience [8], [11], [12], [13], [14]. Due to high costs and insufficient data, flood insurance penetration remains low (<1%) for climate-vulnerable populations in countries like Bangladesh [15]. Despite opportunities satellites provide to monitor floods by providing global data at regular intervals [16], applications and studies of remote sensing for flood insurance are scant [17], [18], [19] and often based on rainfall instead of inundated area [20]. Remotely sensed data for socioeconomic applications should be corroborated with ground-truth data [21], yet little publicly available spatially explicit data on flood damages exist at a high spatial resolution (e.g., household, farm, or neighborhood scales).

A. Index Insurance for Flood Resilience
Index insurance (or parametric insurance) is a novel type of insurance that is increasingly being advanced in data-poor environments due to its scalability and low cost. Index insurance contracts use observational data of a hazard (e.g., station or satellitebased rainfall, temperature, wind, or flood measurements) to assess hazard frequency and magnitude over time and create a damage index of that hazard. Thresholds are set on this index at various exceedance probabilities to trigger predetermined payouts based on expected financial losses. This methodology This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ reduces transaction costs compared to traditional insurancewhich requires individual assessment of each household-and thereby enables affordable premiums accessible to uninsured and low-income populations. As compared with drought [22], [23], [24], hurricane [25], and rainfall hazards [20], index-based flood insurance has been difficult to implement because floods are often fast-moving, often occur due to infrastructure failures not captured by river gauge measurements or traditional flood models, and occur at specific places (e.g., two sides of the same road could experience different flood impacts) [26]. Therefore, accurate estimates of flood damage for index insurance require high-resolution (approximately 3-20 m) spatial data with frequent observations (at least weekly) [27].

B. Validating Satellite Algorithms for Insurance
A primary contribution of this article is to propose and operationalize a set of metrics and criteria for remote sensing scientists to develop and assess algorithms that quantify environmental hazards (e.g., floods, drought, and wind) suitable for index insurance. Assessing the value and quality of a data source to represent a hazard for index-based insurance is essential for the financial sustainability of the product (for the insurer) and to deliver on promised benefits for flood resilience (for the insured). A poor correlation between the index and insured losses is known as basis risk. Basis risk can lead to policyholders not receiving payouts, resulting in low demand for purchasing the insurance product and increasing premium prices, which could further exacerbate low insurance demand. Basis risk can result from inadequate length of data, poor design of an index that was not validated by loss and damage data [24], inadequate index measurement of the spatial variance of hazard [28], improper selection of a geographic scale [29], and others (see [14] for a more in-depth review). These sources of basis risk and data uncertainty are not typically considered in the development and validation of remote sensing algorithms for flood detection. Despite the growing reliance on satellites for agricultural insurance [14], most remote sensing flood algorithms are only validated for several hundred points [30] or small area contingency maps for specific flood events [31], [32], which is insufficient to minimize most sources of basis risk.
To minimize basis risk in agricultural index insurance, Benami et al. [14] suggested that indices fill at least four criteria: i) the index must be externally observable and trustworthy; ii) payments must be timely; iii) there exists a historical time series of at least 15 years to calculate an index; and iv) index time series must match on the ground losses experienced by farmers. However, satellite flood detection algorithms are more complex than many other indices used for agricultural index insurance (e.g., rainfall and temperature) and thus require a more significant transformation from observed data (surface roughness and spectral reflectance) to index data (flood extent). With a flood index, data can be considered trustworthy if the data process is transparent, clear, and untampered with, as outlined for generalized insurance in [14], and if the algorithm used to calculate the index from observational data is accurate over time. We propose two additional criteria for flood index insurance algorithms: v) the index must be sufficiently spatially accurate such that the correct farmers (or districts, depending on the spatial unit of the contract) receive payouts; and vi) the satellite index is a temporally consistent measure. All public satellite data algorithms explored in this study are externally observable and trustworthy in that the data have not been tampered with (criterion i), and any automated and predetermined satellite algorithm has the potential for near real-time measurement (criterion ii). Differences in criteria iii-vi (historical time series, spatial accuracy, temporal consistency, and correlation with damage) comprise this study's examination of Sentinel-1's feasibility for flood index insurance in comparison to the Moderate Resolution Imaging Spectroradiometer (MODIS) satellite in Bangladesh.

C. Flood Detection With Sentinel-1
The optical satellites that are currently in use for flood index insurance contracts can detect inundated areas for large events by delineating water visible from space [34], [35], [36], [37], [38], [3] and are used for routine observations by NASA [39]. However, these satellites are susceptible to cloud cover, which is frequent in Bangladesh, and often images at comparatively low spatial resolution (∼500 m). Synthetic aperture radar (SAR) data, such as from Sentinel-1, provide an important alternative for flood detection due to its high resolution and ability to detect inundation under cloud cover [40]. Water can be identified by thresholding low backscatter values from active radar bands (in VV, HH, VH, and HV bands) on a single image [41], [42], [43], the difference in backscatter between two images [44], [45], or variance of backscatter in a time series [30], [31]. Machine learning approaches, such as random forest [46], and deep learning approaches, such as convolutional neural networks, can also identify inundation in Sentinel-1 [47], [48], [49]. While these deep learning algorithms can have additional gains in accuracy, the engineering expertise, computational, and thus financial cost of running and deploying them make them less desirable for deploying in the low remote sensing technical capacity insurance companies in Bangladesh today. There are dozens of Sentinel-1 water detection algorithms available, but there are no globally agreed upon benchmark data that make it easy to discern which is most accurate or appropriate for one's use case.
Several approaches for Sentinel-1 flood detection have been optimized and tested in Bangladesh to generate a monthly time series. Singha et al. [50] used two multitemporal algorithms in Google Earth Engine (GEE) to generate high-confidence flood maps to identify monthly inundation in rice paddies, validating accuracy with Sentinel-2 flood maps. Ahmad et al. [51] used a fusion of Landsat 8, Sentinel-2, and Sentinel-1 to generate monthly inundated area maps, validated with PlanetScope imagery in three small wetlands. Uddin et al. [52] created monthly inundated area maps using an object-oriented approach by converting Sentinel-1 images into RBG imagery, clustering and identifying inundated objects, and comparing the result to one Landsat 8 image.
However, none of these previous approaches can be used to design index insurance, which requires 1) a consistent time series on a roughly weekly (instead of a monthly) time step to provide the quick payout farmers need [14] and 2) a fully automated algorithm such that an insurance policy could be preemptively written and held constant over many years without the need for additional parameter adjustment or image selection. We chose to select a previous algorithm with easily accessible code linked to the paper, and which was free and easy to run (e.g., not a deep learning algorithm requiring GPUs) in our application context.
We take this gap in Sentinel-1 algorithms and the need for robust validation methods in data-poor environments as an opportunity to 1) adapt an existing Sentinel-1 flood algorithm to serve as an effective flood index for insurance in Bangladesh and 2) validate and compare the proposed algorithm against an existing Sentinel-1 flood algorithm and a MODIS flood algorithm according to the six criteria for an effective flood index developed above. Two criteria are fulfilled in any validated and public remote sensing algorithm (externally observable and trustworthy, timely payments). One criterion is fulfilled by MODIS data, but not Sentinel-1 (historical data time series longer than 15 years). However, there are three criteria (correlation to damage reports, spatial accuracy, and temporal consistency) where the performance of the three algorithms is unknown. Given the limitations in previous Sentinel-1 flood algorithms and the above criteria, we ask the following questions? 1) What adjustments must be made to an existing Sentinel-1 flood detection algorithm to provide an automated, accurate, and consistent time series that performs well on the proposed validation criteria? 2) How well does our improved Sentinel-1 flood mapping algorithm optimized for insurance applications perform on these criteria-spatial accuracy, temporal consistency, and correlation to flood damage reports-in comparison to existing Sentinel-1 and MODIS-based algorithms? To address these questions, we adapt a Sentinel-1 time series from the algorithm proposed by DeVries et al. [30]. We consider the following statements: 1) assess the spatial accuracy of the Sentinel-1 algorithm over four flood events with 3-m PlanetScope data; 2) compare the correlation between 42 river water level gauges with both a Sentinel-1 and MODIS surface water time series to measure temporal consistency; 3) compare Sentinel-1 and MODIS inundated area estimates to government damage reports from the July 2020 flood. Our study has the following three contributions: 1) a set of criteria for other remote sensing scientists to assess the suitability of algorithms and sensors for index insurance applications relying on the convergence of evidence across (imperfect) reference data; 2) a publicly available and validated weekly inundated area dataset from 2017 to 2022 across several administrative aggregations in Bangladesh; 3) the code used to generate the time series of inundation that can run globally using GEE.

II. OVERVIEW OF BANGLADESH STUDY SITE
Bangladesh is one of the most flood-prone countries in the world. The highly populous nation is situated in the delta formed by the confluence of three major rivers: the Ganges, Brahmaputra, and Meghna. Due to this unique geography, significant flooding is a normal part of the typical monsoon season which typically stretches from June to September [53]. However, exceptionally severe years can see flooding last in October [54], and flash flooding in the northeastern Haor (seasonal wetlands) region has the potential to arrive as early as April [55].
The total national flood extent in Bangladesh covers roughly 20%-25% of the country's area in a normal year and can reach over 60% of the country's area in an extreme year Fig. 1, [54]. Like global trends, flood exposure in Bangladesh is expected to increase with climate change due to increases in frequency and magnitude of high riverine flows [56], increase in cyclonic storm surge inundation [57], sea-level rise [58], and in response to climatic signals, such as El Nino-Southern Oscillation (ENSO) [53] and the Indian Ocean dipole [59]. Flood flows representing a 100-year return period are expected to increase along the Brahmaputra river by 8%, 24%, and 63% for global warming levels of 1.5°C, 2°C, and 4°C [56].
Over the past half-century, increasing flood control infrastructure (e.g., river embankments) has led to less flooding on average [60] but has raised the risk for catastrophic infrastructure failure [61] and thus more damaging and unpredictable major floods [60]. Currently, the Bangladesh Flood Forecasting and Warning Center (FFWC) uses thresholds on water levels ("danger levels") to determine if a damaging flood has occurred. But this method often fails when damaging floods result from infrastructure failures, such as dam breaches. Measuring flood extent directly with satellites can detect damaging inundation occurring from both predictable and unpredictable events without reliance on imperfect water level assumptions. Generally, land use and housing in Bangladesh are adapted to typical flooding (Bengali barsha). However, especially large or mistimed floods (bonna) often damage crops and property [58]. Agricultural production in Bangladesh is highly dependent both on the extent and timing of monsoon season flooding. For example, Aman rice crops, which are sowed during the rainy season (kharif), can be destroyed by a second August-September flood peak [54], and Boro rice crops sowed during the dry season (rabi) can be destroyed by an early flood, as occurred in the Northeast Haor region flash floods of April 2017 [55].
Groundwater irrigation is increasingly used for sowing Boro rice in the dry season, in contrast to the monsoonal flooding used to irrigate rainy season rice (Aman) [62]. Because rice irrigation requires standing water, rice fields irrigated during the dry season can often appear to be inundated both from the ground (see Fig. 2) and, consequently, from satellites. This dry-season inundation complicates change detection Sentinel-1 flood detection algorithms, such as the one proposed by DeVries et al. [30] that rely on specified nonflood baseline images to compare flood events against subjective user input. Improper selection of a baseline-in this case, choosing a baseline covering the dry season but failing to incorporate groundwater irrigation dynamics-could lead to significant irrigation in baseline images and subsequent failure to detect (and thus underpredict) flooding during a damaging flood event. In this study, we avoid the inclusion of this dry-season intentional inundation with a regionally adapted baseline selection method proposed in Section II-B2.
Bangladesh is the first country to pilot satellite-based flood index insurance for farmers, making it an ideal case study for the development and validation of an improved flood index. Past flood index insurance pilots required LIDAR-based digital elevation models to resolve complex topography and failed to scale due to prohibitive costs [15]. New satellite-based index insurance pilots based on inundated area estimates from MODIS [17], [33] are promising but so far demonstrate only limited validation over a few data points. The 2019 satellite index-based contract funded by OXFAM and implemented by IWMI in  Bangladesh is at the Upazila (subdistrict) scale and triggers payments from thresholds on the fraction of Upazila flooded index over 7-or 14-day periods [17].

A. Satellite Data
A variety of satellite datasets were used in addition to Sentinel-1 data in this study. All data were acquired and processed in GEE (see Table I).
Sentinel-1: The European Space Agency (ESA) Copernicus Sentinel-1 mission is a polar-orbiting two-satellite constellation of SAR satellites composed of Sentinel-1A, launched in April 2014, and Sentinel-1B, launched in April 2016 [63]. According to observed availability in GEE, Sentinel-1 observations in Northern Bangladesh scaled up rapidly from an average of 17 images in 2016 to 67 images in 2017, and therefore only data from 2017 till present are sufficient for the development of an inundation time series [see Fig. 3 Fig. 3(B) additionally shows the dominant overpass interval in Bangladesh to be primarily 2 days followed by 10 days, with some regional variance. As of 2020, the average revisit time in Bangladesh was ∼4.2 days. The image resolution is 10 m. Of the four available Sentinel-1 imaging modes [64], only the interferometric wide-swath (IW) mode is available in Bangladesh. Images are preprocessed with Sentinel-1 Toolbox to perform thermal noise removal, radiometric calibration, terrain correction, and log-scaling to decibels units (dB).

NASA-USDA Global Soil Moisture:
The NASA-USDA Global Soil Moisture dataset has been providing soil moisture measurements worldwide at 0.25°× 0.25°resolution with 3-day temporal frequency since 2010. Data are produced by integrating soil moisture observations into a two-layer Palmer model.

MODIS:
The NASA MODIS is a sensor measuring 36 spectral bands mounted on both the Terra and Aqua satellites, launched in 1999 and 2002, respectively. For this study, only the Terra satellite was used due to observed striping and missing data in Band 6 of the Aqua satellite [65]. The 8-day composite product (MOD09A1 v006) available in GEE was used throughout this study and scaled by a factor of 0.0001 as specified in GEE band details.
PlanetScope: PlanetScope is a private constellation of more than 180 satellites (as of December 2021) launched by Planet offering daily imaging of the entire Earth's surface at 3 m resolution. Scenes selected for the study offer four bands (red, green, blue, and near infrared) and are orthorectified and radiometrically corrected to a surface reflectance product [66].

B. Algorithms
We adapt an existing algorithm for a Sentinel-1 surface water time series in Bangladesh, which is validated for index insurance applications but useful in any capacity where a high-resolution water map time series is needed. The algorithm combines the z-score-based change detection approach proposed by DeVries et al. [30], implemented with a novel dry-season baseline selection method, an additional water threshold on the VH band of Sentinel-1, and smoothing post-processing.
1) Sentinel-1 Rapid and Automated Flood Detection Algorithm: An abbreviated explanation of the Sentinel-1 algorithm presented by DeVries et al. [30] is presented in the following.
A baseline stack of images from a roughly two-month dry period and a target flood image are chosen over a given region of interest (ROI) from the Sentinel-1 Ground Range Data image collection in GEE. Images are chosen to contain both VV and VH polarizations in IW mode. The baseline stack is filtered by ascending and descending satellite orbit passes. The mean and standard deviations of each pixel in its respective baseline image stack for both the VV and VH band are computed. These pixelwise baseline mean and standard deviation images are then used to calculate the pixelwise z-score for each target image over the VV and VH bands (Z VV and Z VH , respectively) as where t VV /t VH , μ VV /μ VH , and σ VV /σ VH , are the target image, the mean image, and the standard deviation image, respectively, for VV/VH polarizations. Given a z-score threshold for each band (z-thd VV and zthd VH ), pixels are marked flooded with high confidence if (Z VV < z-thd VV and Z VH < z-thd VH ) and are marked as flooded with medium confidence if (Z VV < z-thd VV or Z VH < z-thd VH ). z-thd VV = z-thd VH = −2, i.e., the target value is more than two standard deviations below the dry season mean, is shown to give maximum accuracy of the three thresholds tested in the original This flood map is combined with a permanent water flag from the JRC Global Surface Water Mapping Layers as described by DeVries et al. [30].
2) Updated Sentinel-1 Algorithm: We propose a revised version of this algorithm, adding a method to estimate a dry-season baseline by region, an additional backscatter threshold to the VH band, and spatial smoothing post-processing. More importantly, because of the threshold on the VH backscatter image, the output of the updated algorithm is a surface water time series, not a flood water time series. Due to the difficulty of classifying permanent or seasonal water in the constantly changing landscape of Bangladesh, and for easier comparison with optical data, in this study, all satellite flood maps are binary surface water maps. Because an index of surface water represents a constant positive shift from an index of flood water, this distinction does not affect the methodology or results of this study.
We give an overview of our algorithm in Fig. 4 and describe each section in the following.
Step 1. Baseline selection: As addressed above, careful selection of a historical baseline period in Bangladesh is essential to the accurate mapping of damaging floods due to the widespread irrigated inundation of rice crops in the traditional dry season. This irrigation inundation varies by region (see Fig. 5), and so a regionally adapted approach is desired.
We use the NASA-USDA Global Soil Moisture dataset in this study to develop a regionally adapted and automated method for baseline selection. We pick baselines from soil moisture rather than precipitation because this inundation is often from groundwater irrigation. We linearly interpolate this 3-day overpass time series to create a daily time series for analysis. The lowest soil moisture average for a 45-day window of this processed districtlevel soil moisture time series containing both ascending and descending Sentinel-1 passes is designated as the baseline period for that district. A 45-day baseline period was empirically chosen based on observations of irrigation timing as it allowed sufficient Sentinel-1 overpasses (∼10 images on average) while not encroaching on residual monsoon water, irrigation inundation, or early season flooding. Fig. 5 shows the baseline selection over two districts, Sirajganj (North-Central) and Sylhet (North-East). Transplantation of rice seedlings into flooded fields generally occurs from the fourth week of January until the third week of February in Sirajganj and mid-November until mid-December in Sylhet, according to Bangladesh Agro-Meteorological Information Portal [67]. Sensitivity analysis reveals that this algorithm decision does not affect spatial accuracy as opposed to a general dry season baseline. However, we believe that it is still valuable to reduce the need for a priori knowledge of the study site and increase the generalizability of the algorithm.
Step 2. Time series preparation: To prepare a multiyear time series over a given region from the algorithm presented in [30], we calculate a baseline period for that region as described in Step 1. With the pixelwise mean and standard deviation images for both VV and VH polarizations created from the baseline period, we calculate the z-score image in both polarizations for each Sentinel-1 image from 2017 to present as described in Section III-B1. When Sentinel-1 images partially cover the study area for a given overpass, pixels not covered are given the value of the previous overpass. We then have three images over the study region for each timestep: the VV z-score image, the VH z-score image, and the raw VH image.
Step 3. Pixel classification: Pixels are classified as water according to the medium confidence classification in [30] (flooded in either z-score band) or by an additional threshold on the Sentinel-1 VH band (Z VV < z-thd VV Þ Z VH < z-thd VH Þ VH < thd VH ). We choose an or condition because we qualitatively observe that over Bangladesh's widespread flood extents, algorithms with more opportunities to detect floodwater perform better than those with fewer opportunities-an observation supported by sensitivity analysis. An additional VH threshold was added to correct for the underprediction seen on windy days when the VV band is highly susceptible to wind-caused radar scattering. Compounding false positives caused by combining three flood detection methods with an or condition are subsequently corrected for in Step 4: Spatial smoothing. z-thd VV = z-thd VH = −2 is used for z-score thresholding as described in Section III-B1. To determine thd VH , a region-based thresholding method would likely deliver maximum accuracy. However, a region-dependent method requires a roughly bimodal histogram from a flood event in each region. To ensure the automated nature of the algorithm, we propose a universal threshold across northern Bangladesh of thd VH = −22 dB. We derive this threshold from the average of an Otsu threshold-derived from minimizing the intraclass variance of a bimodal distribution [68]-run over the four spatial validation events (thd avg = −22.3 dB, σ = 1.48 dB). Sentinel-1 thresholds across the world vary by region, therefore this threshold may have to be adapted outside of Bangladesh. A review of previous Sentinel-1 VH thresholds shows values often fall between −20 and −25 dB [69], [70], [71]. We conduct sensitivity analysis over this range of thresholds and find no significant change in results in the range thd VH = −20 to −25 dB (Δ max F1 = 0.01). Sensitivity analysis also reveals that the addition of this threshold increases accuracy by 0.05 (F1).
Step 4. Spatial smoothing: Radar speckle noise is a common issue in radar remote sensing data [72]. This radar noise especially makes SAR change detection approaches less accurate because these errors increase statistical overlapping between changed (flooded) and unchanged (nonflooded) classes [73]. As described in the Sentinel-1 overview (see Section III-A), some noise correction is performed on the raw Sentinel-1 scenes before ingestion to GEE. However, preliminary flood maps in this study suggested that speckle noise in the target backscatter image still led to noise in the flood image.
Therefore, we employ a smoothing filter to increase algorithm accuracy. From the Boolean flood map created as described above, we calculate the circular focal mean (unweighted by distance) and apply a threshold of 0.6, which is equivalent to a majority filter. We tested a variety of kernel radii and thresholds and found a 30-m radius with a threshold of 0.6 to be the most accurate. Although all smoothing filters create some loss of accuracy at the pixel scale, we find that this filter acts primarily as a noise reduction filter and therefore we still report the algorithm scale at 10 m. This smoothing process is demonstrated in Fig. 6 over the April 2017 Sylhet flash flood. Sensitivity analysis also reveals that the addition of this threshold increases accuracy by 0.03 (F1).
3) MODIS Water Detection Algorithm: To detect surface water with the MODIS sensor, we implement the algorithm for flood time series development in Bangladesh proposed by Islam et al. [74]. This algorithm masks cloudy pixels and calculates three flood indices: Enhanced Vegetation Index (EVI), Land Surface Water Index (LSWI), and difference value between EVI and LSWI (DVEL). Thresholds are applied to the indices to return three water classifications: water pixel, mixed pixel (water and land), and water-related pixel (water + mixed pixels). We test both the water and water-related classifications in this study.

C. Validation
We assess the performance of the proposed Sentinel-1 algorithm, the Sentinel-1 time series proposed by DeVries et al. [30], and the MODIS time series over criteria iv-vi of the validation criteria proposed above. We compute spatial accuracy with comparisons to four high-resolution PlanetScope flood event images (iv), temporal consistency with correlation with river water levels (v), and correlation to government damage reports (vi). Using a convergence of evidence framework, we assess these three imperfect but valuable metrics to compare the three algorithms as flood indices for insurance.

1) PlanetScope Spatial Event Validation:
To evaluate the spatial accuracy of the Sentinel-1 inundation time series, we perform accuracy analysis over high-resolution (3 m) PlanetScope images from four flood events across Northern Bangladesh. We aim to extend a standard remote sensing validation practice of random sampling over several hundred points to a larger analysis across an entire ROI. For each flood event, an image overpass and ∼600 km 2 ROI were chosen based on minimal cloud cover, closeness of Sentinel-1 and MODIS overpasses, and to sample a variety of flood locations and times. For each PlanetScope image, the closest Sentinel-1 and MODIS images are selected for comparison. Because it is an 8-day composite, the MODIS image is selected based on its pixelwise median date. While this leads to an 8-day uncertainty window for each pixel, persistent cloud cover on most rainy-season days means that pixels usually come from on or near the median date.   Fig. 7, summarized in Table II, and described as follows.
1) Sylhet District, April 2017 Flash Flood: This major flash flood occurring across the Haor basin of Northern Bangladesh was especially damaging due to its relatively early timing compared to the typical monsoon season. The flood is estimated to have caused about 800 000 tons of losses to Boro rice crops planted during the dry season, which were nearly ready for harvest at the time of the flood [55]. 2) Natore/Naogoan District, August 2017 Monsoon Flooding: This flood event centered around the confluence of the Atrai and Nagar rivers occurred during the 2017 monsoon season. Floodwater in the Natore district increased in August 2017 while flooding in other areas started to decrease, with Atrai river water levels running 70 cm above danger levels [75]. The studied image is centered around the intersection between the Natore and Naogoan districts.

3) Sirajganj/Pabna District, July 2019 Monsoon Flooding:
Monsoon season flooding in 2019 was likely not as widely severe as in 2017; however, there was greater localized damage in certain regions. The Sirajganj district was one of the heavily affected regions, with an estimated 11%-20% of the population affected [76]. The studied image centers around the border between the Sirajganj and Pabna districts near the intersection of the Brahmaputra and Atrai rivers.

4) Jamalpur District, July 2020 Monsoon Flooding:
Monsoon season flooding in 2020, especially in and around the Jamalpur district, was especially severe due to record water levels and long-lasting flood extent [77]. This studied flood event centers around the Brahmaputra (Jamuna) river and its overflow over the Jamalpur district near the intersection of the Brahmaputra and Old Brahmaputra Rivers. Although events were chosen based on minimal cloud cover, clouds and cloud shadows are still present in PlanetScope images and must be masked to avoid systematic validation errors. To mask clouds in each PlanetScope image, we calculate albedo as the sum of all bands and empirically threshold albedo over each image to create a cloud mask layer (7000, 7000, 9000, and 7000 for events 1-4, respectively). To mask cloud shadows, we iteratively transform the cloud mask at 1-m steps at an angle opposite of sun azimuth to the maximum cloud shadow length. While overmasking is observed from the cloud and cloud shadow mask, this error simply leads to a smaller sample of studied pixels (a minimum of 5 million in Sirajganj) and therefore should not significantly affect results. Remaining observed clouds and cloud shadows are masked by hand. We remove permanent water bodies from analysis with the JRC Global Surface Water Mapping Layers (v1.3) [34]. Water bodies are masked if present in more than 80% of monthly images from 2017 to 2021. We calculate the Normalized Difference Water Index (NDWI) [78]  We compare each PlanetScope validation map to the Sentinel-1 water maps and MODIS water map with two metrics: F1-Score to measure algorithm accuracy and bias to measure algorithm overprediction or underprediction. The F1-Score is a common metric used to assess classification algorithms and is defined as the harmonic mean of the algorithm precision (P) and recall (R).
The F1-Score is measured in the range [0, 1], where 1 represents perfect accuracy. We use the bias metric proposed by Sampson et al. [79] and defined as  where A S represents the total surface water area in the study map and A V represents the total surface water area in the validation map. Bias values fall in the range (0, Ý) where >1 represents a tendency to overpredict, <1 represents a tendency to underpredict, and 1 represents no tendency.
2) Water Level Time Series Validation: Quantifying accuracy over only a few select flood events is an established and helpful form of algorithm validation, but it does not prove index performance over time-a key necessity for reliable index insurance and to reduce basis risk over time. In the development of the updated proposed Sentinel-1 algorithm, we tested a version that required flooding to appear in the VV band. Because this band is highly sensitive to radar scattering from wind, resulting time series in very windy regions often demonstrated high spatial accuracy but highly inconsistent time series unusable for insurance (e.g., Sunaganj district, Fig. 9). This anecdotal example highlights the importance of studying performance over the entire time series and not just a small sample of events.
To evaluate the temporal consistency of the three studied time series, we calculate correlation statistics between river water level data and surrounding inundation for each surface water time series. The aim of this metric is to show the validity of the proposed algorithm over all possible dates and thus measure the consistency of a possible flood index for insurance payouts over time. River water levels do not provide a perfect proxy for surrounding inundation, but we do generally expect the water level to have a positive and monotonic relationship with flooding. Because water level data is available at many locations with at least daily frequency, it serves as a useful comparison point for remotely sensed flood algorithms-especially to quantify aggregate performance over time.
Water level data through 2020 are obtained from the Bangladesh FFWC over 53 stream gauges in Northern Bangladesh. Of these 53 gauges, 42 gauges are chosen for analysis due to missing or poor data in 11 gauges Fig. 10. We filter extraneous outliers by removing all values below and above z-score thresholds of z = −3 and z = 5, respectively. We empirically chose these values to remove outliers without removing rainy season water level peaks. Dry-season agricultural inundated area (from irrigation) is not expected to correlate with the river water levels. We thus only assess the correlation between water levels and the satellite-derived inundated area within a broad rainy season window defined from May to October, inclusive. We run each satellite algorithm during 2017-2020 (inclusive) within a 15-km-radius circle centered at each gauge.
For each river gauge, we cross-correlate coincident days in the water level and inundation time series (from coincident days from MODIS and Sentinel-1, respectively) with a Spearman's rank correlation. We chose Spearman's rank because the relationship between water level and inundation is expected to be monotonic but nonlinear: once a river overflows its banks, inundation rapidly increases with only slight increases in water level.
3) Comparisons to Government Damage Data: To estimate the correlation of Sentinel-1 inundated area to flood damage, we compare measured satellite inundation to government damage reports from the 2020 Bangladesh monsoon floods. These government damage reports, or D-Forms, are created by inspection and surveying of flooded areas during or after major flood events. They are obtained by Oxfam and translated into English by IRI. These data provide "total damaged area" per Union (fourth-level administrative district), defined as an area affected by harmful inundation. Data are available for 15 unions distributed across two Upazilas along the Jamuna river-Fulchhari in Kurigram district and Ulipur in Gaibandha district Fig. 11. These types of government data are the standard for ground-truth flood data in Bangladesh. However, it is still susceptible to human error and political influence and therefore cannot be considered gold-standard "ground truth." Despite limitations, these data are still essential for ensuring remotely sensed results correlate with on-the-ground reality and are an important component in our convergence of evidence approach.
Satellite data are obtained from the overpass with maximum flood extent in each sensor in 2020. To measure the relationship between these damage data and the satellite flood data, we fit a linear relationship between fractional inundated area and fractional damaged area in each Union. Over each Union, we obtain an R 2 value for the strength of the relationship, a p-value for significance, and an RMSE value to add context to the R 2 value.

A. Spatial Event Validation Results
In all four PlanetScope events, we find high accuracy and low bias of the proposed Sentinel-1 algorithm. Average F1-score is 0.920 (σ = 0.025) and average bias is 1.026 (σ = 0.164). Slight overprediction is seen in the Natore/Naogoan flood event (bias = 1.209), and slight underprediction is seen in the Jamalpur flood  Fig. 12 for each event. We find comparatively lower accuracy and systematic underprediction (F1 avg = 0.812, bias avg = 0.6035) in the original z-score-based algorithm proposed by DeVries et al. [30]. Of the two classifications in the studied MODIS algorithm, we find slightly higher accuracy with the "water-related" classification (F1 avg = 0.824) versus the "flood" classification (F1 avg = 0.773). Therefore, we use results from the "water-related" classification to represent the MODIS sensor throughout this study. MODIS classifications resulted in lower accuracy than both Sentinel-1 algorithms and systematic overprediction (F1 avg = 0.824, bias avg = 2.296). Accuracy metrics are given in Table III and summarized in Fig. 13.

B. Water Level Correlation Metric Results
We observe a statistically significant rank correlation between both Sentinel-1 algorithms and water levels for all 42 stream gauges studied (p < 0.0001). The proposed algorithm has an average correlation of 0.752 (σ = 0.107), and the Sentinel-1 algorithm by DeVries et al. [30] has an average correlation of 0.755 (σ = 0.143). While there is no significant difference between the mean correlation of the two Sentinel-1 algorithms, the algorithm proposed by DeVries et al. [30] shows slightly greater variation. We, therefore, compare only the proposed  III  ACCURACY METRICS FOR THE PROPOSED SENTINEL-1 ALGORITHM, THE  PREVIOUS SENTINEL-1   Sentinel-1 algorithm and the MODIS algorithm in subsequent results. The MODIS time series demonstrates systematically lower correlation values than Sentinel-1, with an average correlation of 0.568 (σ = 0.196) and a statistically significant correlation (p < 0.05) in 38/42 gauges. In addition, the correlations between water levels and MODIS vary greatly, indicating large temporal uncertainty and inconsistency in MODIS flood detection, which we suspect is due to cloud cover. A histogram of Spearman's rank correlations over all 42 stream gauges for both Sentinel-1 algorithms and MODIS is given in Fig. 14, and the spatial distribution of correlation values for Sentinel-1 is given in Fig. 15. We observe clear spatial trends in both Sentinel-1 and MODIS correlation maps. The highest correlations in both sensors are

C. Government Damage Comparison Results
Flooded area by union from the proposed Sentinel-1 algorithm demonstrates a moderate and statistically significant positive correlation with reported fractional flooded area from government damage statistics (R 2 = 0.43, p = 0.01, RMSE = 0.10). This proposed algorithm demonstrates a higher correlation than the Sentinel-1 algorithm proposed in [30] (R 2 = 0.29, p = 0.04, RMSE = 0.06).
We find no positive correlation between the reported damaged area and the inundated area reported by MODIS. Inspection of the MODIS flood image reveals the two unions (Saheber Alga and Begumganj) with the lowest fractional flooded area reported despite the high damage reported are obscured by clouds. When these two unions are removed, the data show no significant correlation (R 2 = 0.03, p = 0.55, RMSE = 0.06). Therefore, we interpret the original correlation (R 2 = 0.37, p = 0.02, RMSE = 0.12) as primarily a result of cloud cover in the MODIS image and not a significant result. The negative trendline is likely due to these two outliers in the graph.

V. DISCUSSION
In this article, we demonstrate how validation with several imperfect but helpful reference data sources can allow for confidence and improvement in remote sensing applications for flood insurance or similar socioeconomic applications. We first develop a set of criteria to assess remote sensing algorithms for flood index insurance. We then adapt a Sentinel-1 flood algorithm over Bangladesh based on those criteria. Using these Results of comparison over each of the six requirements for an index insurance trigger are summarized in Table IV. We find that the proposed Sentinel-1 algorithm significantly improves on the previous Sentinel-1 algorithm over two metrics and the MODIS algorithm on all three metrics. It demonstrates increased spatial accuracy (F1 = 0.925 versus 0.867 and 0.838, respectively) with less significant bias (bias = 1.065 versus 0.778 and 1.758, respectively). Notably, proposed algorithm updates lead to nearly double the improvement (ΔF1 = 0.058) over the previous Sentinel-1 algorithm than the previous Sentinel-1 algorithm shows over MODIS (ΔF1 = 0.028). The updated algorithm demonstrates high temporal consistency with similar results to the previous Sentinel-1 algorithm (ρ = 0.752, ρ = 0.755, and ρ = 0.568). Finally, the updated algorithm leads to the highest correlation to reported damage data (R 2 = 0.43 versus R 2 = 0.29, and no positive relationship). Here, algorithm changes lead to a 43% relative increase in correlation over the previous Sentinel-1 algorithm. The proposed algorithm is fully automated, requiring no additional event-based or region-based thresholding or adjustment, and yields high temporal consistency with low variance (σ = 0.12). These metrics and the following analysis of quantitative results show with confidence that the proposed Sentinel-1 algorithm delivers concrete improvements over past algorithms in Bangladesh and that Sentinel-1 is a preferred sensor to MODIS for flood index insurance applications.
A deeper analysis of the three validation exercises gives greater confidence and context to these conclusions and demonstrates the criteria's usefulness in assessing a flood index. In spatial accuracy comparisons, clear accuracy advantages of the proposed algorithm and Sentinel-1 more generally over MODIS translate to less basis risk because more affected policyholders correctly get identified as flooded. Equally important is the degree to which MODIS severely and inconsistently overpredicts flood extent (bias = 1.758, σ bias = 0.525), which translates to inflated payouts, high basis risk for insurers, and ultimately an unsustainable insurance product. Meanwhile, a less pronounced but still important underprediction from [30] algorithm would lead to underpayment and poor confidence from policyholders.
Analysis of temporal consistency results also gives important insights into the advantages of cloud-free Sentinel-1 data over MODIS past quantitative metrics. Fig. 16 shows that the MODIS time series features severe oscillations from serious overprediction in the time series which prevent accurate estimates of flood timing and length. The MODIS sensor's higher variability (σ = 0.196 versus σ = 0.107 for Sentinel-1) and clear spatial trends (see Fig. 15) highlight how this temporal consistency likely results from persistent cloud cover and thus underprediction of flood extent in the rainy season. Northeastern Haor regions where MODIS demonstrates the poorest performance are known for persistent clouds in the Monsoon-the neighboring Meghalaya district in India translates to "The Abode of Clouds." Finally, the MODIS sensor's poorer performance also shows that cloud cover in optical imagery is a far greater concern for temporal consistency in Bangladesh than frequent 10-day gaps in Sentinel-1 data.
Results from government damage report correlations reinforce the above insights and connect algorithm performance back to index insurance and basis risk. The proposed algorithm's 48% improvement over the DeVries et al. [30] algorithm (from R 2 = 0.29 to R 2 = 0.43) shows that it does a better job of representing flood extent that causes damage and thus reduces basis risk. For the July 2020 event, clouds can be observed in the northeast part of Fulchhari (see Fig. 11) corresponding to the unions with the greatest underprediction, whereas inundation is systematically overpredicted in noncloudy unions and not related to actual flood extent. Therefore, the union flooded area from MODIS in this damage report example proves to be primarily a function of cloud extent rather than actual flood extent Fig. 17.
Each of these three results raises concerns about the use of MODIS flood algorithms in index insurance contexts. The sensor's significant overprediction (bias = 1.758) and high variability of spatial accuracy and bias (σ F1 = 0.080, σ bias = 0.606) are a major cause of basis risk and would lead to excess insurance payouts in regions where flooding was falsely identified. Excess payouts make an insurance policy far less attractive to insurers who must pay those excess fees and charge higher rates and thus make the overall feasibility of index insurance with the MODIS sensor less likely. The inconsistent time series of inundation from MODIS shown in temporal consistency tests could underestimate the duration of flooding-a key factor for flood damage in Bangladesh-and lead to insufficient payouts. Finally, the lack of correlation between MODIS and damage metrics from a combination of spatial overprediction and cloud cover suggests that MODIS is a poor proxy for flood damage and thus carries sizable basis risk as a flood index.
Despite displaying significant improvements over MODIS in spatial accuracy, temporal consistency, and correlation to reported damage, Sentinel-1 does not fulfill the requirement for an index insurance trigger of a greater than 15-year historical time series necessary to create a historical risk distribution and price premiums [14]. Because Sentinel-1 data in Bangladesh only reached sufficient frequency in 2017 (see Fig. 3), future work would need to be done to fuse Sentinel-1 with other long time series sensors and datasets and provide a necessary risk distribution of historical flooding. Fusion possibilities could include MODIS, passive microwave [81], river water levels, consistent damage reports from the government, news media, social media [82], or other data sources that can give a signal for flood damage (e.g., ENSO-based indices worked as a flood proxy in Peru [83]). Promising approaches to fuse higher resolution with lower resolution sensors have been demonstrated by combining Sentinel-1 with passive microwave sensors [84] or Landsat and MODIS [85]. Other machine learning approaches, especially general adversarial networks, have been successful in fusing Landsat and MODIS [86] or Sentinel-2 with PlanetScope [87] and could also be employed. Fused time series would also have to be validated and correlated with flood damage to reduce basis risk.
One limitation of this study is that the demonstrated results over the proposed criteria in this article may only be able to serve as a proxy for certain types of longer-duration floods (e.g., riverine, tested here) and may not be suitable for dense urban floods, flash floods, or rapidly decreasing storm surges. Urban areas present complications for flood detection with Sentinel-1 due to the sensor's side-looking nature. Small areas of flood extent common in cities are often blocked by buildings and other structures, and backscatter values are often increased due to "double-bounce" scattering from reflection off both water and buildings [88], [89]. The proposed algorithm only examines decreases in backscatter and therefore might show decreased performance in urban areas. Further research could attempt to detect these flood areas with positive deviations in backscatter z-score. Coastal areas present additional challenges due to the unique causes of flooding. While flooding in Northern Bangladesh is usually caused by river overflow or rainfall, flooding in coastal areas is often caused by storm surges and tidal behavior [90]. Because storm surges can often cause a rapid increase and decrease of flood levels and tidal behavior happens on a subdaily time scale, floods in areas of coastal Bangladesh have the potential to inflict significant damage in a time too short to be measured by infrequent satellite data. More work is needed to assess and validate the studied algorithms for urban and coastal flooding.
Another limitation of this study is the error inherent to each validation data source used. As described above, spatial accuracy in only four large flood events does not equate to spatial accuracy in overall flood events. Despite examining data over a longer time with more points in time, water levels are not a perfect proxy for flooding or flood damage and analog water level measurements in Bangladesh can be subject to human error. Finally, the government reported damage data reports damaged area and not necessarily monetary damage values which would be gold-standard for addressing basis risk. In addition, it is subject to political influence, and the sites examined for damage correlations include many chars (sandy islands) within the Brahmaputra/Jamuna river, which may not be broadly representative of damage in other areas. This article attempts to demonstrate how imperfect data sources can be used in tandem to analyze specific elements of a flood index and ultimately gain greater confidence in data-poor regions with a "convergence of evidence" [91] approach. Further efforts to validate damage with farmers' experiences (e.g., via crowdsourcing) and monetary damage values could add further confidence to any data proxy chosen for index insurance [24]. While this study focuses only on Bangladesh because of the immediate application, future work could attempt flood algorithm development or validation simultaneously in data-rich and data-poor regions to build more confidence in these methods across regions. Future work could also include new or upcoming SAR missions such as NISAR that are greatly expanding radar-based flood mapping capabilities, and integrate new high spatial resolution commercial sensors such as Capella [92] and ICEYE.
Data from this study and the code used to generate data are publicly available. Data can be accessed from https://doi.org/10. 17605/OSF.IO/TGRX4. The code necessary to create this data can be found at https://github.com/mitchellthomas1/S1-Flood-Bangladesh.git.

VI. CONCLUSION
In this article, we propose a "convergence of evidence" validation methodology for satellite-based flood index insurance in data-poor environments and examine its use on adapted and existing Sentinel-1 flood data and MODIS flood data. The proposed methodology shows that the adapted Sentinel-1 time series demonstrates high spatial and temporal accuracy and is a more suitable flood index than both MODIS and the previous Sentinel-1 algorithm. Alongside data fusion methods to assess hazard frequency and exceedance probabilities [19], this validation methodology past limited spatial comparisons could help expand insurance and financial protection from floods in Bangladesh and elsewhere. The demonstrated use of a "convergence of evidence" approach between several imperfect but helpful data sources could help aid validation in data-poor environments of remote sensing algorithms for a variety of social and economic applications.