Processing math: 100%
CloudCast: A Satellite-Based Dataset and Baseline for Forecasting Clouds | IEEE Journals & Magazine | IEEE Xplore

CloudCast: A Satellite-Based Dataset and Baseline for Forecasting Clouds


Abstract:

Forecasting the formation and development of clouds is a central element of modern weather forecasting systems. Incorrect cloud forecasts can lead to major uncertainty in...Show More

Abstract:

Forecasting the formation and development of clouds is a central element of modern weather forecasting systems. Incorrect cloud forecasts can lead to major uncertainty in the overall accuracy of weather forecasts due to their intrinsic role in the Earth's climate system. Few studies have tackled this challenging problem from a machine learning point-of-view due to a shortage of high-resolution datasets with many historical observations globally. In this article, we present a novel satellite-based dataset called “CloudCast.” It consists of 70 080 images with 10 different cloud types for multiple layers of the atmosphere annotated on a pixel level. The spatial resolution of the dataset is 928 × 1530 pixels (3 × 3 km per pixel) with 15-min intervals between frames for the period January 1, 2017 to December 31, 2018. All frames are centered and projected over Europe. To supplement the dataset, we conduct an evaluation study with current state-of-the-art video prediction methods such as convolutional long short-term memory networks, generative adversarial networks, and optical flow-based extrapolation methods. As the evaluation of video prediction is difficult in practice, we aim for a thorough evaluation in the spatial and temporal domain. Our benchmark models show promising results but with ample room for improvement. This is the first publicly available global-scale dataset with high-resolution cloud types on a high temporal granularity to the authors' best knowledge.
Page(s): 3485 - 3494
Date of Publication: 02 March 2021

ISSN Information:

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Cloud forecasting remains one of the major unsolved challenges in meteorology, where cloud errors have wide-reaching impacts on the overall accuracy of weather forecasts [1], [2]. Due to the vertical and horizontal nature of clouds, there is an intrinsic difficulty in measuring clouds quantitatively and evaluating the performance of cloud forecasts. This inability to accurately parameterize and thus quantify clouds, convective effects, and aerosols on a subgrid scale in weather models is one reason model estimates can carry major uncertainties [2].

The primary source of quantitative weather forecasts comes from numerical weather prediction (NWP) systems. For these numerical methods, we model the future using governing equations from the field of atmospheric physics [3]. Over the past decades, there have been tremendous improvements in weather prediction owing to increased computational power, integration of new theory, and assimilation of large amounts of data. Regardless, these atmospheric simulations are still computationally expensive and operate on coarse spatial scales (9  × 9 km or above per pixel) [4], [5]. Furthermore, the current amount of atmospheric data collection exceeds hundreds of petabytes per day [6], implying that data collection far outpaces our ability to analyze and assimilate it. As a consequence of this, the authors behind [4] argue that we face two substantial challenges in this field for the future: 1) gaining knowledge from these extreme amounts of data and 2) developing models that tend to be more data-driven compared to traditional approaches while still abiding the laws of physics. One recent application found discrepancies in climate models’ estimation of photosynthesis in the tropical rainforests, which ultimately led to a more accurate description of these processes globally [7], [8]. Ideally, similar insights can be discovered from data-driven methods for cloud dynamics, but obtaining adequate observations of clouds globally has been a substantial obstacle for developing data-driven cloud forecasting methods to date.

To tackle this problem and spark further research into data-driven atmospheric forecasting, we introduce a novel satellite-based dataset called “CloudCast” that facilitates the evaluation of cloud forecasting methods with a global perspective. This approach has been paramount to progress in state-of-the-art methods in the computer vision literature with datasets such as MNIST [9], ImageNet [10], and CIFAR10 [11]. Current datasets for global cloud forecasting exhibit coarse spatial resolution (9  × 9 to 31  × 31 km) and low temporal granularity (one to multiple hours between images) [5], [12]–​[15]. We overcome both these issues by using geostationary satellite images, arguably the most consistent and regularly sampled global data source for clouds [1]. Since these satellites can obtain images every 5–15 min with a relatively high spatial resolution (1  × 1 to 3  × 3 km), they provide an essential ingredient for developing data-driven weather systems, which is an abundance of historical observations. It is possible to achieve higher accuracy in the vertical dimension with radar- and lidar-based profiling methods [16], but these fall short on the temporal resolution due to not being geostationary. Our contributions are as follows.

  1. We present a novel satellite-based dataset designed for cloud forecasting. The dataset has 10 different cloud types for multiple layers of the atmosphere annotated on a pixel level. It consists of 70 080 images with a spatial resolution of 928  × 1530 pixels (3  × 3 km) and 15-min sampling intervals from January 1, 2017 to December 12, 2018. All frames are centered and projected over Europe. To the authors’ best knowledge, no equivalent dataset with high spatial and temporal resolution exists for evaluating multilayer cloud forecasting methods globally.

  2. We evaluate four video prediction methods to serve as benchmarks for our dataset by predicting 4 h into the future. Two of these are based on recent advancements in machine learning methods specifically for applications in atmospheric forecasting.

  3. To evaluate our results, we present an evaluation study for measuring cloud forecasting accuracy in satellite-based systems. The evaluation design is based on best practices from the World Meteorological Organization when conducting cloud evaluation studies [1], which includes widely tested statistical metrics for categorical forecasts. Furthermore, we implement the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) from the computer vision literature. The combination of these two domains should provide the best and most fair evaluation of our results.

The remainder of the article is organized as follows. Section II provides an overview of the related work. Section III describes the new dataset. Section IV provides the evaluation study for measuring cloud forecasting accuracy in satellite-based systems. Finally, conclusions are drawn in Section V.

SECTION II.

Related Work

We start by briefly reviewing related datasets commonly used in the cloud forecasting literature. After presenting related datasets, we will review methods for video prediction that are particularly suitable for forecasting in the spatiotemporal domain.

A. Related Datasets

We want to introduce a dataset to the community that is particularly suitable for developing data-driven methods with a global perspective. As a result, we only consider 1) geostationary satellite and 2) model-based observations. While other cloud datasets do exist with very high resolution and accuracy for localized atmospheric analysis, such as those from sky-imagers or radar satellites, these technologies generally provide poor spatial coverage, limiting their global use-case.

1) Satellite-Based Cloud Observations

Before introducing related satellite datasets, it is essential to differentiate between cloud detection versus cloud types for meteorological purposes such as forecasting, as the literature varies considerably between the two. Generally speaking, satellite-based cloud detection, typically with the objective of cloud removal, is a relatively established field with many accurate methods [17], [18]. For meteorological purposes, most methods typically deal with binary cloud masks [19]. As outlined earlier, we are specifically interested in multilayer cloud types for forecasting purposes, limiting the amount of related literature and datasets quite remarkably. When we focus on multilayer cloud types, satellite-based cloud observations can be divided into raw infrared brightness temperature and satellite-derived cloud measurements [1].

Raw satellite brightness temperature acts as a proxy for cloud top height, which will be available during day and night. This proxy is not a perfect indicator for multilayer cloud types, as the temperature of clouds can also be explained by other factors such as the specific cloud type and seasonal variations.

Satellite-derived cloud measurements typically involve a brightness-based algorithm that can extrapolate variables such as cloud mask, type, and height from multispectral images [12]. This is the approach we adopt for our novel dataset, as it can classify multilayer cloud types with relatively high spatial and temporal resolution in near real time [12], [20]. To the authors’ best knowledge, only a few related datasets for satellite-derived multilayer clouds exist. For the European Meteosat Second Generation (MSG) satellites, these are Cloud Analysis and Cloud Analysis Image published by EUMETSAT [14], [15]. These products exhibit either coarse spatial resolution (9  × 9 km) or infrequent temporal sampling (1–3 h between images) [12]. As we derive our cloud types directly from the raw satellite images, we can maintain the high resolution (3  × 3 km) and temporal granularity (15-min sampling) of the raw satellite images. While our dataset is made from the MSG satellites, our approach is not limited to any specific geostationary satellite system and could be extended for other constellations. Outside Europe, similar datasets exist for 1) the American GOES-R satellites called ABI Cloud Height [21] and 2) the Japanese Himawari-8 called Cloud Top Height Product [22]. Due to their geographical coverage not extending to Europe, they are not directly comparable to ours or those of EUMETSAT.

2) Model-Based Cloud Observations

Model-based clouds are measured using output from an NWP model and is the most directly comparable alternative to satellite observations due to its global coverage. Two commonly used global NWP models are the European ECMWF atmospheric IFS model [5] and the American GFS model [13]. The resolution varies between the two, but the ECMWF model offers the highest spatial resolution with 9  × 9 km grid spacing [5]. As both models are global, they can be used interchangeably. The advantage of using NWP model output is that a physics-based simulation of the future exists, while the clear disadvantage is the coarse spatial resolution compared to satellite. Other NWP models also exist on a much finer spatial scale, but these are restricted to local areas, usually on a country basis [23], [24].

Given that we want to establish a global reference dataset for data-driven methods, the operational ECMWF model is considered the best global model-based multilayer cloud dataset. Compared to CloudCast, the ECMWF dataset is inferior in spatial resolution (9  × 9 vs. 3  × 3 km) and temporal resolution (1 h vs. 15 min). Furthermore, the operational ECMWF is not open-source, making it less relevant for the machine learning and computer vision community.

B. Methods

Producing accurate and realistic video predictions in pixel-space is an open problem to date. Extrapolating frames in the near future can be done relatively accurately, but once the future sequence length grows, so does the inherent uncertainty of the predicted pixel values. Several approaches have been proposed for solving this complex and high-dimensional task: spatiotemporal-transformer networks [25], variational autoencoders [26], generative adversarial networks (GANs) [27]–​[29], and recurrent-convolutional neural networks (CNNs) [30], [31]. In the video prediction literature, the tasks are often governed by relatively simple physics, such as the Moving MNIST dataset [32]. However, for predicting atmospheric flow, the task becomes bound by much more complex physics. Therefore, our chosen methods focus on applications that have been explicitly applied for atmospheric forecasting, which will justify the chosen benchmark models for our dataset. These are 1) convolutional long short-term memory networks (ConvLSTM) [31], 2) multistage dynamic generative adversarial networks (MD-GAN) [29], and TV- L^{1} optical flow (TVL1) [33].

1) Convolutional and Recurrent Neural Networks (ConvLSTM)

ConvLSTM was originally developed for precipitation nowcasting using radar images. It is considered the seminal paper for atmospheric forecasting using deep learning, making it relevant as a baseline for our dataset. While newer LSTM-based video prediction methods have been proposed following the ConvLSTM paper, such as PredRNN++ [34] and Eidetic-3D LSTM [30], these were not applied or evaluated on any atmospheric-related datasets, meaning they are outside the scope defined in Section I.

2) Optical Flow-Based Video Prediction

While optical flow is a classical topic in the computer vision literature, it is also one of the most important methods for global data assimilation in meteorology [35], [35]. Optical flow has been applied for video prediction in several papers [33], [36]. In [33], the authors implement the TV- L^{1} optical flow method introduced by [37] to capture cloud motions on multiple scales spatially based on two subsequent raw satellite images. Once estimated, the flow fields are inverted and applied for extrapolating multiple steps in the future to serve as a forecast of the effective cloud albedo.

3) Generative Adversarial Networks

GANs have been applied for video prediction in several recent papers [28], [29], [38]. One of these [29] achieved state-of-the-art results for generating 32-frame time-lapse videos with 128  × 128 resolution of cloud movement in the sky using only one frame as input. The authors use a two-stage generative adversarial network-based approach (MD-GAN), where the first-stage model is responsible for generating an initial video of realistic photos in the future with coarse motion. The second stage model then refines the initially generated video by enforcing motion dynamics using the Gram matrix in the intermediate layers of the discriminator.

SECTION III.

Dataset Description

The CloudCast dataset contains 70 080 cloud-labeled satellite images with 10 different cloud types corresponding to multiple layers of the atmosphere, as seen in Table I. As stated in Section II, we apply a satellite-derived cloud measurement approach. The procedure for generating our dataset is as follows (see Fig. 1 for a visual inspection).

  1. Acquire 70 080 samples for the period January 2017 to December 2018. The samples originate from the MSG satellites with four different channels per sample (280 320 satellite images), with 15-min sampling intervals and 3  × 3 km spatial resolution.

  2. Collect hourly NWP output for the entire period using the ECMWF operational model (exact variables to be elaborated shortly).

  3. Annotate the 70 080 samples on a pixel-level using the multilayer segmentation algorithm (to be described shortly).

  4. Conduct postprocessing to account for short-term missing observations.

  5. Generate and publish a 1) standardized version of the full resolution dataset and a 2) spatially downsampled version in addition to the raw dataset to serve as benchmark for future studies.

Fig. 1. - Processing chain going from raw data to our final published dataset, CloudCast. As climatology and land-sea masks are of different resolution, $N$ refers to the specific resolution of each data type, respectively.
Fig. 1.

Processing chain going from raw data to our final published dataset, CloudCast. As climatology and land-sea masks are of different resolution, N refers to the specific resolution of each data type, respectively.

TABLE I Description of the Various Cloud Types Present in the CloudCast Dataset
Table I- Description of the Various Cloud Types Present in the CloudCast Dataset

We will now elaborate on the above steps in more detail. As stated above, we start by collecting the 70 080 raw multispectral satellite images from EUMETSAT. These images come from a satellite constellation in geostationary orbit centered at zero degrees longitude and arrive in 15-min intervals. The resolution is 3712  × 3712 pixels for the full-disk of the Earth, which implies that every pixel corresponds to a space of dimensions 3  × 3 km. In the remote sensing community, it is well known that infrared channels can observe clouds differently than visible light, meaning infrared is necessary for low and medium cloud detection. Therefore, we sample one visible channel, two infrared channels, and one water vapor channel for each observation to enable multilayer cloud detection. The size of the entire raw satellite dataset is around 16 TB. Due to download and request limits imposed by EUMETSAT, we can only process a certain number of samples at any given time, which means we had to divide this process over a couple of months.

Next, we annotate each sample on a pixel level using a segmentation algorithm originally developed by [20] under the European Organisation for Meteorological Satellites—Satellite Application Facility on Support to Nowcasting and Very Short Range Forecasting (NWCSAF) project [39]. This algorithm is essentially a threshold algorithm applied at the pixel level for our multispectral satellite images. To improve multilayer cloud detection in the segmentation algorithm, we include climatological variables and metadata such as geographical land-sea masks and viewing geometry, which have shown to improve low- and mid-level cloud detection considerably [20]. Additionally, we also include NWP output to further improve the segmentation algorithm by having data not observable from satellite data. We collect the NWP data from the ECMWF operational model, which includes surface temperature, air temperatures at five different heights (950, 850, 700, and 500 hPa, and the tropopause level), total water vapor content of the atmosphere, and metadata for the ECMWF model grid.

Having established all the required datasets for the segmentation algorithm, we will outline how the thresholds are calculated for the major cloud types, which are primarily based on illumination conditions, viewing geometry, geographical location, and the NWP data. We will not list all the specific threshold values due to the sheer number of threshold values, which varies between daytime, nighttime, and twilight. If the reader is interested in these specific values, please see [20].

The first set of clouds is high semitransparent clouds versus opaque (thin) clouds, also called fractional clouds. To separate these, we use the differences between 1) the infrared channels 8.7 versus 10.8 \mu \text{m} and 2) channels 10.8 versus 12.0 \mu \text{m}. This is based on the insight that these differences are typically larger for cirrus clouds compared to thick clouds [20]. To improve separation during daytime where we have visible light available, we apply an additional threshold based on the illumination from the 0.6-\mu \text{m} channel. This can be done due to cirrus clouds having lower reflectance values than opaque clouds with the same radiative temperature.

Once we have identified semitransparent and fractional clouds, we classify the remaining cloudy pixels into either low, mid, and high clouds found in Table I. This separation is simpler; hence, we can calculate the threshold based on the 10.8-\mu \text{m} brightness temperature, which directly correlates with cloud height. Atmospheric instability still impacts this separation, which implies that we need to correct the previously outlined NWP forecast of air temperatures at different pressure levels. These allow us to separate very low from low clouds, low from medium clouds, medium from high clouds, and high from very high clouds.

We can define the specific thresholds as \begin{align*} vh &= 0.4 * T_{500\text{hPA}} + 0.6 * T_{\text{tropo}} - 5\,\text{ K} \\ hi &= 0.5 * T_{500\text{hPA}} - 0.2 * T_{700\text{hPa}} + 178\,\text{ K} \\ me &= 0.8 * T_{850\text{hPA}} + 0.2 * T_{700\text{hPa}} - 8\,\text{ K} \\ lo &= 1.2 * T_{850\text{hPA}} - 0.2 * T_{700\text{hPa}} - 5\,\text{ K.} \tag{1} \end{align*}

View SourceRight-click on figure for MathML and additional features. These thresholds were found in [20] to yield the best results. Having defined these thresholds, we can now define our cloud annotation procedure for the remaining cloud types: \begin{equation*} f(x) = {\begin{cases}\text{VeryHigh} & \text{if } 10.8\,\mu \text{m} < vh \\ \text{High} & \text{if } vh \leq 10.8\,\mu \text{m} < hi \\ \text{Medium} & \text{if } h \leq 10.8\,\mu \text{m} < me \\ \text{Low} & \text{if } me \leq 10.8\,\mu \text{m} < lo \\ \text{VeryLow} & \text{if } lo \leq 10.8\,\mu \text{m} \end{cases}}. \tag{2} \end{equation*}
View SourceRight-click on figure for MathML and additional features.

In practice, the 7.3-\mu \text{m} channel and the secante of the satellite zenith angle is used to reduce the risk of classifying a low cloud as a medium cloud. For notational simplicity, we have left these out from the above, but they can be found in [20].

While the segmentation algorithm is considered accurate, there are a few limitations. The primary limitation is the scenario where low clouds are sometimes classified as medium clouds in case of, e.g., strong thermal inversion, despite the corrections being made with the 7.3-\mu \text{m} channel and solar zenith angle. Nevertheless, we want to highlight that extensive validation of the segmentation algorithm has been carried out using both space-born lidar and ground-based observations to verify its accuracy, which are considered among the most accurate methods for ground-truth observations of clouds [20].

As a final postprocessing step, we interpolate missing observations that can arise due to numerous reasons such as scheduled outages or sun outages. More specifically, we interpolate the missing observations from neighboring values linearly, which only happens for short-term periods (below 6 h). The list of outages at the satellite level can be found at the EUMETSAT website [40]. One specific example is October 17, 2017 from 11.30 to 12.30 UTC, where the outage is due to sun co-linearity.

As stated in Section I, current global datasets [5], [13] for cloud forecasting and evaluation come with either low temporal granularity (one to multiple hours between images) or coarse spatial resolution (9  × 9 to 31  × 31 km) (see Table II). This demonstrates the need for our novel high-resolution dataset. In addition to the raw dataset, we also publish a standardized version for future studies, where we center and project the final annotated dataset to cover Central Europe, which implies a final resolution of 728  × 728 pixels. An example observation can be seen in Fig. 2. To support small-scale experiments and analysis, we also publish a downsampled low-resolution dataset of 15  × 15 km, which is significantly smaller in size compared to the full dataset.1

TABLE II Comparison to Existing Global-Scale Datasets for Multilayer Cloud Types or Cloud Cover
Table II- Comparison to Existing Global-Scale Datasets for Multilayer Cloud Types or Cloud Cover
Fig. 2. - Example observation from the CloudCast dataset for the April 1, 2017 13:00 UTC time. From left to right: Raw map of area under investigation; multispectral raw satellite RGB composite consisting of two visible light images (0.6 and 0.8 $\mu$m) and one infrared (10.8 $\mu$m); satellite-labeled CloudCast dataset. The 10 cloud types (“no clouds or missing data” is not assigned a cloud type for visualization purposes) can be seen in the colorbar below the right image.
Fig. 2.

Example observation from the CloudCast dataset for the April 1, 2017 13:00 UTC time. From left to right: Raw map of area under investigation; multispectral raw satellite RGB composite consisting of two visible light images (0.6 and 0.8 \mum) and one infrared (10.8 \mum); satellite-labeled CloudCast dataset. The 10 cloud types (“no clouds or missing data” is not assigned a cloud type for visualization purposes) can be seen in the colorbar below the right image.

SECTION IV.

Experiments

As an initial baseline study for our CloudCast dataset, we include several of the video prediction methods from our review in Section II. These methods have seen considerable success in similar atmospheric nowcasting studies recently [31], [33]. To match the resolution of most state-of-the-art video prediction methods [41], [42], we crop and transform our dataset using a stereographic projection to cover Central Europe with a spatial resolution of 128  × 128. We still use the full temporal resolution of 15-min intervals compared to hourly observations for other datasets, as mentioned in Section II. Several different definitions of nowcasting exist, but they generally vary between 0–2 and 0–6 h [43], [44]. We select the future time frame to be 4 h ahead in 15-min increments (16 time steps), which is somewhere in the middle of most definitions. While forecasting beyond 6 h is theoretically possible, we expect performance to deteriorate over time unless we incorporate additional variables that cannot be observed from satellite data alone to explain the more medium to long-term cloud dynamics.

We have divided the dataset into 1.5 years (75%) of training and 0.5 years (25%) of testing. Ideally, we would want our test data to cover all seasons of the year. However, the frequency distribution between training and test are relatively similar for most classes as seen in Table III. We also group the 10 cloud types into four based on height:

TABLE III Statistical Description of the CloudCast Dataset
Table III- Statistical Description of the CloudCast Dataset

1) no clouds;

2) low clouds;

3) medium clouds; and

4) high clouds.

This ensures a more natural ordering of the classes and enables us to focus on the major cloud types also present in the global NWP models [5], [13].

A. Benchmark Models

We present an initial benchmark for our dataset based on the reviewed methods in Section II-B. The results of the baseline models will be presented in Section IV-C along with the advantages and disadvantages of the chosen methodologies.

As stated in Section II-B, our chosen methods focus on applications that have been explicitly applied for atmospheric forecasting. This is the motivation behind the first three computer vision benchmark models that we outlined in Section II-B. The final benchmark is the simple persistence model typically used and recommended as a baseline in meteorology studies [1]. As these models are all suitable for the problem of cloud forecasting, they should provide good baselines for our dataset.

1) Autoencoder ConvLSTM (AE-ConvLSTM)

For our first baseline, we implement a variant of the ConvLSTM model from [31], where we introduce an autoencoder architecture with 2-D CNNs and use the ConvLSTM layers on the final encoded representation instead of directly on the input frames. This helps us to 1) encode the relevant spatial features from the input images before we start encoding and decoding the temporal representation, and 2) make training more memory efficient as ConvLSTM layers are memory-intensive. The autoencoder uses skip connections similar to UNet [45]. The motivation behind including skip connections for video prediction is to transfer static pixels from the input to the output images, making the model focus on learning the movement of dynamic pixels instead [41].

We start by reconstructing the first 16 input frames to initialize a spatiotemporal representation of the past cloud movement time-series. To predict 16 frames into the future, we use an autoregressive approach, where we feed the predicted output as input recursively to predict the next 16 steps. This is similar to the approach of other video prediction papers [28]. To improve the sharpness of our results without introducing an adversarial loss function, we have chosen to use the \ell _{1} loss. Furthermore, we use the Adam optimizer with batch size = 4, and we implement an optimization schedule used in the transformer paper [46], where we increase the learning rate linearly for a number of warmup rounds before decreasing it proportionally to the inverse square root of the current step number. We run the training schedule for 200 epochs with an initial learning rate of \varepsilon =2e^{-4} and momentum parameters \beta _{1}=0.90, \beta _{2}=0.98, and 400 warmup rounds. We notice only a slight improvement in our loss function after 100 epochs.

2) Multistage Dynamic Generative Adversarial Networks

For training and optimizing the MD-GAN model, we follow the original authors [29] with some differences. Since the MD-GAN paper focused on video generation rather than video prediction, we make necessary adjustments to the experimental design to account for this. Instead of cloning one input frame into 16 and feeding them to the generator, we feed the previous 16 images to the generator.

Besides these changes, we largely followed the approach in [29]. We found that having the learning rate fixed at 0.0002 did not produce satisfying results and often caused mode collapse for the generator. Instead, we employed the technique found in the article [47], where you set a higher learning rate for the discriminator (0.0004) than the generator (0.0001). This overcomes situations where early mode collapse causes training to stall and, instead, incentivizes smaller steps for the generator to fool the discriminator.

We find that the training procedure is inherently unstable, a frequent issue for GANs [48]. This issue arises particularly for the second stage training, where training seems to stall after around 10–20 epochs.

3) \mathbf {TV}- \mathbf {L}^{\mathbf {1}} Optical Flow

We implement the optical flow algorithm TV- L^{1} similar to the authors of [33], which can capture cloud motion on multiple spatial scales and is one of the most popular optical flow algorithms for meteorological purposes [33]. One of the underlying assumptions of optical flow is constant pixel intensity over time [33]. This is violated due to cloud formation and dissipation. Hence, the presence of convective clouds will negatively impact the accuracy of optical flow algorithms. As stated in Section II-B2, we can extrapolate multiple steps ahead in time using the estimated optical flow recursively on the predicted cloud images. The TV- L^{1} algorithm effectively has 11 parameters. The authors [33] optimized these by finding the lowest absolute bias among 21 different parameter settings calculated using a variant of the area under the curve calculation over absolute bias as a function of forecast time [33]. Instead, we conduct a grid search over a hyperspace of 360 different combinations chosen relative to the default values and the optimal hyperparameters found in [33]. As estimated flows (and, by extension, the predicted pixel values) will lie in a continuous real-valued space, we round our predictions to the nearest integer in our four-class setting.

4) Persistence

One of the recommended benchmark models in cloud evaluation studies is called a persistence model [1]. Persistence refers to the most recent observation, which, in this case, is the 15-min lagged cloud-labeled satellite image, replicated 16 steps into the future. Under the case of limited cloud motion, we expect this model to perform relatively well, but, obviously, it is naïve and will not work in dynamic weather situations. The most challenging part of video prediction is usually realistic motion generation, and, therefore, comparing other models to the persistence model shows how well the model has captured and predicted future cloud motion dynamics. Hence, the persistence model will serve as the baseline for skill score calculations.

B. Evaluation Metrics

Standardized evaluation metrics for the video prediction domain emphasizing atmospheric applications are hard to come by. As stated in Section I, we select our evaluation metrics from the World Meteorological Organization [1]. Due to numerous available metrics, we select the ones with the highest ranking score in the referenced paper. As several of these are common in the computer vision and machine learning literature, we only go through the nonstandard metrics. Metrics such as frequency bias is typically called “bias score” and measures the total number of predicted events relative to observed events. Any value above (below) 1 indicates that the model tends to overforecast (underforecast) events.

The first nonstandard metric is called “Brier Score.” In this case, the Brier Score refers to the MSE between estimated probabilistic forecasts and binary outcomes. To extend it to the categorical multiclass setting, we sum the individual MSE's for all categorical probabilistic forecasts relative to the one-hot target class variable as follows: \begin{equation*} \text{BS}= \frac{1}{M} \frac{1}{N} \sum _{k=1}^{M} \sum _{t=1}^{N}\left(f_{t,k}-y_{t,k}\right)^{2} \tag{3} \end{equation*}

View SourceRight-click on figure for MathML and additional features. where f_{t,k} is the predicted probabilities for all class k pixels in a given image at time t, y_{t,k} the actual binary outcomes, N=16 the number of future time steps, and M=4 the number of classes. The second is called Brier Skill Score, and it is calculated using the MSE of a given model relative to some benchmark forecast. As mentioned in Section IV-A4, we use the persistence forecast as the benchmark in skill score metrics. The formula for the Brier Skill Score is \begin{equation*} \text{BSS}= 1 -\frac{\text{BS}_{\text{model}}}{\text{BS}_{\text{persistence}}} \tag{4} \end{equation*}
View SourceRight-click on figure for MathML and additional features.
where any value above (below) 0 implies superior (inferior) performance by the proposed model.

In addition to these metrics, we also include video prediction metrics from the computer vision literature, which, taken together with the meteorology metrics, should constitute the fairest evaluation in this complex setting. These include the SSIM and the PSNR.

C. Results

The results of our baseline methods can be found in Table IV. We also include an example of model forecasts relative to ground truth in Fig. 3.

TABLE IV Key Evaluation Metrics Based on [1] for Our Proposed AE-ConvLSTM Model, the Second-Stage MD-GAN Model, the TV- L^{1} Optical Flow Model and the Simple Persistence Model. Reference Forecast Used in Brier Skill Score Metric is the Persistence Model. SSIM and PSNR Refer to the Structural Similarity Index and Peak Signal-to-Noise Ratio, Respectively
Table IV- Key Evaluation Metrics Based on [1] for Our Proposed AE-ConvLSTM Model, the Second-Stage MD-GAN Model, the $TV$- $ L^{1}$ Optical Flow Model and the Simple Persistence Model. Reference Forecast Used in Brier Skill Score Metric is the Persistence Model. SSIM and PSNR Refer to the Structural Similarity Index and Peak Signal-to-Noise Ratio, Respectively
Fig. 3. - Example forecast for all models relative to ground truth. $t$ refers to forecast time in 15-min increments. Here we include two input images corresponding to 2 and 1 h before. We also include four output images corresponding to 1–4 h ahead predictions in 1-h increments.
Fig. 3.

Example forecast for all models relative to ground truth. t refers to forecast time in 15-min increments. Here we include two input images corresponding to 2 and 1 h before. We also include four output images corresponding to 1–4 h ahead predictions in 1-h increments.

Despite the proposed models showing relatively high overall accuracies, it is quite clear that none of the models show consistent performance across time and space. This is also evident when looking at the decline in accuracy across time. This suggests that we need to 1) develop models more suitable for this particular problem, or 2) incorporate other data sources or variables to make more reasonable and causal predictions for the complex setting of multilayer cloud movement and formation.

We include a visualization of the worst predictions from the test set measured by mean accuracy in Fig. 4. Looking at the failure cases for MD-GAN S2 and AE-ConvLSTM, we observe that they struggle with situations where clouds are primarily scattered. This is unsurprising given that these models tend to generate predictions that are generally clustered and moderately blurry. The TVL1 model projects a considerable movement of clouds that is incorrect. The underlying reason could be the dissipation of clouds from the input images used in the optical flow estimation, which would violate the constant pixel intensity assumption. The persistence model achieves poor performance in situations with substantial motion as in Fig. 4.

Fig. 4. - Worst predictions from our test dataset on CloudCast using the proposed benchmark models compared to ground truth. Worst is defined as having the lowest mean accuracy among all test images for each model. The difference plots are calculated using the absolute difference between the predicted and ground truth images.
Fig. 4.

Worst predictions from our test dataset on CloudCast using the proposed benchmark models compared to ground truth. Worst is defined as having the lowest mean accuracy among all test images for each model. The difference plots are calculated using the absolute difference between the predicted and ground truth images.

1) Autoencoder ConvLSTM

The AE-ConvLSTM method achieves the highest accuracy on our dataset measured both temporally and spatially on all but medium clouds. For the BSS metric, we notice superior performance relative to the persistence model with a value of 0.11. This implies the application of ConvLSTM layers for cloud-labeled satellite images do capture spatiotemporal motion to some extent. On the other hand, we see in Fig. 3 that predictions become increasingly blurry over time. This is in alignment with the discussion in Section II-B1.

2) Multistage Dynamic Generative Adversarial Networks

The MD-GAN model outperforms the persistence model with a Brier Skill Score of 0.07. The categorical accuracy is not captured well, as MD-GAN achieves the lowest accuracy for medium clouds between all our models. The temporal accuracy is closely matched to the ConvLSTM model, especially for the 2-h forecast. Thus, by improving the stability and the initial forecasting accuracy of the MD-GAN model, we expect it could become the best and most consistent model.

3) \mathbf {TV}- \mathbf {L^{1}} Optical Flow

The TVL1 algorithm shows marginally superior performance relative to the persistence model with a BSS of 0.02. The primary reason behind the close performance of TVL1 and persistence relates to the choice of hyperparameters for the TVL1 algorithm, where hyperparameters yielding more static movement generally implied better performance across time. We believe the underlying reason for this result is the complexity of forecasting multilayer clouds 16 steps ahead combined with the violation of the optical flow assumption of having constant brightness intensity over time. Compared to AE-ConvLSTM and MD-GAN, it achieves lower overall and temporal accuracy but does reach higher accuracy for medium clouds. While optical flow methods have been popular for atmospheric forecasting, as stated in Section II-B2, their application to multilayer cloud types has not been fully researched yet. Hence, the proposed machine learning methods currently seem more appropriate for this task given their superior performance.

4) Persistence

The simple persistence model achieves relatively good results. The high short-term accuracy is not surprising given the limited cloud movement for 1-h ahead. Due to its static nature, however, it achieves the lowest accuracy near the end of our forecasting horizon.

SECTION V.

Conclusion

We introduce a novel dataset for cloud forecasting called CloudCast, which consists of pixel-labeled satellite images with multilayer clouds of high temporal and spatial resolution. The dataset facilitates the development and evaluation of methods for atmospheric forecasting and video prediction in both the vertical (height) and horizontal (latitude–longitude) domains. Four different cloud nowcasting models were evaluated on this dataset based on recent advancements in the machine learning literature for video prediction and traditional methods from the meteorology and computer vision literature. Several evaluation metrics based on best practices in cloud forecasting studies were proposed in addition to the PSNR and SSIM. The four models provided an initial benchmark for this dataset but showed ample room for improvement, especially for predictions near the end of our forecasting horizon. Hybrid methods combining machine learning and NWP could be interesting approaches to address medium to long-term forecasting in a future study.

We hope this novel dataset will help advance and stimulate the development of new data-driven methods for atmospheric forecasting in a field heavily dominated by physics and numerical methods.

References

References is not available for this document.