CloudCast: A Satellite-Based Dataset and Baseline for Forecasting Clouds

Forecasting the formation and development of clouds is a central element of modern weather forecasting systems. Incorrect cloud forecasts can lead to major uncertainty in the overall accuracy of weather forecasts due to their intrinsic role in the Earth's climate system. Few studies have tackled this challenging problem from a machine learning point-of-view due to a shortage of high-resolution datasets with many historical observations globally. In this article, we present a novel satellite-based dataset called “CloudCast.” It consists of 70 080 images with 10 different cloud types for multiple layers of the atmosphere annotated on a pixel level. The spatial resolution of the dataset is 928  × 1530 pixels (3  × 3 km per pixel) with 15-min intervals between frames for the period January 1, 2017 to December 31, 2018. All frames are centered and projected over Europe. To supplement the dataset, we conduct an evaluation study with current state-of-the-art video prediction methods such as convolutional long short-term memory networks, generative adversarial networks, and optical flow-based extrapolation methods. As the evaluation of video prediction is difficult in practice, we aim for a thorough evaluation in the spatial and temporal domain. Our benchmark models show promising results but with ample room for improvement. This is the first publicly available global-scale dataset with high-resolution cloud types on a high temporal granularity to the authors’ best knowledge.


I. INTRODUCTION
C LOUD FORECASTING remains one of the major unsolved challenges in meteorology, where cloud errors have wide-reaching impacts on the overall accuracy of weather forecasts [1], [2]. Due to the vertical and horizontal nature of clouds, there is an intrinsic difficulty in measuring clouds quantitatively and evaluating the performance of cloud forecasts. This inability to accurately parameterize and thus quantify clouds, convective effects, and aerosols on a sub-grid scale in weather models is one reason model estimates can carry major uncertainties [2].
The primary source of quantitative weather forecasts comes from numerical weather prediction (NWP) systems. For these numerical methods, we model the future using governing equations from the field of atmospheric physics [3]. Over the past decades, there have been tremendous improvements in weather prediction owing to increased computational power, integration of new theory, and assimilation of large amounts of data. Regardless, these atmospheric simulations are still computationally expensive and operate on coarse spatial scales (9x9 km or above per pixel) [4], [5]. Furthermore, the current amount of atmospheric data collection exceeds hundreds of petabytes per day [6], implying that data collection far outpaces our ability to analyze and assimilate it. As a consequence of this, the authors behind [4] argue that we face two substantial challenges in this field for the future; 1) gaining knowledge from these extreme amounts of data and 2) developing models that tend to be more data-driven compared to traditional approaches while still abiding the laws of physics. One recent application found discrepancies in climate models' estimation of photosynthesis in the tropical rainforests, which ultimately led to a more accurate description of these processes globally [7], [8]. Ideally, similar insights can be discovered from datadriven methods for cloud dynamics, but obtaining adequate observations of clouds globally has been a substantial obstacle for developing data-driven cloud forecasting methods to-date.
To tackle this problem and spark further research into datadriven atmospheric forecasting, we introduce a novel satellitebased dataset called "CloudCast" that facilitates the evaluation of cloud forecasting methods with a global perspective. This approach has been paramount to progress in state-of-the-art methods in the computer vision literature with datasets such as MNIST [9], ImageNet [10], and CIFAR10 [11]. Current datasets for global cloud forecasting exhibit coarse spatial resolution (9x9km to 31x31km) and low temporal granularity (one-to multiple hours between images) [5], [12]- [15]. We overcome both these issues by using geostationary satellite images, arguably the most consistent and regularly sampled global data source for clouds [1]. Since these satellites can obtain images every 5 to 15 minutes with a relatively high spatial resolution (1x1km to 3x3km), they provide an essential ingredient for developing data-driven weather systems, which is an abundance of historical observations. It is possible to achieve higher accuracy in the vertical dimension with radarand lidar-based profiling methods [16], but these falls short on the temporal resolution due to not being geostationary. Our contributions are as follows: • We present a novel satellite-based dataset designed for cloud forecasting. The dataset has 10 different cloud types for multiple layers of the atmosphere annotated on a pixel level. It consists of 70,080 images with a spatial resolution of 928 x 1530 pixels (3x3 km) and 15min sampling intervals from 2017-01-01 to 2018-12-31. All frames are centered and projected over Europe. To the authors' best knowledge, no equivalent dataset with high spatial-and temporal resolution exists for evaluating multi-layer cloud forecasting methods globally.
• We evaluate four video prediction methods to serve as benchmarks for our dataset by predicting four hours into the future. Two of these are based on recent advancements in machine learning methods specifically for applications in atmospheric forecasting.
• To evaluate our results, we present an evaluation study for measuring cloud forecasting accuracy in satellitebased systems. The evaluation design is based on best practices from the World Meteorological Organization when conducting cloud evaluation studies [1], which includes widely tested statistical metrics for categorical forecasts. Furthermore, we implement the Peak Signalto-Noise Ratio and Structural Similarity Index from the computer vision literature. The combination of these two domains should provide the best and most fair evaluation of our results. The remainder of the paper is organized as follows. Section II provides an overview of the related work. Section III describes the new dataset. Section IV provides the evaluation study for measuring cloud forecasting accuracy in satellitebased systems. Finally, conclusions are drawn in Section V.

II. RELATED WORK
We start by briefly reviewing related datasets commonly used in the cloud forecasting literature. After presenting related datasets, we will review methods for video prediction that are particularly suitable for forecasting in the spatiotemporal domain.

A. Related Datasets
We want to introduce a dataset to the community that is particularly suitable for developing data-driven methods with a global perspective. As a result, we only consider 1) geostationary satellite and 2) model-based observations. While other cloud datasets do exist with very high resolution and accuracy for localized atmospheric analysis, such as those from sky-imagers or radar satellites, these technologies generally provide poor spatial coverage, limiting their global use-case.
1) Satellite-Based Cloud Observations: Before introducing related satellite datasets, it is essential to differentiate between cloud detection versus cloud types for meteorological purposes such as forecasting, as the literature varies considerably between the two. Generally speaking, satellite-based cloud detection, typically with the objective of cloud removal, is a relatively established field with many accurate methods [17], [18]. For meteorological purposes, most methods typically deal with binary cloud masks [19]. As outlined earlier, we are specifically interested in multi-layer cloud types for forecasting purposes, limiting the amount of related literature and datasets quite remarkably. When we focus on multi-layer cloud types, satellite-based cloud observations can be divided into raw infrared brightness temperature and satellite-derived cloud measurements [1].
Raw satellite brightness temperature acts as a proxy for cloud top height, which will be available during day and night.
This proxy is not a perfect indicator for multi-layer cloud types, as the temperature of clouds can also be explained by other factors such as the specific cloud type and seasonal variations.
Satellite-derived cloud measurements typically involve a brightness-based algorithm that can extrapolate variables such as cloud mask, type, and height from multispectral images [12]. This is the approach we adopt for our novel dataset, as it can classify multi-layer cloud types with relatively high spatial and temporal resolution in near real-time [12], [20]. To the authors' best knowledge, only a few related datasets for satellitederived multi-layer clouds exist. For the European Meteosat Second Generation (MSG) satellites, these are Cloud Analysis and Cloud Analysis Image published by EUMETSAT [14], [15]. These products exhibit either coarse spatial resolution (9x9km) or infrequent temporal sampling (one to three hours between images) [12]. As we derive our cloud types directly from the raw satellite images, we can maintain the high resolution (3x3 km) and temporal granularity (15-minute sampling) of the raw satellite images. While our dataset is made from the MSG satellites, our approach is not limited to any specific geostationary satellite system and could be extended for other constellations. Outside Europe, similar datasets exist for a) the American GOES-R satellites called ABI Cloud Height [21] and b) the Japanese Himawari-8 called Cloud Top Height Product [22]. Due to their geographical coverage not extending to Europe, they are not directly comparable to ours nor those of EUMETSAT.
2) Model-Based Cloud Observations: Model-based clouds are measured using output from an NWP model and is the most directly comparable alternative to satellite observations due to its global coverage. Two commonly used global NWP models are the European ECMWF atmospheric IFS model [5] and the American GFS model [13]. The resolution varies between the two, but the ECMWF model offers the highest spatial resolution with 9x9 km grid spacing [5]. As both models are global, they can be used interchangeably. The advantage of using NWP model output is that a physics-based simulation of the future exists, while the clear disadvantage is the coarse spatial resolution compared to satellite. Other NWP models also exist on a much finer spatial scale, but these are restricted to local areas, usually on a country-basis [23], [24].
Given we want to establish a global reference dataset for data-driven methods, the operational ECMWF model is considered the best global model-based multi-layer cloud dataset. Compared to CloudCast, the ECMWF dataset is inferior in spatial resolution (9x9 km vs. 3x3 km) and temporal resolution (1 hour vs. 15 minutes). Furthermore, the operational ECMWF is not open-source, making it less relevant for the machine learning and computer vision community.

B. Methods
Producing accurate and realistic video predictions in pixelspace is an open problem to date. Extrapolating frames in the near future can be done relatively accurately, but once the future sequence length grows, so does the inherent uncertainty of the predicted pixel values. Several approaches have been proposed for solving this complex and high-dimensional task: spatiotemporal-transformer networks [25], variational autoencoders [26], generative adversarial networks (GANs) [27]- [29] and recurrent-convolutional neural networks [30], [31]. In the video prediction literature, the tasks are often governed by relatively simple physics, such as the Moving MNIST dataset [32]. However, for predicting atmospheric flow, the task becomes bound by much more complex physics. Therefore, our chosen methods focus on applications that have been explicitly applied for atmospheric forecasting, which will justify the chosen benchmark models for our dataset. These are; 1) Convolutional Long Short-Term Memory Networks (ConvLSTM) [31], 2) Multi-Stage Dynamic Generative Adversarial Networks (MD-GAN) [29] and TV -L 1 Optical Flow (TVL1) [33].

1) Convolutional-and Recurrent Neural Networks (ConvL-STM):
ConvLSTM was originally developed for precipitation nowcasting using radar images. It is considered the seminal paper for atmospheric forecasting using deep learning, making it relevant as a baseline for our dataset. While newer LSTMbased video prediction methods have been proposed following the ConvLSTM paper, such as PredRNN++ [34] and Eidetic-3D LSTM [30], these were not applied nor evaluated on any atmospheric-related datasets, meaning they are outside the scope defined in Section I.
2) Optical Flow-Based Video Prediction: While optical flow is a classical topic in the computer vision literature, it is also one of the most important methods for global data assimilation in meteorology [35]. Optical flow has been applied for video prediction in several papers [33], [36]. In [33], the authors implement the T V -L 1 optical flow method introduced by [37] to capture cloud motions on multiple scales spatially based on two subsequent raw satellite images. Once estimated, the flow fields are inverted and applied for extrapolating multiple steps in the future to serve as a forecast of the effective cloud albedo.
3) Generative Adversarial Networks: GANs have been applied for video prediction in several recent papers [28], [29], [38]. One of these [29] achieved state-of-the-art results for generating 32-frame time-lapse videos with 128 x 128 resolution of cloud movement in the sky using only one frame as input. The authors use a two-stage generative adversarial network-based approach (MD-GAN), where the first-stage model is responsible for generating an initial video of realistic photos in the future with coarse motion. The second stage model then refines the initially generated video by enforcing motion dynamics using the Gram matrix in the intermediate layers of the discriminator.

III. DATASET DESCRIPTION
The CloudCast dataset contains 70,080 cloud-labeled satellite images with 10 different cloud types corresponding to multiple layers of the atmosphere, as seen in Table I. As stated in Section II, we apply a satellite-derived cloud measurement approach. The procedure for generating our dataset is as follows (see Figure 1 for a visual inspection): • Acquire 70,080 samples for the period 2017-01 to 2018-12. The samples originate from the MSG satellites with Mid-level clouds 4 High opaque clouds 5 Very high opaque clouds 6 Fractional clouds 7 High semitransparent thin clouds 8 High semitransparent moderately thick clouds 9 High semitransparent thick clouds 10 High semitransparent above low or medium clouds four different channels per sample (280,320 satellite images), with 15 minute sampling intervals and 3x3 km spatial resolution.
• Collect hourly NWP output for the entire period using the ECMWF operational model (exact variables to be elaborated shortly).
• Annotate the 70,080 samples on a pixel-level using the multi-layer segmentation algorithm (to be described shortly).
• Conduct post-processsing to account for short-term missing observations.
• Generate and publish a 1) standardized version of the full resolution dataset and a 2) spatially downsampled version in addition to the raw dataset to serve as benchmark for future studies.
We will now elaborate on the above steps in more detail. As stated above, we start by collecting the 70,080 raw multispectral satellite images from EUMETSAT. These images come from a satellite constellation in geostationary orbit centered at zero degrees longitude and arrive in 15-minute intervals. The resolution is 3712 x 3712 pixels for the full-disk of the Earth, which implies that every pixel corresponds to a space of dimensions 3x3km. In the remote sensing community, it is well known that infrared channels can observe clouds differently than visible light, meaning infrared is necessary for low-and medium cloud detection. Therefore, we sample one visible channel, two infrared channels and one water vapor channel for each observation to enable multi-layer cloud detection. The size of the entire raw satellite dataset is around 16 TB. Due to download and request limits imposed by EUMETSAT, we can only process a certain number of samples at any given time, which meant we had to divide this process over a couple of months.
Next, we annotate each sample on a pixel level using a segmentation algorithm originally developed by [20] under the European Organisation for Meteorological Satellites -Satellite Application Facility on Support to Nowcasting and Very Short Range Forecasting (NWCSAF) project [39]. This algorithm is essentially a threshold algorithm applied at the pixel level for our multispectral satellite images. To improve multi-layer cloud detection in the segmentation algorithm, we include climatological variables and metadata such as geographical land-sea masks and viewing geometry, which have shown to improve low and mid-level cloud detection considerably [20]. Additionally, we also include NWP output to further improve the segmentation algorithm by having data not observable from satellite data. We collect the NWP data from the ECMWF operational model, which includes surface temperature, air temperatures at five different heights (950 hPa, 850 hPa, 700 hPa, 500 hPa, and the tropopause level), total water vapor content of the atmosphere, and metadata for the ECMWF model grid.
Having established all the required datasets for the segmentation algorithm, we will outline how the thresholds are calculated for the major cloud types, which are primarily based on illumination conditions, viewing geometry, geographical location, and the NWP data. We will not list all the specific threshold values due to the sheer number of threshold values, which varies between daytime, nighttime, and twilight. If the reader is interested in these specific values, please see [20].
The first set of clouds is high semitransparent clouds versus opaque (thin) clouds, also called fractional clouds. To separate these, we use the differences between a) the infrared channels 8.7µm versus 10.8µm and b) channels 10.8µm versus 12.0µm. This is based on the insight that these differences are typically larger for cirrus clouds compared to thick clouds [20]. To improve separation during daytime where we have visible light available, we apply an additional threshold based on the illumination from the 0.6µm channel. This can be done due to cirrus clouds having lower reflectance values than opaque clouds with the same radiative temperature.
Once we have identified semitransparent and fractional clouds, we classify the remaining cloudy pixels into either low-, mid and high clouds found in Table I. This separation is simpler; hence we can calculate the threshold based on the 10.8µm brightness temperature, which directly correlates with cloud height. Atmospheric instability still impacts this separation, which implies we need to correct the previously outlined NWP forecast of air temperatures at different pressure levels. These allow us to separate very low from low clouds, low from medium clouds, medium from high clouds, and high from very high clouds.
We can define the specific thresholds as These thresholds were found in [20] to yield the best results.
Having defined these thresholds, we can now define our cloud annotation procedure for the remaining cloud types: In practice the 7.3µm channel and the secante of the satellite zenith angle is used to reduce the risk of classifying a low cloud as a medium cloud. For notational simplicity we have left these out from the above but they can be found in [20].
While the segmentation algorithm is considered accurate, there are a few limitations. The primary limitation is the scenario where low clouds are sometimes classified as medium clouds in case of e.g., strong thermal inversion, despite the corrections being made with the 7.3µm channel and solar zenith angle. Nevertheless, we want to highlight that extensive validation of the segmentation algorithm has been carried out using both space-born lidar and ground-based observations to verify its accuracy, which are considered among the most accurate methods for ground-truth observations of clouds [20].
As a final post-processing step, we interpolate missing observations that can arise due to numerous reasons such as scheduled outages or sun outages. More specifically, we interpolate the missing observations from neighboring values linearly, which only happens for short-term periods (below 6 hours). The list of outages at the satellite level can be found at the EUMETSAT website [40]. One specific example is the 2017-10-17 from 11.30 to 12.30 UTC, where the outage is due to Sun co-linearity.
As stated in Section I, current global datasets [5], [13] for cloud forecasting and evaluation come with either low temporal granularity (one-to multiple hours between images) or coarse spatial resolution (9x9km to 31x31 km) (see Table  II). This demonstrates the need for our novel high-resolution dataset. In addition to the raw dataset, we also publish a standardized version for future studies, where we center and project the final annotated dataset to cover Central Europe, which implies a final resolution of 728 x 728 pixels. An example observation can be seen in Figure 2. To support small-scale experiments and analysis, we also publish a downsampled lowresolution dataset of 15x15 km, which is significantly smaller in size compared to the full dataset. Our novel dataset can be found at https://vision.eng.au.dk/cloudcast-dataset/.

IV. EXPERIMENTS
As an initial baseline study for our CloudCast dataset, we include several of the video prediction methods from our review in Section II. These methods have seen considerable success in similar atmospheric nowcasting studies recently [31], [33]. To match the resolution of most state-of-the-art video prediction methods [41], [42], we crop and transform our dataset using a stereographic projection to cover Central Europe with a spatial resolution of 128x128. We still use the full temporal resolution of 15-minute intervals compared to hourly observations for other datasets, as mentioned in Section II. Several different definitions of nowcasting exist, but they generally vary between 0-2 hours and 0-6 hours [43], [44]. We select the future time frame to be four hours ahead in 15minute increments (16 time steps), which is somewhere in the middle of most definitions. While forecasting beyond 6 hours is theoretically possible, we expect performance to deteriorate over time unless we incorporate additional variables that cannot be observed from satellite data alone to explain the more medium to long-term cloud dynamics. We have divided the dataset into 1.5 years (75%) of training and 0.5 years (25%) of testing. Ideally, we would want our test data to cover all seasons of the year. However, the frequency distribution between training and test are relatively similar for most classes as seen in Table III. We also group the 10 cloud types into four based on height: a) no clouds, b) low clouds, c) medium clouds, and d) high clouds. This ensures a more natural ordering of the classes and enables us to focus on the major cloud types also present in the global NWP models [5], [13].

A. Benchmark Models
We present an initial benchmark for our dataset based on the reviewed methods in Section II-B. The results of the baseline models will be presented in Section IV-C along with the advantages and disadvantages of the chosen methodologies.
As stated in Section II-B, our chosen methods focus on applications that have been explicitly applied for atmospheric forecasting. This is the motivation behind the first three computer vision benchmark models that we outlined in Section II-B. The final benchmark is the simple persistence model typically used and recommended as a baseline in meteorology studies [1]. As these models are all suitable for the problem of cloud forecasting, they should provide good baselines for our dataset.

1) Autoencoder ConvLSTM (AE-ConvLSTM):
For our first baseline, we implement a variant of the ConvLSTM model from [31], where we introduce an autoencoder architecture with 2D CNNs and use the ConvLSTM layers on the final encoded representation instead of directly on the input frames. This helps us to a) encode the relevant spatial features from the input images before we start encoding and decoding the temporal representation, and b) make training more memory efficient as ConvLSTM layers are memory-intensive. The autoencoder uses skip connections similar to UNet [45]. The motivation behind including skip connections for video prediction is to transfer static pixels from the input to the output images, making the model focus on learning the movement of dynamic pixels instead [41].
We start by reconstructing the first 16 input frames to initialize a spatiotemporal representation of the past cloud movement time-series. To predict 16 frames into the future, we use an autoregressive approach, where we feed the predicted output as input recursively to predict the next 16 steps. This is similar to the approach of other video prediction papers [28]. To improve the sharpness of our results without introducing an adversarial loss function we have chosen to use the ℓ 1 loss. Furthermore, we use the Adam optimizer with batch size = 4, and we implement an optimization schedule used in the Transformer paper [46], where we increase the learning rate linearly for a number of warmup rounds before decreasing it proportionally to the inverse square root of the current step number. We run the training schedule for 200 epochs with an initial learning rate of ε = 2e −4 and momentum parameters β 1 = 0.90, β 2 = 0.98 and 400 warmup rounds. We notice only a slight improvement in our loss function after 100 epochs.

2) Multi-Stage Dynamic Generative Adversarial Networks (MD-GAN):
For training and optimizing the MD-GAN model, we follow the original authors [29] with some differences. Since the MD-GAN paper focused on video generation rather than video prediction, we make necessary adjustments to the experimental design to account for this. Instead of cloning one input frame into 16 and feeding them to the generator, we feed the previous 16 images to the generator.
Besides these changes, we largely followed the approach in [29]. We found that having the learning rate fixed at 0.0002 did not produce satisfying results and often caused mode collapse for the generator. Instead, we employed the technique found in the paper [47], where you set a higher learning rate for the discriminator (0.0004) than the generator (0.0001). This overcomes situations where early mode collapse causes training to stall, and instead incentivizes smaller steps for the generator to fool the discriminator. We find that the training procedure is inherently unstable, a frequent issue for GANs [48]. This issue arises particularly for the second stage training, where training seems to stall after around 10-20 epochs.
3) TV-L 1 Optical Flow (TVL1): We implement the optical flow algorithm T V -L 1 similar to the authors of [33], which can capture cloud motion on multiple spatial scales and is one of the most popular optical flow algorithms for meteorological purposes [33]. One of the underlying assumptions of optical flow is constant pixel intensity over time [33]. This is violated due to cloud formation and dissipation. Hence, the presence of convective clouds will negatively impact the accuracy of optical flow algorithms. As stated in Section II-B2, we can extrapolate multiple steps ahead in time using the estimated optical flow recursively on the predicted cloud images. The T V -L 1 algorithm effectively has 11 parameters. The authors [33] optimized these by finding the lowest absolute bias among 21 different parameter settings calculated using a variant of the area-under-the-curve calculation over absolute bias as a function of forecast time [33]. Instead, we conduct a grid search over a hyperspace of 360 different combinations chosen relative to the default values and the optimal hyperparameters found in [33]. As estimated flows (and by extension, the predicted pixel values) will lie in a continuous real-valued space, we round our predictions to the nearest integer in our four-class setting. 4) Persistence: One of the recommended benchmark models in cloud evaluation studies is called a persistence model [1]. Persistence refers to the most recent observation, which in this case is the 15-minute lagged cloud-labeled satellite image, replicated 16 steps into the future. Under the case of limited cloud motion, we expect this model to perform relatively well, but obviously it is naïve and will not work in dynamic weather situations. The most challenging part of video prediction is usually realistic motion generation, and therefore comparing other models to the persistence model shows how well the model has captured and predicted future cloud motion dynamics. Hence, the persistence model will serve as the baseline for skill score calculations.

Input Frames
Output Frames Ground Truth

MD-GAN S2
AE-ConvLSTM TVL1 Persistence Fig. 3. Example forecast for all models relative to ground truth. t refers to forecast time in 15-minute increments. Here we include two input images corresponding to two-and one-hour before. We also include four output images corresponding to one-to four hours ahead predictions in one-hour increments.

B. Evaluation Metrics
Standardized evaluation metrics for the video prediction domain emphasizing atmospheric applications are hard to come by. As stated in Section I, we select our evaluation metrics from the World Meteorological Organization [1]. Due to numerous available metrics, we select the ones with the highest ranking score in the referenced paper. As several of these are common in the computer vision and machine learning literature, we only go through the non-standard metrics. Metrics such as Frequency Bias is typically called "bias score", and measures the total number of predicted events relative to observed events. Any value above (below) one indicates the model tends to overforecast (underforecast) events.
The first non-standard metric is called "Brier Score". In this case, the Brier Score refers to the MSE between estimated probabilistic forecasts and binary outcomes. To extend it to the categorical multi-class setting, we sum the individual MSE's for all categorical probabilistic forecasts relative to the one-hot target class variable as follows where f t,k is the predicted probabilities for all class k pixels in a given image at time t, y t,k the actual binary outcomes, N = 16 the number of future time steps and M = 4 the number of classes. The second is called Brier Skill Score, and it is calculated using the MSE of a given model relative to some benchmark forecast. As mentioned in Section IV-A4, we use the persistence forecast as the benchmark in skill score metrics. The formula for the Brier Skill Score is where any value above (below) zero implies superior (inferior) performance by the proposed model. In addition to these metrics, we also include video prediction metrics from the computer vision literature, which, taken together with the meteorology metrics, should constitute the fairest evaluation in this complex setting. These include the Structural Similarity Index (SSIM) and the Peak Signal-to-Noise Ratio (PSNR).

C. Results
The results of our baseline methods can be found in Table  IV. We also include an example of model forecasts relative to ground truth in Figure 3.
Despite the proposed models showing relatively high overall accuracies, it is quite clear that none of the models show consistent performance across time and space. This is also evident when looking at the decline in accuracy across time. This suggests that we need to a) develop models more suitable for this particular problem or b) incorporate other data sources or variables to make more reasonable and causal predictions for the complex setting of multi-layer cloud movement and formation.
We include a visualization of the worst predictions from the test set measured by mean accuracy in Figure 4. Looking at the failure cases for MD-GAN S2 and AE-ConvLSTM, we observe that they struggle with situations where clouds are primarily scattered. This is unsurprising given that these models tend to generate predictions that are generally clustered and moderately blurry. The TVL1 model projects a considerable movement of clouds that is incorrect. The underlying reason could be the dissipation of clouds from the input images used in the optical flow estimation, which would violate the constant pixel intensity assumption. The Persistence model achieves poor performance in situations with substantial motion as in Figure 4.

1) Autoencoder ConvLSTM (AE-ConvLSTM):
The AE-ConvLSTM method achieves the highest accuracy on our dataset measured both temporally and spatially on all but medium clouds. For the BSS metric, we notice superior performance relative to the persistence model with a value of 0.11. This implies the application of ConvLSTM layers for cloud-labeled satellite images do capture spatiotemporal motion to some extent. On the other hand, we see in Figure  3 that predictions become increasingly blurry over time. This is in alignment with the discussion in Section II-B1.

2) Multi-Stage Dynamic Generative Adversarial Networks (MD-GAN):
The MD-GAN model outperforms the persistence model with a Brier Skill Score of 0.07. The categorical accuracy is not captured well, as MD-GAN achieves the lowest accuracy for medium clouds between all our models. The temporal accuracy is closely matched to the ConvLSTM model, especially for the 2-hour forecast. Thus, by improving the stability and the initial forecasting accuracy of the MD-  Fig. 4. Worst predictions from our test dataset on CloudCast using the proposed benchmark models compared to ground truth. Worst is defined as having the lowest mean accuracy among all test images for each model. The difference plots are calculated using the absolute difference between the predicted and ground truth images.
GAN model, we expect it could become the best and most consistent model.

3) TV-L 1 Optical Flow (TVL1):
The TVL1 algorithm shows marginally superior performance relative to the persistence model with a BSS of 0.02. The primary reason behind the close performance of TVL1 and Persistence relates to the choice of hyperparameters for the TVL1 algorithm, where hyperparameters yielding more static movement generally implied better performance across time. We believe the underlying reason for this result is the complexity of forecasting multi-layer clouds 16-steps ahead combined with the violation of the optical flow assumption of having constant brightness intensity over time. Compared to AE-ConvLSTM and MD-GAN, it achieves lower overall-and temporal accuracy but does reach higher accuracy for medium clouds. While optical flow methods have been popular for atmospheric forecasting, as stated in Section II-B2, their application to multi-layer cloud types has not been fully researched yet. Hence, the proposed machine learning methods currently seem more appropriate for this task given their superior performance. 4) Persistence: The simple persistence model achieves relatively good results. The high short-term accuracy is not surprising given the limited cloud movement for one-hour ahead. Due to its static nature, however, it achieves the lowest accuracy near the end of our forecasting horizon.

V. CONCLUSIONS
We introduce a novel dataset for cloud forecasting called CloudCast, which consists of pixel-labeled satellite images with multi-layer clouds of high temporal and spatial resolution. The dataset facilitates the development and evaluation of methods for atmospheric forecasting and video prediction in both the vertical (height) and horizontal (latitude-longitude) domains. Four different cloud nowcasting models were evaluated on this dataset based on recent advancements in the machine learning literature for video prediction and traditional methods from the meteorology-and computer vision literature. Several evaluation metrics based on best practices in cloud forecasting studies were proposed in addition to the Peak Signal-to-Noise Ratio and Structural Similarity Index. The four models provided an initial benchmark for this dataset but showed ample room for improvement, especially for predictions near the end of our forecasting horizon. Hybrid methods combining machine-learning and NWP could be interesting approaches to address medium to long-term forecasting in a future study.
We hope this novel dataset will help advance and stimulate the development of new data-driven methods for atmospheric forecasting in a field heavily dominated by physics and numerical methods.