Introduction
Cloud forecasting remains one of the major unsolved challenges in meteorology, where cloud errors have wide-reaching impacts on the overall accuracy of weather forecasts [1], [2]. Due to the vertical and horizontal nature of clouds, there is an intrinsic difficulty in measuring clouds quantitatively and evaluating the performance of cloud forecasts. This inability to accurately parameterize and thus quantify clouds, convective effects, and aerosols on a subgrid scale in weather models is one reason model estimates can carry major uncertainties [2].
The primary source of quantitative weather forecasts comes from numerical weather prediction (NWP) systems. For these numerical methods, we model the future using governing equations from the field of atmospheric physics [3]. Over the past decades, there have been tremendous improvements in weather prediction owing to increased computational power, integration of new theory, and assimilation of large amounts of data. Regardless, these atmospheric simulations are still computationally expensive and operate on coarse spatial scales (9 × 9 km or above per pixel) [4], [5]. Furthermore, the current amount of atmospheric data collection exceeds hundreds of petabytes per day [6], implying that data collection far outpaces our ability to analyze and assimilate it. As a consequence of this, the authors behind [4] argue that we face two substantial challenges in this field for the future: 1) gaining knowledge from these extreme amounts of data and 2) developing models that tend to be more data-driven compared to traditional approaches while still abiding the laws of physics. One recent application found discrepancies in climate models’ estimation of photosynthesis in the tropical rainforests, which ultimately led to a more accurate description of these processes globally [7], [8]. Ideally, similar insights can be discovered from data-driven methods for cloud dynamics, but obtaining adequate observations of clouds globally has been a substantial obstacle for developing data-driven cloud forecasting methods to date.
To tackle this problem and spark further research into data-driven atmospheric forecasting, we introduce a novel satellite-based dataset called “CloudCast” that facilitates the evaluation of cloud forecasting methods with a global perspective. This approach has been paramount to progress in state-of-the-art methods in the computer vision literature with datasets such as MNIST [9], ImageNet [10], and CIFAR10 [11]. Current datasets for global cloud forecasting exhibit coarse spatial resolution (9 × 9 to 31 × 31 km) and low temporal granularity (one to multiple hours between images) [5], [12]–[15]. We overcome both these issues by using geostationary satellite images, arguably the most consistent and regularly sampled global data source for clouds [1]. Since these satellites can obtain images every 5–15 min with a relatively high spatial resolution (1 × 1 to 3 × 3 km), they provide an essential ingredient for developing data-driven weather systems, which is an abundance of historical observations. It is possible to achieve higher accuracy in the vertical dimension with radar- and lidar-based profiling methods [16], but these fall short on the temporal resolution due to not being geostationary. Our contributions are as follows.
We present a novel satellite-based dataset designed for cloud forecasting. The dataset has 10 different cloud types for multiple layers of the atmosphere annotated on a pixel level. It consists of 70 080 images with a spatial resolution of 928 × 1530 pixels (3 × 3 km) and 15-min sampling intervals from January 1, 2017 to December 12, 2018. All frames are centered and projected over Europe. To the authors’ best knowledge, no equivalent dataset with high spatial and temporal resolution exists for evaluating multilayer cloud forecasting methods globally.
We evaluate four video prediction methods to serve as benchmarks for our dataset by predicting 4 h into the future. Two of these are based on recent advancements in machine learning methods specifically for applications in atmospheric forecasting.
To evaluate our results, we present an evaluation study for measuring cloud forecasting accuracy in satellite-based systems. The evaluation design is based on best practices from the World Meteorological Organization when conducting cloud evaluation studies [1], which includes widely tested statistical metrics for categorical forecasts. Furthermore, we implement the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) from the computer vision literature. The combination of these two domains should provide the best and most fair evaluation of our results.
The remainder of the article is organized as follows. Section II provides an overview of the related work. Section III describes the new dataset. Section IV provides the evaluation study for measuring cloud forecasting accuracy in satellite-based systems. Finally, conclusions are drawn in Section V.
Related Work
We start by briefly reviewing related datasets commonly used in the cloud forecasting literature. After presenting related datasets, we will review methods for video prediction that are particularly suitable for forecasting in the spatiotemporal domain.
A. Related Datasets
We want to introduce a dataset to the community that is particularly suitable for developing data-driven methods with a global perspective. As a result, we only consider 1) geostationary satellite and 2) model-based observations. While other cloud datasets do exist with very high resolution and accuracy for localized atmospheric analysis, such as those from sky-imagers or radar satellites, these technologies generally provide poor spatial coverage, limiting their global use-case.
1) Satellite-Based Cloud Observations
Before introducing related satellite datasets, it is essential to differentiate between cloud detection versus cloud types for meteorological purposes such as forecasting, as the literature varies considerably between the two. Generally speaking, satellite-based cloud detection, typically with the objective of cloud removal, is a relatively established field with many accurate methods [17], [18]. For meteorological purposes, most methods typically deal with binary cloud masks [19]. As outlined earlier, we are specifically interested in multilayer cloud types for forecasting purposes, limiting the amount of related literature and datasets quite remarkably. When we focus on multilayer cloud types, satellite-based cloud observations can be divided into raw infrared brightness temperature and satellite-derived cloud measurements [1].
Raw satellite brightness temperature acts as a proxy for cloud top height, which will be available during day and night. This proxy is not a perfect indicator for multilayer cloud types, as the temperature of clouds can also be explained by other factors such as the specific cloud type and seasonal variations.
Satellite-derived cloud measurements typically involve a brightness-based algorithm that can extrapolate variables such as cloud mask, type, and height from multispectral images [12]. This is the approach we adopt for our novel dataset, as it can classify multilayer cloud types with relatively high spatial and temporal resolution in near real time [12], [20]. To the authors’ best knowledge, only a few related datasets for satellite-derived multilayer clouds exist. For the European Meteosat Second Generation (MSG) satellites, these are Cloud Analysis and Cloud Analysis Image published by EUMETSAT [14], [15]. These products exhibit either coarse spatial resolution (9 × 9 km) or infrequent temporal sampling (1–3 h between images) [12]. As we derive our cloud types directly from the raw satellite images, we can maintain the high resolution (3 × 3 km) and temporal granularity (15-min sampling) of the raw satellite images. While our dataset is made from the MSG satellites, our approach is not limited to any specific geostationary satellite system and could be extended for other constellations. Outside Europe, similar datasets exist for 1) the American GOES-R satellites called ABI Cloud Height [21] and 2) the Japanese Himawari-8 called Cloud Top Height Product [22]. Due to their geographical coverage not extending to Europe, they are not directly comparable to ours or those of EUMETSAT.
2) Model-Based Cloud Observations
Model-based clouds are measured using output from an NWP model and is the most directly comparable alternative to satellite observations due to its global coverage. Two commonly used global NWP models are the European ECMWF atmospheric IFS model [5] and the American GFS model [13]. The resolution varies between the two, but the ECMWF model offers the highest spatial resolution with 9 × 9 km grid spacing [5]. As both models are global, they can be used interchangeably. The advantage of using NWP model output is that a physics-based simulation of the future exists, while the clear disadvantage is the coarse spatial resolution compared to satellite. Other NWP models also exist on a much finer spatial scale, but these are restricted to local areas, usually on a country basis [23], [24].
Given that we want to establish a global reference dataset for data-driven methods, the operational ECMWF model is considered the best global model-based multilayer cloud dataset. Compared to CloudCast, the ECMWF dataset is inferior in spatial resolution (9 × 9 vs. 3 × 3 km) and temporal resolution (1 h vs. 15 min). Furthermore, the operational ECMWF is not open-source, making it less relevant for the machine learning and computer vision community.
B. Methods
Producing accurate and realistic video predictions in pixel-space is an open problem to date. Extrapolating frames in the near future can be done relatively accurately, but once the future sequence length grows, so does the inherent uncertainty of the predicted pixel values. Several approaches have been proposed for solving this complex and high-dimensional task: spatiotemporal-transformer networks [25], variational autoencoders [26], generative adversarial networks (GANs) [27]–[29], and recurrent-convolutional neural networks (CNNs) [30], [31]. In the video prediction literature, the tasks are often governed by relatively simple physics, such as the Moving MNIST dataset [32]. However, for predicting atmospheric flow, the task becomes bound by much more complex physics. Therefore, our chosen methods focus on applications that have been explicitly applied for atmospheric forecasting, which will justify the chosen benchmark models for our dataset. These are 1) convolutional long short-term memory networks (ConvLSTM) [31], 2) multistage dynamic generative adversarial networks (MD-GAN) [29], and
1) Convolutional and Recurrent Neural Networks (ConvLSTM)
ConvLSTM was originally developed for precipitation nowcasting using radar images. It is considered the seminal paper for atmospheric forecasting using deep learning, making it relevant as a baseline for our dataset. While newer LSTM-based video prediction methods have been proposed following the ConvLSTM paper, such as PredRNN++ [34] and Eidetic-3D LSTM [30], these were not applied or evaluated on any atmospheric-related datasets, meaning they are outside the scope defined in Section I.
2) Optical Flow-Based Video Prediction
While optical flow is a classical topic in the computer vision literature, it is also one of the most important methods for global data assimilation in meteorology [35], [35]. Optical flow has been applied for video prediction in several papers [33], [36]. In [33], the authors implement the
3) Generative Adversarial Networks
GANs have been applied for video prediction in several recent papers [28], [29], [38]. One of these [29] achieved state-of-the-art results for generating 32-frame time-lapse videos with 128 × 128 resolution of cloud movement in the sky using only one frame as input. The authors use a two-stage generative adversarial network-based approach (MD-GAN), where the first-stage model is responsible for generating an initial video of realistic photos in the future with coarse motion. The second stage model then refines the initially generated video by enforcing motion dynamics using the Gram matrix in the intermediate layers of the discriminator.
Dataset Description
The CloudCast dataset contains 70 080 cloud-labeled satellite images with 10 different cloud types corresponding to multiple layers of the atmosphere, as seen in Table I. As stated in Section II, we apply a satellite-derived cloud measurement approach. The procedure for generating our dataset is as follows (see Fig. 1 for a visual inspection).
Acquire 70 080 samples for the period January 2017 to December 2018. The samples originate from the MSG satellites with four different channels per sample (280 320 satellite images), with 15-min sampling intervals and 3 × 3 km spatial resolution.
Collect hourly NWP output for the entire period using the ECMWF operational model (exact variables to be elaborated shortly).
Annotate the 70 080 samples on a pixel-level using the multilayer segmentation algorithm (to be described shortly).
Conduct postprocessing to account for short-term missing observations.
Generate and publish a 1) standardized version of the full resolution dataset and a 2) spatially downsampled version in addition to the raw dataset to serve as benchmark for future studies.
Processing chain going from raw data to our final published dataset, CloudCast. As climatology and land-sea masks are of different resolution,
We will now elaborate on the above steps in more detail. As stated above, we start by collecting the 70 080 raw multispectral satellite images from EUMETSAT. These images come from a satellite constellation in geostationary orbit centered at zero degrees longitude and arrive in 15-min intervals. The resolution is 3712 × 3712 pixels for the full-disk of the Earth, which implies that every pixel corresponds to a space of dimensions 3 × 3 km. In the remote sensing community, it is well known that infrared channels can observe clouds differently than visible light, meaning infrared is necessary for low and medium cloud detection. Therefore, we sample one visible channel, two infrared channels, and one water vapor channel for each observation to enable multilayer cloud detection. The size of the entire raw satellite dataset is around 16 TB. Due to download and request limits imposed by EUMETSAT, we can only process a certain number of samples at any given time, which means we had to divide this process over a couple of months.
Next, we annotate each sample on a pixel level using a segmentation algorithm originally developed by [20] under the European Organisation for Meteorological Satellites—Satellite Application Facility on Support to Nowcasting and Very Short Range Forecasting (NWCSAF) project [39]. This algorithm is essentially a threshold algorithm applied at the pixel level for our multispectral satellite images. To improve multilayer cloud detection in the segmentation algorithm, we include climatological variables and metadata such as geographical land-sea masks and viewing geometry, which have shown to improve low- and mid-level cloud detection considerably [20]. Additionally, we also include NWP output to further improve the segmentation algorithm by having data not observable from satellite data. We collect the NWP data from the ECMWF operational model, which includes surface temperature, air temperatures at five different heights (950, 850, 700, and 500 hPa, and the tropopause level), total water vapor content of the atmosphere, and metadata for the ECMWF model grid.
Having established all the required datasets for the segmentation algorithm, we will outline how the thresholds are calculated for the major cloud types, which are primarily based on illumination conditions, viewing geometry, geographical location, and the NWP data. We will not list all the specific threshold values due to the sheer number of threshold values, which varies between daytime, nighttime, and twilight. If the reader is interested in these specific values, please see [20].
The first set of clouds is high semitransparent clouds versus opaque (thin) clouds, also called fractional clouds. To separate these, we use the differences between 1) the infrared channels 8.7 versus 10.8
Once we have identified semitransparent and fractional clouds, we classify the remaining cloudy pixels into either low, mid, and high clouds found in Table I. This separation is simpler; hence, we can calculate the threshold based on the 10.8-
We can define the specific thresholds as
\begin{align*}
vh &= 0.4 * T_{500\text{hPA}} + 0.6 * T_{\text{tropo}} - 5\,\text{ K} \\
hi &= 0.5 * T_{500\text{hPA}} - 0.2 * T_{700\text{hPa}} + 178\,\text{ K} \\
me &= 0.8 * T_{850\text{hPA}} + 0.2 * T_{700\text{hPa}} - 8\,\text{ K} \\
lo &= 1.2 * T_{850\text{hPA}} - 0.2 * T_{700\text{hPa}} - 5\,\text{ K.} \tag{1}
\end{align*}
\begin{equation*}
f(x) = {\begin{cases}\text{VeryHigh} & \text{if } 10.8\,\mu \text{m} < vh \\
\text{High} & \text{if } vh \leq 10.8\,\mu \text{m} < hi \\
\text{Medium} & \text{if } h \leq 10.8\,\mu \text{m} < me \\
\text{Low} & \text{if } me \leq 10.8\,\mu \text{m} < lo \\
\text{VeryLow} & \text{if } lo \leq 10.8\,\mu \text{m} \end{cases}}. \tag{2}
\end{equation*}
In practice, the 7.3-
While the segmentation algorithm is considered accurate, there are a few limitations. The primary limitation is the scenario where low clouds are sometimes classified as medium clouds in case of, e.g., strong thermal inversion, despite the corrections being made with the 7.3-
As a final postprocessing step, we interpolate missing observations that can arise due to numerous reasons such as scheduled outages or sun outages. More specifically, we interpolate the missing observations from neighboring values linearly, which only happens for short-term periods (below 6 h). The list of outages at the satellite level can be found at the EUMETSAT website [40]. One specific example is October 17, 2017 from 11.30 to 12.30 UTC, where the outage is due to sun co-linearity.
As stated in Section I, current global datasets [5], [13] for cloud forecasting and evaluation come with either low temporal granularity (one to multiple hours between images) or coarse spatial resolution (9 × 9 to 31 × 31 km) (see Table II). This demonstrates the need for our novel high-resolution dataset. In addition to the raw dataset, we also publish a standardized version for future studies, where we center and project the final annotated dataset to cover Central Europe, which implies a final resolution of 728 × 728 pixels. An example observation can be seen in Fig. 2. To support small-scale experiments and analysis, we also publish a downsampled low-resolution dataset of 15 × 15 km, which is significantly smaller in size compared to the full dataset.1
Example observation from the CloudCast dataset for the April 1, 2017 13:00 UTC time. From left to right: Raw map of area under investigation; multispectral raw satellite RGB composite consisting of two visible light images (0.6 and 0.8
Experiments
As an initial baseline study for our CloudCast dataset, we include several of the video prediction methods from our review in Section II. These methods have seen considerable success in similar atmospheric nowcasting studies recently [31], [33]. To match the resolution of most state-of-the-art video prediction methods [41], [42], we crop and transform our dataset using a stereographic projection to cover Central Europe with a spatial resolution of 128 × 128. We still use the full temporal resolution of 15-min intervals compared to hourly observations for other datasets, as mentioned in Section II. Several different definitions of nowcasting exist, but they generally vary between 0–2 and 0–6 h [43], [44]. We select the future time frame to be 4 h ahead in 15-min increments (16 time steps), which is somewhere in the middle of most definitions. While forecasting beyond 6 h is theoretically possible, we expect performance to deteriorate over time unless we incorporate additional variables that cannot be observed from satellite data alone to explain the more medium to long-term cloud dynamics.
We have divided the dataset into 1.5 years (75%) of training and 0.5 years (25%) of testing. Ideally, we would want our test data to cover all seasons of the year. However, the frequency distribution between training and test are relatively similar for most classes as seen in Table III. We also group the 10 cloud types into four based on height:
1) no clouds;
2) low clouds;
3) medium clouds; and
4) high clouds.
This ensures a more natural ordering of the classes and enables us to focus on the major cloud types also present in the global NWP models [5], [13].
A. Benchmark Models
We present an initial benchmark for our dataset based on the reviewed methods in Section II-B. The results of the baseline models will be presented in Section IV-C along with the advantages and disadvantages of the chosen methodologies.
As stated in Section II-B, our chosen methods focus on applications that have been explicitly applied for atmospheric forecasting. This is the motivation behind the first three computer vision benchmark models that we outlined in Section II-B. The final benchmark is the simple persistence model typically used and recommended as a baseline in meteorology studies [1]. As these models are all suitable for the problem of cloud forecasting, they should provide good baselines for our dataset.
1) Autoencoder ConvLSTM (AE-ConvLSTM)
For our first baseline, we implement a variant of the ConvLSTM model from [31], where we introduce an autoencoder architecture with 2-D CNNs and use the ConvLSTM layers on the final encoded representation instead of directly on the input frames. This helps us to 1) encode the relevant spatial features from the input images before we start encoding and decoding the temporal representation, and 2) make training more memory efficient as ConvLSTM layers are memory-intensive. The autoencoder uses skip connections similar to UNet [45]. The motivation behind including skip connections for video prediction is to transfer static pixels from the input to the output images, making the model focus on learning the movement of dynamic pixels instead [41].
We start by reconstructing the first 16 input frames to initialize a spatiotemporal representation of the past cloud movement time-series. To predict 16 frames into the future, we use an autoregressive approach, where we feed the predicted output as input recursively to predict the next 16 steps. This is similar to the approach of other video prediction papers [28]. To improve the sharpness of our results without introducing an adversarial loss function, we have chosen to use the
2) Multistage Dynamic Generative Adversarial Networks
For training and optimizing the MD-GAN model, we follow the original authors [29] with some differences. Since the MD-GAN paper focused on video generation rather than video prediction, we make necessary adjustments to the experimental design to account for this. Instead of cloning one input frame into 16 and feeding them to the generator, we feed the previous 16 images to the generator.
Besides these changes, we largely followed the approach in [29]. We found that having the learning rate fixed at 0.0002 did not produce satisfying results and often caused mode collapse for the generator. Instead, we employed the technique found in the article [47], where you set a higher learning rate for the discriminator (0.0004) than the generator (0.0001). This overcomes situations where early mode collapse causes training to stall and, instead, incentivizes smaller steps for the generator to fool the discriminator.
We find that the training procedure is inherently unstable, a frequent issue for GANs [48]. This issue arises particularly for the second stage training, where training seems to stall after around 10–20 epochs.
3) \mathbf {TV} - \mathbf {L}^{\mathbf {1}} Optical Flow
We implement the optical flow algorithm
4) Persistence
One of the recommended benchmark models in cloud evaluation studies is called a persistence model [1]. Persistence refers to the most recent observation, which, in this case, is the 15-min lagged cloud-labeled satellite image, replicated 16 steps into the future. Under the case of limited cloud motion, we expect this model to perform relatively well, but, obviously, it is naïve and will not work in dynamic weather situations. The most challenging part of video prediction is usually realistic motion generation, and, therefore, comparing other models to the persistence model shows how well the model has captured and predicted future cloud motion dynamics. Hence, the persistence model will serve as the baseline for skill score calculations.
B. Evaluation Metrics
Standardized evaluation metrics for the video prediction domain emphasizing atmospheric applications are hard to come by. As stated in Section I, we select our evaluation metrics from the World Meteorological Organization [1]. Due to numerous available metrics, we select the ones with the highest ranking score in the referenced paper. As several of these are common in the computer vision and machine learning literature, we only go through the nonstandard metrics. Metrics such as frequency bias is typically called “bias score” and measures the total number of predicted events relative to observed events. Any value above (below) 1 indicates that the model tends to overforecast (underforecast) events.
The first nonstandard metric is called “Brier Score.” In this case, the Brier Score refers to the MSE between estimated probabilistic forecasts and binary outcomes. To extend it to the categorical multiclass setting, we sum the individual MSE's for all categorical probabilistic forecasts relative to the one-hot target class variable as follows:
\begin{equation*}
\text{BS}= \frac{1}{M} \frac{1}{N} \sum _{k=1}^{M} \sum _{t=1}^{N}\left(f_{t,k}-y_{t,k}\right)^{2} \tag{3}
\end{equation*}
\begin{equation*}
\text{BSS}= 1 -\frac{\text{BS}_{\text{model}}}{\text{BS}_{\text{persistence}}} \tag{4}
\end{equation*}
In addition to these metrics, we also include video prediction metrics from the computer vision literature, which, taken together with the meteorology metrics, should constitute the fairest evaluation in this complex setting. These include the SSIM and the PSNR.
C. Results
The results of our baseline methods can be found in Table IV. We also include an example of model forecasts relative to ground truth in Fig. 3.
Example forecast for all models relative to ground truth.
Despite the proposed models showing relatively high overall accuracies, it is quite clear that none of the models show consistent performance across time and space. This is also evident when looking at the decline in accuracy across time. This suggests that we need to 1) develop models more suitable for this particular problem, or 2) incorporate other data sources or variables to make more reasonable and causal predictions for the complex setting of multilayer cloud movement and formation.
We include a visualization of the worst predictions from the test set measured by mean accuracy in Fig. 4. Looking at the failure cases for MD-GAN S2 and AE-ConvLSTM, we observe that they struggle with situations where clouds are primarily scattered. This is unsurprising given that these models tend to generate predictions that are generally clustered and moderately blurry. The TVL1 model projects a considerable movement of clouds that is incorrect. The underlying reason could be the dissipation of clouds from the input images used in the optical flow estimation, which would violate the constant pixel intensity assumption. The persistence model achieves poor performance in situations with substantial motion as in Fig. 4.
Worst predictions from our test dataset on CloudCast using the proposed benchmark models compared to ground truth. Worst is defined as having the lowest mean accuracy among all test images for each model. The difference plots are calculated using the absolute difference between the predicted and ground truth images.
1) Autoencoder ConvLSTM
The AE-ConvLSTM method achieves the highest accuracy on our dataset measured both temporally and spatially on all but medium clouds. For the BSS metric, we notice superior performance relative to the persistence model with a value of 0.11. This implies the application of ConvLSTM layers for cloud-labeled satellite images do capture spatiotemporal motion to some extent. On the other hand, we see in Fig. 3 that predictions become increasingly blurry over time. This is in alignment with the discussion in Section II-B1.
2) Multistage Dynamic Generative Adversarial Networks
The MD-GAN model outperforms the persistence model with a Brier Skill Score of 0.07. The categorical accuracy is not captured well, as MD-GAN achieves the lowest accuracy for medium clouds between all our models. The temporal accuracy is closely matched to the ConvLSTM model, especially for the 2-h forecast. Thus, by improving the stability and the initial forecasting accuracy of the MD-GAN model, we expect it could become the best and most consistent model.
3) \mathbf {TV} - \mathbf {L^{1}} Optical Flow
The TVL1 algorithm shows marginally superior performance relative to the persistence model with a BSS of 0.02. The primary reason behind the close performance of TVL1 and persistence relates to the choice of hyperparameters for the TVL1 algorithm, where hyperparameters yielding more static movement generally implied better performance across time. We believe the underlying reason for this result is the complexity of forecasting multilayer clouds 16 steps ahead combined with the violation of the optical flow assumption of having constant brightness intensity over time. Compared to AE-ConvLSTM and MD-GAN, it achieves lower overall and temporal accuracy but does reach higher accuracy for medium clouds. While optical flow methods have been popular for atmospheric forecasting, as stated in Section II-B2, their application to multilayer cloud types has not been fully researched yet. Hence, the proposed machine learning methods currently seem more appropriate for this task given their superior performance.
4) Persistence
The simple persistence model achieves relatively good results. The high short-term accuracy is not surprising given the limited cloud movement for 1-h ahead. Due to its static nature, however, it achieves the lowest accuracy near the end of our forecasting horizon.
Conclusion
We introduce a novel dataset for cloud forecasting called CloudCast, which consists of pixel-labeled satellite images with multilayer clouds of high temporal and spatial resolution. The dataset facilitates the development and evaluation of methods for atmospheric forecasting and video prediction in both the vertical (height) and horizontal (latitude–longitude) domains. Four different cloud nowcasting models were evaluated on this dataset based on recent advancements in the machine learning literature for video prediction and traditional methods from the meteorology and computer vision literature. Several evaluation metrics based on best practices in cloud forecasting studies were proposed in addition to the PSNR and SSIM. The four models provided an initial benchmark for this dataset but showed ample room for improvement, especially for predictions near the end of our forecasting horizon. Hybrid methods combining machine learning and NWP could be interesting approaches to address medium to long-term forecasting in a future study.
We hope this novel dataset will help advance and stimulate the development of new data-driven methods for atmospheric forecasting in a field heavily dominated by physics and numerical methods.