Forecasting El Niño and La Niña Using Spatially and Temporally Structured Predictors and a Convolutional Neural Network

El Niño–Southern Oscillation (ENSO) is characterized by large-scale fluctuations of sea surface temperature (SST) in the central and eastern tropical Pacific accompanied by changes in the atmospheric circulation. ENSO events are of two main types: El Niño and La Niña. Oceanic Niño index (ONI) determines the five consecutive three-month running mean of SST anomalies, in the Niño 3.4 region (5° S–5° N, 170° W–120° W). El Niño is a phenomenon in the equatorial Pacific Ocean characterized by a value of greater than 0.5 °C for ONI. La Niña is a phenomenon in the equatorial Pacific Ocean characterized by a value of less than –0.5 °C for ONI. The lingering of either of these two phenomena could induce severe droughts, whereas either of them following the other could cause massive floods. In both cases, deaths and substantial pecuniary loss are unavoidable, making their forecast of great significance. This study takes over the challenge of forecasting these two phenomena with one year lead time, which has proven difficult in the literature. This research's contribution is restructuring ENSO events’ predictors, including SST, sea level pressure, surface wind speed, and wind stress in a spatially and temporally meaningful way and designing a convolutional neural network that takes advantage of this structure to forecast ENSO events of different types (i.e., Central Pacific and Eastern Pacific) in the next year. Not only a high precision in forecasts was achieved but also it was shown that the proposed model has the potential to achieve higher recalls if a larger number of samples from the positive class would become available.

On one hand, tropical processes, including the positive air-sea interaction [1] and oceanic dynamics [2], play an important role for the ENSO occurrence. On the other hand, more and more recent studies have demonstrated that atmospheric and oceanic forcings outside the tropical Pacific also play a crucial role in modulating onset, development, and amplitude of the ENSO events, such as the North Pacific Oscillation (NPO) (an intrinsic atmospheric variability over the extratropical North Pacific) [3], [4], Aleutian Low [5], Arctic Oscillation (AO) (the dominant atmospheric variability over the extratropical Northern Hemisphere) [6]- [8], North Atlantic SST [9], East Asian monsoon variability, the atmospheric circulation anomalies in the Southern Hemisphere, and the SST variability over the Indian and Atlantic Oceans [10]. For instance, Vimont et al. [3] and Chen et al. [4] showed that, via seasonal footprinting mechanism, NPO can exert significant impacts on the outbreak of ENSO events. Spring AO, via modulating the westerly wind bursts over the tropical western Pacific, significantly influences ENSO events in the following winter. Chen et al. [10] reviewed these forcings outside the tropical Pacific on the occurrences of ENSO events. They showed that the westerly (easterly) wind anomaly over the tropical western Pacific play a key role in the occurrence of an El Niño (a La Niña) event. The wind anomalies over the tropical western Pacific are also essential in relaying the impacts of the atmosphere-ocean forcings outside the tropical Pacific on the ENSO variability.
ENSO is the strongest air-sea coupled system in the world [1], [11], [12], which can exert substantial impacts on the global temperature, precipitation, weather, climate, marine, fisheries, and ecosystems around the Pacific and remote regions [13]- [23]. For instance, ENSO events can significantly influence the East Asian monsoon activity and related precipitation anomalies [13], [17], [19], climate and weather variations over Europe and North America [18], [20], [21], SST variations over many remote regions [14]- [16], and tropical cyclone activities over the western North Pacific and Atlantic [22], [23]. Due to the This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ substantial impacts of ENSO, it is of great importance to improve the prediction of ENSO events.
Oceanic Niño index (ONI) determines the five consecutive three-month running mean of SST anomalies, in the Niño 3.4 region (5°S-5°N, 170°W-120°W). El Niño is a phenomenon in the equatorial Pacific Ocean characterized by a value of greater than 0.5°C for ONI. La Niña is a phenomenon in the equatorial Pacific Ocean characterized by a value of less than -0.5°C for ONI [24]. El Niño and La Niña, each have two types, one over the CP and one over the EP. El Niño and La Niña phenomena have profound climate impacts across the world. However, the impacts are considerably different between its cold (i.e., La Niña) and warm (i.e., El Niño) phases, as well as among its different evolution patterns [25], [26]. Two prominent examples of these evolution patterns are the lingering and transitional patterns. The lingering pattern happens when an El Niño or La Niña event lingers for two years [27], [28], which have caused prolonged droughts in the U.S. [29]. The transitional pattern happens when an El Niño event decays and then changes into a La Niña event or vice versa, which have caused flash floods in Bangladesh and China [30], [31]. Therefore, predicting El Niño and La Niña events is of paramount importance in climatology and disaster management [32], [33]. Existing literature has shown that forecasting ENSO events with long lead times, such as one year, can only reach low accuracies [34], [35]. Additionally, the majority of studies on forecasting ENSO events lack proper handling of its spatial and temporal structure. Therefore, this study's objective is to structure the environmental observations throughout different points across the world, over the past year, in spatial-temporal layers and apply a convolutional neural network (CNN) to learn the spatial behavior of these features and their temporal relationship with ENSO events happening the following year.
The rest of this article is structured as follows. Section II reviews state-of-the-art literature in forecasting ENSO events and highlights their differences with the current study. Section III describes our dataset. Section IV lays out the proposed forecast model, elaborates on its details, and justifies its selection. Section V reports and discusses the results. Finally, Section VI concludes this article and provides future directions.

II. RELATED WORK
Forecasting of ENSO events at lead times of more than one year remains a challenge [34], in addition to the type of the ENSO event, which could be EP or CP. The closest work to ours includes the CNN model applied by Ham et al. [34] to predict ENSO events for lead times of greater than one year, based on SST and oceanic heat content (OHC) anomaly maps over three consecutive months. They established the superiority of CNN over other major forecasting models. Not only their input features are different from ours but also the structure of their CNN. Additionally, they only predict the ENSO event but not its type (EP or CP). They developed a separate CNN to forecast the ENSO event's type and reported the correlation coefficient of the forecasts with the true values. However, in predicting El Niño or La Niña, there is an imbalance between the class sizes, with the positive class being much smaller than the negative one. Therefore, a high correlation could be achieved even if the classifier's precision is low. We report the precision and recall of our model, in addition to its overall accuracy, to underscore its higher precision than recall.
Dynamic models take advantage of physical laws that govern how the atmosphere and the ocean interact to mathematically forecast ONI. Statistical models, such as our model, learn patterns from historical atmospheric and oceanic data to statistically forecast ONI. Hong et al. [35] combined a quadratic time-series dynamic model with a self-memorization component (a statistical model) to forecast SST anomalies in the EP. The forecast is made based on SST, Southern oscillation index, and the East Asian winter monsoon index, which is defined by the meridional 850 hPa winds averaged over the region of 20°-40°N , 100°-140°E. When tried to forecast El Niño and La Niña events with lead times of one year or greater, the accuracy of their model fell, compared to forecasting SST anomalies. Zhang et al. [36] combined three statistical models (Constructed Analogues [37], Canonical Correlation Analysis [38], [39], and Markov Model [40]) and one dynamic model (Climate Forecast System Version 2 [41]) using a Gaussian mixture model, whose parameters are learned from the expectation-maximization algorithm. Statistical models use SST, sea level, and wind-related factors as input features and forecasts are made for lead times of up to seven months. They showed that their combination is more accurate than any of the individual models and that statistical models' accuracy falls for lead times of greater than four months.
Other similar works include a standard neural network (SNN) applied by Lubkov et al. [42] to forecast El Niño and La Niña based on a set of global climatic indices of atmosphere-ocean system oscillations, with a lead time of three months. Petersik and Dijkstra [43] applied Gaussian density neural network and quantile regression neural network to forecast ONI, for lead times of up to 21 months. Their input features include ONI previous values, volume of water above the 20°C isotherm in the tropical Pacific, dipole mode index, zonal wind stress anomaly, sea surface height anomaly, and season. Their model's accuracy fell for longer lead times, especially during the El Niño and La Niña periods. Guo et al. [44] used long short-term memory network (LSTM) to predict ONI as a time series based on its previous values, after filtering out the noise using ensemble empirical mode decomposition. The LSTM requires each data sample to be vectorized. They trained an autoencoder with a 1-D time series of ONI as its input and output. They used the hidden layers of the trained autoencoder as the vectorized version of ONI values, to be input to the LSTM. Their model outperformed traditional time-series forecast models, for lead times of up to a year.
The main difference between our work and the aforementioned studies is that we structure each feature (or predictor) spatially in a map and feature maps will be fed to the prediction model, rather than a feature's average over the entire region or some of its values with no spatial arrangement. Then, we apply a CNN that not only learns the relationship between the output (ENSO events) and different features but also the spatial structure of features in a hierarchical framework and their relationship with the output, where there is a one-year gap between the input features and the output.

III. DATA DESCRIPTION
The ONI and oceanic-atmospheric variables are available from 1949 to 2014. The data are recorded at 4697 different data points, positioned from 180°E to 70°W (or 290°E) longitudes and from 65°S to 85°N latitudes, with 2.5°longitudinal and latitudinal distances, displayed in Fig. 1. The data are recorded from 1949 to 2014, with one observation per year, per data point. Therefore, the dataset contains a total of 31 0002 records. Multiple measurements are available at each data point, throughout three consecutive months: December, January, and February. However, only the average of each variable over the three months is reported for each year. For example, the sea level pressure (SLP) is measured at a data point from December 1990 to February 1991, yet only the average SLP over these three months is available in the dataset, reported as the SLP in January 1991 for that specific point.
At each data point, the detrended anomalies of the following variables are available.
1) SST in°C, obtained from version 3b of the extended reconstructed SST dataset [45]. 2) SLP in Hecto Pascal, obtained from the National Centers for Environmental Prediction (NCEP) Reanalysis 1 dataset [46]. 3) Zonal wind stress, which is known as the U component of the wind stress (UFLX) in Pascal, obtained from NCEP Reanalysis 1 dataset [46]. 4) Zonal wind speed at 10 m, which is known as the U component of the wind (UWND) in m/s, obtained from NCEP Reanalysis 1 dataset [46]. 5) Meridional wind speed at 10 m, which is known as the V component of the wind (VWND) in m/s, obtained from NCEP Reanalysis 1 dataset [46]. SST and UFLX fields are defined only over the ocean, whereas SLP, UWND, and VWND are defined over the entire domain: land and ocean.
An El Niño happens when SSTs in the central western Pacific are above average and the tropical easterly winds weaken. In other words, El Niño is a phenomenon in the equatorial Pacific Ocean characterized by a value of greater than 0.5°C for ONI [24]. A La Niña happens when SST in the eastern tropical Pacific is above average and the tropical easterly winds strengthen. In other words, La Niña is a phenomenon in the equatorial Pacific Ocean characterized by a value of less than -0.5°C for ONI [24]. We converted the ONI index, available in the dataset, to a categorical variable with three categories: below -0.5, which indicates a La Niña, between -0.5 and 0.5, and above 0.5, which indicates an El Niño.
Some studies have suggested the existence of two types of El Niño: 1) the traditional El Niño or EP El Niño and 2) the CP El Niño, also referred to as El Niño Modoki or date-line El Niño. The EP El Niño is distinguished from CP El Niño based on its SST anomaly pattern [47]- [54] and global climate impacts [55]- [66]. Other studies have suggested that the CP La Niña can also be distinguished from the traditional or EP La Niña by analyzing the ocean surface currents [67] and climate impacts [68], such as the spatial distribution of SST anomaly [69].
In this study, we distinguish between EP and CP ENSO based on geographical location of the largest amplitude sea surface anomalies. When anomalies are concentrated toward the eastern portion of the tropical Pacific basin, the ENSO event is referred to as EP and when anomalies are concentrated in the central portion of the basin, it is referred to as CP. The EP and CP ENSO events are estimated based on two types of ONI, which are separately reported in the dataset, one across the EP (120°W-145°W) and one across the CP (145°W-170°W). Details of how these indices are calculated are described in [70].
Although it is hard to separate EP and CP ENSO events, especially La Niña events, only based on the ONI, as indicated by Song et al. [71], they offer a close enough estimate of these events, which is sufficient within this study's scope. Table I displays the presence or absence of ENSO events, each year.
The objective is to use the entire set of measurements (SST, SLP, UFLX, UWND, VWND) across the data points from one year to determine the type of the ENSO event next year, if any. It is noteworthy that this is a classification problem, because rather than the exact value of ONI, only its category is forecasted. SST has been used in almost all previous ONI forecast models. SLP and wind-related measures have been shown to impact ONI events [72]- [74].
The temperature is missing for 25% of records. Two strategies are compared with each other in our experiments to address SST's large number of missing values: dismissing this predictor and filling its missing values with average SST.

IV. METHODOLOGY
Values of the five features (SST, SSP, TAUX, UWND, and VWND) have different ranges. Therefore, all features are standardized using (1), where x specifies a feature and x i specifies an individual value of this feature at one data point. All standardized features range from 0 to 1 The data are only available in a plain text file, where each record contains environmental variables at one data point and one year, geographical coordinates of the point, year, and ONI. In a conventional machine learning problem, features at each record would be the input features to the machine to predict the output, which is whether or not an El Niño or La Niña happened. However, this prediction approach ignores the structure of this spatial-temporal dataset and the proper role of location and time in the forecast. The most important property of this dataset is that the presence or absence of an ENSO event is not determined per data point but as a whole, across the tropical Pacific. In other words, all records from the same year in the dataset have the same output, regardless of geographical coordinates. Consequently, the predictions should be made across the tropical Pacific, rather than at each data point, separately. This also means that all the records in the dataset, across all data points, for the same year, must be used at input features at the same time, rather than only one record. In other words, a sample is not defined as an individual record in the dataset, but all the records belonging to the same time construct one sample. Since the dataset ranges from 1949 to 2014, there are 66 samples. The simplest approach would be to merge all the observations from different grid points, for the same year, in one vector, and consider that as input to the machine [75]. This is equivalent to taking pixels of an image apart and putting them next to each other, in a vector, and classifying the vector, rather than the original matrix. This approach would miss the spatial structure of the dataset. Instead, we construct an image for each sample, where the number of channels in the image is equal to the number of features (which is 5 here). Our program reads through the dataset and constructs a grid/matrix for each feature (e.g., SST), based on their geographical locations (see Fig. 2). This grid represents the values of that feature across all data points at a snapshot. A separate grid is created for each feature. By juxtaposing these grids next to each other, a tensor of features is created for that specific time. This tensor or image would be used as input to the machine to predict whether or not an ENSO event would happen.
The last point is that predicting whether or not an ENSO event would happen in a specific year, based on the data from the same year, is not the same as forecasting an ENSO event for the next year. Our objective is to forecast the occurrence of El Niño and La Niña for the following year, based on data from this year. For instance, forecasting whether or not an El Niño or La Niña would happen in 1950, based on atmosphere-oceanic factors in 1949. However, in the original dataset, the ONI is reported at each record for the same year, when other environmental variables are observed. Therefore, after constructing the feature tensor for a specific year, we ensure that the corresponding output for that sample coincides with the occurrence of El Niño or La Niña in the next year. In other words, the preditand (ENSO event) and the predictors will be lagged by a year. As a result, despite the data are available for 66 years (from 1949 to 2014), the number of training samples would be 65. This is because, while we have the output (the occurrence of an ENSO event) in 1949, we do not have the feature tensor in 1948, and while we have the feature tensor in 2014, we do not have the output in 2015.
CNN has outperformed other architectures of neural networks, machine learning models, and image processing approaches in object detection and image classification [76]- [81]. This is mostly because CNN is independent of hand-crafted visual features and has excellent abstract and semantic abilities. CNN's assumptions about the nature of images, which are mostly correct and strong, consist of locality of pixel dependencies and stationarity of statistics. Therefore, CNN has much fewer connections and parameters in comparison to an SNN with similar-size layers, which makes it easier to train. CNN, originally designed for image classification, is the most suitable prediction model for the problem at hand. CNN is designed to hierarchically capture the spatial variation of feature values across the image. Due to the similarity of images and spatial data, this would enable the CNN to capture spatial autocorrelation [82], which states that "data from locations near one another in space are more likely to be similar than data from locations remote from one another" [83]- [85].
AlexNet [86] has eight layers and is a deep architecture for CNN. We use this model in our study. It won the ImageNet Large Scale Visual Recognition Competition in 2012. AlexNet includes five convolutional layers followed by three fully connected layers (as shown in Fig. 2). Settings for five convolutional layers is: l 1 = 11 × 11 × 5, t 1 = 4, c 1 = 96, l 2 = 5 × 5 × 96, t 2 = 1, c 2 = 256, l 3 = 3 × 3 × 256, t 3 = 1, c 3 = 384, l 4 = 3 × 3 × 384, t 4 = 1, c 4 = 384, l 5 = 3 × 3 × 384, t 5 = 1, c 5 = 256, where l k , t k , and c k denote the size, stride, and number of filters in the kth layer, respectively. Out of three fully connected layers, the first two have 4096 neurons each and the third has as many neurons as the number of classes. The results for networks with a reduced number of convolutional layers are also reported in the following section. Pooling is a function that shrinks a group of neighboring neurons into one neuron. Each pooling layer in our network is of size 3 × 3 with a stride of 2. This size is proposed by Krizhevsky et al. [86] and Zeiler and Fergus [87]. A small pooling window results in overfitting and a large window would cause noise in training [87]. Because the stride is smaller than the pooling region's size, the pooling windows overlap. According to Krizhevsky et al. [86], overlapping pooling windows reduce overfitting and boost the generalization accuracy. The first, second, and fifth convolutional layers in our network are followed by pooling layers. Maximum pooling function, which takes the maximum value in the kernel map as the output, and is conventionally used in AlexNet [86], is employed in our network. In AlexNet, rectified linear unit (ReLU) [88], [89] is selected as the activation function at all layers, but the last one. Softmax is the activation function for the last layer. It is noteworthy that the pooling layer, if any, proceeds the activation function. Fig. 2 displays the structure of the CNN for forecasting El Niño and La Niña based on environmental variables.
Two architectural features of CNN, compared to an SNN, not only make it more suitable for spatial data classification but also significantly improve its training efficiency and reduce its computational complexity and consequently the size of the required training data.

A. Partial Connectivity Rather Than Full Connectivity
A node in a CNN is connected only to a small number of nodes in the previous layer, whereas the same node in an SNN is connected to all nodes in the previous layer. This means that the number of synaptic weights that need to be calculated is much fewer in CNN than SNN. Assume we use a 3 × 3 convolution window in the CNN, shown on the left side of Fig. 3. This means a node in the first hidden layer, for instance, is only connected to nine pixels in the image. The same node in the SNN, shown on the right side of Fig. 3, is connected to all the 270 pixels of the image. In other words, the number of synaptic weights is 30 times fewer in the CNN than SNN. Of course, this number depends on the size of both the image and convolution window. If the image is n × m and the convolution window is z × z, the number of synaptic weights in CNN is n × m/z 2 times fewer than SNN. We showed this only for the first hidden layer, but the same is true for all convolutional hidden layers. This has two advantages. First, a much fewer unknown parameters (synaptic weights) can be learned more quickly (less computational complexity) and accurately by the machine, with a significantly reduced chance of overfitting. Second, deriving the value of each node in the next layer from only a small number of neighboring pixels, rather than the entire image, is based on the assumption that the relationship between two distant pixels is probably less significant than two close neighbors. This assumption is true for spatial data.

B. Shared Weights
We mentioned that n × m synaptic weights need to be learned for one node in the first hidden layer of SNN. With k nodes in the first hidden layer, a total of n × m × k synaptic weights must  be calculated, because each node in the first hidden layer has its own synaptic weights, which are different than those of other nodes. In a CNN, however, the number of synaptic weights that need to be learned remains z 2 , because nodes in the first hidden layer do not have different synaptic weights, but share the same weights. Therefore, regardless of how many nodes exist in the first hidden layer, the number of synaptic weights that need to be learned remains z 2 . Consequently, the number of synaptic weights in CNN is n × m × k/z 2 times fewer than SNN for the first hidden layer. This is referred to as weight-sharing property and is depicted in Fig. 4. Despite this explanation concerned the first hidden layer, it is true for all convolutional hidden layers. This property gives CNN two advantages over SNN. The first advantage is even less parameters for the machine to learn and the second is enabling the CNN to look for certain objects in

V. RESULTS AND DISCUSSION
Tables II and III report the tenfold cross-validation accuracy, precision, recall, and F1 score for forecasting El Niño and La Niña in CP and EP. We experimented with different numbers of convolutional layers and evaluated the CNN's accuracy. We also evaluated the CNN's accuracy under two scenarios for addressing SST's missing values: dismissing SST as a predictor and replacing its missing values with average SST, which is zero. The highest accuracy achieved in each category is shown with a bold font.
Overall, CNN with one convolutional layer performed best. CNN with one convolutional layer produced the highest accuracies in forecasting CP-El Niño, EP-El Niño, and CP-La Niña. CNN with five convolutional layers produced the highest accuracies in forecasting EP-La Niña.
Two approaches were compared in handling SST's missing values: dismissing SST as a predictor and replacing SST's missing values with average SST. Table IV summarizes which  strategies produced the highest accuracies in forecasting each  TABLE V  HIGHEST ACCURACIES ACHIEVED IN FORECASTING EACH TYPE OF ENSO  EVENT type of ENSO event. The first approach produced a higher accuracy in forecasting CP-El Niño and EP-La Niña. The second approach produced a higher accuracy in forecasting EP-El Niño and CP-La Niña. However, this has more to do with when the SST values are missing than the SST variable itself. In other words, the reason that dismissing SST as a feature results in a higher accuracy in forecasting CP-El Niño and EP-La Niña is not that SST is a bad predictor, but that the SST values are missing at critical times where replacing them with the average SST confuses the training process. Table V displays the highest accuracies achieved in forecasting each type of ENSO events. All models have high overall accuracies because of the larger size of the negative class. In other words, as long as a model forecasts no events in all cases, it will achieve an overall accuracy of better than 66%. For the same reason, all models have higher precisions than recalls, because they are better trained for the negative class in comparison to the positive class. The low recall means that the forecast model would miss the majority of ENSO events and would not be able to forecast them. The high precision means that if the model forecasts an ENSO event, most probably it will happen. In this research, we avoided undersampling or oversampling of classes as these types of dataset manipulations pose the forecast model to other sources of bias and impugn its accuracy in face of real-world challenges. Additionally, undersampling is not the best method for small datasets such as ours. Duplicating samples from the small class would not have much impact on CNN's accuracy as it is already trained multiple times with each sample over the training epochs.
The highest recall is achieved in forecasting CP-El Niño (23%), meaning that the model missed 77% of CP-El Niño events. The highest precision is achieved in forecasting EP-El Niño (100%), meaning that the model never falsely forecasted an EP-El Niño. F1 score is the harmonic mean of recall and precision. The highest F1 score is achieved in forecasting CP-El Niño, followed by CP-La Niña, EP-La Niña, and EP-El Niño. This order matches the number of events from each class. In other words, there are 22 CP-El Niño events in the dataset, 20 CP-La Niña events, 18 EP-La Niña events, and 13 EP-El Niño events. This indicates the strong impact of the number of samples from the positive class on the forecast accuracy and endorses the potential of CNN with the proposed predictors to achieve even higher accuracies in forecasting Niño events with one year lead time, with a larger dataset.

VI. CONCLUSION AND FUTURE DIRECTIONS
The seemingly low accuracy of the proposed model in forecasting ENSO events for the next year should be taken into account with three facts: the training dataset is small with only 65 samples, SST (an important predictor) was missed for 25% of samples, and forecasting ENSO events with one year lead time using machine learning has proven to be a challenging task in the literature. This research focused on restructuring the spatial-temporal dataset and designing a proper machine learning model that takes into account the idiosyncrasies and nuances of forecasting ENSO events. Not only a high precision in forecasts was achieved but also it was shown that the proposed model has the potential to achieve higher recalls if a larger number of samples from the positive class would become available. It is noteworthy that despite its low recall, the high precision of our model makes it of significant value in climatology and disaster management. Our future work would focus on feature selection and experimenting with other predictors that are available over more years and designing other machine learning models for this problem. For instance, upper OHC or warm water volume has been shown as a strong predictor of EP types of ENSO events [90], [91]. Additionally, two recent studies have indicated that Arctic sea ice variation over the Greenland-Barents Seas in winter has a significant impact on the occurrence of ENSO events in the next year [92], [93]. Therefore, it is a potential predictor of ENSO events with a lead time of one year.