A windowed correlation based feature selection method to improve time series prediction of dengue fever cases

The performance of data-driven prediction models depends on the availability of data samples for model training. A model that learns about dengue fever incidence in a population uses historical data from that corresponding location. Poor performance in prediction can result in places with inadequate data. This work aims to enhance temporally limited dengue case data by methodological addition of epidemically relevant data from nearby locations as predictors (features). A novel framework is presented for windowing incidence data and computing time-shifted correlation-based metrics to quantify feature relevance. The framework ranks incidence data of adjacent locations around a target location by combining the correlation metric with two other metrics: spatial distance and local prevalence. Recurrent neural network-based prediction models achieve up to 33.6% accuracy improvement on average using the proposed method compared to using training data from the target location only. These models achieved mean absolute error (MAE) values as low as 0.128 on [0,1] normalized incidence data for a municipality with the highest dengue prevalence in Brazil's Espirito Santo. When predicting cases aggregated over geographical ecoregions, the models achieved accuracy improvements up to 16.5%, using only 6.5% of incidence data from ranked feature sets. The paper also includes two techniques for windowing time series data: fixed-sized windows and outbreak detection windows. Both of these techniques perform comparably, while the window detection method uses less data for computations. The framework presented in this paper is application-independent, and it could improve the performances of prediction models where data from spatially adjacent locations are available.


INTRODUCTION
A CCURATE time series prediction of dengue fever outbreaks can be useful in planning mitigation strategies for hundreds of tropical and subtropical regions around the world. Data-driven models such as neural networks are flexible in design and can ease the difficulties of estimating unknown parameters that mechanistic models often require [1]. However, the prediction accuracy of such models depends on the quality and the quantity of training data. For dengue fever outbreaks, the availability of incidence data varies across regions. A data aggregation center in a region may not have adequate data to achieve an acceptable level of accuracy in out-of-sample projections. In such cases, selected incidence data from adjacent centers in the same region could improve model performance as additional features. We propose a framework with quantitative methodologies to rank and select nearby case data as supplementary features. Our method uses windowed timelagged cross-correlation combined with distance and prevalence metrics to identify relevances and potential causal Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U.S. Department of Agriculture. The conclusions in this report are those of the authors and do not necessarily represent the views of the USDA. USDA is an equal opportunity provider and employer.

relationships.
Dengue virus is primarily spread by several species of mosquito vectors (Aedes aegypti and Aedes albopictus), and the outbreaks infect 390 million people every year [2]. The viral strains also cause about 40,000 annual deaths with hemorrhagic fever, and dengue shock syndrome [3]. Dengue virus transmission is prevalent in regions where competent vector mosquitoes are present. In those regions, the mosquito abundance varies throughout the year and depends on factors including air temperature, precipitation, vegetation, and urbanization [4], [5], [6]. The availability of extensive data on these factors makes statistical and machine learning analyses feasible. However, researchers experience missing data, uneven reporting intervals, lack of granularity, inadequacy, and inaccuracy with dengue case counts for many regions around the globe. These factors limit the prediction performance of data-driven models. To complicate the situation further, time series outbreak data for most diseases are non-stationary (e.g., the underlying processes evolve with time). In many regions, climatic variations affect Aedes vector populations [7]. In addition to that, outbreak start times and sizes may vary because of imported cases caused by short and long-range mobility. Cocirculation of multiple viral strains adds to the complexity. Despite those issues, correlations exist between outbreaks in adjacent human populations (county, municipality, district, etc.) for most communicable diseases, including dengue [8]. While a correlation may not always imply causation, the use of incidence data from adjacent regions can improve the model training because of similarities in meteorological arXiv:2104.10289v1 [cs.LG] 21 Apr 2021 factors, host population density, and a high probability of mobility-based viral transmissions.
For sequential data (e.g., multivariate time series), different methods of feature selection have been used in the past including correlation-based filters [9], [10], Granger causality tests [11], [12], genetic algorithm [13], and several other methods [14], [15]. The applications include forecasts of electrical energy consumption, meteorological variables, financial markets, etc. Correlation-based methods are widely used in machine learning problems [16], [17]. Few works extend feature sets for disease outbreak prediction using incidence data from spatially adjacent locations. Such geospatial clustering techniques rely on similarity metrics to improve model performance. A recent work [18] uses pair-wise correlation to cluster location data to train models for COVID-19 outbreak projection in Chinese provinces. Another work targeted towards dengue also uses correlationbased similarity measures to extend the training feature set [19]. However, these implementations assume an instantaneous correlation of incidence data between regions and do not consider the temporal order of outbreaks (whether one location is leading or lagging the other). There can be a considerable amount of time delay between outbreaks occurring in adjacent regions. Time lagged cross-correlation can help identify such phase relations [20]. The phase information may help quantify relationships between outbreaks of adjacent locations. We hypothesize that if the temporal incidence dynamics of one outbreak lead the incidence dynamics of another, the former outbreak might have a causal influence on the latter, given that these places lie spatially close enough to affect one another, and the outbreaks are of a reasonable size. We weigh the available features (incidence data) based on a combination of these factors: leading phase correlation, distance, and prevalence. Because of the population's dynamically changing immunity patterns caused by the cocirculation of multiple dengue virus strains, the regions experiencing large outbreaks may also evolve. A single computation of the correlation coefficient over the entire timeline of data may mislead the analysis. Hence, splitting the time series into multiple windows and comparing sequences at each time window for correlation and phase analyses can provide better insights into the dengue outbreak's seasonal patterns. A limitation with some existing methods is that clustering large data sets for model training can make the process complex and inefficient, while unrelated features may deteriorate prediction performance. This phenomenon is known as the curse of dimensionality [21], [22]. This work aims to reach optimal model performance with the smallest subset of highly relevant features based on their ranks. There are two broad categories of feature selection methods: filters and wrappers [23]. Wrapper approaches [24] are computationally expensive as the search space for optimal feature subsets increases exponentially with the number of available features. Our method is primarily a filter approach to rank features.
In this work, we present a framework of feature enhancement for data-driven prediction of dengue outbreaks. To achieve that goal, this paper details: i) a windowed time-shifted cross-correlation method to compute correlation weights, ii) two correlation-window allocation methods, iii) a procedure of ranking feature using metrics based on correlation, distance, and prevalence, and iv) analysis of prediction performance across windowing schemes, prediction models, and spatial aggregations. This work's novel contribution lies in how we interpret and process incidence data to compute correlation metrics using our knowledge of how outbreaks spread.

Definitions
In supervised learning problems, we collect data on multiple variables. A target variable (label) is the designated output of a machine learning model for prediction. A feature is an input variable that is expected to influence the target variable. Each instance of data (e.g., a point in time) contains several feature values and usually a single label value. A data set comprises many such instances. For a supervised learning problem, a data set is split into 3 subsets in order to: train, evaluate, and test the models. With the training subset fed in batches (collection of instances), a supervised model learns to predict targets based on features in an iterative process. It optimizes parameters by minimizing a loss function. A neural network comprises artificial neurons (cells) in one or multiple layers. Each neural cell is a node in the network with connections to other nodes across layers, inputs, or outputs. The recurrent neural networks (RNN) are special neural cells that can store internal states in their memory, which helps them predict sequence data (data points that are temporally related) better. Model training aims to get optimal parameter values to generalize beyond the training data and perform well with test data. Sometimes, a model can over-fit the training data and perform poorly with unseen test examples. To prevent such scenarios, we use regularization techniques.

Time series prediction of outbreaks
Time series forecasting is a popular research area because of its applicability in many disciplines, such as forecasting weather patterns, stock prices, market trends, and resource allocation. In epidemiology, these methods enable the prediction of future outbreaks by fitting models with past disease incidence data and carefully chosen covariates. Such predictions come at varying degrees of accuracy and depend on the characteristics of the target variable, quality of sample data available for fitting, the covariates (predictors) being used, and the models themselves. There is rarely a single model that works best for every application. Classical forecasting methods such as exponential smoothing, autoregressive integrated moving average (ARIMA), and seasonal autoregressive integrated moving average (seasonal ARIMA) have been widely used to predict time series data [25]. The ARIMA model can handle nonstationary data, which is an important advantage. Besides that, the seasonal ARIMA model can incorporate repeating patterns in the data to enable forecasting of diseases that show seasonal patterns [26]. However, these models have tendencies to follow the mean values of past data, and it is not easy to associate these with rapidly changing processes [27]. In addition to that, many classical models require manual tuning of their parameters and may fail to capture complex nonlinear interactions. Data-driven forecasting of vector-borne diseases such as dengue fever is complicated due to complex interactions of several factors with disease dynamics. A list of these factors include but is not limited to seasonally dependent Aedes mosquito growth and feeding patterns [28], co-circulations of multiple viral strains [29], environmental (e.g., temperature) effects on dengue virus transmission [30], and human mobility patterns [31]. Neural networks can automatically interpret features from observable variables and can model complex nonlinear phenomena. Hybrid methodologies that combine neural networks with classical models (e.g., ARIMA) are also popular and have been used for dengue outbreak forecasting [32]. In recent times, long-short term memory (LSTM) networks [33] (a type of recurrent neural networks), and its derivatives [34], [35] have shown a considerable amount of success in predicting sequential data and has frequently outperformed other classical and machine learning methods [36]. These recurrent architectures have also performed well to predict disease outbreaks [37], [38], [39]. Hence, we consider these viable candidates to test the performance of the proposed feature enhancement framework in this paper.

Factors that affect dengue disease dynamics
Dengue virus is primarily spread by female mosquito vector species: (Aedes aegypti and Aedes albopictus). Hence, the outbreaks depend on the abundance of such vectors. The relationships between Aedes mosquitoes and environmental variables (i.e., temperature, rainfall) are already well characterized by many researchers. Environmental temperature affects the growth, host-seeking, blood-meal intakes of mosquitoes. Aedes aegypti cannot develop below 16 • C or above 34 • C [40]. Within that range, the development from larva to adult was found to be faster at higher temperatures (30 • C) compared to lower temperatures (21 • C) [41]. Aedes albopictus can develop in wider temperature ranges and can survive better in lower temperatures [42] compared to Aedes aegypti. The optimum flight temperature for Aedes aegypti females was found to be 21 • C [43]. Studies have observed that a large diurnal temperature range decreases female fecundity [44]. Temperature fluctuations also affect extrinsic incubation periods (EIP) of dengue viruses. An experiment with DEN-2 strain found that EIP was 12 days at 30 • C and reduced to 7 days for 32 • C and 35 • C [45]. Besides temperature, rainfall has a significant role in dengue outbreaks. Rainwater stuck in different places creates breeding spaces for Aedes mosquitoes. Previous works have studied the association of dengue transmission with rainfall [4], [46]. In this work, we use several variables, including observed and reanalyzed temperatures, precipitations, relative humidity, and surface-level pressure. These can directly or indirectly affect mosquito vector suitability and dengue outbreak dynamics.

Data acquisition
To test the proposed framework for dengue outbreak predictions, we collect data for several regions of Brazil. The InfoDengue project [47] monitors outbreak data on over 700 municipalities of Brazil. Their server contains weekly dengue fever case counts with a surveillance period starting from 2010. Besides dengue incidence data, we collect weather observation data from NOAA (National Oceanic and Atmospheric Administration) ground weather station database and reanalysis data from the NCEP /NCAR Reanalysis 1 dataset published by the NOAA physical sciences laboratory (PSL) database (NCEP and NCAR stand for National Centers for Environmental Prediction and National Center for Atmospheric Research, respectively). We list all the variables in Table 1. We use the weekly case counts as labels and further process the remaining variables to use those as baseline features.

Re-sampling and feature engineering
The available raw data cannot be readily used in the machine learning methods. All the features and labels are matched and aligned in both spatial and temporal dimensions. We align data based on available labels (i.e., case data). For each municipality of Brazil (smallest spatial unit available), we search for the nearest ground weather station to collect weather data. We extract reanalysis data from NCEP/NCAR Reanalysis 1 data sets using each municipality's centroid's coordinates. Once the spatial granularity is taken care of, we fix the temporal dimension mismatches by converting all data to match incidence data resolution. As incidence data are available in weekly intervals and the remaining variables are available daily, this process involves context-aware down-sampling of those remaining variables (temperature, precipitation, humidity, etc.). During this process, we derive 4 additional features from observed weather data of ground stations: average diurnal temperature range of the week, minimum diurnal temperature range of the week, maximum diurnal temperature range of the week, and the number of rainy days in the week. We compute these from daily observed temperature and precipitation data. In total, we have 12 feature variables related to weather and environment: i) 4 observed variables: temperature (average, minimum, and maximum) and precipitation, ii) 4 derived variables based on the time interval (week): diurnal temperature range (average, minimum, and maximum) and the number of rainy days, iii) 4 reanalysis variables: temperature, relative humidity, pressure, precipitable water (all are averages and at earth surface level).

Data splitting and normalization
The complete set of data is a two-dimensional array X with features on one dimension and time on the other. The set of features consists of: i) the variables listed in section 2.4.2, ii) dengue incidence data of the target location, and iii) dengue incidence data of locations selected as predictors by the methods presented in this paper. We split X along its time dimension into three parts: training (X train ), validation (X val ), and test (X test ). The validation set is required during the training phase as we implement early stopping [48] as a regularization mechanism. We normalize all three sets of data before model training and evaluation. The normalization formula is the following, All three data sets (training, validation, and test) are normalized using the mean and the standard deviation computed from training data. The complete data set's summary statistics are not used here to prevent the machine learning models from gaining statistical insights about validation and test data sets.

Sequence model specifications
Whether it is training or evaluation, the time series data are split into smaller batches and fed into the recurrent neural network models. From a macroscopic perspective, a sliding window moves over batches of data. Because of the sequential nature of dengue case data, the batches are fed according to the time order without randomization. The sliding window is configured with two integer parameters: input length (t in ), and output length (t out ). The window is depicted in Figure 1. A trainable model would take t in time steps of feature data as input and predict t out time steps of target/label data (e.g., dengue case counts) as outputs every iteration. In the configurations used in this paper, there is no temporal overlap between input and output sequences. In this configuration, the models make single shot projections (all the t out data points are predicted at once every iteration).
Each batch (window) of data is (t in + t out ) steps long in the time dimension. A total of 32 batches are stacked together for model training and evaluation. One batch differs from another by a single time-step (hence, there are temporal overlaps between batches). Each batch is further split in time and data (variable) dimensions to separate inputs (t in of features) and outputs (t out of labels).
This work focuses on performance gains obtainable using recurrent neural network (RNN) models due to their proven track record in predicting time series data [36]. A recurrent unit's temporal behavior is illustrated in the lower part of Figure 1. We can see that information is passed through time (also called cell state), enabling the model to Input Output t in t out predict values based on insights gained from past inputs. We use two popular recurrent neural network cell types: LSTM (long-short term memory) [33] with forget gates [49] and GRU (gated recurrent unit) [50]. We also test with a simple linear neural network model as a trivial baseline. Using the TensorFlow [51] package, we configure the models as described below to produce the results: We initialize the weight metrics of the models as zeros in the beginning. Only the recurrent models (LSTM and GRU) can train and predict based on entire input sequences. The Linear model predicts based on the last input time step. While training, we use the mean squared error (MSE) as the loss function to optimize the model using Adam optimizer [52]. For predictions, we use the mean absolute error (MAE) metric to evaluate model performance. We regularize our training process with the early stopping [48] mechanism, which monitors loss within validation data and stops training if performance does not improve. Based on our tests on different data sets, we configure the training to run for 120 iterations (epochs).

METHODS
To describe the methodology, first, we define our spatial units. We designate the term infection center (IC) to indicate a spatial building block of the model. An IC is a point in the space (regional map) where observed or estimated incidence data on disease outbreaks are available. The spatial granularity of an IC is not fixed for the model. It can be a country, a state, a city, a suburb, or an administrative unit with some resolution in space based on available disease incidence data. The basic structure of our proposed framework is shown in Figure 2. The circles inside the shaded region are the infection centers. One of the infection centers, marked as IC, indicates the target infection center where we intend to make predictions of a designated label (e.g., dengue cases). The map's remaining infection centers are marked as peripheral infection centers (P IC). These are locations where similar observations on the designated label of the target IC are available. The temporal resolution of the IC and the P ICs are aligned before performing any comparative analysis of the data. Some locations may have daily, weekly, or monthly observations. Some data may need re-sampling in the time domain before they can be compared (e.g., convert daily weather data to weekly values). The target infection center is marked as IC, while the k th peripheral infection center is P IC k . Using the proposed windowed crosscorrelation method (section 3.2), a phased cross-correlation matrix is computed (I), which is eventually reduced to a correlation weight, γ C (k) (II). We also compute a prevalence weight, γ P (k), using cumulative case data (III) and a geographic distance weight, γ D (k), using location data (IV). The three weight metrics are combined to compute (V) the predictor metric of P IC k , Γ(k). The P ICs are ranked using these Γ values, and their incidence data are selected accordingly to be stacked together with the IC feature set (VI).
Assuming that there are N peripheral infection centers (P IC) on the map. Once we match the spatial and the temporal dimensions of the label data (e.g., weekly dengue cases in a city), we use a windowed-time shifted crosscorrelation analysis on each IC − P IC k pair (for all k ∈ N ) and compute a correlation weight, γ C (k). We also consider the cumulative cases of each P IC and compute a prevalence weight, γ P (k). Finally, the geodesic distance of each IC − P IC k pair is taken into account in the form of a distance metric, γ D (k). All three metrics are normalized and lie within the range [0, 1] for the selected region. These are combined as following to compute a predictor strength metric for each P IC k ,

Windowing incidence data
We compare the time series of disease incidence data (e.g., weekly cases per 100k people) to find correlations. The comparisons are made for all IC−P IC k pairs with available data for a given region. The key intuition behind this approach is, if a P IC in the region has an infectious outbreak at some point in time, t = t 0 , this may initiate or affect the course of an outbreak for the target IC at t ≥ t 0 .
A leading P IC outbreak may not always imply causal influence depending on the geographic location and population behavior. Despite that, a P IC having an outbreak will influence adjacent ICs, as it acts as an infection source. This also assumes that a strict isolation measure is not in place and the control measures are not 100% effective due to the vector-borne nature of the infection. It is also important to note that, despite seasonal patterns, outbreaks can occur irregularly. A P IC may lead the target IC in one season and lag in other seasons due to complex interactions of multiple viral strains. Hence, we divide the time series into smaller time windows. We propose two methods for windowing incidence data: i) fixed-length window allocation and ii) variable-length window detection. Both of these methods are depicted in Figure 3. A straightforward approach is to divide the entire time series into a fixed number of intervals, M f . If the time series is T units (day, week, or month) long in total, then fixed window, w f m (where m ∈ [0, M f − 1]) will have T /M f units of data. While this is the simplest way to divide the series for correlation analysis, it may not be the most efficient. Setting the appropriate value of M f remains an open problem, although we apply some intuitions from the seasonality patterns of dengue outbreaks in our case. The fixed windows might not be appropriately placed to contain the time series curve's meaningful dynamics, and some windows may even cover regions without an outbreak.
A second approach is to detect windows based on the time series itself. To do this, we normalize the incidence rate of the target IC to be ranged between [0, 1]. A typical outbreak curve has irregularities in its shape that make the analysis cumbersome. We use a Savitzky-Golay filter [53] to smooth out the irregularities while preserving the shape of the outbreaks. Our window detection method has two parameters: the incidence threshold   [53] smoothing is applied to the data before window detection takes place.
Once the windows are selected (either by assignment of fixed number or detection), the following procedures are identical. Hence, we will ignore the superscripts (f and d) in this paper's next sections for brevity. M will depict the total number of windows. w m (where m ∈ [0, M − 1]) will depict the (m + 1) th window.

Time-shifted cross correlation
Let i 0 (t) and i k (t) be the disease incidence (new cases at time step t) of the target IC and the k th P IC respectively. We perform bivariate computations of time shifted Pearson's correlation coefficients [54] using windowed (w m ) incidence data of the target IC (i wm 0 (t)) and the k th P IC (i wm k (t)). The formula used to compute the coefficients is shown in Equation 3. This measure is also known as time lagged (or phased) cross-correlation [55] in statistics and signal processing.
Here, R wm k (θ) is the correlation coefficient computed for subsets of the time series i 0 (t) and i k (t) selected by the time window w m when one series is shifted by an amount θ with respect to another. The Pearson's correlation coefficient function is indicated by r() in Equation 3.
For the two time series curves shown in Figure 4, the time-lagged correlation matrix (obtained by computing 8]) is plotted as a heatmap in Figure 5. We use the location depicted in Figure  3 as the target IC (i 0 (t)) and another location from the same state (Espirito Santo) of Brazil as the k th P IC (i k (t)). For this demonstration, the time series curves were split using fixed length windows (M f = M = 5). The heatmap depicted in Figure 5 can be visually verified by comparing with Figure  4. As expected, the two curves are almost in phase in w 0 , negatively correlated in w 2 , IC leads in w 1 , and P IC leads in w 3 and w 4 . The two curves shown here correspond to the target IC (i 0 (t)) and the k th P IC (i k (t)) and these are from the municipalities: Cachoeiro de Itapemirim and Vitória respectively [47]. The m th time window is marked by the symbol wm. Note, in the first window (w 0 ), the outbreaks of IC and P IC are almost in phase, whereas in the second window (w 1 ), the P IC outbreak is lagging in time compared to the IC. Heatmaps like Figure 5 are computed for all P IC k with k ∈ [1, N ]. To determine if a P IC k is leading in a window (w m ), we find the location (θ) of the peak correlation as shown in Equation 4.
We define the correlation strength S wm k to be the mean correlation measure computed around the peak (θ wm k ), extending by the amount θ E in both directions (Equation 5).
Equation 5 has an additional condition on the values θ such that R wm k (θ wm k + θ) exists for the given parameters. A P IC k leads the IC if the peak of correlation lies on the right half of the heatmap shown in Figure 5, which translates to θ wm k > 0. We only consider if a P IC k is at least in phase with the IC and discard the cases where any P IC k lags the IC. This is how we compute the predictor probability matrix P with dimensions M × N . The individual predictor probabilities (∀m ∈ [0, M −1], ∀k ∈ [1, N ]) are are computed as shown in Equation 6.
The predictor probabilities are averaged across all windows (Equation 7) to compute overall predictive abilities of all P IC k . For a particular region (k ∈ [1, N ]), the predictive ability metrics are normalized between [0,1]. The final measure, γ C (k), is defined as the correlation weight of P IC k as shown in Equation 8.γ

Distance and prevalence metrics
With increasing distance, the impact of a P IC on the target IC is likely to be reduced due to decreased travel between the locations, increasing differences in environmental conditions (e.g., temperature, rainfall, vegetation), etc. It is intuitive to use a metric proportional to the inverse distance for strengthening the predictive ability measures of P ICs. Let, d k be the geodesic distance [56] (shortest path on the surface of the earth, assuming earth to be an ellipsoid) between the target IC and P IC k . The normalized [0, 1] distance of a P IC k in the region is, We want the metric to be inversely proportional to the distance. Hence, the distance metric of P IC k is defined as, The outbreak history of a location is an important criterion that indicates the viral pathogen and endemic scenarios' persistence. For a P IC k , we compute the prevalence I k by taking a sum of the incidence data i k (t) for the entire timeline (∀t ∈ [0, T ]).
The prevalence metric is normalized [0, 1] across the region.

RESULTS
We present here the results in several stages. The outcomes of the feature analysis are presented first. This is followed by an analysis of prediction performance using the proposed methods. The results are generated using municipality-wise weekly dengue case data between 2010-2019 from Brazil's Espírito Santo state. We obtained data for 78 municipalities of Espírito Santo and ranked them based on the total number of cases recorded for the entire time period. The top 5 municipalities based on prevalence are listed in Table 2.
The location IDs shown in the table are the IBGE (Instituto Brasileiro de Geografia e Estatística) codes for Brazil [57].

Feature selection and analysis
The P ICs are sorted and ranked for each IC, based on the predictor strength metric, Γ (Equation 2), which is computed from the three individual metrics: correlation weight (γ C ), prevalence weight (γ P ), and distance weight (γ D ). We  Table 3 with the corresponding weights. While our method prioritizes the correlation weight more than others, P ICs with relatively lower correlation weight can still be favored because of the following factors: i) having a significant number of cases or ii) being in proximity of the target IC. This is evident in Table 3 as 3201209 (P IC #1) is chosen over 3205200 (P IC #2) and 3205101 (P IC #4) is chosen over 3200607 (P IC #5). Note, the numbers (#) used in 'P IC #' indicate rank. This should not be confused with arbitrary indices (k) used to compute metrics (P IC k ). This effect is also illustrated in Figure 6, which shows the ranked P ICs listed in Table 3. Although P IC #1 lies farthest from the IC among the five (lowest γ D ), it is ranked at the top due to significantly higher values in the other two factors (γ C and γ P ). A time series plot in Figure 7 shows that the top P ICs are mostly correlated with the IC, Vitória. Among the P ICs displayed here, PIC #4 (3205101) shows the weakest correlation with the IC. It will be clear in the upcoming results, the proximity and the high incidence fraction of this location help with prediction performance. The variability of the incidence curves prevents our method from classifying a single P IC as the optimal predictor for all time ranges. However, the combination of top-ranked P ICs will improve prediction accuracy.  6. A geospatial map showing an incidence center (IC) and 5 ranked peripheral incidence centers (P IC). The circles' radii are proportional to the total number of cases reported during 2010-2019 [47]. The shades of the fill color are generated from a color gradient (green-yellow-red) which is proportional to the fraction of cases with respect to the local population of each location (IC or P IC) during 2010-2019 [47]. The greenish shades indicate smaller infected fractions, while the reddish shades indicate larger fractions. Map generated using Folium [58] with OpenStreetMap [59]. Basemap tiles provided by CartoDB [60].

Prediction performance
After selecting features with the proposed methods, we train and evaluate the prediction performances using the three models (Linear, LSTM, and GRU) described in section 2.5. The time series data are split into 3 distinct sets with ratios of 50:30:20 for model training, evaluation, and testing. A sliding window with input length (t in ) of 8 and output length (t out ) 4 is used for a single shot prediction of the next 4 weeks using data of the past 8 weeks every step.

Individual IC prediction
For the IC of Vitória, the P ICs are added gradually according to their computed ranks (section 4.1), and the models' prediction performances are evaluated. The mean absolute error (MAE) values on the normalized test data are plotted in Figure 8 for varying number of additional features, N P IC . For both LSTM and GRU models, the addition of the first Fig. 7. Time series plots of weekly dengue cases per 100,000 people for the IC and top 5 ranked P ICs ( Table 3). The plots depict cases only between 2010-2015 for improved clarity, but the metrics were computed based on the entire series (2010-2019).
two PICs deteriorates the model performance. However, the subsequent additions keep improving the outcomes. The plots quickly reach their minima, after which errors increase. The first few additions increase error due to high variability in the incidence data, as evident in Figure 7. Further additions of P IC create averaging effects on the predictor data set and cause performance improvements. The GRU model eventually reaches a minimum MAE value of 0.128, which is lower than the best optimal LSTM prediction (0.1415) by about 9.54%. The Linear model reaches an optimum MAE value of 0.3384, which is nowhere close to the recurrent models. After reaching the minima, all three error curves rise again. This increase in MAE with larger feature sets (N P IC ) can be attributed to the curse of dimensionality [21], [22]. LSTM and GRU models perform optimally with 4 and 6 additional P ICs, respectively. Predicting based on present input and historical context (internal states of LSTM and GRU) of data certainly puts recurrent models ahead in performance, which is evident even without a P IC (N P IC = 0 in Figure 8) in the feature set. However, the linear model significantly benefits from feature addition as case data from P ICs strengthen inductive bias. Fig. 8. Prediction performances of the Linear, LSTM, and GRU models in predicting normalized test data for varying number of features (N P IC ). P IC data are added to the feature set based on ranks dictated by computed predictor strengths (Table 3), after which models are trained and evaluated over the test data to compute the mean absolute error (MAE) metrics (lower is better).
After determining the optimum number of features (N P IC ) to be added to the predictor data set for an IC, recurrent models are trained with the extended feature set and are used to predict dengue cases for the test data subset of the time series. Figure 9 compares the predictions with actual incidence data for both LSTM and GRU models using Vitória as the IC and choosing the optimum number of P ICs for the two models (N P IC = 4 for LSTM and 6 for GRU). Fig. 9. Predicted weekly dengue cases per 100,000 people using test data with the recurrent models (LSTM and GRU). These results were obtained for the IC, Vitória, with 4 and 6 additional P IC data joined with LSTM and GRU's feature set, respectively.

Effect of window selection methods
The impacts of various correlation window configurations are analyzed under the two proposed windowing methods: i) fixed-length window assignment and ii) variable-length window detection. For the first method (fixed), we vary the number of windows (M f ) between 5, 10, 20, and 40. For the second method (detection), we vary the minimum window size (∆ M IN ) between 5, 10, 20, and 30 weeks. In both methods, we are consequently varying the number of windows, window lengths, and where the windows are located. In total, we run tests under 8 different scenarios and compare the prediction performances using the 3 models (Linear, LSTM, and GRU). Instead of working with a single IC, we run the tests over multiple locations and report the average performance metrics (e.g., mean of MAE). We take the top 20 ICs based on total cases per 100,000 people and average the performance metrics across ICs. Among the 20 ICs, there were 4 ICs where none of the models could predict with reasonable accuracy (MAE < 1) with or without additional features. In those locations, either the data were too limited or the outbreaks were too random for our models to generalize beyond training data. We filter out these 4 locations and take the remaining 16 locations to evaluate our methods.
For a model, predicting on a given IC, the optimal MAE is defined as the minimum mean absolute error (MAE) obtained by varying the number of additional features (N P IC ) from the P IC ranked list produced by a given windowing method (i.e., minima of the curves in Figure 8 are the optimal MAEs of three models). The average optimal MAE (across 16 ICs) are plotted in Figure 10 for all 8 windowing schemes. The recurrent models perform significantly better than the Linear model: on average, LSTM and GRU outperform the Linear model by 60% and 61%, respectively. LSTM and GRU outperform the best-performing linear model by 41% and 43%, respectively. We expect this kind of advantage, given the importance of considering historical data in predicting future dengue cases. The performances across different windowing schemes with LSTM and GRU models are comparable (both models have an approximate standard deviation of 0.02 around their means of 0.3569 and 0.3435, respectively), with a small trend of increasing mean error values with windowing method configurations. In the case of linear models, there are some stark contrasts when using window detection methods. Using the window detection method with a large ∆ M IN value increases performance sharply. For example, using a ∆ M IN of 30 weeks shows 41.9% improvement over fixed window schemes' average performance. This indicates that features selected with a small number of large correlation windows detected based on outbreak locations in the time series are more effective than the remaining schemes that use a relatively large number of smaller windows. A similar trend is also visible with fixed window methods, although it is not statistically conclusive. The lower values of M f , which translates to having a lower number of large windows, show slightly better performance over higher values of M f . To understand how beneficial our proposed methods are over the baseline (without feature enhancement, N P IC = 0), we analyze the improvements in prediction accuracy. The term, improvement in accuracy is defined as the relative decrease in MAE (i.e., increase in prediction performance) with optimum P IC selection ( Figure 10) compared to MAE without additional features (N P IC = 0). We compute the accuracy improvements across all 16 ICs and plot the mean improvement in accuracy as percentages in Figure 11. The linear model demonstrates average improvements ranging from 18.97% to 27.13% depending on the window selection schemes. The LSTM model improvements range from 22.13% to 33.6% and the GRU model range from 10.79% to 31.92% depending on the window selection schemes. We should be careful in interpreting these values; greater improvements do not translate to better optimal performance. These values are relative to their baselines (N P IC = 0). With some incidence centers (IC), a model can already predict with higher accuracy than other incidence centers. Our methods help minimize those gaps and still provide some improvements over the baselines (ranging from 10.79% to 33.6% depending on the model and the scheme). Figure 10 demonstrates that, in general, GRU and LSTM both perform well when we deal with average values. For individual ICs, however, a conclusive determination of the best performing model can be done (either GRU or LSTM).

Performance on aggregated data
It is sometimes reasonable to aggregate data to larger scales based on geographic adjacency and environmental similarities (e.g., weather). From a macroscopic point of view, predicting for an ecoregion may be more meaningful for policymakers to interpret the outcomes. According to Omernik (2004), ecoregions are defined as areas within which there is a spatial coincidence in characteristics of geographical phenomena (e.g., geology, physiography, vegetation, land use, climate, hydrology, terrestrial and aquatic fauna, etc.) associated with differences in the quality, health, and integrity of ecosystems [61]. We use the terrestrial ecoregions defined by The Nature Conservancy(TNC) in this work [62]. The following analysis is performed for Brazilian locations with available data in the Bahia Coastal Forest ecoregion.
Comparing the prediction performance on aggregated data shows that the advantages of recurrent models compared to the Linear model (which we had for individual ICs) are diminished. The optimal performances across prediction models become more comparable, as shown in Figure 12. The mean optimal MAE values obtained for 4 fixed window schemes are 0.38, 0.40, and 0.36 when using Linear, LSTM, and GRU models, respectively. The mean optimal MAEs for 4 window detection schemes are 0.31, 0.35, and 0.33 using the same three models. We observe a distinct advantage of the window detection method for selecting P ICs. On average, the window detection methods improve over their baselines (N P IC = 0) by 16.50%, 12.68%, and 10.07% for Linear, LSTM, and GRU models, respectively. Variable window allocation based on outbreak window detection outperforms fixed window allocation methods.
In other words, windowed cross-correlation on a few important outbreak regions performs better than comparing over the entire time series. As the target data is aggregated from all P ICs in the region, the optimally trained Linear models sometimes perform better than recurrent models.
One key takeaway is that cases for the entire region can be predicted with improved accuracy while using a small subset (P ICs) of data as features. For the results shown in Figure 12, the optimal MAE values can be reached for N P IC values between 2 and 5. This aggregation provides a fast way to predict cases, with a small fraction of regional outbreak data. While the accuracy achieved at the ecoregion level is not on par with the accuracy achieved for single ICs, this prediction is useful to generate risk maps. We construct a risk map for the Bahia Coastal Forest ecoregion using the predicted cases and weights of the topranked P IC in Figure 13. The top 40 P ICs are included in the map, and the P ICs that contribute more towards the weight (Γ) are shown as high-risk regions (e.g., red). The aggregated prediction was done using the optimal Linear model with ∆ M IN = 30 and N P IC = 4.

CONCLUSION
In this work, we develop a method to select relevant incidence data from peripheral locations as features to improve the prediction of dengue fever outbreaks. In order to rank features, we use windowed cross-correlation analysis on dengue case data. We propose two methods for allocating correlation windows (position and size) over the time series to compute correlation weights. For a target location (IC), peripheral locations (P IC) are ranked based on a combination of correlation, distance, prevalence metrics. The predictive models benefit from the ranked feature sets, as these reach model and location-specific optimal performances with a relatively small subset of features. We tested shown here is marked with red borders. The risk of infection is shown as a heatmap with color shades ranging from blue (low risk) to red (high risk). The heatmap is constructed by combining the ranked P ICs in the ecoregion, with higher-ranked P ICs contributing more to the risk. The aggregated data was predicted using the Linear model with an optimum number of features (N P IC = 4) selected with window detection method (∆ M IN = 30). This heatmap depicts the risk on the date of 09-June-2019, with 40 top-ranked P IC. Map generated using Folium [58] with OpenStreetMap [59]. Basemap tiles provided by CartoDB [60]. three predictive models using dengue case data from Brazil, showing different levels of accuracy gains.
On average, the proposed feature enhancement methods improve prediction performance by 10.79% to 33.6% over the baseline feature set for the locations we tested, depending on the prediction model and the window allocation scheme. For the location with the highest total cases (2010-2019) in the Espírito Santo region of Brazil, we could get MAE values as low as 0.13 (normalized case data) using the GRU model with data from just 6 locations added to the feature set. In a test across multiple locations, both RNN models (LSTM and GRU) performed with comparable accuracy (average MAE ranging from 0.3435 to 0.3569) when using an optimal number of additional features. The Linear model also benefited (18.97% to 27.13% improvement over the baseline) from windowed correlation-based feature enhancements, although its performance never got close to recurrent models. When compared with the respective optimal number of features, the best performing recurrent models outperformed the Linear model by at least 41% in terms of prediction accuracy. The window detection methods showed performance comparable to fixed window allocation. This can be advantageous when working with extensive sets of data, as the detected windows only compare a subset of important time steps instead of the entire series. For municipality-level dengue case prediction, GRU was the best performing model, closely followed by LSTM. The performance gaps between these two models diminished after feature set optimization. When predicting aggregated data for the entire region using a subset of constituent locations, our methods reach optimal performance with the addition of only 2-5 locations (out of 77), depending on the model and window selection scheme. This is especially useful for situations where the lack of trainable data hinders forecasting. For example, dengue risk for a region can be predicted if a small subset of data from epidemically important region locations is available. This addresses the issue of incomplete data and eases the 'curse of dimensionality' while improving training efficiency.
Future efforts in this area can focus on gaining further insights based on epidemiological, environmental, and economic characteristics of locations and combining them to improve feature ranks. While case data are indicators of the severity of an outbreak, these are not always adequate to explain future possibilities. Machine learning models cannot generalize beyond training data for every location with the same degree of accuracy. Complex interactions of dengue viral strains and changes in host immunity patterns may evolve outbreak characteristics over the years. Several random factors, including host travel patterns, natural disasters, and lifestyle changes because of other infectious outbreaks (e.g., COVID-19 pandemic), may affect dengue outbreaks. While the metrics used in our method capture such factors' long-term characteristics, these do not account for randomness. While keeping the feature sets reasonably small, our method can improve outbreak prediction. The proposed method can be generalized and used for projecting any infectious outbreak where temporal data at a reasonable spatial resolution are available.