DeepCOVIDNet: An Interpretable Deep Learning Model for Predictive Surveillance of COVID-19 Using Heterogeneous Features and their Interactions

In this paper, we propose a deep learning model to forecast the range of increase in COVID-19 infected cases in future days and we present a novel method to compute equidimensional representations of multivariate time series and multivariate spatial time series data. Using this novel method, the proposed model can both take in a large number of heterogeneous features, such as census data, intra-county mobility, inter-county mobility, social distancing data, past growth of infection, among others, and learn complex interactions between these features. Using data collected from various sources, we estimate the range of increase in infected cases seven days into the future for all U.S. counties. In addition, we use the model to identify the most influential features for prediction of the growth of infection. We also analyze pairs of features and estimate the amount of observed second-order interaction between them. Experiments show that the proposed model obtains satisfactory predictive performance and fairly interpretable feature analysis results; hence, the proposed model could complement the standard epidemiological models for national-level surveillance of pandemics, such as COVID-19. The results and findings obtained from the deep learning model could potentially inform policymakers and researchers in devising effective mitigation and response strategies. To fast-track further development and experimentation, the code used to implement the proposed model has been made fully open source.


Introduction
COVID-19 has had an unprecedented social and economic impact worldwide. With more than 13 million infected cases and more than half a million deaths as of mid-July, the pandemic is still accelerating globally without showing any signs of nearing an end. In dealing with COVID-19 and future pandemics, it is imperative to design reliable intervention strategies and to implement effective mitigation efforts. The design of such strategies hinges on effective surveillance of the spatiotemporal evolution of disease. Hence, a reliable and relatively interpretable method of forecasting the spread of the virus could significantly improve the predictive surveillance capability and help in designing policies for disease containment.
Facing this need, prior research has proposed and tested various statistical, epidemiological and machine learningbased forecasting models for COVID-19 [

1] [2] [3] [4] [5]
[6] [7] based on features such as number of existing infections, deaths and recoveries. Other forecasting models proposed for other epidemics [8] [9] [10] rely on features like human mobility and within-season and betweenseason observations. Although these models show potential for predicting the initial outbreak and growth trajectories, their capability in capturing various temporally dynamic and spatially variant features affecting disease spread is limited [11]. For example, mathematical models such as susceptible-infectious-recovery (SIR) models only account for a small subset of relevant features identified to be responsible for the spread of the virus, as shown below.
Existing studies [12] have shown that the spread of disease is dependent on many factors; thus, reliable forecasting models must capture all major factors that might influence the spread of infection. Specifically, a multitude of factors including, but not limited to, human mobility [13], social distancing guidelines [14] [15], weather [16], population density [17], and demographics [18] affect the spread of COVID- 19. In other fields of study where prediction also depends upon a large number of features, researchers often find that identifying interactions between input features is essential to yielding satisfactory results [19] [20]. Although, to the best of our knowledge, no literature has yet been published that identifies specific feature interactions for the spread of COVID-19, we hypothesize that it is quite likely that many features upon which the spread is dependent interact in complex ways. Hence, an exhaustive examination of the relevant features and their possible interactions is essential for an effective prediction model of disease spread.
Advances in deep learning can enable contemporary models to use a large number of features and to account for possible interactions. Several deep learning models [21] [22] [23] used for online ad click predictions are particularly known for using multiple heterogeneous features as input and learning interactions among them; however, since many features used for epidemic forecasting are in the form of multivariate time series and multivariate spatial time series, an effective method to compute representations of such features that accounts for both the temporal and spatial structure of data is essential before using them as input to the aforementioned models. Perhaps, due to the challenges associated with building such representations for heterogeneous input features, existing deep learning models for epidemic forecasting rely upon more conventional recurrent architectures that do not explicitly account for feature interactions and consider only few input features [5] [1] [24].
To complement existing epidemiological models, we propose a deep learning model based on the high-level framework of DeepFM [21] that takes in multiple features, accounts for interactions between them and forecasts growth in the number of infected cases in all U.S. counties. For effective processing of the many input features, the model includes a novel method to compute equidimensional representations (also called embeddings) of heterogeneous features such as multivariate time series, multivariate spatial time series, and multidimensional time-independent variables. Furthermore, we perform feature importance evaluations to identify the most influential features for predicting the growth of infected cases. Also, since the proposed model accounts for possible interactions among input features, we perform an analysis to estimate the relative amount of second-order interaction between pairs of input features. The results show that the model obtains satisfactory performance. In addition, the highly interpretable feature importance results can also help policy makers develop control strategies in response to the rapidly evolving pandemic situation. To fast-track future research and experimentation with new features or models, we have also made our code fully open source.

Standard disease-spread models
To estimate the growth of infected populations, epidemiologists and mathematical modelers have developed multiple statistical and mathematical models to simulate the spread of disease in terms of susceptible, infected, recovered, and deceased populations. These models include the susceptible-infectious-recovery (SIR) model and its derived models, such as the susceptible-exposed-infectiousrecovery (SEIR) model [25] [26]. Through contact activities of people, these standard models attempt to capture the community spread of disease [27]. While these models provide useful insights regarding the initial outbreak and growth trajectories, they have limitations in terms of the number of influencing factors and complex relationships captured. For example, existing models can include only a limited number of features (primarily infection rate) to forecast the spread of infection; however, research has shown that disease spread is related to a large number of factors (such as socio-demographic factors, mobility, population density, and visits to points of interest), which possibly interact in complex ways. Data-driven models can be adopted to capture various dynamic features and their interactions to complement the standard disease-spread models.
Existing studies also reveal that population flow drives the spatiotemporal distribution of COVID-19 cases [28], and travel ban was projected to be successful in slowing the epidemic spread. This has been demonstrated in the context of COVID-19 in China [29]. To complement existing mathematical disease spread models, the global epidemic and mobility model [9] was developed to incorporate the movement of people (which may hasten transmission of the disease across different areas) and predict the spread of disease. This model focuses mainly on cross-and withincommunity transmission of the disease through the analysis of global human mobility; however, as shown in recent literature, the spread of the disease is affected not only by human contact, but is also related to multiple additional factors, such as population density, human shopping activities, and directives of local government [14] [15] [13] [30]. Existing models have a limited capability to capture the effects of these factors; hence, developing a model that can take various relevant factors into account, predict the spread of disease, and attribute importance to each factor could be informative for pandemic surveillance and associated policy making.

Deep learning models
Several researchers have identified that predictive models based on deep learning could be effective in aiding decision making in response to COVID-19 [32] [33] [34]. Huang et al. [4] proposed a convolutional neural network to forecast the spread of the virus using six input features related to the number of confirmed, recovered, and deceased people. Their model predicts the spread one day ahead using data from last five days. Although their work shows that deep learning can be effective in forecasting the pandemic, the model cannot be directly deployed in practice because it predicts only one day of future scenario and considers only a limited number features affecting the spread of disease.
Chimmula et al. [5] proposed a long short-term memory (LSTM) model [35] to forecast the spread of the virus based on past number of confirmed, deceased, and recovered cases. Similarly, Tian et al. [24] also used an LSTM

Time-Dependent
Cross-county mobility and infections Incoming traffic from one county to another and cumulative infected cases in source county Cross-County Time-Dependent model to forecast the spread based on similar features but normalized by population. The authors also compared their results with a Hidden Markov Model and a Hierarchial Bayes Model, but concluded that LSTM exhibits the best performance of the three, demonstrating the potential impact deep learning can have on pandemic forecasting.
Tian et al. [36] also proposed a custom model to forecast cumulative confirmed cases and deaths by combining the LSTM [35] and GRU [37] cells. Their model used past numbers of confirmed cases and deaths as input and five other time-independent features. They reported their model performance only in terms of relative error, so it is difficult to judge the effectiveness of their model in absolute terms. Further, the authors did not clarify why some features, such as violent crime rate, were used in the model and how these features could contribute to prediction of infections and deaths by COVID-19.
Further, an important and perhaps similar work is the DeepCOVID model [38]. The associated researchers have developed a deep learning-based forecasting model using "syndromic, clinical, demographic, mobility, and point-ofcare data" to forecast mortality and hospitalization in the United States. However, only a small amount of information (such as accuracy, model architecture, and feature importance) about this work has yet been made public.
To the best of our knowledge, the proposed Deep-COVIDNet model is among the first significant deep learning models for COVID-19 forecasting to be completely open-source. Moreover, a notable distinction of our work is that we perform feature analysis to understand which features are important in forecasting the growth of the virus, an analysis not performed in prior studies. Further, we also account for possible interactions among our many input features and identify pairs of features with relatively higher amount of second-order interaction between them, which again has not been done by other models.

Input Feature Groups
To have a comprehensive understanding of the situation and characteristics of counties, it is important to examine several features for each county. To this end, we extensively surveyed the literature and identified certain "influencing factors" that might affect the spread of infection. We then identified specific feature groups that corresponded with the set of influencing factors identified earlier. A feature group is simply a set of similar features grouped together to facilitate further processing. A brief description of all feature groups is presented in table 1. In this section, we describe the process of feature collection, organization, and inclusion in the proposed model in detail.

Feature Collection
Our extensive analysis of the literature helped us identify four factors that may influence the spread of COVID-19 both spatially and temporally. These "influencing factors" include population attributes (such as population density, age/race-based population distribution, etc.), population activities (such as visits to certain types of points of interest, adherence to social distancing guidelines, etc.), mobility (movement from more infected places to less infected ones), and disease spread attributes (such as reproduction number, growth of infections in past, etc.). The feature groups used in correspondence with each influencing factor are shown in table 2. A more complete discussion of all feature groups is provided in the appendix 10. In the following, we provide evidence to show the importance of each influencing factor in predicting the growth of the virus.
Features related to population attributes, such as population size and density, are important in predicting the growth of infection as shown by research studies discussed below. Rocklöv and Sjödin [17] demonstrated that the spread of the virus in a geography is directly proportional to its pop-ulation density. Dyer [39] and Kirby [40] showed that the impact of the pandemic has been disproportionately higher in black and other minority communities. Dowd et al. [18] illustrated how population age distribution and intergenerational contacts (possibly captured by household type and size) affect the impact of the virus in a community. Together, these studies provide a strong rationale for using features that capture population attributes of counties by showing that these unchangeable population characteristics contribute significantly in understanding the spread/impact of the virus. As shown in table 2, to account for population attributes, we use census features, population density, and some other engineered features built by the Surgo Foundation to assess vulnerability of each county due to COVID-19 (refer appendix 10).
Multiple research studies have found that population activities are important in predicting the growth of future cases. Benzell et al. [41] identified risks associated with visits to venues at which people would be placed in proximity and showed that particularly high risk was associated with visits to restaurants, grocery stores, fast food establishments, cafes, and gyms. Also, Lai et al. [42] found that there is high risk of small disease outbreaks within residential facilities for elderly, indicating that examining the number of visits to such facilities could be helpful in forecasting the growth of infection. Moreover, since research has shown that COVID-19 is communicated via airborne particles, especially indoors, visits to hospitals, especially by healthcare workers, could be risky [43] [44], and therefore are important to consider when determining the growth of infection. Further, Gao et al. [45] showed that adherence to social distancing orders helps to reduce the spread of infection. Cohen and Kupferschmidt [46] and Sen-Crowe et al. [47] also acknowledged the importance of social distancing in slowing the spread of the virus. Further, Béland et al. [48] demonstrated that workers in certain occupations in which they are more likely to work in proximity are more affected by the virus, indicating that the percentage of people working full-or part-time could influence infection spread. These studies, together, show the importance of capturing population activities as input features. As shown in table 2, we use points of interest (POI) visitation patterns, Venables distance [31], and social distancing metrics to capture these activities. As explained in appendix 10, visitation patterns capture the number of visits to important types of POIs, Venables distance captures the agglomeration of population activities in a county, and social distancing metrics show adherence to stay-at-home orders. Together, these three feature groups show the extent to which people are likely to come in close contact and potentially facilitate the spread of infection.
Further, the dynamics of urban mobility are also important in predicting the spatial spread of the virus. Kraemer  [13] showed that human mobility played a major role in explaining the initial spatial distribution of infections in China. They showed that more than 50% of initial cases identified outside Wuhan could be traced back to travel from Wuhan. Sirkeci et al. [49] also confirmed that human mobility from more infected places to less infected places is a significant factor in predicting the spatial spread of the virus. Other studies [50] [28] also concurred regarding the role played by mobility in the spread of the virus. Therefore, we use data of incoming traffic from other counties to capture mobility in the proposed model as shown in table 2.
We further augment the mobility information by including the number of infections in source counties because travel statistics alone are not sufficient to inform the model about counties from which travel could be more dangerous due to the high prevalence of infection. Finally, disease spread attributes, such as the growth of cases in the past few days and reproduction number, are also important factors in determining the spread of virus in the future. As mentioned in section 2, several existing models [5] [24] [4] obtain satisfactory results in forecasting future cases by using features similar to the growth of cases in past days. Further, epidemiology-based studies [51] [52] [53] have reported that estimating the reproduction number can be essential to understanding the behavior of virus spread, thereby making it an important input feature to consider. Based on these studies, we include both the growth of cases in past days and reproduction number (estimated by the method proposed by Fan et al [51]) as input features to the model.

Feature Organization
As discussed above, we combine similar features into a feature group to better process and organize data. We define three types of feature groups based on their spatial and temporal characteristics as described below: constant, time-dependent, and cross-county time-dependent feature groups. The exact feature groups used in the model are shown in table 1 and are fully explained in the appendix 10.
Constant Feature Groups: Constant feature groups contain features that do not vary significantly over the analysis period. For example, the set of census features for a county (such as population size and population density) are considered constant features because their values do not change significantly within a few months.
Time-Dependent Feature Groups: This group contains features whose values change over time. For example, the number of people who visit grocery stores on a particular day in a given county is a time-dependent feature since its value changes depending on the particular day. A timedependent feature group is essentially a multivariate time series.
Cross-County Time-Dependent Feature Groups: In cross-county time-dependent feature groups, features capture a time-dependent interaction between two counties. For example, the number of people traveling from one county to another on a particular day is a cross county time dependent feature. A cross-county time-dependent feature group is essentially a multivariate spatial time series.

DeepCOVIDNet
We have designed a novel deep learning model, Deep-COVIDNet, to estimate the range of increase in the number of infected cases on a particular day given multiple constant, time-dependent, and cross-county time-dependent feature groups as input. The "projection interval" of the model is 7 days, which means the model can forecast the range of increase in cases for 7 days into the future.
The model comprises of two modules: the embedding module and the DeepFM module. The embedding module takes as input various heterogeneous feature groups and outputs an equidimensional embedding corresponding to each feature group. These embeddings are then passed to the DeepFM module which computes second-and higher-order interactions between them. Finally, we use a shallow, fully connected network which takes as input the computed interactions and the sum of feature embeddings (to improve gradient flow) and outputs n probabilities corresponding to n binary classification tasks, as explained in section 5. We transform the n probabilities to get the probability distribution of the current rise in cases to lie within n + 1 ordered ranges. We expect the rise in cases to lie within the range with the greatest probability. Figure 1 shows a schematic representation of the process described above.

Embedding Module
The novel embedding module used in this study produces embeddings of the same dimension for all feature groups using a generalizable process, which is described in detail below for each feature group type. The fundamental idea behind extracting embeddings from various feature groups is to utilize all available structure in the data. For example, for time-dependent features, we would like the model to be able to treat the same features on different dates more similarly than different features on different dates. We accomplish this goal by sharing parameters, which is a successful technique used to design novel models and arguably is at the heart of the success of both recurrent and convolutional neural networks.

Constant Feature Embeddings
Since constant feature groups do not have a time dimension, the shape of a group of constant features is simply [n], where n is the number of features in the group. As shown in figure 1, the embeddings of constant features are calculated simply by using a fully connected layer that converts the input tensor of shape [n] to a tensor of shape [e] where e is the embedding dimension.

Time-Dependent Feature Embeddings
Since the time dimension of time-dependent feature groups needs to be taken into account, the shape of a timedependent feature group is [t, n], where t represents the time steps, and n represents the number of features for each time step. From the n input features, our method computes a holistic feature score by performing a weighted summation of the n original features for every time step. This holistic score can be thought of as a new time-dependent feature, engineered by the model as per its needs. Next, a weighted summation is conducted over all time steps to learn the influence of different time steps on the final output. As shown in equation 1, when the embedding size is e, the model has a chance to engineer e new time-dependent features and understand how each new feature behaves over time.
In a formal formulation, let F t ∈ R t×n be a timedependent feature matrix of shape [t, n]. Let (F t ) ij represent the j th feature at the i th time step. Let the output embedding be E t ∈ R e and (E t ) k represent the k th element in the embedding. Then, we calculate (E t ) k in the following way: where σ is any activation function and W F ∈ R n×e and W T ∈ R t×e are learnable weights in the network. In part 2 of figure 1, we show exactly how equation 1 is implemented in a vectorizable method.

Cross-County Time-Dependent Feature Embeddings
Since each feature value of a cross-county time-dependent feature group is associated with all counties and all time steps, the shape of a cross-county time-dependent feature group is [t, c, n], where t represents the total time steps, c represents the total number of counties in the U.S. and n represents the number of features. As shown in figure 1, let F cc ∈ R t×c×n represent the raw features so that (F cc ) ijk is the k th feature of the j th county at the i th time step. Further, if we let E cc ∈ R e represent the final embedding of this feature group and (E cc ) p represent the p th element of E cc , then we compute (E cc ) p using an extension of equation 1 in the following way: where σ is any activation function, W F ∈ R n×e , W C ∈ R c×e and W T ∈ R t×e are learnable weights in the network. Note that equation 2 is an extension of equation 1 over the county dimension. As in the previous case, the innermost summation computes a holistic feature score for every county and time step using the same weights. Next, an overall score for each time step is computed by a weighted sum of all holistic feature scores for all counties. Finally, a weighted combination of all time steps is computed.

DeepFM Module
We hypothesize that there exist several interactions between different feature groups. An interaction between two features exists when their values together convey some information that cannot be extracted by considering their values individually. For example, there could be an interaction between the percentage of population staying indoors in a county and the total number of infected cases in the county. A higher incidence of new cases is expected if few people remain indoors and there are already a high number of infected cases. However, the model may not be able to predict the number of new cases as effectively if only one of the two feature values is given. Hence, it is important for the model to identify and learn many such interactions that could exist among different features. In this section, we provide an overview of how the model computes second-and higherorder interactions among input features using the DeepFM [21] framework (with slight modifications).
The embeddings of all features obtained from the embeddings module described above serve as input to the DeepFM model. Note that all raw features groups are of different sizes, but their embeddings have the same dimensions and are easy to further process by the DeepFM module, as shown in figure 1.
In the model, the dot products between a pair of embeddings represent second-order interactions between the cor-responding two feature groups. To identify higher-order interactions, we concatenate all embeddings and process them through a self-normalizing neural network [54]. The network comprises of a sequence of dense layers with the same output dimensions, followed by the SELU (Scaled Exponential Linear Units) non-linearity and the alpha dropout layer [54]. We treat the number of dense layers and the dimension of their output as hyperparameters in the model, which are tuned by using Bayesian optimization as explained in section 5.

Implementation Details
Loss function. The proposed model predicts a range between which the rise in the number of infected cases is expected to lie. In this section, we describe how the model is trained with this output. Let C ∈ Z n represent a list containing boundaries of ranges used for prediction in the model. Therefore, if the model predicts class i ∈ [0, n], the rise in cases is expected to be within the interval [C i , C i+1 ) under the assumptions that C 0 = 0 and C n+1 = ∞. Naturally, C i ≤ C i+1 for all i.
Since C is sorted and ordered, we can treat the output of the model as an ordinal variable and perform ordinal regression using the method described by Frank et al. [55]. According to the method, n binary classifiers need to be trained such that classifier i ∈ [1, n] outputs the probability P (x > c i ), where x is the rise in number of cases and c i is a constant. After n probabilities are obtained from these n classifiers, we can easily find P (c j ≤ x < c j+1 ) for all j ∈ [0, n]. Finally, the rise in cases is expected to lie in the interval [c k , c k+1 ) for some k ∈ [0, n] so that ∀ i P (c k ≤ x < c k+1 ) ≥ P (c i ≤ x < c i+1 ). Frank et al. [55] suggest that the model should be trained by using only binary cross entropy loss on the n binary classifiers. However, this could seem non-ideal because the final goal is to predict the interval [c j , c j+1 ) rather than to predict P (x > c i ) for all i. Therefore, for the proposed model, a multi-class cross-entropy loss on P (c i ≤ x < c i+1 ) for all i is added in addition to the binary cross-entropy loss on all binary classifiers. The conducted experiments show that the results are slightly better when this new loss term is added.
Class ranges for outputs. As described above, the proposed method predicts a range within which the increase in the number of cases would lie. It is important to discuss how range boundaries (i.e., list C) are actually chosen. Basically, we start by finding the rise in the number of cases in every county from April 5 through June 28 during the same projection interval of the model. We then find the 33 rd , 67 th and the 90 th percentile of the rise in cases during those dates. In raw numbers, this turns out to be 1, 13, and 93. So the output classes correspond to the following ranges: 0 to 1 (negligible increase), 2 to 13 (moderately low increase), 13 to 93 (moderately high increase), and 93 and above (significantly high increase). These numbers denote the rise in cases in every county during one week. Further, it also should be noted that due to the distribution of the labels, a naive model which predicts the same class always would at most get 33% accuracy. This should also help contextualize our results. We achieved about 64% accuracy on the testing data (June 12 through June 28), which is about two times better than a naive model. In light of this and the feature analysis results, we can be confident that the model has learned useful information. It is also important to note that these class boundaries can easily be changed to make the model predictions finer or coarser as per different use cases.
Train and test splits. The total data used was from April 5 through June 28. 68% of this data was used for the training set, 12% for validation and 20% for the testing set. Equivalently, data from April 5 through June 1 was used for training, June 2 through June 11 for validation, and June 12 through June 28 for testing. Although data since January 21 was available, we used data only from April 5 onward because of the lack of widespread testing availability before then. Since the model was trained on labels derived from the results of COVID-19 testing data, it was essential to use data that was relatively accurate and did not underestimate the number of infections due to lack of testing [11].
Amount of Past Data Used. As explained in section 3, the values of many input feature groups used in the model change with time (time-dependent or cross-county timedependent feature groups). For all such features, we used 13 days of past data. We experimentally found that increasing the number of past days to 21 and 28 had insignificant effect on the model accuracy. Therefore, for reasons related to computational efficiency, we chose to use 13 days of past data.  Therefore, the value of 0 means the model's prediction was correct, 1 means it was one class away from ground truth, and so on. The accuracy of the model on the given date is shown above each subfigure. (The few counties not shown in these plots either have no infected cases or did not have sufficient data.) So, to be clear, when predicting the range of increase in infected cases on, for example, June 21, we use input data from June 1 to June 13. Features from June 14 through 20 are not given as input because that represents the interval for which the model is predicting the rise in cases.
Hyperparameter Tuning. We used Bayesian optimization for 30 iterations with expected improvement as the acquisition function to choose hyperparameters for the model [56].
Training Method. As shown in figure 2, the model's testing results become less accurate as it is tested on dates further in time from the dates on which it was trained. Since the dates of the training, validation, and testing set are in ascending order, we hypothesized that training on the validation set as well should improve testing performance, since more recent dates will be used in training. In light of this belief, the model is trained in two steps.
First, we keep the original training and validation sets intact and use Bayesian optimization as described above to perform hyperparameter tuning and choose the model with the best performance on the validation set. We note the best hyperparameters and the number of epochs needed to train the best model. Second, we train a fresh model on both the validation and training sets with the hyperparameters chosen in the first step and for the same number of epochs. The model obtained in the second step is the final model. Note that the first step has two purposes: (1) choosing best hyperparameters, and (2) helping decide a stopping point for the second step to avoid/reduce overfitting.
Feature Analysis. For the model to be informative for policy making, it should be explainable. To this end, we adopted a simple and popular method to evaluate feature importance [57]. We first evaluated the accuracy of the model on a small section of the training set. Next, we looped through every feature in every feature group, randomized its value, and reevaluated the model's accuracy on the same section of the training set. We assigned higher importance to those features which when randomized cause greater decrease in the model accuracy.
A similar analysis was performed to determine feature values at which time steps are the most important in determining the final output. To perform this analysis, we randomized all features on a given time step, and then assigned greater importance to those time steps which when randomized cause greater decrease in performance.
Feature Interaction Analysis. As described in section 4, the DeepCOVIDNet model explicitly computes secondorder interactions between input feature groups. Due to this explicit representation of second-order interactions, it is possible to identify pairs of features between which high amount of interaction is observed. To perform this analysis, we evaluated the network on a section of the training set and tracked the magnitude of activations in the vector representing the second-order interactions of input features. We  concluded that the observed interaction among two feature groups is high when the average magnitude of their secondorder interaction is also high. This implicit assumption that neurons are activated highly when they capture a pattern to which they are responding to is common in deep learning and is used in other interpretation techniques, such as activation maximization [58].

Predictive Performance
In this section, we discuss the performance of the proposed model on the test set. As discussed above, the test set contains 17 days of data from June 12 through June 28. The average accuracy of the model on these 17 days is 63.7% when using four output classes to categorize the growth in the number of new cases for each county (i.e. negligible increase, moderately low increase, moderately high increase, and significantly high increase).
Further, as shown in figure 2, the performance of the model, which is trained from April 5 through June 11, decreases as we evaluate on dates further from June 11. The same trend of the model performance decreasing over time can also be observed in figures 3 and 4. This could be due to two possible reasons. One, the COVID-19 situation is highly dynamic and the behavior of people and the adaptive strategies they use change frequently. For example, mask use in the United States has increased over time, especially after the Centers for Disease Control and Prevention (CDC) advised it [59]. So if the model were trained on data before mask use became prevalent, it would learn trends that will not hold true after masks become more widespread. Two, the testing capacity of the United States continues to ramp up with time. As testing becomes more accessible, the trends a model may have learned earlier may not hold, as more people would be tested, resulting in more infected cases being found. Therefore, during deployment, a good strategy is to let the model forecast only 7 days beyond the end of dates used in training, as new features are released about every 7 days. To project further into the future, the model must be retrained on the most recent training data available to ensure that the knowledge it learns holds for the dates it is asked to forecast on. It should be noted that research [12] has shown that predictive performance of many other forecasting models also declines with time. Therefore, since the model will only forecast 7 days into the future in practice, it is important to know its "use case accuracy" or "7-day accuracy," which is the model's accuracy from June 12 through June 18 (7 days of test data). The model's "use case accuracy" is much higher at 70.8%. Figure 3 shows the predictive performance of the model in all U.S. counties on a series of dates. The figure shows that the majority of the predicted results are consistent with or one class away from the actual growth in the number of cases. The predicted growth for only a few counties are more than two or three classes away from the actual situation. This shows that the model generally performs well in predicting growth in spread of infection in a county. Even when the model's predictions are incorrect, they are usually just one category away from being correct. Further, as discussed above, the figure also shows that the accuracy of the model decreases for dates further away from those used in the training set. Figure 4 shows the model predictions over time for different counties based on the growth of infected cases in them. We see the same general trend as in figure 3 that the model makes more inaccurate predictions as we go further in the future.  The results indicate that the feature values related to the most recent days are most important in predicting the growth of cases 7 days in the future. Refer 6.2 for more details.

Feature Importance Evaluation
In the next step, we evaluated the importance of different county features in terms of their contribution to the prediction of the number of new cases in each county. Figure 5 shows some of the most salient features identified by the model using the method described in section 5. As shown in the figure 5, the most important feature is the past increase in infected cases in the current county. This implies that the past trend of the growth of infection in a county is a strong determinant of future growth. Moreover, the second and third most important features are the cumulative infected cases in all counties and the number of people from other counties visiting the county under study. As explained in appendix 10, these are both cross-county time-dependent features. Together, these features could imply that the phenomenon of people traveling from more infected counties to other counties is associated with the growth of infected cases in the destination county. In other words, the results show that inter-county travel risk could be high for the destination county, consistent with previous studies about the risks of travel [60] [61]. Other important features, such as the median home dwell time, percentage of people working full-or part-time, percentage of people staying home, and number of people traveling to work for less than 5 minutes all show that the model has learned the importance of staying home and social distancing in general, which is again consistent with the reported findings in previous studies [62]. In addition, the model also shows that various manually engineered features are effective and indicative of the growth of infected cases. For example, COVID-19 Vulnerability Index (CCVI) score, socioeconomic vulnerability, epidemiological vulnerability, and healthcare system factors are engineered features designed by the Surgo Foundation [63] to identify communities that are at a greater risk of infection, and our results show that all these features are important. Note that other features, such as estimated reproduction number (equal to the number of other infections caused by one infected person) and population density, also make the list of important features. Figure 6 shows the results of the time-step analysis so we know features of which of the past days are most influential in predicting the final output. As discussed earlier, all timedependent and cross-county time-dependent features use 13 days of past data. Since the projection interval of the model is 7 days, the most recent day used for input features is 7 days before the prediction date (also called projected date).
As shown by figure 6, the most recent days are the most important in predicting the growth of cases. Figure 7 shows second-order interactions between all feature groups computed by the method described in section 5. The values in the cells represent the mean activations of the second-order interactions of corresponding feature groups. As discussed in section 5, higher values represent higher observed interactions. An important observation from the figure is that almost all feature groups have notable interaction with census features. This result indicates that all other county-level features must be interpreted in the context of the population attributes of that county. For example, two counties with different population attributes may require varying amounts of adherence to social distancing orders to produce the same dampening effect in the growth of disease spread. Note also that figure 5, which shows important features identified by the model, does not include many census features, indicating that census features alone do not contribute much to prediction of the growth in future cases.

Feature Interactions
Also, the results show that many feature groups have relatively high interaction with cross-county mobility and infections. For example, a relatively high interaction exists between cross-county mobility & infections and social distancing metrics. This result indicates that travellers from other counties have a different kind of impact in destination counties which have higher adherence to social distancing than they do in destination counties with lower adherence levels. This analysis validates our initial hypothesis that interactions among the many input feature groups are likely to exist and aid in predicting the rise in future number of cases in counties.

Discussion and Concluding Remarks
In this paper, we propose a novel deep learning model to examine heterogeneous county-level features and to predict growth of infected cases in the future. The proposed method can extract embeddings from multivariate time series and multivariate spatial time series data in a novel way by utilizing both the temporal and spatial (if available) structure of the data. This process of extracting embeddings could be employed by other deep learning research to process similar type of data. Further, unlike existing models [1]  Similarly, the results of identifying pairs of features with high second-order interactions show that interactions among the many input features exist and are important to explicitly account for by future researchers in their forecasting models. Further, this analysis also opens up opportunities for data scientists and statisticians to explain more clearly the reasons for such interactions and their possible implications to epidemiologists studying the spread of COVID-19.
In addition, we believe that data-driven models like the DeepCOVIDNet mark the advent of an era of extensively using deep learning for pandemic forecasting, which can complement existing epidemiology models. Already, there is an increasing interest in using recurrent networks to forecast the spread of COVID-19 [64] [36] [65]. To the best of our knowledge, the proposed model is among the earliest non-recurrent and non-conventional deep learning models used for pandemic prediction, which should encourage other researchers to develop more creative models. Finally, this work also presents one of the first attempts to make the results of a deep learning based COVID-19 forecasting model interpretable to policymakers and the public.
Further improvement regarding the prediction accuracy and interpretability of the model are important future directions for this work. For example, although we explicitly model second order interactions between feature groups, we still lack a good way to interpret these interactions at the individual feature level. In other words, although we may be able to tell that there is high interaction between census features and social distancing features, we cannot tell which exact census features interact with which exact social distancing features. A strong understanding of which individual features interact will be of great value to epidemiology researchers and policy makers. Further, it is important to continue to add and engineer new features for the model and improve its predictive performance in future studies.

Data and Code Availability
The data that support this study are available from SafeGraph, Mapbox, and The New York Times GitHub repository containing information about the past number of cases. Restrictions apply to the availability of some of these data, which were used under license for the current study, and so are not publicly available. However, the full code used to implement the proposed Deep-COVIDNet model is available on Github: https: //github.com/urban-resilience-lab/ covid-county-prediction.

Appendix: Feature Descriptions
In this section, we provide detailed descriptions of the feature groups used in the model. As shown in section 3, the feature groups of the model are categorized based on four influencing factors. In this section, we provide the definition and detailed calculation of all feature groups based on their influencing factors.

Population Attributes
As mentioned in table 2, we use the following features to capture population attributes: to measure the capacity, strength (measured by county spending on health and research quality) and preparedness of the health care system in a county using 8 different features. (g) COVID-19 Vulnerability Index (CCVI). Finally, the above 6 features are combined with equal weighting to create the COVID-19 vulnerability index, which is intended to identify communities "with a limited ability to mitigate, treat, and delay the transmission of" the virus.

Population Activities
We use the mobility data collected by SafeGraph to capture population activities. The SafeGraph Patterns [68] and SafeGraph Social Distancing Metrics [69] are adopted in our use case. The former includes information related to the number of visits to certain points of interest, and the latter provides data to show adherence to stay-at-home guidelines.
1. Visitation Patterns: SafeGraph Patterns data has information about the number of people that visit different types of points of interest (i.e. grocery stores, health facilities, etc) on a given day. We consider visits to the following types of places in our model: amusement parks and arcades, colleges/universities/professional schools, living facilities for elderly, department stores, hospitals, merchandise stores/supercenters, grocery stores, and restaurants/eating places. As discussed in 3. 3. Venables Distance: Venables distance captures the concentration of population activities in a county. Venables distance, D V (t), is defined formally as [31]: where s i , s j represents the population activity intensity in cell i and j respectively and d ij represents the distance between the two cells. In our analysis, we define a cell to be a 4 km 2 area. We compute D V for each day and county by basically using the daily average values for s i (t) and s j (t). Louail et al. [31] describe more details about this equation. We use Mapbox digital trace telmetry data, which basically uses aggregated cell phone data to estimate population activity in small spatial regions of each county, to extract values of s i (t) and s j (t) for cells.

Mobility
To capture mobility across regions, we employed intercounty travel data from SafeGraph as described below.

Cross County Mobility & Infections:
The SafeGraph data [68] include the census blocks where the travelers visiting a certain point of interest come from. With this information, we can estimate the total number of travelers from one county to another. We further augment the mobility data by also adding the cumulative number of infections in source counties, so the model can realize that visitors from a more infected county could be more dangerous than visitors from a less infected county.

Disease Spread Attributes
1. Past Rise in Infected Cases: We consider the weekly rise in the number of confirmed cases in past days as a feature to represent the past pandemic situation in a county. This data is obtained from the New York Times GitHub repository [70]. Since the proposed model predicts the growth of cases a week in the future, the weekly rise in cases for the past few days as an input feature can inform the model about the recent trend of the variable that it has to predict.

Reproduction Number:
The basic reproduction number is also used, which is an estimate of how fast the number of cases are expected to increase by. Formally, it is the estimated number of other cases caused by one infected person. Fan et al. [51] describe the exact method of estimating reproduction number by using a simple epidemic model. They assume that an infected person infects another R 0 people after time τ . Therefore, if the number of infected people at time step 0 are i(0), then at time step t, the number will grow to i(0)R t/τ 0 . Simple algebraic manipulations show that: Fan et al. [51] set τ to 5.1 days and provide justifications for this choice in their work. As table 1 shows, we use reproduction number as a time dependent feature in the model. Therefore, for this analysis, we estimate the daily reproduction number based on 10 days of past data. In other words, to estimate R 0 at day d, we use the following formula: