Predicting County-Level Population From VIIRS Nighttime Light Imagery With Deep Learning

Nighttime light (NTL) imagery has been widely used to predict economic impacts, electricity consumption, and light pollution. In this research, we evaluated NTL imagery to predict the county-level population in the United States using deep learning methods. Rather than condensing the brightness information into a single statistic (e.g., mean annual brightness per county), this research aims to exploit the temporal nature of NTL imagery. Monthly composites of Visible Infrared Imaging Radiometer Suite (VIIRS) imagery were analyzed for each county to create quantiles of brightness. These quantiles were provided as input to two neural network architecture variants: a feedforward neural network (FFNN) and a recurrent neural network variant (long short-term memory, LSTM). A rigorous leave-one-group-out (division holdout) analysis showed that the LSTM was able to achieve a higher weighted average ${R}^{{2}}$ (0.68) value across all the divisions than the baseline (−0.01) and FFNN (0.26) models, suggesting that treating NTL data as sequence data can yield more accurate predictions than traditional structured datasets. This approach of relating temporal NTL data to the population can be generalized to other parts of the world where data collection is sparse or uncertain.


I. INTRODUCTION
N IGHTTIME light (NTL) imagery collection and analysis started in the seventies when the United States (U.S.) Air Force launched a spacecraft into orbit under the Defense Meteorological Satellite Program (DMSP) [1].Initially, images were stored on film strips and were made available to the public in 1973.The first-ever NTL image of the entire Earth was put together in 1991 by Sullivan [2].This worldview of the NTL image is in the Mercator projection and it was built by combining 40 separate photographs from 1974 to 1984.The image had its limitations due to incompatibilities among the 40 separate images that were collected over a decade.In 1992, the National Oceanic and Atmospheric Administration/National Geophysical Data Center (NOAA/NGDC) built a digital archive of global NTL imagery data from the DMSP program and the original series of NTL data from 1992 to 2013 [3].Recently, a NOAA discovery revealed that the F15 and F16 satellites started collecting predawn data from 2012 through 2018.Utilizing this predawn nighttime data, the Earth Observations Group (EOG) extended the annual DMSP NTL time series data to 2019 [4].Thereafter, numerous versions and resolutions (spatial and temporal) of NTL imagery data have been widely employed by researchers in their studies to solve scientific problems at the global, country, and regional scales.
There are various sources of NTL imagery such as the DMSP's Operational Line-scan System (DMSP-OLS), Suomi National Polar Partnership Visible Infrared Imaging Radiometer Suite (NPPVIIRS), International Space Station (ISS), EROS-B, JL1-3B, JL1-07/08, and LJ1-01 [5], [6].Among these sources, researchers extensively used NTL imagery data from DMSPOLS and NPP-VIIRS as they are openly available and accessible for broader time periods.DMSP-OLS serves as the singular source that provides global NTL data for the highest timeframe ranging from 1992 to 2019.A plethora of literature has been published using DMSP-OLS data and researchers effectively demonstrated various applications of NTL imagery.However, NPP-VIIRS provides better features than DMSP-OLS in terms of higher spatial resolution, lower detection limit, finer quantization (14-bit), and in-flight calibration [7].
Population estimates are major drivers of the sustainable development goals for economic growth, education, health, industry, innovation, infrastructure, jobs, transportation, and urban planning [26], [27], [28].Census estimates provided by the government are very valuable, but they have limitations.NTL models that are validated in countries with rich data environments such as the U.S. may prove useful for application in locations: 1) where census is not conducted frequently, where census is of poor quality, and where census data are held closely by a given country; 2) NTL models may also prove useful in identifying rapidly changing populations due to migration, wars, and natural disasters that will not be provided by the census in a timely manner; 3) accurate census is very expensive, labor-intensive, and time-consuming enterprise.Deep learning-based approaches using alternative data such as NTL can be used as a partial substitute and alternative to more traditional census count methods; 4) the U.S. census is conducted every ten years and traditionally provides local estimates on an annual basis.NTL models can provide estimates on a sub-annual basis or for ad hoc purposes at minimal additional cost; 5) current census methodology often cannot efficiently use remote sensing data as effectively as NTL models that incorporate compatible spatial platforms; 6) additionally, the proposed approach is not only applicable to estimate the population at administrative unit level but also for any region of interest.This region of interest can fall between two or more administrative units (counties).
The methodology presented in this study uses a shapefile (geographic boundary) and extracts the NTL to estimate the population within the shapefile.The shapefile is not constrained to symmetrical forms like squares or circles; it may encompass any irregular shape.We demonstrated it with over 3000 (counties) shapefiles of different shapes, sizes, and areas.Unlike the proposed method, the Census does not provide the population of such regions of interest that span across administrative units [27].However, the Census provides population for smaller geographies than counties which needed to be aggregated and may not align exactly with the region of interest.
Fig. 1 reveals a qualitative relationship between infrastructure, NTL, and population.Higher populations were observed in the regions coupled with higher NTL.While many studies have demonstrated the correlation between population and NTL, Sutton et al. [28] are among the first few who proposed predicting population density based on NTL.Later the same problem was addressed by multiple researchers with different modeling strategies.A study by Chu et al. [29] used a geographically weighted regression model to predict population density.Other statistical techniques such as kernel density estimation were implemented to predict population density with NTL data [30].Machine learning techniques such as random forest and support vector machines [24] have also been used to predict the population density of cities. Population density was calculated from the percent of lit areas to study population dynamics [31].The volume of NTL was used to analyze urban population dynamics [32].
While most studies have focused on population density and dynamics, some studies have predicted population counts based on NTL.Random forest (machine learning approach) was adopted to predict population counts for nine central urban districts in Chongqing, China [25].Zhao et al. [33] implemented a random forest model to predict the logarithm of the total population based on NTL data for the U.S. counties (with a population of more than 50 000) and they identified that age and race can contribute to the population bias.Another study used regression techniques to predict population, but this study also focused on specific zones of interest in Zhejiang, the tenth most populated province located on the southeastern coast of China [34].Huang et al. [35] calculated the urban population in China using regression models that illustrated a significant correlation between NTL and population.The primary limitation of previous research is that they are based on a particular area of interest that might exhibit similar characteristics and may not generalize to an entire country or larger geographic region [30].The primary contributions of our research are as follows: 1) Our approach to estimating the population of the U.S. counties in a generalized fashion without constraints on county selection and validation of the models through division holdout strategy is novel.2) We implemented deep learning models to predict the county-level population as estimated by the American Community Survey (ACS) using the distribution of NTL.Later, we evaluated the performance (lower prediction error) and analyzed the behavior of deep learning models in terms of under and overestimates at division and county levels.

II. MATERIALS A. Study Area
Three thousand one hundred thirty-nine counties in the U.S. were selected as our study area, including the continental U.S., Alaska, and Hawaii.The U.S. is the world's third-largest country in terms of population and fourth in terms of area.The Census Bureau grouped U.S. states into four regions (East, North, South, and West) which are further divided into nine divisions.According to the 2020 Census, the U.S. has a population of over 330 million people.The U.S. has a total area of 9.8 million square km, encompassing 9.1 million square km of land and 0.6 million square km of water.Tile 1 of VIIRS DNB provides NTL data for most of the North American continent.For this study, we selected only a portion of Tile 1 corresponding to the U.S. (masked with green color in Fig. 2).A sample of raw VIIRS DNB NTL extracted from Tile 1 for the U.S. is shown in Fig. 2 (enclosed in a red bounding box).

B. Data Sources
The data used for the study comprise monthly composites of VIIRS DNB NTL, U.S. county shapefiles, county area metadata (area of land and water in sq.km.), and population estimates of U.S. counties as estimated by the ACS.Visible Infrared Imaging Radiometer Suite (VIIRS) Day/Night Band (DNB) NTL data are provided by the EOG, Payne Institute for Public Policy, Colorado School of Mines.EOG provides VIIRS DNB NTL imagery data of the Earth on a monthly and annual basis in six tiles.The tiles are spread over 120 • latitudes and are slit by the equator to form six distinct tiles.The six tiles stretch from 75 N/180 W, 75 N/060 W, 75 N/060 E, 00 N/180 W, 00 N/060 W, and 00 N/060 E, respectively.Monthly and annual cloud-free composite imagery was generated by calculating an average of all the images over the respective timescale.The data are exempt from noise-producing sources such as stray light, lightning, lunar illumination, and cloud cover.However, the data do contain lights from sources like aurora, fires, boats, and other ephemeral lights.Monthly data are available in two configurations: one with eliminated stray light and another with corrected stray light.Each of the configurations includes two sets of files.One file set contains radiance values with units in nW/cm 2 /sr and the other set with a total number of cloud-free observations used to develop NTL imagery.For this study, stray lightcorrected cloud-free data from Tile 1 was selected as it covers most parts of the North American continent.The VIIRS DNB data for five years (2013, 2014, 2015, 2016, and 2017) was downloaded from the internet (available at https://eogdata.mines.edu/products/vnl/)[36].
Aside from NTL, the U.S. Census Bureau served as a major data source for boundary shapefiles and county-level population estimates [37].The U.S. Census Bureau provides cartographic boundary shapefiles from 1990 for various demographic boundaries such as counties, states, regions, and divisions.In this study, we used the U.S. county shapefile from Census with a resolution level of 500k (1:5 00 000) (available at https://www.census.gov/geographies/mappingfiles/time-series/geo/carto-boundary-file.2017.html)[38].The county shapefile comes with an attribute table consisting of county-level demographics, area of land, and area of water information.The 2010 county-level percent urban (https://www.census.gov/programs-surveys/geography/guidance/geo-areas/urban-rural/2010-urban-rural.html) [39] and poverty rate (https://www.census.gov/library/visualizations/time-series/demo/census-poverty-tool.html)[40] was downloaded from the Census as well.We used education data provided on the U.S. Department of Agriculture website (available to download at https://www.ers.usda.gov/dataproducts/county-level-data-sets/county-level-data-setsdownload-data/)[41].

III. METHODS
The pipeline of the study includes two steps: data preprocessing and modeling as presented in Fig. 3.The study began with downloading Tile 1 of VIIRS DNB NTL monthly data for five years and retrieved 60 (12 × 5) files.Subsequently, an R program [42] was developed to iteratively extract NTL images of each county by clipping with a U.S. county shapefile from the 60 individual files.The extraction resulted in 60 clipped NTL images for each of the 3139 counties.To represent NTL, 13 quantiles were calculated from the county NTL images.Two other features, area of land and area of water were included in the county feature space.Two neural network architectures, feedforward neural network (FFNN) and long short-term memory (LSTM) are implemented to estimate the population of the U.S. counties for 2017.The data are rearranged to match the input requirement of the two models.The FFNN is provided with 782 variables [(60 × 13) + 2] to predict the county population, alternatively, LSTM is given 15 features at a time for 5 years over 12 months.A division holdout strategy is employed to validate the robustness of both FFNN and LSTM models.Predicted values for the models are then collected separately to perform evaluations and model comparisons.

A. Data Preprocessing
Instead of considering a single NTL value like the mean brightness per county as in other studies [24], [31], [33], [34], [35], we calculated quantiles of the distribution of brightness for each county and each month/year.Thirteen quantile values: Q0 (min), Q0.01, Q0.025, Q0.05, Q0.1, Q0.25, Q0.5, Q0.75, Q0.9, Q0.95, Q0.975, Q0.99, and Q1 (max) were calculated from the county NTL images based on the formula below where t ≤ p < t + 1/n, t = ( j − m)/n, n is sample size, X j is jth order statistic, and 0 <= γ <= 1. Hyndman and Fan [43] proposed a general notation for sample quantiles and theorized that different statistical packages should provide a standard definition for sample quantiles so that all packages provide the same answer.Quantiles were calculated using the R programming language implementation of the Hyndman and Fan equation [44].A small portion of quantiles produced negative values because of the presence of negative pixels in the VIIRS DNB NTL county image.We agree with the assumption of Shi et al. [45] (negative values arise due to noise and data processing) and similarly recorded negative VIIRS pixel values to zero.
Once the data were prepared, we first built a baseline model for all other models to be compared to.This simple model

B. Modeling
Two different neural network variants were used to predict the county-level population of 2017-an FFNN and a recurrent neural network (RNN) variant called long short-term memory (LSTM).These two variants were used because each model relates the input features to the output in different ways.While the FFNN treats each feature independently, the LSTM treats each variable as part of a sequence and then makes a prediction.The data preparation for the FFNN require one row per county with each year and brightness value as a feature (column).The LSTM uses sequence data and requires data prepared as 3-D samples-one sample per county (3139 counties), with 60 timesteps (one timestep for each Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.month across five years) and one column for each calculated percentile (13 columns) 1) FFNN: Neural networks nonlinearly relate input features to a target variable.The FFNN, also known as a 'deep neural network', is comprised of an input layer, several dense (hidden) layers, and an output layer.Input features are passed across the network via dot products with intermediate weight matrices between each of the hidden layers and are transformed with nonlinear activation functions.Weight matrices are initialized randomly and updated via backpropagation such that a more accurate prediction is derived after each iteration.
The number of nodes in the input layer is equal to the number of input features.The input features for this model consisted of 67 features (years × 13 percentiles + 2 land variables equals 67).FFNN can use sequential time series data as features so long as the data are in wide format (one row and multiple columns).This destroys the sequential nature of the original time series and transforms it into a traditional supervised learning problem.
The FFNN in this study (refer to Table V for model summary) was developed with multiple hidden layers, including a first hidden layer with 1600 nodes, followed by five hidden layers, and an output layer with one node (to generate the final prediction.)Each hidden layer consisted of exactly half the number of nodes in the previous layer, such that the nodes in the hidden layers decreased from 1600 to 800, then 400, 200, 100, and last 50 nodes, respectively.Rectified linear unit (ReLU) was used as an activation function for all the hidden layers except the output layer to learn nonlinear relationships between input features and the target variable [46].For the last layer, a linear activation was used to output the predicted county populations.The loss for this network was measured by mean squared error and the Adam optimizer was used to fit the model.A dropout rate of 0.2 was implemented between each hidden layer to provide weight regularization and encourage a more robust network.An early stopping callback was implemented to cease training when the network did not show improvement in loss after 300 epochs.
2) RNNs: RNNs are ideal for numeric sequence data such as a time series of NTL brightness.RNNs require data to be prepared as 3-D samples of (samples, lookback (rows), columns) and read each timestep one at a time and update a hidden state (like having an FFNN in a for-loop.)Simple RNNs have one FFNN in the recurrent cell, while advanced variants can have more than one FFNN.The recurrent layer generates one hidden state per time step and uses the final hidden state as input to downstream dense layers before making a final prediction.Recurrent layers can be stacked together which requires the entire output sequence of hidden states to be the input to the next recurrent layer.The final hidden state of the final recurrent layer is then used as input into the final dense layer.Sequence data can be read from oldest to newest timestep and can also be read from newest to oldest.The use of a bidirectional layer wrapped around a recurrent layer allows for two RNNs to be built simultaneously, with one reading the sequence forward and one reading the sequence backward.The final state of each is concatenated together to yield the final hidden state or hidden state sequence.
While there are many variants of RNNs, the one we used in this article is an LSTM model (refer to Table VI for model summary).At a high level, the LSTM has four FFNNs included in the recurrent layer instead of one, which helps ameliorate vanishing gradients.With four FFNNs in the LSTM layer, the LSTM model takes longer to run versus simpler RNN architectures, but the hope is that the model might be able to learn difficult patterns.
To exploit even more information from the NTL time series, feature processing techniques from computer vision such as convolution and pooling are used to generate a richer sequence for learning.The kernel moves along the multivariate time series and convolves the features into a transformed sequence that may yield a better prediction.The number of time steps will always be consistent following convolution, but the number of features will be lesser depending on stride and kernel size.
Gers et al. research shows that LSTM has been proven to outperform FFNNs and traditional RNNs on temporal processing tasks [47].The gate network of the LSTM enables it to hold the training weights in its memory unit in contrast to FFNN where weights keep updating on every pass.The network architecture for our LSTM also included a convolutional layer with 32 filters, a pooling layer with a pool size of 2, a bidirectional LSTM with 32 filters to enable the network to have both backward and forward information of each sequence, a dense layer, and a few dropout layers between each of them with a dropout rate of 0.2.A nonlinear activation function, ReLU is used for convolution and the bidirectional LSTM to speed up the training process and keep the gradients from vanishing.Similar to the FFNN, the Adam optimizer is used to minimize loss, and early stopping callbacks with restoring best weights were implemented while training the network at 300 epochs while minimizing metrics mean squared error.

C. Model Evaluations
To evaluate the model performance geographically, a leaveone-group (division) out holdout strategy was implemented.As shown in Fig. 2, each county belongs to one of nine geographic divisions (such as 'New England' or 'Pacific').In a set of divisions d i (where i ranges from 1 to 9), for the first iteration (i = 1), all counties in a division d 1 are selected as a holdout, and the models were trained on counties in the remaining eight divisions (from d 2 to d 9 ).A traditional approach such as k-fold cross-validation could have been used, however, we believe this can result in a form of "data leakage" where nearby counties can be used to predict a county of interest and artificially inflate the performance of the model.To illustrate the data leakage phenomenon, we provided a distribution of training dataset for different k folds (where 1 ≤ k ≤ 9) across all the divisions in Fig. 5.We selected k = 9 because there are nine divisions in the U.S. In kfold cross validation, unlike leave-one-division out holdout, for every fold, there are more than zero counties for all divisions which can be used in training to predict the population of other counties within the same division during model testing.This data leakage is more pronounced for divisions with a higher number of counties such as East North Central, South Atlantic, West North Central, and West South Central divisions.This issue can be resolved with the leave-one-division out holdout strategy which allows for testing on counties of an unseen division during training.Also, it can help better evaluate the generalizability of our methodology.
Population predictions for ith county ( ŷ) were compared to the actual population (y) of the county.For each of 3139 counties (1 ≤ i ≤ 3139) we calculated error (E), and absolute percentage error (APE) as in the following equations: Similarly, the coefficient of determination (R 2 ) for each holdout division with a mean actual population (µ) was calculated as follows:

IV. RESULTS
The coefficient of determination (R 2 ) between the actual and predicted population was calculated based on model predictions using the division holdout strategy.Table I provides a list of R 2 values (rounded to two decimal places) for each of the nine divisions from the three models: baseline, FFNN, and LSTM.Generally, the LSTM outperforms the other two models, followed by FFNN and baseline in terms of weighted average R 2 (refer to Table I).
The FFNN (R 2 = 0.57) and baseline models (R 2 = 0.53) provided better R 2 values than LSTM (R 2 = 0.45) for the Middle Atlantic division.Both baseline and FFNN models exhibited negative R 2 for the South Atlantic division.Additionally, the baseline model had a negative R 2 for the Pacific and West North Central divisions.Whereas LSTM had no negative R 2 values across all the divisions with the lowest in the Middle Atlantic division (R 2 = 0.45).
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II ERROR METRICS OF MODELS FOR THE U.S. CENSUS DIVISIONS. TEST HOLDOUT RESULTS ONLY
We calculated mean errors for each of the divisions and provided the results in columns 2 and 3 of Table II.As there is substantial variation among the population of counties (varies in order of magnitude from thousands to millions), to minimize the effect of fewer highly populated counties influencing the outcome, we calculated median errors; they are provided in the below table.From Table II, FFNN and LSTM models did not exhibit the same behavior and performance across all the divisions.From mean and median errors, both FFNN and LSTM underestimated for East North Central, Middle Atlantic, New England, and Pacific divisions (refer to columns 2-5).Contrarily, both models produced overestimates for the East South Central, West North Central, and West South Central divisions.The median absolute percent error (refer to columns 5 and 6) shows that FFNN and LSTM share the East North Central division as the best-predicted division, while West North Central and Mountain were the worst-predicted divisions for FFNN and LSTM, respectively.Similar mean absolute percent error values were observed between FFNN and LSTM in the New England and Pacific divisions.
Fig. 6 shows scatter plots of the actual versus predicted population of all the U.S. counties for the FFNN and LSTM models without and with log scales.Points in scatter plots are colored varying from yellow (high density of points) to purple (low density of points).From Fig. 6(a) and (b), we can observe that FFNN points are scattered away 45 • line while the LSTM is more evenly scattered about the line, indicating better model predictions by LSTM.However, these two figures do not allow analyzing most of the counties (colored yellow at the bottom left corner of the plots), which have populations in the order of thousands compared to a few counties with populations in the millions.Therefore, to address the skewness and increase the visibility of the plot, we applied log transformation to the actual and predicted population values, and the results are presented in Fig. 6(c) and (d).For most points (yellow region), FFNN results exhibit a positive bias (systemic overestimation), predictions are above the 45 • line while LSTM demonstrated less biased results with an even scatter.
The distribution of FFNN and LSTM error metrics for all counties are represented as box plots in Fig. 7. FFNN and LSTM models had over and underestimated population of counties with closer range of error distributions ≈1 20 000 [refer to Fig. 7(a) and (b)].FFNN is generally overestimated for more counties than LSTM (which has near zero median error).The distribution of percent absolute error (upper whisker, interquartile range (IQR), and lower whisker) for LSTM is shorter than FFNN, indicating a better model fit [refer to Fig. 7(c) and (d)].Collectively, the results suggest the LSTM with fewer trainable parameters than FFNN makes more accurate county-level population predictions (refer to Tables V and VI).
Fig. 8 shows error maps of FFNN and LSTM models for contiguous U.S. counties.The common scale color varies depending on the error generated by the model in a particular county from blue (high underestimates) to red (high overestimates).Excessive underestimation was observed in the Northwest (group 5) and Southwest U.S. (group 4) compared to excessive overestimation in the South Central (groups 1 and 2) and North Central U.S. (group 3).By comparing Fig. 8(a) and (b), one can observe that the LSTM has fewer biased counties than FFNN in these groups.
To identify characteristics of counties with over and underestimates by FFNN and LSTM models, we performed k-means clustering [48] on county-level model errors for 2017 (data missing for 35 counties), 2017 population, and socioeconomic data that included 2010 poverty rate, 2010 percent urban, and percent of adults with a college degree or higher (during 2007-2011).We performed standard scaling on the abovementioned variables before clustering because unsupervised  methods such as k-means tend to yield biased results if the variables are not on a comparable scale (for example, features with a larger dynamic range will dominate the clustering algorithm).For each variable, the standard scaler subtracts the   Distribution of error at the county level for LSTM for all 3139 counties.The green line and red star represent median error and mean error, respectively.
mean and divides the difference by the standard deviation such that the scaled data represent the number of standard deviations above (+) or below (−) the mean.The cluster centers for each of the variables are provided in Tables III and IV.FFNN (3083 counties) and LSTM (3066 counties) models underestimated the majority of counties that fall within clusters 1 and 2. These counties are highly populated, with a lower poverty rate and a higher percentage of an urban and educated population.Whereas models overestimated for counties in cluster 3 are relatively less populated, with a higher poverty rate and lower percentage of urban and educated population.
Last, as LSTM performed the best among the models, we conducted an experiment to observe how the model would perform overtime in a real-world setting under less training data.Specifically, we evaluated how LSTM would predict the population of the next year when provided with the previous two years of NTL data.For example, 2013 and 2014 NTL data were used to predict the population of 2015.Box plots of population error without outliers for the years 2015, 2016, 2017, and 2018 are presented in Fig. 9.The model appears to perform stable for all years, with the IQR of errors consistently in the range from +20 000 to −20 000.Overall, the model performed consistently across time with overestimates in 2016.
V. DISCUSSION Table I demonstrates that FFNN was better than the baseline model despite having the same network architecture.The difference in these models lies in the exposure to the data.For the baseline model, only the median value of NTL was provided but for FFNN a distribution of NTL is provided in the form of quantiles.This temporal data contribute to the superior performance of the FFNN.Fig. 8 provides insights into where the FFNN and LSTM models are positively or negatively biased.Groups 1, 2, and 3 are comprised of counties with overestimation, and groups 4 and 5 contain underestimated counties.We believe that gas flares may help explain the presence of the overestimation.Cushing et al. [49] identified eight basins with a high density of gas flares of which Williston, Permian, and Western Gulf are among the three basins with the highest total number of gas flares.Counties with overestimation from groups 1, 2, and 3 are proximal to these top three basins with high-density flares.Researchers have been detecting gas flares using different NTL imagery sources such as DMSP-OLS and VIIRS [10].This suggests that flares can light up the sky during the night which is captured in NTL leading to overestimates in these counties.Even though LSTM performed better than other models there are geographic variations in both the performance and behaviors at both the division level and county level.Additional information such as the number of gas flare sites may be added to further increase the performance of the models.
There are several contributions of this research that may extend the utility of NTL in the human population and its distribution among counties.First, the approach combines the measures of the central tendency of NTL and the monthly distribution of NTL over multiple years as a predictive matrix of variables.This is a contrast to previous research that tends to use only measures of central tendency or sum of NTL [24], [31], [33], [34], [35].Second, unlike previously adopted models [32], we used and compared deep learning models (FFNN and LSTM) as an analytical tool that links NTL with the human population of the U.S. counties.Third, the methodology can be potentially extended to a wide range of geographies and resolutions such as cities, counties, states, and nations depending on the availability of the NTL.Fourth, this study explores the generalizability of deep learning models with a robust hold-out strategy that prevents data leakage.In the past, researchers have achieved promising results by limiting their study area-a constraint that can help bolster a more accurate model [25], [33], [34], [35].We found that our deep learning models robustly estimate counties in a U.S. Census division when data from that division were not used to train the model.This suggests that the model can predict the population of counties without census data by using NTL alone.
Our study has some limitations in terms of modeling and spatial resolution.First, the model weights are randomly initialized which can result in slight variations of accuracy for multiple iterations.However, random initialization of weights is a common practice to aid with the training of the neural network as it can resolve symmetry-breaking problems.This problem arises if all the initial weights of a network are assigned with a constant value which can result in the same weight updates in the backpropagation step.Second, the deep learning models are data-driven approaches to estimating the population of counties in an alternative country other than the U.S. may require fine-tuning of model parameters-our study held the architecture constant for each holdout experiment.Third, the models in this article are developed for estimating the population at a coarse resolution but not at a granular resolution such as LandScan gridded data [50].However, during disasters such as Earthquakes and Tsunamis, the primary utility of gridded population datasets is to estimate populations at risk.In such cases, the impacted populations are over significantly larger regions compared to 3 or 30 arc second grid cells and it requires aggregating the population of individual cells within the region of interest or impact.In terms of usability, both the proposed approach and gridded population datasets converge with varying ways of estimating the population within the region of interest.Gridded population datasets estimate the population of individual grid cells first followed by an aggregation of the grid cells within the region of interest.Whereas the proposed approach first extracts the NTL within the region of interest and subsequently estimates the population for the region of interest.Finally, this study primarily depends on NTL data, and inaccuracies in NTL can affect model performance.However, in this study, we used VIIRS DNB NTL imagery, which is one of the high-resolution data available with noise filtered from multiple sources.In a future study, we suggest considering these limitations to further improve the model's efficiency.

VI. CONCLUSION
Estimating the population of a county based on NTL has advantages over traditional population counting techniques primarily due to the monthly availability of NTL for the entire world and reduced operating costs.In this study, we demonstrated the use of VIIRS DNB NTL to predict the county-level population of the U.S. without any selection criteria on counties.We implemented a baseline model and two deep learning models: FFNN and LSTM.The models were evaluated with the U.S. Census Bureau County population data as ground truth.The FFNN and LSTM models successfully captured the temporal aspects of NTL through quantile calculation, and in turn, improved the performance.The robustness of the models is demonstrated with the division hold-out strategy (the ability to predict the population of counties in an unseen division).The performance and behavior of FFNN and LSTM models are not the same across all the divisions.For most divisions, the models exhibit the same biased (overestimates or underestimates) behavior.Poor performance of models was observed in highly populated counties with a higher percentage of an urban and educated population.Among these models, LSTM performed the best.Also, LSTM was able to predict the population with a consistent error margin across time when trained on the previous two years of NTL data.Furthermore, we estimated the county-level population of the U.S. without constraints on the selection of counties.However, in counties with gas flare sites, deep learning models may result in overestimates.We believe this method can be beneficial for countries where the census is not available.

Fig. 1 .
Fig. 1.Three visualizations for New York County (Manhattan) from 2017.(a) Satellite imagery with 0.3 m resolution from ESRI, updated on 30th November.(b) VIIRS NTL imagery with average DNB radiance (nW/cm 2 /sr) for November 2017 from NPP-VIIRS.(c) Annual gridded population extracted from Oak Ridge National Laboratory's (ORNL) LandScan TM dataset with grid cells having an approximate resolution of 1 km.

Fig. 2 .
Fig. 2. Occupied portion of the U.S. in Tile 1 of VIIRS DNB NTL with nine divisions.
consisted of an FFNN inputted with only the median NTL brightness per county per month (12 months × 5 years = 60 median NTL values), the land area, and the water area per county.As expected, the results produced by the baseline model were quite poor (average weighted R 2 = −0.01)and we assume the performance of the baseline model was due to incomplete data on the distribution of NTL within a county.This encouraged us to better capture the distribution of NTL rather than considering singular values such as the sum, median, or mean NTL of each county.Using a quantile calculation for every month over each year enabled us to better capture the distribution of NTL in the following ways: 1) Discern interannual variations in NTL.Fig. 4(a) shows the distribution of NTL for the month of November in 2013, 2014, 2015, 2016, and 2017.2) Capture monthly variations in NTL.Fig. 4(b) represents NTL quantiles for January, February, October, and November 2017.Note that January and February were brighter than October and November.

Fig. 5 .
Fig. 5. Distribution of training dataset for k-fold (k varies from 1 to 9) cross validation across the U.S. divisions.

Fig. 6 .
Fig. 6.Scatter plots of division holdout results for actual versus predicted county population for (a) FFNN, (b) LSTM, (c) FFNN with log scales, and (d) LSTM with log scales.Red-dotted line is the 45 • line.

Fig. 7 .
Fig. 7. County-level box plots of error and absolute percent error for FFNN and LSTM for all 3139 counties.A red line joining notches on either side of the box represents the median value.Boxes are colorcoded to differentiate between the error metrics.Sparse and dense hatch fill of boxes denote error metrics of FFNN and LSTM, respectively.(a) Error_FFNN.(b) Error_LSTM.(c) Absolute percent Error_FFNN.(d) Absolute percent Error_LSTM.

Fig. 8 .
Fig. 8. County-level population error maps for (a) FFNN and (b) LSTM.Both error maps share a common color scale.Five circled regions represent groups of counties with extreme estimates.

Fig. 9 .
Fig. 9.Distribution of error at the county level for LSTM for all 3139 counties.The green line and red star represent median error and mean error, respectively.

TABLE I COUNTY
-LEVEL PERFORMANCE (R 2 ) OF MODELS FOR THE U.S.
CENSUS DIVISIONS.TEST HOLDOUT RESULTS ONLY

TABLE III CLUSTERING
ANALYSIS ON 2017 FFNN ESTIMATES

TABLE IV CLUSTERING
ANALYSIS ON 2017 LSTM ESTIMATES