Granularity at Scale: Estimating Neighborhood Socioeconomic Indicators From High-Resolution Orthographic Imagery and Hybrid Learning

Many areas of the world are without basic information on the socioeconomic well-being of the residing population due to limitations in existing data collection methods. Overhead images obtained remotely, such as from satellite or aircraft, can help serve as windows into the state of life on the ground and help “fill in the gaps” where community information is sparse, with estimates at smaller geographic scales requiring higher resolution sensors. Concurrent with improved sensor resolutions, recent advancements in machine learning and computer vision have made it possible to quickly extract features from and detect patterns in image data, in the process correlating these features with other information. In this work, we explore how well two approaches—a supervised convolutional neural network and semisupervised clustering based on bag-of-visual-words—estimate population density, median household income, and educational attainment of individual neighborhoods from publicly available high-resolution imagery of cities throughout the United States. Results and analyses indicate that features extracted from the imagery can accurately estimate the density (R$^{2}$ up to 0.81) of neighborhoods, with the supervised approach able to explain about half the variation in a population's income and education. In addition to the presented approaches serving as a basis for further geographic generalization, the novel semisupervised approach provides a foundation for future work seeking to estimate fine-scale information from aerial imagery without the need for label data.


I. INTRODUCTION
Censuses and other surveys administered to collect socioeconomic data are expensive and time-consuming [1].For this reason, there is often an undesirably long gap between surveys in developing countries, hindering the appropriate formulation of public policies.The ability to measure socioeconomic metrics is essential for evaluating progress toward targets (such as the United Nations' Sustainable Development Goals [2]), promoting accountability, enabling evidence-based decisionmaking, and providing a basis for informed actions and interventions to improve human well-being [3], [4].Measurements E. Brewer is with Spectral Sciences, Inc., Burlington, MA, USA.E-mail: ebrewer@spectral.comG. Valdrighi and J. Poco are with Fundac ¸ão Getúlio Vargas, Rio de Janeiro, RJ, Brazil.E-mails: {giovani.valdrighi,jorge.poco}@fgv.br.P. Solunke, J. Rulff, Y. Piadyk, and C. Silva are with New York University, New York, NY, USA.E-mails: {parikshit.s,jlrulff, ypiadyk, csilva}@nyu.edu.
Z. Lv is with William & Mary, Williamsburg, VA, USA.E-mail: zlv@wm.edu.help determine areas and populations that require the most attention and resources [5].By quantifying development indicators such as urbanization, education levels, healthcare access, and income, policymakers and development practitioners can identify the most vulnerable and disadvantaged groups and design targeted interventions to address their specific needs and optimize resource allocation [6], [5], [7].
Such metrics are traditionally measured through national accounts data, household surveys, and administrative records such as tax filings [1].Even in regions where such data are collected, they are expensive [8].Furthermore, at the neighborhood level, socioeconomic conditions can undergo drastic changes in very short periods of time [9].Hence, it is imperative to explore faster and more cost-effective techniques for estimating vital socioeconomic outcomes between neighborhoods.Since the 1990s, analysis of nighttime light intensity from remote sensing technologies, such as from sensors onboard satellites and aircraft, has effectively contributed to approximating development and socioeconomic metrics at large scales [10], [11].Beginning around 2016, analysis of remotely sensed data (often high-resolution 1 daytime imagery) with machine learning methods, particularly neural networks, has rapidly grown in popularity and enabled finer scale approximations.These techniques have broadly illustrated that analyzing aerial imagery with machine learning is an effective strategy to remotely monitor the natural and built environment and to estimate and track development and socioeconomic statistics [12].Technical challenges arise in this line of research as aerial images can cover vast areas and, at higher resolutions, be too large to input directly into computer vision networks.This complexity is further compounded by the often irregular shapes of neighborhoods, and the aggregation of survey statistics at different levels across various geographical areas.
Our objective.In this paper, we investigate the potential of machine learning models trained on high-resolution aerial imagery to estimate the following metrics at the U.S. Census block group level (approximately the size of a neighborhood): • Population density • Median Household Income (MHI) • Educational attainment (% of the population with at least a bachelor's degree) This paper is a feasibility study in how well aerial photography and machine learning can detect where and how people live between neighborhoods.UN sustainable development goals that may benefit specifically from the measurement of neighborhood density, income, and education conditions include goals 1 (poverty reduction), 4 (well-being improvement), and 10 (within-country inequality reduction) [2].We carry out our investigation by training and testing models on 94 of the 100 largest U.S. cities by gross domestic product (GDP).By automatically extracting spatial features in urban settings, variations in city infrastructure, such as roads, parks, and buildings, can be quantified and related to the census variables.Two methodologies are employed-one driven by a supervised convolutional neural network (CNN) and the other by a semi-supervised framework utilizing bag-ofvisual-words (BoVW) to generate simplified but interpretable representations of census blocks.
The most notable contributions of this paper are: 1) Demonstration of the ability of contemporary aerial imagery to resolve features related to socioeconomic variables at the scale of a neighborhood; 2) Finding that supervised learning and semi-supervised clustering of image patches can respectively explain 81% and 61% of the variation in neighborhood population density.
Paper structure.The remainder of this paper is organized as follows.Section II provides an overview of the existing literature on estimating poverty, population, and other socioeconomic indicators from aerial imagery with machine learning.
In Section III, we detail how the image and annotation data are acquired, fused, and processed.We describe our methods in Section IV, including supervised and semi-supervised learning approaches.Next, we present results in Section V, and analyze and discuss the limitations of our work in Section VI.Finally, concluding remarks are offered in Section VII.

II. RELATED WORK
Throughout much of the world, there is a lack, or a complete absence, of data on the social and economic well-being of people due to conflict, natural disasters, pandemics, and the effort, expense, and time periods between surveys [13], [1].In recent years, remotely sensed images in combination with machine learning have helped fill in these critical information gaps.Use of such techniques has extended into the estimation of population [14], wealth [15], poverty [13], conflict [16], migration [17], education [18], land use [19], and infrastructure [20], [21], among other applications [22], [23].In this section we focus on studies related to poverty, population density, education, and related metrics.Income and wealth are correlated with self-reported happiness and well-being [24], [25].Post-secondary educational attainment is associated with higher levels of income [26], [27], satisfaction with life [28], and lifelong well-being [29].Educational attainment for women of reproductive age is linked to reduced child and maternal mortality, lower fertility, and improved reproductive health [30].Existing studies show a mixed correlation between population density and quality of life [31], [32], [33].Findings within the city of Oslo, Norway suggest that, compared to residents of lower-density neighborhoods, residents in higherdensity neighborhoods have higher levels of personal relationship satisfaction and perceived physical health, similar levels of leisure satisfaction, but lower levels of emotional response to neighborhood and higher levels of anxiety driven primarily by noise and safety concerns [33].
Poverty.Gaining momentum in the mid-2010s, several studies have focused on estimating poverty.In [34], a fully convolutional neural network was trained to predict nighttime light intensity from daytime imagery, simultaneously learning features that are useful for poverty prediction at 1 km resolution in Uganda.The model identified different terrains and manmade structures, including roads, buildings, and farmlands, without supervision beyond nighttime lights.Their results approached the predictive performance of survey data collected in the field.[13] showed that a CNN trained on Google Maps daytime imagery and existing survey data can identify image features that can explain 37-75% of the variation in local-level economic outcomes such as wealth and consumption across countries in Africa.In [35], household wealth in Bangladesh was estimated at 10 km resolution with random forest regression from multiple sources such as nighttime lights, daytime imagery, and land cover maps.[15] showed multispectral 30m Landsat imagery can help estimate African village wealth in countries where the model was not trained with errors comparable to existing ground data.Other related work to identify poverty via remote sensing and machine learning include [36] which validated a Slum Severity Index using Grey Level Co-occurrence Matrix (GLCM) features extracted from highresolution satellite images of Mexico, [37] which used high-res imagery and geospatial covariates to characterize degrees of intra-urban deprivation in Nairobi, Kenya (R 2 = 0.65), and [38] in which deprivation in Liverpool, UK was measured by extracting features from Google Earth images (R 2 = 0.54).Additional related poverty studies include [39], [40], [41], and [42].
Population.Commonly used techniques for small-area population estimation typically redistribute population "top-down" from higher to lower administrative units using areal weighting interpolation or dasymetric mapping techniques [43].Opensource population products that use this approach include Landscan, Meta's High Resolution Settlement Layer (HSRL), Gridded Population of the World (GPW), WorldPop, and Global Human Settlement Layer (GHSL).Existing studies have focused on redistributing population counts using a random forest-based weighting scheme in Cambodia, Vietnam, and Kenya [44], redistributing population density in Peru using satellite imagery-based covariates employing regression and tree-based methods [45], and downscaling population counts using one billion mobile phone call records from Portugal and France [46].Most population density studies do not validate the accuracy of their estimates against a census [43].Other studies have used coarse nighttime lights for large administrative areas [47], 3D city models [48], or focused on a subset of the population (such as children under 5 years of age) [49].Moving further, [43] estimated local population density for in-between census years in Bangladesh by combining household surveys with geospatial data, including an assortment of satellite imagery-based indicators.The data were analyzed with Poisson regression models, with out-of-sample results approximating the density of sub-districts (larger than a village) with an R 2 of up to 0.83.Additional Metrics.Other closely related work includes [50] in which a Siamese-like Convolutional Neural Network, integrating ridge regression and Gaussian process regression, was developed for the estimation of income for districts and zip codes in New York City.Their model makes use of a pairwise comparison of location-based house price information, daytime satellite images, street views, and spatial location information, achieving an R 2 of 0.72 at the census tract level.[1] used daytime and nighttime satellite imagery and transfer learning to estimate average income, GDP per capita, and a water index at the city level in two Brazilian states, explaining up to 64% of the variation in the target variables.In [51], the authors estimate American Community Survey socioeconomic variables such as income, race, education, and voting patterns in 200 US cities at the zip code and precinct level solely through 50 million images of street scenes from Google Street View and computer vision detection of the make, model, and year of all motor vehicles present in the images.Finally, [30] explored educational inequalities across Africa by estimating years of schooling across a 5x5 kilometer grid based on geocoded survey data, generating estimates of average educational attainment by age and sex.
Bag-of-visual-words. Widely used in natural language processing, bag-of-words is a numerical representation of text by counting individual words [52].Despite being a simple formulation, this methodology has shown positive results in diverse language tasks.Inspired by it, computer vision studies have proposed an adaption called bag-of-features or bag-ofvisual-words that have shown positive results in natural scene classification [53], [54].By creating a set of low-level visual "features" that describe the images, the frequency of the visual features in each image can be used for predictive tasks.The method has also shown positive results [55], [56], [57] in the domain of remote sensing imagery.Despite this existing work, these techniques have not been tested to estimate highresolution census variables.
Most existing remote sensing studies analyze variables for entire nations, states, or cities, potentially obscuring neighborhood-level prosperity and inequality patterns.Our study pushes beyond the limitations of existing work by investigating population density, income, and education at an unprecedentedly precise scale across a country, using free, publicly available data.

III. DATA
In this section, we detail how the data for the 94 cities are collected and processed.
Imagery.For the United States, orthographic imagery is retrieved from the National Agriculture Imagery Program (NAIP) [58], administered by the U.S. Department of Agriculture.For a given point in the U.S., RGBIR aerial imagery is acquired approximately every 2-3 years at a resolution of 60-centimeter ground sample distance during the agricultural growing season, or "leaf on" conditions.The images are orthorectified, which combines the image characteristics of an aerial photograph with the georeferenced qualities of a map.We utilize the most recent NAIP tiles (2019-2021) for 94 prominent cities across the United States, including the ten largest (by GDP).Our selection process involved filtering cities of the 100 largest metropolitan statistical areas (MSAs) based on 2021 GDP, followed by identifying the largest city by area within each MSA.This selection is used to study a diverse set of cities while considering the most significant ones.
Annotation data.Census data for the United States are acquired through the American Community Survey (ACS) [59].Every year, the U.S. Census Bureau contacts approximately 3.5 million households (1 in 40 total households) across the country to participate in the ACS, with a 2021 response rate of 85.3%.The survey includes various demographic, social, economic, and housing data on residents such as age, race, occupation, income, disability status, housing type (e.g., single-family, multi-unit), languages spoken, and highest degree earned.The resulting data products are aggregated at various levels from country, to state, to county, to census tract, to block group.In this study, data are examined at the finest possible level, block group, to extract the maximum benefit from the resolution of the imagery.Block groups contain an average of approximately 1,500 residents and may henceforth be referred to as neighborhoods.
The following 5-year 2 ACS variables for neighborhoods are downloaded via API for all counties containing the cities (since the ACS aggregates by county but not city) from the year their associated imagery was captured: • Total population, P t • Population >25 years old, P 25 • Median Household Income, MHI • Four education variables: Population >25 years old whose highest degree completed is (1) bachelor's, P b , (2) master's, P m , (3) professional, P p , (4) doctoral, P d An educational attainment metric, E, representing the percent of the population with at least a bachelor's degree is calculated with The population density metric, D, representing people per square kilometer is calculated with where A is the geographic area of a neighborhood in square meters found by 2 5-year estimates aggregate data from the preceding 60-month period.For example, a 5-year estimate from the 2021 ACS aggregates data from 01Jan2017 to 31Dec2021 [60].
See Table I for all specifications of the neighborhoods analyzed.
Image-label pairing.To generate geographic boundaries for the imagery, shapefiles of the neighborhoods are downloaded from the U.S. Census Bureau's TIGER archive [61] (see Fig. 1).This geographic information (polygons of neighborhoods) is then merged with their corresponding ACS variables.In the process, neighborhoods containing census errors or zero population are dropped.
Fig. 2A shows all the cities examined (in orange outlines) overlaid with the ACS median household income by neighborhood, as an illustration.The cities of New York and Chicago are enlarged in Figs.2B&C to provide a more detailed view.
Next, the imagery is cropped by neighborhood based on the bounding boxes of the neighborhood polygons.This results in a total of 43,497 images (see Table I for other specifications on the neighborhood crops).Crop processing.The crops are processed for CNN input for the supervised method in two ways, "patching" and "resizing" (an example is visualized in Fig. 3).
Patching.With this technique, neighborhoods are split into 512x512 patches, as in Fig. 3A.If either of the original dimensions of an image is not a multiple of 512, it is padded by zeros before being split.Only patches composed of >50% nonzero pixels are kept.This results in a total of 339,413 patches.Patching allows an image of a neighborhood to retain its resolution and shape, but results in the CNN treating each patch as a separate image, thus breaking apart the spatial relationship within a neighborhood.
Resizing.With this technique, neighborhoods are resized (through bilinear interpolation) to the median size of a neighborhood, i.e., a width of 1353 pixels and a height of 1350 pixels.Resizing allows a neighborhood to be read as a single image by the CNN, but results in upsampling/downsampling for crops that have a dimension(s) less/greater than the median, and shape distortion for crops with a width height ratio different from 1353 1350 .
Semi-Supervised.In the semi-supervised methodology, the imagery is cropped into a square grid, each cell measuring 112x112 pixels.Therefore, each neighborhood is composed of a mosaic of patches.This high granularity serves the purpose of separating distinct urban structures within each patch.Due to the NAIP tiles covering areas not examined in this study, only the subset of obtained patches that have an intersection to any neighborhood are used.Additionally, considering that some neighborhoods are very large (i.e., they have low population density), further filtering is implemented to limit the maximum number of patches to 50 per neighborhood to reduce computational costs.The patches are semi-randomly selected at a probability proportional to the percentage of their area contained in the neighborhood.As a result of these filtering measures, around two million 112x112 patches are generated (see Table II).Fig. 4 depicts two example neighborhoods and their division of patches.
For both supervised and semi-supervised approaches, the resulting data are separated into training, validation, and testing sets for model input in 70-15-15% splits.Dataset sizes are shown in Table II.

IV. METHODS
We now present the formulations of each methodology and their details.

A. Supervised
The overall supervised methodology for both the image patching and resizing techniques is executed in the following manner: 1) A ResNet50-based [62] architecture (shown in Fig. 5) is trained in separate instances on each of the three target variables (density, MHI, education); 2) The trained networks are evaluated on the independent test set not included in the training process.The ResNet50-based architecture of Fig. 5 has its base model pre-trained on ImageNet. 3In this study, seven fullyconnected layers are added after the base model to gradually scale down the feature space to a single output estimation.For model training on each metric, updating all weights in all layers for patching and updating only the fully-connected layers for resizing produces optimal results.A batch size of

B. Semi-supervised
As discussed in Section.II, we use ideas from bag-of-visual words to produce a rich set of features from the high-resolution imagery that are interpretable and correspond to distinct urban infrastructures.These features are later used to fit a supervised model with the target variables.The methodology (Fig. 6) can be separated into two steps: clustering of patches and calculation of the cluster distribution of each neighborhood.
Clustering of patches.While cities in the U.S. exhibit variations in culture and environment, there are shared characteristics in their urban infrastructure that can be organized into clusters.However, due to the high dimensionality of images and the distances defined between pixel colors, clustering algorithms can present better results when applied to a learned representation of the images.Diverse methodologies have already used the representational power of deep neural networks to improve clustering results [63].A simple and effective technique is to run k-means in the latent representation learned from an autoencoder.Deep Embedding Clustering (DEC) [64] is a more sophisticated technique that trains an autoencoder in two stages: after the first stage optimizes for reconstruction, the embeddings are clustered by k-means and the resulting centroids are added as parameters jointly with the encoder.The loss penalizes the distance between the embeddings and their respective centroids in the second stage.
In our work, we evaluate using both k-means in a regular autoencoder and k-means in an autoencoder trained with DEC.For both techniques, a ResNet50 pre-trained on ImageNet is used as a feature extractor from patches, generating a 2048dimension representation for each.The autoencoder is defined as a feed-forward network with the architecture depicted in Fig. 6, i.e., four layers in the encoder and decoder.The choice of the latent space dimension (d Z ) hyperparameter is crucial to balance the expressive power of the autoencoder and the effectiveness of the k-means clustering.A higher dimension  allows for lower reconstruction loss, but it may hinder the clustering process since k-means relies on the Euclidean distance between embeddings.Therefore, the latent dimension is tested using {32, 64}.A second necessary parameter is the number of clusters, k, which is also selected between {50, 100, 200} through experimentation.
Cluster distribution.In this step, we use the patch clusters to build two different sets of features for the neighborhoods.The first set of features is the frequency of each of the k clusters among the patches.The second set of features calculates the distance in the latent space of the patches to each of the k centroids defined in the latent space.Both methods attempt to describe each neighborhood as a composition of clusters, i.e., a composition of distinguished urban infrastructures.The features then can be used in a regression model.We choose to evaluate with random forest.
As mentioned, the methodology has hyperparameters that need to be selected: the dimension of latent space d Z , the number of clusters k, the type of the set of features, and random forest hyperparameters.The autoencoder and random forest training are made using only the training dataset, with hyperparameters selected based on performance on the validation set.We present separate results comparing k-means and DEC.

A. Supervised
Table III displays supervised results on the test data, in terms of mean absolute error (MAE) and R 2 , for patching and resizing image processing techniques.In Table III, bold fonts highlight the most accurate MAE and R 2 results.Models performed best at measuring density, with models trained on resized neighborhoods able to explain 81% of the variation in density across the study area.These models were able to estimate density to within 461 ppl km 2 , on average (for reference, the density variable has a ground truth standard deviation of 2519 ppl km 2 ).With resizing, models trained to measure median household income and educational attainment are able to explain about half the variance in the ground truth (R 2 of 0.48 and 0.51, respectively).Models trained on patches performed as well as resized images for estimating education level, however, they were about 7-8% worse (in terms of R 2 ) at estimating density and MHI than their resize-based counterparts.Additionally, it is worth noting that while the R 2 score is similar for estimating education level, the MAE is lower using patches compared to resizing.f

B. Semi-supervised
Table IV presents MAE and R 2 results obtained on the test data with the best-set parameters from Table V. Results show that k-means in the latent space performs better than DEC.

Blocks prediction Cluster Distribution Patches from block
Fig. 6.Overall steps of the semi-supervised methodology.First, an unsupervised clustering algorithm is used to cluster small patches of neighborhoods from aerial imagery.The clustering uses ResNet50 as a feature extractor and an autoencoder.The second step is supervised regression on the target variables using the distribution of clusters composed of neighborhood patches.Only small values of k were evaluated in [64] and using a larger k (>100) could result in cluster collapse during training (since not all clusters have samples linked to them).The designed features of bag-of-visual words are able to explain some degree of variation in density (R 2 = 0.61), however they not well-suited for estimating income and education, as we discuss later in Section VI.Also in Section VI, we discuss the capabilities of this method and apply explainability techniques.

A. Supervised
As shown in Table III, resizing the neighborhoods generally produces more accurate results than splitting them into patches.The mean absolute errors in measuring population density and education are better through patching, but their R 2 s are less than resizing indicating those models do not generalize and explain the variation in density and education as well.A possible explanation for the performance gap in the processing techniques is that resizing the neighborhoods, as opposed to splitting them up, retains more of the spatial and geographic relationships throughout the image amenable to inferring the target metrics.This makes more sense when taken to the logical extreme-a model trained and tested on individual pixels will perform no better than random.
CNN interpretation.A frequent criticism of deep learning models is the difficulty of interpreting the relative importance of features in prediction.To help explore what factors contribute to a density model's estimates, we apply SHAP (SHapley Additive exPlanations) saliency map visualizations [65] on patches from two neighborhoods of contrasting density (Fig. 7).In Fig 7, the first column shows patches in high-density (top) and low-density (bottom) neighborhoods.The second column displays the SHAP values at the pixel level.SHAP values represent the relative importance of features within the selected images.In this example, red pixels represent those that contribute to a higher density estimate while blue pixels contribute to a lower estimate.Fig. 7 is one of many examples showing man-made structures contributing to a higher density estimate.In the examples, the outlines of smaller dwellings, in particular, are relatively important features.In comparison, in the low-density area, the model demonstrates a tendency to assign a relatively lower density value to larger single-family homes and other features within this more rural area.It is important to note these features would be much less visible from lower-resolution imagery such as Sentinel-2.Effect of resizing.To better understand the effect of resizing on model estimation, for each image in the density test set, absolute error is plotted against the degree to which the image is resized (Fig. 8).In Fig. 8, width (and height) deviation is the difference between an original image's width (and height) and the median value the images are resized to, i.e. 1353 (and 1350 pixels).Interestingly, the result is exponential decay in error as the original images become larger.A possible explanation for this is that larger neighborhoods contain more information, both in the amount of geographic context and in the number of pixels (i.e., in the bilinear interpolation resizing process, a neighborhood image smaller than the median must upsample while an image greater than the median must downsample).This phenomenon also occurs with MHI and education metrics, though to a lesser degree.

B. Semi-supervised
Hyperparameters.The semi-supervised method utilizes two important hyperparameters: the dimension of the latent space, d Z , and the number of clusters, k.As previously mentioned, different values for the parameters are evaluated, and the ones that provided the best results on the validation data are selected (and displayed in Table V).Focusing on the results through k-means, it can be seen that the optimal number of clusters is k = 100 for both MHI and educational attainment, in contrast with the density variable that obtained optimal results with k = 50.This is intuitive because more detailed clusters ("visual words") are necessary to predict the small variations in MHI and education attainment.The latent dimension with the best results is d Z = 64 (the larger of the two tested), and this result indicates that smaller dimensions do not represent the information in the images as accurately.
Clustering.By studying the output clusters, it can be seen that clustering primarily results in groups of undeveloped (natural) geographic areas and groups of urban infrastructure without substantial differentiation within the two groups (though some clusters exhibit congregation of certain feature subtypes such as roads and bodies of water).The method clusters most prominently on natural environment versus built, i.e., the degree of urbanization and development, which are closely correlated with population density but not income or education (at least not within city boundaries in the U.S.).As an illustration, Fig. 9 shows random samples from two clusters, with cluster 7 containing patches from urban areas but without a particular infrastructure type.Despite its shortcomings, the semi-supervised approach can take advantage of a large unlabeled dataset to create a corpus of bag-of-visual-words features which may be an interesting aspect to exploit in future work.
Interpretation.The proposed semi-supervised methodology presents two learning steps that make use of complex models.First, a deep neural network is employed to cluster, and then a random forest model is used.A t-SNE projection [66] is used to visualize the latent representation learned by the autoencoder and to comprehend and validate the neighboring relations learned by the network.Similar to the analysis of the supervised method, we use SHAP to study how the random forest regression model interprets cluster features to generate estimations.To exemplify, we select two neighborhoods in New York, one with high density and one with low density.
In the low-density neighborhood, the most important features are from cluster 3, and when analyzed in further detail, it is possible to identify that it is a cluster of water patches (shown in Fig. 9).In the high-density neighborhood, the most important features are related to clusters 31, 39, and 7 (despite the neighborhood having no patches in cluster 39).Similarly, by inspecting the patches of these clusters, it can be seen that they are densely built areas with attributes such as high-rise apartments (Fig. 9).

C. Limitations
For supervised learning, only one resize dimension and patch size are studied, so conclusions can only be drawn based on the arrangements presented.That is, a greater patch size, for example, may lead to better or worse performance than the one chosen (512x512).Our experiments show that despite the semi-supervised approach exhibiting interesting results, it is not able to surpass the performance obtained in supervised learning, particularly when measuring income and education.Clustering is unable to separate nuanced variations in urban infrastructure, which could have been helpful in estimating certain socioeconomic variables such as income and education.Conversely, the supervised models are trained directly on the predicted variables, directly associating image features to metric values.Overall, income and education results from both methods highlight a general limitation of machine learning to extract useful features from aerial imagery correlated with neighborhood-level census variables.Also, it remains a question to what extent the results gathered here in the United States generalize to different social and ecological environments, with or without fine-tuning the models.Finally, it should be noted that many factors contribute to human well-being and not all of them can be quantified, including from aerial or remotely sensed data [67].To most accurately understand well-being and development, a holistic approach that considers a complex mesh of personal freedoms, institutional capacity and stability, mental and physical health, cultural values, etc. is required [68].Nevertheless, the techniques employed in this study can serve as a foundation for further refinement, allowing for more precise estimations of socioeconomic variables at a finer granularity than existing literature.This progress can pave the way for future endeavors aimed at estimating the well-being of neighborhoods.

VII. CONCLUSION
Census data require resources and coordination to collect, so are therefore produced relatively infrequently in developing countries.Such data are also usually disseminated with a lag, making it difficult to rapidly assess changes in living standards, especially at local levels.In this work, we explored how well CNNs trained on census data and a semi-supervised clustering approach can estimate census variables in urban neighborhoods throughout the United States.Results show promise in accurately approximating certain metrics (i.e., density), while uncovering limitations for others (i.e., income, education).
Our findings raise several questions for further research including generalizability, whether changes in aerial imagery could be used to forecast changes in neighborhood metrics over time, and how the latest aerial data could be fused with survey-based measures (including "now-casting" [42]).Different methods may also be explored at this scale such as the use of semantic models to explicitly extract features such as roads (and their quality), number of buildings (and their type), amount of vegetation, etc. (possibly compiled into an "urban well-being index"), for a regressor downstream.As more and more high-resolution aerial imagery products become available ( [69], [70]) including from cheaply-produced unmanned aerial vehicles (UAVs) deployed outside the U.S. ( [71], [72]), the techniques introduced here provide a foundational benchmark for researchers and reveal potentially fruitful avenues for future work.

Fig. 1 .
Fig. 1. (A) Illustration of the neighborhoods in states containing cities analyzed in this study.(B) Blow up of neighborhoods in the state of Florida.(C) Blow up of neighborhoods in the county of Hillsborough, Florida which contains the city of Tampa.

Fig. 2 .
Fig. 2. (A) Illustration of median household income (in 2021 USD) of counties containing the 94 cities examined.City boundaries are in orange.(B) Expanded view of New York City in which its boroughs are coterminous with counties.(C) Expanded view of Chicago in which its city limits are within Cook and DuPage counties (mostly Cook).

Fig. 3 .
Fig. 3. Processing of a typical neighborhood (this one is in San Jose, CA) for the two processing methods for supervised learning.(A) Patching: The image is split into six 512x512 patches.(B) Resizing: The image is resized to 1353x1350 pixels (the median width and height of a neighborhood).

Fig. 4 .
Fig. 4. For the semi-supervised approach, examples of 112x112 patches for neighborhoods in different cities.Patch boundaries are denoted with white borders, and census block groups (i.e., neighborhoods) with black borders.In the New York neighborhood, all patches overlapping with the neighborhood are used.For the larger Houston neighborhood, only 50 samples are selected.

Fig. 5 .
Fig. 5. Visual representation of the ResNet50-based architecture used in the supervised approach.30% dropout layers are embedded after the first four fully-connected layers.

Fig. 7 .
Fig. 7. SHAP of two patches from the supervised approach: (top row) within a high-density area of 1814 ppl km 2 , and (bottom row) within a low-density area of 11 ppl km 2 .

Fig. 8 .
Fig. 8. 3D plot of the absolute error as a function of width and height deviation from the resized values (when estimating density using the supervised resizing approach).

Fig. 9 .
Fig.9.Analysis of the most important features of two neighborhoods in New York using SHAP.The most important cluster for the low-density neighborhood is related to water, and for the high-density neighborhood, two clusters corresponding to densely built areas.