Siamese-Like Convolutional Neural Network for Fine-Grained Income Estimation of Developed Economies

Estimating the per-capita income and the household income at a fine-grained geographical scale is critical but challenging, even across the developed economies. In this article, a novel Siamese-like Convolutional Neural Network, integrating Ridge Regression and Gaussian Process Regression, has been developed for fine-grained estimation of income across different parts of New York City. Our model (the GP-Mixed-Siamese-like-Double-Ridge model) makes good use of the pairwise comparison of location-based house price information, daytime satellite image, street view and spatial location information as the inputs. Taking the per-capita income and the median household income in New York City as the ground truths, our model outperforms (R2 = 0.72–0.86 for five-fold validation) other state-of-the-art income estimation models and achieves good performance in cross-district and cross-scale validation. We also find that models which partially share our model architecture, including the Spatial-Information-GP and the Mixed-Siamese-like model, perform well under certain spatial granularity and data availability. Since such models rely on less data input types and simpler architectures, they can be used to save resources on data collection and model training. Hence, using our model for fine-grained income estimation does not mean excluding these models that share similar architectures. Our fine-grained income estimation model can allow the per-capita and the household income data generated in fine-grained resolution to couple with other types of data, such as the air pollution or the epidemic data, of the same scale, to ensure that any location-specific socio-economic-related study and evidence-based decision-making at the fine-grained resolution can be conducted. Future research will focus on extending our model for fine-grained income estimation in developing metropolises, and for developing other socio-economic indicators.


I. INTRODUCTION
Measuring income 1 distribution at a high spatial resolution is critical but challenging, even for developed The associate editor coordinating the review of this manuscript and approving it for publication was Jinjia Zhou . 1 According to the definition of American Community Survey, ''Total income'' refers to the sum of incomes reported separately for wage or salary income; net self-employment income; interest, dividends, or net rental or royalty income, or income from estates and trusts; Social Security or Railroad Retirement Income; Supplemental Security Income (SSI); public assistance or welfare payments; retirement, survivor, or disability pensions; and all other incomes [3]. economies [1]- [3]. Accurate income data are mainly obtained from field surveys, which can be highly capital intensive [2]. Over the past few decades, attempts have been made to overcome data scarcity and to estimate fine-grained income distribution across developing or non-urban areas [4]- [7]. Few studies have attempted to make good use of proxy data and deep learning models for high accuracy, fine-grained income estimation in developed and urban contexts. Such studies should advance our understanding of income distribution and variation at the fine-grained geographical level, so far as the developed and urban contexts are concerned [2], [8]. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Income is an important indicator critical for socioeconomic studies in the developed world. First, income can largely reflect citizens' accessibility to a number of goods and services in most developed economies [9]. Second, income is closely related to ones' living standards in developed economies. Given better welfare allocation (e.g. retirement plans, free health care, unemployment compensation), citizens of developed economies have less incentives to save their income to mitigate future financial risks due to illness and unemployment etc., and more incentives to spend their income over the short-term to maintain their standards of living [10]- [13]. In the United States, the savings rate has substantially fallen to below 3% during the late 2000s [14], [15]. Third, collecting income data is a relatively easy task across the developed economies when field survey resources/services are freely provided/supported by the government and other NGOs [9].
In this article, our fine-grained income estimation study estimates income at the district-level of a city. Estimating income at such a level is beneficial for our understanding of the relationship between income and other socio-economic variables, such as air pollution exposure or COVID-19 pandemic. Such fine-grained analysis can allow policymakers to provide recommendations on any socio-economic related environmental/public health challenges that are locationspecific [16]. However, the validity of the analysis will ultimately be dependent upon the accuracy of income estimation at the fine-grained resolution.
Collecting accurate fine-grained income data/conducting accurate income estimation is crucial for developed economies. First, as compared to developing economies characterized mostly by low-income distribution, developed economies are facing a higher risk of intra-city income inequality. Specifically, some citizens of developed economies may earn extremely high levels of incomes, whilst other citizens who lack the needed capabilities may be forced to accept extremely low levels of incomes [17]. Second, developed economies usually are associated with a higher level of democracy, and a higher social awareness and demand for data transparency [18]. Publishing fine-grained income data can meet the public demand and can facilitate better understanding of such issues as socio-economic-related environmental exposure inequality or COVID-19 infection imbalance.
In any developed economy, such as the United States, the income data obtained via large-scale surveys are not immediately updated; data collection is highly expensive [19], [20]. In fact, the United States spends more than USD250 million per year on discharging the American Community Survey (ACS), a door-to-door survey that collects statistics such as per-capita income and household income [21]. Due to high manpower, smaller geographical units (areas having <65,000 residents) are investigated less frequently, and income data surveyed are not published until one or two years later [19]. Delays in data-reporting may impede timely policy decisions and weaken the effectiveness of public resource allocation [22].
To reduce the manpower needed for fine-grained income surveys and to speed up fine-grained income data collection, researchers have used house price as a proxy for income. Previous studies have identified a positive correlation between house price and income [23]- [29], whilst house price data are easily accessible and downloadable online in the developed world. However, estimation models that depend on house price as the input and income as the output have yielded a low estimation accuracy. A study that estimates yearly household income with a kernel regression model, using as inputs the household-level house price information of six cities across the United States, has achieved very low estimation performance [30]. The Spearman rank correlation between house price and income at the household-level has achieved a correlation coefficient as low as 0.38 to 0.52 [30]. In another study, a polynomial model is used to estimate the household income in London, also taking house price as an explanatory variable, but no validation accuracy has been provided [31]. Furthermore, in most developed metropolises, house price data distribution is uneven. Some parts of the city may have more house price data points than others. Due to data skewness, income estimation using house price as the input may be inaccurate. More obstacles have to be overcome when alternative advanced machine learning techniques that use house price as the input are being considered.
In addition, as house price to income ratio can vary greatly across different times and spaces [32], [33], other auxiliary factors may need to be properly taken into account. To better capture the interactions between house price and other variables, scholars have suggested that an advanced machine learning method may be useful for overcoming the compounding effect/multi-collinearity of input variables, which are commonly found in traditional statistical models [34].
Apart from the house price-based income estimation model, other resource-efficient methods for fine-grained district-level income estimation in the developed economies have been identified. In Table 1, we classify these models into four different categories. The first category is based on the visual appearance of the district, which is normally captured by the night-time/daytime satellite image or street view [2], [35]- [37]. The assumption being that buildings, roads, vegetations and nightlight intensities can vary from place to place, when the income level of these places vary [2], [35]- [37]. Some researchers have claimed that combining the visual data with the spatial data of individual districts can contribute to higher income estimation accuracy [1]. The second category focuses on transportation [19], [38]- [40]. Some studies have extracted features of human mobility or car attributes of a small area to represent an average income level of that area [19], [32], [38]- [41]. The third category is based on the quality of local food restaurants/stores, as an area with low-end restaurants or food stores is assumed to have lower-income residents [42], [43]. The fourth category is based on data collected from the online social network platforms [44]. It assumes that people having more complex social networks may earn higher incomes due to better accessibilities to higher paid jobs [45]- [49], or better entrepreneurship opportunities [50], [51]. However, such fine-grained income estimation models are yet to address the followings: First, these studies are highly dependent on non-public data, and that some of these data are not easily obtainable and may invite privacy concerns. For instance, transportation card records [40] have been used to extract features of social network structures and human mobility patterns but can be hardly accessible without making prior agreements with the relevant organizations, such as the transport authorities. The models that rely on social media records, such as Twitter [44], might expose the personal information of Twitter users and raise privacy concerns.
Second, some of these estimation studies have been based on indicators which have low correlations with the district-level incomes. For example, indicators derived from the distributions of fast-food restaurants [43] and business/ restaurant reviews, or profiles from Yelp [42], have relatively low correlations with district-level incomes. Besides, the nightlight intensity has been widely used to estimate Gross Domestic Product (GDP) in developing economies [52]- [57], but its application on fine-grained income estimation tends to achieve low accuracy across developed economies [36], [58]- [61]. As both rich and poor parts of cities in the developed world have been equipped with sufficient lighting facilities, nightlight intensity can hardly be used to differentiate the poor areas from the rich areas in the developed cities [36], [58]. To cite an example, in New York City (NYC), the nightlight intensity is high across all districts; hence, the intensity variation appears to be too low to signify any change in income at the fine-grained resolution [59]. Some researchers have developed a nightlight-based transfer learning methodology relevant for estimating assets [62] and consumer expenditures in African countries [4], but such studies may not be directly applicable to cities in the developed economies [2]. Third, regarding district-level income estimation, previous studies have yet combined daytime satellite image with street view as a model input. Some studies have shown that combining daytime satellite image with street view can contribute to good model performance in house price estimation [63], which may be extended to income estimation. Further, previous visual-based income estimation methods have yet exploited features extracted from both aerial and ground-level street view [1], [2], [35], [37].

II. NOVELTY
Given such background, we propose the adoption of a transfer learning methodology for fine-grained per-capita income and median household income estimation in developed economies, which outperforms state-of-the-art models and achieves a higher estimation accuracy at a district-level of a city. Specifically, our proposed method combines four data categories, including house price, daytime satellite image, street view, and spatial information (latitude and longitude of district centroid) as data inputs. Based on pair-wise comparison results of house price information, we develop a novel Siamese-like Convolutional Neural Network (CNN) to enhance the effectiveness of image feature extraction. The model does not require one to input all house price information from all parts of a city, which may solve the problem of house price data sparsity due to information skewness. Our model presents high generalizability.
The rest of the paper is organized as follows. Section III details the methodology of our Siamese-like CNN model. Section IV reveals and discusses our income estimation model results. We further compare the performance of our model with our selected state-of-the-art income models. Finally, Section V concludes our study and puts forward suggestions for future research.

III. METHODOLOGY
Our overall methodology consists of four parts (see Fig. 1). In Part 1, we develop a Siamese-like CNN to extract house price-related features from the daytime satellite images and the street views collected from NYC. In Part 2, our image features are averaged at the district-level and taken as the inputs to the Ridge Regression model for district-based income estimation. Since house price is positively correlated with income, it is expected that our house price-related features would be correlated with the income values (ground truths) obtained from NYC. Given the richer information derivable from the daytime satellite images and the street views, they are expected to outperform house price information in income estimation, given their better spatial representation. In Part 3, we take the latitude and the longitude of a district centroid as the inputs to a Gaussian Processes (GP) model to extract a scalar value from the spatial information for income estimation. In Part 4, we concatenate the scalar outputs generated by Part 2 and Part 3 and feed them into another Ridge Regression model for final income estimation. We take NYC in the United States, a metropolis with a highly developed economy as our case study. We use a Siamese-like CNN to estimate the district-based income levels of NYC in 2018.

A. LABELLED DATA
The ground truths of the district incomes in 2018 in NYC are obtained from the 2014-2018 American Community Survey (a 5-year estimate), a national-level door-to-door field survey [21]. Two types of district-level average income data in NYC are used as labels: per-capita income and median household income. The average income data at two different geographical levels are used to test the model performance at different granularities: the tract-level (2067 tracts), and the ZIP code-level (211 ZIP codes). Specifically, the data is gathered from Census Reporter 2 [64].

B. INPUT DATA
Four types of inputs are used in this research: the house price, the daytime satellite image, the street view, and the spatial location information of each district (see Fig. 2). The house price information in 2018 is obtained from NYC Department of Finance, which is the official and a highly comprehensive information source [65]. Each piece of house price data corresponds to one real estate transaction, and the exact location of the building is used. 21,144 items of house price information (sales price divided by gross square feet) are extracted after data cleaning (excluding sale price 0, gross square feet 0, or location not found). The latitude and the longitude of each building is located by the official map searching tool [66]. Daytime satellite images, captured in 2018, are gathered from the NYC government [67]. All satellite images are gathered at zoom-level 18; successive images are taken approximately every 0.001 degree with no overlappings (a total of 89889 images, at 256 × 256 pixels per image). The spatial resolution of the daytime satellite image is approximately 4 × 10 −7 degrees per pixel. Although many previous studies have obtained the daytime satellite image from Google Static Maps API, we do not collect our images from these data sources as the API does not provide the exact year of the image captured. Our street views are directly obtained from Google Street View Static API which provides the captured year of the street views. One image is taken with a change in view every 0.001 degree if it exists (within a default 50 meter searching radius, 640 × 640 pixels per image, 54246 images). Data cleaning is conducted to filter any street views that are invalid/dark/interior/blurred/duplicated/ obstructed by an object [63], [68]. Only the street views taken between 2017 to 2019 are used. It is assumed that the physical appearance of NYC did not change significantly from 2017 to 2019. The spatial location information refers to the latitude and the longitude of any district centroid in NYC, which is obtained from the district boundary shapefile via Census Reporter [64].

C. TRANSFER LEARNING
Transfer learning is a machine learning technique that learns certain knowledge during one process of problem-solving, then transfers such knowledge to another area of problemsolving [69]. In this study, we adopt the method of transfer learning and extract image features to compare house prices in NYC, then apply the knowledge learnt from house price to income estimation. The overall framework of transfer learning consists of four steps, as detailed below (see Fig. 1).
The first step of our study aims at extracting the house price-related features from the daytime images and the street views. Before training, each piece of house price data is matched with the nearest daytime satellite image and street view. To extract image features, an intuitive approach is to establish a Regression model between the image and the house price information (normalized by the maximum value). Following this method, a model based on CNN is constructed. As shown in Fig. 3, the image is inputted to a Resnet-50 (a 50-layer residual CNN) and a tensor consisting of 2048 features is extracted [70]; the predicted house price is generated after a dense layer (with activation function tanh). The mean square error is used as the loss function. However, there are limitations with regard to this method during the training process. In particular, the loss function can hardly converge without excluding the outliers of the house prices. Experimental results show that the features extracted are not highly correlated with the actual income values.
To improve the feature extraction performance, we design a novel Siamese-like CNN to more effectively extract house price-related features for fine-grained income estimation. The traditional Siamese CNN has been a few-shot learning technique, designed originally for image classification [71]. As shown in Fig. 4, the inputs to Siamese CNN normally cover a pair of images; each image can be treated by one CNN, and the outputs of the two CNNs can be concatenated. After some fully connected layers with the Rectified Linear Unit (ReLU) activation function, the model produces a scalar value indicating the similarity between two images [71]. A unique characteristic of a Siamese-like CNN is that the two CNN models always share the same architecture and weight. Researchers have designed a Siamese-like CNN to predict the human judgment of pairwise image comparisons [72]. In this study, we develop a novel Siamese-like CNN for extracting house price-related image features for fine-grained income estimation.   Fig. 5), with the two image feature sets being subtracted element-wise and fed into a dense layer to generate a 3-element vector for representing the image captured in the location with a FIGURE 5. Architecture of Siamese-like CNN. VOLUME 8, 2020 higher house price. In Fig. 5, p 1 , p 2 , p 3 represent the three elements and each of them ranges from 0 to 1 to represent the likelihood of each of the three possible results; Image 1 is higher in house price, Image 1 and Image 2 are equivalent in house price, or Image 2 is higher in house price. The label used in our study is transformed to a one-hot vector with 3 values ([1,0,0] indicates that Image 1 is higher in values, [0,1,0] indicates that Image 1 and 2 are equivalent in values, and [0,0,1] indicates that Image 2 is higher in values).
There are reasons why features extracted by Siameselike CNN can outperform non-Siamese-like CNN for fine-grained income estimation. Siamese-like CNN is a classification model and converges more easily as compared to Regression. Non-Siamese-like CNN requires that image features extracted estimate the exact house price, whereas Siamese-like CNN relaxes this requirement and allows the image features to be less strongly correlated with the exact house price. Such difference leads to the next question: Are features that are good for house price estimation also good for income estimation? In reality, studies have shown that given the same house price, difference in the income level can still be significant [73], [74]. Hence, the expectation for a one-to-one correspondence between house price and income is unrealistic (which is the underlying assumption taken by non-Siamese-like CNN), whereas it is much more likely that any districts having a higher house price would have a higher income level (which is the underlying assumption of the pairwise comparison adopted by Siamese-like CNN). Hence, though the features extracted by Siamese-like CNN are less strongly correlated with the exact house price values, they can better capture any factors that simultaneously influence both house price and income (instead of factors that only influence house price), and eventually achieve a higher correlation with the actual income. The better performance of Siamese-like CNN (see Section IV) also confirms our intuition that by relaxing some irrelevant and redundant restrictions on feature extractions, the classification model can obtain features more relevant to the income values of the local contexts. Our model consists of four steps: Step 1: We train our Siamese-like CNN. The compared results of 100,000 house price-related image pairs are randomly generated based on daytime satellite images and street views separately. The cross-entropy loss is used as the loss function for classification and the Resnet-50 is initiated by the weights pre-trained on ImageNet [75]. The batch size is 32, the training epoch is 10, ReLu is used as the activation function, except for the softmax layer that is used for calculating the cross-entropy loss. The optimizer is Stochastic Gradient Descent (SGD) (momentum = 0.9; initial learning rate = 0.001, reduced by a factor of 10 when the loss value increases, minimum learning rate = 0.0000001) and L2 regularization (0.01) is applied.
Step 2: Second, a Ridge Regression model is trained for dimension reduction via supervised learning, to reduce overfitting. Ridge Regression can be taken as Linear Regression with L2 regularization (without penalizing the intercept term) [76], which has been verified as an effective model in tackling a large number of image features extracted for income estimation [2]. The district income gathered from a field survey can be used as the label of the Regression model, and the model can generate a scalar output. Our inputs contain two sets of data: the first input is obtained from the daytime satellite images, and the second input is obtained from the street views. As each district (tract/ZIP code) contains multiple daytime satellite images, 2048 features are calculated for each district, by averaging the features of all images within the same district. The features of street views are calculated in the same way.
Step 3: Third, a GP model is used to extract a scalar value from the spatial information for income estimation. The GP model is a non-linear model built upon a Bayesian approach which specifies a Gaussian prior over the parameters [77]. Suel et al. have pointed out that adding the spatial data by the GP model can further enhance the income estimation accuracy of the district [1]. The inputs to the GP model cover both the latitudes and the longitudes of the district centroids. The labels of the model are the district income collected by field surveys. The GPy package is used to fit GP with the Matern-3/2 kernel [78], based on [1]. Following the default settings in the package sample codes, training has been repeated twice and the mean output scalar value is further used for income estimation.
Step 4: Finally, a new Ridge Regression model, Image-Spatial-Info-Ridge-Regression model, is used to combine the image features with the spatial features, and to estimate the final income via supervised learning. The scalar outputs generated from Step 2 and Step 3 are concatenated for each district and taken as the inputs. The district income data collected from the field surveys are taken as the ground truths. Ridge Regression is used here due to its ability to avoid overfitting and achieve good estimation performance. Three types of cross-validation are used to evaluate model performance, including R 2 , the Root Mean Square Error (RMSE), and the Mean Absolute Error (MAE) [2], [4], [79], [80]. First, a five-fold validation is used to evaluate the model's overall income estimation accuracy [4]. Here, our five-fold validation, which masks the income data representing one-fifth of the districts in each fold, is limited to Steps 2 to 4, since Siamese-like CNN conducted in Step 1 does not require any income data input. To ensure comparability, the same set of five-fold district division is used across all experiments. Second, a cross-district validation is utilized to test our model's spatial generalizability. Here, in each fold, all four parts of our model are trained on the data obtained from only one-fifth of the districts and validated using the data from the rest of the districts. Third, a cross-scale validation is conducted by applying the ZIP code-level (coarsescale) model to tract-level (fine-scale) income estimation [2]. The hyperparameter of the Ridge Regression model in each fold is determined by a grid-searching procedure, that aims at maximizing the R 2 of another five-fold validation conducted on the training dataset [2], [4].

A. FIVE-FOLD VALIDATION 1) COMPARISON OF SIAMESE-LIKE CNNS VS NON-SIAMESE-LIKE CNNS OF DIFFERENT DATA INPUTS AND ARCHITECTURES
In Table 2, the models of different data inputs and architectures are compared. It shows that our proposed GP-Mixed-Siamese-like-Double-Ridge model achieves outstanding performance (R 2 = 0.72 − 0.86), as compared to other models, which only use part of the available data.
Specifically, we compare our Siamese-like CNN models with different input image datasets. The Mixed-Siamese-like model is based on imagery features from both the daytime satellite image and the street views, the Satellite-Siameselike model and Street-view-Siamese-like model are merely based on one type of image corresponding to their name. It is observed that the Mixed-Siamese-like model always attains the highest R 2 value at both the tract-level and the ZIP code-level. Besides, the Mixed-Siamese-like model achieves a higher R 2 on the per-capita income-level as compared to the median household income-level. In addition, it should be mentioned the Satellite-Siamese-like model has a wider applicability as compared to the Mixed-Siamese-like model, as street views may not be available across all small districts, while satellite images are. This would not present a major challenge for NYC, as the street views are available across most of the districts.
We also compare with the models that are only based on the house price or the spatial location information. The House Price model only estimates district income based on the local average house price. The Satellite-Siamese-like model, the Street-view-Siamese-like model and the House Price model all generate the estimated income value by Ridge Regression in the final step, but the former two perform much better than the House Price model. This indicates that the house price-related image features extracted by Siamese-like CNN can outperform the house price for income estimation. It is worth noting that, when GP is used, the Spatial-Information-GP model [1], which only takes the latitudes and the longitudes of district centroids as the inputs, achieves a high five-fold validation accuracy, especially for the per-capita income estimation. Intuitively, there is hardly any direct causal relationship between the latitudes, the longitudes and income. Hence, the efficacy of the spatial information for income estimation can be attributable to the spatial autocorrelation of the income distributions. Specifically, our results have shown that for the people living in nearby districts, their per-capita income might be more similar.
The performance of models with and without Siamese-like CNN architecture is compared. We compare the perfor- We combine Ridge Regression and GP as a multi-step regressor in our study (see Steps 2 to 4). With the prior knowledge that individual image feature (among 4096 image features) is less important than individual spatial feature (i.e. the latitude or the longitude) for income estimation, we believe a multi-step approach is desirable. If all features are fed into a regressor indiscriminately, we will not be able to make full use of this prior knowledge. Specifically, we test the Mixed-Siamese-like-Random-Forest model by taking all 4098 features (i.e. 4096 image features and 2 spatial features) as the inputs to a Random Forest model in a single step [81]. The model tends to overfit. In addition, we compare our Mixed-Siamese-like-Double-Ridge model with the Mixed-Siamese-like-GP model, which excludes Step 3 and Step 4 from our model, and takes the scalar output of Step 2, the latitude and the longitude as the inputs to a GP model for income estimation. The results show that our proposed model outperforms the Mixed-Siamese-like-GP model, indicating that Step 3 and Step 4 are capable of reducing overfitting.
We test the Mixed-Spatial-Siamese-like model, which takes the spatial information as an input to a Siamese-like CNN. In this model, the latitude and the longitude of each image are combined with 2048 image features by a dense layer in each branch of a Siamese-like CNN, and the dense layer produces 2050 outputs. Other architectures of this Siamese-like CNN are the same as the one described in Section III. All features are then taken as the inputs to a Ridge Regression model for income estimation. It performs less well than the Spatial-Information-GP model. One reason being that the Siamese-like CNN aims at extracting the house price-related features. Taking the latitude and the longitude as the inputs to the model, and by transforming the spatial information to house price-related features, it increases the difficulty for the model to directly comprehend the spatial autocorrelation of the income distributions.

2) COMPARISON OF SIAMESE-LIKE CNNS WITH STATE-OF-THE-ART MODELS
To compare our Siamese-like CNNs with state-of-the-art models, we select five district-level income estimation Regression models by the following criteria. For each of the four methods shown in Table 1, at least one model is selected, and the model shall deploy state-of-the-art methodology and achieve outstanding performance when compared to other studies of the same class. The architectures of the selected models are summarized in Tables 3 and 4.
The comparison analysis shows that our model outperforms the five state-of-the-art Regression models. Specifically, as shown in Table 3, model performance is compared across four dimensions, including, data availability, privacy protection, transferability between the developed and the developing countries, and the overall expected estimation accuracy. Details of our proposed model's major advantage over the other models are outlined in Table 4. In addition, given the available data, we obtain the performance of three state-of-the-art models on income estimation in NYC, and show that their validation R 2 values are lower than our proposed model (see Table 5). Benchmark Model 1 utilizes  CNN for street view feature extraction and uses the spatial location information and GP for performing a residual regression task to boost the income estimation accuracy [1]. The model takes the four street views of each location as the inputs (views of the north, east, south, and west), Whereas in our model, we only take one street view of each location (views facing the searching center). Hence, we keep the street view locations constant between the two models to ensure model comparability. 3 Benchmark Model 2 extracts 7480 pre-defined features from each street view (i.e. GIST [82], texton, and color histogram features [83]) and deploys a Support Vector Regression model for income estimation [35]. The same street views deployed in our model are taken as the input. The implementation is slightly different from the original work due to the training speed limitation.
The original work first trains an image-level Support Vector Regression model and then averages the income estimation of each image at the district-level. Since our model feeds in a large number of images, our training speed becomes extremely slow. Hence, we first average the image features at the district-level, then train a district-level Support Vector Regression model to generate district-level income estimation. Benchmark Model 3 is an income estimation model that utilizes the daytime satellite image and different CNN techniques [2]. To ensure comparability, the same daytime satellite image dataset used by our model is used as the input dataset for Benchmark Model 3. Specifically, Benchmark Model 3 is based on a transfer learning technique, which uses CNN to first extract imagery features that are useful in estimating nightlight intensity, then applies the features for income estimation by Ridge Regression [2], [4]. It is seen in Table 5 that the GP-Mixed-Siamese-like-Double-Ridge model significantly outperforms the benchmark models.

B. CROSS-DISTRICT VALIDATION
The cross-district validation results are presented in Table 6. The cross-district validation is conducted by randomly separating the small districts (tract/ZIP code) into five sets each set containing the same number of districts. Each time, the model is trained on input data (the daytime satellite image, the street view, the house price and the spatial location information) from one set of districts and is evaluated based on a validation dataset composed of all other districts. The average R 2 , RMSE and MAE of five models trained on the five sets of district data are calculated and presented in Table 6.
Here we find models sharing partially the architecture of our model, including the Spatial-Information-GP model and the Mixed-Siamese-like model, may outperform other models under different circumstances. Specifically, the Spatial-Information-GP model has achieved very good tract-level cross-district validation performance, when such validation is based on the per-capita income. However, it performs less well in ZIP code-level cross-district validation. As the optimal performance of the Spatial-Information-GP model depends on the availability of a relatively large training dataset, one-fifth of the ZIP code-level regions is too small a training dataset to maintain the model performance. Besides, the Mixed-Siamese-like model, which relies on satellite images, street views and house prices, has achieved an outstanding performance on ZIP code-level cross-district validation. Hence, when data of different spatial granularities are available, different models may be preferred. We would discuss this further in Section IV Part E. Further, our results also imply that when performing income estimation, instead of conducting a labor-intensive door-to-door survey across all districts, researchers can instead develop a CNN that estimates the income levels of a subset of all districts, say onefifth, then use the trained CNN to estimate the income levels across the remaining four-fifths of the districts of the city. Table 7 presents R 2 , RMSE and MAE when the model trained on a less fine-grained spatial scale (ZIP code-level) is applied to income estimation at the more fine-grained scale (tract-level). With the same statistical significance, the less fine-grained scale model requires a smaller number of households to be interviewed, hence more resource friendly. The comparison results imply that the combined use of satellite images and street views can enhance the cross-scale validation accuracy. The Spatial-Information-GP model has achieved a good performance on cross-scale validation based on both per-capita and median household income. Fig. 6 presents the estimated income distribution, generated by the GP-Mixed-Siamese-like-Double-Ridge model following a five-fold validation procedure. We notice that our model generates a negative estimated median household income for a tract in the center of NYC. We investigate this problem by first checking the intermediate model outputs generated by the images and the spatial information. We find the scalar output generated by image features in Step 2 of our model is negative for this tract, and this leads to the final negative estimation. Hence, we further check the images in this tract. The street views in this tract look normal, whereas the satellite images in this tract contain some large shadows caused by high buildings. This indicates that dark building shadows might bring extra noises and reduce the model's estimation accuracy. We verify this idea by checking the estimated median household income generated by the Satellite-Siamese-like model and the Street-view-Siameselike model. The Satellite-Siamese-like model also generates a negative estimated median household income for this tract, and the Street-view-Siamese-like model could generate a normal positive income estimation. Those results also support that abnormal satellite images would lead to low income estimation accuracy. Hence, we would suggest scholars filter out the daytime satellite images with large shadows before using our model, and we also encourage scholars to further investigate more specific criteria used for filtering abnormal daytime satellite images. Besides, any district that does not have ground truth data is not included in the Figure. Since some of these areas are covered/surrounded by water, whereas our model is trained on the ground truths collected from the land areas, the estimated income values in these districts are considered not credible and discarded.

E. DISCUSSION
Our study brings forth an important question, can we input additional data types as proxies to income, or integrate more powerful techniques to our model to improve our income estimation accuracy? First, as illustrated in Section I, other types of data that have been used in existing literature present various challenges; transportation card records are not open to the public [40], social media records can induce privacy concern [44], nightlights have a low correlation with incomes in developed countries [36], [58]- [61], and restaurant/business information exhibits a low estimation power in previous studies [42], [43]. Second, more powerful techniques, or additional data input types do not guarantee a higher estimation accuracy. For instance, studies from the Stanford University have shown that a powerful Generative Adversarial Network (GAN) and multiple types of data inputs may perform more poorly than a simple CNN, also perform less better than GAN with fewer input types, and have suggested that the failure may be attributable to the model's propensity of overfitting, as the model may be fitted to noise [80]. Third, an important performance evaluation criterion is whether the model maintains a good balance between simplicity and accuracy [84]. Our model achieves a relatively high R 2 of 0.72-0.86 (a five-fold validation), indicating that integrating more powerful models or providing extra inputs to this study may increase model complexity, but hardly improve estimation accuracy. However, income estimation models tend to overfit when combining image features with spatial information. Hence, estimation models that better fit the spatial information with other regional features without overfitting, such as Graph Neural Network (GNN), are recommended for the future study [85].
Besides, the computational burdens of different models are compared and presented in Table 8. We divide the computational burden of our assessed models into three levels, 'high', 'medium' and 'low'. Models of a 'high' computational burden are models that concatenate a number of sub-models and spend long time on training and making inference (>6 hours). Models of 'low' computational burdens are models of a single sub-model and the results are generated within a short time (<1 hour). Other models falling between the two extremes are categorized as 'medium'. In our baseline models, many have been rated as ''high'' in computational costs. Unlike some real-time model training and inference operations, regional income estimation usually does not require high computational speed. Hence, such shortcoming might not affect income estimation operation seriously, even though efforts to reduce the computational costs can be further pursued in future studies.
Although our GP-Mixed-Siamese-like-Double-Ridge model has achieved a good estimation performance, several models sharing partially our proposed architecture can achieve comparable or even better performance under different circumstances (i.e. different spatial granularities and data availability). Hence, when data of different spatial granularities are available, different models may be preferred (see Table 9). Specifically, the high-performance models under high income data availability are selected according to the five-fold validation results (where field survey-based income data from 80% of districts are available). The high-performance models suitable to be used under low income data availability are selected according to the cross-district validation results (where 20% of the ground truth income data are available).

V. CONCLUSION AND FUTURE RESEARCH
We propose a novel methodology, Mixed Siamese-like CNN, which integrates Ridge Regression and GP for the finegrained, district-level per-capita income and median household income estimation for NYC in the United States. Our new model (the GP-Mixed-Siamese-like-Double-Ridge model) makes good use of a rich array of data types, including the house price, the daytime satellite image, the street view and the spatial location information. Our model outperforms other state-of-the-art income estimation Regression models (R 2 = 0.72-0.86 under a five-fold validation). A good performance has been achieved with regard to cross-district and cross-scale validation, which can be used to replace field surveys to reduce manpower and financial resources. We also identify models that share partially the architecture of our model, including the Spatial-Information-GP model and the Mixed-Siamese-like model. Each of them can perform better than other baselines under certain spatial granularity and data availability. Since each of those models relies on less data input types and simpler architectures, utilizing them can save resources spent on data collection and model training.
We recommend that these model architectures can be flexibly utilized under different circumstances to optimize the estimation performance. Even though our income estimation model is applicable to the developed economies, it can be modified and extended to developing countries where no historical fine-grained income data is readily available. First, instead of estimating the exact income, due to the lack of accurate ground truths, one can transform this model into an income classification model based on unsupervised learning. Second, other types of fine-grained data inputs can be utilized in the developing metropolises. Third, field surveys can be conducted in some districts of the targeted developing metropolis to verify model accuracy. To apply our model to estimate the fine-grained income values in other cities, it is necessary to perform verification. First, when any cities to be examined are lying within the same country, one may wish to check whether the geographical features, the population densities, the house prices and socio-economic status are showing similar characteristics. If the differences are significant, the model parameters trained in one city may not be directly transferrable. Even if the city characteristics are similar, field surveys are best conducted beforehand across some parts of the new city to be examined. Second, for cities located in a different country, it is desirable to examine if the general socio-economic features of the two countries are similar first.
In future, we plan to study how to transfer our proposed Siamese-like CNN to other unsupervised learning-based models. Our proposed Siamese-like CNN enjoys the following advantages. First, training a Siamese-like CNN does not need income data collected from field surveys, rendering it suitable for unsupervised learning. Second, house price is highly correlated with other socio-economic indicators, such as wealth [86]. Hence, the house price-related features extracted from Siamese-like CNN can facilitate the estimation of composite indicators that represent multi-dimensional concepts [87]. In reality, comprehensive house price information like that available in NYC may not be readily available in all cities. Hence, our model can still work well even when only a fraction, for instance, when one-fifth of house price information is available (see the cross-district validation result). Further, our Siamese-like CNN can also be used for estimating other socio-economic indicators. Depending on the nature and the type of the socio-economic values to be estimated, instead of using the house price for image feature extraction, multi-modal big data can be used for transfer learning. The features extracted from relevant dimensions can then serve as the inputs for unsupervised learning model (e.g. Principal Component Analysis [88]).