Multidimensional Population Health Modeling: A Data-Driven Multivariate Statistical Learning Approach

Population health is multidimensional in nature, having complex relationships with the various health determinants. However, most previous studies investigate a single dimension of population health using linear models, failing to capture the nonlinearity in the data and interdependence of multiple dimensions in health outcomes. In this paper, we propose a data-driven multivariate statistical learning approach to simultaneously model various aspects of population health—characterizing the length and quality of life—as a function of health behaviors, clinical care, socioeconomic factors, physical environment, and demographics. We also propose a novel percentile-based variable selection for multivariate regression, without compromising the model’s generalization performance. We demonstrate the applicability of our proposed data-driven methodological framework using the New York State as a case study. Leveraging cross-validation techniques and statistical hypothesis tests, the results indicate that multivariate tree boosting method outperforms the traditionally-used univariate linear regression model and random forest in modeling multidimensional population health. The variable importance heat-map illustrates the relative influence of the key health determinants on the various dimensions of population health. Partial dependence plots are used to quantify the marginal effects and the nonlinear relationships between the health outcomes and health inputs. Our results show that teen birth rate is strongly associated with both length of life (e.g., child mortality) and quality of life (e.g., physically unhealthy days). Socioeconomic status is the key indicator to predict child and infant mortality. Our proposed framework can be used as a decision support tool for accurately assessing and predicting multivariate population health.


I. INTRODUCTION
Human health and wellbeing is the key to a thriving and equitable society [1]. Population health is often conceptualized as the health status and health outcomes within a group of people, instead of considering the individual health at a time [2]. In the last decade, efforts have been made to enhance the overall population health by not only improving the overall or mean population health of a community, but also eliminating health disparities within that population [3]- [5]. Recently, the ''Healthy People 2030'', developed by the U.S. Department of Health and Human Services Advisory Committee on National Health Promotion and Disease Prevention for 2030, sets data-driven national The associate editor coordinating the review of this manuscript and approving it for publication was Dost Muhammad Khan . objectives to improve health and well-being over the next decade [1]. Even though increasing attention has been paid to enhancing population health, several challenges still exist towards quantitatively assessing population health.
Most importantly, population health has a multidimensional construct, characterized by length of life (a.k.a., mortality) and quality of life (a.k.a., morbidity) [5], [6]. Although a number of studies focus on assessing population health, most of them consider only one of the dimensions such as health-related quality of life [7], [8], life expectancy [9] or the mortality rate [10] in their analysis. Such a siloed approach, however, fails to provide the comprehensive and holistic picture of the health outcomes within a population.
In addition, the recent studies show that the relationships between health outcomes and health variables such as social VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and economic factors are highly complex and nonlinear [11]- [13]. However, there is a lack of systematic quantitative approach in modeling the complex nonlinear characteristics of population health [14], [15]. The traditionally-used linear models often fall short in accurately estimating the health outcomes [5], [6], [16], thus underestimating the risks to population health. Moreover, population health is affected by a wide range of factors including health behaviors, clinical care, social and economic conditions, physical environment and demographics [16], [17]. Presently, a weighted average technique is used to account for the relative contribution of such factors on population health. However, the weighting mechanism is simplistic where the weights of the factors in contribution to the health outcomes are often assumed to be equally distributed, or are assigned based on expert opinions [4], [18], [19]. Recognizing the complex interactions of the health determinants to population health, such traditional weighing mechanisms could lead to a biased estimate of the population health, leading to sub-optimal decision making.
To address the above-mentioned challenges and gaps, in this paper, we propose to develop a novel data-driven multivariate framework to model the population health at county-level leveraging the advanced statistical learning theory, while considering its multidimensional construct. Specifically, in this proposed framework, nine different dimensions of population health outcomes characterizing length of life and quality of life are considered. Population health of a county is then modeled as function of a wide range of factors including health behaviors, clinical care, social and economic characteristics, physical environment and demographics. A suite of statistical learning models including linear regression and non-linear ensemble tree-based models along with their multivariate and univariate versions are implemented to evaluate the population health. Additionally, we propose a percentile-based variable selection method for multivariate analysis without compromising the generalization performance of the model. Finally, we present the visualization tools including a variable importance heat-map and partial dependence plots of the key predictors to explain the underlying relationships of the important variables with the population health outcomes.
We implemented our proposed framework for the state of New York (NY) as a case study because [20]: • NY state's population is the fourth-largest in the United States; • NY state has one of the most diverse demographic characteristics; • the minority populations in NY state increased as a percentage of the total population (39.8% in 2006 to 44.5% in 2016); • NY population is aging-e.g., the median age of NY population increased from 37.4 years in 2006 to 23.4 in 2016; in fact, the percentage of population aged 65 and over increased to 15.3% from 13.1%, while that aged 19 and younger decreased to 23.8% from 26.3%.
Therefore, NY presents a unique case to study the populations health and its key determinants. Although we present the applicability of our proposed framework for the state of NY, this methodological framework is generalized enough that can be implemented to any other states in the US or states of other nations, provided data are available.
The major contributions of our proposed data-driven framework are two-fold. First, it provides a systematic approach to develop and select a model that can best capture the complex relationships of the various determinants of health with the multidimensional population health. Second, it helps to identify and evaluate the key health factors affecting the overall health outcomes, which can help policymakers, community leaders, and public health officials in informed decision making towards improving overall population health.
The rest of this paper is organized as follows. Section II presents the literature review regarding population health assessment, highlighting the gaps in the body of knowledge. Section III presents the collection and preprocessing of the data. Section IV introduces our proposed research methodology. Section V provides the results including model comparison and selection, variable selection, and statistical inference. Finally, the discussion and conclusion are exhibited in Section VI and VII respectively. The Appendix provides additional information for variable descriptions and results on hypothesis testing.

II. LITERATURE REVIEW
The concept of multidimensional population health provides a holistic picture of integrating all major aspects of health outcomes and various health determinants. This concept recently has gained more and more attention to better understand and improve overall population health [21], [22]. Generally speaking, length of life and quality of life describe different aspects of population health, and a combination of these two characteristics can provide more comprehensive and holistic analysis of health outcomes of the population [5], [6].
Length of life is one of the critical dimensions of population health [5], [16]. Various studies indicated that length of life can be represented by the overall life expectancy that measures how long people can live on average [9], [23]. In recent years, the average life expectancy in the United States has been declining [24], [25], even falling behind other wealthy countries [10]. Besides life expectancy, mortality rate is another metrics for length of life. Preston et al. found that different age groups exhibit different mortality rates. For example, in 2017 the US witnessed a sharp increase in mortality rate among adults aged 20 to 34, in contrast to a decline in mortality rates among people aged 85 and up, compared to other similar European countries [10].
Quality of life, on the other hand, indicates the extent to which people feel physically, mentally, socially, and emotionally healthy [5], [16]. Quality of life, therefore, is measured by various characteristics of a population such as number of unhealthy days, prevalence of diabetes within the population, low birth weight, psychological distress, etc. However, most of the existing studies have focused on evaluating a single measure depicting the quality of life. For instance, Slabaugh et al. used number of healthy days to measure quality of life in order to understand both mental and physical well-being of a population [26]. Prevalence of diabetes within a population, as another measure of quality of life, has been studied in various existing literature [27]- [29]. The epidemic of diabetes is one of the most prevalent and costly chronic diseases in the US, which adversely affects the quality of life in majority of the population [30], [31]. Longer duration of any type of diabetes was found to be correlated with poor quality of life [27], [32]. Some other studies included low birthweight [33] and psychological distress [34] in the evaluation of quality of life.
In addition to being multidimensional in nature, recent studies demonstrate that the associations between population health and the various health determinants are complex and exhibit nonlinear characteristics, and that the linear models are inadequate to capture such nonlinear behaviors [11]- [13]. However, most of the existing studies in the population health assessment domain have used linear models to analyze the relationships between the health determinants and the population health outcomes. For example, to determine the relationships between health-related behaviors and the health outcomes of people with chronic conditions, Hinnell et al. implemented Logistic regression model. The authors found that physical inactivity was significantly correlated with epilepsy disorder [35]. Another study by Khedmat et al., which aimed to predict health-related quality of life, leveraged ordinal logistic regression to characterize the socioeconomic and demographic conditions associated with the perceived physical and mental health conditions of the participants [7]. Veen et al., in a different study, used linear regression analysis to investigate the degree of variance in quality of life that contributed to the risk of facial palsy [8]. In another study, Michel et al. examined the disparities of age and gender in the quality of life among teenagers using multilevel regression analyses. The authors found that girls reported a significant deterioration in quality of life than boys with increasing age [36]. Thus, although linear models are predominantly used to model population health outcomes because of their easier interpretability and low computational cost, the theory of these models are founded on a set of rigid assumptions regarding the underlying distribution of the data such as linearity and normality [37]. However, in reality such assumptions often do not hold, leading to poor generalization performance of the model [38]. Recently, data-driven techniques are developed and demonstrated to have potential values in discovering the complex nonlinear relationships in the field of public health [39], [40].
Population health is also affected by a number of factors. For example, socioeconomic determinants of health including unemployment and income are found to be strongly associated with the population health [41]. Moriarty et al. showed that people from poor socioeconomic condition are likely to suffer from more unhealthy days, impacting their quality of life-an important dimension of population health [42]. Lin et al. separately investigated the relationships between demographic backgrounds, and the physical and mental wellbeing of US adults aged over 65 years. The authors found that people living in the rural areas are likely to suffer more from poor physical health rather than mental illness [43]. Wu et al. examined a range of demographic and socioeconomic factors in relation to the mental wellbeing of adolescents. The authors pointed out that African-American youths, girls, and children from low income family are the most vulnerable to mental illness [44]. Physical environment also plays a critical role in shaping population health. Samet et al. investigated the relationships between air pollutants and mortality rates in the major metropolitan areas in the US. The authors found that the level of PM 10 was positively linked to a higher risk of death from all causes [45]. Zanobetti et al. examined the association between air temperature and mortality rate, and concluded that a 10 • F increase in air temperature was associated with an 1.8% increase in mortality rate [46]. Thus, although it is evident that population health is dependent on a multitude of factors, most of the studies focused on understanding the association of a particular type of a determinant/factor on population health, neglecting their complex interactions and the holistic effects of such factors on the population health.
It is noteworthy that some of the recent studies have developed the population health assessment framework, integrating all the different types of health determinants to the population health outcomes. For example, Kindig et al. proposed a framework to assess the population health outcomes by considering five determinants including medical care, individual behaviors, social environment, physical environment and genetics. However, the authors assumed that all the determinants are equally weighted, i.e., they make equal contributions to the health outcomes [4]. Similarly, the County Health Ranking (CHR) framework developed by the University of Wisconsin Population Health Institute and the Robert Wood Johnson Foundation, assumes a fixed weight for all the health determinants. The CHR framework delineates four main categories of health determinants including health behaviors, clinical care, socioeconomic factors, and physical environment [16], [17], where the weight of each health determinant is determined by expert knowledge, and is fixed for all counties [16]. However, the interactions of health determinants to population health are highly complex [47], and the simple weighing mechanisms can lead to suboptimal decision making where the decision makers cannot adequately prioritize the key risk factors affecting the population health that needs attention.
To summarize, the gaps in the current body of literature aiming to evaluate the population health are as follows: • the state-of-the-art modeling approach undermines the importance of the multidimensional aspect of the population health; VOLUME 10, 2022 • the traditionally-used linear models cannot capture the nonlinearities and the complex interactions between the various factors and the population health, thereby underestimating the risk factors of population health; • most of the existing studies use silo-ed approaches, i.e., focus on the effects of a specific type of a determinant, instead of leveraging a holistic approach that can consider a wide range of predictors for the analysis; and, • finally, few of the studies that have considered a range of various factors in assessing population health, used a fixed or equal weighting system assuming that all the predictors have equal contribution towards estimating population health, which is often times not true. To the best of our knowledge, no previous studies have simultaneously modeled all major dimensions of population health as a predictive function of a wide range of health determinants. Therefore, to address this research gap, we propose a systematic data-driven multivariate approach to model the multiple dimensions of population health (such as quality of life and length of life) as a function of various determinants of health, leveraging advanced statistical learning algorithms. Our framework can not only provide a robust way to select the optimal model for understanding and predicting the population health outcomes, but also help decision makers identify and quantify the focal health factors influencing population health.

III. DATA COLLECTION AND PRE-PROCESSING
In this section, we describe the detailed data preprocessing procedures for response and input variables used in this study. The population health data are collected at the county-level for each of 62 counties in New York State during 2010 to 2020. The mean of population per county is 0.3 million, and the total population across all counties is 19.5 million (2020 U.S. Census).

A. DATA PREPROCESSING
The health-related population data used in this study are provided by County Health Rankings & Roadmaps (CHR) program [48], and weather information are collected through National Center for Environmental Information (NOAA) [49]. Then, we conducted two following steps to process the collected raw data.
The first step is to address missing values, which is a common issue in data-driven analytics [50]. We found that few health features (e.g., income inequality, insufficient sleep) in NY had more than half of data missing in our study period 2010-2020. This was mostly due to small reported sample size or other uncertainties [5]. To avoid introducing bias in the analysis, we only kept the variables that had at least 50% of non-missing observations. For the variable that had less than or equal to 50% of the observations missing, we imputed the missing values leveraging the multivariate imputation by chained equations (MICE) technique using the R package 'mice' [51].
The second step is to detect the correlation between variables. The Pearson correlation coefficient (denoted as ρ) between two variables in each category of predictors and responses is calculated. If two variables are demonstrated to exceed a correlation of 0.90, the one with higher percentage of missing values is excluded from our analysis. For example, we found that the premature age-adjusted mortality is highly correlated with years of potential life lost (YPLL) with ρ = 0.93, and the proportion of missing data for the premature age-adjusted mortality is 27% compared to 0% missing values in YPLL. Then, the YPLL is kept in the analysis. To identify the highly correlated variables are essential for statistical inference to avoid ''masking effect'', which has been successfully used in previous research studies [52], [53].

B. RESPONSE VARIABLES
To capture the multifaceted length of life in a population, we included three measures of mortality, namely, years of potential life lost (YPLL), child mortality, and infant mortality. Here, YPLL can capture overall mortality of the population. Child and infant mortality can inform policymakers to design prevention strategies for meeting United Nations' Sustainable Development Goals by 2030 in the reduction of preventable deaths of newborns and children under age five [54].
To characterize the quality of life, six measures are used including poor or fair health, poor physical health days, poor mental health days, low birth weight, diabetes, and HIV prevalence. These variables have been used and validated by the CHR framework to assess the quality of life in population [5], [16]. Note that, CHR program also provides additional measures of quality of life including physical and mental distress among adults. Since these two variables have missing data for six years (i.e., over 50% of the data is missing), we did not include them in our study.
The summary statistics of the total of nine multivariate responses are displayed in Table 1. The description of all response variables is exhibited in Table 5 from Appendix A, and their dependencies and distributions are depicted in Fig. 13 from Appendix B.

C. INPUT VARIABLES
A wide range of input variables in relation to population health are analyzed in our study. These health factors are grouped into five categories: health behaviors, clinical care, social and economic factors, physical environment and demographics, based on the CHR framework [48]. Each variable has been validated by researchers and experts to satisfy several criteria such as: a) their importance to population health, b) their applicability to future population health, and c) availability and reliability of the indicators [5]. Additionally, we collected weather-related variables (precipitation, snowfall, etc.) from NOAA, and added to the category of physical environment for better characterizing the impacts of physical surroundings on population health. All the input variables used in our analysis are displayed in Table 2, and the detailed description of variables are presented in Table 5 from Appendix A.
To summarize, we have 67 variables (9 responses variables, 58 input variables including 57 health measures and one proxy variable 'year') for each county in New York State from 2010 to 2020, that are used for model development.

IV. METHODOLOGY A. RESEARCH FRAMEWORK
In this study, we propose a data-driven multivariate framework to assess and predict the multidimensional population health outcomes, leveraging the advanced predictive algorithms. The framework is depicted in Fig. 1, where each component is explained as follows.
The first component in the framework is data collection and pre-processing (described in Section III). Specifically, nine different variables are used as health outcomes to represent length of life (mortality) and quality of life (morbidity) in population. Input variables including health behaviors, clinical care, social and economic factors, physical environment and demographics are used in the analysis. Then, a sequence of data pre-processing procedures are implemented including screening for missing data and correlated variables. Additionally, we applied the min-max normalization to scale the response values to the range [0, 1] so that the performance of various models are comparable across response variables [55]. Finally, the data are aggregated for each county at a given year, which is used for model development.
The second component in the framework is model development, evaluation, and selection as described in Sections IV-B. The final model selection consists of two steps-Step-1: a library of multivariate models including multivariate linear regression, multivariate random forest, and multivariate tree boosting model have been developed. Following that, the best multivariate model is chosen based on the generalization performance; Step-2: based on the multivariate model selection, the corresponding univariate model is implemented; following this, the generalization performances were again compared to select the final model for statistical inferencing. Note that, generalization performance is obtained through 5-fold cross validation approach. Further, hyper-parameter tuning process is executed based on the selected model. Specifically, our proposed two-step model selection framework aims to test two hypotheses as follows: • Hypothesis (1): multivariate tree-based models can better capture the complex nonlinear effects and interactions between population health and the various predictor variables than the traditionally used linear model (outcome of Step-1); • Hypothesis (2): the multivariate model can better predict the multi-dimensional population health outcomes simultaneously, compared to corresponding univariate models for capturing each health outcome separately (outcome of Step-2). The third component in the framework is model interpretation and inference as presented in Section IV-C. To further reduce model complexity without sacrificing the generalization performance, the percentile-based variable selection for multivariate regression (PVS-MR) approach is proposed to select the optimal subset of features. The statistical insights of key factors related to health outcomes are provided, with the help of the variable importance heatmap and partial dependence plots.

B. MODEL IMPLEMENTATION
In this section, different types of models are introduced including parametric and nonparametric models, multivariate and univariate models. Specifically, we implement tree-based models, including random forest and gradient boosted model in the context of univariate and multivariate constructs, and compare their performances to that of the traditionally-used linear regression model. The rationale for selecting tree-based models is that they are able to capture the non-linearity and the complex interactions between the health outcomes and the corresponding health factors [13], [40]. Then, we provide the details of the multivariate tree boosting model. The generalization performance is also presented to help select the final model.

1) PARAMETRIC VS. NON-PARAMETRIC MODELS
The objective of supervised learning is to estimate the function f that maps the input vector X to the response Y . Mathematically it can be written as Y = f (X ) + , where is the irreducible error term that captures unobserved heterogeneity from the data [56], [57].
Supervised learning algorithms vary differently in their degree of complexity and flexibility depending on the construction of function f [58], [59]. Broadly speaking, parametric and non-parametric are the common types of learning models. The widely-used generalized linear regression belongs to the family of parametric models, where the model parameters are predetermined and estimated from data. One of the major advantages of parametric model is easier interpretability from explicit formulae of the function [60]; however, it comes at the cost of predictive accuracy. On the contrary, non-parametric model does not VOLUME 10, 2022  require prior knowledge about the form of mapping function, and has the flexibility to fit in any structure of the data. Thus, it can better capture the complex nonlinear relationships but comes at the cost of interpretability [37], [58]. Grounded in ensemble theory, non-parametric tree ensembles are robust to outliers and noise in the data, and exhibit superior predictive accuracy [37], [61]. Therefore, in this paper we implemented ensemble tree-based models including random forest and gradient tree boosting in our analysis.

2) MULTIVARIATE VS. UNIVARIATE MODELS
Depending on the number of response variables in the analysis, the statistical model can be either univariate or multivariate. Specifically, multivariate model involves multivariate response Y ∈ R N ×Q where N ≥ 1 is the total observations and Q > 1 denotes the cardinality of response variables, and it can be reduced to univariate regression when Q = 1. Multivariate analysis is often implemented in the cases where the covariances between multiple response variables are dependent on a set of input variables [62]. By utilizing the covariance structure of the response variables, it can allow us to predict multivariate response simultaneously, and to capture the variability of responses in order to improve model's predictive accuracy [62], [63]. Therefore, in this study, we implemented both multivariate and univariate models to investigate if the covariance between the nine different response variables representing population health contributes to the overall accuracy of the population health assessment and prediction model.

3) MULTIVARIATE TREE BOOSTING ALGORITHM
The gradient boosted trees [64] (viz., gradient boosting machines [65]) leverage the gradient boosting technique to iteratively fit ensembles of decision trees to minimize the loss function, and it can be applied for both classification and regression problems. Specifically, gradient boosting algorithm constructs several decision trees sequentially, where each tree is grown by utilizing information (i.e., residuals) from the previous trees in each iteration [66]. Essentially, the model performance can be ''boosted'' by adding more penalty on bad predictions [67]. The gradient boosted tree model has several advantages: (1) it does not require any pre-determined form of the function, (2) the nonlinear effects and interactions among variables can be captured in the process of growing trees, and (3) the treebuilding process uses previous information (i.e., each tree is grown sequentially on residuals), which helps the model to be efficient and robust [64], [67].
The multivariate tree boosting algorithm is an extension of the gradient boosted tree algorithm, and it helps to predict the multivariate response simultaneously, given a set of input variables. Particularly, the multivariate tree boosting algorithm sequentially fits the regression trees to simultaneously minimize the loss function L for each response variable and maximize the covariance discrepancy D in the multivariate response. Here, the mean squared error (a.k.a., squared L 2 norm) is represented as loss function, which is given by whereŶ i is the predicted value of the multivariate response using the i-th observation, and Y i is the actual value.
In addition, covariance discrepancy D can be mathematically written as [68] D m,q = || where (m−1) is the covariance matrix of the response variables at the previous step m − 1, and (m,q) is the covariance matrix at the current step m after fitting a decision tree to the response variable y (q) . The discrepancy D (m,q) is used to measure the amount of covariance explained in the multivariate responses by the predictors selected by the tree to fit in y (q) in step m. Basically, D denotes the improvement in model fitting in the sample covariance matrix during each iteration [68]. The steps of multivariate ensemble tree boosting algorithm are displayed in Algorithm (1).

4:
end for 5: Select the response y (q) corresponding to the regression tree with the maximum covariance discrepancy D m,q .

6:
Update residuals by subtracting the predictions of the tree fitted to y (q) , multiplied by step-size. 7: end for

4) MODEL EVALUATION
The generalization performance of a learned model is evaluated leveraging a robust cross-validation technique [57]. In this study, we leveraged a 5-fold cross validation technique, where the data are randomly divided into five different folds of roughly equal size. During each of the five crossvalidation loops, the model is trained using the four folds (i.e., 80% of data) and tested using the remaining fold (i.e., 20% of data). This procedure is repeated five times until every fold is utilized for model testing. During each loop, the insample goodness-of-fit performance is estimated using the training set, while the out-of-sample predictive accuracy is evaluated using the test set. This procedure can guarantee every single data point is being utilized for both training and test, to produce the generalized and robust model results. Note that, to control other factors (i.e., sampling bias) that could affect model performance, all the models are fitted into the same training set, and tested using identical test set during each cross validation loop.
We apply root mean square error (RMSE) and R 2 to be representative of in-sample and out-of-sample model performance metrics. The overall model performance is evaluated through the averaged RMSE and R 2 across all validation loops for each population health dimension. The model with the highest R 2 , and the lowest RMSE for both in-sample and out-of-sample is then selected as the final model, which is used to conduct the statistical inference. Mathematically, the RMSE and R 2 are given by where q represents the q-th response variable to be calculated, k is the number of times cross validation performed, u indicates the number of observations from either in-sample or out-of-sample. For response q under the k-th iteration,ŷ (q) i,k describes the predicted value at the observation i, y i,k is the actual value, andȳ (q) k is the mean value of the response q.

C. STATISTICAL INFERENCE
Inferential statistical techniques are provided here based on the final multivariate tree boosting model. First, variable importance is introduced to assess how useful health predictor is at predicting population health outcome variables. Second, we proposed a feature selection approach for multivariate analysis. Third, the partial dependence analysis is presented to quantify the marginal effect of health determinants on the health outcomes.

1) VARIABLE IMPORTANCE
To identify and evaluate the impacts of key predictors on population health outcome variables, the relative importance of each input variable is measured. A predictor with larger relative importance is considered to be of significance in contributing to the overall model performance. In ensemble tree-based methods, the relative importance can be computed in two steps [65]. First, the importance of a predictor j VOLUME 10, 2022 in a single tree T is measured by the number of times this predictor is used for splitting in growing of the tree, weighted by the squared improvement of the model as a result of each split, as denoted I 2 j (T ) in (4). Secondly, the relative importance can be obtained by averaging all importance over ensembles of trees {T m } M 1 , as denoted I 2 j in (5). Mathematically, the importance in a single tree can be written as: Here, the summation is over the non-terminal nodes t of the J -terminal node tree T , v t is the splitting variable associated with node t, andî 2 t is the improvement in squared error as a result of the split in the tree. For an ensemble of trees, the relative importance of the predictor j can be given by where M is the number of trees in the model. More details can be referred to [65].

2) VARIABLE SELECTION
The variable selection (a.k.a., feature selection) is used to select a subset of original features to reduce the dimensions of data, and it is computationally expensive [69], [70]. The naive brute-force method for feature selection aims to exhaustively search over all possible combinations of variables, which require massive computation power [71]. Some greedy search strategies, which are computationally advantageous, can be generally classified as either forward selection or backward elimination, which can iteratively add (or delete) one variable at a time to (or from) the existing subset [72]. Variable importance scores can also help in key variable selection based on the ranking of variable importance obtained from the model [73]. Most of the feature selection methods are developed for univariate models, which fall short in selecting key features in the context of multivariate analysis. Therefore, in this paper, we propose a novel approach for feature selection, named Percentile-based Variable Selection for Multivariate Regression (PVS-MR), to strategically select the optimal subset of input variables for multivariate analysis without compromising the generalization performance of the model. The proposed PVS-MR algorithm leverages backward elimination to iteratively eliminate variables, based on the quantile of the variables' relative importance obtained from multivariate tree boosting model. Specifically, during each model run, one or more variables can be removed from the model if their relative importance is universally below a pre-determined quantile threshold across all responses. This process is repeatedly executed with a small increment in threshold to its upper bound. Finally, the optimal subset of features can be obtained based on the generalization performance of the model. The PVS-MR algorithm is displayed in Algorithm (2).
Note that the proposed PVS-MR is particularly developed for multivariate analysis, and has the following advantages. First, it is computationally efficient as one or more variables are deleted at each iteration. Second, the model's generalization performance can be guaranteed, in the sense that model error is evaluated in each step of variable selection. Third, variables are assessed simultaneously across all responses in the context of multivariate analysis. Finally, decision makers can adjust their threshold and increment to control the process of feature selection.  for q in 1, · · · , Q response variable do 5: for j in 1, · · · , J input variable do 6: Calculate I 2 qj based on (5).  10: for j in 1, · · · , J input variable do 11: if q θ qj = 0 then 12: Delete variable j, and update J ← J − {j}. 13: end if 14: end for 15: p ← p + δ. 16: end while

3) PARTIAL DEPENDENCE ANALYSIS
The partial dependence plot (PDP) is a widely-used method to assess the marginal effects of an input variable to the response variable in non-parametric statistical learning models. Presence of non-linearity can be easily detected leveraging PDPs, where the estimated values of the function are produced by changing the value of the predictor, while keeping rest of predictors constant (i.e., ceteris paribus condition) [65]. Mathematically, the partial dependency can be written as: Here, x −j represents all the predictor variables except j. The estimated partial dependency of predictor x j is calculated by integrating the functionf when x j is fixed and x −j varies over its marginal distribution. The PDPs could inform the changing direction and functional form of the marginal effect of the predictor on the response. Thus, it can be used to facilitate the model inference [68].

V. RESULTS
In this section, we present the results from our case study to illustrate the applicability of our proposed data-driven multivariate framework for modeling the multidimensional population health outcomes.

A. MODEL SELECTION
In this section, first, we discuss and compare the performance of the various multivariate models, following which we select the multivariate model that outperforms all the other models in terms of their generalization performance. Second, we discuss and compare the performance of the selected multivariate model and the corresponding univariate model, following which we select the final model for statistical inferencing. Table 3 displays the generalization performances of the three multivariate models-linear, random forest and treeboosting. The RMSE and R 2 are calculated for each response variable in terms of both in-sample goodness of fit and out-of-sample predictive accuracy. The goodness of fit indicates how well the model fits the data, and predictive accuracy describes the prediction power of the model in an unseen dataset. Higher values of R 2 , conversely lower values of RMSE, indicate superior performance of the model. In Table 3, we observe that the tree-based models (multivariate random forest and multivariate tree boosting) demonstrate superior performance on average for each of the nine response variables, over the multivariate linear regression model. This pattern is observed across both the goodness of fit and predictive accuracy performances. Additionally, we implemented t-test to verify if this pattern is statistically significant. Our results show that p < 0.01 for each of the response variables (shown in Table 6 from Appendix C), indicating that there is a significant statistical difference between the performances of the linear versus the non-linear ensemble tree-based models. Therefore, the Hypothesis-1 holds good, concluding that non-parametric tree-based models better capture the non-linear effects and interactions between population health outcomes and the predictors than the parametric linear model. We can also observe from Table 3 that multivariate tree boosting model outperforms the multivariate random forest model. Specifically, multivariate tree boosting model has improved the goodness-of-fit by 24% over linear regression, and the predictive accuracy by 11% over random forest. This illustrates that the multivariate tree boosting method best models the county-level population health outcomes. Additionally, we compared the model results to the null (a.k.a. mean-only) model, where the average values of each response variable is used to fit the data.

1) COMPARING PERFORMANCE AMONG MULTIVARIATE MODELS
Comparing the results, we found that multivariate tree boosting model significantly reduces the training and test errors, i.e., improves the goodness of fit and predictive accuracy by 91% and 54% in YPLL, 87% and 36% in child mortality, 86% and 35% in infant mortality, 90% and 48% in poor or fair health, 87% and 40% in poor physical health days, 88% and 37% in poor mental health days, 93% and 62% in low birth weight, 88% and 45% in diabetes, 96% and 80% in HIV prevalence, over the null model. Therefore, we selected multivariate tree boosting model as the best multivariate model to compare with its corresponding univariate model. Table 4 shows the in-sample and out-of-sample model performance of the multivariate tree boosting and the corresponding univariate tree boosting models. The corresponding t-test results are shown in Table 6 in Appendix C. The results show multivariate tree boosting model performs significantly better than the corresponding univariate model across all response variables in terms of goodness of fit, but this significance is relatively weak with respect to predictive accuracy. Specifically, in case of predicting the responses YPLL, infant mortality, diabetes and HIV prevalence, there is no significance in predictive accuracy between multivariate and univariate models. But for the remaining five responses, the multivariate model is statistically better than the corresponding univariate model in predictive performance. Overall we concluded that multivariate tree boosting model VOLUME 10, 2022  outperforms the univariate tree boosting models on average, confirming that our Hypothesis-2 holds good.

B. VARIABLE SELECTION
After selecting the final model, we then applied a grid search technique to fine tune two parameters in the multivariate tree boosting model, namely the number of trees T and depth of each of the trees D. The rationale for choosing these two parameters are: a) they are critical for controlling the generalization performance of the model, and b) they can be easily modified by decision-makers to modulate computational complexity [68], [74]. Instead of searching the whole two-dimensional space Z 2 + to find the optimal values, we limited the bound for parameter T is the range of [500, 5000] with a 500 unit grid-step, and the bound for parameter D is the range of [2,20] with a grid-step of 2 units. In total, there are 100 = 10 × 10 combinations for those two parameters. Among those, we aim to find the optimal values of T and D that can achieve the minimal generalization error of the model. Leveraging this method, we finally determined T = 3500 and D = 16 in the model, which are used for further variable selection and statistical inferencing.
We applied the proposed PVS-MR method on the final model to select a subset of important variables, and to further reduce model complexity by mitigating the curse of dimensionality. The training error (denoted by black curve) and test error (denoted by red curve) shown in Fig. 2, are obtained by averaging the RMSE across all the nine response variables using the final multivariate tree boosting model. From the Fig. 2, the model with exact 29 variables (denoted by blue vertical line) yields the least generalization error. Additionally, we observed that the model performance improves while downsizing the number of input variables; but after a certain threshold as further variables are removed, the model performance deteriorates. This indicates that optimal selection of variables plays a pivotal role in accurately and sufficiently modeling the multidimensional population health. Thus, without compromising model generalization performance, we included only 29 variables as the key influencing factors of the population health outcomes in the model.  Fig. 3 presents the heat-map of relative importance of the selected 29 key predictor variables (indicated on the horizontal axis) to health outcomes (indicated on the vertical axis). Each number in the heat-map indicates the relative importance obtained from (5), as a fraction out of 100 for each health outcome variable. The larger numbers associated with the darker blue in grid cells, indicate higher degree of importance, while smaller numbers displayed in the lighter blue grid cells, represent lower degree of importance. For example, two most influencing factors for HIV prevalence are female population (with a relative importance of 68) and African American (with a relative importance of 14.8), which are consistent with the previous research findings [75], [76]. Even though some of the health factors have relatively small importance scores to the response variables such as excessive drinking and preventable hospital stays (as shown in the last two columns in the heat-map), these predictors are deemed to be important for the overall population health of a region, as they contribute to the overall performance of the multivariate model. This is established by the fact that deletion of these variables can deteriorate the overall model generalization performance as depicted in Fig. 2. From Fig. 3, the key factors that are significantly associated with more than one health outcome can be also identified. For example, the median household income is the most significant factor in contributing to both infant mortality and YPLL, indicating that economic condition is strongly linked to the length of life of the population, on average. Another health variable, teen birth rate, plays the most critical role in predicting both poor physical health days and child mortality. This highlights that adolescent pregnancy significantly affects both quality of life and length of life.

2) PARTIAL DEPENDENCE ANALYSIS
To quantify the association of the health determinants with the population health outcomes, partial dependence plot (PDP) is implemented. PDP is used to unravel the marginal effect of a particular variable by keeping other variables constant (ceteris paribus condition). Note that, there is a total of 261 potential combinations (i.e., 29 predictors variables for each of the nine outcome variables) of PDPs, that can be constructed. For the sake of brevity, we only explained the PDPs of the top two key predictors for each of the nine population health outcome variables, based on the relative importance scores in Fig. 3. Figs.4-12 depict the PDPs of the top two important predictors for each of the nine health outcome variables.  percentage of adults with obesity problem increases, the prevalence of diabetes in the community also increases. This finding is consistent with the previous studies where the prevalence of obesity and overweight is closely linked to the diabetes [77]- [79]. In addition to the correlation, our analysis reveals that this relationship is nonlinear. We observe that when the adult obesity rate ≤25%, the prevalence of diabetes is insensitive to the obesity rate within the community. However, the diabetes prevalence rate sharply increases at adult obesity rate ≥25%. Thus, threshold for adult obesity rate can be considered to be 25%, i.e., efforts to be given such that the overall percentage of the population suffering from obesity is <25%. Another second-most important predictor of diabetes is free lunch. This predictor describes the percentage of children enrolled in public school who are eligible for free or reduced priced lunch; thus, it is often used as an indicator for family poverty levels [80]. From Fig. 4(b), we observe that the prevalence of diabetes in a community increases as the percentage of school children eligible to VOLUME 10, 2022  get free meals crosses the 30% threshold. This finding is in line with a previous study showing that children attending schools located in deprived areas had the higher prevalence of diabetes compared to those in non-deprived areas [81]. In this study, schools were viewed as being situated in a deprived region, if over 21% of children received free lunch [81].

b: POOR MENTAL HEALTH
Our results shows that the two most significant predictors of poor mental health days are the variables ''year'' and ''free lunch''. Fig. 5(a) shows that the population mental health is worsening over time. Especially after the year of 2017, prevalence of poor mental health has significantly increased within the communities. This re-establishes the fact that mental health burden is crippling the established market economies such as the US, where an estimated 26% of Americans aged 18 and older-about 1 in 4 adults-suffers from a clinical mental disorder in a given year [82]. Free lunch as another important predictor of poor mental health. Fig. 5(b) displays the positive association between poor mental health days and the percentage of children receiving free lunch in their schools. As the percentage of children receiving free lunch ≥20% in a community, the mental health days, on average, increases from 3.45 days to 3.75 days. Since higher values of free lunch indicates higher poverty rates in a community, our results indicate that population in deprived areas suffer from poor mental health. In fact, a previous study found that children receiving free or reduced-priced lunch were more likely to suffer mental illness such as depression, anxiety and pessimism [83].

c: POOR OR FAIR HEALTH
Our analysis indicate the two most important predictors of the poor or fair population health to be ''the percentage of children in poverty'' and the ''African American population''. It can be observed from Fig. 6(a) that as the percentage of children in poverty increases to ≥15%, the percentage of overall population in a community suffering from poor/fair health condition increases to 14.8%, from 14.0%. This finding is in line with the existing study that shows poverty to be the key risk factor of child health and development [84], which in turn is a key determinant of the overall poor health of a population in a community. Racial disparities in population health is a well-studied area. Our results show that higher percentages of people reporting poor or fair health are from communities having higher percentages of African American population (≥15%) on average. Fig. 6(b) re-emphasizes that racial health disparities exist in population health.

d: POOR PHYSICAL HEALTH
Our results indicate that the number of poor physical health days reported by the population in a county is positively associated with ''teen births'', (i.e., adolescent pregnancy) (see From Fig. 7(a)) and ''percentages of children from a single-parent family'' (see Fig. 7(b)). More specifically, communities with ≥1500 teen births per 100K of the population or number of children from single-parent families ≥30% of the population, witness a sharp increase in the number of poor health days experienced by the population on average. It is established in literature that adolescent pregnancy could have adverse social and economic impacts on mothers, children, their families [85], [86], which can be linked to the overall poor physical wellbeing of a population in that community, on average. In addition, as the percentage of children living with their single parent increases in a community, the number of poor physical health days reported by the population in a community on average also increases. This finding is also consistent with the previous studies [87], [88].

e: CHILD MORTALITY RATE
The top two predictors of child mortality rate are found to be ''teen births'' and ''Chlamydia rate''. Fig. 8(a) shows that the child mortality rate is positively associated with the teen birth  rate, which is consistent with the previous studies [89], [90]. We found that as the teen birth rate increases to ≥500 per 100K population in a community, the child mortality rate increases from 35-50 per 100K of the population. In fact, the burden of child mortality still remains unevenly distributed globally [91]. On the other hand, chlamydia rate is also found to be positively correlated with child mortality. As observed from Fig. 8(b), with increasing Chlamydia rate, the child mortality rate increases monotonically from 40-48 per 100K of the population in a community, on average. Our finding is in line with the previous research outcomes that established chlamydia infection to be the most widely reported sexually transmitted disease in the US, especially among females aged from 15 to 24 [92]. Chlamydia infection increases the risk of still births and infant mortality rates significantly [93], [94].

f: LOW BIRTH WEIGHT
The two most important factors of low birth weightone of the dimensions of population health-are found to be ''rates of homicides'' and ''population of a county''. The positive association between low birth weight and homicide is exhibited in Fig. 9(a). We found that as the prevalence of homicide increases beyond 2 per 100K population in a county, the percentage of low birth weight grows rapidly from 6.8% to 7.8%. This positive correlation between the homicide rate and low birth weight can be attributed to the presence of a confounding variable such as socioeconomic condition of a region that influences both low birthweight and homicides. For example, communities with lower socioeconomic status have been witnessing higher homicide rates in general [95]. On the other hand, low birth weight babies are mostly outcomes of adolescent pregnancy or poor socioeconomic status of a region [96]. Prenatal poverty, in fact, is a key determinant of low birthweight [97]- [99]. Population size is also strongly correlated with low birth weight rate, which is displayed in Fig. 9(b). This strong correlation between low birthweight and population density may arise due to a confounding factor such as air pollution. Higher population density is a surrogate for urban and suburban areas which experience higher levels of air pollution. Air pollution, in turn, has a confounding effect on public health and has a strong link with low birth weight [100], [101].

g: INFANT MORTALITY RATE
Factors such as ''median household income'' and ''percentage of Asian population'' are found to be the top two predictors of infant mortality rate. Fig. 10(a) shows that the median household income has negative correlation with the infant mortality rate, indicating that income inequality plays a critical role in affecting the survival and health of newborns. Specifically, communities with median household income ≤80, 000 USD, witness a sharp increase in infant mortality rates from 500-640 per 100K population, on average. This finding is in line with a meta-analysis study which found that a significant inverse relationship exists between household income and mortality rate among infants and children [102]. Similarly, proportion of Asian population in a community is another key predictor showing negative association with infant mortality rate, as shown in Fig. 10(b). This observation is consistent with findings from an existing study that shows infant mortality rate is lower among Asian population, compared to the White population [103].

h: YEARS OF POTENTIAL LIFE LOST (YPLL)
This dimension of population health is a measure of premature mortality, that provides an estimate of the average years a person would have lived if they had not died prematurely [104]. Our analysis indicates that ''median household income'' and ''injury deaths (per 100K)'' are the top two significant predictors of YPLL. Fig. 11(a) shows negative relationship between median household income and VOLUME 10, 2022 FIGURE 11. Partial dependence plots of top two factors (median household income and injury deaths) on YPLL. YPLL, indicating that prevalence of premature death is less in families from high income group. Specifically, communities with median household income ≤80, 000 USD witness higher risk of premature deaths; however, risk of premature death is insensitive to median household income ≥80, 000 USD. In fact, previous studies indicated that higher rates of income inequality is strongly linked to the preventable or immediate death rates across the US cities [105]. Number of injury deaths is another key predictor of premature deaths. As can be observed from Fig. 11(b), as the number of injury deaths increases, the prevalence of YPLL in the community also increases monotonically, which is expected. In fact, the unintentional injury deaths contribute to be the number one leading cause of premature death among people aged below 44 years, in the US [106].

i: HIV PREVALENCE
Prevalence of HIV is one of the key dimensions of population health. Our analysis indicates that female population is strongly associated with the HIV prevalence. From Fig. 12(a), it can be noticed that communities with higher proportions of females (≥52%), witness a sharp increase in HIV prevalence. This sharp increase can be attributed to the right-skewed distribution of HIV prevalence in the state of NY (see Fig. 13 in Appendix B). In fact, NY is ranked top in the US with the highest female HIV infection rate in 2017, where the Bronx County in NY alone contributes to the 27% of the total HIV cases in the state of NY [107]. In addition, racial disparity is again observed where the African American population is significantly affected by the HIV. Fig. 12(b) shows that increased prevalence of HIV is associated with higher percentages of African Americans in a community, which is in line with the previous findings [108].

VI. DISCUSSIONS
The data-driven framework developed in this paper could potentially help enhance the overall population health of a community (e.g., county), by transforming information from routinely collected data into informed decisions. To facilitate decision making, this paper provides variable importance heat-map and partial dependence plots to identify and assess the associations of the various factors with the multidimensional population health that would help communicate datadriven results to the policy makers. The heat-map of variable importance reveals the key health inputs that jointly influence various dimensions of population health. It is beneficial for state and federal health agencies to identify focal variable(s) and develop informed public health intervention strategies to enhance the overall population health considering its various dimensions. For example, teen birth rate needs to pay more attention, because it is strongly associated with both the quality of life (e.g., poor physical health) and length of life (e.g., child mortality). Similarly, the free lunch variable (i.e., the percentage of children who are eligible for free lunch) significantly correlates with both the incidence of diabetes and poor mental health in a community. The partial dependence plots are used to assess the marginal effects and quantify the complex nonlinear relationships of the essential health determinants and the various dimensions of population health. It could provide a holistic picture of the overall trend in population health with respect to changes in specific health determinant, given all the other factors remain constant (ceteris paribus condition). For instance, child mortality is more sensitive to an increase in teen births compared to other factors such as chlamydia rate in the community. Additionally, years of potential life loss (YPLL) is highly correlated with injury deaths in a nonlinear form, where the growth rate of the YPLL becomes faster as the number of injury deaths increases. The racial disparity gap still persists in New York State. African American groups are more likely to suffer from poor of fair health and high HIV infections, which could further deteriorate their quality of life. This finding is consistent with the previous research [108].
Although we applied our framework to the New York State, our framework is generalized enough that can be easily applied to other regions of interest, provided adequate data is available. The model selection and variable selection techniques will also remain unchanged. However, the associations between the health determinants and the health outcomes might be different in reflection of the regional health disparities. To validate different statistical learning models, we implemented a robust cross-validation technique to evaluate the generalization performance of the models which is used as an input of the model selection process. Our hypotheses of the superior performance of multivariate tree boosting model over linear method and are generalized enough to be equally valid for any future data provided that the unobserved heterogeneities remain the same [57], [58].

VII. CONCLUSION
Accurate prediction and evaluation of population health plays a vital role in the development of a thriving and equitable society. In this paper, we propose a data-driven multivariate framework to simultaneously model nine dimensions (outcomes) of population health-characterizing the length of life and quality of life-as a nonlinear function of health behaviors, clinical care, socioeconomic factors, physical environment and demographics. We also developed a novel percentile-based variable selection for multivariate regression (PVS-MR) method for dimension reduction, without compromising model's generalization performance. To validate different statistical learning models, we implemented an iterative cross validation approach to generalize model's performance, and a statistical significance test for model's results. Furthermore, variable importance heat-map and partial dependence plots are provided to inform decision makers for understanding underlying health determinants and pathways in population.
Using NY state as a case study, we established the applicability of the framework, and quantified the associations linking mortality and mobility to some of key influencing factors in NY. Our numerical analyses suggest that the multivariate tree boosting algorithm best captures non-linearity relationships and interdependence of population health across multiple dimensions. Our findings also indicate that socioeconomic factors and health behaviors are the most predictors of population health in NY.
This study is focused towards modeling various aspects of population health and assessing the key determinants of population health at county-level. Clearly there are some future work directions for multidimensional health studies. First, as individual-level relevant data become available, similar data-driven methodological frameworks can be applied to further study the key determinants of individual health. Such studies can help the state and federal health agencies to design individual-level health intervention strategies. Second, the correlations between health factors and health outcomes revealed in this paper only imply association, and not causation. Future research could utilize the key influencing factors identified in this work along with longitudinal studies, to better examine the casual mechanism of how the key factors affect population health. .

APPENDIX A VARIABLES DESCRIPTION
See Table 5 for description of all variables in the model.

APPENDIX B CORRELATIONS BETWEEN RESPONSE VARIABLES
See Fig. 13 for the correlation and distribution for nine multivariate response variables.

APPENDIX C HYPOTHESIS TESTING
See Table 6 for the statistical results for our two hypotheses. with University at Buffalo-The State University of New York, USA. She conducts cutting-edge interdisciplinary research to address the resiliency and sustainability challenges related to the vulnerable communities and socio-technical systems leveraging advanced machine learning algorithms and robust optimization techniques. More specifically, she develops quantitative models to examine systemic/stochastic impacts of various chronic/acute shocks on the interdependent sociotechnical systems, develop risk-informed decision models, and investigate cost-effective adaptation measures to advance resilience and sustainability of our communities and critical infrastructure systems. VOLUME 10, 2022