Sparse Variable Selection on High Dimensional Heterogeneous Data With Tree Structured Responses

We consider the problem of sparse variable selection on high dimension heterogeneous data sets, which has been taking on renewed interest recently due to the growth of biological and medical data sets with complex, non-i.i.d. structures and huge quantities of response variables. The heterogeneity is likely to confound the association between explanatory variables and responses, resulting in enormous false discoveries when Lasso or its variants are naïvely applied. Therefore, developing effective confounder correction methods is a growing heat point among researchers. However, ordinarily employing recent confounder correction methods will result in undesirable performance due to the ignorance of the convoluted interdependency among response variables. To fully improve current variable selection methods, we introduce a model, the tree-guided sparse linear mixed model, that can utilize the dependency information from multiple responses to explore how specifically clusters are and select the active variables from heterogeneous data. Through extensive experiments on synthetic and real data sets, we show that our proposed model outperforms the existing methods and achieves the highest ROC area.


I. INTRODUCTION
V Ariable selection is one of the central tasks in statis- tics and has been studied for decades [1], [2].Modern machine learning problems, especially biological or medical applications often seek solutions in the existing statistical approaches.Lasso [3] is an example of those widely adopted methods in a variety of areas for sparse variable selection tasks.However, the increasing volume of data sets often requires the data to be collected from multiple batches and then integrated together.This procedure is particularly harmful to the biological [4] and medical [5], [6] data sets, which are sensitive to the data sources, like populations, hospitals or even experimental devices.This sensitivity results in the heterogeneity, therefore, breaks one of the most fundamental assumptions (i.i.d.assumption) that most statistical machine learning methods make.More importantly, due to the expensiveness of biological and medical data, different batches of data are gathered for different purposes from distinctly differ-ent sources, such as samples of the control group are mostly collected from volunteers from several different undeveloped regions.Consequently, the heterogeneity often induces confounding factors between explanatory variables and response variables, resulting in numerous false positive selected variables when classical variable selection techniques are applied [7].
To deepen understanding of the challenge that heterogeneity is introduced to biological or medical data sets and define the problem, consider that we have data samples in the format of (X , Z , Y ), where X stands for the explanatory variables, Y stands for the responses, Z stands for the indicator of the data source.The dependency between X and Y is the premise of any variable selection tasks [8], and the dependency between X and Z is induced through heterogeneity [9], [10].The data collection procedure we mentioned brings the dependency between Z and Y .In the real world, this problem may be even intractable, for the origin of different samples is lost either through data compression or experimental necessity in most cases.Nowadays genetic association studies are rarely aware of the origin of the samples listed.Z becomes the confounding factor between X and Y [11]- [14].One challenge of the heterogeneous data variable selection problem is to mitigate the confounding effects brought by Z .
Aside from challenges above, many of the real world biological and medical data sets are collected along with multiple response variables.These responses are often more closely related and could share common relevant covariates than others and then form the tree or other kinds of structures [15]- [18].For instances, in genetic association analysis, which aims to select the single-nucleotide polymorphism (explanatory variables) that could affect the phenotype (response variables), the genes in the same pathway pretend to share the common set of relevant explanatory variables than other genes.
Thus, to improve the performance of the variable selection, incorporating the complex correlation structure in the responses is under our consideration.In this paper, we extend the recent solutions of sparse linear mixed model [8], [9] that can correct confounding factors and perform variable selection simultaneously further to account the relatedness between different responses.We propose the tree-guided sparse linear mixed model, namely TgSLMM, to correct the confounder and incorporate the relatedness among response variables simultaneously.With TgSLMM, we are capable to improve the performance of the variable selection when considering the statistical criterion, incorporating the complex tree-based correlation structure in the traits under our consideration.Eventually, we examine our model through plenty of repeated experiments and show that our method is superior to other existing approaches and able to discover the real genome association in the real data set.

II. RELATED WORK
Recent years have witnessed the great advances in the variable selection area.The most classical approach is ℓ 1 -norm regularization (i.e.Lasso regression [3]).Further, studies have extended the model capability by introducing various regularizers [15].Examples including the Smoothly Clipped Absolute Deviation (SCAD) [19], the Local Linear Approximation (LLA) [20], the Minimax Concave Penalty (MCP) [21], and the Precision Lasso [22] have been introduced since then, which all overcome a variety of limitations of Lasso [19].Some other variable selection methods like [23] ignore underlying multidimensional structure, leading to severe small dataset problems.[24] imposes a rank constraint into ℓ 1 regularization to factor matrices and promotes sparsity in variable selection, which hurts the interpretability.The liabilitythreshold mixed linear model overcomes the limitation of Linear Mixed Model (LMM) in case-control ascertainment [25].[26] proposed a unsupervised variable selection method.But both of them cannot apply to high dimensional data with heterogeneity.
Besides these, in the non-i.i,dsetting, confounders could raise a challenge in variable selection when the data set is originated from different sources.Corresponding solutions have been studied for decades.Principal components analysis (PCA) [27], [28] and linear mixed model [29], [30] are two popular and efficient approaches to alleviate the confounding effect.The latter provides a more fine-grained way to model the population structure and won its prominence in the animal breeding literature, where it was used to reveal the underlying kinship and family structure [11], [31].Many extensions have been developed, however, these measures such as LMM-Select [32] LMM-BOLT [33] and Liability-threshold mixed linear model (LTMLM) [33] along with other algorithms [34]- [36] only rely on univariate testing to select the variable once uncovering the confounding factor.Attempts have been made to propose multi-variable testing model [8], [9], [37], [38] these days, but their performances fall short while tackling with the challenge that takes the relatedness between responses into account.[39]- [45] are proposed to identify significant associations, which is to be contrasted with the related problem of estimating heritability.However, they also lack accounting for the relatedness between different traits [46].[47] helps improve association methods for kinship estimation, but it could not construct the convoluted phenotypic architecture in a dataset originated from different populations in the real world like [16].The challenges show the desire to have a method, which requires no prior knowledge of the individual relationship and is capable of uncover the structured pattern in a way that is properly calibrated to the degrees of traits' relatedness.

III. TREE-GUIDED SPARSE LINEAR MIXED MODEL
Throughout this paper, X denotes the n×p matrix for explanatory variables for individuals, Y denotes the n × k matrix for response variables, and β denotes the p × k matrix for effect sizes.We use subscripts to denote rows and superscripts to denote columns, for example, β k and β k are the k-th column and k-th row of β respectively.
In this section, we begin by examining the sparse linear mixed model.Next, we demonstrate how we leverage the technique to uncover relationships between traits.Finally, we transform this approach into a regression problem and employ efficient methods to address associated challenges.

A. SPARSE LINEAR MIXED MODEL
The linear mixed model (LMM) is an extension of the standard linear regression model that explicitly describes the relationship between response variables and explanatory variables incorporating an extra random term to account for confounding factors.To introduce the sparse linear mixed model, we briefly revisit the classical linear mixed model as Equation 1: where Z is an n × t matrix for the random effect.u is the confounding influences with implicitly identity correlation information, ϵ denotes observed noise and they both follow the independent Gaussian distribution with the zero means.Intuitively, Zu models the covariance between the observations y i .Assuming that ϵ ∼ N ( 0, σ 2 ϵ I ), u ∼ N (0, σ 2 g I ), K = ZZ T and represents the covariance between the responses and σ g represents the magnitude of confounder factors, we can rewrite the formula as Equation 2to simplify mathematical derivation: Assuming the priori distribution of β could be expressed as e −Φ(β) , we can define log likelihood function as Equation 3: Based on the sparsity of β, it's reasonable to assume that β follows Laplace shrinkage prior.Such assumptions lead to the sparse linear mixed model.However, sparse LMM fails to consider the relatedness among response variables.The defect drives us to the tree-guided sparse linear mixed model.

B. TREE-GUIDED SPARSE LINEAR MIXED MODEL
To incorporate the relatedness among responses simultaneously, we use Tree-Lasso as Equation 4.
where λ is a tuning parameter that controls the amount of sparsity in the solution and β Gv j is a vector of regression coefficients {β k j |k ∈ G v }.The overlaps of groups of Tree-Lasso and the number of the trees is determined by the hierarchical clustering tree.Each node v ∈ V of the j-th tree is associated with the group G v whose members are the response variables at the nodes of the same subtree.Each group of subtree regression coefficients β Gv j is weighted with w v , which is defined as the Equation 5.In general, h v in the Equation 5represents the weight for selecting relevant covariates separately for the responses associated with each child of node v, whereas the 1 − h v represents the weight for selecting relevant covariates jointly for the responses for all of the children of node v, and the value of h v ranges from 0 to 1. Assuming K is the number of response variables and |V | is equivalent to the number of nodes in one tree, since a tree associated with K responses has 2K − 1 nodes, |V | appears in the tree-lasso penalty is upper-bounded by 2K .
To simplify the computation process of Tree-Lasso, we can calculate the separate penalty from the root of each tree iteratively as Equation 6: C. PARAMETER LEARNING Overall, optimizing Equation 3 with hyper-parameter {Θ = σ 2 g , σ 2 ϵ , λ, w v } is a non-convex optimization problem aside with weights β.Hence, we could apply the null-model fitting method first to correct the confounding factors and then solve Tree-Lasso regression problem using smoothing proximal gradient method [48].

1) Null-model
Due to sparsity of β, null-model fitting method by first optimizing σ 2 g , σ 2 ϵ while ignoring individual explanatory variables, can yield near-identical result as an exact method [30].By using the computational trick [34] that introduces the ratio of the random effect and the noise variance, δ = σ 2 ϵ /σ 2 g , we could transform the equation as Equation 7: The genetic effects are treated as fixed effects, whereas the confounding influences are modeled as random effects.We carry out a log likelihood optimization with regard to δ and then σ g in closed form.

2) Reduction to Tree-Lasso Regression Problem
In general, we first compute the spectral decomposition of K = U diag(d)U T , where U for eigenvector matrix and diag(d) for eigenvalue matrix.Having the yielded δ, we use the U to reweight the data such that the covariance matrix becomes isotropic.Assume Ỹ and X are the resulting rescaled data, which can be calculated by the following equation: Using this transformation, the equation eventually ends up with a standard Tree-Lasso regression problem since it is free of population structure and has been alleviated the confounding factor.In the following step, we can obtain the β tree as Equation 8: where || • || F denotes the matrix Frobenius norm, and Φ is determined by the Equation 4, then we can easily employ the smoothing proximal gradient descent method.

IV. SYNTHETIC EXPERIMENTS
In this section, we evaluate the yielded results of the TgSLMM versus Tree-Lasso, LMM-Lasso and some techniques mentioned above, which is shown in the receiver operating characteristic (ROC) curves 1 .
A. DATA GENERATION First, we simulate a sparse tree-structured vector as β.An illustrated example is shown in Figure 1.To construct β, the generation rules are listed below: • The righter columns have fewer non-zero elements.
• The elements from righter columns have bigger value.
• Some non-zero elements are shattered discretely in β to increase the complexity and mimic the real situation.Then we generate centroids of m different distributions.With c j as the centroid of j-th distribution, we generate explanatory variable data from a multivariate Gaussian distribution as follows: where x i denotes the i-th data or information bore by one individual and originates from j-th distribution chosen from m different distributions c.Then we generate an immediate response vector r from X matrix with ϵ ∼ (0, σ2 ϵ ): To get the final response matrix Y , we introduce a covariance matrix K to simulate correlation between different responses: where σ 2 y is to control the magnitude of the variance.Assuming C is the matrix formed by stacking the centroids c j , we choose K = CC T to simulate the correlation between observations.
Using the data generation method described earlier, our synthetic dataset can effectively mimic real-world heterogeneous datasets, capturing the desired trait relatedness. 1 The problem can be regarded as classification problem-identifying the active response variables from all genes.For each threshold, we select the response variables whose absolute effect sizes are greater than the threshold.If the selected explanatory variable has value above the threshold in ground truth effect size, it will be the true positive.

B. EXPERIMENT RESULTS
We assessed the ability of TgSLMM in our synthetic data sets.The experimental setting is listed in Table 1.To evaluate the performance of the proposed model in identifying active variables in different data sets, Tree-Lasso, LMM-Lasso, MCP, SCAD, Lasso 2 , BOLT-LMM, LTMLM and LMM-Select are also tested.The baselines we choose in this paper are all highly cited and have been proven effective by many scholars.In general, our method exceeds all the other methods.The results are shown in Figure 2 considering the golden criterion ROC curves.
In Figure 2(a) as n increases, and in Figure 2(b) as p decreases, the ratio of p n gets smaller and the performance gets better as expected.Compared to Tree-Lasso along with other methods, our method is more robust with big data sets, which suits the real-world situation.As we increase the number of response variables in Figure 2(c), increase the number of distributions in Figure 2(d), or decrease the proportion of active variables in β as Figure 2(e), the problem becomes more challenging.Figure 2(f) and Figure 2(g) show that our method is more flexible to different magnitudes of covariance of explanatory variables and response variables.In Figure 2(e), we notice that when the proportion of active variables in β is large, the performance of TgSLMM and LMM-Lasso is similar.However, it contradicts the background of our research that the active variables should be sparse among data.Through our experiments, it is hard for Tree-Lasso to identify the active variables on high dimensional heterogeneous data.
TgSLMM also performs best in most cases in the figure of Precision-Recall curves.These figures are shown in Appendix.

C. ANALYSIS OF YIELDED β AND Y
We use the same experimental setting3 as in Figure 1 to perform the ablation studies.The results are shown in Figure 3    Figure 3 shows that TgSLMM recovers both the values and structure of ground truth effect size, revealing the supreme ability of TgSLMM in variable selection.LMM-Lasso has trouble finding enough useful information.Trapped into the confounding factors, the Tree-Lasso discovers too many false positives.Tree-Lasso also falls short when the data set becomes complicated in the Figure 2. Based on Figure 4, both prediction performance of TgSLMM and Tree-Lasso are convincing, LMM-Lasso fails as reported before.Unsurprisingly, the proposed TgSLMM also behaves the best in estimating β with respect to mean-squared error through almost all the experimental settings.The other approaches cannot discover any meaningful information.
By using the proposed method, we are able to detect weak signals and reveal clear groupings in the patterns of associations between explanatory variables and responses and apply our method to many applications, such as variable selection, effect sizes estimation, and response prediction.

V. REAL GENOME DATA EXPERIMENTS
Having shown the capacity of TgSLMM in recovering explanatory variables of synthetic data sets, we now demonstrate how TgSLMM can be used in real-world genome data and discover meaningful information.To evaluate the method, we focus on some practical data sets, Arabidopsis thaliana, Heterogeneous Stock Mice and Human Alzheimer Disease.Since Arabidopsis thaliana and Heterogeneous Stock Mice have been studied for over a decade, the scientific community has reached a general consensus regarding these species [49].With such authentic golden standard, we could plot the ROC curve and assess the model's performance using the area under it.However, since Alzheimer's disease is a very active area of research with no ground truth available, we list the genetic variables identified by our proposed model and verify the top genetic variables by searching the relevant literature.

A. DATA SETS 1) Arabidopsis Thaliana
The Arabidopsis thaliana data set we obtained is a collection of around 200 plants, each with around 215,000 genetic variables [50].We study the association between these genetic variables and a set of observed responses.These plants were  gathered from 27 different countries in Europe and Asia, so that geographic origin served as a potential confounding factor.For example, different sunlight conditions in different regions may affect the observed responses of these plants.We test the genetic associations between genetic variables with 44 different responses such as days to germination, days to flowering, etc.

2) Heterogeneous Stock Mice
The heterogeneous stock mice data set contains measurements from around 1700 mice, with 10,000 genetic variables [51].These mice were raised in cages by four generations over a two-year period.In total, the mice came from 85 distinct families.The obvious confounding variable is genetic inheritance due to family relationships.We study the association between the genetic variables and a set of 27 response variables that could possibly be affected by inheritance.These 27 response variables fall into six different categories, relating to the glucose level, insulin level, immunity, EPM, FN and OFT respectively.

3) Human Alzheimer Disease
We use the late-onset Alzheimer's Disease data provided by Harvard Brain Tissue Resource Center and Merck Research Laboratories [52].It consists of measurements from 540 patients with 500,000 genetic variables.We test the association between these genetic variables and 28 responses corresponding to a patient's disease status of Alzheimer's disease.

4) Preprocessing of Real Genomic Data
Each element of the explanatory variables X takes values from {0,1} according to the number of minor alleles frequency (MAF) at the given locus in each individual.We also standardized the traits data, according to the analysis and statistics law.In the experiments, we found that the standardizing process is very crucial to the performance of the model.

B. ARABIDOPSIS THALIANA
Since we have access to a validated gold standard of the Arabidopsis thaliana data set, we compare the alternative algorithms in terms of their ability in recovering explanatory variables with a true association. Figure 5 illustrates the area under the ROC curve for each response variable for Arabidopsis thaliana.By analyzing the results, we conclude that TgSLMM equals or exceeds the other methods for all of responses.TgSLMM allows for dissecting individual explanatory variable effects from global genetic effects driven by population structure.
Further, we simply apply linear regression and crossvalidation to evaluate the proposed model's ability of response prediction versus all the algorithms.Using the explanatory variables our proposed method selects, 61.4% of prediction for Arabidopsis thaliana is better than using origin data set, 56.8% is better than using the data after employing Tree-Lasso, 79.5% is better than applying LMM-Lasso, 84.1% is better than MCP and SCAD, 66.0% is better than  Lasso, 91.0% is better than LMM-BOLT, 56.7% is better than LTMLM.Our method only works worse than LMM-Select while considering prediction.

C. HETEROGENEOUS STOCK MICE
For Heterogeneous Stock Mice data set, ground truth is also available so that we could evaluate the methods based on the area under their ROC Curve as Figure 6.TgSLMM behaves as the best one on 22.2% of the traits and achieves the highest ROC area for the whole data set as 0.627.The second best model is MCP with the area of 0.604.The areas under ROC of Tree-Lasso, Lasso and SCAD are 0.582, 0.591 and 0.590 respectively.The areas of the remaining models are all around 0.5, showing little ability to process such complex data sets.On traits Glucose_75, Glucose_30, Glucose.DeadFromAnesthetic, Insulin.AUC, Insulin.Delta and FN.postWeight, our method TgSLMM behaves the best.The results are interesting: the left side of the figure mostly consists of traits regarding glucose and insulin in the mice, while the right side of the figure consists of traits related to immunity.This raises the inspiring question of whether or not immune levels in stock mice are largely independent of family origin.

D. HUMAN ALZHEIMER DISEASE
Finally, we proceed to the Human Alzheimer's Disease data set and report the top 99 genetic variables our model discovered in Table 2 to foster further research.
Due to space limitation, we only verify the top 10 reported genetic variables with prior research.The 1 st discovered genetic variable is corresponded to apoB gene, which can influence serum concentration [53] in Alzheimer's disease [54].The 2 nd one is associated with ARHGAP10 gene (also called GRAF2), which is an important paralogon of ARHGAP26 that closely related to the Alzheimer's disease [55] and affects the developmentally regulated expression of the GRAF proteins that promote lipid droplet clustering and growth, and is enriched at lipid droplet junctions [54], [56].The 3 rd SNP GNPDA2 is discovered to show the environment and gene association with obesity.They have impact on neurodegenerative and neurodevelopmental diseases [57].The 4 th SNP is expressed by the SYNPO2, which influences hypercholesterolemia or hypertension that has a identified a link between cognitive deficits [58].The 6 th SNP known as the LY75, has close relation with the significantly differentially expression in the time-series paired analysis involving APOE4 carriers and non-carriers, which could affect Alzheimer's disease [59].The 7 th genetic variable is associated with AGAP1.AGAP1 can regulate membrane trafficking, actin remodeling [60] and is reported to be associated with Alzheimer's disease.The 8 th one is coded by gene FAM114A1.Biologists have found that FAM114A1 is highly expressed in the developing neocortex [61].Also, from ''the amyloid hypothesis", betaamyloid accumulation is mainly cause Alzheimer's disease [62].The 9 th is corresponded with gene CNTNAP2 and the direct downregulation of CNTNAP2 by STOX1A is associated with Alzheimer's disease [63].

B. RUNTIME
To evaluate its effectiveness and practicability, we have empirically measured the runtime on the Arabidopsis thaliana dataset mentioned in our paper.On a four-core computer (3GHz 12MB L2-Cache, 8GB Memory), TgSLMM required about 4 hours CPU time.In this paper, we show that our method is scalable to large genetic dataset.

VII. CONCLUSION
In this paper, we aim to solve the challenging task of sparse variable selection when the data are not i.i.d.This type of situation often occurs in genomics since different batches of medical data are collected from different sources for different purposes.Due to such confounding factors, naively applying the traditional variable selection methods will result in a huge number of false discoveries.In addition to that, existing algorithms ignore the convoluted interdependency among responses, hence a joint analysis that can utilize such relatedness information in a heterogeneous data set is crucial.To address these problems, we propose the tree-guided sparse linear mixed model for sparse variable selection.Apart from extending the recent solutions of LMM that can correct confounding factors, we can perform variable selection simultaneously further to account the relatedness between different responses.By conducting extensive experiments, we compare our method with state-of-art methods and deeply analyze how confounding factors from the high dimensional heterogeneous data set influence the capability of the model to identify active variables.We show that traditional methods easily fall into the trap of utilizing false information, whereas our proposed model outperforms other existing methods in both the synthetic data set and real genome data set.We make our source code available 4 .

A. THE PRECISION-RECALL CURVE OF SYNTHETIC EXPERIMENT
The Figure 7 shows the full images of Precision-Recall curves in synthetic experiments to compare our method with other existing methods by using the same parameters in our paper.
For each configuration, the reported curve is drawn over five random seeds.And we can see that TgSLMM behaves almost always best.

B. ESTIMATION OF β
The Figure 8 shows the β vectors yielded by methods we used in our paper together with the ground truth β vector generated in the synthetic experiments.The figures show that  TgSLMM yields the best result with the number about 0.95 of the area under ROC curves.The area of Tree-Lasso is about 0.84, that of LMM-Lasso is around 0.71.The area under ROC of MCP, SCAD, Lasso, LMM-BOLT, LTMLM and LMM-Select is 0.81, 0.81, 0.80, 0.57, 0.50 and 0.41 respectively.

C. PREDICTION OF Y
Figure 9 shows the Y results recovered.TgSLMM also yields the best result.5

D. THE ROC CURVE OF SYNTHETIC EXPERIMENT
The Figure 10 shows the remaining images of ROC curves in synthetic experiments to compare our method with other existing methods by using the same parameters in our paper.
For each configuration, the reported curve is drawn over five random seeds.And we can see that TgSLMM behaves almost always best.

FIGURE 1 .
FIGURE 1.The simulated ground-truth β vector.For the illustration purpose, we choose the experimental setting of n = 250, p = 500 and k = 50.

TABLE 1 .
Default experimental setting in the simulated experiments.
(a) Different number of samples (b) Different number of explanatory variables (c) Different number of response variables (d) Different number of distributions (e) Different percentage of active variables (f) Different magnitude of variance of explanatory variables (g) Different magnitude of variance of response variables (h) Different magnitude of noise

FIGURE 2 .
FIGURE 2. ROC curves for experiments with different parameters.We show the full image of ROC curves to compare our method with previous methods.For each configuration, the reported curve is drawn over five random seeds.

FIGURE 4 .
FIGURE 4. The simulated responses matrices and the predicted responses results by different models.

FIGURE 5 .
FIGURE 5. Area under ROC curve for the 44 traits of Arabidopsis thaliana.

FIGURE 6 .
FIGURE 6. Area under ROC curve for the 27 traits of mice.
(a) Different number of samples (b) Different number of explanatory variables (c) Different number of response variables (d) Different number of distributions (e) Different percentage of active variables (f) Different magnitude of variance of explanatory variables (g) Different magnitude of variance of response variables (h) Different magnitude of noise

FIGURE 7 .
FIGURE 7. Precision-Recall curves for experiments with different parameters.

FIGURE 9 .
FIGURE 9.The simulated responses matrix and the yield responses results.

TABLE 2 .
Discovered Genetic Variables with TgSLMM.Second, we employ a smoothing proximal gradient method that is originally developed for structuredsparsity-inducing penalties.By using the efficient method, the convergence rate of the algorithm is O(1 s explanatory variables.The runtime is reduced from O(nk 2 ) to O(n 2 s k).ϵ ), given the desired accuracy ϵ and the time complexity per iteration of the smoothing proximal gradient for the Tree-Lasso is O(p 2 k +p v ∈ V |G v |).Thus the overall complexity for our method is O(n