Incremental Learning of Latent Forests

In the analysis of real-world data, it is useful to learn a latent variable model that represents the data generation process. In this setting, latent tree models are useful because they are able to capture complex relationships while being easily interpretable. In this paper, we propose two incremental algorithms for learning forests of latent trees. Unlike current methods, the proposed algorithms are based on the variational Bayesian framework, which allows them to introduce uncertainty into the learning process and work with mixed data. The first algorithm, incremental learner, determines the forest structure and the cardinality of its latent variables in an iterative search process. The second algorithm, constrained incremental learner, modifies the previous method by considering only a subset of the most prominent structures in each step of the search. Although restricting each iteration to a fixed number of candidate models limits the search space, we demonstrate that the second algorithm returns almost identical results for a small fraction of the computational cost. We compare our algorithms with existing methods by conducting a comparative study using both discrete and continuous real-world data. In addition, we demonstrate the effectiveness of the proposed algorithms by applying them to data from the 2018 Spanish Living Conditions Survey. All code, data, and results are available at https://github.com/ferjorosa/incremental-latent-forests.


I. INTRODUCTION
Real-world data are often complex and high-dimensional. In this setting, latent variables have proven to be useful in analyzing their generation process, as they are able to represent underlying data concepts by grouping observed variables. A latent variable model that has received considerable attention is the latent tree model (LTM) [1], [2]. The LTM is a tree-structured probabilistic graphical model whose leaf nodes are observed variables and whose internal nodes can be either observed or latent variables.
LTMs are appealing because can capture complex relationships with a simple tree structure that is easily interpretable. Furthermore, they allow for exact probabilistic inference in linear time [3]. For this reason, LTMs have proven to be valuable in many areas, such as classification [4]- [6], topic detection [7], probabilistic inference [8] and cluster analysis [9]- [13].
The associate editor coordinating the review of this manuscript and approving it for publication was Benyun Shi .
Existing literature includes several approaches that address the problem of learning LTMs from data [10]- [21]. However, the majority of them rely on point estimates of the parameters, which do not provide a good assessment of a model's generalization given the data, and do not allow for the incorporation of expert knowledge into the learning process. We address these two problems by making use of the variational Bayes (VB) framework [22]. In this framework, point estimates over parameters are replaced with distributions, thus naturally reflecting the uncertainty over parameter values given the data.
Forests of LTMs are appealing owing to their ability to separate groups of similar variables in different trees. Although several methods for learning latent forests have been studied [15]- [18], they are currently limited to function with either discrete or continuous data. In this work, we propose two new learning algorithms of latent forests that are based on the VB framework, which intrinsically allows them to work with discrete, continuous and mixed datasets.
Our first algorithm, which we refer to as incremental learner (IL), hill-climbs the space of latent forests in a two-phase iterative process. In its first phase, the forest structure is incremented with a new arc or latent variable. In its second phase, the cardinalities of latent variables are estimated. Our second algorithm, which we name constrained incremental learner (CIL), modifies the previous procedure by considering only a subset of candidate structures in each step of the search. We show that by doing this, it is able to return almost identical results for a fraction of the computational cost.
The remainder of this paper is organized as follows. In Section II, we formally introduce LTMs and describe how they can be learned from data using the VB framework. In Section III, we present our approach for incrementally learning forests of LTMs. Then, in Section IV, we explore the performance of our proposed approach with respect to the state of the art for various experimental settings: (i) discrete data, (ii) continuous data, and (iii) mixed data. Finally, we report our conclusions and propose future research directions in Section V.

II. PRELIMINARIES
In this section, we describe LTMs and present an overview of how to learn their parameters and structure from data. For notation, we use capital letters, such as X , Y , Z , to denote variable names, and lower-case letters, such as x, y, z, to denote specific values taken by these variables. Sets of variables are indicated by bold capital letters, such as X, Y, Z, and assignments of values to these variables are indicated by bold lower-case letters, such as x, y, z.
For a dataset with N instances, we denote the observed variables and their values in the nth instance by X[n] and x[n], respectively, and denote the latent (or hidden) variables and their possible assignments in the nth instance by H[n] and h[n], respectively. We use D = ∪ n x[n] to denote all observed values, and use H = ∪ n h[n] to denote all possible assignments to all hidden values in the dataset. Thus, the pair (D, H) defines an assignment to all variables in all data instances.

A. LATENT TREE MODELS
A Bayesian network (BN) B is a probabilistic graphical model [3], [23] that represents a joint probability distribution over a set of observed X = {X 1 , X 2 , . . . , X S } and latent H = {H 1 , H 2 , . . . , H L } random variables. It is defined by a directed acyclic graph G, which represents the network structure that encodes existing conditional independence between triplets of variables, and a set of parameters θ, which comprises the conditional probability distributions of each variable given its parents in G. That is, p(x s |pa x s ) and p(h l |pa h l ), where pa x s denotes a value in the set of X s parents, Pa X s . This is equivalent for H l . Thus, the joint distribution is factorized as LTMs are tree-structured BNs whose leaf nodes are observed variables and whose internal nodes can be either observed or latent variables. Subclasses of LTMs, such as phylogenetic trees [24] and latent class models (LCMs) [25], [26], have been studied for decades. In an LCM, all observed variables are conditionally independent given a single latent variable. The hierarchical latent class model (HLCM) generalizes the LCM by allowing multiple latent variables, where all observed variables are leaf nodes. Therefore, an HLCM is a restricted version of an LTM whose internal nodes cannot be observed variables. Fig. 1 presents examples of these models, including an LTM with no restrictions. To learn the structure of an LTM, the following information must be determined: (i) the number of latent variables, (ii) the cardinality of each latent variable, and (iii) the connections between variables. To learn the structure of a latent forest, it is necessary to determine the structure of each tree that forms it. Structure learning algorithms fall into the following three categories [1]: search-based methods, variable clustering methods, and distance-based methods.
Search-based methods construct LTMs one local move at a time, such as by introducing a latent variable, removing an arc, and increasing the cardinality of a latent variable. Zhang [10] was the first to propose an approach of this kind, called double hill-climbing (DHC) algorithm. DHC starts with an LCM and then explores the space of HLCMs. At each step of the search, it employs a hill-climbing procedure to identify the best HLCM that can be produced by introducing a single arc or latent variable. Once this model is found, DHC optimizes the cardinalities of its latent variables by employing a second hill-climbing procedure (hence the name of the algorithm). However, this results in a very high computational complexity. For this reason, a less computationally intensive alternative called the single hill-climbing (SHC) algorithm [14] was proposed. SHC balances model quality and computational efficiency by employing a single hill-climbing procedure to explore the space of HLCMs. This method was later improved in [11] by dividing each search step into three stages: expansion, adjustment, and simplification. Each stage uses its own set of operators (e.g., during the expansion stage, the LTM is modified only by introducing a new are or latent variable). For this reason, this algorithm is called EAST (expansion, adjustment, simplification until termination).
The aforementioned algorithms were all designed to work with discrete data. An extension of the EAST algorithm for continuous data was later proposed in [12]. Additionally, for continuous data, Galimberti and Soffritti proposed a greedy method for learning forests of unconnected LCMs. Both methods assume that each observed variable in the model follows a Gaussian distribution.
Although search-based methods usually find high-scoring models, they are computationally expensive, as they involve the evaluation of a large number of models. Methods based on variable clustering are much faster alternatives with a small compromise in model quality. All clustering-based methods rely on two key points: grouping variables to identify new latent variables and constructing models in a hierarchical manner using a bottom-up strategy. Three types of structures have been studied: HLCMs via the bridged-islands (BI) algorithm [13], binary forests (in which each tree node can have at most two children) via the BIN-G and BIN-A algorithms [16], and non-binary forests (in which each tree node has no restrictions on the number of children) via the CFHLC algorithm [17]. These algorithms are all limited to work with discrete data.
Phylogenetic tree reconstruction led to the development of distance-based methods [19], [20]. These algorithms learn an LTM from the information distances of observed variables, connecting variables with a new arc or latent variable. The main advantage of these algorithms is their ability to recover the correct latent structure under some mild conditions. However, they have several drawbacks. First, when applied to discrete data, all variables must share the same state space. Second, in the presence of continuous data, all variables must be continuous; that is, mixing continuous and discrete variables is not possible. The work of Huang et al. [21] circumvents these issues by defining a new distance metric and a new parameter learning method based on moments and tensor decompositions [27]. However, similar to its predecessors, it is currently limited to learning a single tree.
The majority of search-based, clustering-based, and distance-based methods estimate model parameters using the expectation-maximization (EM) algorithm [28]. Each iteration t of this algorithm is divided into two steps: (i) the expectation step, in which estimated parametersθ The likelihood function is not appropriate for model selection because complex model structures usually have a large number of parameters and thus score higher, leading to overfitting. For this reason, a penalized version of the likelihood function can be used instead, such as the Bayesian information criterion (BIC) [29]. BIC applies a penalty term to the log-likelihood that depends on the model complexity. For an LTM with a number of dim(G) parameters, BIC is defined as B. BAYESIAN LEARNING WITH LATENT VARIABLES By using a point estimate of the parameters, there is no measure of uncertainty, and no prior knowledge can be incorporated into the learning process. In Bayesian statistics, prior knowledge is introduced via a prior distribution, and uncertainty is reflected in its posterior distribution. The posterior distribution encodes updated beliefs once prior knowledge and data have been taken into consideration. For a fixed structure G, two quantities are of interest. The first is the posterior distribution over the sets of parameters and latent variables: The second quantity is the marginal likelihood, which acts as a normalizing constant for the posterior distribution and is a key element in model selection: In the Bayesian learning problem, we assume that the learner has a prior distribution over the set of structures p(G). Furthermore, in theory, we should average over all possible structures when estimating the posterior distribution in (3). However, in practice, constraints on storage and computation or ease of interpretability may lead us to select the most probable model structure. In this setting, if we assume that all structures are equally probable for the given data, then model selection is equivalent to choosing a structure with the highest marginal likelihood. Unlike p(D|θ ) in the BIC score (2), the marginal likelihood automatically penalizes complex models by integrating out the parameters and latent variables. This is called the Bayesian Occam's razor [30]. The marginal likelihood is tractable to compute for certain types of models, such as fully observed discrete BNs [31]. However, when latent variables are present, such as in LTMs, the marginal likelihood becomes intractable to compute even for moderately sized datasets. In such situations, approximation schemes are required. Two of the most popular approximation schemes are the Markov chain Monte Carlo (MCMC) method [32], [33] and variational inference (VI) [34]. MCMC is a nonparametric method that produces asymptotically exact results but at a high computational cost. In contrast, VI has the advantage of being faster by using a parametrized approximation, in which the Bayesian inference problem is solved via optimization. In this paper we focus on the VI approach.

C. VARIATIONAL BAYES FRAMEWORK
The goal of VI is to find an approximate distribution q(H, θ ) from some tractable family Q that closely approximates the true posterior distribution p(H, θ |D, G). For simplicity, we denote these distributions q and p, respectively. The key principle of VI is to solve this problem via optimization, in which a set of variational parameters ϕ that makes q closest to p is identified. The usual cost function for this optimization problem is the reverse Kullback-Leibler (KL) divergence: However, we cannot minimize this function, as that requires computing the marginal likelihood in (4). Instead, we can maximize an alternative objective that is equivalent to the reverse KL divergence up to an added constant, log p(D|G). This function is called the evidence lower bound (ELBO): Two conclusions can be inferred from (5) and (6). First, we can see that maximizing the ELBO is equivalent to minimizing KL(q||p). Second, we can see that, as its name suggests, the ELBO is a lower bound of log p(D|G), which follows from the fact that KL(·) ≥ 0. This relationship with the marginal likelihood has led the ELBO to be used as a model selection criterion. However, when applied to models with discrete latent variables (e.g., LTMs), an adjustment must be made to account for the parameters' lack of identifiability. To better understand this problem, consider a conditional distribution of a variable whose parent is a discrete latent variable with K labels. There are K ! equivalent settings of parameters for this conditional distribution, which differ merely by permuting the parent labels. Two common approaches exist for addressing this redundancy: (i) using asymmetric priors; (ii) introducing a small penalty into the ELBO score. We use the second approach when there is a lack of expert knowledge in the learning process. In this context, a simple penalty involves subtracting the term log K ! from the lower bound [35]. We can generalize this penalty for a model with multiple discrete latent variables H l , each with cardinality K l . Then, we can use it to define a penalized version of the ELBO: The complexity of maximizing the ELBO (and therefore, the p-ELBO) is determined by the complexity of the variational family Q. In this paper, we use the VB framework [22], which assumes the following factorization of the variational posterior q(H, θ ) = q(H)q(θ ). The VB framework iteratively maximizes the ELBO with respect to q(H) and q(θ ). This results in an iterative algorithm that is directly analogous to the EM (called the VBEM algorithm), which is guaranteed to monotonically increase the ELBO. The exact form of the variational expectation and maximization equations depends on the functional form of the conditional distributions in the model (e.g., discrete BNs [36]). However, deriving a set of specific update equations for each type of conditional distribution is an arduous task. Fortunately, the variational message passing (VMP) framework [37] provides a set of general purpose update equations that work for any BN for which all conditional distributions are in the exponential family, and for which all parent distributions are conjugate. 1 A model in which both of these constraints hold is known as a conjugate-exponential (CE) model.
In this paper, we combine the VMP framework with the VBEM algorithm to learn LTMs with discrete and/or continuous variables. Continuous variables are assumed to follow a Gaussian distribution, which belongs to the CE family. The only restriction imposed by the VMP framework is that a discrete variable cannot have a Gaussian parent.

III. INCREMENTAL LEARNING OF LATENT FORESTS
In this section, we propose a search-based method that hill-climbs the space of latent forests using the VB framework. Instead of exploring this space directly, we approach this search as an iterative process with two phases as follows: • Structure phase. The forest structure is incrementally built, in which variables are connected via a new arc or latent variable.
• Cardinalities phase. The cardinalities of previously involved latent variables are estimated. We use two search operators (node addition and arc addition) to modify the forest structure and another two operators (state introduction and state removal) to estimate latent cardinalities.

A. SEARCH OPERATORS
Node addition (NA) involves two variables (observed and latent) and generates a new model by introducing a latent variable as their parent. The cardinality of this variable is 2. To reduce computational complexity, we limit this operator to consider pairs of variables. This restriction is compensated by the following operator.
Arc addition (AA) generates a new model by introducing an arc from one variable to another. In this operator, both directions are considered. However, there is an intrinsic restriction given by the CE family that a continuous variable cannot be the parent of a discrete variable. Therefore, these arcs are ignored by the operator.
State introduction (SI) creates a new model by adding a state to a latent variable. In contrast, state removal (SR) creates a new model by removing a state from a latent variable. The SI is operator is not applicable if the variable has a number of states equal to K max , which can be specified by the user. The SR operator is not applicable if the variable only has two states.

B. SEARCH PROCESS
The search starts with an unconnected forest M in which no latent variables are present and whose observed variables are independent. These observed variables form the working set W. The parameters of this model are learned via the VBEM algorithm and its corresponding score is stored as a baseline for comparison.
In the structure phase, the NA and AA operators are applied to each pair of variables in W, resulting in the M NA and M AA sets of candidate models. Each candidate model is then evaluated by learning its parameters and storing its p-ELBO score. When this process is completed, the highest scoring model M S is selected, and all of its variables that were involved in the selected operator are stored in a new set of variables V. We refer to this phase as the structure subroutine and it is formally described in Algorithm 1. In the cardinalities phase, the SI and SR operators are applied to the latent variables in V, resulting in the M SI and M SR sets of candidate models. In contrast to the structure phase, this is done iteratively. It starts with M C and then, at each iteration, applies the SI and SR operators, selecting the best candidate model M C . If its p-ELBO is greater than that of M C , the candidate takes its place and the process continues. Once the score ceases to improve, the resulting model is returned as M C . We refer to this phase as the cardinalities subroutine and it is formally described in Algorithm 2.
After the cardinalities phase, the scores of M and M C are compared. If there is an improvement from the previous iteration, then M C becomes the current best model, and the working set W is updated. The update process depends on the operator that is selected on the structure phase. In the case of node addition, children variables are removed from W and the The algorithm alternates between these two phases until the score ceases to increase or there is only one remaining variable in W. We call this method IL and it is formally described in Algorithm 3. Fig. 2 provides an example execution of the IL algorithm. It starts with an empty latent forest with sets of two discrete and two continuous observed variables, {X 1 , X 2 } and {X 3 , X 4 }, respectively. In its first iteration, the AA operator is selected and X 3 becomes the new parent of X 4 , resulting in the removal of X 4 from W. In its second iteration, the NA operator is selected and a new latent variable H 1 is included as the parent of X 2 and X 3 , thus removing both X 2 and X 3 from W. The cardinality of H 1 is estimated and it remains at its initial value, 2. In the third and final iteration, the AA operator is selected again, and H 1 is set as the parent of X 1 . Unlike the previous iteration, the cardinality of H 1 is increased to 4. Given that only one variable remains in W, the process stops, and if the p-ELBO of the model is greater than that of the previous iteration, then M is updated and returned. However, IL can be very time-consuming due to two factors. First, each iteration requires the evaluation of a large number of models. In the structure phase, the NA and AA operators evaluate a total of O(3(|W| 2 − |W|)/2) candidate models. In the cardinalities phase, the number of evaluations depend on the previously selected operator. The maximum number of evaluations occurs when a latent variable is the new parent of two other latent variables, resulting in O(3K max ) evaluations for SI and SR. Therefore, the total number of candidate models evaluated at each iteration is O(3(|W| 2 − |W|)/2 + 6K max ). The second time-consuming factor is that each candidate evaluation requires running the VBEM algorithm, which is computationally expensive.
To address these challenges, we propose an alternative search procedure that considers fewer candidates in Section III-C, and in Section III-D, we propose an alternative to the VBEM algorithm that speeds up the model evaluation.

C. CONSTRAINED SEARCH
A straightforward approach for increasing the speed of the proposed method is to introduce restrictions in the search process [38], [39]. We achieve this by only evaluating certain structures. More specifically, in the structure phase, instead of considering all pairs of variables in W, we select the pair with the highest mutual information (MI). The rationale for this approach stems from the following observation. Let p(w) be the probability distribution over the set of variables in W. Knowing that all variables in this set are currently considered to be independent of each other, we can approximate this distribution with another distribution,p(w), that models the joint distribution of two variables W i and W j and the marginal distributions of the remaining variables: Then, our objective is to select the pair of variables whose connection (via an arc or latent variable and represented by their joint probability distribution p(w i , w j )) makesp(w) as close to p(w) as possible. Evaluating this separation using the KL divergence, we have where I (W i , W j ) is the MI between W i and W j under the distribution p(w i , w j ) and is defined as By selecting the pair of variables with the highest MI, we minimize the KL divergence between p(w) andp(w). However, (10) is often computationally intractable given that W i and W j can be either discrete or continuous. For this reason, approximate MI estimators, such as binning, k-nearest neighbors [40], or kernel density estimation [41], are required. By approximating the MI, we cannot ensure that the selected pair of variables results in the truly minimal KL(p||p). A simple way to alleviate this problem is to consider additional pairs of variables. In this case, we propose considering α pairs with the highest MI. We apply this idea to IL, resulting in the CIL method, which is formally described in Algorithm 4. CIL is very similar to IL, the main execution differences occur (i) at the beginning, in line 2, where CIL estimates the MI matrix with each pair of variables in W; (ii) in the structure phase, in line 6, where CIL only considers the α pairs of variables in W with the highest MI; and (iii) at the end of each iteration, in line 12, where CIL updates the MI matrix. The update removes the variables that have been removed from W and estimates the MI of new variables in W. In this method, α represents a compromise between model quality and computational cost.
CIL generates fewer candidate models than IL at each step of the search. While its cardinalities phase is identical to that of IL, its structure phase depends only on α, which results in a total of O(2α + 6K max ) evaluations at each step of the search. Repeatedly evaluating candidate models can become prohibitive when each evaluation involves running the VBEM algorithm. Previously, we addressed this problem by reducing the number of candidates. However, when the number of variables is high, this reduction may be insufficient. One solution that has been successfully applied in the LTM [11], [16] and phylogenetics [42] literature involves optimizing some parameters of the model while the remaining parameters stay constant.
Following this idea, we propose replacing the VBEM algorithm with a more efficient procedure that estimates the variational posterior distribution q(H, θ ) of only several variables while the remaining variables stay constant. We refer to this method as local VBEM. Thus, when evaluating a candidate model using local VBEM, we estimate the variational posteriors of (i) variables involved in the search operator, and (ii) variables belonging to the Markov blankets (MBs) of variables in (i). For any variable in a BN, its MB consists of the set of all its parents, children, and spouses (parents of children).
To illustrate the behavior of local VBEM, let us examine the rightmost example in Fig. 2, which is produced by applying the AA operator on H 1 and X 1 . Evaluating this model requires the estimation of variational posteriors of X 1 and H 1 as well as X 2 and X 3 , which belong to the MB of H 1 . Incidentally, we do not require X 4 because it does not belong to the MB of either H 1 or X 1 .
One iteration of local VBEM is computationally much cheaper than one iteration of VBEM because it updates fewer variational parameters. This also implies that a run of local VBEM requires fewer steps to converge than a run of VBEM. However, parameter estimates provided by local VBEM may deviate from those provided by VBEM. To prevent this from affecting the quality of the IL and CIL results, we can perform a run of VBEM after several search steps or before returning the model. In this paper, we run VBEM before returning the model.
Finally, like VBEM, local VBEM may get trapped at local maxima. To avoid this, we use a variant of the multiple-restart approach proposed by Chickering and Heckerman [43] and adapt it for the variational case. First, we sample C initial configurations of the variational parameters ϕ. Next, we perform one VBEM step and retain C/2 of the configurations that lead to the largest score values. Then, we perform two VBEM steps and retain C/4 configurations. We continue this procedure, doubling the number of VBEM steps in each iteration until only one configuration remains.

E. PRIOR SPECIFICATION
A key aspect in Bayesian learning is the specification of prior distributions. This includes two aspects: their form and parameters. Because IL and CIL are designed to work with CE models, the forms of prior distributions are already established, and it is only necessary to focus on their parametrization.
Tuning prior parameters is largely dependent on the availability of expert information. This information can be obtained from a person or from other directly related studies (see Bayesian meta-analysis [44]). When expert information is available, prior parameter values can be selected to best reflect the expert knowledge. When this information is unavailable, we propose using the following strategy, which varies for observed and latent variables. First, for observed variables, we use an empirical Bayes [45] approach and assign maximum likelihood estimates to their prior parameters. Second, for latent variables, we assume a symmetric Dirichlet prior with a total concentration of 1.

IV. EXPERIMENTS
In this section, we evaluate the performances of IL and CIL in terms of data fitting and computational complexity. To this end we first conducted a comparative study using both discrete (Section IV-A) and continuous (Section IV-B) real-world data. Then, as described in Section IV-C, we analyzed real social data from the Spanish Living Conditions Survey of 2018, which contains both discrete and continuous variables.
In these experiments, IL and CIL were compared to several LTM methods of the state of the art. However, given that most of these algorithms are specific to the discrete data domain, we complemented this study with two other approaches that are not based on LTMs but can work with mixed datasets. The first one, called MSPN [46], combines sum-product networks [47] with nonparametric estimation to learn hierarchically structured latent variable models that do not require the specification of variables' parametric forms. The second one is a version of kernel density estimation that can work with continuous and discrete variables and uses Silverman's rule of thumb for bandwidth selection [48]. Experiments were conducted on a computer with an Intel Core i7-6700K CPU at 4.00 GHz with 64GB RAM, running Windows 10 Enterprise. All experimental runs were performed on a single thread to allow a fair comparison between methods. For reproducibility, all code (including implementations and state-of-the-art executables), data, and results can be downloaded from our repository at https://github.com/ferjorosa/incremental-latent-forests.

A. COMPARATIVE STUDY ON DISCRETE DATA
The algorithms we considered in this comparative study are summarized as follows. The LCM algorithm starts with a latent class model of cardinality 2 and greedily increases its cardinality until the score ceases to improve. The BIN algorithm generates a binary latent forest following the agglomerative clustering variant proposed in [16]. The EAST algorithm learns an HLCM and refers to the method of the same name proposed in [11]. The BI algorithm learns an HLCM and refers to the method of the same name presented in [13]. The MSPN algorithm learns a mixed sum-product network and refers to the method presented in [46]. Finally, the IL and CIL algorithms are the methods proposed in this paper. For CIL, we used α values of 1 and 10. The goal of using these values was to evaluate whether an increase in α produced better results or simply an increase in computational complexity.
To conduct the study, we used 15 real-world datasets of varied dimensionalities (S) and sample sizes (N ). The majority of these datasets have been used in previous studies. We also added several publicly available datasets from the UCI repository 2 for a balanced set of examples in terms of S and N . None of these datasets provided expert knowledge to set Bayesian priors. For this reason, we set the prior parameters of the IL and CIL models using an empirical Bayes approach. 2 https://archive.ics.uci.edu/ml/index.php As a performance measure, we used the 10-fold crossvalidated predictive log-likelihood (CVPLL). That is, we divided each dataset into 10 equal-sized folds, trained a model on nine of them, and computed the predictive log-likelihood on the remaining fold. Tables 1 and 2 display the average results for the CVPLL and learning time, respectively. The maximum time allowed per fold was 24 hours. For a better comparison between IL and CIL, we colored in blue those results from CIL that were in second place behind IL. Our objective was to evaluate whether CIL was able to produce top results in less time than IL. Due to space limitations, we were unable to place the standard deviation values in the tables. This informations is provided in the supplementary material.
Tables 1 and 2 indicate that IL and CIL produced competitive results in terms of both model quality and execution time. Table 1 indicates that, IL and CIL had the best performance on 5 out of 15 datasets. In addition, CIL with α = 10 was in second place on another dataset. EAST and IL were unable to compete on high-dimensional datasets due to their computational complexity. In contrast, both BI and CIL were able to produce competitive results on all types of datasets. Finally, in terms of speed, the MSPN algorithm was the fastest of the seven methods evaluated, followed by CIL with α = 1.
It can be observed that there were no substantial score differences between IL and CIL. This observation is independent of the selected α value because the MI estimation is exact for discrete data (see Section III-C). In fact, for all datasets except for Car-evaluation, the results of CIL with α = 1 were almost identical to those of CIL with α = 10 and IL.

B. COMPARATIVE STUDY ON CONTINUOUS DATA
The algorithms we considered in this comparative study are summarized as follows. The KDE algorithm corresponds to the kernel density method proposed in [48]. The GS algorithm corresponds to the method proposed by VOLUME 8, 2020   Galimberti and Soffritti in [15], which produces a forest of unconnected LCMs whose leaf variables are all Gaussian. The GEAST algorithm [12] refers to the Gaussian version of the EAST algorithm. The MSPN algorithm is the same method of the previous comparative study.
As with discrete LTMs, we ran experiments on 15 real-world datasets taken from the UCI repository 2 and state-of-the-art papers. We also used CVPLL as the evaluation measure and allowed a maximum learning time of 24 hours per fold. In addition, given the absence of expert knowledge, we used an empirical Bayes approach to set the prior parameters of the IL and CIL models. Tables 3 and 4 display the average results of this study. As with discrete LTMs, we were unable to place the standard deviation values in the tables due to space limitations; however, this information is provided in the supplementary material.
Tables 3 and 4 indicate that in terms of model quality, GEAST and IL had the highest performance. The performance of IL was similar to that in the discrete case in terms of the number of top results: IL had the best performance on 4 out of 15 datasets. However, it was outperformed by the Gaussian version of EAST, which performed considerably better than its discrete counterpart, achieving the best performance on 8 out of 15 datasets. However, GEAST still had the same computational problems as with high-dimensional datasets and was unable to complete in four of the experiments. In terms of speed, the KDE algorithm was the fastest method on all of the datasets. However, this high performance in terms of execution time was not accompanied by high performance in terms of model quality.
Although there were no substantial differences between the performance of IL and CIL with discrete data, the opposite behavior was observed with continuous data. More specifically, there was only one dataset (Buddymove) for which CIL with α = 1 resulted in the same performance as IL. This coincides with our intuition from Section III-C. When handling continuous data, we must rely on approximate MI methods, which may result in suboptimal solutions (we used the k-nearest neighbors approach proposed in [40]). As previously mentioned in Section III-C, increasing the number of candidate models in each step improves the quality of the results, as illustrated in Table 3. There were only small differences between IL and CIL with α = 10, which makes CIL particularly promising for high-dimensional data for which IL takes a much longer time to complete.

C. SPANISH LIVING CONDITIONS DATASET
The Spanish Living Conditions Survey 3 is an annual survey conducted by Eurostat to analyze income, poverty, social exclusion, and livfing conditions in the European Union. The objective of this survey is the cross-sectional study of Spanish households. This information will be later used in a comparative study that includes other European countries. In this experiment, we used data from the last published survey, from 2018. Our objective for using these data was twofold: (i) to test the performance of our proposed methods on a mixed dataset; and (ii) to obtain knowledge from this study by learning a latent forest.
After preprocessing, the data consisted of 13,222 instances and 22 observed variables. Variables were organized into two main groups: (i) house quality (more specifically, the location, furniture, and appliances); and (ii) family economy, which includes household income and the ability to afford essential services and products. For a better comparison of different types of households according to the number of individuals that constitute them and the individuals' ages, the concept of equivalent income was used. This concept standardizes households according to the number of equivalent consumption units that constitute them. The number of units in each household was determined using the OECD-modified equivalence scale.
Most LTM algorithms used in the comparative studies of Sections IV-A and IV-B are specific to discrete or continuous data. Therefore, in this experiment we evaluated the performance of IL and CIL with respect to other methods that could also handle mixed datasets such as KDE, MSPN and LCM. The LCM algorithm starts with a latent class model of cardinality 2 and greedily increases its cardinality until the score ceases to improve.
As in the previous experiments, we lacked an expert who could provide us with prior information about the observed variables in the model. Instead, we based our prior information on another study: the Continuous Survey of Spanish Households. 4 This annual survey collects information from more than 100,000 households, including the number of family members, number of rooms in the house, and its tenure regime. We used this information to set the parameter values of the prior distributions from the corresponding variables in our study. For variables in the living conditions survey that did not have an analogous variable (e.g., equivalent_income, problem_pollution, afford_meal), we set their prior parameters using an empirical Bayes approach. A more detailed explanation of each variable and the specification of its prior parameter values are provided in the supplementary material.
The comparative results of this experiment (both CVPLL scores and execution times) are displayed on Table 5. As with previous studies, we were unable to place the standard deviation values in the table due to space limitations; however, this information is provided in the supplementary material. Table 5 indicates that even though the MSPN and KDE algorithms were much faster than the rest, their resulting models were unable to correctly represent the underlying probability distribution. For the proposed methods, in terms of model quality there was a clear underperformance by CIL with α = 1, which coincides with our reasoning on the inherent problems of approximate MI methods. IL and CIL with α = 10 produced almost identical results, although CIL obtained a slightly better average CVPLL. In terms of speed,  CIL with α = 1 behaved similarly to previous studies and was the fastest of the three. Fig. 3 displays the latent forest inferred by CIL with α = 10. We selected this model due to its good performance in terms of score and speed. The learned forest was formed by a single tree with three latent variables. Theoretically, each latent variable represents a soft partition of data, where the meaning of each partition is determined by the conditional probability distributions of the latent variable's children. To analyze this model and its latent variables, we used Genie, 5 a common tool for interpreting BNs. Some of our findings are as follows.
H 1 divides households into two groups: those with an income and those without an income. Unsurprisingly, those without an income are more likely to be unable to afford a meal (the probability of being unable to afford a meal increases from 3% to 14%).
H 2 divides households into two groups according to the number of rooms in the house: small-medium-sized houses (whose conditional Gaussian distribution has a mean value of 4.29 with a variance of 0.69) and large-sized houses (mean value of 6.00 with a variance of 0.00). In addition, H 2 relates the size of the house to other aspects, such as the possession of a car and delays in paying house bills. We observed that 5 https://www.bayesfusion.com/genie/ families with a larger house had a smaller probability of delay (3% versus 7%).
H 3 divides households into three groups according to the number of individuals living in the house: single family, couple, and family with children or relatives (more than two family members). H 3 also relates family size to the degree of urbanization of the city. We observed that larger families usually preferred less populated areas rather than highly populated areas (48% versus 54%).
There were other interesting aspects of the model that were not directly related to the latent variables. For example, we observed that houses located in highly polluted areas had an increased probability of vandalism (from 16% to 55%) and noise (from 11% to 37%).
The results of this study demonstrate the ability of latent forests to group variables in meaningful ways and extract insightful information. Models generated by CIL with α = 10 as well as the remaining methods can be found in our GitHub repository. They are in XDSL format, which is directly supported by Genie. as well as the remaining methods can be found in our GitHub repository.

V. CONCLUSION
In this paper, we propose an incremental method that, in combination with the VB framework, is able to learn forests of LTMs from data composed of discrete and/or continuous variables. Considering that directly searching this space requires the evaluation of a large number of candidate structures, a constrained variant that only evaluates the most prominent α candidates of each iteration is also developed. As demonstrated by the experimental results, the constrained approach is a valid alternative to the brute-force search method. It returns almost identical results for discrete data experiments and displays only slightly worse performance for continuous data. We believe that these differences are caused by the use of approximate MI estimation, and demonstrate that they can be reduced by increasing α.
Although restricting the structure search to an incremental process limits the space of possible models, our experiments demonstrate that this restriction still leads to competitive results. Furthermore, due to this restriction, our proposed method is able to work effectively with both low-and high-dimensional dataset. Other methods (e.g., GEAST) may perform better by considering a larger number of candidate models at each iteration, but their computational complexity makes them infeasible when the number of variables is large. Additionally and in contrast to current LTM methods, by taking advantage of the variational Bayesian framework, we are able to provide a means incorporating prior knowledge and produce a superior evaluation of the generalization properties of a model given data.
There are various future research directions. First, inspired by factor analysis, our methods can be modified to work with Gaussian latent variables, where the number of factors can be estimated analogously to the cardinality of discrete latent variables. Second, we can extend parameter estimation by using more flexible variational families [49], [50] and nonconjugate priors [51], but at the cost of a more difficult variational optimization problem. Third, we can add more flexibility to the structure search by replacing the arc addition operator with a variational version of Friedman's structural EM [52]. This can allow us to efficiently add and remove arcs without relying on an incremental process. Finally, selecting the most probable latent forest may not be suitable when we are interested in quantifying our confidence in the forest structure or when we have a low number of data instances. For these cases, Bayesian inference can also be introduced into the search process by averaging over the set of possible latent forests (i.e., by using Bayesian model averaging [53], [54]). For this, we can define a Markov chain over the space of possible latent forests (given our search operators) and then do a random walk in this Markov chain.