Model-Based Clustering of Mixed Data With Sparse Dependence

Mixed data refers to a mixture of continuous and categorical variables. The clustering problem with mixed data is a long-standing statistical problem. The latent Gaussian mixture model, a model-based approach for such a problem, has received attention owing to its simplicity and interpretability. However, these approaches are prone to dimensionality problems. Specifically, parameters must be estimated for each group, and the number of covariance parameters is quadratic in the number of variables. To address this, we propose “regClustMD,” a novel model-based clustering method that can address sparse dependence among variables. We consider a sparse latent Gaussian mixture model, assuming that the precision matrix between variables has sparse nonzero elements. We propose maximizing a penalized complete log-likelihood using the Monte Carlo expectation-maximization (MCEM) algorithm. Our numerical experiments and real data analyses demonstrated that our method outperformed a counterpart algorithm in both accuracy and failure rate under the correlated data structure.


I. INTRODUCTION
In clustering problems, the observations are clustered into groups that share similar features. These features are commonly observed in continuous and categorical mixed data types (ordinal, nominal, or binary). Mixed-type datasets are prevalent in many applications, for example, finance, marketing, medicine, and healthcare sciences [1]. Although there is a wealth of clustering approaches, they often face challenges in correctly and simultaneously explaining the correlation structure in large datasets that consist of mixed types of variables. One of the main issues is the choice of the most appropriate distance or model to simultaneously deal with both data types. When an acceptable or reasonable model for cluster structure in the data can be found, model-based clustering provides persuasive results [2]. It is a popular toolkit and a principled statistical approach for clustering. In particular, the Gaussian The associate editor coordinating the review of this manuscript and approving it for publication was Md. Abdur Razzaque . mixture model is popular for model-based clustering of continuous data [3], [4].
Recently, various model-based approaches have been proposed for categorical or mixed data analysis. The Gaussian mixture models in the presence of categorical variables were proposed in [5] and [6] for binary data, [7] for ordinal data, [8] for the combination of binary and continuous data. In addition, [9] and [10] considered a mixture of latent trait and factor models, respectively. Reference [11] then proposed the Gaussian mixture model in the ClustMD latent variable framework. ClustMD considers six specific covariance structures for latent variables. However, these structures assume uncorrelated variables, which may limit their practical use. The adaptation of the uncorrelatedness assumption may be due to the high dimensionality of the parameters when an arbitrary covariance structure is assumed.
In this study, we developed a novel model-based clustering algorithm called ''regClustMD'' (regularized model-based clustering for mixed data), that can model sparse but arbitrary dependence structures for mixed data while addressing the dimensionality problem. We consider the latent Gaussian mixture model for the observed data and maximize the ℓ 1 -penalized version of the complete log-likelihood. By imposing the ℓ 1 penalty on the inverse covariance matrix term, we expect to explain the strong partial correlation among variables while reducing the number of active parameters. We propose a Monte Carlo expectation maximization (MCEM) algorithm to maximize the complete log-likelihood.
One key advantage of regClustMD is its ability to naturally incorporate the dependence structure between variables into the algorithm, even when it varies across different groups. As a result, compared to ClustMD, the proposed regClustMD can better estimate the within-covariance structure of each cluster. The improvement in estimating cluster-specific covariance can enhance the quality of clustering, as we will demonstrate in both our simulation experiments and real data examples.
The remainder of this paper is organized as follows. In Section III, we introduce our probability model of the Gaussian mixture model with latent continuous variables for mixed data. In Section IV, we describe the objective function and the detailed procedure of the regClustMD algorithm. In Section V, we propose a model selection procedure based on BIC. In Sections VI and VII, we demonstrate the usefulness of the proposed method through simulations and real data examples. Section VIII summarizes this paper.
The codes used for the simulation and real data analysis section is available at: http://github.com/shahn63/regclustMD.

II. RELATED WORK A. MODEL-BASED CLUSTERING ALGORITHMS FOR MIXED DATA
Since the work of McParland and Gormley [11] proposed a clustering algorithm for a mixed type of continuous, binary, ordinal, or nominal variables, it may be the most relevant to our method. It employs a latent variable model where observed categorical variables are considered as a discretization of the continuous latent variables. Then, the latent variables are assumed to follow a mixture of Gaussian distributions. To reduce the number of parameters, it introduces a diagonal covariance structure for the latent Gaussian variables. A major drawback of this approach is not being able to model the dependence structure between variables.
Several other papers have proposed clustering methods for a narrower scope of a mixture of specific types of data. Reference [8] proposed a clustering algorithm for mixed binary and continuous variables, where each binary attribute is generated by a latent continuous variable that is dichotomized with a suitable threshold value. Reference [5] introduced a latent variable model for binary data with heterogeneity accounted for replacing the traditional assumption of Gaussian distributed factors with a finite mixture of multivariate Gaussian. Reference [6] proposed a mixture of latent trait models with common slope parameters for model-based clustering of high-dimensional binary data. Reference [7] developed a mixture model for ordinal data using a pairwise likelihood approach. They considered the observed categorical variables as a discretization of an underlying finite mixture of Gaussian estimated within the EM framework. Reference [9] assumed a model for the categorical response variables that depends on both a categorical latent class and a continuous latent trait variable. The discrete latent class accommodated group structure and the continuous latent trait held dependence within these groups. Last but not least, [10] presented a latent variable based-algorithm for a mixture of binary, ordinal, and nominal response data.

B. REGULARIZED MODEL-BASED CLUSTERING ALGORITHMS FOR CONTINUOUS DATA
For continuous-type datasets, the Gaussian mixture model and its variants have been popular for model-based clustering algorithms. We refer to [3] and [4] for a comprehensive review. Regularized mixture models have been proposed to deal with the dimensionality problem when the data are only of the continuous type. References [12] and [13] proposed penalizing the inverse covariance matrix in terms of the log-likelihood of the Gaussian mixture model. Some variants can be made; for example, the group membership variable can be extended from a binary variable to a continuous variable that sums up to one [14] and [15]. In addition, the distribution of the data can extend to multivariate tmixtures [16]. We refer the reader to [17] for a more comprehensive and extensive review of model selection strategies for the Gaussian mixture model-based clustering problem. However, as mentioned previously, these methods are limited to continuous data.

C. NON-MODEL-BASED CLUSTERING ALGORITHMS FOR MIXED DATA
The extensive literature on non-model-based clustering algorithms encompasses a wide variety of methods, such as Kmeans, hierarchical clustering, and density-based algorithms among others [18], [19], [20], [21]. Some of these methods rely on a measure of distance between data values. A common approach for non-model-based clustering of mixed data involves calculating a 'generalized distance', a weighted combination of the distance between continuous variables, and a dissimilarity measure for categorical variables such as Gower distance. One can run hierarchical clustering algorithms for mixed data by employing this generalized distance instead of the conventional distance measure. Reference [22] proposed the K -prototype algorithm, a variant of K -means, that iteratively calculates modes of temporary clusters. However, one major drawback of these approaches is the difficulty in balancing the weight between distances for categorical and continuous variables. These are controlled by a weighting factor which, if not set appropriately, may cause one type of 75946 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
variable to dominate the other, which may lead to suboptimal clustering.

III. MODEL
We denote the observed record of the i-th subject as y i = (y i1 , . . . , y ip ) T ∈ R p , i = 1, . . . , N . We denote hidden cluster group memberships by the random variable g i ∈ {1, . . . , G}. In addition, we define l i = (l i1 , . . . , l iG ) T as a one-hot representation of group membership, that is, l ig = I (g = g i ), g = 1, . . . , G.

A. MODELING LATENT VARIABLES
We employ the setting described by [11] for the latent variables. For completeness, we define the latent variables as follows. Assume that each j-th variable of the i-th subject, y ij , has a latent random variable associated with it, say, z ij . We assume that z ij is always continuous, whereas y ij can be continuous, binary, ordinal, or nominal. For each case, we postulate the relationship between y ij and z ij as described below.

1) CONTINUOUS y ij
We assume that the observed y ij exactly matches the latent variable; that is,

2) ORDINAL y ij
It is assumed that the observed category y ij is a split of univariate z ij . Let us say that the j-th variable consists of ordinal categories 1, . . . , K j . We assume that there exist K j −1 threshold values. We include ±∞ as the threshold for notational convenience. Let partitions −∞ = c j,0 < c j,1 < · · · < c j,K j = ∞ split support of z ij . Then, we suppose that in other words, y ij = k · I (c j,k−1 ≤ z ij < c j,k ). It is assumed that the mean and variance parameters of z ij are unknown and free to change. Thus, for identifiability issue, the threshold values c j,0 , c j,1 , . . . , c j,K j are used as predetermined values. Following the literature [10], [11], we fix c j, is the cumulative distribution function of the standard normal distribution.

3) NOMINAL y ij
Because nominal categories are unordered, a nominal variable requires a multivariate representation. With slight notational abuse, we propose that a nominal variable y ij with K j categories be represented by z ij := (z 1 ij , . . . , z ) T ∈ R K j −1 as follows: In other words, if at least one z l ij (l = 1, . . . , K j − 1) is nonnegative then y ij is the index l that maximizes z l ij otherwise, y ij takes K j .

4) BINARY y ij
Binary y ij can be viewed as a special case of both ordinal and nominal variables when K j = 2. The formulation of an ordinal variable with K j = 2 is the same as that of a nominal variable.

B. PROBABILITY MODEL AND OBJECTIVE FUNCTION
We impose a normal mixture assumption on the latent space of z i . Suppose that P(g i = g) = π g (equivalently l i ∼ Multinom 1, (π 1 , . . . , π G ) ), g denote the precision matrix for the group g. Then, the log-likelihood for complete data The unknown parameters to be estimated are {π g , µ g , g : g = 1, . . . , G}. In [11], g is strictly restricted to a class of diagonal matrices. We impose only the sparsity of g , that is, g is assumed to have sparse nonzero off-diagonal elements. This assumption relaxes the diagonal assumption and is advantageous because non-zero off-diagonal elements can address the dependency between variables. To reflect the sparsity assumption in the estimation procedure, we consider penalizing g , g = 1, . . . , G, when maximizing log L C . If complete data were available, the penalized log-likelihood would be written as where P g ( g ) is a sparse penalty function for g ; for example, the vectorized ℓ 1 norm of g multipled by a tuning parameter. Because the observed data {y i } N i=1 are incomplete, one may want to maximize the expected value of the objective function given
is the t-th update of the loop of the MCEM algorithm. For simplicity, we let = (t) if there is no confusion.
To conveniently refer to the continuous and categorical parts without loss of generality, we write y i as y T i = (y α i T , y β i T ), where y α i ∈ R C and y β i ∈ R p−C are the continuous and categorical variables, respectively. For the continuous part, the designation of latent space yields y α i = z α i . For the categorical part, let k s (s = 1, . . . , q) be all possible values that y β can take and let I s ⊆ R D−C be the set of all possible values of z β ∈ R D−C that generates an outcome k s . We partition other notations accordingly, i.e., (2), the expected values for the calculation are E(l ig |D, ), E(l ig z i |D, ), and E(l ig z i z T i | D, ). In addition, we define τ ig = E(l ig |D, ). With a slight abuse of notation, let N (z|µ, ) be the density function of the multivariate normal distribution with mean vector µ and covariance matrix . Assuming that the observed value of y β i is k s , according to Bayes' rule, where µ Evaluations of the truncated integrals of the multivariate normal densities are numerically conducted; for example, the minimax tilting Gibbs sampling method proposed by [23] implemented in the R package mvNcdf. Once Note that m ig and V ig are the first and second moments of the truncated multivariate normal distribution, respectively. Their calculations were implemented using the mtmvtnorm of the R package tmvtnorm. We refer the reader to [24] for the closed-form formula for moments.
ig be the resulting values from (3), (4), and (5), respectively, when the given parameter is (t) ig for g = 1, . . . , G. For each M-step, we propose to encourage sparsity on g s by maximizing ℓ 1 -penalized expected complete log-likelihood, that is, Here, |A| = i≥j |a ij | is the vectorized ℓ 1 norm of matrix A = (a ij ) and λ ≥ 0 is a tuning parameter. For each group g, we have weighed the penalty λ| g | by N (t) g to ensure that the parameters across different groups were penalized in the same amount.
We now derive (t+1) := argmax Q( | (t) ). It is noteworthy that (6) is separable over g. For each g, maximization over π g and µ g is independent of maximization over g . It is 75948 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. straightforward to derive the following equation: and To update g , the maximization of (6) is a weighted version of the graphical lasso problem. To be precise, where (t+1) g To solve (9), we can use off-the-shelf statistical software, for example, the graphical lasso [25] or QUIC [26].

Remark.
To encourage sparsity, one may consider other penalty functions, such as the smoothly clipped absolute deviation (SCAD) or the minimax concave penalty (MCP). An advantage of the choice of ℓ 1 -penalty is that each M-step becomes a concave maximization problem, which facilitates the scalability of our proposed algorithm.
Remark. The time complexity of the algorithm for each iteration is dominated by the combination of the number of Monte Carlo samples in the E-step and the precision matrix estimation in the M-step. Thus, it is O(NM (D − C) + min(GD 3 , GND 2 )). Compared to ClustMD, a benchmark algorithm in the Simulation and Real Data Analysis Section, our algorithm has an additional cost of O(min â(GD 3 , GND 2 )) due to the estimation of the precision matrix.

V. MODEL SELECTION
The regClustMD procedure requires tuning G and λ, which determines the number of clusters and sparsity of the estimated precision matrix.
We consider BIC-based model selection, widely employed in model-based clustering [3], [11], [13], [27]. Let = (G, λ) be the estimate of given G and λ. In EM algorithm-based procedures, the BIC value is the expected negative complete likelihood added by DF ·log(N ). Similarly, we propose the BIC value as where DF is regarded as the number of non-zero elements of the estimated parameters in the sparse estimation literature [28], that is, letting = { π g , µ g , g } G g=1 , where [ g ] ij is the (i, j)-th element of g . One advantage of our method is that it directly maximizes the expected complete likelihood, which does not require an additional approximation procedure, as in [11]. Finally, we select G and λ to minimize BIC(G, λ).

VI. NUMERICAL STUDY A. EXPERIMENTAL SETTINGS
To assess the performance of the proposed method, we compared the clustering accuracy rate and the number of failures.
Our model was compared with ClustMD [11], a latent Gaussian mixture model approach that does not allow for an association between variables. For the clustering error rate, although BIC is known to identify the true model consistently across a range of applications, as in Section V, the number of clusters was assumed to be known as two (G = 2), so that the selection of cluster numbers between two methods does not confound the performance evaluation. With G = 2, the clusters produced by each method are re-labeled to be the most consistent with the real membership. We consider 100 replications consisting of N = 100 subjects with p = 10 variables from a 2-cluster model with a mixing probability π 1 = π 2 = 0.5. Seven variables were continuous, two were ordinal with two and three levels, and one was nominal with three levels. The last three categorical variables were obtained by categorizing a part of z ij . As in Section III, we consider the median and the first and third quantiles as threshold values for binary and ordinal data with three levels, respectively. For nominal data, we generate K j = 3 levels using a K j − 1 = 2-dimensional latent continuous variable and a threshold 0 using (1).
Fixing G = 2 and π 1 = 0.5, we generate 50 subjects for the first cluster from a MVN (µ 1 , −1 1 ) with a mean vector µ 1 and a covariance matrix −1 1 , and similarly 50 subjects for the second cluster from MVN (µ 2 , −1 2 ). For simplicity, we fix the mean vector of each mixture component as µ 1 = 0 p and µ 2 ∈ {1 p /2, 1 p }, where 1 p denotes a p-dimensional vector of ones. Here, the norm of µ 2 determines the separability of two clusters. Finally, we consider two sparse covariance structures: • AR(1) model: the two precision matrices for both clusters follow AR(1) structure, respectively: 0, otherwise, VOLUME 11, 2023 • Random model: the two precision matrices are generated by adding random noise ϵ to off-diagonal elements: where g = 1, 2, ϵ ij ∈ [−1, 1] is uniformly and independently generated but fixed over replications, I (·) is the indicator function, and ρ ∈ {0, 0.2, 0.4, 0.6, 0.8} denotes covariance structure parameters controlling the dependency and sparsity of the covariance matrix. A larger ρ indicates stronger dependence in both AR(1) and Random models. We scaled the covariance matrices for both models with diagonal elements of 1. All analyses were conducted using R version 4.0.2 [29]. Tables 1 and 2 show the results across different association levels by ρ under both AR(1) and Random models, with µ 2 = 1 p /2 and 1 p denoting medium and large separability, respectively. We compared the two methods in terms of the averaged clustering accuracy and the simulation failure rate over the 100 replications. In terms of clustering accuracy, regClustMD tended to have increased accuracy as the dependency between variables gets stronger (larger ρ). On the other hand, as we expected, ClustMD tended to have decreased accuracy for most settings as ρ increased. We hypothesize that the inefficiency of ClustMD in larger ρ may be because ClustMD assumes an independent covariance structure and does not account for dependency. When we compared the two methods for each setting, the proposed regClustMD was more accurate than ClustMD in most cases, except for the independent covariance structure case (ρ = 0) that is favorable to ClustMD. In terms of the simulation failure rate, the proposed regClustMD outperformed ClustMD which does not allow estimation of the correlated data structure (e.g., with ρ = 0.8 under the Random model, ClustMD's failure rates are 23% in Table 1 and 43% in Table 2).

C. SENSITIVITY ANALYSIS AGAINST THE CHOICE OF EXPERIMENTAL PARAMETERS
We evaluated our proposed method across different parameters: mixing probability π 1 , the number of groups G, sample size n, and the number of variables p. To illustrate this, we varied a parameter once at a time while fixing other parameters in Table 2. Table 3 shows the results with different mixing probability π 1 = 0.75, π 2 = 0.25 compared to Table 2 with π 1 = π 2 = 0.5. In terms of both clustering accuracy and failure rate, the proposed regClustMD consistently outperformed ClustMD while the accuracy level was lower than those for each case in Table 2.

2) NUMBER OF GROUPS
Tables 4 present the results with different numbers of groups G = 3, 4 compared to Table 2 with G = 2. We again fixed the mean vectors as µ 1 = 0 p and µ 2 = 1 p and consider µ 3 = (1 p−q , −1 q ) and µ 4 = −1 p where q denotes the 75950 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. number of categorical variables. As the number of groups increased with a fixed sample size n = 100, both methods tended to have decreased accuracy and increased failure rate. Especially, clustMD failed most of the cases with G = 4 but regClustMD consistently outperformed clustMD with a small failure rate for both G = 3, 4.

3) SAMPLE SIZE
Tables 5 illustrate the results with different sample sizes n = 50, 200 compared to Table 2 with n = 100. As the sample size become larger, accuracy also increased except for clustMD under the Random model. The proposed method consistently outperformed clustMD and the difference in accuracy between the two methods tended to increase as n increased under the Random model which required the estimation of the more sophisticated correlated data structure.

4) NUMBER OF VARIABLES
Tables 6 illustrate the results with different sample size p = 5, 20 compared to Table 2 with p = 10. The accuracy positively correlated with the number of variables p. The results were consistent in accuracy and the differences between two methods were negligible with large p = 20 in which the performances of both methods were almost perfect.

VII. DATA EXAMPLE A. PROSTATE CANCER DATA
Firstly, we considered the prostate cancer data introduced by [30]. The data consist of 12 mixed-type variables. To be specific, eight variables (Age, Diastolic blood pressure, Index of tumor stage and histologic grade, Serum hemoglobin, Serum prostatic acid phosphatase, Size of the primary tumor, Systolic blood pressure, Weight) are continuous, three variables (Bone metastases, Cardiovascular disease history, and performance rating) are ordinal, and only one variable (Electrocardiogram code) is nominal. We preprocessed 12 variables, as in [11]. Some exploration of the correlation structure showed dependency across variables. For example, the absolute values of the correlation coefficients ranged from 0.011 to 0.797 in the correlation matrix of the data. In addition, the range of the inverse covariance matrix of the prostate cancer data is from 0.000 to 1.461.
As in Section VI, we fitted the proposed method and ClustMD with a BIC-based model selection. A line plot of the BIC values estimated using our proposed method (the left panel of Figure 1) reveals that the two-cluster model (G = 2) leads to the minimum BIC value. The right panel of Figure 1 shows the estimated group means for the ClustMD with G = 2. The patients in group 2 have a larger primary tumor size, higher serum prostatic acid phosphatase levels, higher tumor stage and histologic grade index, and lower serum hemoglobin levels than those of group 1.
In addition, we estimated the existing validation indices that are widely used in the cluster analysis introduced by [31]. In other words, we calculated eight validation indices to measure the quality of clusters (C, Dunn, Gamma, Gplus, Mcclain, Ptbiserial, Silhouette, Tau) in the ClustMD method with three clusters (selected by the BIC) and regClustMD method with two clusters to compare their results. Table 7 summarizes the cluster results for each method in terms of eight cluster validity indices. Six of the eight performance measures represent that our proposed method with two clusters fits better than ClustMD with three clusters.

B. AUSTRALIAN INSTITUTE OF SPORTS DATA
We also considered the Australian Institute of Sports (AIS) data [32]. The data contains 202 observations with 13 mixedtype variables. The variables consist of nine continuous  (Red blood cell count, White blood cell count, Hematocrit, Hemoglobin concentration, Plasma ferritins, Body mass index, Sum of skin folds, Percent body fat, Lean body mass), one binary (Sex), and one nominal variable (Sport), respectively. We recategorized the 'Sport' variable into three categories for computational efficiency. That is, we merged basketball, Netball, and Tennis as the first group, waterpolo and Rowing as the second group, and the rest of the nine sports as the third group, respectively. A total of 13 variables revealed pairwise dependency showing that absolute values of the correlation coefficients ranged from 0.080 to 0.964.
Two cluster model in our proposed method with a minimum BIC value was selected (Figure 2). The right panel of Figure 2 represents that the patients in group 2 have higher lean body mass, higher body mass index, higher plasma ferritins, higher hemoglobin concentration, higher hematocrit, higher white blood cell count, higher red blood cell count, lower percent body fat, lower sum of skin folds, and are more likely to be females than those of group 1, respectively.
We measured the eight validation indices that measure the quality of clustering to compare the ClustMD result with three clusters which were also selected by the BIC value and that of regClustMD with two clusters. As summarized in Table 8, five of the eight indices illustrate that our proposed regClustMD with two clusters outperforms the clustMD with two clusters.

VIII. CONCLUDING REMARKS
It is well known that modeling the correlation structure between variables improves clustering results. Most of the data were correlated regardless of the data type. In this study, we proposed the regClustMD algorithm, which is a model-based clustering method for mixed data, that can address sparse dependence between variables. Our probability model postulates that categorical variables are generated from latent continuous variables before categorization. We then considered a sparse latent Gaussian mixture model. Through simulation studies and prostate cancer data analysis, we showed that regClustMD outperformed existing approaches that do not address dependence.
Our study considered the l 1 -penalty in regularization for implemental simplicity. However, the l 1 -penalty generally produces biased estimates, and other nonconvex penalties such as SCAD and MCP may be beneficial for mitigating the bias. Another future direction is to accelerate the proposed algorithm by parallelization because many computation routines in our algorithm can be conducted in parallel. Since our method relies on the assumption of a Gaussian distribution, it may be susceptible to outliers. One systematic approach to handling outliers is by developing a latent t-mixture model suitable for mixed data. An example of such a model is presented by [16] where a multivariate t-mixture model is proposed for clustering continuous-type data.