Skip to Main Content
Loosely speaking, clustering refers to the problem of grouping data, and plays a central role in statistical machine learning and data analysis. One way in which to formalize the clustering problem is in terms of a mixture model, where the mixture components represent clusters within the data. We consider a semi-parametric formulation, in which a random vector X isin Ropfd is modeled as a distribution Fx(x)=Sigmaalpha=1 momegaalphaFalpha(x-mualpha). (1) with m components. Here omegaalpha isin (0, 1) is the weight on mixture component alpha. The mean vectors mualpha isin Ropfd are the (parametric) component of interest, whereas the dispersion distributions Falpha are a non-parametric nuisance component, on which we impose only tail conditions (e.g., sub-Gaussian or sub-exponential tail decay). Given n independent and identically distributed samples from the mixture model (1), we consider the problem of "learning" the mixture. More formally, for parameters (delta, epsi) isin (0, 1) times (0, 1), we say that a method (epsi, delta)-learns the mixture if it correctly determines the mixture label of all n samples with probability greater than 1 - delta, and estimates the mean vectors to accuracy epsi with high probability. This conference abstract provides an overview of the results in our full-length paper. We derive both necessary and sufficient conditions on the scaling of the sample size n as a function of the ambient dimension d, the minimum separation r(d) = minalphanebetaparmualpha - mubetapar2 between mixture components, and tail decay parameters. All of our analysis is high-dimensional in nature, meaning that we allow the sample size n, ambient dimension d and other parameters to scale in arbitrary ways. Our necessary conditions are information-theoretic in nature, and provide lower bounds on the performance of - - any algorithm, regardless of its computational complexity. Our sufficient conditions are based on analyzing a particular form of spectral clustering. For mixture models without any constraints on the mean vectors mualpha we show that standard spectral clustering-that is, based on sample means and covariance matrices-can achieve the information-theoretic limits. We also analyze mixture models in which the mean vectors are ldquosparse", and derive information-theoretic lower bounds. For such models, spectral clustering based on sample means/covariances is highly sub-optimal, but modified spectral clustering algorithms using thresholding estimators are nearly optimal.