By Topic

Spectral clustering in high-dimensions: Necessary and sufficient conditions for dense and sparse mixtures

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

1 Author(s)
Martin J. Wainwright ; Department of Statistics, and Department of Electrical Engineering and Computer Science, UC Berkeley, CA 94720, USA

Loosely speaking, clustering refers to the problem of grouping data, and plays a central role in statistical machine learning and data analysis. One way in which to formalize the clustering problem is in terms of a mixture model, where the mixture components represent clusters within the data. We consider a semi-parametric formulation, in which a random vector X isin Ropfd is modeled as a distribution Fx(x)=Sigmaalpha=1 momegaalphaFalpha(x-mualpha). (1) with m components. Here omegaalpha isin (0, 1) is the weight on mixture component alpha. The mean vectors mualpha isin Ropfd are the (parametric) component of interest, whereas the dispersion distributions Falpha are a non-parametric nuisance component, on which we impose only tail conditions (e.g., sub-Gaussian or sub-exponential tail decay). Given n independent and identically distributed samples from the mixture model (1), we consider the problem of "learning" the mixture. More formally, for parameters (delta, epsi) isin (0, 1) times (0, 1), we say that a method (epsi, delta)-learns the mixture if it correctly determines the mixture label of all n samples with probability greater than 1 - delta, and estimates the mean vectors to accuracy epsi with high probability. This conference abstract provides an overview of the results in our full-length paper. We derive both necessary and sufficient conditions on the scaling of the sample size n as a function of the ambient dimension d, the minimum separation r(d) = minalphanebetaparmualpha - mubetapar2 between mixture components, and tail decay parameters. All of our analysis is high-dimensional in nature, meaning that we allow the sample size n, ambient dimension d and other parameters to scale in arbitrary ways. Our necessary conditions are information-theoretic in nature, and provide lower bounds on the performance of - - any algorithm, regardless of its computational complexity. Our sufficient conditions are based on analyzing a particular form of spectral clustering. For mixture models without any constraints on the mean vectors mualpha we show that standard spectral clustering-that is, based on sample means and covariance matrices-can achieve the information-theoretic limits. We also analyze mixture models in which the mean vectors are ldquosparse", and derive information-theoretic lower bounds. For such models, spectral clustering based on sample means/covariances is highly sub-optimal, but modified spectral clustering algorithms using thresholding estimators are nearly optimal.

Published in:

Communication, Control, and Computing, 2008 46th Annual Allerton Conference on

Date of Conference:

23-26 Sept. 2008