|
This last module in the series discusses just one approach to the interesting and important problem of clustering in very large (VL) data. The target audience is graduate students majoring in engineering and science, and practicing engineers and scientists interested in either research about or applications of clustering applied to very large real world problems that occur in data mining, image analysis and bioinformatics. Almost none of the subject matter in this course is available in textbooks; almost all of it is the object of (my own) current research, and as such, it reflects my own bias, prejudices, background and interests. I have supplied references that contain pointers to many nice papers on these topics that use related or competitive methods that have been proposed and studied by others. I begin with a characterization of VL data. For me, this means any data set that you cannot load into your computer. Not an objective definition, but a definition that is easy to understand and practical, because there is a data set too big for any computer you use, and hence, VL for you. There are two main approaches to clustering in VL data; distributed clustering, and progressing sampling followed by extension. I discuss the first approach briefly, but it seems much more difficult to me than the second approach. Next, I define progressive sampling followed by (non-iterative) extension. This idea is pretty general: it can accelerate most (but not all) iterative algorithms that estimate parameters with loadable data (this is true for both clustering and classifier design!), and, it provides a means for approximating the outputs of many algorithms for unloadable data. So, one of the main points of this third course is to establish the basic ideas of progressive sampling and extension. The method of clustering in VL data by (sampling + extension) is developed and illustrated with four clustering algorithms: (i) extended fast fuzzy c-means (eFFCM) for segmentation of VL images; generalized fast fuzzy c-means (geFFCM) for clustering in VL object data (VL sets of feature vectors in p dimensions); (iii) generalized fast expectation maximization (geFEM) for clustering by Gaussian mixture decomposition in VL object data; and (iv), extended non-Euclidean relational fuzzy c-means (eNERF) for clustering in VL (square) relational data. These four methods are presented in the spirit of active research - i.e., parts of them clearly need improvement and more testing, and I expect much of this material to be replaced by better approaches as our understanding of clustering using this approach matures.
|