Skip to Main Content
This course- the first in a series of three - provides a foundation for understanding the field of cluster analysis in unlabeled data. The target audience for this course comprises undergraduate and graduate students majoring in engineering and science, as well as practicing engineers and scientists interested in either research about or applications of clustering to real world problems such as data mining, image analysis and bioinformatics. The subject matter is widely available in a number of standard textbooks given in the references below. The course begins with a discussion of the general nature of clustering. Three problems are identified: tendency assessment, partitioning and validation. Two types of data are discussed: object vector data, and pair wise objects relational data. Next, I develop the mathematical structure needed to carry clustering algorithms, discussing the notions of similarity, label vectors, partition matrices (U) and point prototypes (V). The second part of the course contains a description (and pseudo code) for one algorithm each from the four major categories of clustering methods. Specifically, I discuss and illustrate with a numerical example: (i) the U only model for single linkage clustering; (ii) the V only model for clustering with Kohonen's self-organizing map; (iii) the (U,V) model for clustering with the hard and fuzzy c-means models; and (iv) the (U,V,+) model for clustering using the expectation-maximization algorithm for Gaussian mixture decomposition.