Cluster Analysis

Cluster Analysis, also called data segmentation, has a variety of goals that all relate to grouping or segmenting a collection of objects (i.e., observations, individuals, cases, or data rows) into subsets or clusters. These clusters are grouped in such a way that the observations included in each cluster are more closely related to one another than objects assigned to different clusters. The most important goal of cluster analysis is the notion of the degree of similarity (or dissimilarity) between the individual objects being clustered. There are two major methods of clustering: hierarchical clustering and k-means clustering.

This chapter explains the k-Means Clustering algorithm. The goal of this process is to divide the data into a set number of clusters (k), and to assign each record to a cluster while minimizing the distribution within each cluster. A non-hierarchical approach to forming good clusters is to specify a desired number of clusters, say, k, then assign each case (object) to one of k clusters to minimize a measure of dispersion within the clusters. A very common measure is the sum of distances or sum of squared Euclidean distances from the mean of each cluster. The problem can be set up as an integer programming problem but because solving integer programs with a large number of variables is time consuming, clusters are often computed using a fast, heuristic method that generally produces good (but not optimal) solutions. The k-means algorithm is one such method.