On the XLMiner ribbon, from the Applying Your Model tab, select Help - Examples, then Forecasting/Data Mining Examples, and open the example file Wine.xlsx. As shown in the figure below, each row in this example data set represents a sample of wine taken from one of three wineries (A, B, or C). In this example, the Type variable representing the winery is ignored, and the clustering is performed simply on the basis of the properties of the wine samples (the remaining variables).
Select a cell within the data set, and then on the XLMiner ribbon, from the Data Analysis tab, select XLMiner - Cluster - k-Means Clustering to open the k-Means Clustering Step 1 of 3 dialog.
From the Variables list, select all variables except Type, then click the > button to move the selected variables to the Selected Variables list.
Click Next to advance to the Step 2 of 3 dialog.
At # Clusters, enter 8. This is the parameter k in the k-means clustering algorithm. The number of clusters should be at least 1 and at most the number of observations -1 in the data range. Set k to several different values and evaluate the output from each.
Leave #Iterations at the default setting of 10. The value for this option determines how many times the program will start with an initial partition and complete the clustering algorithm. The configuration of clusters (and data separation) may differ from one starting partition to another. The program will go through the specified number of iterations, and select the cluster configuration that minimizes the distance measure.
Set Random starts to 5. When this option is selected, the algorithm starts building the model from any random point. XLMiner generates five cluster sets and generates the output based on the best cluster.
Set seed is selected by default. This option initializes the random number generator that is used to calculate the initial cluster centroids. Setting the random number seed to a non-zero value (default 12345) ensures that the same sequence of random numbers is used each time the initial cluster centroids are calculated. When the seed is zero, the random number generator is initialized from the system clock, so the sequence of random numbers are different each time the centroids are initialized. Set the seed to view successive runs of the clustering method as comparable.
Select the Normalize input data option to normalize the data. In this example, the data will not be normalized. Select Next to open the Step 3 of 3 dialog.
Select Show data summary (default) and Show distances from each cluster center (default), then click Finish.
The k-Means Clustering method starts with k initial clusters as specified. At each iteration, the records are assigned to the cluster with the closest centroid, or center. After each iteration, the distance from each record to the center of the cluster is calculated. These two steps are repeated (the record assignment and distance calculation) until the redistribution of a record results in an increased distance value.
When a random start is specified, the algorithm generates the k cluster centers randomly, and fits the data points in those clusters. This process is repeated for all specified random starts. The output is based on the clusters that exhibit the best fit.
The worksheet KM_Output1 is inserted immediately to the right of the Data worksheet. In the top section of the output worksheet, the selected options are listed.
In the middle section of the output worksheet, XLMiner has calculated the sum of the squared distances and determined the start with the lowest Sum of Square Distance as the Best Start (#5). After the Best Start is determined, XLMiner generates the remaining output using the Best Start as the starting point.
In the bottom portion of the output worksheet, XLMiner has listed the Cluster Centers (shown below). The upper box shows the variable values at the Cluster Centers. Cluster 8 has the highest average Alcohol, Total_Phenols, Flavanoids, Proanthocyanins, Color_Intensity, Hue, and Proline content. Compare this cluster to Cluster 2, which has the highest average Ash_Alcalinity and Nonflavanoid_Phenols.
The lower box shows the distance between the Cluster Centers. From the values in this table, it is determined that Cluster 3 is very different from Cluster 8 due to the high distance value of 1,176.59, and Cluster 7 is close to Cluster 3 with a low distance value of 89.73.
The Data Summary (below) displays the number of records (observations) included in each cluster and the average distance from cluster members to the center of each cluster. Cluster 6 has the highest average distance of 42.79, and includes 24 records. Compare this cluster to Cluster 2, which has the smallest average distance of 29.66, and includes 26 members.
Click the KM_Clusters1 worksheet. This worksheet displays the cluster to which each record is assigned and the distance to each of the clusters. For the first record, the distance to Cluster 6 is the minimum distance of 23.205, so this first record is assigned to Cluster 6.