Using Hierarchical Clustering

The following options appear on the Hierarchical Clustering dialogs. See the Common Options section of the Introduction to Analytic Solver Data Science for descriptions of options appearing on the Step 1 of 3 dialog. The Data type argument can be found below.

Analytic Solver Data Mining: Hierarchical Clustering Data tab

Data Type

The Hierarchical Clustering method can be used on raw data as well as the data in Distance Matrix format. Choose the appropriate option to fit your data set. If Raw Data is chosen, Analytic Solver computes the similarity matrix before clustering.

Analytic Solver Data Mining: Hierarchical Clustering Parameters tab

Normalize input data

Normalizing the data is important to ensure that the distance measure accords equal weight to each variable. Without normalization, the variable with the largest scale dominates the measure.

Similarity Measures

Hierarchical Clustering uses the Euclidean distance as the similarity measure for working on raw numeric data. When the data is binary, the remaining two options, Jaccard's coefficients and Matching coefficients, are enabled.

Suppose we have binary values for xij. See the table below for individual i and j values.

Binart Table

The most useful similarity measures in this situation are:

Jaccard's coefficient = d/(b+c+d). This coefficient ignores zero matches.

The matching coefficient = (a + d)/p.

Clustering Method

The goal of clustering is to reduce the amount of data by categorizing or grouping similar data items together. See the introduction to this section for a description of all clustering methods used in Analytic Solver.

Analytic Solver Data Mining: Hierarchical Clustering Parameters tab

Draw Dendrogram

When this option is selected, Analytic Solver creates a dendrogram to illustrate the clustering process.

Maximum Number of Leaves

If Draw Dendrogram is selected, this option is enabled. Use this option to define the maximum number of leaves in the dendrogram tree. The default setting is equal to 10 or if the number of records in the dataset is less than 10; then the default is the Minimum between the Number of Rows in the dataset and your current licensed limit.

Show cluster membership

Select this option to display the cluster number (ID) to which each record is assigned by the routine.

# Clusters

The agglomerative method of Hierarchical Clustering continues to form clusters until only one cluster is left. Use this option to stop the process at a given number of clusters.