This example illustrates the use of XLMiner's k-Nearest Neighbors Classification method. On the XLMiner rribbon, from the Applying Your Model tab, select Help - Examples, then Forecasting/Data Mining Examples, and open the example workbook Iris.xlsx. This data set was introduced by R. A. Fisher, and reports four characteristics of three species of the Iris flower.

Iris.xlsx 

First, partition the data using a standard partition with percentages of 60% Training and 40% Validation (the default settings for the Automatic choice). For more information on how to partition a data set, see the Data Mining Partition section. 

On the XLMiner ribbon, from the Data Mining tab, select Partition - Standard Partition to open the Standard Data Partition dialog.

Standard Data Partition Dialog

Select a cell on the Data_Partition worksheet, then on the XLMiner ribbon, from the Data Mining tab, select Classify - k-Nearest Neighbors Classification to open the k-Nearest Neighbors Classification - Step 1 of 3 dialog.

From the Variables In Input Data list, select Petal_width, Petal_length, Sepal_width, and Sepal_length, then click > to move them into the Selected Variables list. Select Species_name as the Output Variable (variable to be classified).

Note: Since the variable Species_No is perfectly predictive of the output variable, Species_name, it will not be included in the model.

Once the Output Variable is selected, # Classes (3) filled automatically. Since the Output Variable contains more than two classes, Specify “Success” class (for Lift Chart) and Specify initial cutoff probability for success are disabled.

k-Nearest Neighbors Step 1 of 3 Dialog 

Click Next to advance to the k-Nearest Neighbors Classification - Step 2 of 3 dialog.

Select Normalize input data for XLMiner to normalize the data by expressing the entire data set in terms of standard deviations. When this is done, the distance measure is not dominated by a large magnitude variable. In this example, the values for Petal_width are between .1 and 2.5, while the values for Sepal_length are between 4.3 and 7.9. When the data is normalized, the actual variable value (4.3) is replaced with the standard deviation from the mean of that variable. This option is not selected by default.

For Number of nearest neightbors (k), enter 10. This number is based on standard practice from the literature. This is the parameter k in the k-Nearest Neighbor algorithm. If the number of observations (rows) is less than 50, then the value of k should be between 1 and the total number of observations (rows). If the number of rows is greater than 50, then the value of k should be between 1 and 50. Note that if k is chosen as the total number of observations in the Training Set, then for any new observation, all the observations in the Training Set become nearest neighbors. The default value for this option is 1.

Under  Scoring Option, select Score on best k between 1 and specified value. When this option is selected, XLMiner displays the output for the best k between 1 and the value entered for Number of nearest neighbors (k). If Score on specified value of k is selected, the output is displayed for the specified value of k.

Under Prior Class Probabilities, confirm that According to relative occurrences in training data is selected. When this option is selected, XLMiner incorporates prior assumptions about how frequently the different classes occur, and assumes that the probability of encountering a particular class in the data set is the same as the frequency with which it occurs in the Training Set. (See below for information on the two remaining options.)

XLMiner V2015 provides the ability to partition a data set from within a classification or prediction method by selecting Partition Options on the Step 2 of 3 dialog. If this option is selected, XLMiner partitions the data set (according to the options set) before running the prediction method. If partitioning has already occurred on the data set, this option is disabled. For more information on partitioning, please see the Data Mining Partitioning chapter.

k-Nearest Neighbors Classification - Step2 of 3 Dialog 

Click Next to advance to the k-Nearest Neighbors Classification - Step 3 of 3 dialog.

Under Score Training Data and Score Validation Data, Summary Report is selected by default. Select Detailed Report under both Score Training Data and Score Validation Data. XLMiner creates detailed and summary reports for both the Training and Validation Sets.

Lift charts are disabled since there are more than two categories in the Output Variable, Species_name. Since a test partition was not created, the options under Score Test Data are disabled. For information on how to create a test partition, see the Data Mining Partition section.

k-Nearest Neighbors Classification - Step3 of 3 Dialog 

Click Finish, then click the KNNC_Output tab to view the Output Navigator located at the top of this worksheet. Click the links to navigate to other areas of the output.

k-Nearest Neighbors Output:  Output Navigator

Scroll down on the KNNC_Output worksheet to view the Validation error log.

Validation error log for different k

The Validation Error Log lists the % Errors for all values of k for both the Training and Validation Sets. The k with the smallest % Error is selected as the Best k. Scoring is performed later using this best value of k.

Further down on the KNNC_Output worksheet are the Training and Validation Data Scoring tables.

Training/Validation Data Scoring - Summary Report 

This Summary Report tallies the actual and predicted classifications. Predicted classifications were generated by applying the model to the Validation Set. Correct classification counts are along the diagonal of the table from the upper-left to the lower-right. In this example, there were no misclassification errors in either the Training or Validation Sets.

On the Output Navigator, click the Train. Score - Detailed Rpt and Valid. Score - Detailed Rpt. links to be routed to the Training Data and Validation Data Detailed Reports.

These tables show the predicted class for each record, the percent of the nearest neighbors belonging to that class, and the actual class. The class with the highest probability is highlighted in yellow. Mismatches between Predicted and Actual class are highlighted in green, if present.

Classification of Training Data 

Classification of Validation Data