The following example illustrates how to use the Discriminant Analysis classification algorithm. On the XLMiner ribbon, from the Applying Your Model tab, select Help - Examples, then Forecasting/Data Mining Examples, and open the example data set Boston_Housing.xlsx.
This data set includes 14 variables pertaining to housing prices from census tracts in the Boston area, as collected by the U.S. Census Bureau.
First, create a standard partition using percentages of 80% for the Training Set and 20% for the Validation Set. The Data_Partition worksheet is inserted at the beginning of the workbook. For more information on how to partition a data set, see the Discriminant Analysis section.
Select a cell on the Data_Partition worksheet, then on the XLMiner ribbon, from the Data Mining tab, select Classify - Discriminant Analysis to open the Discriminant Analysis - Step 1 of 3 dialog.
From the Variables In Input Data list, select the CAT. MEDV variable, then click > to select as the Output Variable. Rhe options for Classes in the Output Variable are enabled. #Classes is prefilled as 2 since the CAT. MEDV variable contains two classes, 0 and 1.
Specify Success class (for Lift Chart) is selected by default, and Class 1 is to be considered a success or the significant class in the Lift Chart. Note: This option is enabled when the number of classes in the output variable is equal to 2. Enter a value between 0 and 1 to denote the Specify initial cutoff probability for success. If the calculated probability for success for an observation is greater than or equal to this value, than a success (or a 1) will be predicted for that observation. If the calculated probability for success for an observation is less than this value, then a non-success (or a 0) will be predicted for that observation. The default value is 0.5. Note: This option is only enabled when the # of Classes is equal to 2.
From the Variables In Input Data list, select CRIM, ZN, INDUS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, and B, then click > to move to the Selected Variables list. CHAS, LSTAT, and MEDV should remain in the Variables in Input Data list as shown below.
Click Next to advance to the Discriminant Analysis - Step 2 of 3 dialog.
Under Analysis Method Options, select Canonical Variate for XLMiner to produce the canonical variates for the data based on an orthogonal representation of the original variates. This has the effect of choosing a representation that maximizes the distance between the different groups. For a k class problem, there are k-1 canonical variates. Typically, only a subset of the canonical variates is sufficient to discriminate between the classes. For this example, we have two canonical variates, which means that if we replace the four original predictors by just two predictors, X1 and X2 (which are linear combinations of the four original predictors), the discrimination based on these two predictors will perform similar to the discrimination based on the original predictors.
Three options appear under Prior Class Probabilities: According to relative occurrences in training data, Use equal prior probabilities, and User specified prior probabilities.
If According to relative occurrences in training data is selected, XLMiner calculates according to the relative occurrences, the discriminant analysis procedure incorporates prior assumptions about how frequently the different classes occur, and XLMiner assumes that the probability of encountering a particular class in the large data set is the same as the frequency with which it occurs in the training data.
If Use equal prior probabilities is selected, XLMiner assumes that all classes occur with equal probability.
If User specified prior probabilities is selected, manually enter the desired class and probability value. Under the Probability list, enter 0.7 for Class1, and 0.3 for Class 0.
XLMiner provides the option of specifying the cost of misclassification when there are two classes; where the success class is judged as failure and the non-success as a success. XLMiner takes into consideration the relative costs of misclassification, and attempts to fit a model that minimizes the total cost. Leave these options at their defaults of 1.
XLMiner V2015 provides the ability to partition a data set from within a classification or prediction method by selecting Partition Options on the Discriminant Analysis - Step 2 of 3 dialog. If this option is selected, XLMiner partitions the data set (according to the partition options set) immediately before running the prediction method. If partitioning has already occurred on the data set, this option will be disabled. For more information on partitioning, see the Discriminant Analysis section.
Click Next to advance to the Discriminant Analysis - Step 3 of 3 dialog.
Under Output Options, select Linear Discriminant Functions to include the Linear Discriminant Functions in the output. Select Canonical Variate loadings for XLMiner to produce the canonical variates for the data based on an orthogonal representation of the original variates. This has the effect of choosing a representation that maximizes the distance between the different groups. For a k class problem, there are k-1 canonical variates. Typically, only a subset of the canonical variates is sufficient to discriminate between the classes. For this example, we have two canonical variates, which means that if we replace the four original predictors by just two predictors, X1 and X2 (which are linear combinations of the four original predictors), the discrimination based on these two predictors will perform similar to the discrimination based on the original predictors.
Under Score Training Data and Score Validation Data, select all four options. When Detailed Report is selected, XLMiner creates a detailed report of the Discriminant Analysis output. When Summary Report is selected, XLMiner creates a report summarizing the Discriminant Analysis output. When Lift Charts is selected, XLMiner includes Lift Chart and ROC curves in the Discriminant Analysis output.
The values of the variables X1 and X2 for the ith observation are known as the canonical scores for that observation. In this example, the pair of canonical scores for each observation represents the observation in a two-dimensional space. The purpose of the canonical score is to separate the classes as much as possible. Thus, when the observations are plotted with the canonical scores as the coordinates, the observations belonging to the same class are grouped together. When this option is selected, XLMiner reports the scores of the first few observations.
Since we did not create a test partition, the options for Score Test Data are disabled. For more information about how to create a test partition, see the Data Mining Partitioning section.
For information on scoring data, see the Scoring New Data section.
Click Finish to view the output. The output worksheets are inserted at the end of the workbook. The first output worksheet, DA_Output, contains the Output Navigator that can be used to navigate to various sections of the output.
Scroll down to view the Summary Reports. In this example, we are classifying the price of houses in a census tract based on the features of the houses in the tract. The output variable, CAT.MEDV, is 1 if the median cost of houses in a census tract are larger than $30,000, and 0 if not. In this example, our Success class is the class containing housing tracts with a higher median price.
A Confusion Matrix is used to evaluate the performance of a classification method. This matrix summarizes the records that were classified correctly and those that were not.
TP stands for True Positive. These are the number of cases classified as belonging to the Success class that were members of the Success class. FN stands for False Negative. These are the number of cases that were classified as belonging to the Failure class when they were members of the Success class (i.e., patients who were told they did not have cancer when they actually did). FP stands for False Positive. These cases were assigned to the Success class, but were actually members of the Failure group (i.e., patients who were told they tested positive for cancer but in fact their tumors were benign). TN stands for True Negative. These cases were correctly assigned to the Failure group.
In the Training Set, we see that 62 records belonging to the Success class were correctly assigned to that class, while six records belonging to the Success class were incorrectly assigned to the Failure class. Additionally, 294 records belonging to the Failure class were correctly assigned to this same class, while 43 records belonging to the Failure class were incorrectly assigned to the Success class. The total number of misclassified records was 49 (43+6), which results in an error equal to 12.10%. In the Validation Set, 16 records were correctly classified as belonging to the Success class, while 73 cases were correctly classified as belonging to the Failure class. Twelve records were incorrectly classified as belonging to the Success class when they were members of the Failure class. This resulted in a total classification error of 11.88%.
Precision is the probability of correctly identifying a randomly selected record as one belonging to the Success class (i.e., the probability of correctly identifying a random patient as having cancer). Recall (or Sensitivity) measures the percentage of actual positives that are correctly identified as positive (i.e., the proportion of people with cancer who are correctly identified as having cancer). Specificity (also called the true negative rate) measures the percentage of failures correctly identified as failures (i.e., the proportion of people with no cancer being categorized as not having cancer.) The F-1 score, which fluctuates between 1 (a perfect classification) and 0, defines a measure that balances precision and recall.
Precision = TP/(TP+FP)
Sensitivity or True Positive Rate (TPR) = TP/(TP + FN)
Specificity (SPC) or True Negative Rate =TN / (FP + TN)
F1 = 2 * ((Precision * recall) /( precision + recall))
From the Output Navigator, click the LDA Train - Detail Rept. link to view the Classification of training data on the DA_TrainingScoreLDA worksheet. This section of the output shows how each training data observation was classified. Alternatively, the Classification of Validation Data on the DA_ValidationScoreLDA worksheet displays how each validation data observation was classified. The probability values for success in each record are shown after the predicted class and actual class columns. Records assigned to a class other than what was predicted, are highlighted in blue.
On the Output Navigator, click the Class Funs link to view the Classification Function table. In this example, there are two functions, one for each class. Each variable is assigned to the class that contains the higher value.
On the Output Navigator, click the Canonical Variate Loadings link to navigate to the Canonical Variate Loadings section.
Canonical Variate Loadings are a second set of functions that give a representation of the data that maximizes the separation between the classes. The number of functions is one less than the number of classes (i.e., one function). To plot the cases in this example on a line where xi is the ith case's value for variate1, you would see a clear separation of the data. This output is useful in illustrating the inner workings of the discriminant analysis procedure, but is not typically needed by the end-user analyst.
Lift Charts consist of a lift curve and a baseline. After the model is built using the Training Set, the model is used to score on the Training Set and the Validation Set (if one exists). Then the data set(s) are sorted using the predicted output variable value. After sorting, the actual outcome values of the output variable are cumulated, and the lift curve is drawn as the number of cases (x-axis) versus the cumulated value (y -axis). The baseline (red line connecting the origin to the end point of the blue line) is drawn as the number of cases versus the average of actual output variable values multiplied by the number of cases. The greater the area between the lift curve and the baseline, the better the model.
In the Lift Chart (Training Set) below, the red line originating from the origin and connecting to the point (400, 65) is a reference line that represents the expected number of CAT MEDV predictions if XLMiner selected random cases (i.e., no model was used). This reference line provides a yardstick against which the user can compare the model performance. From the Lift Chart below, we can infer that if we assigned 200 cases to class 1, about 65 1s would be included. If 200 cases were selected at random, we could expect about 30 1s.
The decile-wise lift curve is drawn as the decile number versus the cumulative actual output variable value divided by the decile's mean output variable value. This bars in this chart indicate the factor by which the MLR model outperforms a random assignment, one decile at a time. Refer to the validation graph below. In the first decile, taking the most expensive predicted housing prices in the data set, the predictive performance of the model is about 5.8 times better as simply assigning a random predicted value.
Receiver Operating Characteristic (ROC) curves plot the performance of binary classifiers by graphing true positive rates (TPR) versus false positive rates (FPR) as the cutoff value grows from 0 to 1. The closer the curve is to the top left corner of the graph, and the smaller the area above the curve, the better the performance of the model.
In an ROC curve, we can compare the performance of a classifier with that of a random guess which would lie at a point along a diagonal line (red line) running from the origin (0, 0) to the point (1, 1). This line is sometimes called the line of no-discrimination. Anything to the left of this line signifies a better prediction, and anything to the right signifies a worse prediction. The best possible prediction performance would be denoted by a point at the top left of the graph at the intersection of the x and y axis. This point is sometimes referred to as the perfect classification. Area Under the Curve (AUC) is the space in the graph that appears below the ROC curve. This value is reported at the top of the ROC graph. AUC is a value between 0 and 1. The closer the value AUC is to 1, the better the performance of the classification model. In this example, the AUC is very close to 1 in both the Training and Validation Sets, which indicates that this model is a good fit.
On the Output Navigator, click the Training Canonical Scores link to navigate to the DA_TrainCanonScore worksheet. Canonical Scores are the values of each case for the function. These are intermediate values useful for illustration, but are generally not required by the end-user analyst.
For information on stored model sheets such as DA_Stored, see the Scoring New Data section.