This example illustrates how to create a classification tree using the single classification tree method using the Boston_Housing.xlsx example dataset.
Click Help – Examples, then Forecasting/Data Mining Examples to open the Boston_Housing.xlsx dataset. This dataset includes fourteen variables pertaining to housing prices from census tracts in the Boston area collected by the US Census Bureau.
All supervised algorithms include a new Simulation tab. This tab uses the functionality from the Generate Data feature (described in the What’s New section of this guide and then more in depth in the Analytic Solver Data Mining Reference Guide) to generate synthetic data based on the training partition, and uses the fitted model to produce predictions for the synthetic data. The resulting report, CFBM_Simulation, will contain the synthetic data, the predicted values and the Excel-calculated Expression column, if present. In addition, frequency charts containing the Predicted, Training, and Expression (if present) sources or a combination of any pair may be viewed, if the charts are of the same type. Since this new functionality does not support categorical variables, these types of variables will not be present in the model, only continuous variables.
First, we partition the data into training and validation sets using the Standard Data Partition defaults of 60% of the data randomly allocated to the Training Set and 40% of the data randomly allocated to the Validation Set. For more information on partitioning a dataset, see the Data Mining Partitioning chapter.
With the STDPartition worksheet selected, click Classify – Classification Tree to open the Classification Tree dialog. Note: A cell must be selected within the Data Range, A1:O57.
Select CAT. MEDV as the Output variable. Then select all remaining variables except CHAS, MEDV and under Variables in Input Data, then click > to move them to the Selected Variables field.
Note: MEDV is not included in the Input since CAT. MEDV, derived from MEDV, is included in the model. CHAS is not included in the Input since this is a categorical variable. Recall that the new Simulation functionality included in Analytic Solver Data Mining does not support categorical variables.
Choose the value that will be the indicator of “Success” by clicking the down arrow next to Success Class. In this example, we will use the default of 1.
Enter a value between 0 and 1 for Success Probability Cutoff. If the Probability of success (probability of the output variable = 1) is less than this value, then a 0 will be entered for the class value, otherwise a 1 will be entered for the class value. In this example, we will keep the default of 0.5.
Click Next to advance to the Classification Tree – Parameters dialog.
As discussed in previous sections, Analytic Solver Data Mining includes the ability to partition and scale a dataset from within a classification or prediction method by clicking Partition Data and/or Rescale Data on the Parameters dialog. Analytic Solver Data Mining will partition and/or rescale your dataset (according to the partition and rescaling options you set) immediately before running the classification method. If either partitioning and/or rescaling has already occurred on the dataset, the option(s) will be disabled. For more information on partitioning, please see the Data Mining Partitioning chapter. For more information on scaling your data, please see the Transform Continuous Data chapter.
In the Tree Growth section, leave all selections at their default settings. To limit the growth of Tree Levels, Nodes, Splits or the number of Records in a Terminal Node, select the desired component(s) and enter the desired value(s), i.e. if 10 is entered for Levels, the tree will be limited to 10 levels.
Click Prior Probability. Three options appear in the Prior Probability Dialog: Empirical, Uniform and Manual.
- If the first option is selected, Empirical, Analytic Solver Data Mining will assume that the probability of encountering a particular class in the dataset is the same as the frequency with which it occurs in the training data.
- If the second option is selected, Uniform, Analytic Solver Data Mining will assume that all classes occur with equal probability.
- Select the third option, Manual, to manually enter the desired class and probability value.
Click Done to accept the default section, Empirical, and close the dialog.
Select Prune (Using Validation Set). (This option is enabled when a Validation Dataset exists.) Analytic Solver Data Mining will prune the tree using the validation set when this option is selected. (Pruning the tree using the validation set reduces the error from over-fitting the tree to the training data.)
Click Tree for Scoring and select Fully Grown.
Select Show Feature Importance to include the Feature Importance Data Table in the output. This table shows the relative importance of the feature measured as the reduction of the error criterion during the tree growth.
Leave Maximum Number of Levels at the default setting of 7. This option specifies the maximum number of levels in the tree to be displayed in the output.
Select Trees to Display to select the types of trees to display: Fully Grown, Best Pruned, Minimum Error or User Specified.
Select Fully Grown to “grow” a complete tree using the training data.
Select Best Pruned to create a tree with the fewest number of nodes, subject to the constraint that the error be kept below a specified level (minimum error rate plus the standard error of that error rate).
Select Minimum error to produce a tree that yields the minimum classification error rate when tested on the validation data.
To create a tree with a specified number of decision nodes select User Specified and enter the desired number of nodes.
Select Fully Grown, Best Pruned, and Minimum Error. Then click Done to close the dialog.
Click Next to advance to the Classification Tree - Scoring dialog.
Select Detailed report, Summary report, Lift charts and Frequency Chart under both Score Training Data and Score Validation Data. Analytic Solver Data Mining will create a detailed report, complete with the Output Navigator for ease in routing to specific areas in the output, a report that summarizes the regression output for both datasets, and lift charts, ROC curves, and Decile charts for both partitions.
When Frequency Chart is selected under both Score Training Data and Score Validation Data, a frequency chart will be displayed when the CT_TrainingScore and CT_ValidationScore worksheets are selected. This chart will display an interactive application similar to the Analyze Data feature, explained in detail in the Analyze Data chapter that appears earlier in this guide. This chart will include frequency distributions of the actual and predicted responses individually, or side-by-side, depending on the user’s preference, as well as basic and advanced statistics for variables, percentiles, six sigma indices.
Since we did not create a test partition when we partitioned our dataset, Score Test Data options are disabled. See the chapter “Data Mining Partitioning” for details on how to create a test set.
For information on scoring in a worksheet or database, please see the “Scoring New Data” chapter in the Analytic Solver Data Mining User Guide.
Click Next to advance to the Simulation tab.
Select Simulate Response Prediction to enable all options on the the Simulation tab.
Simulation tab: All supervised algorithms include a new Simulation tab. This tab uses the functionality from the Generate Data feature (described earlier in this guide) to generate synthetic data based on the training partition, and uses the fitted model to produce predictions for the synthetic data. The resulting report, CT_Simulation, will contain the synthetic data, the predicted values and the Excel-calculated Expression column, if present. In addition, frequency charts containing the Predicted, Training, and Expression (if present) sources or a combination of any pair may be viewed, if the charts are of the same type.
Evaluation: Select Calculate Expression to amend an Expression column onto the frequency chart displayed on the CT_Simulation output tab. Expression can be any valid Excel formula that references a variable and the response as [@COLUMN_NAME]. Click the Expression Hints button for more information on entering an expression.
For the purposes of this example, leave all options at their defaults in the Distribution Fitting, Correlation Fitting and Sampling sections of the dialog. For Expression, enter the following formula to display census tracts with an average number of bedrooms greater than or equal to 5.
IF[@RM]>5,[@CAT.MEDV],"Tracts <= 5 Rooms")
Note that variable names are case sensitive.
For more information on the remaining options shown on this dialog in the Distribution Fitting, Correlation Fitting and Sampling sections, see the Generate Data chapter that appears earlier in this guide.
Click Finish to run Classification Trees on the example dataset. Output worksheets are inserted to the right of the STDPartition worksheet.
Output containing the results from Classification Trees will be inserted into the active workbook to the right of the STDPartition worksheet and also in the Model tab of the task pane under Reports – Clasification Tree.
This result worksheet includes 4 segments: Output Navigator, Inputs, Training Log, Prune Log and Feature Importance.
Output Navigator: The Output Navigator appears at the top of all result worksheets. Use this feature to quickly navigate to all reports included in the output.
Inputs: Scroll down to the Inputs section to find all inputs entered or selected on all tabs of the Classification Tree dialog.
Training Log and Prune Log: The training log shows the misclassification (error) rate as each additional node is added to the tree, starting with 0 nodes and ending with 17.
Analytic Solver Data Mining chooses the number of decision nodes for the pruned tree and the minimum error tree from the values of Validation MSE. In the Prune log shown above, the smallest Validation MSE error belongs to the trees with 4, 5, 6, 7, 8, 9, 10, 11 and 12 decision nodes. Where there is a tie, meaning when multiple trees have the exact same Error Rate, the tree with the smaller number of nodes is selected. In this case, the tree with four (4) decision nodes is the Minimum Error Tree – the tree with the smallest misclassification error in the validation dataset.
Feature Importance: Select Feature Importance to include the Features Importance table in the output. This table displays the variables that are included in the model along with their Importance value. The larger the Importance value, the bigger the influence the variable has on the predicted classification. In this instance, the census tracts with homes with many rooms will be predicted as having a larger selling price.
Click CT_FullTree to view the full tree.
Recall that the objective of this example is to classify each case as a 0 (low median value) or a 1 (high median value). Consider the top decision node (denoted by a circle). The label above this node indicates the variable represented at this node (i.e. the variable selected for the first split) in this case, RM (Average # of Rooms ). The value inside the node indicates the split threshold. (Hover over the decision node to read the decision rule.) If the RM value for a specific record is greater than or equal to 6.78 (RM >= 6.78), the record will be assigned to the right node. If the RM value for the record is less than 6.78, the value will be assigned to the left node. There are 51 records with values for the RM variable greater than or equal to 6.78 while 253 records contained RM values less than 6.78. We can think of records with an RM value less than 6.78 (RM < 6.78) as tentatively classified as "0" (low median value). Any record where RM >= 6.78 can be tentatively classified as a "1" (high median value).
Let’s follow the tree as it descends to the left for a couple levels. The 253 records with RM values less than 6.78 are further split as we move down the tree. The second split occurs with the LSTAT variable (percent of the population that is of lower socioeconomic status). The LSTAT values for 4 records (out of 253) fell below the split value of 4.07. These records are tentatively classified as a “1” – high median value. The LSTAT values for the remaining 249 records are greater than or equal to 4.07, and are tentatively classified as “0" – low median value.
Following the tree to the left, the 4 records with a LSTAT value < 4.07 are split on the CRIM variable node, CRIM = per capita crime rate by town. Records with CRIM values greater than or equal .33 are classified as a 1 and records with CRIM values less than .33 are classified as a 0, in the terminal nodes. No further splits occur on terminal nodes.
The structure of the full tree will be clear by reading the Full – Grown Tree Rules also on the CT_FullTree tab.
Node ID 1: The first entry in this table shows a split on the RM variable with a split value of 6.776 (rounded to 6.78). The 304 total records in the training partition and 202 records in the validation partition were split between nodes 2 (LeftChild ID) and 3 (Rightchild ID).
Node ID 2:
- In the training partition, 253 records were assigned to this node (from node 1) which has a “0” value (Response). These cases were split on the LSTAT variable using a value of 4.07: 249 records were assigned to node 5 and 4 records were assigned to node 4.
- In the Validation Partition, 155 records were assigned to this node (from node 1). These cases were split on the same variable (LSTAT) and value (4.07): 154 records were assigned to node 5 and 1 record was assigned to node 4.
Node ID 4:
- In the training partition, 4 records, assigned from Node 2, were split on the CRIM variable using a value of 0.33. This node has a tentative classification of 1 (Response). Three records were assigned to node 9 and classified as 1. 1 record was assigned to node 8 and classified as a 0. Both nodes 8 and 9 are terminal nodes.
- In the validation partition, 1 node was assigned from Node 2. This record was assigned to terminal node 8 using the CRIM variable and a value of 0.33 and classified as 0.
The table can be used to follow the tree all the way down to level 33.
Click the CT_BestTree tab to view the Best Pruned Tree and the Rules for the Best Pruned Tree.
The Best Pruned Tree is based on the validation data set, and is the smallest tree whose misclassification rate is within one standard error of the misclassification rate of the Minimum Error Tree
The Validation Partition records are split in the Tree according to the following rules:
Node 1: 202 cases were split using the RM variable with a value of 6.78.
- 155 records were assigned to Node 2, a terminal node, and classified as 0.
- 47 records were assigned to Node 3, a decision node, and tentatively classified as 1.
Node 3: 47 cases were split using the LSTAT variable with a value of 9.65.
- 43 records were assigned to Node 4, a terminal node, and classified as 1.
- 4 records were assigned to Node 5, a decision node, and tentatively classified as 0.
Node 5: 4 records were split using the RAD variable with a value of 5.5.
- All 4 records were assigned to node 7 and classified as 0.
Click CT_MinErrorTree to view the Minimum Error Tree.
The "minimum error tree" is the tree that yields a minimum classification error rate when tested on the validation data. The misclassification (error) rate is measured as the tree is pruned. The tree that produces the lowest error rate is selected. The Min Error Tree Rules can also be found on the CT_MinErrorTree sheet.
The Validation Partition records are split in the Min Error Tree according to the following rules:
Node 1: 202 records were split using the RM variable with a value of 6.78.
- 155 records were assigned to Node 2, a decision node, and tentatively classified as 0.
- 47 records were assigned to Node 3, a decision node, and tentatively classified as 1.
Node 2: 155 records were split using the LSTAT variable with a value of 4.07.
- 1 record was assigned to Node 4, a terminal node, and classified as 1.
- 154 records were assigned to Node 5, a terminal node, and classified as 0.
Node 3: 47 records were split using the LSTAT variable (again) with a value of 9.65.
- 43 records were assigned to node 6, a terminal node, and classified as 1.
- 4 records were assigned to node 7, a decision node, and classified as 0.
Node 7: 4 records were split using the RAD variable using a value of 5.5.
- All 4 records were assigned to node 9, a terminal node, and classified as 0.
Click the CT_TrainingScore tab to view the newly added Output Variable frequency chart, the Training: Classification Summary and the Training: Classification Details report. All calculations, charts and predictions on this worksheet apply to the Training data.
Note: To view charts in the Cloud app, click the Charts icon on the Ribbon, select a worksheet under Worksheet and a chart under Chart.
Frequency Charts: The output variable frequency chart opens automatically once the CT_TrainingScore worksheet is selected. To close this chart, click the “x” in the upper right hand corner of the chart. To reopen, click onto another tab and then click back to the CT_TrainingScore tab. To move, click the title bar on the dialog and drag the chart to the desired location.
Frequency: This chart shows the frequency for both the predicted and actual values of the output variable, along with various statistics such as count, number of classes and the mode.
Click the down arrow next to Frequency to switch to Relative Frequency, Bin Details or Chart Options view.
Relative Frequency: Displays the relative frequency chart.
Bin Details: Use this view to find metrics related to each bin in the chart.
Chart Options: Use this view to change the color of the bars in the chart.
To see both the actual and predicted frequency, click Prediction and select Actual. This change will be reflected on all charts.
Classification Summary: In the Classification Summary report, a Confusion Matrix is used to evaluate the performance of the classification method.
- TP stands for True Positive. These are the number of cases classified as belonging to the Success class that actually were members of the Success class.
- FN stands for False Negative. These are the number of cases that were classified as belonging to the Failure class when they were actually members of the Success class
- FP stands for False Positive. These cases were assigned to the Success class but were actually members of the Failure group
- TN stands for True Negative. These cases were correctly assigned to the Failure group.
- True Positive: All 47 records belonging to the Success class were correctly assigned to that class
- True Negative: All 257 records belonging to the Failure class were correctly assigned to this same class
The were no misclassified records. The metrics below reflect the accuracy of the algorithm.
The following metrics are computed using the values in the confusion matrix.
- Accuracy (#Correct and %Correct): 100% - Refers to the ability of the classifier to predict a class label correctly.
- Specificity: 1 - Also called the true negative rate, measures the percentage of failures correctly identified as failures
- Specificity (SPC) or True Negative Rate =TN / (FP + TN)
- Recall (or Sensitivity): 1 - Measures the percentage of actual positives which are correctly identified as positive (i.e. the proportion of people who experienced catastrophic heart failure who were predicted to have catastrophic heart failure).
- Sensitivity or True Positive Rate (TPR) = TP/(TP + FN)
- Precision: 1 - The probability of correctly identifying a randomly selected record as one belonging to the Success class
- Precision = TP/(TP+FP)
- F-1 Score: 1 - Fluctuates between 1 (a perfect classification) and 0, defines a measure that balances precision and recall.
- F1 = 2 * TP / (2 * TP + FP + FN)
- Success Class and Success Probability: Selected on the Data tab of the Discriminant Analysis dialog.
Classification Details: This table displays how each observation in the training data was classified. The probability values for success in each record are shown after the predicted class and actual class columns.
Click the CT_ValidationScore tab to view the newly added Output Variable frequency chart, the Validation: Classification Summary and the Validation: Classification Details report. All calculations, charts and predictions on this worksheet apply to the Validation data.
Frequency Charts: The output variable frequency chart opens automatically once the CT_ValidationScore worksheet is selected. To close this chart, click the “x” in the upper right hand corner. To reopen, click onto another tab and then click back to the CT_ValidationScore tab.
Click the Frequency chart to display the frequency for both the predicted and actual values of the output variable, along with various statistics such as count, number of classes and the mode. Selective Relative Frequency from the drop down menu, on the right, to see the relative frequencies of the output variable for both actual and predicted. See above for more information on this chart.
Classification Summary: This report contains the confusion matrix for the validation data set.
- True Positive: 33 records belonging to the Success class were correctly assigned to that class
- False Negative: 3 records belonging to the Success class were incorrectly assigned to the Failure class.
- True Negative: 156 records belonging to the Failure class were correctly assigned to this same class
- False Positive: 9 records belonging to the Failure class were incorrectly assigned to the Success class.
There were 13 cases misclassified in the validation dataset resulting in a % error of 6.44%.
Classification Details: This table displays how each observation in the validation data was classified. The probability values for success in each record are shown after the predicted class and actual class columns. Records assigned to a class other than what was predicted are highlighted in red.
CT_TrainingLiftChart and CT_ValidationLiftChart
Click the CT_ValidationLiftChart tab to find the Lift Chart, ROC Curve, and Decile Chart for the Validation partition. Click the CT_TrainingLiftChart tab to display these same charts created using the training partition.
Lift Charts and ROC Curves are visual aids that help users evaluate the performance of their fitted models. Charts found on the CT_Training LiftChart tab were calculated using the Training Data Partition. Charts found on the CT_ValidationLiftChart tab were calculated using the Validation Data Partition. It is good practice to look at both sets of charts to assess model performance on both datasets.
Note: To view these charts in the Cloud app, click the Charts icon on the Ribbon, select CT_TrainingLiftChart or CT_ValidationLiftChart for Worksheet and Decile Chart, ROC Chart or Gain Chart for Chart.
After the model is built using the training data set, the model is used to score on the training data set and the validation data set (if one exists). Then the data set(s) are sorted in decreasing order using the predicted output variable value. After sorting, the actual outcome values of the output variable are cumulated and the lift curve is drawn as the cumulative number of cases in decreasing probability (on the x-axis) vs the cumulative number of true positives on the y-axis. The baseline (red line connecting the origin to the end point of the blue line) is a reference line. For a given number of cases on the x-axis, this line represents the expected number of successes if no model existed, and instead cases were selected at random. This line can be used as a benchmark to measure the performance of the fitted model. The greater the area between the lift curve and the baseline, the better the model. In the Training Lift chart, if we selected 100 cases as belonging to the success class and used the fitted model to pick the members most likely to be successes, the lift curve tells us that we would be right on all of them. Conversely, if we selected 100 random cases, we could expect to be right on about 15 of them.
The decilewise lift curve is drawn as the decile number versus the cumulative actual output variable value divided by the decile's mean output variable value. This bars in this chart indicate the factor by which the model outperforms a random assignment, one decile at a time. Refer to the validation graph above. In the first decile, taking the most expensive predicted housing prices in the dataset, the predictive performance of the model is about 5 times better as simply assigning a random predicted value.
The Regression ROC curve was updated in V2017. This new chart compares the performance of the regressor (Fitted Predictor) with an Optimum Predictor Curve and a Random Classifier curve. The Optimum Predictor Curve plots a hypothetical model that would provide perfect classification results. The best possible classification performance is denoted by a point at the top left of the graph at the intersection of the x and y axis. This point is sometimes referred to as the “perfect classification”. The closer the AUC is to 1, the better the performance of the model. In the Validation Partition, AUC = .94 which suggests that this fitted model is a good fit to the data.
In V2017, two new charts were introduced: a new Lift Chart and the Gain Chart. To display these new charts, click the down arrow next to Lift Chart (Original), in the Original Lift Chart, then select the desired chart.
Select Lift Chart (Alternative) to display Analytic Solver Data Mining's new Lift Chart. Each of these charts consists of an Optimum Predictor curve, a Fitted Predictor curve, and a Random Predictor curve. The Optimum Predictor curve plots a hypothetical model that would provide perfect classification for our data. The Fitted Predictor curve plots the fitted model and the Random Predictor curve plots the results from using no model or by using a random guess (i.e. for x% of selected observations, x% of the total number of positive observations are expected to be correctly classified).
The Alternative Lift Chart plots Lift against the Predictive Positive Rate or Support.
Click the down arrow and select Gain Chart from the menu. In this chart, the True Positive Rate or Sensitivity is plotted against the Predictive Positive Rate or Support.
As discussed above, Analytic Solver Data Mining generates a new output worksheet, CT_Simulation, when Simulate Response Prediction is selected on the Simulation tab of the Discriminant Analysis dialog.
This report contains the synthetic data, the predicted values for the training partition (using the fitted model) and the Excel – calculated Expression results, if populated in the dialog. A chart is displayed with the option to switch between the Predicted Simulation and Training sources and the Expression results for the Simulation and Training data, or a combination of two as long as they are of the same type.
Note the first column in the output, Expression. This column was inserted into the Synthetic Data results because Calculate Expression was selected and an Excel function was entered into the Expression field, on the Simulation tab of the Discriminant Analysis dialog
IF[@RM]>=5,[@CAT.MEDV],"Tracts < 5 Rooms")
The results in this column are either 0, 1, or Tracts <= 5 Rooms.
The remainder of the data in this report is synthetic data, generated using the Generate Data feature described in the chapter with the same name, that appears earlier in this guide.
The chart that is displayed once this tab is selected, contains frequency information pertaining to the output variable in the training partition and the synthetic data. In the screenshot below, the bars in the darker shade of blue are based on the synthetic data. The bars in the lighter shade of blue are based on the predictions for the training partition. In the synthetic data, a little over 70% of the census tracts are predicted to have a classification equal to 0, or low median value, while almost 30% of census tracts are predicted to have a classification equal to 1, or high median value.
Click Prediction (Simulation) / Prediction (Training) to change the chart view to Prediction (Training)/Expression (Training).
The chart above reports the predictions from the training partition and compares them to the results of the expression evaluated on the training partition. About 83% of the records in the training partition were classified as CAT. MEDV = 0. About 15% of the records in the training partition were classified as CAT. MEDV = 1. The Expression columns (the lighter blue columns) report the number of records that were classified as 0 or 1 where RM > 5. This chart clearly shows that a small number of rooms in the training partition have values for RM less than or equal to 5.
Click the down arrow next to Frequency to change the chart view to Relative Frequency or to change the look by clicking Chart Options. Statistics on the right of the chart dialog are discussed earlier in this section. For more information on the generated synthetic data, see the Generate Data chapter that appears earlier in this guide.
For information on Stored Model Sheets, in this example DA_Stored, please refer to the “Scoring New Data” chapter within the Analytic Solver Data Mining User Guide.