XLMiner V2015 includes four different methods for creating classification trees: boosting, bagging, random trees, and single tree. The first three (boosting, bagging, and random trees) are ensemble methods that are used to generate one powerful model by combining several weaker tree models. Single tree is used to create a single classification tree.
This example illustrates how to create a classification tree using the boosting ensemble method. The boosting method starts by first training a single tree, then examining the misclassified records from that tree to train a successive tree. This process repeats until a high level of accuracy is obtained. We will use the Boston_Housing.xlsx data set to illustrate this method
On the XLMiner ribbon, from the Applying Your Model tab, select Help - Examples, then Forecasting/Data Mining Examples to open the Boston_Housing.xlsx data set. This data set includes 15 variables pertaining to housing prices from the Boston area collected by the U.S. Census Bureau.
The following figure displays a portion of the data; observe the last column (CAT. MEDV). This variable has been derived from the MEDV variable by assigning a 1 for MEDV levels above 30 (>= 30) and a 0 for levels below 30 (<30).
First, partition the data into Training and Validation Sets using the Standard Data Partition defaults of 60% of the data randomly allocated to the Training Set, and 40% of the data randomly allocated to the Validation Set. For more information on partitioning a data set, see the Data Mining Partition section.
On the XLMiner ribbon, from the Data Mining tab, select Partition - Standard Partition to open the Sandard Data Partition dialog. Select a cell on the Data_Partition worksheet, then click OK.
On the XLMiner ribbon, from the Data Mining tab, select Classify - Classification Tree - Boosting to open the Classification Tree Boosting - Step 1 of 3 dialog.
Select CAT. MEDV as the Output Variable, then from the Selected Variables list, select all remaining variables except MEDV (the MEDV variable is not included in the Input since the CAT. MEDV variable is derived from the MEDV variable).
At Specify "Success" class (for Lift Chart), select the down arrow to select the value that will be the indicator of Success. In this example, we will use the default of 1.
At Specify initial cutoff probability for success, enter a value between 0 and 1. If the Probability of success (probability of the output variable = 1) is less than this value, then a 0 will be entered for the class value; otherwise, a 1 will be entered for the class value. In this example, we will keep the default of 0.5.
Click Next to advance to the Classification Tree Boosting - Step 2 of 3 dialog.
When Normalize Input Data is selected, XLMiner normalizes the data to determine if linear combinations of the input variables are used when splitting the tree. Keep this option unchecked.
Leave the default selection for Boosting Algorithm, AdaBoost.M1 (Breiman). The difference in the algorithms is the way in which the weights assigned to each observation or record are updated. See the Ensemble Methods section.
In AdaBoost.M1 (Freund), the constant is calculated as
In AdaBoost.M1 (Breiman), the constant is calculated as
In SAMME, the constant is calculated as
αb= 1/2ln((1-eb)/eb + ln(k-1) where k is the number of classes
when, the number of categories is equal to 2, SAMME behaves the same as AdaBoost Breiman.
Leave the Number of weak learners at the default of 50. This option controls the number of weak classification models that will be created. The ensemble method stops when the number or classification models created reaches the value set for the Number of weak learners. The algorithm then computes the weighted sum of votes for each class and assigns the winning classification to each record.
If Use Re-weighting is selected, the Adaboost algorithm calculates a weight for each record and updates that weight on each iteration, while assigning higher weights to misclassified records. If Use Re-sampling is selected, the AdaBoost algorithm chooses a sample of records in each iteration and assigns higher probabilities to misclassified records so that those records are favored in the next sample selection.
In the Trees section, leave the defaults of levels and enter 7 for Maximum number of tree. The tree may also be limited by the number of splits or nodes by clicking the drop-down next to levels. If levels is chosen, the tree contains the specified number of levels. If splits is selected, the number of times the tree is split is limited to the value entered, and if nodes is selected, the number of nodes in the entire tree is limited to the value specified.
Set Minimum # records in a terminal node to 30. XLMiner stops splitting the tree when all nodes contain a minimum of 30 records.
XLMiner V2015 provides the ability to partition a data set from within a classification or prediction method by selecting Partition Options on the Step 2 of 3 dialog. If this option is selected, XLMiner partitions the data set (according to the partition options) immediately before running the prediction method. If partitioning has already occurred on the data set, this option is disabled. For more information on partitioning, see the Data Mining Partition section.
Click Next to advance to the Classification Tree Boosting - Step 3 of 3 dialog.
Under both Score Training Data and Score Validation Data, Summary Report and Lift charts are selected by default. Select Detailed Report (under both Score Training and Score Validation Sets) to produce a detailed assessment of the performance of the tree in both sets. Since we did not create a test partition, the options for Score test data are disabled. See the Data Mining Partition section for information on how to create a test partition.
Click Finish. Worksheets containing the output of the Ensemble Methods algorithm are inserted at the end of the workbook. Click the CTBoost_Output worksheet to view the Output Navigator. Click any link in this section to navigate to various sections of the output.
Scroll down to the CTBoost_Output worksheet to view Details of the boosting tree ensemble. The number of Weak Learners is equal to 50, which matches our input on the Step 2 of 3 dialog for the Number of Weak Learner option. XLMiner assigns a weight to each weak learner. The Importance percentage for each Variable measures the variable's contribution in reducing the total misclassification error.
Click the Ensemble Details link to navigate to the Details of the boosting tree ensemble table, which displays each Weak Learner and the final weight assigned to each the Adaboost algorithm. Note: 50 weak learners were utilized, which corresponds to our option setting for Number of Weak Learners, on the Step 2 of 3 dialog.
Each variable and its Importance are also displayed here. Importance is defined as the measure of the contribution of each variable in reducing the misclassification error of the model.
Scroll down to Training Data scoring - Summary Report to view the Confusion Matrix.
The Confusion Matrix displays counts for cases that were correctly and incorrectly classified in the Training and Validation Sets. No records were misclassified in the Training Set and only six records were misclassified in the Validation Set, resulting in a % error of 2.97.
On the Onput Navigator, click the CTBoost_TrainScore tab to view the Predicted Class, Actual Class, Probability for 0, and Probability for 1 (success) values in the Training Set. If the value for Probability for 1 is greater than 0.5, the record is assigned a classification of 1.
The same applies to the CTBoost_ValidScore tab.
On the Output Navigator, click the CTBoost_TrainLiftChart and CTBoost_ValidLiftChart tabs to navigate to the Lift Charts, shown below.
Lift Charts consist of a lift curve and a baseline. After the model is built using the Training Set, the model is used to score on the Training Set and the Validation Set (if one exists). Then the data set(s) are sorted using the predicted output variable value. After sorting, the actual outcome values of the output variable are cumulated, and the lift curve is drawn as the number of cases (x-axis) versus the cumulated value (y -axis). The baseline (red line connecting the origin to the end point of the blue line) is drawn as the number of cases versus the average of actual output variable values multiplied by the number of cases. The greater the area between the lift curve and the baseline, the better the model.
In the Lift Chart (Training Set) pictured below, the red line originating from the origin and connecting to the point (300, 47) is a reference line that represents the expected number of CAT MEDV predictions if XLMiner simply selected random cases (i.e., no model was used). This reference line provides a yardstick against which the user can compare the model performance. From the Lift Chart below, we can infer that if we assigned 200 cases to class 1, about 47 1s would be included. If 200 cases were selected at random, we could expect about 30 1s.
The decile-wise lift curve is drawn as the decile number versus the cumulative actual output variable value divided by the decile's mean output variable value. This bars in this chart indicate the factor by which the MLR model outperforms a random assignment, one decile at a time. Refer to the validation graph below. In the first decile, taking the most expensive predicted housing prices in the data set, the predictive performance of the model is about 6.5 times better as simply assigning a random predicted value.
ROC (receiver operating characteristic) curves plot the performance of binary classifiers by graphing true positive rates (TPR) versus false positive rates (FPR) as the cutoff value grows from 0 to 1. The closer the curve is to the top left corner of the graph (i.e., the smaller the area above the curve), the better the performance of the model.
In an ROC curve, we can compare the performance of a classifier with that of a random guess, which would lie at a point along a diagonal line (red line) running from the origin (0, 0) to the point (1, 1). (This line is sometimes called the line of no-discrimination.) Anything to the left of this line signifies a better prediction, and anything to the right signifies a worse prediction. The best possible prediction performance would be denoted by a point at the top-left of the graph at the intersection of the x and y axis. This point is sometimes referred to as the perfect classification. Area Under the Curve (AUC) is the space in the graph that appears below the ROC curve. (This value is reported at the top of the ROC graph.) AUC is a value between 0 and 1. The closer the value AUC is to 1, the better the performance of the classification model. In this example, the AUC actual equals 1 to 1 in the Training Set, which means that XLMiner achieved perfect classification in the Training Set. In the Validation Set ROC, the AUC is very small (AUC = .9956), which indicates that this model is a very good fit to the Validation Set as well.
Since the number of trees produced when using an Ensemble method can potentially be in the hundreds, it is not practical for XLMiner to draw each tree in the output.
XLMiner generates the CTBoost_Stored worksheet along with the other output sheets. For details, see the Scoring New Data section.