This example illustrates the Bagging ensemble method with the data set Boston_Housing.xlsx, and compares the results.
On the XLMiner ribbon, from the Data Mining tab, select Classify - Classification Tree - Bagging, then select the Data_Partition worksheet.
At the Output Variable, select CAT. MEDV, then from the Selected Variables list, select all remaining variables except CAT.MEDV (this variable is a categorical variable that is based on the MEDV variable).
Choose the value that will be the indicator of Success by clicking the down arrow next to Specify Success class (for Lift Chart). In this example, we will use the default of 1.
Enter a value between 0 and 1 for Specify initial cutoff probability for success. If the Probability of success (probability of the output variable = 1) is less than this value, a 0 is entered for the class value; otherwise, a 1 is entered for the class value. In this example, we will keep the default of 0.5.
Click Next to advance to the Classification Tree Bagging- Step 2 of 3 dialog.
When Normalize Input Data is selected, XLMiner normalizes the data to determine if linear combinations of the input variables are used when partitioning (splitting) the tree. Keep this option unchecked.
Leave the Number of weak learners at the default of 50. This option controls the number of weak classification models that will be created. The ensemble method stops when the number of classification models created reaches the value set for the Number of weak learners. The algorithm then computes the weighted sum of votes for each class, and assigns the winning classification to each record.
Leave the default of 12345 for Bootstrapping random seed. This value sets the random seed that results in the same observations being chosen for the Training/Validation/Test Sets each time a standard partition is created.
In the Tree Growth section, leave the defaults of levels and 7 for Maximum number of trees. The tree may also be limited by the number of splits or nodes by clicking the drop down next to levels. If levels is chosen, the tree contains the specified number of levels. If splits is selected, the number of times the tree is split is limited to the value entered, and if nodes is selected, the number of nodes in the entire tree is limited to the value specified.
Set Minimum # records in a terminal node to 30. XLMiner stops partitioning the tree when all nodes contain a minimum of 30 records.
XLMiner V2015 provides the ability to partition a data set from within a classification or prediction method by selecting Partition Options on the Step 2 of 3 dialog. If this option is selected, XLMiner partitions the data set (according to the partition options) before running the prediction method. If partitioning has already occurred on the data set, this option is disabled. For more information on partitioning, see the Data Mining Partition section.
Click Next to advance to the Classification Tree Bagging - Step 3 of 3 dialog.
Under Score Training Data and Score Validation Data, Summary Report and Lift Chart are both selected by default. Select Detailed Report under both Score Training Data and Score Validation Data to produce a detailed assessment of the performance of the tree in both sets. Since we did not create a test partition, the options for Score Test Data are disabled. See the Data Mining Partition section for information on how to create a test partition.
Click Finish to run the ensemble method. Worksheets containing the output of the Ensemble Methods algorithm are inserted at the end of the workbook. Click the CTBag_Output worksheet to view the Output Navigator. Click any link in this section to navigate to various sections of the output.
Scroll down to the CTBag_Output worksheet for Details of the boosting tree ensemble. The percentage for each variable is listed here, which measures the variable's contribution in reducing the total misclassification error.
Click the Ensemble Details link to navigate to the Details of the bagging tree ensemble table. The variables included in the model are listed by their importance. Importance is defined as the measure of the contribution of each variable in reducing the misclassification error of the model.
Scroll down to the Training Data Scoring - Summary Report to view the Confusion Matrix.
The Confusion Matrix displays counts for cases that were correctly and incorrectly classified in the Training and Validation Sets. Eight records were misclassified in the Training Set, resulting in an error of 2.63%, and seven records were misclassified in the Validation Set, resulting in a % error of 3.46%.
On the Output Navigator, click the CTBag_TrainingScore tab to view the Predicted Class, Actual Class, Probability for 0, and Probability for 1 (success) values in the Training Set. If the value for Probability for 1 is greater than 0.5, the record is assigned a classification of 1.
The same applies to the CTBag_ValidationScore tab.
On the Output Navigator, click the CTBag_TrainingLiftChart and CTBag_ValidationLiftChart tabs to navigate to the Lift Charts, shown below.
Lift Charts consist of a lift curve and a baseline. After the model is built using the Training Set, the model is used to score on the Training Set and the Validation Set (if one exists). Then the data set(s) are sorted using the predicted output variable value. After sorting, the actual outcome values of the output variable are cumulated, and the lift curve is drawn as the number of cases (x-axis) versus the cumulated value (y -axis). The baseline (red line connecting the origin to the end point of the blue line) is drawn as the number of cases versus the average of actual output variable values multiplied by the number of cases. The greater the area between the lift curve and the baseline, the better the model.
In the Lift Chart (Training Set) pictured below, the red line originating from the origin and connecting to the point (300, 47) is a reference line that represents the expected number of CAT MEDV predictions if XLMiner simply selected random cases (i.e., no model was used). This reference line provides a yardstick against which the user can compare the model performance. From the Lift Chart below, we can infer that if we assigned 200 cases to class 1, about 47 1s would be included. If 200 cases were selected at random, we could expect about 30 1s.
The decile-wise lift curve is drawn as the decile number versus the cumulative actual output variable value divided by the decile's mean output variable value. This bars in this chart indicate the factor by which the MLR model outperforms a random assignment, one decile at a time. Refer to the validation graph below. In the first decile, taking the most expensive predicted housing prices in the data set, the predictive performance of the model is about 6.5 times better as simply assigning a random predicted value.
ROC curves plot the performance of binary classifiers by graphing true positive rates (TPR) versus false positive rates (FPR), as the cutoff value grows from 0 to 1. The closer the curve is to the top left corner of the graph (in other words, the smaller the area above the curve), the better the performance of the model.
In an ROC curve, we can compare the performance of a classifier with that of a random guess which would lie at a point along a diagonal line (red line) running from the origin (0, 0) to the point (1, 1). This line is sometimes called the line of no-discrimination. Anything to the left of this line signifies a better prediction, and anything to the right signifies a worse prediction. The best possible prediction performance would be denoted by a point at the top left of the graph at the intersection of the x and y axis. This point is sometimes referred to as the perfect classification. Area Under the Curve (AUC) is the space in the graph that appears below the ROC curve. (This value is reported at the top of the ROC graph.) AUC is a value between 0 and 1. The closer the value AUC is to 1, the better the performance of the classification model. In this example, the AUC equals 0.995281 in the Training Set, which means that XLMiner achieved almost perfect classification in the training set. In the validation set ROC, the AUC is very small (AUC = .994676), which indicates that this model is a very good fit to the Validation Set as well.
Since the number of trees produced when using an Ensemble method can potentially be in the hundreds, it is not practical for XLMiner to draw each tree in the output.
XLMiner generates the CTBag_Stored worksheet along with the other output sheets. For information on scoring data, see the Scoring New Data section.