Introduction

Ensemble methods, introduced in XLMiner V2015, are powerful techniques that are capable of producing strong classification tree models. XLMiner V2015 now features three of the most robust ensemble methods available in data mining: Boosting, Bagging, and Random Trees. The sections below introduce each technique and when their selection would be most appropriate. Before we discuss these methods, however, it may be useful to have a look at classification trees, pruning, and how these concepts relate to the use of ensemble methods.

 

Classification Trees

Classification tree algorithms use binary recursive partitioning to progressively split data branches into partitions with the aim to classify or predict outcomes. This process enables the generation of rules that can be easily understood and translated into SQL or a natural query language. The classification tree method uses labeling, recording, and the assignation of variables to discrete classes in order to provide confidence of an accurate classification.

Sample Decision Tree

 

Tree Pruning

In nature, pruning is the process of removing branches and leaves in order to promote the overall growth of a plant. The analogy to data mining classification trees is appropriate, as the pruning process likewise involves the performance improvement of decision trees by removing branches that do not accurately generalize the data. This needs to be done because the classification tree algorithm, by its nature, progressively splits data at the root node; consequently, the representative data populations decrease with each subsequent split. From an analysis standpoint, this results in highly-specific display patterns that cannot be accurately applied to larger populations. They are simply not relevant. The solution is to allow the classification tree to grow to its full size and then use pruning (using the validation data set) to remove branches that do not effectively contribute.

 

Ensemble Methods

In short, ensemble methods are used to create stronger (i.e., more accurate) classification tree models. This is done by combining weak classification tree models to create stronger versions. XLMiner offers three robust ensemble methods: Bagging, Boosting, and Random Trees. These methods differ in terms of how data is selected from the weak dataset, how the weak models are generated, and how the outputs are combined to form a stronger classification tree model. Specific ensemble methods may be more appropriate in certain situations based on the nature of the dataset (e.g., size, ability to parallelize). In addition to the trio of ensemble methods, XLMiner also features a simple classification tree algorithm (i.e., Single Tree) that can be used to find a model that classifies the data effectively. However, the Bagging, Boosting, and Random Trees methods are quite powerful options that could provide more meaningful data classifications.

 

How to Access Ensemble Methods in Excel

  1. Launch Excel.
  2. In the toolbar, click XLMINER PLATFORM.
  3. In the ribbon's Data Mining section, click Classify.
  4. In the drop-down menu, hover the mouse cursor over Classification Tree to reveal a sub-menu.
  5. Select Boosting, Bagging, Random Trees, or Single Tree as needed.

How to Access Ensemble Methods in Excel

 

Classification Tree Modeling Methods

 

Bagging

Bagging (aka bootstrap aggregating) is a simple but powerful ensemble algorithm that facilitates the increased stability & accuracy of classification models. The Bagging process works by generating multiple training datasets via random sampling with replacement, applying the algorithm to each dataset, and then taking the majority vote amongst the models to determine data classifications. Bagging is a particularly popular method because it reduces variance, helps to prevent overfitting (i.e., forced applicability of random irrelevant data), and it can be easily parallelized for application to large datasets. 

 

Boosting

Boosting is a robust ensemble algorithm that is capable of reducing both bias & variance, and also facilitates the conversion of weak learners (i.e., classifiers with weak correlations to the true classification) to strong learners (i.e., well-correlated classifiers). Boosting creates strong classification tree models by training models to concentrate on misclassified records from previous models; when this is done, all classifiers are combined by a weighted majority vote. This process places a higher weight to incorrectly classified records while decreasing the weight of correct classifications -- this effectively forces subsequent models to place a greater emphasis on misclassified records. The algorithm then computes the weighted sum of votes for each class and assigns the best classification to the record. Boosting frequently yields better models than bagging, but is not capable of parallelization; consequently, if the dataset is very large (i.e., significant number of weak learners), then boosting may not be the most appropriate ensemble method.

 

Random Trees

Random Trees (aka random forests) borrows the Bagging concept of random feature selection to construct decision tree models with controlled variance. The Random Tree method uses training datasets to create multiple decision trees and then using the mode from each class to create a strong classifier. The benefits of using the Random Trees method is that---like Bagging---it is parallelizable, it helps to prevent overfitting, and the generation of models is faster than Bagging. On the downside, the increased speed is due to to the limited number of features selected in each iteration; consequently, the end-result may not be as comprehensive.

 

Ensemble Methods Summary

  • A trio of methods that are used to create stronger classification tree models by combining weak models.
  • Bagging: The generation of multiple training datasets via random sampling with replacement, applying the algorithm to each dataset, and then taking the majority vote amongst the models to determine data classifications.
  • Boosting: The training of models to concentrate on misclassified records from previous models, and then combining classifiers via a weighted majority vote.
  • Random Trees: The creation of multiple decision trees from training datasets and then using the mode from each class to create a strong classifier.

 

Resources