One issue when fitting a model is how well the newly-created model behaves when applied to new data. To address this issue, the data set can be divided into multiple partitions: a training partition used to create the model, a validation partition to test the performance of the model, and a third test partition. Partitioning is performed randomly to protect against a biased partition -- according to proportions specified by the user -- or according to rules concerning the data set type. For example, when creating a time series forecast, data is partitioned by chronological order.
The Training Set is used to train or build a model. For example, in a linear regression, the training set is used to fit the linear regression model (i.e., to compute the regression coefficients). In a neural network model, the training set is used to obtain the network weights. After fitting the model on the Training Set, the performance of the model should be tested on the Validation Set.
Once a model is built using the Training Set, the performance of the model must be validated using new data. If the Training Set itself was utilized to compute the accuracy of the model fit, the result would be an overly optimistic estimate of the accuracy of the model. This is because the training or model fitting process ensures that the accuracy of the model for the training data is as high as possible, and the model is specifically suited to the training data. To obtain a more realistic estimate of how the model would perform with unseen data, we must set aside a part of the original data and not include this set in the training process. This data set is known as the Validation Set.
To validate the performance of the model, XLMiner measures the discrepancy between the actual observed values and the predicted value of the observation. This discrepancy is known as the error in prediction, and is used to measure the overall accuracy of the model.
The Validation Set is often used to fine-tune models. For example, you might try out neural network models with various architectures and test the accuracy of each on the Validation Set to choose the best performer among the competing architectures. When a model is chosen, its accuracy with the Test Set is still an optimistic estimate of how it would perform with unseen data. This is because the final model has come out as the winner among the competing models based on the fact that its accuracy with the Validation Set is highest. As a result, it is a good idea to set aside another portion of data that is used in either training or in validation. This set is known as the Test Set. The accuracy of the model on the test data gives a realistic estimate of the performance of the model on completely unseen data.
Xlminer provides two methods of partitioning: Standard Partitioning and Partitioning with Oversampling. There are two approaches to standard partitioning: random partitioning and user-defined partitioning.
In simple random sampling, every observation in the main data set has equal probability of being selected for the partition data set. For example, if you specify 60% for the Training Set, then 60% of the total observations are randomly selected for the training set. In other words, each observation has a 60% chance of being selected.
Random partitioning uses the system clock as a default to initialize the random number seed. Alternatively, the random seed can be manually set, resulting in the same observations being chosen for the Training/Validation/Test Sets each time a standard partition is created.
In user-defined partitioning, the partition variable specified is used to partition the data set. This is useful when you have already pre-determined the observations to be used in the Training, Validation, or Test Sets. This partition variable takes the value: t for training, v for validation and s for test. Rows with any other values in the Partition Variable column are ignored. The partition variable serves as a flag for writing each observation to the appropriate partition(s).
Partition with Oversampling
This method of partitioning is used when the percentage of successes in the output variable is very low (i.e., callers who opt in to a short survey at the end of a customer service call). Typically, the number of people who finish the survey is very low, so information connected with these callers is minimal. As a result, it would be almost impossible to formulate a model based on these callers. In these types of cases, we must use Oversampling (also called weighted sampling). Oversampling can be used when there are only two classes, one of much greater importance than the other (i.e., callers who finish the survey as compared to callers who simply hang up).
XLMiner takes the following steps when partitioning with oversampling.
1. The data is partitioned by randomly allocating 50% of the success values for the output variable to the Training Set. The output variable must be limited to two classes that can either be numbers or strings.
2. XLMiner maintains the % success in Training Set specified by the user in the Training Set by randomly selecting the required records with failures.
3. The remaining 50% of successes are randomly allocated to the Validation Set.
4. If % validation data to be taken away as test data is selected, XLMiner will create an appropriate test set from the Validation Set.
In XLMiner V2015, it is no longer always necessary to partition a data set before running a classification or prediction algorithm. Rather, you can now perform partitioning on the Step 2 of 3 dialog for each classification or prediction method. An example of the Classification Tree Boosting - Step 2 of 3 dialog is shown below. If the active worksheet is an un-partitioned data set, the Partition Data option will be enabled. If the active worksheet is partitioned data set, the Partition Data option will be disabled.
If a data partition is used to train and validate several different classification or prediction algorithms that will be compared for predictive power, it may be better to use the Ribbon Partition choices to create a partitioned data worksheet. But if the data partition will be used with a single algorithm, or if it isn't crucial to compare algorithms on exactly the same partitioned data, Partition-on-the-fly offers the following advantages.
User interface steps are saved, and the Excel workbook is not cluttered with partition worksheets.
Partition-on-the-fly is much faster than creating a standard partition, and then running an algorithm.
Partition-on-the-fly can handle larger data sets without exhausting memory, since the intermediate Excel worksheet for the partitioned data is never created.