Call Us: 888-831-0333 | Contact Us | Live Online Chat
| Contents |
|
This method of partitioning is used when the percentage of successes in the output variable is very low in the dataset but we want to train the data with a particular percentage of successes.. Oversampling is executed as follows :-
For illustration we take the data set, Catalog_multi.xls. It contains a response to a direct mail offer, published by DMEF, the Direct Marketing Educational Foundation. In this "Target dependent variable:buyer(yes=1)" is the output variable. The success rate is less than 1%. In some applications we prefer to train the data with around 50% success rate so we use the oversampling utility, let's see how. Open Catalog_multi.xls. Invoke XLMiner --> Partition Data --> Partition with Oversampling. You get the following dialog. Data Range First Row Contains Headers Variables : This box lists all the variables present in the dataset. Variables in the partitioned data: This list box contains the names of the variables that you selected from the Variables list. Randomization options : Check "Set Seed" and enter the desired number. Select all the variables under the variables list and transfer them to Variables in the partitioned data by clicking on the transfer button. We can now select the output variable. Click on Target dependent variable:buyer(yes=1) in the list. As soon as you select it, you will see the selection button under "Output variable" activated. Click on it. This helps selecting the output variable. Remember, the output variable should have only two distinct classes. The options under "Output options" show the values relevant to the output variable we have chosen. #Classes : This shows how many classes (distinct values) are present in the output variable. Specify Success class : You can select which value of the output variable you want as success. Immediately, XLMiner shows its percentage in the data set in front of % of success in data set. Specify % success in training set : You can select what percentage of success you will like the Training set to have. XLMiner will select those many successes randomly and select the remaining failures randomly. Once the training set is made, XLMiner attributes the remaining successes to validation set randomly, combining as many failures randomly as to maintain the % of successes same as that of the original data set. Specify % validation data to be taken away as test data : Here we specify what %ge of validation will be used as test set, if we want a test set. Let us make the following selection : Select Ok. We get the following output. Let us see how XLMiner arrived at the #records that we see in the oversampling sheet. If you take a look at the data set, the output variable contains 576 1's. We have taken 1 as the success class. We have specified 50% successes in the training set so XLMiner takes 50% of 1's randomly in the partition and takes other 50% of 0's. So the training set has 576 records. The oversampling sheet above shows that the %ge of successes in the data set is 0.989605704. XLMiner maintains this percentage in the validation set. It allocates the remaining 1's ( 50% of 576, ie. 288 in all) randomly to the validation set. Then it selects so many 0's so that the success percentage in validation set is same as that in the original data set. If we calculate the total #records in validation set, it comes to 29102. On allotting 50% of it to the test set, each of them has 14551 rows.
|