To further emulate the results of the journal article discussed in the Feature Selection Example, the Random Trees Ensemble Classification Methods will be used to investigate if a machine learning algorithm can predict a patient's survival using the top two or three ranked features as found by the Feature Selection tool. 

First, click Partition – Standard Partition to partition the dataset into Training, Validation and Test Sets using the default percentages of 60% allocated to the Training Set and 40% allocate to the Validation Set. 

Figure 11:  Standard Data Partition dialog

Figure 11:  Standard Data Partition dialog

Then click OK to create the two partitions.  A new worksheet STDPartition is inserted to the right of the dataset. The number of records allocated to the Training partition is 179 and the number of records allocated to the Validation partition is 120.

Figure 12:  Standard Data Partitioning results

Figure 12:  Standard Data Partitioning results

The first time that the model is fit, only two features (ejection_fraction and serum_creatinine) will be utilized.

Click Classify – Ensemble – Random Trees to open the Random Trees: Classification dialog.

Select the two Variables from Variables In Input Data (ejection_fraction and serum_creatinine) and click the right pointing arrow to the left of Selected Variables to add these two variables to the model.   Then take similar steps to select DEATH_EVENT as the Output Variable.  

Leave Success Class as "1" and Success Probability Cutoff at 0.5 under Binary Classification.

The Random Trees:  Classification dialog should be similar to the one pictured in the Figure 13 below.

Figure 13:  Random Trees: Classification dialog with Selection Variables (serum_creatinine and ejection_fraction) and Output Variables (DEATH_EVENT) selected. 

Click the Scoring tab to advance to the Random Trees:  Classification Scoring dialog. 

For more information on Random Trees parameters, see the Random Trees Classification Options section below.

Summary Report is selected by default.  Select Detailed Report for both Score Training Data and Score Validation Data and then click Finish.

Figure 14:  Random Trees:  Classification dialog with output choices selected

Figure 14:  Random Trees:  Classification dialog with output choices selected

Four worksheets are inserted to the right of the STDPartition tab:  CRandTrees_Output, CRandTrees_TrainingScore, CRandTrees_ValidationScore and CRandTrees_Stored.

CRandTrees_Output reports the input data, output data, and parameter settings.

CRandTrees_TrainingScore reports the confusion matrix, calculated metrics and the actual classification by row for the training partition. 

CRandTrees_ValidationScore reports the confusion matrix, calculated metrics and the actual classification by row for the validation partition. 

CRandTrees_Stored contains the stored model which can be used to apply the fitted model to new data.  See the Scoring chapter within the Analytic Solver Data Mining User Guide for an example of scoring new data using the stored model. 

Click CRandTrees_TrainingScore to view the Classification Summary for the Training partition. 

Figure 15:  Training: Classification Summary

Figure 15:  Training: Classification Summary

The overall error for the training partition was 17.32 with 14 surviving patients reported as deceased and 17 deceased patients reported as survivors. 

Accuracy:  82.7% -- Accuracy refers to the ability of the classifier to predict a class label correctly.

Specificity:  0.880 – (True Negative)/(True Negative + False Positives)

Specificity is defined as the proportion of negative classifications that were actually negative, or the fraction of survivors that actually survived.  In this model, 103 actual surviving patients were classified correctly as survivors.  There were 14 false positives or 14 actual survivors classified incorrectly as deceased.

Sensitivity or Recall:  0.726 – (True Positive)/(True Positive + False Negative)

Sensitivity is defined as the proportion of positive cases there were classified correctly as positive, or the proportion of actually deceased patients there were classified as deceased.  In this model, 45 actual deceased patients were correctly classified as deceased.  There were 17 false negatives or 17 actual deceased patients were incorrectly classified as survivors. 

Note:  Since the object of this model is to correctly classify which patients will succumb to heart failure, this is an important statistic as it is very important for a physician to be able to accurately predict which patients require mitigation. 

Precision:  0.763 – (True Positives)/(True Positives + False Positives)

Precision is defined as the proportion of positive results that are true positive.  In this model, 45 actual deceased patients were classified correctly as deceased.  There were 14 false positives or 14 actual survivors classified incorrectly as deceased.

F-1 Score:  0.744 –2  x (Precision * Sensitivity)/(Precision + Sensitivity)

The F-1 Score provides a statistic to balance between Precision and Sensitivity, especially if an uneven class distribution exists, as in this example, (117 survivors vs 62 deceased).  The closer the F-1 score is to 1 (the upper bound) the better the precision and recall. 

Success Class and Success Probability simply reports the settings for these two values as input on the Random Trees: Classification Data dialog. 

View individual records and their classifications beneath Training:  Classification Details. 

Click the CRandTrees_ValidationScore tab to view the Summary Results for the Validation partition.

Figure 16:  Validation:  Classification Summary

Figure 16:  Validation:  Classification Summary

The overall error for the validation partition was 24.17 with 19 false positives (surviving patients reported as deceased) and 10 false negatives (deceased patients reported as survivors). 

Note the following metrics:

Accuracy: 75.83

Specificity: 0.779

Sensitivity or Recall:  0.706

Precision: 0.558

F1 Score:  0.623

These steps were performed multiple times while adding additional Selected Variables according to the variable's importance or significance found by Feature Selection.  The results are summarized in the table below. 

The lowest Overall Error in the Validation Partition for any of the variable combinations occurs when just two variables, ejection_fraction and serum_creatinine, are present in the fitted model.  In addition, this fitted model also exhibits the highest Accuracy, Sensitivity, Precision and F1 Score metrics.   These results suggest that by obtaining these two measurements for a patient, a physician can determine whether the patient should undergo some type of mitigation for their heart failure diagnosis.