This example illustrates how to fit a model using Data Mining's Logistic Regression algorithm using the Boston_Housing dataset.
Click Help - Example Models on the Data Mining ribbon, then Forecasting/Data Mining Examples and open the example file, Boston_Housing.xlsx.
This dataset includes fourteen variables pertaining to housing prices from census tracts in the Boston area collected by the US Census Bureau. The figure below displays a portion of the data; observe the last column (CAT. MEDV). This variable has been derived from the MEDV variable by assigning a 1 for MEDV levels above 30 (>= 30) and a 0 for levels below 30 (<30) and will not be used in this example.
First, we partition the data into training and validation sets using the Standard Data Partition defaults of 60% of the data randomly allocated to the Training Set and 40% of the data randomly allocated to the Validation Set. For more information on partitioning a dataset, see the Data Mining Partitioning chapter.
This example develops a model for predicting the median price of a house in a census track in the Boston area.
Click Classify - Logistic Regression on the Data Mining ribbon. The Logistic Regression dialog appears.
The categorical variable CAT.MEDV has been derived from the MEDV variable (Median value of owner-occupied homes in $1000's) a 1 for MEDV levels above 30 (>= 30) and a 0 for levels below 30 (<30). This will be our Output Variable.
Select the nominal categorical variable, CHAS, as a Categorical Variable. This variable is a 1 if the housing tract is located adjacent to the Charles River.
Select the remaining variables as Selected Variables.
One major assumption of Logistic Regression is that each observation provides equal information. Analytic Solver Data Mining offers an opportunity to provide a Weight Variable. Using a Weight Variable allows the user to allocate a weight to each record. A record with a large weight will influence the model more than a record with a smaller weight. For the purposes of this example, a Weight Variable will not be used.
Choose the value that will be the indicator of “Success” by clicking the down arrow next to Success Class. In this example, we will use the default of 1.
Enter a value between 0 and 1 for Success Probability Cutoff. If this value is less than this value, then a 0 will be entered for the class value, otherwise a 1 will be entered for the class value. In this example, we will keep the default of 0.5.
Click Next to advance to the Logistic Regression - Parameters dialog.
Keep Fit Intercept selected, the default setting, to fit the Logistic Regression intercept. If this option is not selected, Analytic Solver will force the intercept term to 0.
Keep the default of 50 for the Maximum # iterations. Estimating the coefficients in the Logistic Regression algorithm requires an iterative non-linear maximization procedure. You can specify a maximum number of iterations to prevent the program from getting lost in very lengthy iterative loops. This value must be an integer greater than 0 or less than or equal to 100 (1< value <= 100).
Click Prior Probability to open the Prior Probability dialog.
Analytic Solver will incorporate prior assumptions about how frequently the different classes occur in the partitions.
- If Empirical is selected, Analytic Solver will assume that the probability of encountering a particular class in the dataset is the same as the frequency with which it occurs in the training data.
- If Uniform is selected, Analytic Solver will assume that all classes occur with equal probability.
- If Manual is selected, the user can enter the desired class probability value.
For this example, click Done to select the default of Empirical and close the dialog.
Select Variance - Covariance Matrix. When this option is selected, Analytic Solver will display the coefficient covariance matrix in the output. Entries in the matrix are the covariances between the indicated coefficients. The “on-diagonal” values are the estimated variances of the corresponding coefficients.
Select Multicollinearity Diagnostics. At times, variables can be highly correlated with one another which can result in large standard errors for the affected coefficients. Analytic Solver will display information useful in dealing with this problem if this option is selected.
Select Analysis of Coefficients. When this option is selected, Analytic Solver will produce a table with all coefficient information, such as the Estimate, Odds, Standard Error, etc. When this option is not selected, Analytic Solver will only print the Estimates.
When you have a large number of predictors and you would like to limit the model to only the significant variables, click Feature Selection to open the Feature Selection dialog and select Perform Feature Selection at the top of the dialog. Keep the default selection for Maximum Subset Size. This option can take on values of 1 up to N where N is the number of Selected Variables. The default setting is N.
Keep the default selection for Maximum Subset Size. This option can take on values of 1 up to N where N is the number of Selected Variables. The default setting is N. Use some caution when setting this option to a low value, especially if your model contains categorical variables. If the number of total features (continuous variables + encoded categorical variables) is substantially larger than this option setting, then this feature will filter out all subsets (resulting in a blank Feature Selection table).
Analytic Solver offers five different selection procedures for selecting the best subset of variables.
- Backward Elimination in which variables are eliminated one at a time, starting with the least significant. If this procedure is selected, FOUT is enabled. A statistic is calculated when variables are eliminated. For a variable to leave the regression, the statistic's value must be less than the value of FOUT (default = 2.71).
- Forward Selection in which variables are added one at a time, starting with the most significant. If this procedure is selected, FIN is enabled. On each iteration of the Forward Selection procedure, each variable is examined for the eligibility to enter the model. The significance of variables is measured as a partial F-statistic. Given a model at a current iteration, we perform an F Test, testing the null hypothesis stating that the regression coefficient would be zero if added to the existing set if variables and an alternative hypothesis stating otherwise. Each variable is examined to find the one with the largest partial F-Statistic. The decision rule for adding this variable into a model is: Reject the null hypothesis if the F-Statistic for this variable exceeds the critical value chosen as a threshold for the F Test (FIN value), or Accept the null hypothesis if the F-Statistic for this variable is less than a threshold. If the null hypothesis is rejected, the variable is added to the model and selection continues in the same fashion, otherwise the procedure is terminated.
- Sequential Replacement in which variables are sequentially replaced and replacements that improve performance are retained. When this method is selected, the Stepwise selection options F-IN and F-OUT are disabled.
- Stepwise Selection is similar to Forward selection except that at each stage, Analytic Solver considers dropping variables that are not statistically significant. When this procedure is selected, the Stepwise selection options FIN and FOUT are enabled. In the stepwise selection procedure a statistic is calculated when variables are added or eliminated. For a variable to come into the regression, the statistic's value must be greater than the value for FIN (default = 3.84). For a variable to leave the regression, the statistic's value must be less than the value of FOUT (default = 2.71). The value for FIN must be greater than the value for FOUT.
- Best Subsets where searches of all combinations of variables are performed to observe which combination has the best fit. (This option can become quite time consuming depending on the number of input variables.) If this procedure is selected, Number of best subsets is enabled.
Click Done to accept the default choice, Backward Elimination with an F-out setting of 2.71, and return to the Parameters dialog, then click Next to advance to the Scoring dialog.
Select Detailed Report, Summary Report and Lift Charts under both Score Training Data and Score Validation Data. Analytic Solver will create a detailed report, complete with the Output Navigator for ease in routing to specific areas in the output, a report that summarizes the regression output for both datasets, and lift charts, ROC curves, and Decile charts for both partitions.
Since we did not create a test partition when we partitioned our dataset, Score Test Data options are disabled. See the chapter “Data Mining Partitioning” for details on how to create a test set.
For information on scoring in a worksheet or database, please see the “Scoring New Data” chapter in the Analytic Solver Data MiningUser Guide.
Click Finish. The logistic regression output is to the right of the STDPartition worksheet. Use the Output Navigator onLogReg_Output to navigate through the output.
Click the Training: Classification Summary link to open the Training: Classification Summary.
A Confusion Matrix is used to evaluate the performance of a classification method. This matrix summarizes the records that were classified correctly and those that were not.
TP stands for True Positive. These are the number of cases classified as belonging to the Success class that actually were members of the Success class. FN stands for False Negative. These are the number of cases that were classified as belonging to the Failure class when they were actually members of the Success class (i.e. patients with cancerous tumors who were told their tumors were benign). FP stands for False Positive. These cases were assigned to the Success class but were actually members of the Failure group (i.e. patients who were told they tested positive for cancer when, in fact, their tumors were benign). TN stands for True Negative. These cases were correctly assigned to the Failure group.
In the Training Dataset, we see 40 records belonging to the Success class were correctly assigned to that class while 7 records belonging to the Success class were incorrectly assigned to the Failure class. In addition, 250 records belonging to the Failure class were correctly assigned to this same class while 5 records belonging to the Failure class were incorrectly assigned to the Success class. The total number of misclassified records was 14 (7+7) which results in an error equal to 4.61%.
Precision is the probability of correctly identifying a randomly selected record as one belonging to the Success class (i.e. the probability of correctly identifying a random patient with cancer as having cancer). Recall (or Sensitivity) measures the percentage of actual positives which are correctly identified as positive (i.e. the proportion of people with cancer who are correctly identified as having cancer). Specificity (also called the true negative rate) measures the percentage of failures correctly identified as failures (i.e. the proportion of people with no cancer being categorized as not having cancer.) The F-1 score, which fluctuates between 1 (a perfect classification) and 0, defines a measure that balances precision and recall.
Precision = TP/(TP+FP)
Sensitivity or True Positive Rate (TPR) = TP/(TP + FN)
Specificity (SPC) or True Negative Rate =TN / (FP + TN)
F1 = 2 * TP /(2TP+ FP + FN)
down to view the Training: Classification Summary table.
Click the Validation: Classification Summary link to open the Validation Classification Summary.
In the Validation Dataset, 32 records were correctly classified as belonging to the Success class while 6 cases were incorrectly assigned to the Failure class. 155 cases were correctly classified as belonging to the Failure class. Nine (9) records were incorrectly classified as belonging to the Success class when they were, in fact, members of the Failure class. This resulted in a total classification error of 7.43%
Scroll down to view the Validation: Classification Summary table. Again, misclassified records appear in red.
Click the Predictor Screening hyperlink in the Output Navigator to display the Model Predictors table. In V2015, a new preprocessing feature selection step has been added to prevent predictors, causing rank deficiency of the design matrix, from becoming part of the model. Included and excluded predictors are shown in the table below. In this model there were no excluded predictors. All predictors were eligible to enter the model passing the tolerance threshold of 5.2587E-10. This denotes a tolerance beyond which a variance - covariance matrix is not exactly singular to within machine precision. The test is based on the diagonal elements of the triangular factor R resulting from Rank-Revealing QR Decomposition. Predictors that do not pass the test are excluded.
Note: If a predictor is excluded, the corresponding coefficient estimates will be 0 in the regression model and the variable - covariance matrix would contain all zeros in the rows and columns that correspond to the excluded predictor. Multicollinearity diagnostics, variable selection and other remaining output will be calculated for the reduced model.
The design matrix may be rank-deficient for several reasons. The most common cause of an ill-conditioned regression problem is the presence of feature(s) that can be exactly or approximately represented by a linear combination of other feature(s). For example, assume that among predictors you have 3 input variables X, Y and Z where Z = a * X + b * Y where a and b are constants. This will cause the design matrix to not have a full rank. Therefore, one of these 3 variables will not pass the threshold for entrance and will be excluded from the final regression model.
Since we selected Perform Feature Selection on the Feature Selection dialog, Analytic Solver has produced the following output on the LogReg_FS tab which displays the variables that are included in the subsets. This table contains the three best subsets of variables with up to a maximum of 12 features (plus the constant). This table contains the five subsets with the highest Residual Sum of Squares values.
In this table, every model includes a constant term (since Fit Intercept was selected) and one or more variables as the additional coefficients. We can use any of these models for further analysis simply by clicking the hyperlink under Subset ID in the far left column. The Logistic Regression dialog will open. Click Finish to run Logistic Regression using the variable subset as listed in the table.
The Best Subsets Details includes three statistics: RSS (Residual Sum of Squares), Mallows's CP and Probability. RSS is the residual sum of squares, or the sum of squared deviations between the predicted probability of success and the actual value (1 or 0). The #Coefficients are the number of features included in each subset. "Mallows's Cp" is a measure of the error in the best subset model, relative to the error incorporating all variables. Adequate models are those for which Cp is roughly equal to the number of parameters in the model (including the constant), and/or Cp is at a minimum.
To compute the Probability statistic, an F Test is performed to determine whether the full model (F) provides a significantly better fit than a reduced model (R) -- significant here means the statistical nature of inference, since in terms of RSS the model with more predictors would always provide better fit compared to the reduced model. The null hypothesis of this F test is that the full model (F) does not provide a significantly better fit compared to the reduced model (R). /P>
The "Probability" metric corresponds exactly to the p-value for the computed F-Statistic, i.e. the complement of the CDF of the F distribution. Given a threshold say T=0.05, we reject the null hypothesis if the p-value is less than T; otherwise, there is insufficient evidence to reject the null hypothesis. Rejecting the null hypothesis means that the full model (F) does provide a statistically significant fit compared to reduced model (R).
Note: N/A values that the user observes for full models (F) simply indicate that the F-statistic and p-value were not computed. These statistics are not computed for the full model (F), containing all predictors, since the test results in 0 degrees of freedom for the F distribution, which is not defined.
In this example, we cannot reject the null hypothesis. Note that it is up to the user on how to use or interpret this information for his/her application, especially when comparing p-values that are well outside of "rejecting" range.
Note: A blank Feature Selection table can be returned in the results if the Feature Selection Maximum Subset Value is too small. If you notice a blank table in your results, increase the Maximum Subset Value on the Feature Selection dialog (on the Logistic Regression Parameters tab).
Model terms are shown in the Coefficients output on the LogReg_Output sheet.
This table contains the coefficient estimate, the standard error of the coefficient, the p-value, the odds ratio for each variable (which is simply ex where x is the value of the coefficient) and confidence interval for the odds.
Summary statistics, found directly above in the Regression Summary, show the residual degrees of freedom (#observations - #predictors), a standard deviation type measure for the model (which typically has a chi-square distribution), the number of iterations required to fit the model, and the Multiple R-squared value.
The multiple R-squared value shown here is the r-squared value for a logistic regression model , defined as -
R2 = (D0-D)/D0 ,
where D is the Deviance based on the fitted model and D0 is the deviance based on the null model. The null model is defined as the model containing no predictor variables apart from the constant.
Note: If a variable has been eliminated by Rank-Revealing QR Decomposition, the variable will appear in red in the Regression Model table with a 0 Coefficient, Std. Error, CI Lower, CI Upper, and RSS Reduction and N/A for the t-Statistic and P-Values.
Collinearity Diagnostics help assess whether two or more variables so closely track one another as to provide essentially the same information.
The columns represent the variance components (related to principal components in multivariate analysis), while the rows represent the variance proportion decomposition explained by each variable in the model. The eigenvalues are those associated with the singular value decomposition of the variance-covariance matrix of the coefficients, while the condition numbers are the ratios of the square root of the largest eigenvalue to all the rest. In general, multicollinearity is likely to be a problem with a high condition number (more than 20 or 30), and high variance decomposition proportions (say more than 0.5) for two or more variables.
Click the LogReg_TrainingLiftChart and LogReg_ValidationLiftChart to navigate to the Training and Validation Data Lift Charts, Decile and ROC Curves.
Lift Charts and ROC Curves are visual aids that help users evaluate the performance of their fitted models. Charts found on the LogReg_TrainingLiftChart tab were calculated using the Training Data Partition. Charts found on the LogReg_ValidationLiftChart tab were calculated using the Validation Data Partition. It is good practice to look at both sets of charts to assess model performance on both datasets.
Note: To view these charts in the Cloud app, click the Charts icon on the Ribbon, select LogReg_TrainingLiftChart or LogReg_ValidationLiftChart for Worksheet and Decile Charts, ROC Charts or Gain Charts.
Decile-wise Lift Chart, ROC Curve and Lift Charts for Training Partition
Decile-wise Lift Chart, ROC Curve and Lift Charts for Validation Partition
After the model is built using the training data set, the model is used to score on the training data set and the validation data set (if one exists). Then the data set(s) are sorted in decreasing order using the predicted output variable value. After sorting, the actual outcome values of the output variable are cumulated and the lift curve is drawn as the cumulative number of cases in decreasing probability (on the x-axis) vs the cumulative number of true positives on the y-axis. The baseline (red line connecting the origin to the end point of the blue line) is a reference line. For a given number of cases on the x-axis, this line represents the expected number of successes if no model existed, and instead cases were selected at random. This line can be used as a benchmark to measure the performance of the fitted model. The greater the area between the lift curve and the baseline, the better the model. In the Training Lift chart, if we selected 100 cases as belonging to the success class and used the fitted model to pick the members most likely to be successes, the lift curve tells us that we would be right on about 45 of them. Conversely, if we selected 100 random cases, we could expect to be right on about 15 of them. The Validation Lift chart tells us that if we selected 100 cases as belonging to the success class and used the fitted model to pick the members most likely to be successes, the lift curve tells us that we would be right on about 37 of them. If we selected 100 random cases, we could expect to be right on about 15 of them. P>
The decile-wise lift curve is drawn as the decile number versus the cumulative actual output variable value divided by the decile's mean output variable value. This bars in this chart indicate the factor by which the model outperforms a random assignment, one decile at a time. Refer to the validation graph above. In the first decile, taking the most expensive predicted housing prices in the dataset, the predictive performance of the model is about 4.5 times better as simply assigning a random predicted value.
The Regression ROC curve has been updated in V2017. This new chart compares the performance of the regressor (Fitted Predictor) with an Optimum Predictor Curve and a Random Classifier curve. The Optimum Predictor Curve plots a hypothetical model that would provide perfect classification results. The best possible classification performance is denoted by a point at the top left of the graph at the intersection of the x and y axis. This point is sometimes referred to as the "perfect classification". The closer the AUC is to 1, the better the performance of the model. In the Validation Partition, AUC = .97 which suggests that this fitted model is a good fit to the data.
In V2017, two new charts have been introduced: a new Lift Chart and the Gain Chart. To display these new charts, click the down arrow next to Lift Chart (Original), in the Original Lift Chart, then select the desired chart.
Select Lift Chart (Alternative) to display Analytic Solver Data Mining's new Lift Chart. Each of these charts consists of an Optimum Predictor curve, a Fitted Predictor curve, and a Random Predictor curve. The Optimum Predictor curve plots a hypothetical model that would provide perfect classification for our data. The Fitted Predictor curve plots the fitted model and the Random Predictor curve plots the results from using no model or by using a random guess (i.e. for x% of selected observations, x% of the total number of positive observations are expected to be correctly classified).
The Alternative Lift Chart plots Lift against the Predictive Positive Rate or Support.
Lift Chart (Alternative) and Gain Chart for Training Partition
Lift Chart (Alternative) and Gain Chart for Validation Partition
Click the down arrow and select Gain Chart from the menu. In this chart, the True Positive Rate or Sensitivity is plotted against the Predictive Positive Rate or Support.
See the chapter "Score New Data" within the Analytic Solver Data Mining User Guide for more information on the LogReg_Stored worksheet.