1. Open or upload the Boston_Housing.xlsx example dataset. A portion of the dataset is shown below. The last variable, CAT.MEDV, is a discrete classification of the MEDV variable and will not be used in this example.
On the XLMiner ribbon, from the Applying Your Model tab, select Help - Examples, then Forecasting/Data Mining Examples to open the Boston_Housing.xlsx from the data sets folder. A portion of the data set is shown below.
2. First, we partition the data into training and validation sets using the Standard Data Partition defaults with percentages of 60% of the data randomly allocated to the Training Set and 40% of the data randomly allocated to the Validation Set. For more information on partitioning a dataset, see the Data Mining Partitioning chapter within the Analytic Solver Data Mining Reference Guide.
Standard Data Partition Dialog
3. With the STDPartition worksheet selected, click Predict – Linear Regression to open the Linear Regression Data tab.
4. Select MEDV as the Output Variable, CHAS as a Categorical Variable and all remaining variables (except CAT. MEDV) as Selected variables.
All supervised algorithms include a new Simulation tab. This tab uses the functionality from the Generate Data feature (described in the What’s New section of this guide and then more in depth in the Analytic Solver Data Mining Reference Guide) to generate synthetic data based on the training partition, and uses the fitted model to produce predictions for the synthetic data. The resulting report, LinReq_Simulation, will contain the synthetic data, the predicted values and the Excel-calculated Expression column, if presentSince this new functionality does not support categorical variables, these types of variables will not be present in the model, only continuous variables.
If the new Simulation feature is not invoked, the categorical variable, CHAS may be added to the model as a categorical variable. However, in this specific instance, since the CHAS variable is the only Categorical Variable selected and it's values are binary (0/1), the output results will be the same as if you selected this variable as a Selected (or scale) Variable. If either 1. there were more categorical variables, 2. the CHAS variable was non-binary, or 3, another algorithm besides Mulitple Linear Regression or Logistic Regression was selected, this would not be true.
Linear Regression Dialog, Data tab
5. Click Next to advance to the Parameters tab.
6. If Fit Intercept is selected, the intercept term will be fitted, otherwise there will be no constant term in the equation. Leave this option selected for this example.
7. Under Regression: Display, select all 6 display options to display each in the output.
Under Statistics, select:
- Variance-Covariance Matrix
- Multicollinearity Diagnostics
Under Advanced, select:
- Analysis of Coefficients
- Analysis of Residuals
- Influence Diagnostics
- Confidence/Prediction Intervals
8. When you have a large number of predictors and you would like to limit the model to only the significant variables, click Feature Selection to open the Feature Selection dialog and select Perform Feature Selection at the top of the dialog. The default setting for Maximum Subset Size is 12. This option can take on values of 1 up to N where N is the number of Selected Variables. The default setting is N.
Feature Selection Dialog
Analytic Solver Data Mining offers five different selection procedures for selecting the best subset of variables.
- Backward Elimination in which variables are eliminated one at a time, starting with the least significant. If this procedure is selected, FOUT is enabled. A statistic is calculated when variables are eliminated. For a variable to leave the regression, the statistic’s value must be less than the value of FOUT (default = 2.71).
- Forward Selection in which variables are added one at a time, starting with the most significant. If this procedure is selected, FIN is enabled. On each iteration of the Forward Selection procedure, each variable is examined for the eligibility to enter the model. The significance of variables is measured as a partial F-statistic. Given a model at a current iteration, we perform an F Test, testing the null hypothesis stating that the regression coefficient would be zero if added to the existing set if variables and an alternative hypothesis stating otherwise. Each variable is examined to find the one with the largest partial F-Statistic. The decision rule for adding this variable into a model is: Reject the null hypothesis if the F-Statistic for this variable exceeds the critical value chosen as a threshold for the F Test (FIN value), or Accept the null hypothesis if the F-Statistic for this variable is less than a threshold. If the null hypothesis is rejected, the variable is added to the model and selection continues in the same fashion, otherwise the procedure is terminated.
- Sequential Replacement in which variables are sequentially replaced and replacements that improve performance are retained.
- Stepwise selection is similar to Forward selection except that at each stage, Analytic Solver Data Mining considers dropping variables that are not statistically significant. When this procedure is selected, the Stepwise selection options FIN and FOUT are enabled. In the stepwise selection procedure a statistic is calculated when variables are added or eliminated. For a variable to come into the regression, the statistic’s value must be greater than the value for FIN (default = 3.84). For a variable to leave the regression, the statistic’s value must be less than the value of FOUT (default = 2.71). The value for FIN must be greater than the value for FOUT.
- Best Subsets where searches of all combinations of variables are performed to observe which combination has the best fit. (This option can become quite time consuming depending on the number of input variables.) If this procedure is selected, Number of best subsets is enabled.
9. Click Done to accept the default choice, Backward Elimination with a Maximum Subset Size of 3 and an F-out setting of 2.71, and return to the Parameters tab, then click Next to advance to the Scoring tab.
Linear Regression Dialog, Parameters tab
10. Click Next to proceed to the Scoring tab.
11. Select all four options for Score Training/Validation data.
When Detailed report is selected, Analytic Solver Data Mining will create a detailed report of the Discriminant Analysis output.
When Summary report is selected (the default), Analytic Solver Data Mining will create a report summarizing the Discriminant Analysis output.
When Lift Charts is selected, Analytic Solver Data Mining will include Lift Chart and ROC Curve plots in the output.
When Frequency Chart is selected, a frequency chart will be displayed when the LinReg_TrainingScore and LinReg_ValidationScore worksheets are selected. This chart will display an interactive application similar to the Analyze Data feature, explained in detail in the Analyze Data chapter that appears earlier in this guide. This chart will include frequency distributions of the actual and predicted responses individually, or side-by-side, depending on the user’s preference, as well as basic and advanced statistics for variables, percentiles, six sigma indices.
Since we did not create a test partition, the options for Score test data are disabled. See the chapter “Data Mining Partitioning” for information on how to create a test partition.
See the Scoring New Data chapter within the Analytic Solver Data Mining User Guide for more information on Score New Data in options.
LinReg dialog, Scoring tab
12. Click Next to advance to the Simulation tab.
13. Select Simulation Response Prediction to enable all options on the Simulation tab of the Linear Regression dialog.
Simulation tab: All supervised algorithms include a new Simulation tab. This tab uses the functionality from the Generate Data feature (described earlier in this guide) to generate synthetic data based on the training partition, and uses the fitted model to produce predictions for the synthetic data. The resulting report, LinReg_Simulation, will contain the synthetic data, the predicted values and the Excel-calculated Expression column, if present. In addition, frequency charts containing the Predicted, Training, and Expression (if present) sources or a combination of any pair may be viewed, if the charts are of the same type.
Linear Regression dialog, Simulation tab
Evaluation: Select Calculate Expression to amend an Expression column onto the frequency chart displayed on the LinReg_Simulation output sheet. Expression can be any valid Excel formula that references a variable and the response as [@COLUMN_NAME]. Click the Expression Hints button for more information on entering an expression.
For the purposes of this example, leave all options at their defaults in the Distribution Fitting, Correlation Fitting and Sampling sections of the dialog. For Expression, enter the following formula to display the price per room of each record.
Note that variable names are case sensitive.
For more information on the remaining options shown on this dialog in the Distribution Fitting, Correlation Fitting and Sampling sections, see the Generate Data chapter that appears earlier in this guide.
14. Click Finish to run Discriminant Analysis on the example dataset.
This result worksheet includes 8 segments: Output Navigator, Inputs, Regression Summary, Predictor Screening, Coefficients, ANOVA, Variance-Covariance Matrix of Coefficients and Multicollinarity Diagnostics.
Output Navigator: The Output Navigator appears at the top of all result worksheets. Use this feature to quickly navigate to all reports included in the output.
On the Output Navigator, click the Predictors hyperlink to display the Model Predictors table. In Analytic Solver Platform, Analytic Solver Pro, XLMiner Platform, and XLMiner Pro V2015, a new pre-processing feature selection step has been added to prevent predictors causing rank deficiency of the design matrix from becoming part of the model. Included and excluded predictors are shown in the Model Predictors table. In this model, there were no excluded predictors. All predictors were eligible to enter the model passing the tolerance threshold of 5.23E-10. This denotes a tolerance beyond which a variance-covariance matrix is not exactly singular to within machine precision. The test is based on the diagonal elements of the triangular factor R resulting from Rank-Revealing QR Decomposition. Predictors that do not pass the test are excluded.
If a predictor is excluded, the corresponding coefficient estimates will be 0 in the regression model and the variable-covariance matrix would contain all zeros in the rows and columns that correspond to the excluded predictor. Multicollinearity diagnostics, variable selection, and other remaining output is calculated for the reduced model.
The design matrix may be rank-deficient for several reasons. The most common cause of an ill-conditioned regression problem is the presence of feature(s) that can be exactly or approximately represented by a linear combination of other feature(s). For example, assume that among predictors you have three input variables X, Y, and Z, where Z = a * X + b * Y, where a and b are constants. This will cause the design matrix to not have a full rank. Therefore, one of these three variables will not pass the threshold for entrance and will be excluded from the final regression model.
See the following Model Predictors table example with three excluded predictors: Opening Theatre, Genre_Romantic, and Studio_IRS.
On the Output Navigator, click the Train. Score - Detailed Rep. link to open the Multiple Linear Regression - Prediction of Training Data table. Of primary interest in a data-mining context, will be the predicted and actual values for each record, along with the residual (difference) and Confidence and Prediction Intervals for each predicted value.
XLMiner produces 95% Confidence and Prediction Intervals for the predicted values. Typically, Prediction Intervals are more widely utilized as they are a more robust range for the predicted value. For a given record, the Confidence Interval gives the mean value estimation with 95% probability. This means that with 95% probability, the regression line will pass through this interval. The Prediction Interval takes into account possible future deviations of the predicted response from the mean. There is a 95% chance that the predicted value will lie within the Prediction interval.
XLMiner displays The Total sum of squared errors summaries for both the Training and Validation Sets on the MLR_Output worksheet. The total sum of squared errors is the sum of the squared errors (deviations between predicted and actual values), and the root mean square error (square root of the average squared error). The average error is typically very small, because positive prediction errors tend to be counterbalanced by negative ones.
Because the optin was selected on the Multiple Linear Regression - Advanced Options dialog, a variety of residual and collinearity diagnostics output is available.
On the Output Navigator, click the Variable Selection link to display the Variable Selection table that displays a list of models generated using the selections from the Variable Selection table. When Backward elimination is used, Multiple Linear Regression may stop early when there is no variable eligible for elimination, as evidenced in the table below (i.e., there are no subsets with less than 12 coefficients).
The error values calculated are
RSS: The residual sum of squares, or the sum of squared deviations between the predicted probability of success and the actual value (1 or 0).
Cp: Mallows Cp (Total squared error) is a measure of the error in the best subset model, relative to the error incorporating all variables. Adequate models are those for which Cp is roughly equal to the number of parameters in the model (including the constant), and/or Cp is at a minimum
R-Squared: R-squared Goodness-of-fit
Adj. R-Squared: Adjusted R-Squared values
Probability is a quasi hypothesis test of the proposition that a given subset is acceptable; if Probability < .05 we can rule out that subset.
Compare the RSS value as the number of coefficients in the subset decreases from 13 to 12 (6784.366 to 6811.265). The RSS for 12 coefficients is just slightly higher than the RSS for 13 coefficients suggesting that a model with 12 coefficients may be sufficient to fit a regression.
On the Output Navigator, click the Regress. Model link to display the Regression Model table.
The Regression Model table contains the coefficient, the standard error of the coefficient, the p-value and the Sum of Squared Error for each variable included in the model. The Sum of Squared Errors is calculated as each variable is introduced in the model, beginning with the constant term and continuing with each variable as it appears in the data set.
Summary statistics (to the above right) show the residual degrees of freedom (#observations - #predictors), the R-squared value, a standard deviation type measure for the model (i.e., has a chi-square distribution), and the Residual Sum of Squares error.
The R-squared value shown here is the r-squared value for a logistic regression model, defined as
R2 = (D0-D)/D0 ,
where, D is the Deviance based on the fitted model and D0 is the deviance based on the null model. The null model is defined as the model containing no predictor variables apart from the constant.
If a variable has been eliminated by Rank-Revealing QR Decomposition, the variable appears in red in the Regression Model table with a 0 Coefficient, Std. Error, CI Lower, CI Upper, and RSS Reduction and N/A for the t-Statistic and P-Values. The following example Regression Model table displays the results when three predictors (Opening Theaters, Genre_Romantic Comedy, and Studio_IRS) are eliminated.
On the Output Navigator, click the Collinearity Diags link to display the Collinearity Diagnostics table. This table assesses whether two or more variables so closely track one another as to provide essentially the same information. As you can see, the NOX variable was ignored.
The columns represent the variance components (related to principal components in multivariate analysis), while the rows represent the variance proportion decomposition explained by each variable in the model. The eigenvalues are those associated with the singular value decomposition of the variance-covariance matrix of the coefficients, while the condition numbers are the ratios of the square root of the largest eigenvalue to all the rest. In general, multicollinearity is likely to be a problem with a high condition number (more than 20 or 30), and high variance decomposition proportions (say more than 0.5) for two or more variables.
Lift Charts and RROC Curves (on the MLR_TrainingLiftChart and MLR_ValidationLiftChart, respectively) are visual aids for measuring model performance. Lift Charts consist of a lift curve and a baseline. The greater the area between the lift curve and the baseline, the better the model. RROC (regression receiver operating characteristic) curves plot the performance of regressors by graphing over-estimations (predicted values that are too high) versus underestimations (predicted values that are too low.) The closer the curve is to the top-left corner of the graph (the smaller the area above the curve), the better the performance of the model.
After the model is built using the Training Set, the model is used to score on the Training Set and the Validation Set (if one exists). Then the data set(s) are sorted using the predicted output variable value. After sorting, the actual outcome values of the output variable are cumulated and the lift curve is drawn as the number of cases versus the cumulated value. The baseline (red line connecting the origin to the end point of the blue line) is drawn as the number of cases versus the average of actual output variable values multiplied by the number of cases. The decile-wise lift curve is drawn as the decile number versus the cumulative actual output variable value divided by the decile's mean output variable value. This bars in this chart indicate the factor by which the MLR model outperforms a random assignment, one decile at a time. Refer to the validation graph below. In the first decile, taking the most expensive predicted housing prices in the dataset, the predictive performance of the model is about 1.7 times better as simply assigning a random predicted value.
In an RROC curve, we can compare the performance of a regressor with that of a random guess (red line) for which over-estimations are equal to under-estimations. Anything to the left of this line signifies a better prediction, and anything to the right signifies a worse prediction. The best possible prediction performance would be denoted by a point at the top-left of the graph at the intersection of the x and y axis. This point is sometimes referred to as the perfect classification. Area Over the Curve (AOC) is the space in the graph that appears above the ROC curve and is calculated using the formula: sigma2 * n2/2 where n is the number of records The smaller the AOC, the better the performance of the model. In this example, we see that the area above the curve in both data sets, or the AOC, is fairly small, which indicates that this model is a good fit to the data.
For information on the MLR_Stored worksheet, see the Scoring New Data section.