Introduction

Non-numeric data values can be text, alphanumeric (mix of text and numbers), or numeric values with no numerical significance (such as postal codes). Such variables are called categorical variables, where every unique value of the variable is a separate category.

Dealing with categorical data poses some limitations. For example, if the data contains too many categories, several categories would need to be combined into one. In addition, there may be a need to use a data mining technique that requires numeric data rather than categorical data. The features included in the Transform group are used to address these requirements.

Analytic Solver Data Science provides options to transform data in the following ways:

By Creating Dummy Variables: When this feature is used, a string variable is transformed into a dummy variable. Analytic Solver Comprehensive and Data Science and can handle string variables with an unlimited number of distinct values. (If using Analytic Solver Upgrade, variables are limited to 30 distinct values). Imagine a variable called Language that has data values English, French, German, and Spanish. Running this transformation will result in the creation of four new variables: Language_English, Language_French, Language_German, and Language_Spanish. Each of these variables will take on values of either 0 or 1 depending upon the value of the Language variable in the record. For instance, if a particular record has a German data value, then among the dummy variables created, Language_German will be one, and others will be zero.

Create Category Scores: In this feature, a string variable is converted into a new numeric, categorical variable.

Reduce Categories: This utility creates a new categorical variable that reduces the number of categories. When using Analytic Solver Comprehensive or Data Science, you can reduce the number of categories by frequency or manually.

When using Analytic Solver Upgrade to reduce the number of categories in a particular variable, this utility creates a new categorical variable that reduces the number of categories to 30.

To perform this, there are two different options to choose from.

Option 1 assigns categories 1 through n so that 1 through n -1 most represents the most frequently occurring categories, and assigns category n to all remaining categories. (If using Analytic Solver Upgrade, n must be less than or equal to 30.)

Option 2 maps multiple distinct category values in the original column to a new category variable between 1 and n where n is the number of observations. (If using Analytic Solver Upgrade, n must be less than or equal to 30.)