Processing New Documents Based on an Existing Text Mining Model

Analytic Solver Data Science can process new text documents based on an existing text model, if option Write Text Mining Model is selected on the Output Options tab of the Text Miner dialog during the creation of the baseline or existing model. If LSI was performed, the TM_Model worksheet (output) contains the information needed for mapping the new text data from the new documents into the existing semantic space. The basis for this mapping is the fixed vocabulary extracted from the baseline model. Synonyms, phrases, normalizations, and stopwords are also included in the model to ensure proper mapping to the baseline vocabulary. This example illustrates how 200 text documents (100 each for electronics and autos) have been extracted from the Newsgroup data set (complete data set downloadable from here).

On the Analytic Solver Data Science ribbon, from the Applying Your Model tab, select Help - Examples, then select Examples to open the Text Mining Files.zip archive. Extract the files to C:\Program Files\Frontline Systems\Analytic Solver Platform\Datasets (or C:\Program Files (x86)\Frontline Systems\.. (if using 32-bit Microsoft Excel with 64-bit Microsoft Windows). For directions on extracting and importing the files into Analytic Solver Data Science, see the Data - Get Data - File Folder section.

On the Analytic Solver Data Science, from the Data tab, select Sample - Import from File Folder to open the Import From File System dialog. Click Browse and navigate to the location of the additional electronics text files (C:\Program Files\Frontline Systems\Analytic Solver Platform\Datasets\Additional electronics). Set file type to All Files, then select all 100 files in the folder. From the Import From File System dialog, click >> to move all files to the Selected Files list. Repeat these steps to load the additional auto documents into the Additional autos folder. You should now have 200 documents listed under Selected Files.

From the selected files, select Sample, then enter 100 for Desired Sample Size. For Output, keep Write file paths, then click OK. When Write file paths is selected, pointers to the file locations are stored on the XLM_SampleFiles worksheet. If Write file contents is selected, the content of each text document is written to a cell on the XLM_SampleFiles worksheet, up to a maximum of 32,767 characters.

Import from File System

Click OK. The worksheet XLM_SampleFiles1 is inserted into the workbook. Sort the documents by type (electronic or auto) by using Microsoft Excel's Sort functionality (on the Data menu).

Sampling Output

On the Analytic Solver Data Science ribbon, from the Data Analysis tab, select Text to bring up the Text Miner dialog. Under the Variables list, select TextVar, and click > to move the variable to the Selected Text Variables list. Click the Models tab.

Text Miner Data Source Dialog

Select Map text variables to an existing model. Notice that TM_Model has been selected for Worksheet under Select Model. Recall that the TM_Model worksheet is where Analytic Solver Data Science writes the text mining model.

Click Match By Name to match TextVar under Selected Text Variables with TextVar under Model Text Variables.

To map text variables from the newly imported files to the existing text mining model, select TextVar under both Selected Text Variables and Model Text Variables, then click Match Selected. If Match Sequentially is used, Analytic Solver Data Science matches variables in the order that they appear. To unmatch a single pair of variables, highlight the desired variables in the Model Text Variables list, and select Unmatch Selected. To unmatch all variables, click Unmatch All.

Text Miner Models Dialog

Click the Pre-Processing tab.

Options on both the Pre-Processing and Representation tabs are disabled. Options used and defined in the existing Text Miner model (created previously) are prefilled. Vocabulary for the new collection is defined in the existing model and is used to map the new documents to the existing space of terms. Therefore, some pre-processing options that affect vocabulary reduction and term normalization are not applicable in this mode and are not prefilled. To change any of these options, create a new baseline Text Miner model.

Click the Output Options tab to specify output options to produce, then click Finish to run Text Miner. (For explanations of each option and its output, see the previous section.) Notice that Scree Plot, Maximum number of concepts, Concept importance, and Term importance are disabled, as Analytic Solver Data Science does not store comprehensive information concerning all original concepts extracted from the existing or base model. All matrices in the output contain the exact same dimensionality, and conform to the structure of the previous baseline model representations. This means that output from a classification or prediction method based on the existing Text Miner model can be directly compared with output from a classification or prediction method based on the new output. Although the overall look of the newly produced reports will be similar to the baseline model's reports, the content and interpretation will be different. To visually discriminate between the two, each new report contains the following red text beneath each report heading.

Text Miner Report