Get Data

Simple Random Sampling

This is probably the simplest method for obtaining a good sample. A simple random sample of size n is chosen from the population to ensure that a set of n items from the population has an equal chance of being included in the sample. Use the Data Sampling utility of Analytic Solver Data Science to choose sample size, seed for randomization, and sampling with or without replacement.

Stratified Random Sampling

In this technique, the population is first divided into groups of similar items. These groups are called strata. Each stratum is sampled using simple random sampling, and the samples are then combined to form a stratified random sample. Use the Data Sampling utility to choose a sorting seed for randomization and sampling with or without replacement. The desired sample size can be prefixed depending upon which method is chosen for stratified random sampling. Analytic Solver Data Science allows sampling from a worksheet or a database.

Importing From a File System

To run the new Text Miner tool, you must first import the text into Analytic Solver Data Science. Text may be present in a worksheet as a column variable (where the cell in each row contains a comment), in a paragraph, as free-form text, or as a text field in a database. In each of these cases, the text document is associated on each row/observation with other structured input fields, and for supervised learning with an outcome variable.

Text may also be present in a series of document files on a disk or in a network folder, where each document represents an observation. This example uses this method to import a large list of text documents into Excel. Select Get Data - File Folder to open the Import From File System dialog.

If the documents are relatively small (not to exceed 32,767 characters -- the limit on the length of a string in a single worksheet cell), the document contents may be written in the worksheet. Otherwise, Analytic Solver Data Science imports only the document filenames/paths into the worksheet. The document contents are read during the text mining operation (where the document size is not limited except by memory and time).

Note: To import data from online or on-premise databases into Excel's spreadsheet data model, we recommend Microsoft's free Power Query add-in, or the facilities built into the free Power Pivot add-in (which has no limit other than memory on the number of rows). Analytic Solver Data Science may be used to bring a random sample from this data onto a worksheet for analysis and model-building.

The worksheet produced when importing from a file system may be used as input to the text mining operation (refer to the Text Mining section of this Help for further information). However, if there are other structured fields, or an outcome variable in the data set, a worksheet would need to be assembled that associated each document with the correct observation for the remaining fields. As this process is not automatic, we recommend using one of Excel's many tools for editing rows, columns, and sheets.