Introduction
In today's world, data is ubiquitous — it's everywhere, from shopping membership cards to CCTV systems. Machines are gathering information about all facets of the world that we live in: 24 hours a day, 7 days a week. Predictably, the end-result is massive amounts of content that may (or may not) contain valuable insights about the targeted subject matter. Technology is working hard to store all of this data for analysis... unlike the past, terabytes are now commonplace. So the challenge for data scientists doesn't begin with analysis but, rather, how to obtain access to the data? Its sheer size is intimidating, to say the least.
Sampling Data
XLMiner supports access to four different data sources: importing data from a worksheet, from a database, from a file folder, and from Apache Spark. The fourth option, Apache Spark, is a unique open-source technology that lends itself very well to handling Big Data — we'll discuss this option below. First, though, let's have a look at some classic data sources.
Data importation from worksheets, databases, and file folders may be large — perhaps comprised of tables with millions of rows. This size is significant and makes handling all available data impossible... it's just too much. Given this, XLMiner uses the same approach as statisticians: taking a sampling of the data (i.e., manageable size) that, when analyzed, represents the population as a whole. By "population," we mean the entire dataset. There are multiple methods available to sample data; XLMiner offers two of the most prevalent options: simple random sampling and stratified random sampling.
Types of Sampling
Simple Random Sampling: This is a straightforward method that simply retrieves randomized content from the population. The goal with this method is to avoid bias and to give every possible sample an equal chance of being selected. XLMiner facilitates this process by enabling users to configure the sample size, seed for randomization, and the option to sample with or without replacement.
Stratified Random Sampling: This method includes simple random sampling, but also has an additional initial step. First, the population is divided into groups of similar items — these groups are called "strata." Each strata is then sampled using the simple random sampling process, as described above, and all data is then combined to form a stratified random sample. Like simple random sampling, XLMiner enables users to choose a sorting seed for randomization, sampling with or without replacement, and selection of a sample size.
Importing Data from a Worksheet
Before XLMiner can retrieve data from a worksheet, it is necessary to specify some details, such as the data source, the variables to be included in the sampled data, and a sampling option. Note that if you opt for using Stratified Random Sampling, then you need to configure the sampling in terms of the variable to use and its proportionality to the stratum size.
Importing Data from a Database
Like sampling from a worksheet, sampling data from a database is as straightforward as specifying the data source and sampling options. While variables are selected in worksheet importation, fields are selected in a database import. XLMiner supports connections to MS-Access, SQL Server, Oracle, and PowerPivot.
Importing Data from a File Folder
XLMiner supports the import of delimited text files; that is, files in which data values are separated by a character. The process is similar to worksheet and database imports, with the exception that the sampling choice is Simple Random Sampling (with or without replacement).
Importing Big Data from Apache Spark
Frontline Systems empowers organizations to take advantage of the benefits of Big Data analysis via its data mining solution, XLMiner. The XLMiner technology enables users to sample and summarize Big Data stored across an Apache Spark cluster where a Frontline Systems access server is installed. Representative samples of Big Data can then be drawn from all nodes in the cluster and conceptualized in Excel using data & text mining models. In short, this allows individual users to access and summarize Big Data from the comfort of their desktops.
In XLMiner, there are three steps involved in accessing and summarizing Big Data: sampling, getting results, and summarizing. We'll have a brief look at all three below:
Sampling Big Data
The Sampling stage enables users to specify data locations, such as the data source file location and Spark REST server URL. Users can also specify how they would like their data to be sampled: exact or approximate. The former enables users to specify the exact sample size while the latter lets users specify a fractional sample size (e.g., 0.0001 of 25 million records = 2,500 records in sample size). From a convenience standpoint, approximate samples are much faster and, if needed, alterations can be made in Excel in the event of excessive data.
After submitting the sampling details, identification details are automatically created that enable the user to view the current status of the job — this status information, along with the results of the sample, can be requested using the Get Results tool.
Getting Results
The Get Results interface enables users to check the status of a sampling job (Job Identifier, Get Info button) and, when the status is set to "Finished," users can view the results directly within Excel. The report consists of an Inputs section, which conveys details about the sampled data used, and a Results section that lists all variables used along with the sample data.
Summarize
The final stage, summarization, provides an opportunity to delve into the sample data in order to derive its meaning and relevance. The power and complexity of large-scale cluster computing via Apache Spark is made understandable through the integration of XLMiner with Microsoft Excel. Excel's user-friendly and familiar interface essentially breaks down the barriers of understanding in terms of making data science tools accessible.
The new Summarization feature in XLMiner is a powerful tool that enables data professionals to rapidly extract key metrics within your sampled data. Users of all profiles—not necessarily just data scientists—can now use a powerful Big Data summarization solution to compose reports, create visualizations, and build models for subsequent analysis.
Results can be customized using a combination of XLMiner's data handling functions, such as variable selection and aggregation type selection, with Excel's native data presentation functions, such as sorting and filtering. The end-result is content that can be used to provide answers to highly-specific questions. In the graphic below, specific variables (Year, Unique Carrier, Count) are selected using XLMiner and columnar data is sorted in descending order using Excel — the result is a table that lists airline carriers ranked by market share: