Can you work with Big Data in Excel?  From the barrage of recent news, white papers, and sales calls about Big Data, you would think not.  A common theme is that “spreadsheets can’t handle Big Data and advanced analytics,” and that companies need to “move up” to new tools, that the vendors with the white papers offer -- implicitly, the benefits outweigh the expense and steep learning curve.  But in this blog post, we'll show that you can work with Big Data in Excel (with help from our Apache Spark link), without a lot of expense or a steep learning curve.

To do this, we'll study the same airline Big Data dataset used by HortonWorks, one of the best-known Big Data firms, in an excellent series of tutorial blog posts published about seven months ago.  We'll answer the same kinds of questions, and build a similar predictive model to the one built by HortonWorks data scientists.  But we'll do all of this "point and click", without any programming, in desktop Excel with XLMiner Platform, our powerful Excel add-in -- and one other tool, Microsoft's free Power Map add-in, to produce the image below.

Flight delay times by airport, visualized with Power Map for Excel

20+ Years of Airline Data

The public dataset used in this example consists of flight arrival and departure information for all commercial flights within the USA dating from October 1987 to April 2008, obtained from the Research and Innovative Technology Administration (RITA) which coordinates the U.S. Department of Transportation research programs.  For supplemental data including the location of each airport, plane type, and meteorological data pertaining to each flight, click here.  The data was obtained from 29 commercial airlines and 3,376 airports, and consists of nearly 120 million records (12 GB storage uncompressed), including 3.2 million cancelled flights and 25 million flights at least 15 minutes late.  The US airline industry generates about $157 billion in annual revenue, so it’s no wonder that both travelers and airlines incur great costs when nearly 20% of  flights are delayed or cancelled.

Goals of This Tutorial

We won’t aim to replicate each step of the case study presented by Hortonworks, nor will we aim to show all the advanced machine learning methods available in XLMiner. Instead, we'll show how quickly a business analyst with Excel and XLMiner can get essentially the same results as a team of data scientists and programmers equipped with the full set of Big Data tools, programming languages and statistical packages (Hadoop, Pig, Python, Pandas, NumPy, MatPlotLib and SciKit-Learn) -- in a matter of minutes, not hours or days.  We'll walk through these steps:

  1. Drawing a representative data sample from an Apache Spark Big Data cluster into Excel
  2. Summarizing and visualizing the airline data for immediate insights about flight delays
  3. Preprocessing raw data to exclude cancelled and diverted flights
  4. Building a Logistic Regression model for predicting flight delays
  5. Using Feature Selection to identify the important flight attributes for predicting delays
  6. Exploring the effects of eliminating redundant features from the model

Parts of this example are covered in our XLMiner Online Help. You can follow the case study on your own PC if you have Analytic Solver Platform or XLMiner V2015-R2 or later -- if you're registered and logged in on Solver.com, you can download a free trial here. Faculty members teaching MBA students with our software are invited to contact us about access to Frontline's Apache Spark Big Data cluster on Amazon Web Services, where this and other example datasets are pre-loaded.

One take-away from this blog post is that a statistically representative sample can be "just as good" as the full dataset for many kinds of analysis.  This is hardly a new idea -- statisticians and practitioners have applied sampling methods to great effect for many years -- but it seems to have been lost in the hype over Big Data.