Over the past decade, thousands of organizations have made their data available to the public, either by posting an extract online or through an API. These data sets range from government data to restaurant reviews to metadata on songs. One of the more famous data sets released recently was the New York City Taxi data, which was posted by the NYC Taxi and Limousine Commission in 2013 as a result of a Freedom of Information Law (FOIL) request.

A number of analysts and computer scientists have since dug into the data to produce a wide range of analysis, from creating interactive visualizations to identifying encryption flaws. One hacker was able to de-anonymize and share the data, raising many privacy concerns. A blogger named Anthony Tockar was then able to match up license plates from the de-anonymized data with paparazzi pictures of celebrities to figure out exactly how much celebrities like Bradley Cooper and Jessica Alba tipped their cab drivers.

Other, less privacy-invasive analysis has also been done to demonstrate the predictive power of the NYC taxi data. A computer science student named Jose Mazo was able to classify, with an accuracy of 72%, whether a cab driver would be tipped above or below 20%. Similarly, Girish Nathan, a Senior Data Scientist at Microsoft, published a tutorial on classifying tips using Microsoft Azure ML Models with an Azure HDInsight Hadoop cluster running Hive.

With the NYC taxi data set containing 24 files of detailed information about 14,000 taxis over the course of 170 million trips, one might assume that programming or extensive data science skills would be needed to analyze the data. But with XLMiner Platform, working either in Excel or on the web at XLMiner.com, and its built-in link to Apache Spark, any business analyst can easily explore the NYC taxi data and create supervised learning models.

Predictive Analytics with XLMiner in Excel

Frontline Systems created a case study to demonstrate how an analyst can use XLMiner to create a classification model, based on the NYC Taxi data, to predict how much a cab driver will be tipped on a taxi ride. Where previous studies used programming tools like Python and Big Data tools like Hive, and actually analyzed only a subset of the data (one month's worth in Jose Mazo's thesis), with XLMiner and Apache Spark, drawing a representative sample across the entire dataset is a point-and-click operation -- and training a model on this random sample yields just as much predictive power as the earlier studies.

Calculating Summary Statistics

The new Summarization Tool in XLMiner allows a user to calculate summary statistics on a large data set using the Apache Spark cluster, all within Excel -- or with a web browser using XLMiner.com. For example, if a user wants to see if certain cab drivers consistently receive high tips, he can use the graphical user interface to look at the average fare and tip amounts for each cab driver, and create a histogram of the distribution of the average tip percentages. These calculations, which are done on 170 million records, are completed within seconds.

Sampling The Data

Another way to explore the data in Excel is by using the Sample Tool in XLMiner. With the point-and-click interface, instead of looking at millions of lines of data at once, the user can quickly pull a random sample of the data to investigate. Taking an 86,000-record sample of the taxi data set can be done with one click and completed in just a few seconds. Once data is available, other features such as XLMiner’s Chart Wizard can be used to further explore the data and visually identify patterns, such as creating histograms to look at the frequency of various payment types, and scatter plots to identify the most common tip percentages.

Pre-Processing The Data

While visually exploring the data, XLMiner can also be used to clean the data in preparation for creating a model. For example, looking at the GPS coordinates of all of the pick-up and drop-off locations, the user may find some erroneous coordinates that place a taxi cab outside New York, or even in the ocean, and can go on to exclude these records from the data set.

Before creating a model, additional features can also be derived from the original set of available variables. For this classification model, the analyst can first create a derived feature that calculates the tip percentage, and another feature that buckets the tip percentage into three different ranges, called tip percent class, to use for the final output of the classification model.

Building a Classification Tree

Once the data has been explored and cleaned, the user can now build a model. Through XLMiner’s graphical user interface, the analyst can choose to build a classification tree and specify various parameters, such as the size of the test and training sets and the levels of the resulting classification tree. He can then select the input variables he’d like to look at, such as distance of the trip, pick-up and drop-off latitude and longitude, time of day, day of week and payment type. Finally, he can set the output variable to the feature he created in the data pre-processing step, tip percent class.

Interpreting the Results

After running the model, the summary report shows that XLMiner created a classification model with a 78% accuracy rate. While this model was run on only a sample of the NYC taxi data, the results rival those of computer science student Jose Mazo, who was able to create a classification model on the full data set with an accuracy rate of 72%. XLMiner also outputs a graphic of a decision tree that shows the most influential variables in classifying the tip amount, including trip distance and drop-off location.

XLMiner enables any business analysts to create supervised learning models using a point-and-click interface, without extensive machine learning or programming knowledge. In this case study, the analyst was able to quickly calculate aggregate statistics and sample data from the NYC taxi data containing 170 million records, and create and interpret a classification model, something that could previously only be done by experienced IT pros and data scientists.