When the New York City Taxi and Limousine Commission, responding to a Freedom of Information Law (FOIL) request, released NYC Taxi Data for 2013, the result was significant buzz about privacy and the implications of public data. The data set contained 24 files with features describing GPS pickup and drop-off locations, time of day, number of passengers, tips and fare amounts, for 14,000 taxis over 170,000,000 trips. Analysts have used this public dataset to produce fascinating visualizations such as “NYC Taxis: A Day in the Life”, but have also revealed flaws related to the privacy of taxi drivers and passengers, since both can de-anonymized. All of this was made possible by the data request made by Chris Whong.
At a less privacy invasive level, some published analysis has demonstrated the predictive power of features such as pickup and drop-off locations for tipping patterns. One example is a post from Jose Mazo, who published his analysis on GitHub. Using a variety of open-source tools, he was able to build a classifier that predicted, with an accuracy of 71.74%, whether the tip would be greater than or less than 20% of the fare. Similarly, Girish Nathan, a Senior Data Scientist at Microsoft, published a tutorial showing how to address the same problem using Microsoft Azure ML, with an Azure HDInsight running Hive. More recently, Dan Kikuchi from IBM posted a short video demonstrating how IBM Analytics for Apache Spark could be used to clusterize the data and determine the top 3 drop-off spots for taxi cabs in New York.
This case study shows how a business analyst can use Excel and XLMiner, with its built-in link to the open Big Data platform Apache Spark, to explore the NYC Taxi data and build a similar supervised learning model (with even greater accuracy) for classifying taxi tip range. We show how easily business analysts can access Apache Spark power to crunch huge volumes of data via a simple point-and-click user interface, with no programming expertise or other advanced tools required. You can follow the case study on your own PC if you have Analytic Solver Platform or XLMiner V2015-R2 or later -- if you're registered and logged in on Solver.com, you can download a free trial here.
The joined taxi trip + fare data is about 48GB in CSV format, and consists of 173,183,895 records and 21 columns. Summarized information about the dataset is available here, and a complete description of data is available here. Faculty members teaching MBA students with our software are invited to contact us about access to Frontline's Apache Spark Big Data cluster on Amazon Web Services, where this and other example datasets are pre-loaded.