Data Mining

Introduction: Data Mining

In short, data mining is the process of discovering knowledge via data analysis. Data mining is much more than that, however, and it will be useful for us to delve into the subject in a bit more detail.

Data is everywhere, from consumer shopping habits to the frequency of a patient's heartbeat. The last few years have seen a rapid increase in our capability to gather and store this data -- we've quickly migrated from kilobytes to terabytes. The sheer quantity of available data, however, has exceeded humanity's capacity to cognitively understand it. Given this, we have turned to computers to help us extract, sort, cleanse, format, and analyze these massive amounts of data. Raw binary code gradually turns into information (data with meaning), and can then be converted into knowledge (application of said meaning). This process of extracting the meaning & application of raw data is called Data Mining.

Data Mining Methods

There are multiple methods involved when mining data. Some of these include:

Clustering: Finding and documenting groups of facts;
Path Analysis / Sequencing: Discovering patterns whereby events can be linked to (i.e., causes) subsequent events;
Association: Discovering patterns where events are connected;
Predictive Analytics / Forecasting: Using data patterns to create predictions about future events.

Uses of Data Mining

Data Mining has made a profound impact on organizations across multiple industries, both in terms of understanding growth challenges and increasing corporate revenue. The ability to understand Big Data on a deeper level has resulted in significant changes to the way business has traditionally been done:

Retail Industry: Data Mining is used for market segmentation (identifying customer shopping characteristics), preventing customer churn (predict which customers will not remain loyal consumers), fraud detection (identifying fraudulent transactions), and multiple marketing-based implementations (e.g., determining/comparing marketing initiative success rates, basket analysis, and trend analysis over time).
Biological Data Analysis: Data Mining has played a vital role in both genomics and biomedical research. Similarity and indexing methodologies used in Data Mining have enabled scientists to map nucleotide sequences, discover gene structural patterns, and analyze genetic networks.
Financial Sector: The growth of data warehouses have enabled organizations to perform detailed analysis of BI-centric Big Data. Real-world applications of these technologies have resulted in more accurate credit policy analysis, loan re-payment prediction capabilities, detection of money laundering, watchlist monitoring, and enterprise fraud detection.

Who uses Data Mining?

As Data Mining has grown in popularity, its accessibility across multiple user profiles has similarly increased. In the early days, Data Mining was traditionally the province of data scientists working in budding corporate-sponsored data labs. Old school language stacks, such as SAS and SPSS, enabled data professionals to analyze large volumes of data... but accessibility and collaboration between IT and non-IT oriented users was in its infancy.

Over time, though, the need for user-friendly knowledge mining solutions has made Big Data more accessible to non-IT and novice data professionals. The technology has evolved to include newer language stacks, such as R, Python, Spark, Pig, and Hive... this evolution has likewise resulted in more accessible data mining functionalities, many of which are now Web-based and/or take advantage of existing and familiar technologies (such as XLMiner / Excel).

Typical data miners now include not only data scientists, but also business analysts, software engineers, and marketing specialists. This has resulted in collaborative environments that, ultimately, produce more meaningful models and visualizations.

Data Mining Usage in XLMiner

Data Mining in XLMiner is comprised of four powerful functionalities that enable users to partition, classify, predict, and discover data associations.

Partition: Used to partition data into training, validation, and/or test sets;
Classify: Used to categorize a set of observations into pre-defined classes based on a set of variables. There are six different classification methods available for selection, including discriminant analysis, logistic regression, k-nearest neighbors, classification tree, naïve bayes, and neural network;
Predict: Used to predict the response variable value based on a predictor variable. There are four different prediction methods available, including multiple linear regression, k-nearest neighbors, regression tree, and neural network;
Associate: Used to recognize associations or correlations among variables in the dataset.

In addition, XLMiner V2015 has introduced a set of powerful techniques that are capable of producing strong classification tree models. XLMiner V2015 now features three of the most robust ensemble methods available in data mining: Boosting, Bagging, and Random Trees.

Selecting an Ensemble Method

Resources

Classification Methods: Used to categorize a set of observations into pre-defined classes based on a set of variables.
Prediction Methods: Used to predict the response variable value based on a predictor variable.
Ensemble Methods: Used to create stronger and more accurate classification tree models.