Text Mining

Introduction

Text Mining is the automated analysis of text, or collections of text (corpus), with the intent to derive high-quality information — this is done by discerning patterns and trends within the content. The input text, often unstructured, undergoes a structuring process via parsing, the structured text is then analyzed for the presence of trends & patterns, and finally the output is evaluated & interpreted.

Common Text Mining Tasks

Frequently implemented tasks during the text mining process include:

Text Categorization: The assignment of text to categories/classes.
Sentiment Analysis: The identification and extraction of subjective content from text.
Named Entity Recognition: The extraction and classification of text elements according to pre-defined categories.
Text Clustering: The organization of text into groups (i.e., clusters) based on similarity. Text can be clustered using hierarchal-based algorithms or k-means clustering.
Concept Extraction: The mapping of text to subjective concepts using such methods as semantic similarity and the linguistic analysis of the text.

Uses of Text Mining

Text Mining is a rapidly growing industry, with the overall growth of Big Data as a whole increasing at a rate of 40% annually.¹ The sheer volume of captured data across all varieties of public & private sectors has been without precedent — the capability of technology and storage mediums have, in no small part, facilitated this growth. A significant percentage of this data is text-based: corporations capture trillions of bytes of customer transactions, the global research community produces over 1.5 million scholarly articles per year, and over 1.3 billion pieces of content are shared on social networking websites, such as Facebook, on a daily basis. Such a large volume of textual data, coupled with software capable of mining & analyzing it, has made Text Mining a top priority for organizations in multiple industries. Some of those industries include:

Retail Sector: The ability to make sense of billions of customer transactions has enabled companies in the retail space to understand the nature of consumer buying habits, particularly in terms of competitive intelligence. In addition, the analysis of customer textual data (e.g., consumer addresses, telephone numbers, feedback) has enabled the retail sector to create in-depth profiles of their customers, their levels of engagement, and how to secure loyalty.
Cybersecurity: Text Mining has transformed the capability of cybersecurity organizations, both at the private and government level, to understand existing and potential threats. Different sectors have been able to benefit from the extraction and analysis of raw text data; for example, financial institutions use text mining software to perform comprehensive investigations on customers (e.g., watchlist monitoring, asset tracing) to determine instances of financial crime.
Academia / Research / Science / Healthcare: Higher learning institutions, laboratories, pharmaceutical companies, think tanks, and life science research organizations have all benefitted from the use of Text Mining. Biomedical texts and research articles can now be parsed and automatically analyzed for specific applications in the life sciences, such as gene research. The healthcare sector uses Text Mining to structure and analyze clinical trial data in order to gain deeper insights into results and to refine the designs of future clinical trials.

¹: McKinsey Global Institute's (MGI) 'Big Data' report

Text Mining Usage in XLMiner

XLMiner features a set of powerful tools in order to process, normalize, analyze, and visualize textual data. Some key functionality is outlined below:

Pre-Processing

Examination of all terms (e.g., words, numbers, e-mails, URLs, etc.);
Addition or removal of terms to be considered for mining with the option to disregard all remaining terms;
Ability to disregard starting & ending terms/phrases;
Management of lists of commonly removed words (i.e., add, edit, remove) and maintenance of a term exclusion list.

Term Normalization

Disregard terms based on character count;
Removal of HTML tags and normalization of URLs, e-mail addresses, numbers, and monetary amounts;
Comprehensive vocabulary reduction (e.g., using a singular term for large groups of synonyms);
Phrase combination (e.g., assign the phrase “post” to all instances of “forum post” — this alleviates the recognition of the two distinct phrases “forum” and “post”);
Define a maximum vocabulary size and maximum character length;
Intelligent stemming algorithm strips words down to their root terms;
Remove low-occurring terms by percentile.

Term Significance Analysis

XLMiner supports the selection of pre-defined weighting/normalization options or user-defined weighting/normalization schemes;
Weighting options available include Presence/Absence (indicates existence of a term), Term Frequency (counts the number of term occurrences), Term Frequency - Inverse Document Frequency [TF-IDF] (product of scaled term frequency and inverse document frequency), and Scaled Term Frequency (normalizes the number of term occurrences in the documents);
XLMiner also supports the use of Latent Semantic Indexing (LSI) to detect patterns in associations between terms and concepts — XLMiner can also create LSI visualizations to aid in the interpretation process.

Visualization Options

After analyzing the content, XLMiner is capable of producing a wide variety of matrices, plots, and charts to help you understand the underlying data. All output options, such as plot selection and term frequency choices, can be configured before performing the text analysis.

Some key visualizations produced from the text mining process are as follows:

Final List of Terms: The top 20 terms occurring in the document collection, the number of documents that include the term, and the top 5 document IDs where the term appears most frequently.

Final List of Terms

Document Concepts Scatter Plot: A visualization of how terms are explained by pairs of concepts — the proximity of data points to magnitudes away from non-zero coordinates indicates that a concept has explained the term well. Proximity to zero, however, indicates that a concept has not explained the term well.

Document Concepts Scatter Plot

Scree Plotting: This type of plot graph conveys the relative importance of each concept. This visualization is effective at quickly determining which concepts best explain the document content.

Scree Plotting

Summary Reports: Error and performance reports quickly convey the success (or failure) of classification analysis.

Summary Reports

Lift Charts and ROC Curves: These charts, coupled with summary report data, help to determine whether or not a model is a good fit to the data.

Lift Charts

Resources

Data Mining: This page introduces XLMiner's capability to perform comprehensive data mining in order to discover trends, knowledge, and insights from large quantities of raw data. Specific topics discussed here include XLMiner's use of classification methods, prediction methods, and ensemble methods.
Time Series Analysis: Introduction to XLMiner's feature set with regards to time series analysis and the forecasting process. Specific subjects discussed in this section include the use of ARIMA models and smoothing techniques.