Text as Data - Barry DeVille - E-Book

Text as Data E-Book

Barry DeVille

0,0
46,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Text As Data: Combining qualitative and quantitative algorithms within the SAS system for accurate, effective and understandable text analytics The need for powerful, accurate and increasingly automatic text analysis software in modern information technology has dramatically increased. Fields as diverse as financial management, fraud and cybercrime prevention, Pharmaceutical R&D, social media marketing, customer care, and health services are implementing more comprehensive text-inclusive, analytics strategies. Text as Data: Computational Methods of Understanding Written Expression Using SAS presents an overview of text analytics and the critical role SAS software plays in combining linguistic and quantitative algorithms in the evolution of this dynamic field. Drawing on over two decades of experience in text analytics, authors Barry deVille and Gurpreet Singh Bawa examine the evolution of text mining and cloud-based solutions, and the development of SAS Visual Text Analytics. By integrating quantitative data and textual analysis with advanced computer learning principles, the authors demonstrate the combined advantages of SAS compared to standard approaches, and show how approaching text as qualitative data within a quantitative analytics framework produces more detailed, accurate, and explanatory results. * Understand the role of linguistics, machine learning, and multiple data sources in the text analytics workflow * Understand how a range of quantitative algorithms and data representations reflect contextual effects to shape meaning and understanding * Access online data and code repositories, videos, tutorials, and case studies * Learn how SAS extends quantitative algorithms to produce expanded text analytics capabilities * Redefine text in terms of data for more accurate analysis This book offers a thorough introduction to the framework and dynamics of text analytics--and the underlying principles at work--and provides an in-depth examination of the interplay between qualitative-linguistic and quantitative, data-driven aspects of data analysis. The treatment begins with a discussion on expression parsing and detection and provides insight into the core principles and practices of text parsing, theme, and topic detection. It includes advanced topics such as contextual effects in numeric and textual data manipulation, fine-tuning text meaning and disambiguation. As the first resource to leverage the power of SAS for text analytics, Text as Data is an essential resource for SAS users and data scientists in any industry or academic application.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 253

Veröffentlichungsjahr: 2021

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright

Dedication

Preface

Acknowledgments

About the Authors

Introduction

CHAPTER 1: Text Mining and Text Analytics

BACKGROUND AND TERMINOLOGY

TEXT ANALYTICS: WHAT IS IT?

NOTES

CHAPTER 2: Text Analytics Process Overview

TEXT ANALYTICS PROCESSING

PROCESS BUILDING BLOCKS

PROCESS DESCRIPTION

LINGUISTIC PROCESSING

INTERNAL REPRESENTATION AND TEXT PRODUCTS

NOTES

CHAPTER 3: Text Data Source Capture

TEXT MINING DATA SOURCE ASSEMBLY

CONSUMING LINGUISTICS TEXT PRODUCTS

NOTES

CHAPTER 4: Document Content and Characterization

AUTHORSHIP ANALYTICS: EARLY TEXT INDICATORS AND MEASURES

A CASE STUDY IN GENDER DETECTION

SUMMARIZATION AND DISCOURSE ANALYSIS

FACT EXTRACTION

CONCLUSION

NOTES

CHAPTER 5: Textual Abstraction: Latent Structure, Dimension Reduction

TEXT MINING DATA SOURCE ASSEMBLY

LATENT STRUCTURE AND DIMENSIONAL REDUCTION

ROUGH MEANING – APPROXIMATION FOR SINGULAR VALUE DIMENSIONS

CONCLUSION

NOTES

CHAPTER 6: Classification and Prediction

USE CASE SCENARIO

IDENTIFYING DRIVERS OF TEXTUAL CONSUMER FEEDBACK USING DISTANCE-BASED CLUSTERING AND MATRIX FACTORIZATION

NOTES

CHAPTER 7: Boolean Methods of Classification and Prediction

RULE-BASED TEXT CLASSIFICATION AND PREDICTION

EXAMPLE OF BOOLEAN RULES APPLIED TO TEXT MINING VACCINE DATA

SUMMARY

NOTES

CHAPTER 8: Speech to Text

INTRODUCTION

PROCESSING AUDIO FEEDBACK

FURTHER ANALYSIS: SENTIMENT AND LATENT TOPICS

CONCLUSION

NOTES

Appendix A: Mood State Identification in Text

ORIGINS OF MOOD STATE IDENTIFICATION

NOTES

Appendix B: A Design Approach to Characterizing Users Based on Audio Interactions on a Conversational AI Platform

AUDIO-BASED USER INTERACTION INFERENCE

IMPLEMENTATION SCENARIO: VOICE-BASED CONVERSATIONAL AI PLATFORM

NOTE

Appendix C: SAS Patents in Text Analytics

Glossary

Index

End User License Agreement

List of Tables

Chapter 1

Table 1.1 Trains and Boats Example: Document Collection

Table 1.2 Entropy Calculation for Trains and Boats Example

Chapter 2

Table 2.1

N

-gram Illustration

Table 2.2 Advantage of Using

n

-grams vs. Unigrams

Table 2.3 Example Parse Result for the First Line of Mark Antony Oration

Table 2.4 Internal Representation of Text Products.

Table 2.5 Terms by Document Word Frequency

Table 2.6 Term by Document Representation

Chapter 3

Table 3.1 Format of the Conference Proceedings Input File

Chapter 4

Table 4.1 Function Words

Table 4.2 Feature Types and Machine Learning Methods

Table 4.3 Baby Names and Likely Gender (from US Social Security Records)

Table 4.4 Textual Features as Predictors from Research

Table 4.5 Summary of Extracted Text Products Used in Document Characterization

Chapter 5

Table 5.1 Example Term by Document Table

Table 5.2 Example Term-Document Co-occurrence Table

Table 5.3 Example Newswire-like text to Illustrate LSA

Table 5.4 Term

x

Document Matrix Representation for Newswire Data

Table 5.5 Binary Co-Occurrence Matrix for Newswire Data

Table 5.6 Clustering Term-Documents by Term Co-Occurrence

Table 5.7 Singular Value Weights for Newswire Terms – Three Dimensions

Table 5.8 Singular Value Dimensional Scores for Each Document Based on All Weigh...

Table 5.9 Absolute Singular Value Dimensional Scores for Each Document

Table 5.10 Performance of Cluster Derived and SVD Derived Semantic Categories

Table 5.11 Columns in the Voted Kaggle Data Set

Table 5.12 Document Topic Weights

Table 5.13 Topic Distribution across Documents

Table 5.14 Top Twenty Word Distributions by Topic

Table 5.15 Features for the Neural Network Setup

Table 5.16 Arguments for Neural Network Function

Table 5.17 Relative Importance of Terms

Table 5.18 Relative Ranked Importance of Terms in Respective Subsets

Table 5.19 Top Terms in Respective Subsets

Chapter 6

Table 6.1 Record Layout of Warranty Action with FAILDES Text Target Field

Table 6.2 Claim Code Distributions in Host and Sample Data Sets

Table 6.3 Fixed Field Data Used in Prediction

Table 6.4 Sample Data from Amazon

Table 6.5 Field Descriptors Amazon Feedback Data

Table 6.6 Top Bigrams from Feedback in Retailer 1

Table 6.7 Top Bigrams from Feedback in Retailer 2

Table 6.8 Average Sentiment Polarity of Customer Feedback

Table 6.9 Example Input Matrix for the Boolean Document Model

Table 6.10 Feature Probabilities with Respect to Negative Assessment, Retailer 1

Table 6.11 Feature Probabilities with Respect to Negative Assessment, Retailer 2

Chapter 7

Table 7.1 Example VAERS Data

Table 7.2 Text Clusters Identified in VAERS Data

Table 7.3 Boolean Rule (Conjunctions) Derived from VAERS Data

Chapter 8

Table 8.1 Improvement in Accuracy Across Methods for Gender Classification

Appendix A

Table A.1 Mood State Dimensional Labels and Polar Composites

Table A.2 Textual Indicators of Six Positive Mood States from Web Search

Table A.3 Example Term Weight Adjustments

Table A.4 Sample Document with Raw Mood State Scores

Table A.5 Mood Score Adjusted by Document Word Length

Table A.6 Mean and Standard Deviation Measures for the Mood State Dimensional Sc...

Table A.7 Standardized Scores for the Dimensional Mood States

Table A.8 Bucket Values and Standard Score Cutoffs for the “Agreeable” Dimension

Table A.9 Dimensional Scores Mapped into Ordered Categories (Based on Standard S...

Table A.10 Overall Document Mood Score Based on Average Calculation Score

Table A.11 Overall Mood State for the Example Collection on Six Dimensions

List of Illustrations

Chapter 1

Figure 1.1 Traffic sign in Cherokee syllabary, Tahlequah, Oklahoma.

Figure 1.2 Example of cuneiform recording the distribution of beer in southern...

Figure 1.3 Shang oracle bone script for character “Eye.” Modern character is 目...

Figure 1.4 Modern Chinese representation of “eye” (mù).

Figure 1.5 Encode–decode send–receive communications model.

Chapter 2

Figure 2.1 Main stages of the text-mining process.

Figure 2.2 Category-oriented folder structure.

Figure 2.3 Text treatments, transformations, derivations, and extractions.

Figure 2.4 Excerpt of Marc Antony's address from Shakespeare's

Julius Caesar

.

Chapter 3

Figure 3.1 Average number of papers at SUGI-SGF 1989–2012.

Figure 3.2 Example snippet of text input to conference proceedings analysis.

Figure 3.3 A high-level semantic map of conference proceedings, 1989–2012.

Chapter 4

Figure 4.1 Assessment of various word features as predictors of gender in WebM...

Figure 4.2 Illustrative results of gender predictors in the WebMD data set.

Figure 4.3 Example of medical notes used as input.

Figure 4.4 Result of text parsing to match notable document features for summa...

Figure 4.5 Pro-forma text summary production based on the structured textual s...

Figure 4.6 Example summary record (hospitalization).

Figure 4.7 Example of text analytics categorization.

Figure 4.8 Example date definition (using standard Perl regular expressions).

Figure 4.9 Illustration of the relationship between extracted fields of data a...

Figure 4.10 Example report applying approach to VAERS data.

Figure 4.11 A Process flow diagram for VAERS report.

Figure 4.12 Fact extraction (information in context).

Figure 4.13 Example of sentiment extraction.

Figure 4.14 Example of conditional inference.

Figure 4.15 Typical pro-forma report output.

Figure 4.16 Example of raw input and recognition features.

Figure 4.17 Results of conditional inference.

Figure 4.18 Source-target production of the pro forma summary.

Figure 4.19 Pro forma summary.

Chapter 5

Figure 5.1 The factorization process.

Figure 5.2 SVDs for a term-document frequency matrix.

Figure 5.3 Text variance explained by singular value dimensional products for ...

Figure 5.4 Illustration of effect of rotation in the SVD projections in a coll...

Figure 5.5 Example of adjustments to SVD computation in the construction of to...

Figure 5.6 Snapshot of Kaggle test data.

Figure 5.7 Plot of fitted neural network on the entire dataset (screen shot).

Figure 5.8 Relative importance for each of the explanatory variables (subset 1...

Figure 5.9 Neural network for subset 1 (screen shot).

Figure 5.10 Neural network for subset 2 (screen shot).

Chapter 6

Figure 6.1 Test scenario record structure illustration.

Figure 6.2 Captured response comparison of four analytical approaches.

Chapter 7

Figure 7.1 Results of preliminary text clustering in VAERS incident data.

Figure 7.2 Accuracy comparison of numeric data vs. text data model.

Figure 7.3 Boolean rule tree display of VAERS model.

Chapter 8

Figure 8.1 Audio sample plot (amplitude vs. time).

Figure 8.2 Periodogram using Fast Fourier Transforms (FFTs).

Figure 8.3 Illustrative functionality of Mel Filterbank.

Figure 8.4 Mel Filterbank showing overlapping frequency patterns.

Figure 8.5 MFCC features vs. time (without scaling).

Figure 8.6 MFCC features vs. time (scaled).

Figure 8.7 MLP architecture.

Figure 8.8 Typical CNN architecture.

Figure 8.9 Variable importance (from Random Forest).

Figure 8.10 Process flow of generating value from audio feedback.

Appendix A

Figure A.1 Social media monitoring mood through week.

Figure A.2 Mood state dimensional components and polarity.

Figure A.3 Mood state score development process.

Figure A.4 Example web search for mood state synonyms.

Figure A.5 Text and target metric mapping.

Appendix B

Figure B.1 Implementation scenario.

Figure B.2 Component process flow.

Figure B.3 Acoustic analytic record construction.

Figure B.4 Audio signal codification optimizer.

Figure B.5 Textual latent value extractor.

Figure B.6 Textual latent value extractor (detail).

Guide

Cover Page

Title Page

Copyright

Dedication

Preface

Acknowledgments

About the Authors

Introduction

Table of Contents

Begin Reading

Appendix A Mood State Identification in Text

Appendix B A Design Approach to Characterizing Users Based on Audio Interactions on a Conversational AI Platform

Appendix C SAS Patents in Text Analytics

Glossary

Index

WILEY END USER LICENSE AGREEMENT

Pages

ii

iii

v

vi

vii

xi

xiii

xv

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

Wiley and SAS Business Series

The Wiley and SAS Business Series presents books that help senior level managers with their critical management decisions.

Titles in the Wiley and SAS Business Series include:

The Analytic Hospitality Executive: Implementing Data Analytics in Hotels and Casinos

by Kelly A. McGuire

Analytics: The Agile Way

by Phil Simon

The Analytics Lifecycle Toolkit: A Practical Guide for an Effective Analytics Capability

by Gregory S. Nelson

Anti-Money Laundering Transaction Monitoring Systems Implementation: Finding Anomalies

by Derek Chau and Maarten van Dijck Nemcsik

Artificial Intelligence for Marketing: Practical Applications

by Jim Sterne

Business Analytics for Managers: Taking Business Intelligence Beyond Reporting

(

Second Edition)

by Gert H. N. Laursen and Jesper Thorlund

Business Forecasting: The Emerging Role of Artificial Intelligence and Machine Learning

by Michael Gilliland, Len Tashman, and Udo Sglavo

The Cloud-Based Demand-Driven Supply Chain

by Vinit Sharma

Consumption-Based Forecasting and Planning: Predicting Changing Demand Patterns in the New Digital Economy

by Charles W. Chase

Credit Risk Analytics: Measurement Techniques, Applications, and Examples in SAS

by Bart Baesen, Daniel Roesch, and Harald Scheule

Demand-Driven Inventory Optimization and Replenishment: Creating a More Efficient Supply Chain (Second Edition)

by Robert A. Davis

Economic Modeling in the Post Great Recession Era: Incomplete Data, Imperfect Markets

by John Silvia, Azhar Iqbal, and Sarah Watt House

Enhance Oil & Gas Exploration with Data-Driven Geophysical and Petrophysical Models

by Keith Holdaway and Duncan Irving

Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection

by Bart Baesens, Veronique Van Vlasselaer, and Wouter Verbeke

Intelligent Credit Scoring: Building and Implementing Better Credit Risk Scorecards (Second Edition)

by Naeem Siddiqi

JMP Connections: The Art of Utilizing Connections in Your Data

by John Wubbel

Leaders and Innovators: How Data-Driven Organizations Are Winning with Analytics

by Tho H. Nguyen

On-Camera Coach: Tools and Techniques for Business Professionals in a Video-Driven World

by Karin Reed

Next Generation Demand Management: People, Process, Analytics, and Technology

by Charles W. Chase

A Practical Guide to Analytics for Governments: Using Big Data for Good

by Marie Lowman

Profit from Your Forecasting Software: A Best Practice Guide for Sales Forecasters

by Paul Goodwin

Project Finance for Business Development

by John E. Triantis

Smart Cities, Smart Future: Showcasing Tomorrow

by Mike Barlow and Cornelia Levy-Bencheton

Statistical Thinking: Improving Business Performance (Third Edition)

by Roger W. Hoerl and Ronald D. Snee

Strategies in Biomedical Data Science: Driving Force for Innovation

by Jay Etchings

Style and Statistics: The Art of Retail Analytics

by Brittany Bullard

Text as Data: Computational Methods of Understanding Written Expression Using SAS

by Barry deVille and Gurpreet Singh Bawa

Transforming Healthcare Analytics: The Quest for Healthy Intelligence

by Michael N. Lewis and Tho H. Nguyen

Visual Six Sigma: Making Data Analysis Lean (Second Edition)

by Ian Cox, Marie A. Gaudard, and Mia L. Stephens

Warranty Fraud Management: Reducing Fraud and Other Excess Costs in Warranty and Service Operations

by Matti Kurvinen, Ilkka Töyrylä, and D. N. Prabhakar Murthy

For more information on any of the above titles, please visit www.wiley.com.

Text as Data

Computational Methods of Understanding Written Expression Using SAS

 

 

By

Barry deVille and

Gurpreet Singh Bawa

 

 

 

 

Copyright © 2022 by John Wiley & Sons, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our website at www.wiley.com.

Library of Congress Cataloging-in-Publication Data is Available:

9781119487128 (hardback)

9781119487173 (ePDF)

9781119487159 (ePub)

Cover Design: Wiley

To all those who unconditionally love and support authors and their writing processes – especially our life partners, Maya McNeilly and Dilpreet Kaur, who go above and beyond.

Preface

This book provides an end-to-end description of the text analytics process with examples drawn from a range of case studies using various capabilities of SAS text analytics and the associated SAS computing environment. Qualitative and quantitative approaches within the SAS environment are covered across the entire text analytics life cycle from document capture, document characterization, document understanding, through operational deployments.

We cover procedure-based, engineering approaches to text analytics, as well as more discovery-based quantitative approaches. Since much of the text analytics process depends on the text capture and text preprocessing environment, these aspects of text analytics are covered as well.

Acknowledgments

This work was initiated and promoted by Julie Palmieri, serving as editor-in-chief of SAS Press. James Allen Cox has consistently offered advice and review throughout and gave a detailed review of early versions of the draft. Tom Sabo gave advice and review and made significant contributions to the chapter on Boolean rules. Our colleagues Saratendu Sethi, Terry Woodfield, and Sanford Gayle have provided decades of advice on text analytics in general. Elisha Benjamin of John Wiley & Sons was a great source of advice and assistance throughout the project. Wiley executive editor Sheck Cho is the consummate professional and both a rock and a beacon for us aspiring authors.

As authors, we acknowledge their invaluable advice, assistance, encouragement, and also humbly acknowledge that any remaining faults are ours alone.

About the Authors

Barry deVille is a practitioner, developer, and author in the fields of statistics, data science, and text analytics. During a decades-long career at SAS, he collaborated extensively with the text analytic R&D development team, deploying text mining solutions to a variety of global clients in various industrial, financial, health, and social media applications. This work resulted in the award of numerous US patents on decision tree induction algorithms and multidimensional text analytics. Prior to joining SAS, he worked with the National Research Council and other government and commercial entities in Canada in the development and commercialization of statistical and machine learning algorithms.

Gurpreet Singh Bawa has practiced internationally in the areas of statistics with an emphasis on artificial intelligence (AI) and machine learning (ML). He was awarded a PhD at Panjab University, Chandigarh, India, in the fields of AI and ML. He has authored numerous publications in national and international journals. His research in the areas of unstructured data analysis have led to numerous patent applications and awards (including one with co-author deVille on social community identification and automatic document classification). He also works in breakeven analysis and portfolio optimization. He is currently authoring a book on advanced mathematics.

Introduction

Text analytics are a collection of computer methods that use semantic and numerical processing to convert collections of text into identified components that carry meaning and function and can be manipulated quantitatively. Meaning assignment is a semantic process that leads to greater understanding of the text. Numerical manipulation leads to a range of data summarization approaches that typically reduce complexity, capture multiple relationships, and highlight tendencies. Text analytics incorporates semantic and numerical text processing in a synergistic process that leads to greater understanding of various collections of text.

In this treatment we also touch on speech applications so we can see how spoken words, like written words, can be transformed into representations that can be manipulated and summarized quantitatively.

Chapter 1 expands our definition of text analytics and provides some background on the development of written language and systems of writing that are used to capture and communicate meaning.

Chapter 2 provides an overview of the end-to-end process of text analytics. A generic template is described that can enhance our understanding of the various aspects of text analytics and that can also serve as an organizing framework for discussing text analytics. These processes are further described in Chapter 3.

Linguistic processing and associated forms of document characterization are discussed in Chapter 4. Linguistic processing is the front-end text analytics intake process to read and parse the incoming text stream to identify useful and interesting textual components such as parts of speech, phrases, expressions, and special terms.

Chapter 5 shows how numerical approaches to data, including the production of dimensional summaries and data reduction approaches, can be productively applied to creating meaningful textual summaries and dimensional products, like text topics, that help us understand the content of text collections.

In Chapter 6 we provide examples of how quantitative text products can be used for classification and prediction tasks. A real-world industrial use case is discussed.

Chapter 7 discusses the architecture within SAS that unifies linguistic and quantitative processing and so blends the strengths of these two approaches. We show how Boolean rules are constructed, how these are derived from quantitative operations, and how they serve a linguistic purpose.

Chapter 8 provides a case study in speech processing and shows how audio signals can be analyzed and manipulated much like text products to create analytical reports.

There is also a glossary of specialized terms and three appendices. Appendix A expands on the discussion of text characterization and provides an example of how mood state extracted from text can be used in text analytics. Appendix B provides a discussion and architectural approach to using audio processing to infer end user persona characteristics in the construction of artificial intelligence computer-user interaction interfaces. Appendix C provides an annotated summary description of critical patents that have been assigned to SAS. A range of important patents are covered, including an initial patent awarded to extract dimensional products from text and some of the more recent patents that address the unified approach to linguistic and numerical processing.

CHAPTER 1Text Mining and Text Analytics

This chapter describes some of the background and recent history of text analytics and provides real-world examples of how text analytics works and solves business problems. This treatment provides examples of common forms of text analytics and examples of solution approaches. The discussion ranges from a history of the analytical treatment of text expression up to the most recent developments and applications.

BACKGROUND AND TERMINOLOGY

The analysis of written and spoken expression has been developing as a computer application over several decades. Some of the earliest research in machine learning and artificial intelligence dealt with the problem of reading and interpreting text as well as in text translation (machine translation). These early activities gave rise to a field of computer science known as natural language processing (NLP). The recent rapid development of computer power – including processing power, large data, high bandwidth communication, and cloud-based, high-capacity computer memory – has provided a major new (and considerably broadened) emphasis on computerized text processing and text analysis.

TEXT ANALYTICS: WHAT IS IT?

Text processing and text analysis are components of the developing area of understanding written and spoken expression. Commonly occurring text documents – such as traditional newspapers, journals and periodicals, and, more recently, electronic documents, such as social media posts and emails – are forms of written expression. This active, multilayered area in current computer applications joins well-established, traditional fields such as linguistics and literary analysis to form the outline of the emerging field we call text analytics.

Current approaches to text analytics operate in two reinforcing directions that incorporate traditional forms of linguistic and literary analysis with a wide range of statistical, artificial intelligence (AI), and cognitive computing techniques to effectively process written and spoken expressions. The decoded expressions are used to drive a wide range of computer-mediated inference tasks that includes artificial intelligence, cognitive computing, and statistical inference. An everyday example is when we speak or type in a destination in order to receive an optimal driving route. Similarly, a call center agent might decipher multiple forms of common requests in order to construct the most effective solution approach.

Our treatment throughout the chapters to come includes examples of common forms of text analytics and examples of solution approaches. The discussion ranges from a history of the analytical treatment of text expression up to the most recent developments and applications. Since speech is quickly becoming an important form of unstructured data, a final chapter takes up the topic of rendering speech to text.

Computer science and AI emerged as formal disciplines in the aftermath of World War II. An early application of computers to the analysis of written expression, natural language processing, took a universal approach, designed to apply regardless of what language the text was written in – English, Spanish, or Chinese. The techniques that have been developed also apply regardless of the source of the text to be analyzed. With the widespread availability of speech-to-text engines, it is also possible to consider a wide variety of spoken documents as potential sources for text analytics.

An important goal of NLP is to decompose text constructs (sentences, paragraphs, articles, chapters) into various kinds of entities, verbs, semantic constructs (like articles and conjunctions), and so on. The sentence “See Spot run” may be processed and encoded into an NLP representation as: declarative sentence (intransitive); Spot – Subject (Animal/Dog); run – Verb (motion).

Historically, NLP relied on various linguistic analysis capabilities, including extensive logical processing and reasoning capabilities. As computing capabilities have expanded, NLP has increasingly relied on a range of computational approaches to enhance the range of NLP results. An emerging area of NLP includes statistical natural language processing (SNLP). This form of NLP can be used to craft high-level representations of textual documents so that relationships between and among the documents can be computed statistically. The statistical capability also improves the accuracy of the NLP processing itself.

One recent area of written language processing includes statistical document analysis (SDA). Like SNLP, SDA enables us to show the statistical relationships between and among the various components of a textual document. Further, it enables us to summarize the document using multivariate statistical techniques like cluster analysis and latent class analysis. Predictive analytics such as regression analysis, decision trees, and neural networks can also be used.

As computer processing and storage have continued to grow, so too have a variety of deep learning applications. One such application is the Bidirectional Encoder Representations from Transformers (BERT), a deep-learning application for research at Google AI language.i

BERT can be leveraged for tasks such as categorization, entity extraction, and natural language generation. Deep learning approaches require significant computing power and training. As the area of text analytics continues to unfold, we will likely see how deep learning approaches complement the capabilities offered in traditional text analytics, which are less computationally intensive and more than adequate for a wide range of tasks.

The fields of text mining and text analytics are recent applied areas of SDA used in a variety of general-purpose social and economic settings. Text mining often refers to the construction of statistical or numerical models or predictions. Common sources of data include customer service logs and emails, customer use records for warranty issue analysis and defect detection. Text analytics often refers to semantically based applications – for example, customer analytics (who talks to whom and what do they say?), competitive analysis (brand metrics, mentions), and content management (the creation of taxonomies, web page characterization).

Brief History of Text

Language is a form of communication, and text is a written form of language. Text comes in a variety of symbolic forms. In addition to the alphabetic representation we see capturing the written expression in this text, there are other encoding systems such as syllabaries that capture spoken syllables and logograms that capture pictographic representations. Linguistics distinguishes between phonograms – which capture parts of words like syllables in written expression – and logograms – which capture entire concepts.

Figure 1.1 Traffic sign in Cherokee syllabary, Tahlequah, Oklahoma.

Source: Shot November 11, 2007. By Uyvsdi. License: Public Domain.

Figure 1.1 shows an example of a pictographic representation – the STOP sign itself – an alphabetic representation (in Latin script) that spells the word “STOP” and a syllabary – in this case, one used to record the Cherokee language.

One of the earliest true writing systems, dating to the third millennium BCE,