46,99 €
Text As Data: Combining qualitative and quantitative algorithms within the SAS system for accurate, effective and understandable text analytics The need for powerful, accurate and increasingly automatic text analysis software in modern information technology has dramatically increased. Fields as diverse as financial management, fraud and cybercrime prevention, Pharmaceutical R&D, social media marketing, customer care, and health services are implementing more comprehensive text-inclusive, analytics strategies. Text as Data: Computational Methods of Understanding Written Expression Using SAS presents an overview of text analytics and the critical role SAS software plays in combining linguistic and quantitative algorithms in the evolution of this dynamic field. Drawing on over two decades of experience in text analytics, authors Barry deVille and Gurpreet Singh Bawa examine the evolution of text mining and cloud-based solutions, and the development of SAS Visual Text Analytics. By integrating quantitative data and textual analysis with advanced computer learning principles, the authors demonstrate the combined advantages of SAS compared to standard approaches, and show how approaching text as qualitative data within a quantitative analytics framework produces more detailed, accurate, and explanatory results. * Understand the role of linguistics, machine learning, and multiple data sources in the text analytics workflow * Understand how a range of quantitative algorithms and data representations reflect contextual effects to shape meaning and understanding * Access online data and code repositories, videos, tutorials, and case studies * Learn how SAS extends quantitative algorithms to produce expanded text analytics capabilities * Redefine text in terms of data for more accurate analysis This book offers a thorough introduction to the framework and dynamics of text analytics--and the underlying principles at work--and provides an in-depth examination of the interplay between qualitative-linguistic and quantitative, data-driven aspects of data analysis. The treatment begins with a discussion on expression parsing and detection and provides insight into the core principles and practices of text parsing, theme, and topic detection. It includes advanced topics such as contextual effects in numeric and textual data manipulation, fine-tuning text meaning and disambiguation. As the first resource to leverage the power of SAS for text analytics, Text as Data is an essential resource for SAS users and data scientists in any industry or academic application.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 253
Veröffentlichungsjahr: 2021
Cover
Title Page
Copyright
Dedication
Preface
Acknowledgments
About the Authors
Introduction
CHAPTER 1: Text Mining and Text Analytics
BACKGROUND AND TERMINOLOGY
TEXT ANALYTICS: WHAT IS IT?
NOTES
CHAPTER 2: Text Analytics Process Overview
TEXT ANALYTICS PROCESSING
PROCESS BUILDING BLOCKS
PROCESS DESCRIPTION
LINGUISTIC PROCESSING
INTERNAL REPRESENTATION AND TEXT PRODUCTS
NOTES
CHAPTER 3: Text Data Source Capture
TEXT MINING DATA SOURCE ASSEMBLY
CONSUMING LINGUISTICS TEXT PRODUCTS
NOTES
CHAPTER 4: Document Content and Characterization
AUTHORSHIP ANALYTICS: EARLY TEXT INDICATORS AND MEASURES
A CASE STUDY IN GENDER DETECTION
SUMMARIZATION AND DISCOURSE ANALYSIS
FACT EXTRACTION
CONCLUSION
NOTES
CHAPTER 5: Textual Abstraction: Latent Structure, Dimension Reduction
TEXT MINING DATA SOURCE ASSEMBLY
LATENT STRUCTURE AND DIMENSIONAL REDUCTION
ROUGH MEANING – APPROXIMATION FOR SINGULAR VALUE DIMENSIONS
CONCLUSION
NOTES
CHAPTER 6: Classification and Prediction
USE CASE SCENARIO
IDENTIFYING DRIVERS OF TEXTUAL CONSUMER FEEDBACK USING DISTANCE-BASED CLUSTERING AND MATRIX FACTORIZATION
NOTES
CHAPTER 7: Boolean Methods of Classification and Prediction
RULE-BASED TEXT CLASSIFICATION AND PREDICTION
EXAMPLE OF BOOLEAN RULES APPLIED TO TEXT MINING VACCINE DATA
SUMMARY
NOTES
CHAPTER 8: Speech to Text
INTRODUCTION
PROCESSING AUDIO FEEDBACK
FURTHER ANALYSIS: SENTIMENT AND LATENT TOPICS
CONCLUSION
NOTES
Appendix A: Mood State Identification in Text
ORIGINS OF MOOD STATE IDENTIFICATION
NOTES
Appendix B: A Design Approach to Characterizing Users Based on Audio Interactions on a Conversational AI Platform
AUDIO-BASED USER INTERACTION INFERENCE
IMPLEMENTATION SCENARIO: VOICE-BASED CONVERSATIONAL AI PLATFORM
NOTE
Appendix C: SAS Patents in Text Analytics
Glossary
Index
End User License Agreement
Chapter 1
Table 1.1 Trains and Boats Example: Document Collection
Table 1.2 Entropy Calculation for Trains and Boats Example
Chapter 2
Table 2.1
N
-gram Illustration
Table 2.2 Advantage of Using
n
-grams vs. Unigrams
Table 2.3 Example Parse Result for the First Line of Mark Antony Oration
Table 2.4 Internal Representation of Text Products.
Table 2.5 Terms by Document Word Frequency
Table 2.6 Term by Document Representation
Chapter 3
Table 3.1 Format of the Conference Proceedings Input File
Chapter 4
Table 4.1 Function Words
Table 4.2 Feature Types and Machine Learning Methods
Table 4.3 Baby Names and Likely Gender (from US Social Security Records)
Table 4.4 Textual Features as Predictors from Research
Table 4.5 Summary of Extracted Text Products Used in Document Characterization
Chapter 5
Table 5.1 Example Term by Document Table
Table 5.2 Example Term-Document Co-occurrence Table
Table 5.3 Example Newswire-like text to Illustrate LSA
Table 5.4 Term
x
Document Matrix Representation for Newswire Data
Table 5.5 Binary Co-Occurrence Matrix for Newswire Data
Table 5.6 Clustering Term-Documents by Term Co-Occurrence
Table 5.7 Singular Value Weights for Newswire Terms – Three Dimensions
Table 5.8 Singular Value Dimensional Scores for Each Document Based on All Weigh...
Table 5.9 Absolute Singular Value Dimensional Scores for Each Document
Table 5.10 Performance of Cluster Derived and SVD Derived Semantic Categories
Table 5.11 Columns in the Voted Kaggle Data Set
Table 5.12 Document Topic Weights
Table 5.13 Topic Distribution across Documents
Table 5.14 Top Twenty Word Distributions by Topic
Table 5.15 Features for the Neural Network Setup
Table 5.16 Arguments for Neural Network Function
Table 5.17 Relative Importance of Terms
Table 5.18 Relative Ranked Importance of Terms in Respective Subsets
Table 5.19 Top Terms in Respective Subsets
Chapter 6
Table 6.1 Record Layout of Warranty Action with FAILDES Text Target Field
Table 6.2 Claim Code Distributions in Host and Sample Data Sets
Table 6.3 Fixed Field Data Used in Prediction
Table 6.4 Sample Data from Amazon
Table 6.5 Field Descriptors Amazon Feedback Data
Table 6.6 Top Bigrams from Feedback in Retailer 1
Table 6.7 Top Bigrams from Feedback in Retailer 2
Table 6.8 Average Sentiment Polarity of Customer Feedback
Table 6.9 Example Input Matrix for the Boolean Document Model
Table 6.10 Feature Probabilities with Respect to Negative Assessment, Retailer 1
Table 6.11 Feature Probabilities with Respect to Negative Assessment, Retailer 2
Chapter 7
Table 7.1 Example VAERS Data
Table 7.2 Text Clusters Identified in VAERS Data
Table 7.3 Boolean Rule (Conjunctions) Derived from VAERS Data
Chapter 8
Table 8.1 Improvement in Accuracy Across Methods for Gender Classification
Appendix A
Table A.1 Mood State Dimensional Labels and Polar Composites
Table A.2 Textual Indicators of Six Positive Mood States from Web Search
Table A.3 Example Term Weight Adjustments
Table A.4 Sample Document with Raw Mood State Scores
Table A.5 Mood Score Adjusted by Document Word Length
Table A.6 Mean and Standard Deviation Measures for the Mood State Dimensional Sc...
Table A.7 Standardized Scores for the Dimensional Mood States
Table A.8 Bucket Values and Standard Score Cutoffs for the “Agreeable” Dimension
Table A.9 Dimensional Scores Mapped into Ordered Categories (Based on Standard S...
Table A.10 Overall Document Mood Score Based on Average Calculation Score
Table A.11 Overall Mood State for the Example Collection on Six Dimensions
Chapter 1
Figure 1.1 Traffic sign in Cherokee syllabary, Tahlequah, Oklahoma.
Figure 1.2 Example of cuneiform recording the distribution of beer in southern...
Figure 1.3 Shang oracle bone script for character “Eye.” Modern character is 目...
Figure 1.4 Modern Chinese representation of “eye” (mù).
Figure 1.5 Encode–decode send–receive communications model.
Chapter 2
Figure 2.1 Main stages of the text-mining process.
Figure 2.2 Category-oriented folder structure.
Figure 2.3 Text treatments, transformations, derivations, and extractions.
Figure 2.4 Excerpt of Marc Antony's address from Shakespeare's
Julius Caesar
.
Chapter 3
Figure 3.1 Average number of papers at SUGI-SGF 1989–2012.
Figure 3.2 Example snippet of text input to conference proceedings analysis.
Figure 3.3 A high-level semantic map of conference proceedings, 1989–2012.
Chapter 4
Figure 4.1 Assessment of various word features as predictors of gender in WebM...
Figure 4.2 Illustrative results of gender predictors in the WebMD data set.
Figure 4.3 Example of medical notes used as input.
Figure 4.4 Result of text parsing to match notable document features for summa...
Figure 4.5 Pro-forma text summary production based on the structured textual s...
Figure 4.6 Example summary record (hospitalization).
Figure 4.7 Example of text analytics categorization.
Figure 4.8 Example date definition (using standard Perl regular expressions).
Figure 4.9 Illustration of the relationship between extracted fields of data a...
Figure 4.10 Example report applying approach to VAERS data.
Figure 4.11 A Process flow diagram for VAERS report.
Figure 4.12 Fact extraction (information in context).
Figure 4.13 Example of sentiment extraction.
Figure 4.14 Example of conditional inference.
Figure 4.15 Typical pro-forma report output.
Figure 4.16 Example of raw input and recognition features.
Figure 4.17 Results of conditional inference.
Figure 4.18 Source-target production of the pro forma summary.
Figure 4.19 Pro forma summary.
Chapter 5
Figure 5.1 The factorization process.
Figure 5.2 SVDs for a term-document frequency matrix.
Figure 5.3 Text variance explained by singular value dimensional products for ...
Figure 5.4 Illustration of effect of rotation in the SVD projections in a coll...
Figure 5.5 Example of adjustments to SVD computation in the construction of to...
Figure 5.6 Snapshot of Kaggle test data.
Figure 5.7 Plot of fitted neural network on the entire dataset (screen shot).
Figure 5.8 Relative importance for each of the explanatory variables (subset 1...
Figure 5.9 Neural network for subset 1 (screen shot).
Figure 5.10 Neural network for subset 2 (screen shot).
Chapter 6
Figure 6.1 Test scenario record structure illustration.
Figure 6.2 Captured response comparison of four analytical approaches.
Chapter 7
Figure 7.1 Results of preliminary text clustering in VAERS incident data.
Figure 7.2 Accuracy comparison of numeric data vs. text data model.
Figure 7.3 Boolean rule tree display of VAERS model.
Chapter 8
Figure 8.1 Audio sample plot (amplitude vs. time).
Figure 8.2 Periodogram using Fast Fourier Transforms (FFTs).
Figure 8.3 Illustrative functionality of Mel Filterbank.
Figure 8.4 Mel Filterbank showing overlapping frequency patterns.
Figure 8.5 MFCC features vs. time (without scaling).
Figure 8.6 MFCC features vs. time (scaled).
Figure 8.7 MLP architecture.
Figure 8.8 Typical CNN architecture.
Figure 8.9 Variable importance (from Random Forest).
Figure 8.10 Process flow of generating value from audio feedback.
Appendix A
Figure A.1 Social media monitoring mood through week.
Figure A.2 Mood state dimensional components and polarity.
Figure A.3 Mood state score development process.
Figure A.4 Example web search for mood state synonyms.
Figure A.5 Text and target metric mapping.
Appendix B
Figure B.1 Implementation scenario.
Figure B.2 Component process flow.
Figure B.3 Acoustic analytic record construction.
Figure B.4 Audio signal codification optimizer.
Figure B.5 Textual latent value extractor.
Figure B.6 Textual latent value extractor (detail).
Cover Page
Title Page
Copyright
Dedication
Preface
Acknowledgments
About the Authors
Introduction
Table of Contents
Begin Reading
Appendix A Mood State Identification in Text
Appendix B A Design Approach to Characterizing Users Based on Audio Interactions on a Conversational AI Platform
Appendix C SAS Patents in Text Analytics
Glossary
Index
WILEY END USER LICENSE AGREEMENT
ii
iii
v
vi
vii
xi
xiii
xv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
The Wiley and SAS Business Series presents books that help senior level managers with their critical management decisions.
Titles in the Wiley and SAS Business Series include:
The Analytic Hospitality Executive: Implementing Data Analytics in Hotels and Casinos
by Kelly A. McGuire
Analytics: The Agile Way
by Phil Simon
The Analytics Lifecycle Toolkit: A Practical Guide for an Effective Analytics Capability
by Gregory S. Nelson
Anti-Money Laundering Transaction Monitoring Systems Implementation: Finding Anomalies
by Derek Chau and Maarten van Dijck Nemcsik
Artificial Intelligence for Marketing: Practical Applications
by Jim Sterne
Business Analytics for Managers: Taking Business Intelligence Beyond Reporting
(
Second Edition)
by Gert H. N. Laursen and Jesper Thorlund
Business Forecasting: The Emerging Role of Artificial Intelligence and Machine Learning
by Michael Gilliland, Len Tashman, and Udo Sglavo
The Cloud-Based Demand-Driven Supply Chain
by Vinit Sharma
Consumption-Based Forecasting and Planning: Predicting Changing Demand Patterns in the New Digital Economy
by Charles W. Chase
Credit Risk Analytics: Measurement Techniques, Applications, and Examples in SAS
by Bart Baesen, Daniel Roesch, and Harald Scheule
Demand-Driven Inventory Optimization and Replenishment: Creating a More Efficient Supply Chain (Second Edition)
by Robert A. Davis
Economic Modeling in the Post Great Recession Era: Incomplete Data, Imperfect Markets
by John Silvia, Azhar Iqbal, and Sarah Watt House
Enhance Oil & Gas Exploration with Data-Driven Geophysical and Petrophysical Models
by Keith Holdaway and Duncan Irving
Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection
by Bart Baesens, Veronique Van Vlasselaer, and Wouter Verbeke
Intelligent Credit Scoring: Building and Implementing Better Credit Risk Scorecards (Second Edition)
by Naeem Siddiqi
JMP Connections: The Art of Utilizing Connections in Your Data
by John Wubbel
Leaders and Innovators: How Data-Driven Organizations Are Winning with Analytics
by Tho H. Nguyen
On-Camera Coach: Tools and Techniques for Business Professionals in a Video-Driven World
by Karin Reed
Next Generation Demand Management: People, Process, Analytics, and Technology
by Charles W. Chase
A Practical Guide to Analytics for Governments: Using Big Data for Good
by Marie Lowman
Profit from Your Forecasting Software: A Best Practice Guide for Sales Forecasters
by Paul Goodwin
Project Finance for Business Development
by John E. Triantis
Smart Cities, Smart Future: Showcasing Tomorrow
by Mike Barlow and Cornelia Levy-Bencheton
Statistical Thinking: Improving Business Performance (Third Edition)
by Roger W. Hoerl and Ronald D. Snee
Strategies in Biomedical Data Science: Driving Force for Innovation
by Jay Etchings
Style and Statistics: The Art of Retail Analytics
by Brittany Bullard
Text as Data: Computational Methods of Understanding Written Expression Using SAS
by Barry deVille and Gurpreet Singh Bawa
Transforming Healthcare Analytics: The Quest for Healthy Intelligence
by Michael N. Lewis and Tho H. Nguyen
Visual Six Sigma: Making Data Analysis Lean (Second Edition)
by Ian Cox, Marie A. Gaudard, and Mia L. Stephens
Warranty Fraud Management: Reducing Fraud and Other Excess Costs in Warranty and Service Operations
by Matti Kurvinen, Ilkka Töyrylä, and D. N. Prabhakar Murthy
For more information on any of the above titles, please visit www.wiley.com.
By
Barry deVille and
Gurpreet Singh Bawa
Copyright © 2022 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our website at www.wiley.com.
Library of Congress Cataloging-in-Publication Data is Available:
9781119487128 (hardback)
9781119487173 (ePDF)
9781119487159 (ePub)
Cover Design: Wiley
To all those who unconditionally love and support authors and their writing processes – especially our life partners, Maya McNeilly and Dilpreet Kaur, who go above and beyond.
This book provides an end-to-end description of the text analytics process with examples drawn from a range of case studies using various capabilities of SAS text analytics and the associated SAS computing environment. Qualitative and quantitative approaches within the SAS environment are covered across the entire text analytics life cycle from document capture, document characterization, document understanding, through operational deployments.
We cover procedure-based, engineering approaches to text analytics, as well as more discovery-based quantitative approaches. Since much of the text analytics process depends on the text capture and text preprocessing environment, these aspects of text analytics are covered as well.
This work was initiated and promoted by Julie Palmieri, serving as editor-in-chief of SAS Press. James Allen Cox has consistently offered advice and review throughout and gave a detailed review of early versions of the draft. Tom Sabo gave advice and review and made significant contributions to the chapter on Boolean rules. Our colleagues Saratendu Sethi, Terry Woodfield, and Sanford Gayle have provided decades of advice on text analytics in general. Elisha Benjamin of John Wiley & Sons was a great source of advice and assistance throughout the project. Wiley executive editor Sheck Cho is the consummate professional and both a rock and a beacon for us aspiring authors.
As authors, we acknowledge their invaluable advice, assistance, encouragement, and also humbly acknowledge that any remaining faults are ours alone.
Barry deVille is a practitioner, developer, and author in the fields of statistics, data science, and text analytics. During a decades-long career at SAS, he collaborated extensively with the text analytic R&D development team, deploying text mining solutions to a variety of global clients in various industrial, financial, health, and social media applications. This work resulted in the award of numerous US patents on decision tree induction algorithms and multidimensional text analytics. Prior to joining SAS, he worked with the National Research Council and other government and commercial entities in Canada in the development and commercialization of statistical and machine learning algorithms.
Gurpreet Singh Bawa has practiced internationally in the areas of statistics with an emphasis on artificial intelligence (AI) and machine learning (ML). He was awarded a PhD at Panjab University, Chandigarh, India, in the fields of AI and ML. He has authored numerous publications in national and international journals. His research in the areas of unstructured data analysis have led to numerous patent applications and awards (including one with co-author deVille on social community identification and automatic document classification). He also works in breakeven analysis and portfolio optimization. He is currently authoring a book on advanced mathematics.
Text analytics are a collection of computer methods that use semantic and numerical processing to convert collections of text into identified components that carry meaning and function and can be manipulated quantitatively. Meaning assignment is a semantic process that leads to greater understanding of the text. Numerical manipulation leads to a range of data summarization approaches that typically reduce complexity, capture multiple relationships, and highlight tendencies. Text analytics incorporates semantic and numerical text processing in a synergistic process that leads to greater understanding of various collections of text.
In this treatment we also touch on speech applications so we can see how spoken words, like written words, can be transformed into representations that can be manipulated and summarized quantitatively.
Chapter 1 expands our definition of text analytics and provides some background on the development of written language and systems of writing that are used to capture and communicate meaning.
Chapter 2 provides an overview of the end-to-end process of text analytics. A generic template is described that can enhance our understanding of the various aspects of text analytics and that can also serve as an organizing framework for discussing text analytics. These processes are further described in Chapter 3.
Linguistic processing and associated forms of document characterization are discussed in Chapter 4. Linguistic processing is the front-end text analytics intake process to read and parse the incoming text stream to identify useful and interesting textual components such as parts of speech, phrases, expressions, and special terms.
Chapter 5 shows how numerical approaches to data, including the production of dimensional summaries and data reduction approaches, can be productively applied to creating meaningful textual summaries and dimensional products, like text topics, that help us understand the content of text collections.
In Chapter 6 we provide examples of how quantitative text products can be used for classification and prediction tasks. A real-world industrial use case is discussed.
Chapter 7 discusses the architecture within SAS that unifies linguistic and quantitative processing and so blends the strengths of these two approaches. We show how Boolean rules are constructed, how these are derived from quantitative operations, and how they serve a linguistic purpose.
Chapter 8 provides a case study in speech processing and shows how audio signals can be analyzed and manipulated much like text products to create analytical reports.
There is also a glossary of specialized terms and three appendices. Appendix A expands on the discussion of text characterization and provides an example of how mood state extracted from text can be used in text analytics. Appendix B provides a discussion and architectural approach to using audio processing to infer end user persona characteristics in the construction of artificial intelligence computer-user interaction interfaces. Appendix C provides an annotated summary description of critical patents that have been assigned to SAS. A range of important patents are covered, including an initial patent awarded to extract dimensional products from text and some of the more recent patents that address the unified approach to linguistic and numerical processing.
This chapter describes some of the background and recent history of text analytics and provides real-world examples of how text analytics works and solves business problems. This treatment provides examples of common forms of text analytics and examples of solution approaches. The discussion ranges from a history of the analytical treatment of text expression up to the most recent developments and applications.
The analysis of written and spoken expression has been developing as a computer application over several decades. Some of the earliest research in machine learning and artificial intelligence dealt with the problem of reading and interpreting text as well as in text translation (machine translation). These early activities gave rise to a field of computer science known as natural language processing (NLP). The recent rapid development of computer power – including processing power, large data, high bandwidth communication, and cloud-based, high-capacity computer memory – has provided a major new (and considerably broadened) emphasis on computerized text processing and text analysis.
Text processing and text analysis are components of the developing area of understanding written and spoken expression. Commonly occurring text documents – such as traditional newspapers, journals and periodicals, and, more recently, electronic documents, such as social media posts and emails – are forms of written expression. This active, multilayered area in current computer applications joins well-established, traditional fields such as linguistics and literary analysis to form the outline of the emerging field we call text analytics.
Current approaches to text analytics operate in two reinforcing directions that incorporate traditional forms of linguistic and literary analysis with a wide range of statistical, artificial intelligence (AI), and cognitive computing techniques to effectively process written and spoken expressions. The decoded expressions are used to drive a wide range of computer-mediated inference tasks that includes artificial intelligence, cognitive computing, and statistical inference. An everyday example is when we speak or type in a destination in order to receive an optimal driving route. Similarly, a call center agent might decipher multiple forms of common requests in order to construct the most effective solution approach.
Our treatment throughout the chapters to come includes examples of common forms of text analytics and examples of solution approaches. The discussion ranges from a history of the analytical treatment of text expression up to the most recent developments and applications. Since speech is quickly becoming an important form of unstructured data, a final chapter takes up the topic of rendering speech to text.
Computer science and AI emerged as formal disciplines in the aftermath of World War II. An early application of computers to the analysis of written expression, natural language processing, took a universal approach, designed to apply regardless of what language the text was written in – English, Spanish, or Chinese. The techniques that have been developed also apply regardless of the source of the text to be analyzed. With the widespread availability of speech-to-text engines, it is also possible to consider a wide variety of spoken documents as potential sources for text analytics.
An important goal of NLP is to decompose text constructs (sentences, paragraphs, articles, chapters) into various kinds of entities, verbs, semantic constructs (like articles and conjunctions), and so on. The sentence “See Spot run” may be processed and encoded into an NLP representation as: declarative sentence (intransitive); Spot – Subject (Animal/Dog); run – Verb (motion).
Historically, NLP relied on various linguistic analysis capabilities, including extensive logical processing and reasoning capabilities. As computing capabilities have expanded, NLP has increasingly relied on a range of computational approaches to enhance the range of NLP results. An emerging area of NLP includes statistical natural language processing (SNLP). This form of NLP can be used to craft high-level representations of textual documents so that relationships between and among the documents can be computed statistically. The statistical capability also improves the accuracy of the NLP processing itself.
One recent area of written language processing includes statistical document analysis (SDA). Like SNLP, SDA enables us to show the statistical relationships between and among the various components of a textual document. Further, it enables us to summarize the document using multivariate statistical techniques like cluster analysis and latent class analysis. Predictive analytics such as regression analysis, decision trees, and neural networks can also be used.
As computer processing and storage have continued to grow, so too have a variety of deep learning applications. One such application is the Bidirectional Encoder Representations from Transformers (BERT), a deep-learning application for research at Google AI language.i
BERT can be leveraged for tasks such as categorization, entity extraction, and natural language generation. Deep learning approaches require significant computing power and training. As the area of text analytics continues to unfold, we will likely see how deep learning approaches complement the capabilities offered in traditional text analytics, which are less computationally intensive and more than adequate for a wide range of tasks.
The fields of text mining and text analytics are recent applied areas of SDA used in a variety of general-purpose social and economic settings. Text mining often refers to the construction of statistical or numerical models or predictions. Common sources of data include customer service logs and emails, customer use records for warranty issue analysis and defect detection. Text analytics often refers to semantically based applications – for example, customer analytics (who talks to whom and what do they say?), competitive analysis (brand metrics, mentions), and content management (the creation of taxonomies, web page characterization).
Language is a form of communication, and text is a written form of language. Text comes in a variety of symbolic forms. In addition to the alphabetic representation we see capturing the written expression in this text, there are other encoding systems such as syllabaries that capture spoken syllables and logograms that capture pictographic representations. Linguistics distinguishes between phonograms – which capture parts of words like syllables in written expression – and logograms – which capture entire concepts.
Figure 1.1 Traffic sign in Cherokee syllabary, Tahlequah, Oklahoma.
Source: Shot November 11, 2007. By Uyvsdi. License: Public Domain.
Figure 1.1 shows an example of a pictographic representation – the STOP sign itself – an alphabetic representation (in Latin script) that spells the word “STOP” and a syllabary – in this case, one used to record the Cherokee language.
One of the earliest true writing systems, dating to the third millennium BCE,
