25,99 €
Discover how to achieve business goals by relying on high-quality, robust data
In Data Quality: Empowering Businesses with Analytics and AI, veteran data and analytics professional delivers a practical and hands-on discussion on how to accelerate business results using high-quality data. In the book, you’ll learn techniques to define and assess data quality, discover how to ensure that your firm’s data collection practices avoid common pitfalls and deficiencies, improve the level of data quality in the business, and guarantee that the resulting data is useful for powering high-level analytics and AI applications.
The author shows you how to:
An essential resource for data scientists, data analysts, business intelligence professionals, chief technology and data officers, and anyone else with a stake in collecting and using high-quality data, Data Quality: Empowering Businesses with Analytics and AI will also earn a place on the bookshelves of business leaders interested in learning more about what sets robust data apart from the rest.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 429
Veröffentlichungsjahr: 2023
Cover
Title Page
Copyright
Foreword
Preface
ABOUT THE BOOK
QUALITY PRINCIPLES APPLIED IN THIS BOOK
ORGANIZATION OF THE BOOK
WHO SHOULD READ THIS BOOK?
REFERENCES
Acknowledgments
PART I: Define Phase
CHAPTER 1: Introduction
INTRODUCTION
DATA, ANALYTICS, AI, AND BUSINESS PERFORMANCE
DATA AS A BUSINESS ASSET OR LIABILITY
DATA GOVERNANCE, DATA MANAGEMENT, AND DATA QUALITY
LEADERSHIP COMMITMENT TO DATA QUALITY
KEY TAKEAWAYS
CONCLUSION
REFERENCES
CHAPTER 2: Business Data
INTRODUCTION
DATA IN BUSINESS
TELEMETRY DATA
PURPOSE OF DATA IN BUSINESS
BUSINESS DATA VIEWS
KEY CHARACTERISTICS OF BUSINESS DATA
CRITICAL DATA ELEMENTS (CDEs)
KEY TAKEAWAYS
CONCLUSION
REFERENCES
CHAPTER 3: Data Quality in Business
INTRODUCTION
DATA QUALITY DIMENSIONS
CONTEXT IN DATA QUALITY
CONSEQUENCES AND COSTS OF POOR DATA QUALITY
DATA DEPRECIATION AND ITS FACTORS
DATA IN IT SYSTEMS
DATA QUALITY AND TRUSTED INFORMATION
KEY TAKEAWAYS
CONCLUSION
REFERENCES
PART II: Analyze Phase
CHAPTER 4: Causes for Poor Data Quality
INTRODUCTION
DATA QUALITY RCA TECHNIQUES
TYPICAL CAUSES OF POOR DATA QUALITY
KEY TAKEAWAYS
CONCLUSION
REFERENCES
CHAPTER 5: Data Lifecycle and Lineage
INTRODUCTION
BUSINESS-ENABLED DLC STAGES
IT BUSINESS-ENABLED DLC STAGES
DATA LINEAGE
KEY TAKEAWAYS
CONCLUSION
REFERENCES
CHAPTER 6: Profiling for Data Quality
INTRODUCTION
CRITERIA FOR DATA PROFILING
DATA PROFILING TECHNIQUES FOR MEASURES OF CENTRALITY
DATA PROFILING TECHNIQUES FOR MEASURES OF VARIATION
INTEGRATING CENTRALITY AND VARIATION KPIs
KEY TAKEAWAYS
CONCLUSION
REFERENCES
PART III: Realize Phase
CHAPTER 7: Reference Architecture for Data Quality
INTRODUCTION
OPTIONS TO REMEDIATE DATA QUALITY
DataOps
DATA PRODUCT
DATA FABRIC AND DATA MESH
DATA ENRICHMENT
KEY TAKEAWAYS
CONCLUSION
REFERENCES
CHAPTER 8: Best Practices to Realize Data Quality
INTRODUCTION
OVERVIEW OF BEST PRACTICES
BP 1: IDENTIFY THE BUSINESS KPIs AND THE OWNERSHIP OF THESE KPIs AND THE PERTINENT DATA
BP 2: BUILD AND IMPROVE THE DATA CULTURE AND LITERACY IN THE ORGANIZATION
BP 3: DEFINE THE CURRENT AND DESIRED STATE OF DATA QUALITY
BP 4: FOLLOW THE MINIMALISTIC APPROACH TO DATA CAPTURE
BP 5: SELECT AND DEFINE THE DATA ATTRIBUTES FOR DATA QUALITY
BP 6: CAPTURE AND MANAGE CRITICAL DATA WITH DATA STANDARDS IN MDM SYSTEMS
KEY TAKEAWAYS
CONCLUSION
REFERENCES
CHAPTER 9: Best Practices to Realize Data Quality
INTRODUCTION
BP 7: RATIONALIZE AND AUTOMATE THE INTEGRATION OF CRITICAL DATA ELEMENTS
BP 8: DEFINE THE SoR AND SECURELY CAPTURE TRANSACTIONAL DATA IN THE SoR/OLTP SYSTEM
BP 9: BUILD AND MANAGE ROBUST DATA INTEGRATION CAPABILITIES
BP 10: DISTRIBUTE DATA SOURCING AND INSIGHT CONSUMPTION
KEY TAKEAWAYS
CONCLUSION
REFERENCES
PART IV: Sustain Phase
CHAPTER 10: Data Governance
INTRODUCTION
DATA GOVERNANCE PRINCIPLES
DATA GOVERNANCE DESIGN COMPONENTS
IMPLEMENTING THE DATA GOVERNANCE PROGRAM
DATA OBSERVABILITY
DATA COMPLIANCE – ISO 27001, SOC1, AND SOC2
KEY TAKEAWAYS
CONCLUSION
REFERENCES
CHAPTER 11: Protecting Data
INTRODUCTION
DATA CLASSIFICATION
DATA SAFETY
DATA SECURITY
KEY TAKEAWAYS
CONCLUSION
REFERENCES
CHAPTER 12: Data Ethics
INTRODUCTION
DATA ETHICS
IMPORTANCE OF DATA ETHICS
PRINCIPLES OF DATA ETHICS
MODEL DRIFT IN DATA ETHICS
DATA PRIVACY
MANAGING DATA ETHICALLY
KEY TAKEAWAYS
CONCLUSION
REFERENCES
Appendix 1: Abbreviations and Acronyms
Appendix 2: Glossary
Appendix 3: Data Literacy Competencies
About the Author
Index
End User License Agreement
Chapter 10
TABLE 10.1 The Impact of Improving the Algorithm versus the Data on Model P...
Preface
FIGURE P.1 Book Organization
Chapter 1
FIGURE 1.1 Data Management, Data Governance, and Data Quality
FIGURE 1.2 2019 Global Data Transformation Survey of McKinsey
Chapter 2
FIGURE 2.1 Types of Business Data
FIGURE 2.2 Relationships among the Four Types of Data
Chapter 3
FIGURE 3.1 Key Factors to Derive Business Value from Data
FIGURE 3.2 Payment Terms in SAP Vendor Master
FIGURE 3.3 Accuracy and Precision
FIGURE 3.4 Data Quality Dimensions
FIGURE 3.5 Data Quality 1-10-100 Rule
Chapter 4
FIGURE 4.1 Affinity Diagram
FIGURE 4.2 FMEA Diagram
FIGURE 4.3 Fishbone Diagram
FIGURE 4.4 5-Whys Technique
FIGURE 4.5 RCA Techniques
Chapter 5
FIGURE 5.1 DLC Activities
Chapter 6
FIGURE 6.1 Data Quality Issues
FIGURE 6.2 Centrality Measures
FIGURE 6.3 SD versus SE
FIGURE 6.4 IQR Rule
FIGURE 6.5 Data Profile Sample
FIGURE 6.6 Four Process States
Chapter 7
FIGURE 7.1 DataOps
FIGURE 7.2 Data Fabric Features
FIGURE 7.3 Data Mesh Features
FIGURE 7.4 Feature Engineering
FIGURE 7.5 Difference between Union and Join
Chapter 8
FIGURE 8.1 Data Quality in DLC
FIGURE 8.2 Data Literacy Competencies
FIGURE 8.3 Targets, Tolerances, Control Limits, and Specifications
FIGURE 8.4 Value Stream Mapping to Data Elements
FIGURE 8.5 Data Catalog and Semantic Layer in DLC
Chapter 9
FIGURE 9.1 The Registry Style
FIGURE 9.2 The Consolidation Style
FIGURE 9.3 The Coexistence Style
FIGURE 9.4 The Centralized Style
FIGURE 9.5 Key Factors in the Selection of MDM Architectures
FIGURE 9.6 Master Data Architecture Style Maturity Continuum
FIGURE 9.7 Pull and Pull Data Integration
FIGURE 9.8 ETL Process
FIGURE 9.9 ESB Process
FIGURE 9.10 Data Quality Management in the DLC
FIGURE 9.11 MAD Framework
Chapter 10
FIGURE 10.1 Barriers to Achieving Data Governance Objectives
FIGURE 10.2 Policy, Process, and Procedure Hierarchy
FIGURE 10.3 Data Governance on Customer Master
Chapter 11
FIGURE 11.1 Classification for Data Protection
Chapter 12
FIGURE 12.1 Components of Data Ethics
Cover
Title Page
Copyright
Foreword
Preface
Acknowledgments
Table of Contents
Begin Reading
Appendix 1: Abbreviations and Acronyms
Appendix 2: Glossary
Appendix 3: Data Literacy Competencies
About the Author
Index
End User License Agreement
i
ii
iii
iv
v
vi
vii
viii
xi
xii
xvii
xviii
xix
xx
xxi
xxii
xxiii
xxv
xxvi
1
3
4
5
6
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
191
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
211
212
213
214
215
216
217
218
219
220
221
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
241
242
243
244
245
246
247
249
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
“The promise of the ‘data economy’ (i.e., data is the new oil), combined with the naive belief that AI can turn a company's data into gold, is leading many enterprises to experiment with adopting AI. But early adopters are learning that one of the primary causes of AI failure is poor-quality data being fed into AI models (garbage in, garbage out). This highlights that raw data is insufficient and that it must be refined to create value. Consequently, implementing data-centric AI is rapidly becoming a best practice at both start-ups and established technology companies. This book provides a pragmatic approach for enterprises to acquire and manage good-quality data based on proven best practices. If you are a C-level executive or an AI practitioner seeking to deploy AI at scale to drive value creation, I highly recommend this book.”
—Anik Bose
BGV managing general partner and founder, Ethical AI Governance Group(EAIGG) (United States)
“Prashanth Southekal has nailed the ‘why’ and the ‘how’ about data quality's relation to analytics in this highly readable book. To be an economically viable company in today's transparent, global, and competitive world, business leaders must champion the data quality and analytics journey and embed them in decision support systems as an operational core competency. The companies that advocate data quality and analytics integrate them in their DNA to outsmart their competitors in strategic and tactical decision making that yields sustainable success.”
—Gary Cokins
President, Analytics-Based Performance Management LLC;
co-author, Predictive Business Analytics (United States)
“Good data is a source of myriad opportunities, while bad data is a tremendous burden. Data is now exposed at a much more strategic level (e.g., through business intelligence systems), increasing manifold the stakes involved for individuals and corporations, as well as government agencies. There, the lack of knowledge about data accuracy, currency, or completeness can have erroneous and even catastrophic results. In this book, Dr. Southekal provides very detailed and thorough coverage of all aspects of data quality management and best practices to improve data quality that would suit all ranges of expertise from beginner to advanced practitioner.”
—Michael Taylor
AI chief data scientist, Siemens Mobility (Singapore)
“Dr. Southekal has a way of distilling complexity into practical applications for global leaders who span multiple industries and market types. He leverages his relationships to gain more perspective on data management. His book, Data Quality: Empowering Businesses with Analytics and AI, builds on his prior successes and hits the mark once again. ‘Improving data quality should be a top priority for all business leaders.’”
—Victor Ojeleye
Planning & Reporting Manager, FP&A,Cargill Protein North America (United States)
“Data quality is a fundamental building block of success in today's digitally agile world. It creates disruption in long-established industries, but also allows traditional companies to innovate and drive more efficient and effective decision-making practices. Data quality utilization will grow revenues and reduce risk through better-connected intelligence. Dr. Southekal has created a book that explains the why, what, and how of data quality. Written in a structured, logical approach that allows all industries and leaders to fully understand the importance of getting their data quality correct for true value generation, Data Quality: Empowering Businesses with Analytics and AI is a must-read.”
—Matthew Small
Managing director, Data Value Creation Ltd (United Kingdom)
“Prashanth provides an in-depth and scientific perspective on a very critical topic for businesses and organizations today. He addresses the complete lifecycle related to data quality, provides detailed explanation in each chapter, and starts from what data quality is to how to capture DQ issues proactively, govern the data, comply with regulations, and secure and sustain DQ practices. There are key callouts depicted in the insight boxes within each chapter. With the future trends indicating the shift toward right data from big data, data quality is a concept that needs to be ingrained in a company's business fabric. The book is a very practical guide to data quality that will be part of my toolkit.”
—Ramdas Narayanan
Vice president, Bank of America (United States)
“Data-driven organizations understand that useful data is not simply found and organized by itself. In this book, his third, Dr. Prashanth Southekal shows business leaders the foundations needed to create a company that wants data- and analytics-led decisions to be part of their strategy. For data leaders and practitioners, this book will not only guide you but will also trigger new thoughts and ideas.”
—Mark Stern
Vice president of Analytics and BI, BetMGM (United States)
“Data is growing and almost every company is a data company. The majority of organizations want to be in the data-driven space, utilizing and monetizing data through advanced analytics and AI. Although the thought process is great, when it comes to practical implementations most companies are struggling to get value out of their investment. In the consulting space we are seeing repeatedly the need for getting the basics and foundation right. Dr. Southekal's book Data Quality: Empowering Businesses with Analytics and AI is empowering business and data leaders and giving practical guidance on how to build good-quality data to get the most value from analytics and AI projects.”
—Rathi Subbaraj
Senior manager, Dufrain (United Kingdom)
“The most thorough and comprehensive book I've seen on data quality. It covers the entire lifecycle of data management in the current enterprise, AI, and analytics landscape. The book contains a wealth of valuable strategic and tactical elements, as well as best practices for getting the most value from data for the business. A must-read for anyone looking to leverage the value of enterprise data.”
—Tobias Zwingmann
Managing partner, RAPYD.AI (Germany)
“In today's world, where almost every company is dealing with petabyte-scale data, data quality is something that should be ingrained in all phases of the data lifecycle. In this book Dr. Southekal takes you on a journey of data quality and its lifecycle. It provides an in-depth perspective and the right approach to manage DQ. This book provides a detail explanation of the DARS approach, the DQ lifecycle and its difference phases, multiple dimensions of DQ, data decay, best practices, and a lot more. Dr. Southekal hits the mark again, and this book should be part of the toolkit for all levels of DQ and data practitioners.”
—Ujjwal Goel
Director, Data Architecture & Data Engineering, Loblaw (Canada)
“The economy of data has been a trending subject for some time now. But the poor quality of the data affects the decision-making ability and the performance results. Most of the publications in this space refer to the physical flaws in data quality, like data downtime, whereas the author extends the definition to the logical flaws in data, which are much harder to spot and resolve. Dr. Southekal created a playbook for delivering business value from data, with prescriptive recommendations based on best practices for data governance and management practices, all based on the proprietary evaluation framework for data quality.”
—Inna Tokarev Sela
CEO, Illumex AI (Israel)
“Dr. Southekal has done it again: given the data science community a gem of a framework (DARS: Design-Assess-Realize-Sustain) that they can apply to maximize ROI from their data and analytics initiatives. Data Quality: Empowering Businesses with Analytics and AI does a phenomenal job explaining nuanced concepts in a language that can be very easily understood by both technical and business audiences.”
—Swapnil Srivastava
VP and Global Head of Analytics, Evalueserve (United States)
“Like his previous two books, Data Quality: Empowering Businesses with Analytics and AI is yet another great read for enterprise data leaders. In this book, Dr. Southekal first sets up a framework to understand and measure the quality of business data (the Define and Assess phases); he then provides a guidebook to implement data quality programs (the Realize and Sustain phases). In today's AI-driven world, this book will help business leaders build a solid data foundation.”
—Li Kang
Head of Strategy, CelerData (United States)
“With accelerating change, decision-cycle times are narrowing, placing increased pressure on organizations to make faster and effective decisions to drive the biggest impact. In this environment, data quality issues can amplify the impact and costs of incorrect decisions. In this book, Dr. Southekal provides a comprehensive approach, practical frameworks and best practices to defining and addressing data quality issues. This is a must read, full of important and practical information, for all data professionals.”
—Sanjeev Chib, CPA, CA
VP (Product) and managing director, Data Solutions, Moneris (Canada)
“Southekal's Data Quality goes well beyond delivering a thoughtful, useful, and usable text on the virtues and value of quality data; it offers accessible and actionable insights into how serious organizations can get measurable value from their data investments. His ‘Define, Assess, Realize, and Sustain’ framework offers both a guide and a roadmap to making ‘data’ the asset it can and should be for the digital and digitizing enterprise. I am impressed by its clarity. It's comprehensive without being overwhelming. Check it out.”
—Michael Schrage
Research fellow, MIT Sloan School Initiative on the Digital Economy;Author, Recommendation Engines and The Innovator's Hypothesis,MIT Press (United States)
“There are often misnomers when it comes to understanding how valuable data can be to driving an organization forward. Unfortunately, in some cases, data is an afterthought, only because the people managing it either don't invest or they don't know how to go about deploying a great data program. Dr. Southekal's latest book spells it out for you in a way that is simple to understand for all business users looking to improve their data products. If you are a product owner/manager looking to improve your data product, then I recommend adding this to your knowledge bank. I certainly will.”
—Diane Robin
Senior technical product owner, Data & Analytics, Talentnet (Canada)
“If you want to know about empowering business with analytics and AI, this is the book for you – the transformation of business through 3Ds (enabling data-driven decisioning). The aspects highlighted in the book could be familiar (‘been there, done that’), even anecdotal at times; however, this book helps to highlight and join the dots in the successful integration and management of good-quality data within an enterprise using the DARS framework. I am recommending this book because there is no other in the market that captures or attempts to transform business by purposefully empowering it with its data lifecycle, lineage, security, profile, architecture, and governance, thereby helping a business to leapfrog into the next phase of its evolution and resilience. To quote Prashanth, ‘Data is a business asset only when it is consciously captured and deliberately managed such that quality data is available to run and sustain the business.’”
—Tarun Jacob George
CEO, Tata Insights and Quants (India)
“In a ‘data is the new oil’ economy, Dr. Southekal illustrates the requirements, pitfalls, and payoffs of ‘refining’ data for today's business leaders. In the race to create and capture value, many organizations suffer from the ‘capture it all, figure it out later’ mindset. For leaders seeking to develop a sustainable competitive advantage, the Define-Assess-Realize-Sustain process is essential to maximize value and, more importantly, avoid the many potential perils of acting on insights derived from improper data management.”
—Mike Stratta
CEO, Arcalea (United States)
“This book by Dr. Prashanth Southekal is a great take on data quality. The book presents aspects of data quality in very easy-to-understand concepts and language. It talks about various frameworks and practical tools that can be adopted to implement and improve data quality in an organization. The concepts are well supported by various statistics that give a very practical and analytical view of data quality. Thank you, Prashanth, for making data quality so easy to understand.”
—Arihant Garg
Partner, KPMG (India)
“The ability of enterprises to become data driven, achieve data monetization, drive digital transformation, and embrace the power of AI hinges on one thing and one thing only: data quality. For companies to deliver and achieve value from its data assets, data quality coupled with data management and governance is key. As a data practitioner for well over two decades, I have seen how data quality has played a role in achieving and sustaining long term success from data. This book provides a practical, simple, and insightful guide to manage data quality across its lifecycle (why to how). It is a must-have book for C-level executives, business leaders, and AI practitioners embarking on the journey to deliver value from data.”
—Santosh Raju
Global head Industry & Horizontal Solutions, Microsoft Practice,HCL Tech (United Kingdom)
“A very well-constructed book written from a practical perspective that demystifies business-driven data quality, taking a much-needed broader approach focused on achieving business operational efficiencies and revenue optimization driven by analytics. It goes beyond the early industry focus on data profiling to encompass business process change and drive Master/Reference Data Systems of Record frameworks. Prasanth brings thought leadership in driving data quality as a top priority for organizations and a primary driver of a targeted business approach versus a component of an enterprise data governance or data management approach.
—Peter Kapur
Head of Data Governance, Data Quality and MDM,Waste Management (United States)
“Would you like to arm yourself with the best source of information to improving the bottom line of your organization by leveraging data analytics? Look no further and take the opportunity to read the book Data Quality: Empowering Businesses with Analytics and AI by Dr. Prashanth Southekal. Dr. Southekal has a way of teaching the value of data to business owners that allows anyone to see how they can easily improve business performance. His expertise is easily adaptable to any industry.”
—Hadia Lugo
Director of Financial Planning & Analysis (FP&A),Duke Energy (United States)
“Data as a strategic asset to drive business performance is being increasingly recognized as a key tenet across many industries. In order to become a data-driven enterprise, quality of data is of utmost importance. I found Dr. Southekal's DARS (Define, Analyze, Realize, and Sustain) approach highly systematic and effective to define and implement a data quality program with both strategic and tactical considerations. His emphasis that data governance works best when implemented early in the data lifecycle is a great insight. Without good-quality data, any attempt to leverage AI/ML is just hype. I highly recommend this book as it provides an excellent, easy-to-practice framework and techniques for practitioners.”
—Dr. Venkatraman Balasubramanian, PhD, MBA
SVP and Global Head, Healthcare and Life Sciences,Orion Innovation (United States)
“There is no dearth of organizations talking about the need for data quality but very few are actually able to make systemic and process level changes to address and action the wide landscape of data in most modern enterprises. The recommendations put forth by Dr. Prashanth Southekal in this book – DATA QUALITY: EMPOWERING BUSINESSES WITH ANALYTICS AND AI provides a practical and incremental way to tackling the behemoth sized problem”
—Kamayini Kaul
VP, Global Head Information Insights and AnalyticsCSL Behring
PRASHANTH H. SOUTHEKAL
Copyright © 2023 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data is Available:
ISBN 9781394165230 (Hardback)ISBN 9781394165254 (ePDF)ISBN 9781394165247 (ePub)
Cover Design: WileyCover Image : © amiak/Shutterstock
Once they built a building in downtown San Francisco. It was right in the middle of downtown. They cleared the space for the building. They started driving steel beams into the ground. It was hard work drilling into the ground for the beams. At some point they said that the beams were in the ground “good enough.”
Then they built a multi-story building on top of the beams. They built very luxurious living quarters in the building. The building took its place among the skyscrapers in San Francisco. They started to sell the units at very expensive prices.
Then one day one of the tenants of the building dropped a marble on the floor and it did something odd; the marble rolled across the floor. The building was tilting. The builders had not placed the foundation on bedrock because the ground was so difficult to penetrate. And the steel girders they put in looked sturdy enough. But they weren't.
Slowly the building was tipping over. And at this point trying to go back and reposition the girders was not an option.
Someday – in the hopefully distant future – the skyscraper is going to fall over.
You don't want to be on Market Street or Chinatown when that day comes. Nothing good is going to happen then. And you certainly don't want to be in the building when it tips over.
The same phenomenon is happening today but in a different arena. Today we have a world of glitzy technologies – AI, ML, business intelligence, blockchain, and a whole host of other venues. These technologies are glitzy and have great appeal. But all of these new technologies have a fatal flaw. All of these new technologies depend on a solid foundation of quality data. Just like the San Francisco building that was not built on bedrock, these new and swank technologies will only work if they operate on reliable data.
At the end of the day, it is back to the old principle: GIGO – garbage in, garbage out. AI and ML simply do not work if they are trying to operate on faulty data.
The problem is that nobody wants to address the issues of data quality. Data quality just does not have the sizzle that the newer technologies have. And that is a tragic mistake, because the new and sexy technologies do not function well or at all if they don't have the proper and correct data to operate from.
The original pioneer of data quality, Larry English, would be proud to see this work from Dr. Prashanth Southekal. Many years ago, Larry sowed the seeds of the notion of data quality. Larry would be amazed to see how those seeds have grown in a lush green and verdant field.
One of the things I really like about this book is its completeness that any data, analytics, and AI practitioner would benefit from. Dr. Southekal has covered all the bases, including important data quality best practices in the areas of data management, and data governance – and it is a lot of work to cover all the bases. Some of the highlights of the book include:
Definitions—what they are and why they are important in business
Data lineage—a subject overlooked by many authors
The system of record—another important concept missed by most authors
The acknowledgment that the volume of data plays an important role in shaping what can be done
Data governance—what it is and how to do it
Protection and security—essential to any modern organization
Ethics—anther subject missed by most authors
Ownership of data and stewardship
And this short list only scratches the surface of this book.
This book is essential reading for everyone who wants to build technology that relies on data. If you are going to be building massive structures, you need to know how to build solid foundations. Otherwise, you are sowing seeds of disaster.
—Bill Inmon,“Father of the DataWarehouse”
Denver, Colorado, United States
October, 2022
Every company today is a data company as data is redefining business models and enabling new revenue streams, reducing costs, and mitigating business risks. Today, data is often the primary product for nearly every business, and analytics and AI (artificial intelligence) form the core business model element in many companies. IDC predicts that by 2023 more than half of all GDP worldwide will be driven by products and services from digitally transformed enterprises. A McKinsey report says data-driven organizations provide EBITDA (earnings before interest, taxes, and depreciation) increases of up to 25% (Böringer et al. 2022), and a study conducted by Boston Consulting in 2022 found that the first 9 of the top 10 innovative companies in the world are data firms (Manly et al. 2022). Overall, data today is considered the key enabler of innovation and productivity in business.
To derive business results from data, quality data is essential. But most industries are plagued with poor data quality. An Harvard Business Review study found that just 3% of the data in a business enterprise meets quality standards (Nagle et al. 2017). Research analyst firm Gartner found that 27% of data in the world's top companies is flawed. To provide organizations a competitive advantage from data, this book, Data Quality: Empowering Businesses with Analytics and AI provides readers with practical guidance and proven solutions to derive quality business data. While there are many books on data quality in the market, the book has three key elements that will make it unique in the marketplace:
The book is for practitioners written by a fellow practitioner. It is based on my data, analytics, and AI experience, while consulting for over 80 companies including big brands such as GE, SAP, P&G, Apple, and Shell. In addition, this content has been reviewed by senior data and technology leaders from many leading organizations worldwide.
The book is relevant in today's context. Today, companies operate under stiff competition, expanded business networks, increasing regulatory compliance, and emerging technologies such as cloud computing, big data, machine learning (ML), artificial intelligence (AI), blockchain, IoT (Internet of Things), and more. This book caters to managing quality business data in the current AI and analytics landscape. Every effort has been made to ensure that the contents are well researched, the chapters are logically and coherently organized, the topics are relevant for today's context, and the book is written in a simple, clear, and precise manner.
The book is technology agnostic. Many data quality books available in the market are IT product–centric. This book looks at the technical concepts without any reference to proprietary vendor technologies. The primary objective of this book is to enable improved business performance from data. Any business leader who is keen to derive quality data can use this book, regardless of which IT and data products they utilize.
To ensure that the book is useful to the readers, it is written with four key principles in mind.
Data consumption.
This book is written to improve the chances of utilizing data for better business performance. Improved business performance from data can happen under three key circumstances: (1) when there is quality data, (2) where the focus is on the utilization or consumption of data, and (3) when the purpose of data is to improve and optimize the performance of the business in operations, compliance, and decision making. In short, in this book the focus will be on acquiring and managing
quality data to improve operations, compliance, and decision-making capabilities in business
.
Root cause analysis and continuous improvement.
Data quality management is not a one-time exercise. It is a continuous improvement initiative to identifying and fixing the root causes. This is important because if you are not solving the right issue, you will never be able to eliminate the real problem. Hence this book focuses on techniques to identify the root causes of data quality issues. In addition, the book discusses
16 common root causes
that degrade data quality in business.
Best practices.
This book focuses on industry best practices to improve data quality. Specifically, it offers
10 perspective recommendations or best practices
including the required capabilities to improve the quality of data in business. In addition, numerous insights nuggets which are evidence from research and case studies are provided throughout the book.
Relevance.
This book caters to managing
quality data in the current business and AI and analytics landscape.
AI can improve business performance with automation based on insights derived from analytics only if there is quality data. Essentially, there is no AI without data and no data without AI.
So, how can a business enterprise, acquire and manage good-quality data? What is the methodology to acquire and manage quality data? Against this backdrop, the book looks at a four-phase DARS approach for companies to manage high-quality data. DARS, which stands for Define-Assess-Realize-Sustain, is a combination of strategic and tactical elements to deliver the greatest value to the business from data. It is a playbook that offers prescriptive recommendations based on proven best practices in data quality management and governance.
This book has four parts, which are mapped to the four phases of the DARS framework. The first phase, the define phase, clearly defines data quality, including the characteristics or dimensions of data quality. The objective of this phase is to bring the readers to a common understanding of data and data quality. The second phase, the assess phase, is determining the data quality levels. This phase also includes root cause analysis, where the root causes of data quality problems are identified. In the realize phase, the data quality is improved by following industry best practices across the entire data lifecycle. Finally, the data quality that is realized should be sustained to ensure that all benefits continue to live on. This is covered in the last phase, the sustain phase.
The process of remediating and improving the data quality with the DARS framework is akin to improving a person's health. The first step is defining health, given that health could be physical, spiritual, mental, and so on. Once the specific health category is identified, say physical health, we need to define its characteristics or dimensions. In physical health, the dimensions could be strength, flexibility, endurance, and more. Once we have the physical health parameters and its baseline, the next step is to analyze or understand the problem by going into the root causes, given that often problems are stated in symptoms or what is seen. For example, one of the symptoms or effects of poor physical health is fatigue. This fatigue issue has to be analyzed and assessed to determine the root cause(s). A glycated hemoglobin (A1C) test might then indicate that the root cause of fatigue is Type-2 diabetes. So the treatment of the problem is to fix Type-2 diabetes and not simply addressing fatigue. The next logical step is remediation of the Type-2 diabetes that is causing the fatigue. This could be achieved using a combination of different methods such as medication, lifestyle changes like healthy eating (with vegetables, fruits, and whole grains), meditation, and exercising regularly. Once these remedial actions are in place, the person needs to put the right controls in place including regular medical checkups so that the measures taken are sustained.
In this backdrop, this book, Data Quality: Empowering Businesses with Analytics and AI, has 12 chapters which are written in a logical and sequential manner. The organization of the 12 book chapters in each of the four DARS phases is shown in Figure P.1.
FIGURE P.1 Book Organization
The book will explain the core concepts of data quality management and governance and the methods to realize and sustain good-quality data for improved business performance. It will also provide organizations a step-by-step methodology to realize and sustain quality data. However, there are no prerequisites needed to read and apply the concepts mentioned in this book. It is intended for anyone who has a stake and interest in harnessing the value of business data – business and IT teams. The audience could be the chief financial officer (CFO), chief data officer (CDO), chief information officer, accountant, geologist, IT developer, procurement director, claims analyst, data scientist, sales manager, data governance analyst, underwriter, HR manager, credit manager or any other business or IT role. In short, this book is for anyone who wants to achieve and sustain quality business data.
Böringer, J., Dierks, A., Huber, I., and Spillecke, D. (January 18, 2022). Insights to impact: Creating and sustaining data-driven commercial growth. McKinsey & Company.
https://www.mckinsey.com/business-functions/growth-marketing-and-sales/our-insights/insights-to-impact-creating-and-sustaining-data-driven-commercial-growth
.
Manly, J., et al. (December 2022). Are you ready for green growth? Most innovative companies 2022. Boston Consulting Group.
https://www.bcg.com/en-ca/publications/2022/innovation-in-climate-and-sustainability-will-lead-to-green-growth
.
Nagle, T., Redman, T., and Sammon, D. (September 2017). Only 3% of companies' data meets basic quality standards.
Harvard Business Review
.
https://bit.ly/2UxaHO4
.
Data Quality: Empowering Businesses with Analytics and AI reflects over two decades of my data, analytics and AI consulting, research, and teaching experience. Writing a book is harder than I thought and more rewarding than I could have ever imagined. I could only cross this finish line because of great teamwork. There are many people who have positively impacted this project. Writing this book was a unique learning and collaborative experience, and it has been one of my best “investments” to date. Throughout the project, I had the privilege of having discussions with top data and analytics researchers and industry experts who were instrumental in giving a better shape to this book.
First and foremost, I thank Bill Inmon – the “father of the data warehouse” for writing the foreword for the book. Bill is an industry veteran and thought leader who is acutely aware of the importance of quality data for the business to thrive in the global marketplace. I have always looked up to Bill and his work right from my university days, and I am truly honored to have him write the book's Foreword.
I'm indebted to the entire Wiley team, including Sheck Cho, Samantha Wu, and Susan Cerra for their editorial help, keen market insights, and support and coaching during the project. Special thanks to Michael Taylor, Tobias Zwingmann, Christophe Bourguignat, Sreenivas Gadhar, and Tony Almeida, for taking the time to review the book and giving valuable feedback. I am also extremely grateful to my consulting clients and my students at IE Business School (Madrid, Spain) for providing me opportunities to learn and understand the nuances of managing data, analytics, and AI initiatives. In addition, I thank the advisors of my firm DBP-Institute (DBP stands for Data for Business Performance), Gary Cokins, Suresh Chakravarthi, and Sana Gabula for offering the right guidance and support while writing this book.
Finally, writing a book required many hours away from my family activities over the course of two years. My wife, Shruthi Belle, and my two wonderful kids, Pranathi and Prathik, understood how important this book is for me and to the data, AI, and analytics community and bestowed me with terrific support, motivation, and inspiration.
Prashanth H. Southekal, PhD, MBA
Calgary, Canada
October 2022
Today, intangible assets – which are not physical in nature and include things like data, brand, and intellectual property – have rapidly risen in importance compared to tangible assets such as land, machinery, inventories, and cash. In 2018, intangible assets in the S&P 500 hit a record value of $21 trillion and made up 84% of all enterprise value. This is a massive increase from just 17% in 1975 (Ali 2020). IDC predicts that by 2023 half of all GDP worldwide will be driven by products and services from digitally transformed enterprises (IDC 2019). Overall, as technology becomes more pervasive with 5G, artificial intelligence, robotics, the internet of things (IoT), quantum computing, analytics, blockchain, and more, organizations are looking at ways to develop, maximize, and protect the value of intangible assets, especially data, as all these digital technologies are underpinned by data.
Against this backdrop, data – an important intangible asset – is considered a critical business resource as it enables organizations to maximize productivity. Today, four of the top five companies in terms of market capitalization are data companies (Investopedia 2022). In 2019, Brain Porter, CEO of Scotiabank, Canada's leading bank, said, “We are in the data and technology business. Our product happens to be banking, but largely that is delivered through data and technology” (Berman 2016). AIG and Hamilton Insurance Group announced a joint venture firm – Attune, a data and technology platform to harness data and artificial intelligence (AI) capabilities to simplify business processes, trim the amount of time to get insurance, and reduce expenses. Oil field services company Schlumburger captures drilling telemetry data from simulators and sensors to improve drilling performance in oil wells. Moderna's COVID-19 vaccination success story is attributed to data and analytics (Asay 2021). To summarize, data is a key driver for improved business performance today, and many enterprises across various industry sectors have demonstrated that data is a key enabler for improved business performance with enhanced revenues, reduced costs, and lowered risk.
Basically, the data economy – the ecosystem that enables use of data for business performance – is becoming increasingly embraced worldwide. Data has enabled firms such as Netflix, Facebook, Google, and Uber to acquire a distinct competitive advantage. According to Peter Norvig, Google's research director, “We don't have better algorithms than anyone else, we just have more data” (Cleland 2011). In 2021, the market capitalization of Google was more than the GDP of Mexico or Saudi Arabia. Fundamentally, companies that are data-driven demonstrate improved business performance. A report from MIT says that digitally mature firms are 26% more profitable than their peers (MIT 2013). McKinsey Global Institute indicates that data-driven organizations are 23 times more likely to acquire customers, 6 times as likely to retain customers, and 19 times more profitable (Bokman et al. 2014). The industry analyst firm Forrester, found that organizations that use data to derive insights for decision making are almost 3 times more likely to achieve double-digit growth (Eveslon 2020). According to NAIC (National Association of Insurance Commissioners), the implementation of Big Data has resulted in 30% better access to insurance services, 40–70% cost savings, and 60% higher fraud detection rates (NAIC 2021). According to McKinsey & Company, when implemented effectively, data and analytics can yield returns amounting to 30–50 times the investment within a few months in an oil and gas company (McKinsey 2017).
However, most organizations struggle to convert data for improved business performance. There are many reasons for this, and one of the most important is lack of high-quality data. According to Experian Data Quality, a boutique data management company, inaccurate data affects the bottom line of 88% of organizations and impacts up to 12% of revenues (Levy 2015). According to McKinsey, an average user spends two hours a day looking for the right data (Probstein 2019). A report by the Harvard Business Review says that just 3% of the data in a business enterprise meets quality standards (Nagle, Redman, and David 2017), and a joint study by IBM and Carnegie Mellon University found that over 90% of the data in a company is unused.
You cannot separate data from AI, and you cannot separate AI from data. The end product of all AI solutions is data and that data will be used again by AI.
Data is the foundation for enabling artificial intelligence (AI) and analytics, and ultimately improved business performance. But what exactly is AI and analytics? Although there is no one universally agreed definition, AI refers to the simulation of human intelligence including cognitive processes by machines, especially computer systems. It is based on the principle that human intelligence can be defined in a way that a machine can easily mimic it, make decisions, and execute tasks, both simple and complex. AI is used extensively across a range of applications today, with varying levels of sophistication from recommendation algorithms in Netflix to Alexa chatbot to self-driving cars to fraud prevention to personalized shopping and more.
Analytics is asking questions to gain insights for decision making. No questions means there is no analytics.
AI generally is undertaken in conjunction with analytics where the analytics algorithms take the data and look to discern useful patterns to facilitate decision making. Basically, AI looks at patterns or predictions about future states using data and analytics algorithms. In other words, pattern recognition and decision making from data are the foundation for AI. If the patterns and decisions are to be reliable, the data should be of high quality. AI is important in business because it can give enterprises insights into their operations. In some cases, AI can perform tasks even better than humans, particularly when it comes to repetitive and rule-based tasks. In terms of business performance, AI and analytics support three broad and fundamental business needs: automating business processes, gaining insight on business performance through data, and engaging with stakeholders including customers, employees, vendors and other partners associated with the business. To summarize, successful AI relies on patterns, and patterns that are derived from analytics need quality data.
While data can be a valuable business asset by offering tangible business results, it has some serious limitations and can become a huge liability if not managed well (Southekal 2021). How can an intangible asset like data become a liability for business? There are four common scenarios where data can become a liability for the business:
Collecting data without a defined business purpose will result in huge data volumes, ultimately resulting in increased complexity and cost due to data management. In 2018, according to Deloitte, the average IT spending in a company was 3.3% of the top line and trending upwards at an average of 49% every year. One important reason attributed to these increased IT expenses is the processing of huge data volumes. In addition, if the data is captured without a defined purpose, it will remain unused. Forrester found that up to 73% of data in a company is never used strategically, and research by IBM and Carnegie Mellon University has found that 90% of the data in an organization is unused data or “dark data” (Southekal
2020
).
Data takes up vast amounts of energy to store, secure, and process, resulting in an increase in the carbon footprint for the business. This makes it less attractive for investors considering their growing interest in ESG (environmental, social, and governance) commitments these days. In 2018, data centers consumed roughly 1% of total global electricity. By 2025, according to Swedish researcher Anders Andrae, the energy consumption of data centers is set to account for 3.2% of the total worldwide carbon emissions and consume 20% of global electricity (Southekal
2020
).
Cybercriminals are drawn to organizations that have large volumes of data. Many cybercrimes and data breaches in the last few years are associated with organizations that have large databases. These cybercriminals do not care whether or not the data is dark data, and they acquire all the data they can get their hands on. Following its 2017 data breach, Equifax spent $1.4 billion on modifying its technology infrastructure.
Managing data also entails privacy compliance. As noted in
Fortune
, Facebook lost $35 billion in market value following the Cambridge Analytica data scandal. In addition, the scandal resulted in the permanent closure of Cambridge Analytica. While it was data that was responsible for the success and growth of Cambridge Analytica, it was the same data that resulted in its collapse and ultimate closure.
Data is a asset only if it is managed well; if not, data is a liability in business. Just capturing and storing data doesn't make an organization data-driven.