22,99 €
All the answers to your data science questions Over half of all businesses are using data science to generate insights and value from big data. How are they doing it? Data Science Strategy For Dummies answers all your questions about how to build a data science capability from scratch, starting with the "what" and the "why" of data science and covering what it takes to lead and nurture a top-notch team of data scientists. With this book, you'll learn how to incorporate data science as a strategic function into any business, large or small. Find solutions to your real-life challenges as you uncover the stories and value hidden within data. * Learn exactly what data science is and why it's important * Adopt a data-driven mindset as the foundation to success * Understand the processes and common roadblocks behind data science * Keep your data science program focused on generating business value * Nurture a top-quality data science team In non-technical language, Data Science Strategy For Dummies outlines new perspectives and strategies to effectively lead analytics and data science functions to create real value.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 569
Veröffentlichungsjahr: 2019
Data Science Strategy For Dummies®
Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774, www.wiley.com
Copyright © 2019 by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL SERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONAL PERSON SHOULD BE SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATION OR WEBSITE IS REFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE INFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. FURTHER, READERS SHOULD BE AWARE THAT INTERNET WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ.
For general information on our other products and services, please contact our Customer Care Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002. For technical support, please visit www.wiley.com/techsupport.
Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.
Library of Congress Control Number: 2019942827
ISBN: 978-1-119-56625-0; 978-1-119-56626-7 (ebk); 978-1-119-56627-4 (ebk)
Cover
Foreword
Introduction
About This Book
Foolish Assumptions
How This Book Is Organized
Icons Used In This Book
Beyond The Book
Where To Go From Here
Part 1: Optimizing Your Data Science Investment
Chapter 1: Framing Data Science Strategy
Establishing the Data Science Narrative
Sorting Out the Concept of a Data-driven Organization
Sorting Out the Concept of Machine Learning
Defining and Scoping a Data Science Strategy
Chapter 2: Considering the Inherent Complexity in Data Science
Diagnosing Complexity in Data Science
Recognizing Complexity as a Potential
Enrolling in Data Science Pitfalls 101
`Navigating the Complexity
Chapter 3: Dealing with Difficult Challenges
Getting Data from There to Here
Managing Data Consistency Across the Data Science Environment
Securing Explainability in AI
Dealing with the Difference between Machine Learning and Traditional Software Programming
Managing the Rapid AI Technology Evolution and Lack of Standardization
Chapter 4: Managing Change in Data Science
Understanding Change Management in Data Science
Approaching Change in Data Science
Recognizing what to avoid when driving change in data science
Using Data Science Techniques to Drive Successful Change
Getting Started
Part 2: Making Strategic Choices for Your Data
Chapter 5: Understanding the Past, Present, and Future of Data
Sorting Out the Basics of Data
Exploring Current Trends in Data
Elaborating on Some Future Scenarios
Chapter 6: Knowing Your Data
Selecting Your Data
Describing Data
Exploring Data
Assessing Data Quality
Improving Data Quality
Chapter 7: Considering the Ethical Aspects of Data Science
Explaining AI Ethics
Addressing trustworthy artificial intelligence
Introducing Ethics by Design
Chapter 8: Becoming Data-driven
Understanding Why Data-Driven Is a Must
Transitioning to a Data-Driven Model
Developing a Data Strategy
Establishing a Data-Driven Culture and Mindset
Chapter 9: Evolving from Data-driven to Machine-driven
Digitizing the Data
Applying a Data-driven Approach
Automating Workflows
Introducing AI/ML capabilities
Part 3: Building a Successful Data Science Organization
Chapter 10: Building Successful Data Science Teams
Starting with the Data Science Team Leader
Defining the Prerequisites for a Successful Team
Building the Team
Connecting the Team to the Business Purpose
Chapter 11: Approaching a Data Science Organizational Setup
Finding the Right Organizational Design
Applying a Common Data Science Function
Chapter 12: Positioning the Role of the Chief Data Officer (CDO)
Scoping the Role of the Chief Data Officer (CDO)
Explaining Why a Chief Data Officer Is Needed
Establishing the CDO Role
The Future of the CDO Role
Chapter 13: Acquiring Resources and Competencies
Identifying the Roles in a Data Science Team
Seeing What Makes a Great Data Scientist
Structuring a Data Science Team
Retaining Competence in Data Science
Part 4: Investing in the Right Infrastructure
Chapter 14: Developing a Data Architecture
Defining What Makes Up a Data Architecture
Exploring the Characteristics of a Modern Data Architecture
Explaining Data Architecture Layers
Listing the Essential Technologies for a Modern Data Architecture
Creating a Modern Data Architecture
Chapter 15: Focusing Data Governance on the Right Aspects
Sorting Out Data Governance
Explaining Why Data Governance is Needed
Establishing Data Stewardship to Enforce Data Governance Rules
Implementing a Structured Approach to Data Governance
Chapter 16: Managing Models During Development and Production
Unfolding the Fundamentals of Model Management
Implementing Model Management
Chapter 17: Exploring the Importance of Open Source
Exploring the Role of Open Source
Describing the Context of Data Science Programming Languages
Unfolding Open Source Frameworks for AI/ML Models
Choosing Open Source or Not?
Chapter 18: Realizing the Infrastructure
Approaching Infrastructure Realization
Listing Key Infrastructure Considerations for AI and ML Support
Automating Workflows in Your Data Infrastructure
Enabling an Efficient Workspace for Data Engineers and Data Scientists
Part 5: Data as a Business
Chapter 19: Investing in Data as a Business
Exploring How to Monetize Data
Looking to the Future of the Data Economy
Chapter 20: Using Data for Insights or Commercial Opportunities
Focusing Your Data Science Investment
Determining the Drivers for Internal Business Insights
Using Data for Commercial Opportunities
Balancing Strategic Objectives
Chapter 21: Engaging Differently with Your Customers
Understanding Your Customers
Keeping Your Customers Happy
Serving Customers More Efficiently
Chapter 22: Introducing Data-driven Business Models
Defining Business Models
Exploring Data-driven Business Models
Using a Framework for Data-driven Business Models
Chapter 23: Handling New Delivery Models
Defining Delivery Models for Data Products and Services
Understanding and Adapting to New Delivery Models
Introducing New Ways to Deliver Data Products
Part 6: The Part of Tens
Chapter 24: Ten Reasons to Develop a Data Science Strategy
Expanding Your View on Data Science
Aligning the Company View
Creating a Solid Base for Execution
Realizing Priorities Early
Putting the Objective into Perspective
Creating an Excellent Base for Communication
Understanding Why Choices Matter
Identifying the Risks Early
Thoroughly Considering Your Data Need
Understanding the Change Impact
Chapter 25: Ten Mistakes to Avoid When Investing in Data Science
Don't Tolerate Top Management's Ignorance of Data Science
Don't Believe That AI Is Magic
Don't Approach Data Science as a Race to the Death between Man and Machine
Don't Underestimate the Potential of AI
Don’t Underestimate the Needed Data Science Skill Set
Don't Think That a Dashboard Is the End Objective
Don't Forget about the Ethical Aspects of AI
Don't Forget to Consider the Legal Rights to the Data
Don't Ignore the Scale of Change Needed
Don't Forget the Measurements Needed to Prove Value
Index
About the Author
Connect with Dummies
End User License Agreement
Chapter 16
TABLE 16-1 Examples of Model Risks and Possible Control Mechanisms
Chapter 1
FIGURE 1-1: The different stages of the data science life cycle.
FIGURE 1-2: The difference between reporting and analytics.
FIGURE 1-3: Example of data exploration using a table.
FIGURE 1-4: Visualizing your data.
FIGURE 1-5: The difference between a traditional business and a data-driven busi...
FIGURE 1-6: The difference in how development, training, and deployment are done...
Chapter 3
FIGURE 3-1: The traditional programming approach.
FIGURE 3-2: A machine learning approach.
FIGURE 3-3: The traditional programming flow.
FIGURE 3-4: A machine learning flow.
Chapter 4
FIGURE 4-1: Driving change in data science.
Chapter 5
FIGURE 5-1: Defining big data.
FIGURE 5-2: A model for cloud/edge computing.
FIGURE 5-3: How digital twins produce insights.
FIGURE 5-4: Creating a blockchain transaction.
FIGURE 5-5: An example of how to use a conversational platform.
Chapter 6
FIGURE 6-1: Aspects to consider when selecting data.
FIGURE 6-2: Data collection areas to address.
FIGURE 6-3: Data exploration on school grades in Swedish regions using a box-plo...
FIGURE 6-4: A scatter-plot exploring dependencies in the data.
FIGURE 6-5: A path analysis chart using data to show how users enter, move and l...
FIGURE 6-6: A heat map analyzing potential correlation between product manufactu...
FIGURE 6-7: Profiling the data to get an overview of the data quality.
FIGURE 6-8: Data profiling and validation from a country perspective.
Chapter 9
FIGURE 9-1: On the road to a machine driven approach.
Chapter 11
FIGURE 11-1: The different organizational models for data science teams.
FIGURE 11-2: Example of dividing responsibilities between business units and the...
Chapter 12
FIGURE 12-1: Comparing CDOs and CAOs.
FIGURE 12-2: The mandate of the CDO role, when it includes CAO responsibilities.
FIGURE 12-3: The evolution of the CDO role.
Chapter 13
FIGURE 13-1: Competence areas needed on a data science team.
FIGURE 13-2: A data scientist Venn diagram of skills, traits, and attitude neede...
FIGURE 13-3: A typical data science team structure.
FIGURE 13-4: An example of mapping the importance of skill set to certain roles.
Chapter 14
FIGURE 14-1: Using the data science flow to define your data architecture.
Chapter 15
FIGURE 15-1: The data aspects managed by data governance.
Chapter 18
FIGURE 18-1: An example of a data infrastructure framework.
Chapter 19
FIGURE 19-1: Different technology areas in the data economy.
Chapter 20
FIGURE 20-1: Different categories of data products.
Chapter 21
FIGURE 21-1: The old-versus-new ways of performing customer marketing.
FIGURE 21-2: A word cloud for brand touch points.
Chapter 22
FIGURE 22-1: Different categories of data-driven business models.
FIGURE 22-2: The 2-sided business model (data driven).
FIGURE 22-3: Data-driven business model (DDBM) dimensions.
Chapter 23
FIGURE 23-1: Examples of delivery models for different data-driven business mode...
FIGURE 23-2: Using an analytics tool to explore your customer data and possible ...
FIGURE 23-3: Graph focused on a certain geographical area selected in the map us...
FIGURE 23-4: Tracking your day on the slopes.
FIGURE 23-5: A model showing workflows on an open data marketplace.
Cover
Table of Contents
Begin Reading
i
ii
xv
xvi
1
2
3
4
5
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
233
234
235
236
237
238
239
240
241
243
244
245
246
247
248
249
250
251
252
253
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
295
296
297
298
299
300
301
302
303
305
306
307
308
309
310
311
312
313
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
336
337
We’re living in a make-or-break era; the ability to generate business value from enterprise data will either make or break your organization. We didn’t get here overnight. For years, experts have been professing how vital it is that business reframe itself to become more data-driven.
Some listened, some did not.
Organizations that took their business by its big data helm (like Netflix, Facebook, and Walmart) set the precedent. You better believe they have extremely robust data strategies in place governing those operations. The ones that did not? This book was written for you.
Sadly, over the last decade, some organizations got caught up in the media buzz. They’ve spent a huge amount of time and money working to hire data scientists, but haven’t seen the ROI they’d expected.
Part of the problem is that it’s both expensive and difficult to hire data scientists. In 2018, the median salaries for data scientists in USA ranged between $95,000 and $165,000 (see the 2018 Burtch Works’ Data Science Strategy Report). Making matters worse, the demand for analytics-savvy workers is twice the supply (see The Quant Crunch, prepared for IBM by Burning Glass Technologies). No surprise that it’s exceedingly difficult to recruit and retain these type of professionals.
But a bigger part of the problem is just this — contrary to what most advocates will tell you, just sourcing and hiring a team of “Data Scientists” isn’t going to get your organization where it needs to be. You’ll also need to secure a robust set of big data skill sets, technologies, and data resources. More importantly, you’ll need a comprehensive big data strategic plan in place, to help you steer your data ship.
It takes a lot more than just implementation folks dealing with all the details of your data initiatives; you also need an expert to manage them. You need someone who can communicate with and manage your data team, can communicate effectively with organizational leaders, can build relationships with business stakeholders, and who can perform exhaustive evaluations of both your business and your data assets in order to form the data strategy your business will need to survive in the digital era. Read this book for details on how to get these elements in place.
All around the world, I’ve been on the frontlines supporting organizations that know their data’s value and are ready to make big changes to start extracting that value. At Data-Mania, we provide results-driven data strategy services to optimize our client’s data operations. We are also leading the change by training our client’s staff with the data strategy and data science skills they need to succeed. Through our partnerships with LinkedIn and Wiley, over the last five years we’ve educated about a million technical professionals globally. Across both of these functions and with each project we engage, one message strongly resounds — The people and organizations who are committed to taking necessary actions to transform enterprise data to business value are the ones that will prevail in the digital era.
I want to be the first to congratulate you! Just by picking up this book and making the effort to educate yourself on the problems and solutions related to data strategy, you’ve already taken the first step. Whether you’re a C-suite executive that’s looking for guidance on next steps for your organization, or if you’re a data professional looking to move forward in your career, Data Science Strategy For Dummies will provide you a solid framework around which to proceed.
It’s an exciting time to be alive. Never before have businesses had access to such a powerful upper hand. Those of us who recognize this in our business data are the ones who are primed to blaze the trail and build a true legacy with the work we do in our careers. Some of us have been on this path for a while now, while others are new. Welcome aboard!
Lillian Pierson, P.E.
Data Strategist & CEO of Data-Mania
A revolutionary change is taking place in society. Everybody, from small local companies to global enterprises, is starting to realize the potential in digitizing their data assets and becoming data driven. Regardless of industry, companies have embarked on a similar journey to explore how to drive new business value by utilizing analytics, machine learning (ML), and artificial intelligence (AI) techniques and introducing data science as a new discipline.
However, although utilizing these new technologies will help companies simplify their operations and drive down costs, nothing is simple about getting the strategic approach right for your data science investment. And, the later you join the ML/AI game, the more important it will be to get the strategy right from the start for your particular area of business. Hiring a couple of data scientists to play around with your data is easy enough to do — if you can find some of the few that are available — but the real heavy lifting comes when you try to understand how to utilize data science to create value throughout your business and put that understanding into an executable data science strategy. If you can do that, you are on the right path for success.
A recent survey by Deloitte of “aggressive adopters” of cognitive technologies found that 76 percent believe that they will “substantially transform” their companies within the next three years by using data and AI. IDC, a global marketing intelligence firm, predicts that by 2021, 75 percent of commercial enterprise apps will use AI, over 90 percent of consumers will interact with customer support bots; and over 50 percent of new industrial robots will leverage AI.
However, at the same time, there remains a very large gap between aspiration and reality. Gartner, yet another research and advisory company, claimed in 2017 that 85 percent of all big data projects fail; not only that, there still seems to be confusion around what the true key success factors are to succeed when it comes to data and AI investments. This book argues that a main key success factor is a great data science strategy.
The target audience for this book is anyone interested in making well-balanced strategic choices in the field of data science, no matter which aspect you’re focusing on and at what level — from upper management all the way down to the individual members of a data science team. Strategic choices matter! And, this book is based on actual experiences arising from building this up from scratch in a global enterprise, incorporating learnings from successful choices as well as mistakes and miscalculations along the way.
So far, there seems to be little in-depth research or analysis on the topic of data science and AI strategies and little practical guidance as well. In fact, when researching for this book, I couldn’t find another single book on the topic of data science strategy. However, several interesting articles and reports are available, like TDWI's report, “Seven Steps for Executing a Successful Data Science Strategy” (https://tdwi.org/research/2015/01/checklist-seven-steps-successful-data-science-strategy.aspx?tc=page0&m=1) or The Startup's “How To Create A Successful Artificial Intelligence Strategy” https://medium.com/swlh/how-to-create-a-successful-artificial-intelligence-strategy-44705c588e62). However, these articles primarily focus on easily consumable tips and tricks, while bringing up a few aspects of the challenges and considerations needed. There is an obvious lack of in-depth guidance which is not really accessible in an article format.
At the same time, the main reasons companies fail with their data science or AI investment is that either there was no data science strategy in place or the complexity of executing on the strategy wasn’t understood. Although this enormous transformation is happening right here, right now, all around us, it seems that few people have grasped how data science will impose a fundamental shift in society — and therefore don’t understand how to approach it. This book is based on more than ten years of experience spent driving different levels of strategic and practical transformation assignments in a global enterprise. As such, it will help you understand what is fundamentally important to consider and what you should avoid. (Trust me: There are many pitfalls and areas to get stuck in.) But if you want to be in the forefront with your business, you have neither the time nor the money to make mistakes. You really want a solid, end-to-end data science strategy that works for you at the level you need in order to bring your organization forward. The time is now! This is the book that everyone in data science should read.
This book will help guide you through the different areas that need to be considered as part of your data science strategy. This includes managing the complexity in data science and avoiding common data challenges, making strategic choices related to the data itself (including how to capture it, transfer it, compute it, and keep it secure and legally compliant), but also how to build up efficient and successful data science teams.
Furthermore, it includes guidance on strategic infrastructure choices to enable a productive and innovative environment for the data science teams as well as how to acquire and balance data science competence and enable productive ways of working. It also includes how you can turn data into enhanced or new business opportunities, including data-driven business models for new data products and services, while also addressing ethical aspects related to data usage and commercialization.
My goal here is to give you relevant and concrete guidance in those areas that require strategic thinking as well as give some advice on what to include when making choices for both your data and AI investment as well as how best to come up with a useful and applicable data science strategy. Based on my own experience in this field, I'll argue for certain techniques or technology choices or even preferred ways of working, but I won’t come down on one side or the other when it comes to any specific products or services. The most I'll do in that regard is point out that certain methods or technology choices are more appropriate for certain types of users rather than others.
Because this book assumes a basic level of understanding of what data science actually is, don’t think of it as an introduction to data science, but rather as a tool for optimizing your analytics and/or ML/AI investment, regardless of whether that investment is for a small company or a global enterprise. It covers everything from practical advice to deep insights into how to define, focus, and make the right strategic choices in data science throughout. So, if you’re looking to find a broad understanding of what data science is, which techniques and ML tools come recommended, and how to get started as a data scientist professional, I instead warmly recommend the book Data Science For Dummies, by Lillian Pierson (Wiley).
This book has six main parts. Part 1 outlines the major challenges that companies (small as well as large) face when investing in data science. Whereas Part 2 aims to create an understanding of the strategic choices in data science that you need to make, Part 3 guides you in successfully setting up and shaping your data science teams. In Part 4, you find out about important infrastructure considerations, managing models in development and production and how to relate to open source. In Part 5 you learn all about commercializing your data business and monetizing your data. And, and is the case with all For Dummies books, this book ends with The Part of Tens, with some practical tips, including what not to do when building your data science strategy and spelling out why you need to create a data science strategy to begin with.
I'll occasionally use a few special icons to focus attention on important items. Here’s what you’ll find:
This icon with the proverbial string around the finger reminds you about information that’s worth recalling.
Expect to find something useful or helpful by way of suggestions, advice, or observations.
The Warning icon is meant to grab your attention so that can you steer clear of potholes, money pits, and other hazards.
This icon may be taken in one of two ways: Techies will zero in on the juicy and significant details that follow; others will happily skip ahead to the next paragraph.
This book is designed to help you explore different strategic options for your data science investment. It will guide you in your choices for your business, from data-driven business models to data choices and from team setup to infrastructure choices and a lot more. It will help you navigate the most common challenges and steer you toward the success factors.
However, this book is aimed at covering a very broad range of areas in data science strategy development, and is therefore not able to deep-dive into specific theories or techniques to the level you might be looking for after reading parts of this book.
In addition to what you’re reading right now, this product comes with a free access-anywhere Cheat Sheet that offers a number of data-science-related tips, techniques, and resources. To get this Cheat Sheet, visit www.dummies.com and type data science strategy for dummies cheat sheet in the Search box.
You can start reading this book anywhere you like. You don’t have to read in chapter order, but my suggestion is to start by studying how data science is framed in this book, which is outlined in Chapter 1. In that chapter, you can also learn about the complexity and challenges you will encounter, before diving into subsequent chapters, where I explain how to tackle the challenges most enterprises encounter when strategically investing in data science.
Part 1
IN THIS PART …
Defining a data science strategy
Grasping the complexity in data science
Tackling major challenges in the field of data science
Addressing change in a data-driven organization
Chapter 1
IN THIS CHAPTER
Clarifying the concept of data science
Understanding the fundamentals of a data-driven organization
Putting machine learning in context of data science
Clarifying the components of an effective data science strategy
In this chapter, I aim to sort out the basics of what data science is all about, but I have to warn you that data science is a term that escapes any single complete definition — which, of course, makes data science difficult to understand and apply in an organization. Many articles and publications use the term quite freely, with the assumption that it’s universally understood. Yet, data science — including its methods, goals, and applications — evolves with time and technology and is now far different from what it might have been 25 years ago.
Despite all that, I'm willing to put forward a tentative definition: Data science is the study of where data comes from, what it represents, and how it can be turned into a valuable resource in the creation of business strategies. Data science can be said to be a multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract insights from data in various forms, both structured and unstructured. Mining large amounts of structured and unstructured data to identify patterns and deviations that can help an organization rein in costs, increase efficiencies, recognize new market opportunities, and increase the organization's competitive advantage.
Data science is a concept that can be used to unify statistics, analytics, machine learning, and their related methods and techniques in order to understand and analyze actual phenomena with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science.
Behind that type of definition though, lies the definition of how data science is approached and performed. And because the ambition of this part of the book is to frame data science strategy, I need to first frame this multidisciplinary area of data science and its life cycle more properly.
It never hurts to have an image when explaining a complicated process, so do take a look at Figure 1-1, where you can see the main steps or phases in the data science life-cycle. Keep in mind, however, that the model visualized in Figure 1-1 assumes that you've already identified a high-level business problem or business opportunity as a starting point. This early ambition is usually derived from a business perspective, but it needs to be analyzed and framed in detail together with the data science team. This dialogue is vital in terms of understanding which data is available and what is possible to do with that data so you can set the focus of the work going forward. It isn’t a good idea to just start capturing any and all data that looks interesting enough to analyze. Therefore, the first stage of the data science life cycle, capture, is to frame the data you need by translating the business need into a concrete and well-defined problem or business opportunity.
FIGURE 1-1: The different stages of the data science life cycle.
The initial business problem and/or opportunity isn’t static and will change over time as your data-driven understanding matures. Staying flexible in terms of which data is captured as well as which problem and/or opportunity is most important at any given point in time, is therefore a vital in order to achieve your business objectives.
The model shown in Figure 1-1 aims to represent a view of the different stages of the data science life cycle, from capturing the business and data need through preparing, exploring, and analyzing the data to reaching insights and acting on them.
The output of each full cycle produces new data, which provides the result of the previous cycle. This includes not only new data or results, which you can use to optimize your model, but can also generate new business needs, problems, or even a new understanding of what the business priority should be.
These stages of the data science life cycle can also be seen as not only steps describing the scope of data science but also layers in an architecture. More on that later; let me start by explaining the different stages.
There are two different parts of the first stage in the life-cycle, since capture refers to both the capture of the business need as well as the extraction and acquisition of data. This stage is vital to the rest of the process. I'll start by explaining what it means to capture the business need.
The starting point for detailing the business need is a high-level business request or business problem expressed by management or similar entities and should include tasks such as
Translating ambiguous business requests
into concrete, well-defined problems or opportunities
Deep-diving into the context of the requests
to better understand what a potential solution could look like, including which data will be needed
Outlining (if possible) strategic business priorities
set by the company that might impact the data science work
Now that I've made clear the importance of capturing and understanding the business requests and initial scoping of data needed, I want to move on to describing aspects of the data capture process itself. It’s the main interface to the data source that you need to tap into and includes areas such as
Managing data ownership and securing legal rights to data capture and usage
Handling of personal information and securing data privacy through different anonymization techniques
Using hardware and software for acquiring the data through batch uploads or the real-time streaming of data
Determining how frequently data will need to be acquired, because the frequency usually varies between data types and categories
Mandating that the preprocessing of data occurs at the point of collection, or even before collection (at the edge of an IoT device, for example). This includes basic processing, like cleaning and aggregating data, but it can also include more advanced activities, such as anonymizing the data to remove sensitive information. (Anonymizing refers to removing sensitive information such as a person's name, phone number, address and so on from a data set.)
In most cases, data must be anonymized before being transferred from the data source. Usually a procedure is also in place to validate data sets in terms of completeness. If the data isn’t complete, the collection may need to be repeated several times to achieve the desired data scope. Performing this type of validation early on has a positive impact on both process speed and cost.
Managing the data transfer process to the needed storage point (local and/or global). As part of the data transfer, you may have to transform the data — aggregating it to make it smaller, for example. You may need to do this if you’re facing limits on the bandwidth capacity of the transfer links you use.
Data maintenance activities includes both storing and maintaining the data. Note that data is usually processed in many different steps throughout its life cycle.
The need to protect data integrity during the life cycle of a data element is especially important during data processing activities. It’s easy to accidentally corrupt a dataset through human error when manually processing data, causing the data set to be useless for analysis in the next step. The best way to protect data integrity is to automate as many steps as possible of the data management activities leading up to the point of data analysis.
Keeping business trust in the data foundation is vital in order for business users to trust and make use of the derived insights.
When it comes to maintaining data, two important aspects are
Data storage:
Think of this as everything associated with what's happening in the data lake. Data storage activities include managing the different retention periods for different types of data, as well as cataloging data properly to ensure that data is easy to access and use.
Data preparation:
In the context of maintaining data, data preparation includes basic processing tasks such as second-level data cleansing, data staging, and data aggregation, all of which usually involve applying a filter directly when the data is put into storage. You don't want to put data with poor quality into your data lake.
Data retention periods can be different for the same data type, depending on its level of aggregation. For example, raw data might be interesting to save for only a short time because it’s usually very large in volume and therefore costly to store. Aggregated data on the other hand, is often smaller in size and cheaper and easier to store and can therefore be saved for longer periods, depending on the targeted use cases.
Processing of data is the main data processing layer focused on preparing data for analysis, and it refers to using more advanced data engineering methodologies, such as
Data classification:
This refers to the process of organizing data into categories for even more effective and efficient use, including activities such as the labeling and tagging of data. A well-planned data classification system makes essential data easy to find and retrieve. This can also be of particular importance for areas such as legal and compliance.
Data modeling:
This helps with the visual representation of data and enforces established business rules regarding data. You would also build data models to enforce policies on how you should correlate different data types in a consistent manner. Data models also ensure consistency in naming conventions, default values, semantics, and security procedures, thus ensuring quality of data.
Data summarization:
Here your aim is to use different ways to summarize data, like using different clustering techniques.
Data mining: This is the process of analyzing large data sets to identify patterns or deviations as well as to establish relationships in order to enable problems to be solved through data analysis further down the road. Data mining is a sort of data analysis, focused on enhanced understanding of data, also referred to as data literacy. Building data literacy in the data science teams is a key component of data science success.
With low data literacy, and without truly understanding the data you’re preparing, analyzing, and deriving insights from, you run a high risk of failing when it comes to your data science investment.
Data analysis is the stage where the data comes to life and you’re finally able to derive insights from the application of different analytical techniques.
Insights can be focused on understanding and explaining what has happened, which means that the analysis is descriptive and more reactive in nature. This is also the case with real-time analysis: It’s still reactive even when it happens in the here-and-now.
Then there are data analysis methods that aim to explain not only why something happened but also what happened. These types of data analysis are usually referred to as diagnostic analyses.
Both descriptive and diagnostic methods are usually grouped into the area of reporting, or business intelligence (BI).
To be able to predict what will happen, you need to use a different set of analytical techniques and methods. Predictions about the future can be done strategically or in real-time settings. For a real-time prediction you need to develop, train and validate a model before deploying it on real-time data. The model could then search for certain data patterns and conditions that you have trained the model to find, to help you predict a problem before it happens.
Figure 1-2 shows the difference between reporting techniques about what has happened (in black) and analytics techniques about what is likely to happen, using statistical models and predictive models (in white).
FIGURE 1-2: The difference between reporting and analytics.
This list gives you examples of the kinds of questions you can ask using different reporting and BI techniques:
Standard reports:
What was the customer churn rate?
Ad hoc reports:
How did the code fix carried out on a certain date impact product performance?
Query drill-down:
Are similar product-quality issues reported in all geographical locations?
Alerts:
Customer churn has increased. What action is recommended?
And this list gives you examples of the kinds questions you can ask using different analytics techniques:
Statistical analysis:
Which factors contribute most to the product quality issues?
Forecasting:
What will bandwidth demand be in 6 months?
Predictive modeling:
Which customer segment is most likely to respond to this marketing campaign?
Optimization.
What is the optimal mix of customer, offering, price, and sales channel?
Analytics can also be separated into two categories: basic analytics and advanced analytics. Basic analytics uses rudimentary techniques and statistical methods to derive value from data, usually in a manual manner, whereas in advanced analytics, the objective is to gain deeper insights, make predictions, or generate recommendations by way of an autonomous or semiautonomous examination of data or content using more advanced and sophisticated statistical methods and techniques.
Some examples of the differences are described in this list:
Exploratory data analytics
is a statistical approach to analyzing data sets in order to summarize their main characteristics, often with visual methods. You can choose to use a statistical model or not, but if used, such a model is primarily for visualizing what the data can tell you beyond the formal modeling or hypothesis testing task. This is categorized as basic analytics.
Predictive analytics
is the use of data, statistical algorithms, and machine learning techniques to identify the likelihood of future outcomes based on historical data. This is categorized as advanced analytics.
Regression analysis
is a way of mathematically sorting out which variables have an impact. It answers these questions: Which factors matter most? Which can be ignored? How do those factors interact with each other? And, perhaps most importantly, how certain am I about all these factors? This is categorized as advanced analytics.
Text mining or text analytics
is the process of exploring and analyzing large amounts of unstructured text aided by software that can identify concepts, patterns, topics, keywords, and other attributes in the data. The overarching goal of text mining is, to turn text into data for analysis via application of natural language processing (NLP) and various analytical methods. Text mining can be done from a more basic perspective as well as from a more advanced perspective, depending on the use case.
The communication stage of data science is about making sure insights and learnings from the data analysis are understood and communicated by way of different means in order to come to efficient use. It includes areas such as
Data reporting:
The process of collecting and submitting data in order to enable an accurate analysis of the facts on the ground. It’s a vital part of communication because inaccurate data reporting can lead to vastly uninformed decision-making based on inaccurate evidence.
Data visualization:
This can also be seen as
visual communication
because it involves the creation and study of the visual representation of data and insights. To help you communicate the result of the data analysis clearly and efficiently, data visualization uses statistical graphics, plots, information graphics, and other tools. Effective visualization helps users analyze and reason about data and evidence because it makes complex data more accessible, understandable, and usable.
Users may have been assigned particular analytical tasks, such as making comparisons or understanding causality, and the design principle of the graphical visualization (showing comparisons or showing causality, in this example) follows the task. Tables are generally used where users can look up a specific measurement, and charts of various types are used to show patterns or relationships in the data for one or more variables.
Figure 1-3 below exemplifies how data exploration could work using a table format. In this specific case, the data being explored regards cars, and the hypothesis being tested is which car attribute impacts fuel consumption the most. Is it, for example, the car brand, engine size, horse power or perhaps the weight of the car? As you can see, exploring the data using tables has its limitation, and does not give an immediate overview. It requires you to go through the data in detail to discover relationships and patterns. Compare this with the graph shown in Figure 1-4 below, where the same data is being visualized in a completely different way.
Figure 1-3 is based on a screenshot generated using SAS® Visual Analytics software. Copyright © 2019 SAS Institute Inc., Cary, NC, USA. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. All Rights Reserved. Used with permission.
FIGURE 1-3: Example of data exploration using a table.
Figure 1-4 is based on a screenshot generated using SAS® Visual Analytics software. Copyright © 2019 SAS Institute Inc., Cary, NC, USA. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. All Rights Reserved. Used with permission.
FIGURE 1-4: Visualizing your data.
In Figure 1-4, a visualization in the shape of a linear regression graph has been generated for each car attribute, together with text explaining the strength of each relationship to fuel consumption. (Linear regression involves fitting a straight line to a dataset while trying to minimize the error between the points and the fitted line.) The graph in Figure 1-4 shows a very strong positive relationship between the weight of the car and fuel consumption. By studying the relationship between the other attributes and fuel consumption using the graph generated for each tab, it will be quite easy to find the strongest relationship compared to using the table in Figure 1-3.
However, in data exploration the key is to stay flexible in terms of which exploration methods to use. In this case, it was easier and quicker to find the relationship by using linear regression, but in another case a table might be enough, or none of the just mentioned approaches works. If you have geographical data, for example, the best way to explore it might be by using a geo map, where the data is distributed based on geographical location. But more about that later on.
The final stage in the data science life cycle is to actuate the insights derived from all previous stages. This stage has not always been seen as part of the data science life cycle, but the more that society moves toward embracing automation, the more the interest in this area grows.
Decision-making for actuation refers to connecting an insight derived from data analysis to trigger a human- or machine-driven decision-making process of identifying and deciding alternatives for the right action based on the values, policies, preferences, or beliefs related to the business or scope of the task.
What actually occurs is that a human or machine compares the insight with a predefined set of policies for what needs to be done when a certain set of criteria is fulfilled. If the criteria are fulfilled, this triggers a decision or an action. The actuation trigger can be directed toward a human (for example, a manager) for further decisions to be made in a larger context, or toward a machine when the insight falls within the scope of the predefined policies for actuation.
Automation of tasks or decisions increases speed and reduces cost, and if set up properly, also produces continuous and reliable data on the outcome of the implemented action.
The stage where decisions are actuated — by either human hand or a machine — is one of the most important areas of data science. It’s fundamental because it will provide data science professionals (also known as data scientists) with new data based on the results of the action (resolution or prevention of a problem, for example), which tells the data scientists whether their models and algorithms are performing as expected after deployment or whether they need to be corrected or improved. The follow-up regarding model and algorithm performance also supports the concept of continuous improvement.
What is actually the relationship between data science and automation? And, can automation accelerate data science production and efficiency? Well, assuming that the technology evolution in society moves more and more toward automation, not only for simple process steps previously performed by humans but also for complex actions identified and decided by intelligent machines powered by machine-learning-developed algorithms, the relationship will be a strong one, and data science production and efficiency will accelerate considerably due to automation.
The decisions will, of course, not really be decided by the machines, but will be based on human-preapproved policies that the machine then acts on. Machine learning doesn’t mean that the machine can learn unfettered, but rather that it always encounters boundaries for the learning set up by the data scientist — boundaries regulated by established policies. However, within these policy boundaries, the machine can learn to optimize the analysis and execution of tasks assigned to it.
Despite the boundaries imposed on it, automation powered by machines will become more and more important in data science, not only as a means to increase speed (from detection to correction or prevention) but also to lower cost and secure quality and consistency of data management, actuation of insights, and data generation based on the outcome.
When applying data science in your business, remember that data science is transformative. For it to fully empower your business, it isn’t a question of just going out and hiring a couple of data scientists (if you can find them) and put them into a traditional software development department and expect miracles. For data science to thrive and generate full value, you need to be prepared to first transform your business into a data-driven organization.
Data is the new black! Or the new oil! Or the new gold! Whatever you compare data to, it’s probably true from a conceptual value perspective. As a society, we have now entered a new era of data and intelligent machines. And it isn’t a passing trend or something that you can or should avoid. Instead, you should embrace it and ask yourself whether you understand enough about it to leverage it in your business. Be open-minded and curious! Dare to ask yourself whether you truly understand what being data-driven means.
The concept of being data-driven is a cornerstone that you need to understand in order to correctly carry out any strategic work in data science, and it’s addressed in several parts of this book. In this chapter, I try to give you a big-picture view of how to think and reason around the idea of being data-driven.
If you start by putting the ongoing changes happening in society into a wider context, it’s a common understanding that we humans are now experiencing a fourth industrial revolution, driven by access to data and advanced technology. It’s also referred to as the digital revolution. But be aware! Digitizing or digitalizing your business isn’t the same as being data-driven.
Digitization is a widely-used concept that basically refers to transitioning from analog to digital, like the conversion of data to a digital format. In relation to that, digitalization refers to making the digitized information work in your business.
The concept of digitalizing a business is sometimes mixed up with being data-driven. However, it’s vital to remember that digitalizing the data isn’t just a good thing to do — it’s the foundation for enabling a data-driven enterprise. Without digitalization, you simply cannot become data-driven.
In a data-driven organization, the starting point is data. It’s truly the foundation of everything. But what does that actually mean? Well, being data-driven means that you need to be ready to take data seriously. And what does that mean? Well, in practice, it means that data is the starting point and you use data to analyze and understand what type of business you should be doing. You must take the outcome of the analysis seriously enough to be prepared to change your business models accordingly. You must be ready to trust and use the data to drive your business forward. It should be your main concern in the company. You need to become “data-obsessed.”
Before I explain what it means to be data obsessed, consider how you’re doing things today in your company. Is it somewhat data-driven? Or perhaps not at all? Where is the starting point in different business areas?
Figure 1-5 shows a model (with examples) for comparing a more traditional approach to a data-driven approach related to approaching different business aspects.
FIGURE 1-5: The difference between a traditional business and a data-driven business.
Comparing the approaches in a traditional business versus a data-driven organization is worthwhile. Many companies’ leaders actually think that their companies are data-driven just because they collect and analyze data. But it’s all about how data drives (or doesn’t drive) the business priorities, decisions, and execution that tells you how data-driven your business really is. Understanding what the starting point is will help you define your ground zero and identify which areas need more attention in order to change.
So, what does the term data-obsessed actually mean? It’s really quite simple: It means that you should always assume that the access and usage of data can improve your business – in all aspects. Use the following list of questions to determine how data-obsessed your organization actually is:
Which data do you need to use as a company, based on your strategic objectives? Do you collect that data already? If not, how do you get it?
Do you own all the data you need? If not, how can you secure legal rights to use it for your needs (internal efficiency or business opportunities)?
Is the data geographically distributed across countries? If yes, what needs to happen to your infrastructure in order to enable you to use it efficiently?
Is the data sensitive? That is, does it contain personal information? If yes, what are the applicable laws and regulations related to the data? (Be sure to note whether those laws and regulations change, depending on which country houses a specific data storage facility.) How do you intend to use sensitive data?
Do you need access to the data in real-time to analyze and realize your use cases? If yes, what type of data architecture do you need?
What data retention periods do you need to establish for the different types of data used by your organization? What will you use the selected data types for? Are you in control when it comes to expected data volumes and data storage costs per data type?
Can you automate most of the data acquisition and data management activities? If yes, what is the best data architectural solution for that?
Do you need to account for an exploratory development environment as well as an efficient and highly automated production environment in the same architecture? If yes, how will you realize that?
Are employees ready to become data-driven? Have the potential, value, and scope of the change been clearly stated and communicated? If so, are employees ready for that change?
Are managers and leaders on board with what it means to become data-driven? Do they fully understand what needs to change fundamentally? If so, are managers and leaders ready to start taking vital decisions based on data?
The questions I post here don’t comprise an exhaustive list, but they cover some of the main areas to address from a data-driven perspective. Notice that these questions don’t cover anything related to using machine learning or artificial intelligence techniques. The reason that isn’t covered is because, in practice, a company can be data-driven based only on data, analytics, and automation. However, companies that also effectively integrate the use of technologies like machine learning and artificial intelligence have a better foundation for responding to the machine-driven evolution in society.
People often ask me to explain the difference between advanced analytics and machine learning and to say when it is advisable to go for one approach or the other. I always start out by defining machine learning. Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to perform the task.
So, here's how advanced analytics and ML have some characteristics in common:
Both advanced analytics and machine learning techniques are used for building and executing advanced mathematical and statistical models as well as building optimized models that can be used to predict events before they happen.
Both methods use data to develop the models, and both require defined model policies.
Automation can be used to run both analytics models and machine learning models after they’re put into production.
What about the differences between advanced analytics and machine learning?
There is a difference in who the actor is when creating your model. In an advanced analytics model, the actor is human; in a machine learning model, the actor is (obviously) a machine.
There is also a difference in the model format. Analytics models are developed and deployed with the human-defined design, whereas ML models are dynamic and change design and approach as they’re being trained by the data, optimizing the design along the way. Machine learning models can also be deployed as
dynamic,
which means that they continue to train, learn and optimize the design when exposed to real-life data and its live context.
