The Handbook of Data Science and AI - Stefan Papp - E-Book

The Handbook of Data Science and AI E-Book

Stefan Papp

0,0

Beschreibung

Data Science, Big Data, and Artificial Intelligence are currently some of the most talked-about concepts in industry, government, and society, and yet also the most misunderstood. This book will clarify these concepts and provide you with practical knowledge to apply them. Featuring:

- A comprehensive overview of the various fields of application of data science
- Case studies from practice to make the described concepts tangible
- Practical examples to help you carry out simple data analysis projects
- BONUS in print edition: E-Book inside

The book approaches the topic of data science from several sides. Crucially, it will show you how to build data platforms and apply data science tools and methods. Along the way, it will help you understand - and explain to various stakeholders - how to generate value from these techniques, such as applying data science to help organizations make faster decisions, reduce costs, and open up new markets. Furthermore, it will bring fundamental concepts related to data science to life, including statistics, mathematics, and legal considerations. Finally, the book outlines practical case studies that illustrate how knowledge generated from data is changing various industries over the long term.

Contains these current issues:

- Mathematics basics: Mathematics for Machine Learning to help you understand and utilize various ML algorithms.
- Machine Learning: From statistical to neural and from Transformers and GPT-3 to AutoML, we introduce common frameworks for applying ML in practice
- Natural Language Processing: Tools and techniques for gaining insights from text data and developing language technologies
- Computer vision: How can we gain insights from images and videos with data science?
- Modeling and Simulation: Model the behavior of complex systems, such as the spread of COVID-19, and do a What-If analysis covering different scenarios.
- ML and AI in production: How to turn experimentation into a working data science product?
- Presenting your results: Essential presentation techniques for data scientists

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 1069

Das E-Book (TTS) können Sie hören im Abo „Legimi Premium” in Legimi-Apps auf:

Android
iOS
Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Stefan Papp, Wolfgang Weidinger,Katherine Munro, Bernhard Ortner,Annalisa Cadonna, Georg Langs,Roxane Licandro, Mario Meir-Huber,Danko Nikolić, Zoltan Toth, Barbora Vesela,Rania Wazir, Günther Zauner

The Handbook of Data Science and AI

Generate Value from Data with Machine Learning and Data Analytics

Distributed by:Carl Hanser VerlagPostfach 86 04 20, 81631 Munich, GermanyFax: +49 (89) 98 48 09www.hanserpublications.comwww.hanser-fachbuch.de

The use of general descriptive names, trademarks, etc., in this publication, even if the former are not especially identified, is not to be taken as a sign that such names, as understood by the Trade Marks and Merchandise Marks Act, may accordingly be used freely by anyone. While the advice and information in this book are believed to be true and accurate at the date of going to press, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

The final determination of the suitability of any information for the use contemplated for a given application remains the sole responsibility of the user.

All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying or by any information storage and retrieval system, without permission in writing from the publisher.

© Carl Hanser Verlag, Munich 2022Coverconcept: Marc Müller-Bremer, www.rebranding.de, MunichCoverdesign: Max KostopoulosCover image: © gettyimages.de/ValeryBrozhinsky

Print ISBN:        978-1-56990-886-0E-Book ISBN:   978-1-56990-887-7ePub ISBN:       978-1-56990-888-4

Table of Contents

Titelei

Impressum

Table of Contents

Foreword

Preface

Acknowledgments

1 Introduction

1.1 What are Data Science, Machine Learning and Artificial Intelligence?

1.2 Data Strategy

1.3 From Strategy to Use Cases

1.3.1 Data Teams

1.3.2 Data and Platforms

1.3.3 Modeling and Analysis

1.4 Use Case Implementation

1.4.1 Iterative Exploration of Use Cases

1.4.2 End-to-End Data Processing

1.4.3 Data Products

1.5 Real-Life Use Case Examples

1.5.1 Value Chain Digitization (VCD)

1.5.2 Marketing Segment Analytics

1.5.3 360° View of the Customer

1.5.4 NGO and Sustainability Use Cases

1.6 Delivering Results

1.7 In a Nutshell

2 Infrastructure

Stefan Papp

2.1 Introduction

2.2 Hardware

2.2.1 Distributed Systems

2.2.2 Hardware for AI Applications

2.3 Linux Essentials for Data Professionals

2.4 Terraform

2.5 Cloud

2.5.1 Basic Services

2.5.2 Cloud-native Solutions

2.6 In a Nutshell

3 Data Architecture

Zoltan C. Toth

3.1 Overview

3.1.1 Maslow’s Hierarchy of Needs for Data

3.1.2 Data Architecture Requirements

3.1.3 The Structure of a Typical Data Architecture

3.1.4 ETL (Extract, Transform, Load)

3.1.5 ELT (Extract, Load, Transform)

3.1.6 ETLT

3.2 Data Ingestion and Integration

3.2.1 Data Sources

3.2.2 Traditional File Formats

3.2.3 Modern File Formats

3.2.4 Summary

3.3 Data Warehouses, Data Lakes, and Lakehouses

3.3.1 Data Warehouses

3.3.2 Data Lakes and the Lakehouse

3.3.3 Summary: Comparing Data Warehouses to Lakehouses

3.4 Data Processing and Transformation

3.4.1 Big Data & Apache Spark

3.4.2 Databricks

3.5 Workflow Orchestration

3.6 A Data Architecture Use Case

3.7 In a Nutshell

4 Data Engineering

Stefan Papp, Bernhard Ortner

4.1 Data Integration

4.1.1 Data Pipelines

4.1.2 Designing Data Pipelines

4.1.3 CI/CD

4.1.4 Programming Languages

4.1.5 Kafka as Reference ETL Tool

4.1.6 Design Patterns

4.1.7 Automation of the Stages

4.1.8 Six Building Blocks of the Data Pipeline

4.2 Managing Analytical Models

4.2.1 Model Delivery

4.2.2 Model Update

4.2.3 Model or Parameter Update

4.2.4 Model Scaling

4.2.5 Feedback into the Operational Processes

4.3 In a Nutshell

5 Data Management

Stefan Papp, Bernhard Ortner

5.1 Data Governance

5.1.1 Data Catalog

5.1.2 Data Discovery

5.1.3 Data Quality

5.1.4 Master Data Management

5.1.5 Data Sharing

5.2 Information Security

5.2.1 Data Classification

5.2.2 Privacy Protection

5.2.3 Encryption

5.2.4 Secrets Management

5.2.5 Defense in Depth

5.3 In a Nutshell

6 Mathematics

Annalisa Cadonna

6.1 Linear Algebra

6.1.1 Vectors and Matrices

6.1.2 Operations between Vectors and Matrices

6.1.3 Linear Transformations

6.1.4 Eigenvalues, Eigenvectors, and Eigendecomposition

6.1.5 Other Matrix Decompositions

6.2 Calculus and Optimization

6.2.1 Derivatives

6.2.2 Gradient and Hessian

6.2.3 Gradient Descent

6.2.4 Constrained Optimization

6.3 Probability Theory

6.3.1 Discrete and Continuous Random Variables

6.3.2 Expected Value, Variance, and Covariance

6.3.3 Independence, Conditional Distributions, and Bayes’ Theorem

6.4 In a Nutshell

7 Statistics – Basics

Rania Wazir, Georg Langs, Annalisa Cadonna

7.1 Data

7.2 Simple Linear Regression

7.3 Multiple Linear Regression

7.4 Logistic Regression

7.5 How Good is Our Model?

7.6 In a Nutshell

8 Machine Learning

Georg Langs, Katherine Munro, Rania Wazir

8.1 Introduction

8.2 Basics: Feature Spaces

8.3 Classification Models

8.3.1 K-Nearest-Neighbor-Classifier

8.3.2 Support Vector Machine

8.3.3 Decision Tree

8.4 Ensemble Methods

8.4.1 Bias and Variance

8.4.2 Bagging: Random Forests

8.4.3 Boosting: AdaBoost

8.5 Artificial Neural Networks and the Perceptron

8.6 Learning without Labels – Finding Structure

8.6.1 Clustering

8.6.2 Manifold Learning

8.6.3 Generative Models

8.7 Reinforcement Learning

8.8 Overarching Concepts

8.9 Into the Depth – Deep Learning

8.9.1 Convolutional Neural Networks

8.9.2 Training Convolutional Neural Networks

8.9.3 Recurrent Neural Networks

8.9.4 Long Short-Term Memory

8.9.5 Autoencoders and U-Nets

8.9.6 Adversarial Training Approaches

8.9.7 Generative Adversarial Networks

8.9.8 Cycle GANs and Style GANs

8.9.9 Other Architectures and Learning Strategies

8.10 Validation Strategies for Machine Learning Techniques

8.11 Conclusion

8.12 In a Nutshell

9 Building Great Artificial Intelligence

Danko Nikolić

9.1 How AI Relates to Data Science and Machine Learning

9.2 A Brief History of AI

9.3 Five Recommendations for Designing an AI Solution

9.3.1 Recommendation No. 1: Be pragmatic

9.3.2 Recommendation No. 2: Make it easier for machines to learn – create inductive biases

9.3.3 Recommendation No. 3: Perform analytics

9.3.4 Recommendation No. 4: Beware of the scaling trap

9.3.5 Recommendation No. 5: Beware of the generality trap (there is no such a thing as free lunch)

9.4 Human-level Intelligence

9.5 In a Nutshell

10 Natural Language Processing (NLP)

Katherine Munro

10.1 What is NLP and Why is it so Valuable?

10.2 NLP Data Preparation Techniques

10.2.1 The NLP Pipeline

10.2.2 Converting the Input Format for Machine Learning

10.3 NLP Tasks and Methods

10.3.1 Rule-Based (Symbolic) NLP

10.3.2 Statistical Machine Learning Approaches

10.3.3 Neural NLP

10.3.4 Transfer Learning

10.4 At the Cutting Edge: Current Research Focuses for NLP

10.5 In a Nutshell

11 Computer Vision

Roxane Licandro

11.1 What is Computer Vision?

11.2 A Picture Paints a Thousand Words

11.2.1 The Human Eye

11.2.2 Image Acquisition Principle

11.2.3 Digital File Formats

11.2.4 Image Compression

11.3 I Spy With My Little Eye Something That Is

11.3.1 Computational Photography and Image Manipulation

11.4 Computer Vision Applications & Future Directions

11.4.1 Image Retrieval Systems

11.4.2 Object Detection, Classification and Tracking

11.4.3 Medical Computer Vision

11.5 Making Humans See

11.6 In a Nutshell

12 Modelling and Simulation – Create your own Models

Günther Zauner, Wolfgang Weidinger

12.1 Introduction

12.2 General Aspects

12.3 Modelling to Answer Questions

12.4 Reproducibility and Model Lifecycle

12.4.1 The Lifecycle of a Modelling and Simulation Question

12.4.2 Parameter and Output Definition

12.4.3 Documentation

12.4.4 Verification and Validation

12.5 Methods

12.5.1 Ordinary Differential Equations (ODEs)

12.5.2 System Dynamics (SD)

12.5.3 Discrete Event Simulation

12.5.4 Agent-Based Modelling

12.6 Modelling and Simulation Examples

12.6.1 Dynamic Modelling of Railway Networks for Optimal Pathfinding Using Agent-based Methods and Reinforcement Learning

12.6.2 Agent-Based Covid Modelling Strategies

12.6.3 Deep Reinforcement Learning Approach for Optimal Replenishment Policy in a VMI Setting

12.7 Summary and Lessons Learned

12.8 In a Nutshell

13 Data Visualization

Barbora Vesela

13.1 History

13.2 Which Tools to Use

13.3 Types of Data Visualizations

13.3.1 Scatter Plot

13.3.2 Line Chart

13.3.3 Column and Bar Charts

13.3.4 Histogram

13.3.5 Pie Chart

13.3.6 Box Plot

13.3.7 Heat Map

13.3.8 Tree Diagram

13.3.9 Other Types of Visualizations

13.4 Select the right Data Visualization

13.5 Tips and Tricks

13.6 Presentation of Data Visualization

13.7 In a Nutshell

14 Data Driven Enterprises

Mario Meir-Huber, Stefan Papp

14.1 The three Levels of a Data Driven Enterprise

14.2 Culture

14.2.1 Corporate Strategy for Data

14.2.2 The Current State Analysis

14.2.3 Culture and Organization of a Successful Data Organisation

14.2.4 Core Problem: The Skills Gap

14.3 Technology

14.3.1 The Impact of Open Source

14.3.2 Cloud

14.3.3 Vendor Selection

14.3.4 Data Lake from a Business Perspective

14.3.5 The Role of IT

14.3.6 Data Science Labs

14.3.7 Revolution in Architecture: The Data Mesh

14.4 Business

14.4.1 Buy and Share Data

14.4.2 Analytical Use Case Implementation

14.4.3 Self-service Analytics

14.5 In a Nutshell

15 Legal foundation of Data Science

Bernhard Ortner

15.1 Introduction

15.2 Categories of Data

15.3 General Data Protection Regulation

15.3.1 Fundamental Rights of GDPR

15.3.2 Declaration of Consent

15.3.3 Risk-assessment

15.3.4 Anonymization und Pseudo-anonymization

15.3.5 Types of Anonymization

15.3.6 Lawful and Transparent Data Processing

15.3.7 Right to Data Deletion and Correction

15.3.8 Privacy by Design

15.3.9 Privacy by Default

15.4 ePrivacy-Regulation

15.5 Data Protection Officer

15.5.1 International Data Export in Foreign Countries

15.6 Security Measures

15.6.1 Data Encryption

15.7 CCPA compared to GDPR

15.7.1 Territorial Scope

15.7.2 Opt-in vs. Opt-out

15.7.3 Right of Data Export

15.7.4 Right Not to be Discriminated Against

15.8 In a Nutshell

16 AI in Different Industries

Stefan Papp, Mario Meir-Huber, Wolfgang Weidinger, Thomas Treml, Marek Danis

16.1 Automotive

16.1.1 Vision

16.1.2 Data

16.1.3 Use Cases

16.1.4 Challenges

16.2 Aviation

16.2.1 Vision

16.2.2 Data

16.2.3 Use cases

16.2.4 Challenges

16.3 Energy

16.3.1 Vision

16.3.2 Data

16.3.3 Use Cases

16.3.4 Challenges

16.4 Finance

16.4.1 Vision

16.4.2 Data

16.4.3 Use Cases

16.4.4 Challenges

16.5 Health

16.5.1 Vision

16.5.2 Data

16.5.3 Use Cases

16.5.4 Challenges

16.6 Government

16.6.1 Vision

16.6.2 Data

16.6.3 Use Cases

16.6.4 Challenges

16.7 Art

16.7.1 Vision

16.7.2 Data

16.7.3 Use cases

16.7.4 Challenges

16.8 Manufacturing

16.8.1 Vision

16.8.2 Data

16.8.3 Use Cases

16.8.4 Challenges

16.9 Oil and Gas

16.9.1 Vision

16.9.2 Data

16.9.3 Use Cases

16.9.4 Challenges

16.10 Safety at Work

16.10.1 Vision

16.10.2 Data

16.10.3 Use Cases

16.10.4 Challenges

16.11 Retail

16.11.1 Vision

16.11.2 Data

16.11.3 Use Cases

16.11.4 Challenges

16.12 Telecommunications Provider

16.12.1 Vision

16.12.2 Data

16.12.3 Use Cases

16.12.4 Challenges

16.13 Transport

16.13.1 Vision

16.13.2 Data

16.13.3 Use Cases

16.13.4 Challenges

16.14 Teaching and Training

16.14.1 Vision

16.14.2 Data

16.14.3 Use Cases

16.14.4 Challenges

16.15 The Digital Society

16.16 In a Nutshell

17 Mindset and Community

Stefan Papp

17.1 Data-Driven Mindset

17.2 Data Science Culture

17.2.1 Start-up or Consulting Firm?

17.2.2 Labs Instead of Corporate Policy

17.2.3 Keiretsu Instead of Lone Wolf

17.2.4 Agile Software Development

17.2.5 Company and Work Culture

17.3 Antipatterns

17.3.1 Devaluation of Domain Expertise

17.3.2 IT Will Take Care of It

17.3.3 Resistance to Change

17.3.4 Know-it-all Mentality

17.3.5 Doom and Gloom

17.3.6 Penny-pinching

17.3.7 Fear Culture

17.3.8 Control over Resources

17.3.9 Blind Faith in Resources

17.3.10 The Swiss Army Knife

17.3.11 Over-Engineering

17.4 In a Nutshell

18 Trustworthy AI

Rania Wazir

18.1 Legal and Soft-Law Framework

18.1.1 Standards

18.1.2 Regulations

18.2 AI Stakeholders

18.3 Fairness in AI

18.3.1 Bias

18.3.2 Fairness Metrics

18.3.3 Mitigating Unwanted Bias in AI Systems

18.4 Transparency of AI Systems

18.4.1 Documenting the Data

18.4.2 Documenting the Model

18.4.3 Explainability

18.5 Conclusion

18.6 In a Nutshell

19 The authors

Foreword

“Mathematical science shows what is. It is the language of unseen relations between things. But to use and apply that language, we must be able to fully appreciate, to feel, to seize the unseen, the unconscious.” – Ada Lovelace

As Computer Literacy over a generation ago represented a new set of foundational skills to be acquired, Artificial Intelligence (AI) Literacy represents the same for our current generations and beyond. Over the last two decades Data Science has come to encompass the mathematical architecture and corresponding language with which we build and interact with systems that extend our senses and decision-making abilities. Thus, it’s no longer sufficient to be able to send point-and-click commands into computers, but rather it’s vitally important to be able to interpret and interact with AI-enabled recommendations coming out of computers. Currently, machines, as in computers coupled with sensors (in the broadest sense), are processing an increasingly wide array of data including text, images, video, audio, network graphs and a multitude of information from the web, private industry, and public sector sources. Considering diversity of data, the authors of this book approach Data Science as a key underlying topic for society and do so with great insight, from multiple key vantage points, and in enjoyable style that resonates with novices and experts alike.

To gain value from data is arguably the unifying objective of the 21st century knowledge worker. Even professional areas thought of as classically distant from data such as sales and art, now have data-driven sub-areas such as marketing automation and computational design. For the benefit of readers, the authors bring to bear first-hand experiences and diligent research to provide a compelling narrative on how we all have a role to play when attempting to leverage data for better outcomes. Indeed, the breadth conveyed in this work is impressive, spanning that of hardware performance considerations (e.g. CPU, Network, Memory, I/O, GPU) to that of different team member roles when building machines that can find patterns in data. Moreover, the authors provide important coverage on the ways that machines can now see and read, namely, Computer Vision and Natural Language Processing, with implications across nearly every industry area being profound.

As you read this book, I encourage you to be curious and have on top of mind a set of questions on how your professional journey and society as you see it is currently being impacted by increasingly advanced machines: from the capabilities available on your smartphone to that of how jobs are being refashioned in the marketplace with automation tools. Here are some questions to help you get started:

       How does the ratio of what tasks you spend your time on shift with the emergence of increasingly advanced machines in your job area?

       What are the implications of having machines that have perceptive abilities analogous to your own, as in to see, hear, smell, taste, touch and beyond?

       How as society do we grapple with bias in and trust around data?

       How do we make the building and the use of machines that learn more inclusive?

       What distinctly human abilities can you accentuate to help organizations that you care about to be more competitive and sustainable?

I’ve been cautious not to use the term thinking machines, or artificial general intelligence, as to be wary about overstatements. What I would like to focus your attention on is the wide applicability of what we’re seeing coming out of research surrounding machines that have learning capabilities. From my time in laboratories at Columbia and Cornell Universities, to that of the Princeton Plasma Physics Laboratory, the American University of Armenia and NASA-backed TRISH (Translational Research Institute for Space Health) which is collaborating with TrialX, it’s clear that machines can find patterns in data across a tremendously wide range of domains and alert humans in both regular and mission critical contexts. Thus, the impacts to human experience are multi-faceted and Data Scientists have an important role in supporting the design of systems where human interaction with machine output is positive sum. I can’t underscore this enough that a zero-sum approach to automation in sub-optimal. Entrepreneurs though tend to find a way toward maximum sum.

With colleagues and through my work at the BAJ Accelerator and Covenant Venture Capital, I support startups to engage in a type of tandem learning: how a rapidly growing company can transform an industry by spotting market gaps to that of how a company’s invention can learn and provide new capabilities for customers. For example, in the powerful technology area of Computer Vision that is a mainstay in Data Science, three companies stand out as trailblazing in three very different industry areas: Embodied, Scylla and cognaize in health-care, security and finance, respectively.

       Embodied’s flagship product, Moxie, is a robot that supports the emotional well-being and social development of children. To do so, Moxie must see and communicate with family members in a compelling way, understanding visually as well as via other cues the emotional state of people it’s interacting with as to engage in meaningful dialogue. Thus, healthcare providers have a new robotic team member to collaborate with. Embodied has been on the cover of TIME Magazine.

       Scylla enables an organization’s security team to be proactive in improving safety. With real-time detection capabilities, camera networks no longer need to be passive and can be transformed to being proactive. Applications are numerous from detecting slip-and-falls in hospitals and stadiums as they happen to improve health outcomes to that of making intruder alerts at manufacturing facilities and office buildings to better protect staff. Scylla has been featured in Forbes.

       cognaize supports financial institutions and insurance organizations process a tremendous amount of unstructured data when making risk determinations. A key insight is considering documents not only as text, but rather also considering visual information: style, tables, structure. In addition, cognaize has a human-in-the-loop whereby colleagues and the system overall continually learn. cognaize has been featured on the NASDAQ screen in Times Square.

In the above three examples of rising unicorn startups, Data Scientists work in close collaboration with engineers, analysts, designers, content creators, domain specialists and customers to build machines that learn and interact with humans in nuanced ways. The result is a transformation in the nature of work: humans are alerted to the most important documents or moments in time and human experience is learned from to improve quality. This is representative of a new shift requiring AI Literacy, where jobs in nearly every facet of the economy will have aspects requiring machine interaction: humans making corrections, learning new skills, reacting to and interpreting alerts, and having a faster response time in helping other humans leveraging machines in support. In the years ahead, I’m excited about the role of Data Science in interface research, new algorithms and how humans can have a force multiplication on their work.

As I co-wrote the first edition of The Field Guide to Data Science nearly a decade ago, it’s remarkable how much the discipline has advanced both in terms of what has been technically achieved and in an aspirational sense on what is yet to be. The Handbook of Data Science and AI advances the discipline along both of those dimensions and carries the torch forward.

Read on.

Fall 2021

Armen R. Kherlopian, Ph.D.

Preface

“The job of the data scientist is to ask the right questions.”Hillary Mason

Reading the foreword written for our first publication two years ago, I couldn’t shake the feeling that some trends essentially stayed the same while others emerged all of a sudden and hit society and companies like an avalanche.

Starting with the changes, that struck society profoundly, it is obvious that the pandemic is one of them. Setting aside the myriad of consequences it had and continues to have on our lives, I want to focus on the facets which relate to the subject of this book: Data Science and AI.

Put simply, the impact there was that entire societies and our whole way of living became data driven in an instant. Key performance indicators like the seven-day incidence rate or forecasts based on pandemic simulations steered our daily life and temporarily even altered basic rights, like the right to leave our homes. This led to discussions and questions, which every Data Scientist with some experience is familiar with and has encountered repeatedly during their working life:

       Can we trust these models and their predictions?

       Is the chosen KPI really the right one for this purpose?

       Is the underlying data quantity and quality good enough?

and so on.

All of these are valid questions and are, just as they were two years ago, fueled by another trend: Digitization. The engine for this is data. On top of that, Data Scientists are still following the same goal:

Giving understandable answers to questions by using data.

Despite all trends, this purpose stays the same and always will be one of the central pillars of doing Data Science.

But this is not the only trend which has remained or become even stronger. The most important, continuing phenomenon is the still massive hype caused by phrases like “Artificial Intelligence” and “Data Science”. While these fields are incredibly valuable and powerful, discussions around them unfortunately often evoke false promises and skewed expectations, which in turn lead to disappointment. Some companies already started large, ambitious initiatives in the past, which led to underwhelming results, because expectations were set too high and timelines too short. For example, fully autonomous driving is one particularly challenging problem to solve.

Nevertheless, Artificial Intelligence remains the hope for many companies. Investors perceive it as a general purpose, technology that can be applied almost anywhere. The situation is comparable with the development during the nineties when all things related to the ‘Internet’ surged. Suddenly, every company needed a web page and significant investments were made to train web programmers. Nowadays, a similar thing happens with everything AI related. Again the investments into AI are enormous and we have a rush of courses on the topic. In the end, the development concerning the ‘Internet’ led to a vast ecosystem of companies and applications which influence the lives of billions of people in a profound way and it seems that AI follows a similar path.

This explains at least partly another noticeable trend: the further specialization of data science roles with names like “data translator” or “machine learning engineer.” It is a somehow natural development as this is a sign that the field is getting more mature, but it also raises the risk of data science responsibilities being scattered across poorly coordinated organizations, and thus, not reaching its full potential. Chapter 14 and 17 go into this in further detail.

Finally, “Trustworthy AI” is emerging as another, highly important movement within Data Science. This is the field of research, which aims to tackle some previously unmet needs, like explainability or fairness. It is therefore included as one of the new chapters in this book (Chapter 18).

Given all these trends in Data Science, one of the reasons for founding the Vienna Data Science Group (VDSG) has become even more important over the last two years: to create a neutral place where interdisciplinary exchange of knowledge between all involved experts can take place internationally. We are still very much dedicated to the development of the entire Data Science ecosystem (education, certification, standardization, societal impact study, and so on), both across Europe and beyond.

A product of the exchange in our community can be found in the 2nd edition of this book, which has been vastly expanded to cover topics like AI (Chapter 9), Machine Learning (Chapter 8), Natural Language Processing (Chapter 10), Computer Vision (Chapter 11) or Modelling and Simulation (Chapter 12) in more depth. To follow our goal to educate society about Data Science and its impacts, a very relevant use case was included in Chapter 12: An agent-based COVID-19 model, which aims to give ideas about the potential impact of certain policies and their combination on the spread of the disease.

To provide our readers with a firm foundation, an introduction to the underlying mathematics (Chapter 6) and statistics (Chapter 7) used in Data Science has been included, and finished with a visualization section (Chapter 13).

Although a lot of content has been added, the goal of this book stays the same and has become even more relevant: to give a realistic picture of Data Science.

Because despite all trends, data science remains the same as well: an interdisciplinary science gathering a very heterogeneous crowd of specialists, which is made up of three major streams:

       Computer Science/IT

       Mathematics/Statistics

       Domain expertise in the industry in which Data Science is applied.

Science aims to generate new knowledge, and this is still used to

       improve existing business processes in a given company (Chapter 16)

       enable completely new business models

Data Science is here to stay and its direct and indirect impact on society is growing at a fast pace, as can be seen during the pandemic. In some areas a bit of disillusionment has set in, but this can be seen as a healthy development to counter the hype. Data Science team roles are becoming more differentiated, and more companies are putting Data Science projects into production.

So, Data Science has grown up and is entering a new era.

Fall 2021Wolfgang Weidinger

Acknowledgments

We, the authors, would like to take this opportunity to express our sincere gratitude to our families and friends, who helped us to express our thoughts and insights in this book. Without their support and patience, this work would not have been possible.

A special thanks from all the authors goes to Katherine Munro, who contributed a chapter to this book and spent a tremendous amount of time and effort editing our manuscripts.

For my parents, who always said I could do anything. We never expected it would be a thing like this.Katherine Munro

I’d like to thank my wife and the Vienna Data Science Group for their continuous support through my professional journey.Zoltan C. Toth

When I think of the people who supported me most, I want to thank my parents, who have always believed in me no matter what and my partner Verena, who was very patient during the last months when I worked on this book. In addition I’m very grateful for the support and motivation I got from the people I met through the Vienna Data Science Group.Wolfgang Weidinger

1Introduction

“Data really powers everything that we do.”

Jeff Weiner

Questions Answered in this Chapter:

       What makes Data Science, ML, AI and everything else closely connected to generate value out of data so fascinating?

       Why do organisations need a strategy to become data driven?

       What are some everyday use cases in the B2B or NGO world?

       How are data projects structured?

       What is the composition of a data team?

Data Science and related technologies have been the center of attention since 2010. Various changes in the ecosystem triggered this trend, such as

       significant advancements in processing a vast amount of unstructured data,

       substantial cost reduction of disk storage,

       the emergence of new data sources such as social media and sensor data.

The HBR called the data scientist the sexiest job of the 21st century while quoting Hal Varian from Google.1 Strategy consultants declared data to be the new oil, and there have been occasional “data rushes” where “enthusiasts in data fever” mined new data sources for yet unknown treasures. This book explores data science and incorporates various views on the discipline.

Figure 1.1Data Science and related technologies on trends.google.com2

1.1What are Data Science, Machine Learning and Artificial Intelligence?

There are many views on data science, and stakeholders in data science projects may give different answers to what they consider data science to be. Representatives address various aspects and may use different vocabulary since businesses and NGOs, for example, pursue different insights from data science applications. Perhaps the one common denominator is this: Everyone expects data science to deliver some value, which was not there before, with the help of data.

Table 1.1 Various views on Data Science

View

Description

Definition from Wikipedia

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data and apply knowledge and actionable insights from data across a broad range of application domains.3

Application-centered view

We collect data and put this into pandas-data frames or data frames in R Studio. We also use tools such as TensorFlow or Keras. Our goal is to use these tools to explore the data.

Platform-oriented view

We create value from the data that we loaded on our SaaS platform in the cloud. Then, depending on the provided data and its structures, we store them in different storage containers, such as blob storage and distributed databases.

Evangelist-oriented view

Data science was the next big thing in 2015. Now, you should look at more specific applications. Looking at the Gartner charts, invest your time exploring cutting-edge trends such as neuromorphic hardware or augmented intelligence.

Management-oriented view

These are the ways of working to bring our company into the 21st century as a data-driven enterprise. During and after our transition, we will penetrate new markets and monetize data as a service.

Career-oriented view

As a senior data scientist at a major company, I can earn a six-digit yearly salary and explore interesting fields in corporate labs.

Use case-oriented view

Tell me your business problem, and we will tell you how we solved it for another customer. From fraud detection to customer retention to social network analysis, feel free to check out our catalog of possible analytics applications.

Entrepreneurial/Optimistic view

Data Science is one way to change the world. Using Data Science, we can prevent climate change and fight poverty and hunger on a global scale.

Pessimist view

Data Science is one way to change the world. But, unfortunately, power-hungry people will use it to spy on us and suppress us. So Big Brother will be watching you.

Statistician’s view

Data Science is just a buzzword. It is just another word for statistics. We might call it statistics on steroids, maybe. But in the end, it’s just another marketing hype to create another buzzword to sell services to someone.

The essentials of data science lay in mathematics. Data scientists apply statistics to generate new knowledge from data. Besides using algorithms on data, a data scientist must understand the scientific process of exploring data, such as creating reproducible experiments and interpreting the results.

There are many different terms related to data science. For example, professionals talk about Artificial Intelligence, machine learning, or deep learning. Sometimes experts also talk about related terms such as analytics or business intelligence and simulation. In the following chapters, we will detail and highlight how we distinguish between analytics and data science. We will also highlight various data science applications, such as gaining insights into a text through Natural Language Processing or extracting objects from images via object recognition or modeling railway networks for optimal pathfinding.

Data Science as Part of a Cultural Shift

Suppose you apply for a job as a data scientist in a company. Imagine that, although it is unlikely you will get such an answer, the HR of this company rejects you because your astrology chart based on the data you have provided in your CV does not match the position.

Humans decide on what they believe is right. But, unfortunately, human judgment is flawed through bias4, and we have mechanisms, such as confirmation bias, which assure us that we cannot err. For example, some people believe in the flat Earth theory or hollow Earth theory, which shows how powerful mechanisms such as confirmation bias can be.

For many of us, it would be disastrous to realize that a comfortable binary view of the world divided into black and white, good and evil, and right and wrong often does not work out. Modern sociological ideas such as constructivism5 are more connected to data science than many think. The idea is that everyone constructs a reality based on their experience. Within the framework of “our reality”, including its rules and conventions, we make decisions. According to studies, it is not uncommon that we are deeply convinced that we are right even if our choices are questionable to others. For example, suppose we have created mental models for ourselves in which we are confident that astrology must be correct. In that case, it is logical to assume zodiacs for personnel decisions will improve the hiring process. At the same time, people with strong religious beliefs might run into conflicts if they ignore what they might call signs or messages from God. Thanks to the biases mentioned above, our belief systems are often hammered into stone.

Data Science is not just a method to extract value from data; it also has the potential to be a method for making decisions that avoids or reduces human bias in the process. However, as will be shown in Chapter 18 on Trustworthy AI, data alone cannot solve the problem, because historical data and the model building process itself are often imbued with the very same biases. With that, business leaders can integrate data science and transparent and non-discriminatory practices, into corporate culture, and this will substantially impact the company’s DNA. For example, a bias-aware company will adjust processes. Hiring a new employee is a good example. Many companies enlarge hiring teams that decide on the outcome of the candidate interviews in order to ensure that the bias of a single interviewer will not affect a hiring decision too much. In modern hiring processes, data science can be used to generate predictionsabout candidates to assist the decision-making process. If done with care, these model predictions can help to minimise biases in employment decisions.

In the beginning, every judgment is a theory. A theory is neither right nor wrong but inconclusive until it is proven or disproven.

Therefore, the positive effect of hiring personnel using astrological zodiacs would be nothing more than a theory. As long as we cannot prove that an astrological assessment would benefit a hiring process, the statement is inconclusive and, therefore, not recommended to use. Calling astrology inclusive rather than wrong might also make the discussion with believers in astrology less emotional.

Investigating the possible effects of astrology using data science is a perfect introduction to the environment we face in data science projects. Astrology claims to divine information about human affairs and terrestrial events by studying celestial objects’ movements and relative positions. In a simplified version, astrology reduces everything to the sun sign, depending on birthdays. Using the simplified model, we could collect data on existing data scientists to determine a correlation between astrological signs and professions. In addition, we could collect the birthdate of a large pool of data scientists. As we need only a birth date and no other personal data, it would even be perfectly legal to collect these datasets from LinkedIn or any other data source containing data scientists’ birthdates. Most of the analysis will consist of finding appropriate data sources, collecting the data from the data source, anonymizing it, and preparing it for examination.

Mathematics on the collected data will not leave much room for interpretation of results. Nevertheless, based on analysis, we would conclude a correlation between professions and astrological signs.

There is, however, a more complex form of astrology. Astrological charts include all planets and other celestial objects such as Lilith, the black moon, which does not exist in astronomy. In addition, many constellations are contradictory. An astrologer might call a person impulsive because of Venus or Lilith in Aries or passive because of Mars in Cancer. Finally, an astrologer might claim that readings require intuitive interpretations, which are, of course, not measurable.

Many data science projects might end with the assessment that there is insufficient data for a definite answer, and being unable to prove or disprove a theory might be unsatisfactory for many stakeholders. Yet, exploring data often helps bring clarity to the stakeholders, as at least many learn that achieving objective truth is not as easy as it seems. Therefore, we should be free to differ in personal or subjective beliefs and be cautious about things we cannot verify objectively. Of course, in the end, there is a good chance that we are right with our personal views if we have spent a lot of time exploring a specific field, even if we cannot prove it. Still, as long as we do not have enough data to prove something one way or another, it is a question of academic politeness to highlight inclusive outcomes because of insufficient data when talking with others.

Already as early as 2014, the New York Times6 wrote about the 80/20 rule. This rule means that the team spends 80 % of their time finding and preparing data for data science projects and only 20 % on analytics. This number may vary enormously by industry. In addition to data modeling, we will also address the preparation and management of data in the chapters to follow. We aim to provide a compact introduction to data platforms and engineering.

In the second part of this book, we assume all the data is prepared and ready and focus on analytics. We will present several ways to generate value from data and cover essential topics such as neural networks and machine learning. We will also cover basics such as statistics.

The third and last part of the book is about the application of data science. Here we cover business topics and also address the subject of data protection.

Machine Learning and Deep Learning

Starting from Chapter 6, we will detail the differences between these frequently buzzed-about concepts. Still, as using these terms related to data science often creates confusion, we would like to outline them for you here.

In recent years, many companies prioritized processing vast amounts of data. Consequently, scientific processing, such as formulating the working hypothesis, was pushed into the background. Big Data tries to solve problems with a sufficiently large amount of computer power and data. This fact creates a productivity paradox: More data and better algorithms do not make us more productive; instead, the opposite is often true, as it becomes increasingly difficult to distinguish the signal from the noise. The signal is the information relevant to a question and thus contributes to answering it, while the noise is the irrelevant information.

We attempt to make these signals measurable in the scientific field by measuring he signal detection accuracy and how often algorithms find the signal. The quotient of both measurements expresses the algorithms’ accuracy. We describe it as a percentage. A high F1 score means a precise answer, while values around 50 % represent a random result. So if an algorithm has an accuracy of, say, 90 %, it means that 90 % of all information is processed correctly.

This number may sound like a lot; however, data with a large volume is the norm in Big Data. For example, imagine we want to classify comments to find hate speech in social media. Let’s say that 510,000 comments were posted per second on Facebook in 2018. Assuming that 10 % were classified incorrectly, we might fail to detect hate speech in 51,000 posts.

To avoid such a situation, deep learning, a group of machine learning algorithms based on neural networks, is currently being applied as an abstract solution to many problems. The advantage of deep learning over classical machine learning is that the former usually scales better with the amount of data and thus provides more accurate results and can be applied to various problems.

The disadvantage of some methods in machine learning is that it can be challenging to interpret a prediction because the solution path is not immediately comprehensible. Furthermore, a statistically generated prediction may or may not be correct, as most models usually have less than 100 % accuracy. Additionally, we cannot use statistical forecasts to predict new data that has not been adequately analyzed or has limited usage. This statement may seem trivial, but it is essential since statistical analysis primarily depends on the input data and thus on the modeling skills of the data scientist. It is, therefore, necessary to interpret the result correctly and not to take it as truth.

An excellent example of this is numerical weather forecasts such as the weather report. We know fundamental physical laws in differential equations, but false predictions repeatedly occur due to non-existent or incorrect data or a simplified model. For example, a result of the solved differential equation can be: “Tomorrow the probability of rain is 10 %”. Statistically, this means that we have created an analytical model based on historical data and based on all the data we have analysed, in 10 % of the cases with matching input data, it had rained. So 10 % can be a lot or very little; the important thing is to have an appropriate reference amount and relate it to the quantity obtained. In this case, it means that it is quite possible, although not likely, that it will rain tomorrow.

Figure 1.2Differences (https://ai.plainenglish.io/data-science-vs-artificial-intelligence-vs-machine-learning-vs-deep-learning-50d3718d51e5)

Artificial Intelligence

When the common people think of AI, they might think of computers taking over the world such as in Terminator.

Artificial Intelligence is the simulation of human intelligence processes by machines, especially computer systems. There is an overlap between Machine Learning and Data Science, but AI can be seen still separated from both disciplines.

In Chapter 9, we explore Artificial Intelligence in detail. We explain the relation to Data Science and give a brief overview of the history of AI. We also discuss the problems that one may encounter when using Data Science skills to develop AI. In particular, we provide five pieces of advice: Be pragmatic, Make it easier for machines to learn through inductive biases, Perform analytics before creating AI architecture, Watch for the intelligence scaling trap and Watch for the generality trap. In this chapter you will have a chance to learn how to avoid mistakes and how to effectively use your Data Science tools to create AI solutions. After reading this chapter you will understand well where the limitations of AI technology are today and how to cope with those limitations.

1.2Data Strategy

Some experts say that only companies with a data strategy have a future. We might agree or disagree with this assessment. However, everyone will admit that not every company feels the pressure to become data-driven. Many departments work mainly with pen and paper, in monopolies with no pressure to evolve or optimize processes. The figure below is just one of many models that can be found with web research to highlight different stages of a transformation from a non-data company to an entirely data-driven one.

As data maturity depends a lot on external pressure, companies often migrate in phases. When the competition gets more fierce, the market forces companies to innovate. However, the luxury to resist change due to market pressure can also lead to different forms of stress. Some monopolies face the problem sometimes that no vendor supports the legacy software they used for decades.

Introducing data science in organizations, no matter if it is a business, NGO, or governmental institute, starts in most cases with a mission statement. For example, for a global car manufacturer, a strategy could be formulated as follows:

“Our company aims to be the cost leader in the global supply chain by 2025. This measure enables us to bring electric mobility to the mass market with less cost than our competitors. For us to achieve this, we have to cut our supply chain costs by 20 %.”

Other companies simplify the strategy inspired by John F. Kennedy’s speech on landing a man on the moon and returning him safely to the earth within a decade.

“Before this decade is out, all of our manufactured vehicles will be driverless.”

Figure 1.3Data Maturity Model7

An NGO might have less profit-oriented but no less ambitious goals.

“With the help of our donors, we will use satellite images to explore dry areas in countries to find water points. Using that technology, we hope to be able to decrease the pain of gaining access to water in developing countries.”

The recommended practice for companies is to have an owner for data topics. Commonly this is the role of a Chief Data Officer, who needs to ensure that the company can realize its vision with the help of gaining insights from data.

Many companies have established processes to explore the past through business intelligence. For example, in maybe the most classic reference case, retail companies analyze how many products they have sold in the past. As a result, they can learn about which stores did a better or worse job. Based on the insights, leadership can then make changes such as replacing key personnel in poorly performing areas or creating additional incentives for growth in other areas.

Many companies have already reached a high level of optimization through traditional analytics. And it often seems as if conventional methods are at their limits.

Data science often helps to generate new knowledge. In other words, instead of using data science to sell more products, companies often use it to create new products. For example, while traditional analytical methods improve numbers, you get new numbers to work with through data science.

Once a CDO has proposed a strategy to meet the corporate goals, the board will approve the plan and allocate a budget. Using that strategy, the CDO then pools together the various department heads to realize the objective. Then, after a fit/gap analysis of the current situation, they will create hiring plans and plan projects to achieve their goals.

Figure 1.4The Gartner Analytic Continuum (Source: https://twitter.com/Doug_Laney/status/611172882882916352/photo/1)

This position of business also clarifies the role of IT. The CIO is in charge of providing the necessary platforms to enable the teams of the CDO, but IT does not own the data science topic itself. Therefore, the CIO has to assess whether the current IT infrastructure meets the demand of the data strategy and if not, they must come up with a plan to create the required platforms.

1.3From Strategy to Use Cases

Implementing a strategy defines how a business interprets data and the modeling based on it. Based on the strategy, a company can decide which questions the data scientists must answer. Based on these questions, solution architects can design platforms to host the data and data engineers can determine from which data sources they have to extract data.

Most companies have cross-functional teams for data science projects. They work in an agile team to explore new use cases and methods to apply data science.

Without qualified professionals, a company cannot even begin to implement its ambitious plans. Therefore, we want to look first at how data teams could appear from the project’s view. In a corporate world, many of these team members would report to a different department.

1.3.1Data Teams

We need data professionals to implement a data science strategy or to build up a data-driven start-up. There are two groups of experts in the data world that have evolved.

The first group, people with a statistical background, usually have academic experience and create models to answer the departments’ questions. The second group consists of people with an engineering background. They are responsible for fully automating data loading onto the platform and continuously running the developed models’ data in the production environment.

In organizations, these two groups have different reporting lines: Business and IT. In most companies, data agendas are a part of the top management board. Therefore, data is associated with the business. Some companies establish the role of a CDO, who directly reports to the CEO and the board. Others create a position, such as Head of Data or Head of Data Science. The authors of this book believe that data should be part of the board. Therefore, we refer to the CDO as the ultimate leader of all data agendas, whereas we refer to the CIO as the position accountable for all IT agendas.

In Figure 1.5 we have many models to describe the different roles per activity and department. Please be aware that we do not cover all roles in detail in this chapter. We cover this topic in more detail in Chapter 14.

Figure 1.5Role distribution in data programs (Source: https://nix-united.com/blog/data-science-team-structure-roles-and-responsibilities)

1.3.1.1Subject Matter Expert (Domain Expert)

The SME is an essential person for a data project. Still, this person is often not shown in data teams. A subject matter expert understands, from the inside out, how the company provides its services to its clients inwards-out. They are often also referred to as a domain expert.

An SME is someone who has been performing a day-to-day job for a long time. For example, in a retail organization, a perfect SME might be the person who has been working in a supermarket in different roles for multiple years. They have seen almost every possible imaginable scenario and have a good gut feeling about what clients want. They might also find potential side effects to changes that no one without experience in the field could see.

In some industries, the role of an SME overlaps with an analyst. Finance is a good example. A credit analyst takes all data from a client who applies for credit and calculates the credit risk using a given formula. Unlike data scientists, analysts do not generate new knowledge. However, analysts work with numbers and have a deeper understanding than other types of SMEs.

In an NGO, an SME might be a development worker who fights poverty and plagues in developing countries or works in refugee camps. Therefore, an NGO SME might have a completely different view of what is missing on-site and feasible than those who watch the situation remotely.

SMEs are also often natural authorities in their fields due to their long-term experience. If, for example, a company wants to install a new IT system or new processes on-site, the support of SMEs can be crucial for its successful deployment, as less experienced employees in the field often look up to them.

The actual duties of the SME, therefore, depend on the area of operation but generally include the following activities:

       Provide insights on the existing challenges

       Provide access to possible data sources

       Help formulate goals

       Assist in the release of products and verify their successful outcome

       Guide users towards adopting the new system.

1.3.1.2Business Analyst

Many projects need a business analyst who acts as a bridge between SMEs and data scientists. The critical skill of a business analyst is to ask the right questions. His job is to find out which activities make sense from a business perspective.

In start-ups, a business analyst helps to formulate the business plan and the value proposition. First, he needs to underline how the business can make profits and measure if we are successful.

Business analysts, therefore, are dedicated to their time to the following activities.

       Write business plans

       Analyze business requirements

       Translate business requirements into work packages for the data team

1.3.1.3Data Scientist

There is a debate about how much statistics a data scientist should understand. Purists claim that you can be only a “real data scientist” if you have a Ph. D. and are acquainted with scientific methods and statistics in and out. They sometimes call everyone else “fake data scientists.”

Many modern views differ and see a data scientist as an expert who puts the data into use and creates something new. For example, she can discover a new relationship in the data and build models. It is essential to highlight that good communication and programming skills are helpful to achieve this.

Data scientists should be as versatile as the data they are working with and open to learn about new domains and to collaborate with experts from different fields. For example, working with and analysing imaging data requires specific knowledge in Computer Vision, image processing, machine learning and also specific domain knowledge of differential geometry or medicine. It is important to understand how data are acquired, which false interpretations are possible and also if an expert is required to create a baseline or to evaluate the designed models (for example annotations of a specific tumor tissue in an computer tomography scan by a medical doctor). In Chapter 11 you will get a deeper insight into the field of Computer Vision and how to work with imaging data as a data scientist.

All in all, every data scientist will have some form of understanding of science and statistics. But similar to many examples of autodidactic programmers, who have not studied Software Engineering, many things can be self-taught. It is often the case that a data scientist team consists of people with a diverse skill set. While some of the members are top-notch mathematicians, others complement them with more communication or programming skills but still contribute as much to the outcome as others.

Mathematics and Statistics

Mathematics and statistics are still the basis of everything we do. Therefore, we dedicate Chapters 5 and 6 to the topics to recap the basics of probability theory, explain a confidence interval, and say that one idea is correct or not mathematically.

The main tasks of data scientists are exciting, sometimes challenging, and highly diverse.

       First, we must prepare our data, often liaising with other departments, such as information systems, and harmonizing various data sources. In many organizations, this is the job of the Data Engineer, especially if these steps need to be automated and have strong SLA requirements.

       Then we engage in exploratory statistical analyses, interpret the results, and use these to gain domain knowledge and conduct further preliminary data investigations.

       Based on these findings, we curate a data set and feed this to a machine learning algorithm, such as those mentioned above, to build a model for a specific task.

       The trained model is tested and fine-tuned to the point where we can use it productively: its outputs, which usually take the form of predictions of a particular output given an unseen test case, will be acted upon by the data science team and other stakeholders in the company.

Of course, this process is not a one-time effort. Data and models must be continuously monitored (and often, continuously retrained) to ensure performance remains at an acceptable level. New research projects must be undertaken based on the company’s innovation roadmaps, triggering this process to begin again. We can answer business questions through data, and progress and results must be communicated to various departments, often in sophisticated visualizations and presentations (see Chapter 13, ‘Visualisation’).

We will describe a lot more about the job of data scientists throughout this book. Data scientists play an essential role in the development of AI solutions (see Chapter 9), but also in the domain of modeling and simulation (see Chapter 12)

1.3.1.4Data Engineer

Data engineers build and optimize data platforms so that data scientists and analysts have access to the appropriate data. In addition, they load data into the data platform according to the policy set by the architect.

Data engineers implement this activity using data pipelines, load data from third-party systems, transform the data and then store it on the platform. A data pipeline must scale with increasing data volumes and be robust. Therefore, the pipeline must have corresponding fault tolerance. It thus forms the foundation that data scientists and analysts can use to generate knowledge.

Unlike other team members, data engineers must have solid programming skills. Most importantly, a data engineer needs to understand the principles of distributed computation and how to write code that can scale. Thus, the data engineer has a fundamental role in every data science team.

Core activities include:

       Building various interfaces to enable the reading and writing of data

       Integrating internal or external data into existing pipelines

       Applying data transformations to create analytical datasets

       Monitoring and optimization to ensure the continuous quality of the system (and to improve it if necessary)

       Developing a loading framework to load data efficiently

1.3.1.5DevOps

DevOps is a role that requires a mixture of developer and operational skills. Their task is to operate the data platform upon which the data engineers and data scientists work.

DevOps implement the architectural design for a project or system and address the change requests made by the Data Engineers. With the emergence of cloud systems, DevOps engineers gained popularity and have become a scarce resource in many projects.

Their activities include:

       The scaling of data platforms

       Identification of performance problems in the software

       Automating redeployments

       Monitoring and logging applications

       Identifying resource bottlenecks and problems

       Remediation of issues that occur due to system operations

1.3.1.6Solution Architect

In the end, someone has to be accountable for everything running smoothly. Only then can the data scientists do their job, and the users can create business value by using the applications developed during the data strategy implementation. In large organizations, this is the solution architect.

Someone must ensure that the proper hardware infrastructure is in place, that the appropriate data management, selecting processing software, can protect data against misuse and theft, and finally, that data scientists and end-users of a system can do their work.

Many organizations have multiple roles for that:

       A data architect focuses on data and how data is stored. In addition, she cares about metadata management and the definition of processes to load data into data management software such as databases or object stores.

       A systems or infrastructure architect focuses on servers and hardware and ensures the hardware is available. If the company hosts the solution in the cloud, they refer to this role as a ‘cloud architect.’

       A data steward or data manager is responsible for ensuring that the project follows the appropriate corporate policies.

       A security architect protects the system against hackers and other intrusion attempts.

In reality, it is hard to isolate those various engineering roles. A data platform must serve multiple purposes and meet multiple functional and non-functional requirements. Without knowing the software, one cannot make a hardware decision, and numerous data platforms have specific hardware requirements. Therefore, there needs to be a generalist who understands everything and can lead other architects to make cost-effective, scalable, robust, and fast solutions.

In large companies, a CIO leads all streams to create standards for every project. Large companies have their frameworks or business units to provide platforms to other departments. A solution architect must often also consider corporate politics as another factor in building the best platform for their project. There are usually fewer restrictions in small companies and more chances to fail with a wrong strategy. Chapter 17, ‘Mindset and Community’, also explores a risk known as the ‘swiss army knife’, that might apply to a solutions architect in a small company: Many small companies end up with one person being the single expert for multiple engineering domains.

In many organizations, it often boils down to a situation in which one person with a diverse skill set and broad knowledge is fully accountable for realizing the solution. Although they might be able to delegate responsibilities, depending on the size of a project or company, they ultimately still have to cover multiple roles in other scenarios and thus become a bottleneck.

Typical tasks of a solution architect are:

       As an accountable person for the solution, decide about all parameters or lead the decision-making process. All parameters include, among other things, hardware, operating systems, data management software, data processing, user experience, scalability, and cost-effectiveness.

       Ensure that the project meets all requirements, and the project team has all requirements to build the solution for the ultimate end-users.

       Lead other architects and engineers to implement the solution.

       Ensure that all solutions meet corporate standards for all projects, such as data protection standards.

1.3.1.7Other Roles

We have not covered BI engineers and Business data owners here. Often in agile teams, we add a Scrum Master to the team.

We will outline in Chapter 16 that data teams might face quite different requirements in different industries. Also, small companies or start-ups have other needs than large enterprises. This diversity means there is no unique definition of how a data team has to be structured. Various roles will exist in one team but not in others.

Data teams in large organizations, especially with regulatory requirements, will incorporate roles such as data managers, security experts, data stewards and more.

1.3.1.8Team Building

The structure of the team and the operating model depends much on the company’s data maturity level. In many cases, some team members have to clear out old legacy systems before creating something new. In some companies, leaders assign individuals to multiple teams.

The success of teams also depends a lot on the corporate culture. We will go more into details on this in Chapter 17, ‘Mindset and Community.’ Setting up a data-driven organization is the focus of Chapter 14.

1.3.2Data and Platforms

Company data currently exists in most companies horizontally, in different departments, or vertically, which is fragmented and coupled to various functions and silos. In addition, the proportion of critical information generated outside the usual processes is growing. So then, part of a data strategy is to create a process that can handle various data formats and convert them into a structured and processable format. In this process, we can explore four different properties:

       Volume: Describes the amount of data collected through organizations through daily business processes. Volume is an order of magnitudes, such as gigabyte, terabyte, or peta-byte.

       Velocity: