Data Science for Decision Makers
Enhance your leadership skills with data science and AI expertise
Jon Howells
Data Science for Decision Makers
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Ali Abidi
Publishing Product Manager: Tejashwini R
Book Project Manager: Hemangi Lotlikar
Content Development Editor: Joseph Sunil
Technical Editor: Rahul Limbachiya
Copy Editor: Safis Editing
Proofreader: Joseph Sunil
Indexer: Rekha Nair
Production Designer: Ponraj Dhandapani
DevRel Marketing Coordinator: Vinishka Kalra
First published: June 2024
Production reference: 1190624
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s SquareBirmingham
B3 1RB, UK
ISBN 978-1-83763-729-4
www.packtpub.com
To my mother and father, Caroline and Robert, for instilling in me the values of education and constant curiosity. To my partner, Yeshica, for your unwavering support, and to my sister, Felicity, for your keen eye in reviewing and shaping this book.
– Jon Howells
Contributors
About the author
Jon Howells, director of AI consultancy QualifAI, is an experienced professional in data science and machine learning, with over a decade of experience in the consumer goods, market research, and public sectors. He has worked within consultancies including KPMG and Capgemini and with multinational clients such as Unilever and Permira, as well as public sector bodies such as the UK Home Office and the US Food and Drug Administration (FDA).
With an MSc in computational statistics and machine learning from UCL, Jon specializes in applying large language models (LLMs) to consumer-focused businesses, leveraging them for consumer research, personalized content generation, and enhanced customer support. His expertise helps businesses better understand and engage with their customers, driving innovation and unlocking the potential of data-driven decision-making.
About the reviewer
As a principal architect at T-Mobile, Tanmaya Gaur has more than 10 years of web development experience and a passion for delivering technical and architectural leadership for key technology initiatives and business capabilities. In the latest chapter of his professional career, he has been instrumental in shaping the architecture of T-Mobile’s primary CRM solution, which is built using modular micro-frontend architecture and enhances the digital experience for their care representatives and customers.
His expertise in web, infrastructure, and microservices enables him to design and deliver scalable solutions that are performant, secure, and resilient. He works closely with other business and IT partner teams in a highly collaborative environment and is committed to driving the best customer experience across mobile, desktop, point-of-sale, and other emerging devices.
Table of Contents
Preface
Part 1: Understanding Data Science and Its Foundations
1
Introducing Data Science
Data science, AI, and ML – what’s the difference?
The mathematical and statistical underpinnings of data science
Statistics and data science
What is statistics?
Descriptive and inferential statistics
Sampling strategies
Probability
Probability distribution
Conditional probability
Describing our samples
Measures of central tendency
Measures of dispersion
Degrees of freedom
Correlation, causation, and covariance
The shape of data
Probability distributions
Discrete probability distributions
Continuous probability distributions
Summary
2
Characterizing and Collecting Data
What are the key criteria to consider when evaluating datasets?
Data quantity
Data velocity
Data variety
Data quality
First-, second-, and third-party data
First-party data – the treasure trove within
Second-party data – building bridges through collaboration
Third-party data – broadening horizons with external expertise
Structured, unstructured, and semi-structured data
Structured data
Unstructured data
Semi-structured data
Methods for collecting data
Storing and processing data
Cloud, on-premises, and hybrid solutions – navigating the data storage and analysis landscape
Cloud computing – scalable services in the cloud
On-premises – maintaining control within your walls
Hybrid – the best of both worlds?
Data processing
Summary
3
Exploratory Data Analysis
Getting started with Google Colab
What is Google Colab?
A step-by-step guide to setting up Google Colab
Understanding the data you have
EDA techniques and tools
Descriptive statistics
Data visualization
Histograms
Density curves
Boxplots
Heatmaps
Dimensionality reduction
Correlation analysis
Outlier detection
Summary
4
The Significance of Significance
The idea of testing hypotheses
What is a hypothesis?
How does hypothesis testing work?
Formulating null and alternative hypotheses
Determining the significance level
Understanding errors
Getting to grips with p-values
Significance tests for a population proportion – making informed decisions about proportions
The z-test – comparing a sample proportion to a population proportion
Z-test example made easy
Significance tests for a population average (mean)
Writing hypotheses for a significance test about a mean
Conditions for a t-test about a mean
When to use z or t statistics in significance tests
Example – calculating the t-statistic for a test about a mean
Using a table to estimate the p-value from the t-statistic
Comparing the p-value from the t-statistic to the significance level
One-tailed and two-tailed tests
Walking through a case study
Summary
5
Understanding Regression
How can I benefit from understanding regression?
Introduction to trend lines
Fitting a trend line to data
Estimating the line of best fit
Calculating the equations of the lines of best fit
Interpreting the slope of a regression line
Interpreting the intercept of a regression line
Understanding residuals
Evaluating the goodness of fit in least-squares regression
Summary
Part 2: Machine Learning – Concepts, Applications, and Pitfalls
6
Introducing Machine Learning
From statistics to machine learning
What is machine learning?
How does machine learning relate to statistics?
Why is machine learning important?
Customer personalization and segmentation
Fraud detection and security
Supply chain and inventory optimization
Predictive maintenance
Healthcare diagnostics and treatment
The different types of machine learning
Supervised learning
Unsupervised learning
Semi-supervised learning
Reinforcement learning
Transfer learning
Popular machine learning algorithms
Linear regression
Logistic regression
Decision trees
Random forests
Support vector machines
k-nearest neighbors
Neural networks
The machine learning process
Training a supervised machine learning model
Validation of a supervised machine learning model
Testing a supervised machine learning model
Evaluating machine learning models
Risks and limitations of machine learning
Overfitting and underfitting
Bias and variance
Balanced dataset
Models are approximations of reality
Machine learning on unstructured data
Natural language processing (NLP)
Computer vision
Deep learning and artificial intelligence
Artificial intelligence
Deep learning
Summary
7
Supervised Machine Learning
Defining supervised learning
Applications of supervised learning
The two types of supervised learning
Key factors in supervised learning
Steps within supervised learning
Data preparation – laying the foundation
Algorithm selection – choosing the right tool
Model training – learning from data
Model evaluation – assessing performance
Prediction and deployment – putting the model to work
Characteristics of regression and classification algorithms
Regression algorithms
Classification algorithms
Key considerations in supervised learning
Evaluation metrics
Applications of supervised learning
Consumer goods
Retail
Manufacturing
Summary
8
Unsupervised Machine Learning
Defining UL
Practical examples of UL
Steps in UL
Step 1 – Data collection
Step 2 – Data preprocessing
Step 3 – Choosing the right model
Step 4 – Training the model
Step 5 – Interpretation and evaluation
In summary
Clustering – unveiling hidden patterns in your data
What is clustering?
How does clustering work?
k-means clustering
Practical applications of clustering
Evaluation metrics for clustering
In summary
Association rule learning
What is association rule learning?
The Apriori algorithm – a practical example
Evaluation metrics
In summary
Applications of UL
Market segmentation
Anomaly detection
Feature extraction
Summary
9
Interpreting and Evaluating Machine Learning Models
How do I know whether this model will be accurate?
Evaluating on test (holdout) data
Understanding evaluation metrics
Evaluating regression models
R-squared
Root mean squared error
Mean absolute error
When and how to use each metric
Practical evaluation strategies
Summarizing the evaluation of regression models
Evaluating classification models
Classification model evaluation metrics
Precision, recall, and F1-Score
Recall
F1-score
Methods for explaining machine learning models
Making sense of regression models – the power of coefficients
Decoding classification models – unveiling feature importance
Beyond specific models – universal insights using SHAP values
Summary
10
Common Pitfalls in Machine Learning
Understanding the complexity
Dirty data, damaged models – how data quantity and quality impact ML
The importance of adequate training data
Dealing with poor data quality
Conclusion
Overcoming overfitting and underfitting
Navigating training-serving skew and model drift
Ensuring fairness
Mastering overfitting and underfitting for optimal model performance
Overfitting – when your model is too specific
Underfitting – when your model is too simplistic
Spotting the problem
Conclusion
Training-serving skew and model drift
Training-serving skew
Model drift
Key takeaways
Bias and fairness
Understanding bias
Understanding fairness
Mitigating bias and ensuring fairness
Key takeaways
Summary
Part 3: Leading Successful Data Science Projects and Teams
11
The Structure of a Data Science Project
The various types of data science projects
Data products
Reports and analytics
Research and methodology
The stages of a data product
Identifying use cases
Evaluating use cases
Planning the data product
Developing a data product
Data preparation and exploratory analysis
Model design and development
Evaluation and testing
Deploying and monitoring a data product
General best practices for data product development
Evaluating impact
Predictive maintenance in manufacturing
Fraud detection in banking
Customer churn prediction in telecom
Demand forecasting in retail
Personalized recommendations in e-commerce
Predictive maintenance in energy
Workforce optimization in quick service restaurants
Chatbot-assisted customer support
Summary
12
The Data Science Team
Assembling your data science team – key roles and considerations
Data scientists
Machine learning engineers
Data engineers
MLOps engineers
Analytics engineers
Software engineers (full stack, frontend, backend)
Product managers
Business analysts
Data storytellers/visualization experts
Considerations when assembling your team
Data science teams within larger organizations
The hub and spoke model
What is the hub and spoke model?
Practical applications of the hub and spoke model
Building a hub and spoke model
The art of recruitment
Where to find technical talent
How high-performing data science teams operate
Cross-functional collaboration is essential
Diversity of perspectives drives innovation
Start with the right problem to solve
Invest in tooling, infrastructure, and workflow
Continuous adaption and learning are a must
Focus ruthlessly on outcomes over activity
Summary
13
Managing the Data Science Team
Day-to-day management of a data science team
Enabling rapid experimentation and innovation
Managing inherent uncertainty
Balancing research and application
Communicating effectively in data science and artificial intelligence
Fostering a culture of curiosity and continuous learning
Embracing peer review and collaboration
Common challenges in managing a data science team
Challenge 1 – recruiting and retaining top talent
Challenge 2 – aligning projects with business goals
Challenge 3 – managing inherent uncertainty
Challenge 4 – scaling and operationalizing models
Challenge 5 – deploying robust, reliable, fair models ethically
Empowering and motivating your data science team
Working with other teams and external stakeholders and empowering them to use data
Summary
14
Continuing Your Journey as a Data Science Leader
Navigating the landscape of emerging technologies
Specializing in an industry
Specializing in a field
Embracing continuous learning
Online courses
Cloud certifications
Technical tutorials and documentation
Learning plan framework
Staying up to date with current DS/ML/AI news and trends
Promoting data-driven thinking within your organization
Host internal learning sessions
Collaborate on cross-functional projects
Share success stories and lessons learned
Mentor and upskill colleagues
Establish a data science community of practice
Networking beyond your organization
Attend industry conferences and events
Join online communities and forums
Engage with local meetups and user groups
Collaborate on side projects or research
Offer mentorship or seek mentors
Summary
Index
Other Books You May Enjoy
Part 1: Understanding Data Science and Its Foundations
This part covers the foundations of data science, including key statistical concepts, data types, collection methods, exploratory data analysis, statistical significance, and regression. This part has the following chapters:
Chapter 1, Introducing Data ScienceChapter 2, Characterizing and Collecting DataChapter 3, Exploratory Data AnalysisChapter 4, The Significance of SignificanceChapter 5, Understanding Regression