Principles of Data Science - Sinan Ozdemir - E-Book

Principles of Data Science E-Book

Sinan Ozdemir

0,0
28,79 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Principles of Data Science bridges mathematics, programming, and business analysis, empowering you to confidently pose and address complex data questions and construct effective machine learning pipelines. This book will equip you with the tools to transform abstract concepts and raw statistics into actionable insights.
Starting with cleaning and preparation, you’ll explore effective data mining strategies and techniques before moving on to building a holistic picture of how every piece of the data science puzzle fits together. Throughout the book, you’ll discover statistical models with which you can control and navigate even the densest or the sparsest of datasets and learn how to create powerful visualizations that communicate the stories hidden in your data.
With a focus on application, this edition covers advanced transfer learning and pre-trained models for NLP and vision tasks. You’ll get to grips with advanced techniques for mitigating algorithmic bias in data as well as models and addressing model and data drift. Finally, you’ll explore medium-level data governance, including data provenance, privacy, and deletion request handling.
By the end of this data science book, you'll have learned the fundamentals of computational mathematics and statistics, all while navigating the intricacies of modern ML and large pre-trained models like GPT and BERT.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 458

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Principles of Data Science

A beginner’s guide to essential math and coding skills for data fluency and machine learning

Sinan Ozdemir

Principles of Data Science

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Ali Abidi

Publishing Product Manager: Tejashwini R

Book Project Manager: Farheen Fathima

Content Development Editor: Priyanka Soam

Technical Editor: Kavyashree K S

Copy Editor: Safis Editing

Proofreader: Safis Editing

Indexer: Manju Arasan

Production Designer: Alishon Mendonca

DevRel Marketing Coordinator: Vinishka Kalra

First published: December 2016

Second edition: December 2018

Third edition: Jan 2024

Production reference: 1120124

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK

ISBN 978-1-83763-630-3

www.packtpub.com

I have dedicated many books to many loved ones in the past, and for this edition, I want to dedicate this work to the people of Packt Publishing, who not only gave me my first chance at writing a book when I was early in my career but have stuck by me and continued to release editions with me since.

Thank you to everyone at Packt Publishing for all of your hard work, patience, and dedication to my work!

– Sinan Ozdemir

Contributor

About the author

Sinan Ozdemir is an active lecturer on large language models and a former lecturer of data science at Johns Hopkins University. He is the author of multiple textbooks on data science and machine learning, including Quick Start Guide to LLMs. Sinan is currently the founder of LoopGenius, which uses AI to help people and businesses boost their sales, and was previously the founder of the acquired Kylie.ai, an enterprise-grade conversational AI platform with RPA capabilities. He holds a master’s degree in pure mathematics from Johns Hopkins University and is based in San Francisco.

About the reviewer

Jigyasa Grover, a 10-time award winner in AI and open source and the co-author of the book Sculpting Data for ML, is a powerhouse brimming with passion to make a dent in this world of technology and bridge the gaps. With years of machine learning engineering and data science experience in deploying large‐scale systems for monetization on social networking and e‐commerce platforms, she primarily focuses on ad prediction, sponsored content ranking, and recommendation. She is an avid proponent of open source and credits her access to opportunities and career growth to this sphere of community development. In her spirit to build a powerful community with a strong belief in the axiom, “We rise by lifting others,” she actively mentors developers and machine learning enthusiasts.

Table of Contents

Preface

1

Data Science Terminology

What is data science?

Understanding basic data science terminology

Why data science?

Example – predicting COVID-19 with machine learning

The data science Venn diagram

The math

Computer programming

Example – parsing a single tweet

Domain knowledge

Some more terminology

Data science case studies

Case study – automating government paper pushing

Case study – what’s in a job description?

Summary

2

Types of Data

Structured versus unstructured data

Quantitative versus qualitative data

Digging deeper

The four levels of data

The nominal level

Measures of center

The ordinal level

The interval level

The ratio level

Data is in the eye of the beholder

Summary

Questions and answers

3

The Five Steps of Data Science

Introduction to data science

Overview of the five steps

Exploring the data

Guiding questions for data exploration

DataFrames

Series

Exploration tips for qualitative data

Summary

4

Basic Mathematics

Basic symbols and terminology

Vectors and matrices

Arithmetic symbols

Summation

Logarithms/exponents

Set theory

Linear algebra

Matrix multiplication

How to multiply matrices together

Summary

5

Impossible or Improbable – A Gentle Introduction to Probability

Basic definitions

What do we mean by “probability”?

Bayesian versus frequentist

Frequentist approach

The law of large numbers

Compound events

Conditional probability

How to utilize the rules of probability

The addition rule

Mutual exclusivity

The multiplication rule

Independence

Complementary events

Introduction to binary classifiers

Summary

6

Advanced Probability

Bayesian ideas revisited

Bayes’ theorem

More applications of Bayes’ theorem

Random variables

Discrete random variables

Continuous random variables

Summary

7

What Are the Chances? An Introduction to Statistics

What are statistics?

How do we obtain and sample data?

Obtaining data

Observational

Experimental

Sampling data

How do we measure statistics?

Measures of center

Measures of variation

The coefficient of variation

Measures of relative standing

The insightful part – correlations in data

The empirical rule

Example – exam scores

Summary

8

Advanced Statistics

Understanding point estimates

Sampling distributions

Confidence intervals

Hypothesis tests

Conducting a hypothesis test

One-sample t-tests

Type I and Type II errors

Hypothesis testing for categorical variables

Chi-square goodness of fit test

Chi-square test for association/independence

Summary

9

Communicating Data

Why does communication matter?

Identifying effective visualizations

Scatter plots

Line graphs

Bar charts

Histograms

Box plots

When graphs and statistics lie

Correlation versus causation

Simpson’s paradox

If correlation doesn’t imply causation, then what does?

Verbal communication

It’s about telling a story

On the more formal side of things

The why/how/what strategy for presenting

Summary

10

How to Tell if Your Toaster is Learning – Machine Learning Essentials

Introducing ML

Example – facial recognition

ML isn’t perfect

How does ML work?

Types of ML

SL

UL

RL

Overview of the types of ML

ML paradigms – pros and cons

Predicting continuous variables with linear regression

Correlation versus causation

Causation

Adding more predictors

Regression metrics

Summary

11

Predictions Don’t Grow on Trees, or Do They?

Performing naïve Bayes classification

Classification metrics

Understanding decision trees

Measuring purity

Exploring the Titanic dataset

Dummy variables

Diving deep into UL

When to use UL

k-means clustering

The Silhouette Coefficient

Feature extraction and PCA

Summary

12

Introduction to Transfer Learning and Pre-Trained Models

Understanding pre-trained models

Benefits of using pre-trained models

Commonly used pre-trained models

Decoding BERT’s pre-training

TL

Different types of TL

Inductive TL

Transductive TL

Unsupervised TL – feature extraction

TL with BERT and GPT

Examples of TL

Example – Fine-tuning a pre-trained model for text classification

Summary

13

Mitigating Algorithmic Bias and Tackling Model and Data Drift

Understanding algorithmic bias

Types of bias

Sources of algorithmic bias

Measuring bias

Consequences of unaddressed bias and the importance of fairness

Mitigating algorithmic bias

Mitigation during data preprocessing

Mitigation during model in-processing

Mitigation during model postprocessing

Bias in LLMs

Uncovering bias in GPT-2

Emerging techniques in bias and fairness in ML

Understanding model drift and decay

Model drift

Data drift

Mitigating drift

Understanding the context

Continuous monitoring

Regular model retraining

Implementing feedback systems

Model adaptation techniques

Summary

14

AI Governance

Mastering data governance

Current hurdles in data governance

Data management: crafting the bedrock

Data ingestion – the gateway to information

Data integration – from collection to delivery

Data warehouses and entity resolution

The quest for data quality

Documentation and cataloging – the unsung heroes of governance

Understanding the path of data

Regulatory compliance and audit preparedness

Change management and impact analysis

Upholding data quality

Troubleshooting and analysis

Navigating the intricacy and the anatomy of ML governance

ML governance pillars

Model interpretability

The many facets of ML development

Beyond training – model deployment and monitoring

A guide to architectural governance

The five pillars of architectural governance

Transformative architectural principles

Zooming in on architectural dimensions

Summary

15

Navigating Real-World Data Science Case Studies in Action

Introduction to the COMPAS dataset case study

Understanding the task/outlining success

Preliminary data exploration

Preparing the data for modeling

Final thoughts

Text embeddings using pretrainedmodels and OpenAI

Setting up and importing necessary libraries

Data collection – fetching the textbook data

Converting text to embeddings

Querying – searching for relevant information

Concluding thoughts – the power of modern pre-trained models

Summary

Index

Other Books You May Enjoy

2

Types of Data

For our first step into the world of data science, let’s take a look at the various ways in which data can be formed. In this chapter, we will explore three critical categorizations of data:

Structured versus unstructured dataQuantitative versus qualitative dataThe four levels of data

We will dive further into each of these topics by showing examples of how data scientists look at and work with data. This chapter aims to familiarize us with the fundamental types of data so that when we eventually see our first dataset, we will know exactly how to dissect, diagnose, and analyze the contents to maximize our insight value and machine learning performance.

The first thing to note is my use of the word data. In the previous chapter, I defined data as merely a collection of information. This vague definition exists because we may separate data into different categories and need our definition to be loose.

The next thing to remember while we go through this chapter is that for the most part, when I talk about the type of data, I will refer to either a specific characteristic (column/feature) of a dataset or the entire dataset as a whole. I will be very clear about which one I refer to at any given time.

At first thought, it might seem worthless to stop and think about what type of data we have before getting into the fun stuff, such as statistics and machine learning, but this is arguably one of the most important steps you need to take to perform data science.

When given a new dataset to analyze, it is tempting to jump right into exploring, applying statistical models, and researching the applications of machine learning to get results as soon as possible. However, if you don’t understand the type of data that you are working with, then you might waste a lot of time applying models that are known to be ineffective with that specific type of data.

Structured versus unstructured data

The first question we want to ask ourselves about an entire dataset is whether we are working with structured or unstructured data. The answer to this question can mean the difference between needing three days or three weeks to perform a proper analysis.

The basic breakdown is as follows (this is a rehashed definition of organized and unorganized data from Chapter 1):

Structured (that is, organized) data: This is data that can be thought of as observations and characteristics. It is usually organized using a table method (rows and columns) that can be organized in a spreadsheet format or a relational database.Unstructured (that is, unorganized) data: This data exists as a free entity and does not follow any standard organization hierarchy such as images, text, or videos.

Here are a few examples that could help you differentiate betweenthe two:

Most data that exists in text form, including server logs and Facebook posts, is unstructuredScientific observations, as recorded by scientists, are kept in a very neat and organized (structured) formatA genetic sequence of chemical nucleotides (for example, ACGTATTGCA) is unstructured, even if the order of the nucleotides matters, as we cannot form descriptors of the sequence using a row/column format without taking a further look

Structured data is generally thought of as being much easier to work with and analyze. Most statistical and machine learning models were built with structured data in mind and cannot work on the loose interpretation of unstructured data. The natural row and column structure is easy to digest for human and machine eyes. So, why even talk about