Python Data Cleaning and Preparation Best Practices - Maria Zervou - E-Book

Python Data Cleaning and Preparation Best Practices E-Book

Maria Zervou

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Professionals face several challenges in effectively leveraging data in today's data-driven world. One of the main challenges is the low quality of data products, often caused by inaccurate, incomplete, or inconsistent data. Another significant challenge is the lack of skills among data professionals to analyze unstructured data, leading to valuable insights being missed that are difficult or impossible to obtain from structured data alone.
To help you tackle these challenges, this book will take you on a journey through the upstream data pipeline, which includes the ingestion of data from various sources, the validation and profiling of data for high-quality end tables, and writing data to different sinks. You’ll focus on structured data by performing essential tasks, such as cleaning and encoding datasets and handling missing values and outliers, before learning how to manipulate unstructured data with simple techniques. You’ll also be introduced to a variety of natural language processing techniques, from tokenization to vector models, as well as techniques to structure images, videos, and audio.
By the end of this book, you’ll be proficient in data cleaning and preparation techniques for both structured and unstructured data.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 582

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Python Data Cleaning and Preparation Best Practices

A practical guide to organizing and handling data from various sources and formats using Python

Maria Zervou

Python Data Cleaning and Preparation Best Practices

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Apeksha Shetty

Publishing Product Managers: Deepesh Patel and Chayan Majumdar

Book Project Manager: Hemangi Lotlikar

Senior Content Development Editor: Manikandan Kurup

Technical Editor: Kavyashree K S

Copy Editor: Safis Editing

Proofreader: Manikandan Kurup

Indexer: Hemangini Bari

Production Designer: Joshua Misquitta

Senior DevRel Marketing Executive: Nivedita Singh

First published: September 2024

Production reference: 1190924

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-83763-474-3

www.packtpub.com

I want to extend my deepest thanks to those who have been by my side throughout the journey of writing this book while managing work in parallel. I am immensely grateful to everyone who has cheered me on, offered feedback, and inspired me to keep going. A special thanks to my family, for their unwavering support and for teaching me the power of determination. To my mentors, friends, and partner, who have guided me over the years and helped me see the bigger picture, and from whom I have learned so much! This accomplishment is as much yours as it is mine. Thank you for being part of this journey!

– Maria Zervou

Contributors

About the author

Maria Zervou is a Generative AI and machine learning expert, dedicated to making advanced technologies accessible. With over a decade of experience, she has led impactful AI projects across industries and mentored teams on cutting-edge advancements. As a machine learning specialist at Databricks, Maria drives innovative AI solutions and industry adoption. Beyond her role, she democratizes knowledge through her YouTube channel, featuring experts on AI topics. A recognized thought leader and finalist in the Women in Tech Excellence Awards, Maria advocates for responsible AI use and contributes to open source projects, fostering collaboration and empowering future AI leaders.

About the reviewers

Mohammed Kamil Khan is currently a scientific programmer at UTHealth Houston’s McWilliams School of Biomedical Informatics, wherein he works on data preprocessing, GWAS, and post-GWAS analysis of imaging data. He has a master’s degree from the University of Houston – Downtown (UHD), having majored in data analytics. With an unwavering passion for democratizing knowledge, Kamil strives to make complex concepts accessible to all. Moreover, Kamil’s commitment to sharing his expertise led him to publish articles on platforms such as DigitalOcean, Open Source For You magazine, and Red Hat’s opensource.com. These articles explore a diverse range of topics, including pandas DataFrames, API data extraction, SQL queries, and much more.

Ashish Shukla is a seasoned professional with 12 years of experience, specializing in Azure technologies, particularly Azure Databricks, for the past 9 years. Formerly associated with Microsoft, Ashish has been instrumental in leading numerous successful projects leveraging Azure Databricks. Currently serving as an associate manager of data operations at PepsiCo India, he brings extensive expertise in cloud-based data solutions, ensuring robust and innovative data operations strategies.

Beyond his professional roles, Ashish is an active contributor to the Azure community through his technical blogs and engagements as a speaker on Azure technologies, where he shares valuable insights and best practices in data management and cloud computing.

Krishnan Raghavan is an IT professional with over 20 years of experience in software development and delivery excellence across multiple domains and technologies, including C++, Java, Python, Angular, Golang, and data warehouses.

When not working, Krishnan likes to spend time with his wife and daughter, as well as reading fiction, nonfiction, and technical books and participating in Hackathons. Krishnan tries to give back to the community by being part of the GDG – Pune volunteer group.

You can connect with Krishnan at [email protected] or via LinkedIn.

I’d like to thank my wife, Anita, and daughter, Ananya, for giving me the time and space to review this book.

Table of Contents

Preface

Part 1: Upstream Data Ingestion and Cleaning

1

Data Ingestion Techniques

Technical requirements

Ingesting data in batch mode

Advantages and disadvantages

Common use cases for batch ingestion

Batch ingestion use cases

Batch ingestion with an example

Ingesting data in streaming mode

Advantages and disadvantages

Common use cases for streaming ingestion

Streaming ingestion in an e-commerce platform

Streaming ingestion with an example

Real-time versus semi-real-time ingestion

Common use cases for near-real-time ingestion

Semi-real-time mode with an example

Data source solutions

Event data processing solution

Ingesting event data with Apache Kafka

Ingesting data from databases

Performing data ingestion from cloud-based file systems

APIs

Summary

2

Importance of Data Quality

Technical requirements

Why data quality is important

Dimensions of data quality

Completeness

Accuracy

Timeliness

Consistency

Uniqueness

Duplication

Data usage

Data compliance

Implementing quality controls throughout the data life cycle

Data silos and the impact on data quality

Summary

3

Data Profiling – Understanding Data Structure, Quality, and Distribution

Technical requirements

Understanding data profiling

Identifying goals of data profiling

Exploratory data analysis options – profiler versus manual

Profiling data with pandas’ ydata_profiling

Overview

Interactions

Correlations

Missing values

Duplicate rows

Sample dataset

Profiling high volumes of data with the pandas data profiler

Data validation with the Great Expectations library

Configuring Great Expectations for your project

Create your first Great Expectations data source

Creating your first Great Expectations suite

Great Expectations Suite report

Manually edit Great Expectations

Checkpoints

Using pandas profiler to build your Great Expectations Suite

Comparing Great Expectations and pandas profiler – when to use what

Great Expectations and big data

Summary

4

Cleaning Messy Data and Data Manipulation

Technical requirements

Renaming columns

Renaming a single column

Renaming all columns

Removing irrelevant or redundant columns

Dealing with inconsistent and incorrect data types

Inspecting columns

Columnar type transformations

Converting to numeric types

Converting to string types

Converting to categorical types

Converting to Boolean types

Working with dates and times

Importing and parsing date and time data

Extracting components from dates and times

Calculating time differences and durations

Handling time zones and daylight saving time

Summary

5

Data Transformation – Merging and Concatenating

Technical requirements

Joining datasets

Choosing the correct merge strategy

Handling duplicates when merging datasets

Why handle duplication in rows and columns?

Dropping duplicate rows

Validating data before merging

Aggregation

Concatenation

Handling duplication in columns

Performance tricks for merging

Set indexes

Sorting indexes

Merge versus join

Concatenating DataFrames

Row-wise concatenation

Column-wise concatenation

Summary

References

6

Data Grouping, Aggregation, Filtering, and Applying Functions

Technical requirements

Grouping data using one or multiple keys

Grouping data using one key

Grouping data using multiple keys

Best practices for grouping

Applying aggregate functions on grouped data

Basic aggregate functions

Advanced aggregation with multiple columns

Applying custom aggregate functions

Best practices for aggregate functions

Using the apply function on grouped data

Data filtering

Multiple criteria for filtering

Best practices for filtering

Performance considerations as data grows

Summary

7

Data Sinks

Technical requirements

Choosing the right data sink for your use case

Relational databases

NoSQL databases

Data warehouses

Data lakes

Streaming data sinks

Which sink is the best for my use case?

Decoding file types for optimal usage

Navigating partitioning

Horizontal versus vertical partitioning

Time-based partitioning

Geographic partitioning

Hybrid partitioning

Considerations for choosing partitioning strategies

Designing an online retail data platform

Summary

Part 2: Downstream Data Cleaning – Consuming Structured Data

8

Detecting and Handling Missing Values and Outliers

Technical requirements

Detecting missing data

Handling missing data

Deletion of missing data

Imputation of missing data

Mean imputation

Median imputation

Creating indicator variables

Comparison between imputation methods

Detecting and handling outliers

Impact of outliers

Identifying univariate outliers

Handling univariate outliers

Identifying multivariate outliers

Handling multivariate outliers

Summary

9

Normalization and Standardization

Technical requirements

Scaling features to a range

Min-max scaling

Z-score scaling

When to use Z-score scaling

Robust scaling

Comparison between methods

Summary

10

Handling Categorical Features

Technical requirements

Label encoding

Use case – employee performance analysis

Considerations for label encoding

One-hot encoding

When to use one-hot encoding

Use case – customer churn prediction

Considerations for one-hot encoding

Target encoding (mean encoding)

When to use target encoding

Use case – sales prediction for retail stores

Considerations for target encoding

Frequency encoding

When to use frequency encoding

Use case – customer product preference analysis

Considerations for frequency encoding

Binary encoding

When to use binary encoding

Use case – customer subscription prediction

Considerations for binary encoding

Summary

11

Consuming Time Series Data

Technical requirements

Understanding the components of time series data

Trend

Seasonality

Noise

Types of time series data

Univariate time series data

Multivariate time series data

Identifying missing values in time series data

Checking for NaNs or null values

Visual inspection

Handling missing values in time series data

Removing missing data

Forward and backward fill

Interpolation

Comparing the different methods for missing values

Analyzing time series data

Autocorrelation and partial autocorrelation

ACT and PACF in the stock market use case

Dealing with outliers

Identifying outliers with seasonal decomposition

Handling outliers – model-based approaches – ARIMA

Moving window techniques

Feature engineering for time series data

Lag features and their importance

Differencing time series

Applying time series techniques in different industries

Summary

Part 3: Downstream Data Cleaning – Consuming Unstructured Data

12

Text Preprocessing in the Era of LLMs

Technical requirements

Relearning text preprocessing in the era of LLMs

Text cleaning

Removing HTML tags and special characters

Handling capitalization and letter case

Dealing with numerical values and symbols

Addressing whitespace and formatting issues

Removing personally identifiable information

Handling rare words and spelling variations

Dealing with rare words

Addressing spelling variations and typos

Chunking

Tokenization

Word tokenization

Subword tokenization

Domain-specific data

Turning tokens into embeddings

BERT – Contextualized Embedding Models

BGE

GTE

Selecting the right embedding model

Solving real problems with embeddings

Summary

13

Image and Audio Preprocessing with LLMs

Technical requirements

The current era of image preprocessing

Loading the images

Resizing and cropping

Normalizing and standardizing the dataset

Data augmentation

Noise reduction

Extracting text from images

PaddleOCR

Using LLMs with OCR

Creating image captions

Handling audio data

Using Whisper for audio-to-text conversion

Extracting text from audio

Future research in audio preprocessing

Summary

This concludes the book! You did it!

Index

Other Books You May Enjoy

Part 1: Upstream Data Ingestion and Cleaning

This part focuses on the foundational stages of data processing, starting from data ingestion to ensuring its quality and structure for downstream tasks. It guides readers through the essential steps of importing, cleaning, and transforming data, which lay the groundwork for effective data analysis. The chapters explore various methods for ingesting data, maintaining high-quality datasets, profiling data for better insights, and cleaning messy data to make it ready for analysis. Further, it covers advanced techniques like merging, concatenating, grouping, and filtering data, along with choosing appropriate data destinations or sinks to optimize processing pipelines. Each chapter in this part equips readers with the knowledge to handle raw data and turn it into a clean, structured, and usable form.

This part has the following chapters:

Chapter 1, Data Ingestion TechniquesChapter 2, Importance of Data QualityChapter 3, Data Profiling – Understanding Data Structure, Quality, and DistributionChapter 4, Cleaning Messy Data and Data ManipulationChapter 5, Data Transformation – Merging and ConcatenatingChapter 6, Data Grouping, Aggregation, Filtering, and Applying FunctionsChapter 7, Data Sinks