29,99 €
Professionals face several challenges in effectively leveraging data in today's data-driven world. One of the main challenges is the low quality of data products, often caused by inaccurate, incomplete, or inconsistent data. Another significant challenge is the lack of skills among data professionals to analyze unstructured data, leading to valuable insights being missed that are difficult or impossible to obtain from structured data alone.
To help you tackle these challenges, this book will take you on a journey through the upstream data pipeline, which includes the ingestion of data from various sources, the validation and profiling of data for high-quality end tables, and writing data to different sinks. You’ll focus on structured data by performing essential tasks, such as cleaning and encoding datasets and handling missing values and outliers, before learning how to manipulate unstructured data with simple techniques. You’ll also be introduced to a variety of natural language processing techniques, from tokenization to vector models, as well as techniques to structure images, videos, and audio.
By the end of this book, you’ll be proficient in data cleaning and preparation techniques for both structured and unstructured data.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 582
Veröffentlichungsjahr: 2024
Python Data Cleaning and Preparation Best Practices
A practical guide to organizing and handling data from various sources and formats using Python
Maria Zervou
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Apeksha Shetty
Publishing Product Managers: Deepesh Patel and Chayan Majumdar
Book Project Manager: Hemangi Lotlikar
Senior Content Development Editor: Manikandan Kurup
Technical Editor: Kavyashree K S
Copy Editor: Safis Editing
Proofreader: Manikandan Kurup
Indexer: Hemangini Bari
Production Designer: Joshua Misquitta
Senior DevRel Marketing Executive: Nivedita Singh
First published: September 2024
Production reference: 1190924
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN 978-1-83763-474-3
www.packtpub.com
I want to extend my deepest thanks to those who have been by my side throughout the journey of writing this book while managing work in parallel. I am immensely grateful to everyone who has cheered me on, offered feedback, and inspired me to keep going. A special thanks to my family, for their unwavering support and for teaching me the power of determination. To my mentors, friends, and partner, who have guided me over the years and helped me see the bigger picture, and from whom I have learned so much! This accomplishment is as much yours as it is mine. Thank you for being part of this journey!
– Maria Zervou
Maria Zervou is a Generative AI and machine learning expert, dedicated to making advanced technologies accessible. With over a decade of experience, she has led impactful AI projects across industries and mentored teams on cutting-edge advancements. As a machine learning specialist at Databricks, Maria drives innovative AI solutions and industry adoption. Beyond her role, she democratizes knowledge through her YouTube channel, featuring experts on AI topics. A recognized thought leader and finalist in the Women in Tech Excellence Awards, Maria advocates for responsible AI use and contributes to open source projects, fostering collaboration and empowering future AI leaders.
Mohammed Kamil Khan is currently a scientific programmer at UTHealth Houston’s McWilliams School of Biomedical Informatics, wherein he works on data preprocessing, GWAS, and post-GWAS analysis of imaging data. He has a master’s degree from the University of Houston – Downtown (UHD), having majored in data analytics. With an unwavering passion for democratizing knowledge, Kamil strives to make complex concepts accessible to all. Moreover, Kamil’s commitment to sharing his expertise led him to publish articles on platforms such as DigitalOcean, Open Source For You magazine, and Red Hat’s opensource.com. These articles explore a diverse range of topics, including pandas DataFrames, API data extraction, SQL queries, and much more.
Ashish Shukla is a seasoned professional with 12 years of experience, specializing in Azure technologies, particularly Azure Databricks, for the past 9 years. Formerly associated with Microsoft, Ashish has been instrumental in leading numerous successful projects leveraging Azure Databricks. Currently serving as an associate manager of data operations at PepsiCo India, he brings extensive expertise in cloud-based data solutions, ensuring robust and innovative data operations strategies.
Beyond his professional roles, Ashish is an active contributor to the Azure community through his technical blogs and engagements as a speaker on Azure technologies, where he shares valuable insights and best practices in data management and cloud computing.
Krishnan Raghavan is an IT professional with over 20 years of experience in software development and delivery excellence across multiple domains and technologies, including C++, Java, Python, Angular, Golang, and data warehouses.
When not working, Krishnan likes to spend time with his wife and daughter, as well as reading fiction, nonfiction, and technical books and participating in Hackathons. Krishnan tries to give back to the community by being part of the GDG – Pune volunteer group.
You can connect with Krishnan at [email protected] or via LinkedIn.
I’d like to thank my wife, Anita, and daughter, Ananya, for giving me the time and space to review this book.
This part focuses on the foundational stages of data processing, starting from data ingestion to ensuring its quality and structure for downstream tasks. It guides readers through the essential steps of importing, cleaning, and transforming data, which lay the groundwork for effective data analysis. The chapters explore various methods for ingesting data, maintaining high-quality datasets, profiling data for better insights, and cleaning messy data to make it ready for analysis. Further, it covers advanced techniques like merging, concatenating, grouping, and filtering data, along with choosing appropriate data destinations or sinks to optimize processing pipelines. Each chapter in this part equips readers with the knowledge to handle raw data and turn it into a clean, structured, and usable form.
This part has the following chapters:
Chapter 1, Data Ingestion TechniquesChapter 2, Importance of Data QualityChapter 3, Data Profiling – Understanding Data Structure, Quality, and DistributionChapter 4, Cleaning Messy Data and Data ManipulationChapter 5, Data Transformation – Merging and ConcatenatingChapter 6, Data Grouping, Aggregation, Filtering, and Applying FunctionsChapter 7, Data Sinks