29,99 €
Many individuals who know how to run machine learning algorithms do not have a good sense of the statistical assumptions they make and how to match the properties of the data to the algorithm for the best results.
As you start with this book, models are carefully chosen to help you grasp the underlying data, including in-feature importance and correlation, and the distribution of features and targets. The first two parts of the book introduce you to techniques for preparing data for ML algorithms, without being bashful about using some ML techniques for data cleaning, including anomaly detection and feature selection. The book then helps you apply that knowledge to a wide variety of ML tasks. You’ll gain an understanding of popular supervised and unsupervised algorithms, how to prepare data for them, and how to evaluate them. Next, you’ll build models and understand the relationships in your data, as well as perform cleaning and exploration tasks with that data. You’ll make quick progress in studying the distribution of variables, identifying anomalies, and examining bivariate relationships, as you focus more on the accuracy of predictions in this book.
By the end of this book, you’ll be able to deal with complex data problems using unsupervised ML algorithms like principal component analysis and k-means clustering.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 485
Veröffentlichungsjahr: 2022
Get to grips with machine learning techniques to achieve sparkling-clean data quickly
Michael Walker
BIRMINGHAM—MUMBAI
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Ali Abidi
Senior Editor: David Sugarman
Content Development Editor: Manikandan Kurup
Technical Editor: Rahul Limbachiya
Copy Editor: Safis Editing
Project Coordinator: Farheen Fathima
Proofreader: Safis Editing
Indexer: Hemangini Bari
Production Designer: Alishon Mendonca
Marketing Coordinators: Shifa Ansari and Abeer Riyaz Dawe
First published: August 2022
Production reference: 1290722
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80324-167-8
www.packt.com
Michael Walker has worked as a data analyst for over 30 years at a variety of educational institutions. He has also taught data science, research methods, statistics, and computer programming to undergraduates since 2006. He is currently the Chief Information Officer at College Unbound in Providence, Rhode Island.
Kalyana Bedhu is an engineering leader for data science at Microsoft. Kalyana has over 20 years of industry experience in data analytics across various companies such as Ericsson, Sony, Bosch Fidelity, and Oracle, among others. Kalyana was an early practitioner of data science at Ericsson, setting up a data science lab and building up competence in solving some practical data science problems. He played a pivotal role in transforming a central IT organization that dealt with most of the enterprise business intelligence, data, and analytical systems, into an AI and data science engine. Kalyana is a recipient of patents, a speaker, and has authored award-winning papers and data science courses.
Thanks to Packt and the author for the opportunity to review this book.
Divya Sardana serves as the lead AI/ ML engineer at Nike. Previously, she was a senior data scientist at Teradata Corp. She holds a Ph.D. in computer science from the University of Cincinnati, OH. She has experience working on end-to-end machine learning and deep learning problems involving techniques such as regression and classification. She has further experience in moving developed models to production and ensuring scalability. Her interests include solving complex big data and machine learning/deep learning problems in real-world domains. She is actively involved in the peer review of journals and books in the area of machine learning. She has served as a session chair at machine learning conferences such as ICMLA 2021 and BDA 2021.
I try to avoid thinking about different parts of the model building process sequentially, to see myself as cleaning data, then preprocessing, and so on until I have done model validation. I do not want to think about that process as involving phases that ever end. We start with data cleaning in this section, but I hope the chapters in this section convey that we are always looking ahead, anticipating modeling challenges as we clean data; and that we also typically reflect back on the data cleaning we have done when we evaluate our models.
To some extent, the clean and dirty metaphor hides the nuance in preparing data for subsequent analysis. The real concern is how representative our instances and attributes (observations and variables) are of phenomena of interest. This can always be improved, and easily made worse without care. One thing is for certain though. There is nothing we can do in any other part of the model building process that will make right something important we have gotten wrong during data cleaning.
The first three chapters of this book are about getting our data as right as we can. To do that we have to have a good sense of how all variables, features and targets, are distributed. There are three questions we should ask ourselves before we do any formal analysis: 1) Are we confident that we know the full range of values, and the shape of the distribution, of every variable of interest? 2) Do we have a good idea of the bivariate relationship between variables, how each moves with others? 3) How successful are our attempts to fix potential problems, such as outliers and missing values? The chapters in this section provide the tools you need to answer these questions.
This section comprises the following chapters:
Chapter 1, Examining the Distribution of Features and TargetsChapter 2, Examining Bivariate and Multivariate Relationships between Features and TargetsChapter 3, Identifying and Fixing Missing Values