Data Cleaning and Exploration with Machine Learning - Michael Walker - E-Book

Data Cleaning and Exploration with Machine Learning E-Book

Michael Walker

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Many individuals who know how to run machine learning algorithms do not have a good sense of the statistical assumptions they make and how to match the properties of the data to the algorithm for the best results.
As you start with this book, models are carefully chosen to help you grasp the underlying data, including in-feature importance and correlation, and the distribution of features and targets. The first two parts of the book introduce you to techniques for preparing data for ML algorithms, without being bashful about using some ML techniques for data cleaning, including anomaly detection and feature selection. The book then helps you apply that knowledge to a wide variety of ML tasks. You’ll gain an understanding of popular supervised and unsupervised algorithms, how to prepare data for them, and how to evaluate them. Next, you’ll build models and understand the relationships in your data, as well as perform cleaning and exploration tasks with that data. You’ll make quick progress in studying the distribution of variables, identifying anomalies, and examining bivariate relationships, as you focus more on the accuracy of predictions in this book.
By the end of this book, you’ll be able to deal with complex data problems using unsupervised ML algorithms like principal component analysis and k-means clustering.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 485

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Data Cleaning and Exploration with Machine Learning

Get to grips with machine learning techniques to achieve sparkling-clean data quickly

Michael Walker

BIRMINGHAM—MUMBAI

Data Cleaning and Exploration with Machine Learning

Copyright © 2022 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Ali Abidi

Senior Editor: David Sugarman

Content Development Editor: Manikandan Kurup

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Hemangini Bari

Production Designer: Alishon Mendonca

Marketing Coordinators: Shifa Ansari and Abeer Riyaz Dawe

First published: August 2022

Production reference: 1290722

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80324-167-8

www.packt.com

Contributors

About the author

Michael Walker has worked as a data analyst for over 30 years at a variety of educational institutions. He has also taught data science, research methods, statistics, and computer programming to undergraduates since 2006. He is currently the Chief Information Officer at College Unbound in Providence, Rhode Island.

About the reviewers

Kalyana Bedhu is an engineering leader for data science at Microsoft. Kalyana has over 20 years of industry experience in data analytics across various companies such as Ericsson, Sony, Bosch Fidelity, and Oracle, among others. Kalyana was an early practitioner of data science at Ericsson, setting up a data science lab and building up competence in solving some practical data science problems. He played a pivotal role in transforming a central IT organization that dealt with most of the enterprise business intelligence, data, and analytical systems, into an AI and data science engine. Kalyana is a recipient of patents, a speaker, and has authored award-winning papers and data science courses.

Thanks to Packt and the author for the opportunity to review this book.

Divya Sardana serves as the lead AI/ ML engineer at Nike. Previously, she was a senior data scientist at Teradata Corp. She holds a Ph.D. in computer science from the University of Cincinnati, OH. She has experience working on end-to-end machine learning and deep learning problems involving techniques such as regression and classification. She has further experience in moving developed models to production and ensuring scalability. Her interests include solving complex big data and machine learning/deep learning problems in real-world domains. She is actively involved in the peer review of journals and books in the area of machine learning. She has served as a session chair at machine learning conferences such as ICMLA 2021 and BDA 2021.

Table of Contents

Preface

Section 1 – Data Cleaning and Machine Learning Algorithms

Chapter 1: Examining the Distribution of Features and Targets

Technical requirements

Subsetting data

Generating frequencies for categorical features

Generating summary statistics for continuous and discrete features

Identifying extreme values and outliers in univariate analysis

Using histograms, boxplots, and violin plots to examine the distribution of features

Using histograms

Using boxplots

Using violin plots

Summary

Chapter 2: Examining Bivariate and Multivariate Relationships between Features and Targets

Technical requirements

Identifying outliers and extreme values in bivariate relationships

Using scatter plots to view bivariate relationships between continuous features

Using grouped boxplots to view bivariate relationships between continuous and categorical features

Using linear regression to identify data points with significant influence

Using K-nearest neighbors to find outliers

Using Isolation Forest to find outliers

Summary

Chapter 3: Identifying and Fixing Missing Values

Technical requirements

Identifying missing values

Cleaning missing values

Imputing values with regression

Using KNN imputation

Using random forest for imputation

Summary

Section 2 – Preprocessing, Feature Selection, and Sampling

Chapter 4: Encoding, Transforming, and Scaling Features

Technical requirements

Creating training datasets and avoiding data leakage

Removing redundant or unhelpful features

Encoding categorical features

One-hot encoding

Ordinal encoding

Encoding categorical features with medium or high cardinality

Feature hashing

Using mathematical transformations

Feature binning

Equal-width and equal-frequency binning

K-means binning

Feature scaling

Summary

Chapter 5: Feature Selection

Technical requirements

Selecting features for classification models

Mutual information classification for feature selection with a categorical target

ANOVA F-value for feature selection with a categorical target

Selecting features for regression models

F-tests for feature selection with a continuous target

Mutual information for feature selection with a continuous target

Using forward and backward feature selection

Using forward feature selection

Using backward feature selection

Using exhaustive feature selection

Eliminating features recursively in a regression model

Eliminating features recursively in a classification model

Using Boruta for feature selection

Using regularization and other embedded methods

Using L1 regularization

Using a random forest classifier

Using principal component analysis

Summary

Chapter 6: Preparing for Model Evaluation

Technical requirements

Measuring accuracy, sensitivity, specificity, and precision for binary classification

Examining CAP, ROC, and precision-sensitivity curves for binary classification

Constructing CAP curves

Plotting a receiver operating characteristic (ROC) curve

Plotting precision-sensitivity curves

Evaluating multiclass models

Evaluating regression models

Using K-fold cross-validation

Preprocessing data with pipelines

Summary

Section 3 – Modeling Continuous Targets with Supervised Learning

Chapter 7: Linear Regression Models

Technical requirements

Key concepts

Key assumptions of linear regression models

Linear regression and ordinary least squares

Linear regression and gradient descent

Using classical linear regression

Pre-processing the data for our regression model

Running and evaluating our linear model

Improving our model evaluation

Using lasso regression

Tuning hyperparameters with grid searches

Using non-linear regression

Regression with gradient descent

Summary

Chapter 8: Support Vector Regression

Technical requirements

Key concepts of SVR

Nonlinear SVR and the kernel trick

SVR with a linear model

Using kernels for nonlinear SVR

Summary

Chapter 9: K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression

Technical requirements

Key concepts for K-nearest neighbors regression

K-nearest neighbors regression

Key concepts for decision tree and random forest regression

Using random forest regression

Decision tree and random forest regression

A decision tree example with interpretation

Building and interpreting our actual model

Random forest regression

Using gradient boosted regression

Summary

Section 4 – Modeling Dichotomous and Multiclass Targets with Supervised Learning

Chapter 10: Logistic Regression

Technical requirements

Key concepts of logistic regression

Logistic regression extensions

Binary classification with logistic regression

Evaluating a logistic regression model

Regularization with logistic regression

Multinomial logistic regression

Summary

Chapter 11: Decision Trees and Random Forest Classification

Technical requirements

Key concepts

Using random forest for classification

Using gradient-boosted decision trees

Decision tree models

Implementing random forest

Implementing gradient boosting

Summary

Chapter 12: K-Nearest Neighbors for Classification

Technical requirements

Key concepts of KNN

KNN for binary classification

KNN for multiclass classification

KNN for letter recognition

Summary

Chapter 13: Support Vector Machine Classification

Technical requirements

Key concepts for SVC

Nonlinear SVM and the kernel trick

Multiclass classification with SVC

Linear SVC models

Nonlinear SVM classification models

SVMs for multiclass classification

Summary

Chapter 14: Naïve Bayes Classification

Technical requirements

Key concepts

Naïve Bayes classification models

Naïve Bayes for text classification

Summary

Section 5 – Clustering and Dimensionality Reduction with Unsupervised Learning

Chapter 15: Principal Component Analysis

Technical requirements

Key concepts of PCA

Feature extraction with PCA

Using kernels with PCA

Summary

Chapter 16: K-Means and DBSCAN Clustering

Technical requirements

The key concepts of k-means and DBSCAN clustering

Implementing k-means clustering

Implementing DBSCAN clustering

Summary

Why subscribe?

Other Books You May Enjoy

Section 1 – Data Cleaning and Machine Learning Algorithms

I try to avoid thinking about different parts of the model building process sequentially, to see myself as cleaning data, then preprocessing, and so on until I have done model validation. I do not want to think about that process as involving phases that ever end. We start with data cleaning in this section, but I hope the chapters in this section convey that we are always looking ahead, anticipating modeling challenges as we clean data; and that we also typically reflect back on the data cleaning we have done when we evaluate our models.

To some extent, the clean and dirty metaphor hides the nuance in preparing data for subsequent analysis. The real concern is how representative our instances and attributes (observations and variables) are of phenomena of interest. This can always be improved, and easily made worse without care. One thing is for certain though. There is nothing we can do in any other part of the model building process that will make right something important we have gotten wrong during data cleaning.

The first three chapters of this book are about getting our data as right as we can. To do that we have to have a good sense of how all variables, features and targets, are distributed. There are three questions we should ask ourselves before we do any formal analysis: 1) Are we confident that we know the full range of values, and the shape of the distribution, of every variable of interest? 2) Do we have a good idea of the bivariate relationship between variables, how each moves with others? 3) How successful are our attempts to fix potential problems, such as outliers and missing values? The chapters in this section provide the tools you need to answer these questions.

This section comprises the following chapters:

Chapter 1, Examining the Distribution of Features and TargetsChapter 2, Examining Bivariate and Multivariate Relationships between Features and TargetsChapter 3, Identifying and Fixing Missing Values