Python Feature Engineering Cookbook - Soledad Galli - E-Book

Python Feature Engineering Cookbook E-Book

Soledad Galli

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Streamline data preprocessing and feature engineering in your machine learning project with this third edition of the Python Feature Engineering Cookbook to make your data preparation more efficient.
This guide addresses common challenges, such as imputing missing values and encoding categorical variables using practical solutions and open source Python libraries.
You’ll learn advanced techniques for transforming numerical variables, discretizing variables, and dealing with outliers. Each chapter offers step-by-step instructions and real-world examples, helping you understand when and how to apply various transformations for well-prepared data.
The book explores feature extraction from complex data types such as dates, times, and text. You’ll see how to create new features through mathematical operations and decision trees and use advanced tools like Featuretools and tsfresh to extract features from relational data and time series.
By the end, you’ll be ready to build reproducible feature engineering pipelines that can be easily deployed into production, optimizing data preprocessing workflows and enhancing machine learning model performance.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 435

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Python Feature Engineering Cookbook

A complete guide to crafting powerful features for your machine learning models

Soledad Galli

Python Feature Engineering Cookbook

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

The author acknowledges the use of cutting-edge AI, such as ChatGPT, with the sole aim of enhancing the language and clarity within the book, thereby ensuring a smooth reading experience for readers. It’s important to note that the content itself has been crafted by the author and edited by a professional publishing team.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Associate Group Product Manager: Niranjan Naikwadi

Publishing Product Manager: Nitin Nainani

Book Project Manager: Hemangi Lotlikar

Senior Editor: Tiksha Abhimanyu Lad

Technical Editor: Sweety Pagaria

Copy Editor: Safis Editing

Proofreader: Tiksha Abhimanyu Lad

Indexer: Manju Arasan

Production Designer: Joshua Misquitta and Alishon Mendonca

Senior DevRel Marketing Executive: Vinishka Kalra

First published: January 2020

Second edition: October 2022

Third edition: August 2024

Production reference: 1260724

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-83588-358-7

www.packtpub.com

This book would not have been possible without the dedicated efforts of those who contribute to the Python open source ecosystem for data science and machine learning. We often overlook the fact that these contributors are real people with families, jobs, and hobbies, who generously allocate their time to develop these essential tools. I am deeply grateful to the developers of scikit-learn and pandas, pivotal libraries for data analysis and processing, as well as the maintainers of tsfresh and category encoders. A special acknowledgment goes to Nathan Parsons, current maintainer of Featuretools, for his invaluable support in crafting Chapter 8 of this book.

I am grateful to my editor, Tiksha Abhimanyu Lad, and her team for their invaluable support in bringing this book to fruition. Special thanks to our technical reviewer, Hector Patiño, for meticulously reviewing the code and recipes, ensuring smooth execution, and providing valuable resources to our readers.

A heartfelt thank you to my friend Chris Samiullah for his invaluable support in my growth as a software developer.

Finally, I am grateful to the users and contributors of Feature-engine for their unwavering support, feedback, and engagement, which have been instrumental in shaping the functionality of the library. Lastly, I owe a debt of gratitude to my students, whose feedback and encouragement have helped me become a better instructor and writer.

Thank you all for your invaluable contributions to this endeavor.

– Soledad Galli

Foreword

From convolutional neural networks to XGBoost, when it comes to machine learning, it’s easy to focus too much on the algorithms. But as the saying goes, “Garbage in, garbage out.” The quality of the features can be more important than the machine learning algorithm itself. Despite advances in feature learning, such as embedding in neural networks, feature engineering remains as important as ever. Particularly when dealing with categorical, numerical, and time-series features, feature engineering is a critical skill. With the right features, you can greatly improve model performance and ensure that models are more interpretable and robust.

Sole is a remarkable data science and machine learning educator. She has taught tens of thousands of students through her online courses on topics ranging from machine learning interpretability to hyperparameter optimization. It’s fantastic that she has taken on this timeless topic of feature engineering. Her approach is direct, pragmatic, and practical. As the author of the popular Feature-engine, a Python library for feature engineering, and a respected machine learning educator, Sole is uniquely qualified to cover this topic.

The third edition of this book, which you have in your hands now, provides updated guidelines for selecting methods based on the data and the model. It also covers the integration of scikit-learn with pandas through the recently released set_output API. Finally, it covers automating feature creation using decision trees.

Whether you are a beginner or an experienced practitioner, this book will provide you with practical insights, lots of code examples, and various techniques to improve your machine learning models through effective feature engineering.

Christoph Molnar

Author of Interpretable Machine Learning and Modeling Mindsets

Contributors

About the author

Soledad Galli is a bestselling data science instructor, book author, and open source Python developer. As the leading instructor at Train in Data, Sole teaches intermediate and advanced courses in machine learning that have enrolled 64k+ students worldwide and continue to receive positive reviews. Sole is also the developer and maintainer of the Python open source library Feature-engine, which offers an extensive array of methods for feature engineering and selection.

Sole worked as a data scientist in finance and insurance companies, where she developed and put into production machine learning models to assess insurance claims and credit risk and prevent fraud.

Sole has been selected multiple times as a LinkedIn voice in data science. She is passionate about sharing her knowledge and experience, and that is why you’ll often hear her talking at meetups, podcasts, or authoring articles online.

Sole is constantly looking for people like you, who can support her in enhancing the functionality of Feature-engine or delivering more and better courses, so if you are interested, contact her over social media or at her Train in Data website.

About the reviewer

Hector Patiño Rivera has been involved with machine learning for geosciences since 2015, especially for subjects related to satellite imagery. He has a strong knowledge of Python and SQL and is a proficient developer of PostgresQLS, ArcGIS, QGIS, and more GIS-related software. He is an experienced Django developer. When Hector is not programming, he loves playing tennis and hanging out with his friends.

Table of Contents

Preface

1

Imputing Missing Data

Technical requirements

Removing observations with missing data

How to do it...

How it works...

See also

Performing mean or median imputation

How to do it...

How it works...

Imputing categorical variables

How to do it...

How it works...

Replacing missing values with an arbitrary number

How to do it...

How it works...

Finding extreme values for imputation

How to do it...

How it works...

Marking imputed values

How to do it...

How it works...

There’s more…

Implementing forward and backward fill

How to do it...

How it works...

Carrying out interpolation

How to do it...

How it works...

See also

Performing multivariate imputation by chained equations

How to do it...

How it works...

See also

Estimating missing data with nearest neighbors

How to do it...

How it works...

2

Encoding Categorical Variables

Technical requirements

Creating binary variables through one-hot encoding

How to do it...

How it works...

There’s more...

Performing one-hot encoding of frequent categories

How to do it...

How it works...

There’s more...

Replacing categories with counts or the frequency of observations

How to do it...

How it works...

See also

Replacing categories with ordinal numbers

How to do it...

How it works...

There’s more...

Performing ordinal encoding based on the target value

How to do it...

How it works...

See also

Implementing target mean encoding

How to do it...

How it works…

There’s more…

Encoding with Weight of Evidence

How to do it...

How it works...

See also

Grouping rare or infrequent categories

How to do it...

How it works...

Performing binary encoding

How to do it...

How it works...

3

Transforming Numerical Variables

Transforming variables with the logarithm function

Getting ready

How to do it...

How it works...

There’s more…

Transforming variables with the reciprocal function

How to do it...

How it works...

Using the square root to transform variables

How to do it...

How it works…

Using power transformations

How to do it...

How it works...

Performing Box-Cox transformations

How to do it...

How it works...

There’s more…

Performing Yeo-Johnson transformations

How to do it...

How it works...

There’s more…

4

Performing Variable Discretization

Technical requirements

Performing equal-width discretization

How to do it...

How it works…

See also

Implementing equal-frequency discretization

How to do it...

How it works…

Discretizing the variable into arbitrary intervals

How to do it...

How it works...

Performing discretization with k-means clustering

How to do it...

How it works...

See also

Implementing feature binarization

Getting ready

How to do it...

How it works…

Using decision trees for discretization

How to do it...

How it works...

There’s more...

5

Working with Outliers

Technical requirements

Visualizing outliers with boxplots and the inter-quartile proximity rule

How to do it...

How it works…

Finding outliers using the mean and standard deviation

How to do it...

How it works…

Using the median absolute deviation to find outliers

How to do it...

How it works…

Removing outliers

How to do it...

How it works...

See also

Bringing outliers back within acceptable limits

How to do it...

How it works...

See also

Applying winsorization

How to do it...

How it works...

See also

6

Extracting Features from Date and Time Variables

Technical requirements

Extracting features from dates with pandas

Getting ready

How to do it...

How it works...

There’s more…

See also

Extracting features from time with pandas

Getting ready

How to do it...

How it works...

There’s more…

Capturing the elapsed time between datetime variables

How to do it...

How it works...

There's more...

See also

Working with time in different time zones

How to do it...

How it works...

See also

Automating the datetime feature extraction with Feature-engine

How to do it...

How it works...

7

Performing Feature Scaling

Technical requirements

Standardizing the features

Getting ready

How to do it...

How it works...

Scaling to the maximum and minimum values

Getting ready

How to do it...

How it works...

Scaling with the median and quantiles

How to do it...

How it works...

Performing mean normalization

How to do it...

How it works…

There’s more...

Implementing maximum absolute scaling

Getting ready

How to do it...

There’s more...

Scaling to vector unit length

How to do it...

How it works...

8

Creating New Features

Technical requirements

Combining features with mathematical functions

Getting ready

How to do it...

How it works...

See also

Comparing features to reference variables

How to do it…

How it works...

See also

Performing polynomial expansion

Getting ready

How to do it...

How it works...

There’s more...

Combining features with decision trees

How to do it...

How it works...

See also

Creating periodic features from cyclical variables

Getting ready

How to do it…

How it works…

Creating spline features

Getting ready

How to do it…

How it works…

See also

9

Extracting Features from Relational Data with Featuretools

Technical requirements

Setting up an entity set and creating features automatically

Getting ready

How to do it...

How it works...

See also

Creating features with general and cumulative operations

Getting ready

How to do it...

How it works...

Combining numerical features

How to do it...

How it works...

Extracting features from date and time

How to do it...

How it works...

Extracting features from text

Getting ready

How to do it...

How it works...

Creating features with aggregation primitives

Getting ready

How to do it...

How it works...

10

Creating Features from a Time Series with tsfresh

Technical requirements

Extracting hundreds of features automatically from a time series

Getting ready

How to do it...

How it works...

See also

Automatically creating and selecting predictive features from time-series data

How to do it...

How it works...

See also

Extracting different features from different time series

How to do it...

How it works...

Creating a subset of features identified through feature selection

How to do it...

How it works...

Embedding feature creation into a scikit-learn pipeline

How to do it...

How it works...

See also

11

Extracting Features from Text Variables

Technical requirements

Counting characters, words, and vocabulary

Getting ready

How to do it...

How it works...

There’s more...

See also

Estimating text complexity by counting sentences

Getting ready

How to do it...

How it works...

There’s more...

Creating features with bag-of-words and n-grams

Getting ready

How to do it...

How it works...

See also

Implementing term frequency-inverse document frequency

Getting ready

How to do it...

How it works...

See also

Cleaning and stemming text variables

Getting ready

How to do it...

How it works...

Index

Other Books You May Enjoy