Data Augmentation with Python - Duc Haba - E-Book

Data Augmentation with Python E-Book

Duc Haba

0,0
32,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Data is paramount in AI projects, especially for deep learning and generative AI, as forecasting accuracy relies on input datasets being robust. Acquiring additional data through traditional methods can be challenging, expensive, and impractical, and data augmentation offers an economical option to extend the dataset.
The book teaches you over 20 geometric, photometric, and random erasing augmentation methods using seven real-world datasets for image classification and segmentation. You’ll also review eight image augmentation open source libraries, write object-oriented programming (OOP) wrapper functions in Python Notebooks, view color image augmentation effects, analyze safe levels and biases, as well as explore fun facts and take on fun challenges. As you advance, you’ll discover over 20 character and word techniques for text augmentation using two real-world datasets and excerpts from four classic books. The chapter on advanced text augmentation uses machine learning to extend the text dataset, such as Transformer, Word2vec, BERT, GPT-2, and others. While chapters on audio and tabular data have real-world data, open source libraries, amazing custom plots, and Python Notebook, along with fun facts and challenges.
By the end of this book, you will be proficient in image, text, audio, and tabular data augmentation techniques.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 361

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Data Augmentation with Python

Enhance deep learning accuracy with data augmentation methods for image, text, audio, and tabular data

Duc Haba

BIRMINGHAM—MUMBAI

Data Augmentation with Python

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Ali Abidi

Publishing Product Manager: Dinesh Chaudhary

Senior Editor: Sushma Reddy

Technical Editor: Devanshi Ayare

Copy Editor: Safis Editing

Project Manager: Kirti Pisat

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Hemangini Bari

Production Designer: Joshua Misquitta

Marketing Coordinator: Shifa Ansari & Vinishka Kalra

First published: April 2023

Production reference: 1270423

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80324-645-1

www.packtpub.com

They say love is immeasurable, but I say love is measured by two eggs sunny side up, ham, toast, and a cup of tea. My dad has made this breakfast meal for me in the morning, afternoon, evening, and late nights. From my college days till today, he made them with love. Thanks, Dad. :-)

– Duc Haba

Foreword

I recently had the distinct pleasure of interviewing Duc (pronounced “Duke”) regarding his lifelong passion for Artificial Intelligence, as part of an effort to promote an AI Hackathon he was leading. I’d known Duc for several years at this point — we’re both engineering leaders at a premier agency in Silicon Valley. But I had no idea that he was an early pioneer in the AI industry. I was often surprised — even astonished — by the depth and breadth of his experience as we spoke about the history and future of AI. Equally impressive were the caliber of friends he’s made along the way, including global AI leaders.

What impressed me most about Duc, however, was the man himself. Yes, he’s in a rarified stratum of talent and capability, and everyone who knows him is aware of this. But despite his prodigious talent and world-class AI pedigree, Duc is a warm-hearted person, entirely down-to-earth, approachable, and even charming. He treats everyone as a peer, extending a cheerful hand of friendship. This great warmth of character permeates everything he does.

Not only is he blazing trails into unconquered technical territories, but he’s also carving those trails wide and clean so that others can follow with ease. He holds two sacred goals in mind. The first is charting a positive course into the potential of AI to unlock world-changing solutions. The second is empowering as many people as possible to join him on his journey, both to share in the wonder of it all and to provide together a fabric of conscience that wraps around AI as a living safeguard against abuse.

I believe you’ll find Duc’s sacred goals, coupled with his peculiar strengths, on full display in this book. First, he tackles the signature pain point of AI, which is a dearth of data. You see, the ultimate accuracy of every AI model depends entirely upon the quality and quantity of data that the model is derived from, but data can be prohibitively expensive, or impossible to gather at scale. Duc solves this dearth of data for every major data type: image, text, audio, and tabular. This is the first of Duc’s goals, to unlock the peerless power of AI.

But Duc doesn’t stop there. He meticulously charts and documents his techniques to make them readily available to you, dear reader, to share his most important discoveries. He makes it easy to adopt his groundbreaking work because that’s the second of his goals, to democratize AI for the betterment of us all.

The techniques in this book will exponentially expand your data sets and thus drastically improve the accuracy of your AI models. They’re a ready-made bridge to your own AI dreams.

If you look ahead, just up that wide, clean path into the magical world of AI, you can see Duc standing there with a big warm smile, waving you on.

Jonmar

Engineering and Outreach at YML, Founder of Varlio, TEDx speaker, Featured by Apple, Rolling Stone, and Guitar World magazine.

Contributors

About the author

Duc Haba is a lifelong technologist and researcher. He has been a programmer, Enterprise Mobility Solution Architect, AI Solution Architect, Principal, VP, CTO, and CEO. The companies range from startups and IPOs to enterprise companies.

Duc’s career started with Xerox Parc, researching and building expert systems (ruled-based) for copier diagnostic, and skunk works for the USA DOD. Afterward, he joined Oracle, following Viant Consulting as a founding member. He dove deep into the entrepreneurial culture in Silicon Valley. There were slightly more failures than successes, but the highlights are Viant and RRKidz. Currently, he is happy working at YML.co as the AI Solution Architect.

The book is only possible with the support of my family, fellow researchers, and a small gang of professionals at Packt Publishing. Still, above all else, I hope you enjoy reading the book and hacking the Python Notebook as much as I enjoyed writing it.

About the reviewers

Krishnan Raghavan is an IT Professional with over 20+ years of experience in software development and delivery excellence across multiple domains and technology ranging from C++ to Java, Python, Data Warehousing, and Big Data Tools and Technologies.

When not working, Krishnan likes to spend time with his wife and daughter besides reading fiction, nonfiction and technical books. Krishnan tries to give back to the community by being part of GDG – Pune Volunteer Group helping the team oraganize events. Currently, he is unsuccessfully trying to learn how to play the guitar. :)

You can connect with Krishnan at [email protected] or via LinkedIn: www.linkedin.com/in/krishnan-raghavan.

I would like to thank my wife Anita and daughter Ananya for giving me the time and space to review this book.

Rajvardhan Oak is an Applied Scientist at Microsoft and a Ph.D. candidate in Computer Science at UC Davis, advised by the esteemed Professor Zubair Shafiq. Rajvardhan graduated from UC Berkeley with a Masters in Information Management and Systems, where he gained valuable experience in applying ML to security issues, such as detecting fake news, hate speech, adversarial machine learning, and phishing and spam detection. With an impressive resume, Rajvardhan has worked with industry giants such as Facebook and IBM. He has also been involved in Sec-ML research at UC Berkeley, IIT Kharagpur, and Princeton University.

Vitor Bianchi Lanzetta (@vitorlanzetta) has a master’s degree in Applied Economics (University of So PauloUSP) and works as a data scientist in a tech start-up named RedFox Digital Solutions. He has also authored a book called R Data Visualization Recipes. The things he enjoys the most are statistics, economics, and sports of all kinds (electronics included). His blog, made in partnership with Ricardo Anjoleto Farias (@R_A_Farias), can be found at ArcadeData dot org, they kindly call it R-Cade Data.

Bhavan Jasani works as an Applied Scientist at Amazon Web Services AI in San Francisco. His work focuses on multi-modal learning and vision-language. Before that, he did his Master’s in Robotics by Research from Robotics Institute, Carnegie Mellon University, Pittsburgh working on multi-modal emotion recognition and visual question answering. He was also a research staff at Nanyang Technological University, Singapore, working on embedded computer vision. He has reviewed and published his work in leading computer vision conferences, including ICCV and ECCV, as well as IEEE journals.

Table of Contents

Preface

Part 1: Data Augmentation

1

Data Augmentation Made Easy

Data augmentation role

Data input types

Image definition

Text definition

Audio definition

Tabular data definition

Python Notebook

Google Colab

Additional Python Notebook options

Installing Python Notebook

Programming styles

Source control

The PacktDataAug class

Naming convention

Extend base class

Referencing a library

Exporting Python code

Pluto

Summary

2

Biases in Data Augmentation

Computational biases

Human biases

Systemic biases

Python Notebook

Python Notebook

GitHub

Pluto

Verifying Pluto

Kaggle ID

Image biases

State Farm distracted drivers detection

Nike shoes

Grapevine leaves

Text biases

Netflix

Amazon reviews

Summary

Part 2: Image Augmentation

3

Image Augmentation for Classification

Geometric transformations

Flipping

Cropping

Resizing

Padding

Rotating

Translation

Noise injection

Photometric transformations

Basic and classic

Advanced and exotic

Random erasing

Combining

Reinforcing your learning through Python code

Pluto and the Python Notebook

Real-world image datasets

Image augmentation library

Geometric transformation filters

Photographic transformations

Random erasing

Combining

Summary

4

Image Augmentation for Segmentation

Geometric and photometric transformations

Real-world segmentation datasets

Python Notebook and Pluto

Real-world data

Pandas

Viewing data images

Reinforcing your learning

Horizontal flip

Vertical flip

Rotating

Resizing and cropping

Transpose

Lighting

FancyPCA

Combining

Summary

Part 3: Text Augmentation

5

Text Augmentation

Character augmenting

Word augmenting

Sentence augmentation

Text augmentation libraries

Real-world text datasets

The Python Notebook and Pluto

Real-world NLP datasets

Pandas

Visualizing NLP data

Reinforcing learning through Python Notebook

Character augmentation

Word augmenting

Summary

6

Text Augmentation with Machine Learning

Machine learning models

Word augmenting

Sentence augmenting

Real-world NLP datasets

Python Notebook and Pluto

Verify

Real-world NLP data

Pandas

Viewing the text

Reinforcing your learning through the Python Notebook

Word2Vec word augmenting

BERT

RoBERTa

Back translation

Sentence augmentation

Summary

Part 4: Audio Data Augmentation

7

Audio Data Augmentation

Standard audio augmentation techniques

Time stretching

Time shifting

Pitch shifting

Polarity inversion

Noise injection

Filters

Low-pass filter

High-pass filter

Band-pass filter

Low-shelf filter

High-shelf filter

Band-stop filter

Peak filter

Audio augmentation libraries

Real-world audio datasets

Python Notebook and Pluto

Real-world data and pandas

Listening and viewing

Reinforcing your learning

Time shifting

Time stretching

Pitch scaling

Noise injection

Polarity inversion

Low-pass filter

Band-pass filter

High-pass and other filters

Summary

8

Audio Data Augmentation with Spectrogram

Initializing and downloading

Audio Spectrogram

Various Spectrogram formats

Mel-spectrogram and Chroma STFT plots

Spectrogram augmentation

Spectrogram images

Summary

Part 5: Tabular Data Augmentation

9

Tabular Data Augmentation

Tabular augmentation libraries

Augmentation categories

Real-world tabular datasets

Exploring and visualizing tabular data

Data structure

First graph view

Checksum

Specialized plots

Exploring the World Series data

Transforming augmentation

Robust scaler

Standard scaler

Capping

Interaction augmentation

Regression augmentation

Operator augmentation

Mapping augmentation

Extraction augmentation

Summary

Index

Other Books You May Enjoy

Part 1: Data Augmentation

This part includes the following chapters:

Chapter 1, Data Augmentation Made EasyChapter 2, Biases in Data Augmentation