Data Labeling in Machine Learning with Python - Vijaya Kumar Suda - E-Book

Data Labeling in Machine Learning with Python E-Book

Vijaya Kumar Suda

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Data labeling is the invisible hand that guides the power of artificial intelligence and machine learning. In today’s data-driven world, mastering data labeling is not just an advantage, it’s a necessity. Data Labeling in Machine Learning with Python empowers you to unearth value from raw data, create intelligent systems, and influence the course of technological evolution.
With this book, you'll discover the art of employing summary statistics, weak supervision, programmatic rules, and heuristics to assign labels to unlabeled training data programmatically. As you progress, you'll be able to enhance your datasets by mastering the intricacies of semi-supervised learning and data augmentation. Venturing further into the data landscape, you'll immerse yourself in the annotation of image, video, and audio data, harnessing the power of Python libraries such as seaborn, matplotlib, cv2, librosa, openai, and langchain. With hands-on guidance and practical examples, you'll gain proficiency in annotating diverse data types effectively.
By the end of this book, you’ll have the practical expertise to programmatically label diverse data types and enhance datasets, unlocking the full potential of your data.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 454

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Data Labeling in Machine Learning with Python

Explore modern ways to prepare labeled data for training and fine-tuning ML and generative AI models

Vijaya Kumar Suda

Data Labeling in Machine Learning with Python

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Niranjan Naikwadi

Publishing Product Manager: Sanjana Gupta

Book Project Manager: Hemangi Lotlikar

Content Development Editor: Shreya Moharir

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Proofreader: Safis Editing

Indexer: Tejal Soni

Production Designer: Joshua Misquitta

DevRel Marketing Coordinator: Vinishka Kalra

First published: January 2024

Production reference: 1300124

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-80461-054-1

www.packtpub.com

Acknowledgments

I extend my heartfelt gratitude to my mother, Rajya Lakshmi Suda, and dedicate this work to the cherished memory of my father, Koteswara Rao Suda. Their sacrifices and unwavering determination have been a profound source of inspiration.

Special thanks to my wife, Radhika, for her enduring support and patience throughout the writing of this book.

To my son, Chandra Suda (Rise Global Winner 2023), and daughter, Akshaya, your talents and creativity have shown me the beautiful evolution of skill.

I am deeply appreciative of my siblings, Rama Devi, Swarna Kumar, and Dr. Sri Kumar, for their continuous support.

A sincere acknowledgment to my mentors and managers, Kevin Fleck and Des Quinta, for their invaluable support and motivation throughout the writing process of this book.

Finally, I want to thank the Packt Publishing team, especially Shreya and Hemangi, for their fantastic support, which made the writing process an absolute pleasure.

Contributors

About the author

Vijaya Kumar Suda is a seasoned data and AI professional, boasting over two decades of expertise collaborating with global clients. Having resided and worked in diverse locations such as Switzerland, Belgium, Mexico, Bahrain, India, Canada, and the USA, Vijaya has successfully assisted customers spanning various industries. Currently serving as a senior data and AI consultant at Microsoft, he is instrumental in guiding industry partners through their digital transformation endeavors using cutting-edge cloud technologies and AI capabilities. His proficiency encompasses architecture, data engineering, machine learning, generative AI, and cloud solutions. Vijaya also shares his insights through engaging videos on the cloud, data, and AI on his YouTube channel, Cloud & Data Science(https://youtu.be/piVqFcuBV2c).

About the reviewers

Pritesh Kanani is a full stack developer with experience in data wrangling and supervised machine learning. He helped a major oil and gas company with building a tool to monitor drilling operations and handling thousands of high frequency data streams. He completed a post-graduation course in applied AI and is currently utilizing his full stack data science and cloud computing skills at a leading nuclear and renewable energy organization in Ontario, Canada.

Sourav Roy is a passionate data enthusiast, an experienced machine learning practitioner, and an expert book reviewer with a focus on literature linked to data. He possesses a diverse skill set in data engineering and data analytics, which allows him to combine technical proficiency with a deep passion in his work on data-centric books. Sourav obtained a master’s degree in data science and analytics from Queen’s University. He is presently employed as a data engineer in the banking sector.

Mitesh Mangaonkar is an engineering leader pioneering generative AI to transform data platforms. As a tech lead at Airbnb, he builds cutting-edge data pipelines leveraging big technologies and modern data stacks to power trust and safety products. Previously, at AWS, Mitesh helped Fortune 500 companies migrate their data warehouses to the cloud and engineered highly scalable, resilient systems. An innovator at heart, he combines deep data engineering expertise with a passion for AI to create the next generation of data products. Mitesh is an influential voice shaping the future of data engineering and governance.

Table of Contents

Preface

Part 1: Labeling Tabular Data

1

Exploring Data for Machine Learning

Technical requirements

EDA and data labeling

Understanding the ML project life cycle

Defining the business problem

Data discovery and data collection

Data exploration

Data labeling

Model training

Model evaluation

Model deployment

Introducing Pandas DataFrames

Summary statistics and data aggregates

Summary statistics

Data aggregates of the feature for each target class

Creating visualizations using Seaborn for univariate and bivariate analysis

Univariate analysis

Bivariate analysis

Profiling data using the ydata-profiling library

Variables section

Interactions section

Correlations

Missing values

Sample data

Unlocking insights from data with OpenAI and LangChain

Summary

2

Labeling Data for Classification

Technical requirements

Predicting labels with LLMs for tabular data

Data labeling using Snorkel

What is Snorkel?

Why is Snorkel popular?

Loading unlabeled data

Creating the labeling functions

Labeling rules

Constants

Labeling functions

Creating a label model

Predicting labels

Labeling data using the Compose library

Labeling data using semi-supervised learning

What is semi-supervised learning?

What is pseudo-labeling?

Labeling data using K-means clustering

What is unsupervised learning?

K-means clustering

Inertia

Dunn's index

Summary

3

Labeling Data for Regression

Technical requirements

Using summary statistics to generate housing price labels

Finding the closest labeled observation to match the label

Using semi-supervised learning to label regression data

Pseudo-labeling

Using data augmentation to label regression data

Using k-means clustering to label regression data

Summary

Part 2: Labeling Image Data

4

Exploring Image Data

Technical requirements

Visualizing image data using Matplotlib in Python

Loading the data

Checking the dimensions

Visualizing the data

Checking for outliers

Performing data preprocessing

Checking for class imbalance

Identifying patterns and relationships

Evaluating the impact of preprocessing

Practice example of visualizing data

Practice example for adding annotations to an image

Practice example of image segmentation

Practice example for feature extraction

Analyzing image size and aspect ratio

Impact of aspect ratios on model performance

Image resizing

Image normalization

Performing transformations on images – image augmentation

Summary

5

Labeling Image Data Using Rules

Technical requirements

Labeling rules based on image visualization

Image labeling using rules with Snorkel

Weak supervision

Rules based on the manual visualization of an image’s object color

Real-world applications

A practical example of plant disease detection

Labeling images using rules based on properties

Bounding boxes

Example 1 – image classification – a bicycle with and without a person

Example 2 – image classification – dog and cat images

Labeling images using transfer learning

Example – digit classification using a pre-trained classifier

Example – person image detection using the YOLO V3 pre-trained classifier

Example – bicycle image detection using the YOLO V3 pre-trained classifier

Labeling images using transformations

Summary

6

Labeling Image Data Using Data Augmentation

Technical requirements

Training support vector machines with augmented image data

Kernel trick

Data augmentation

Image data augmentation

Implementing an SVM with data augmentation in Python

Introducing the CIFAR-10 dataset

Loading the CIFAR-10 dataset in Python

Preprocessing the data for SVM training

Implementing an SVM with the default hyperparameters

Evaluating SVM on the original dataset

Implementing an SVM with an augmented dataset

Training the SVM on augmented data

Evaluating the SVM’s performance on the augmented dataset

Image classification using the SVM with data augmentation on the MNIST dataset

Convolutional neural networks using augmented image data

How CNNs work

Practical example of a CNN using data augmentation

CNN using image data augmentation with the CIFAR-10 dataset

Summary

Part 3: Labeling Text, Audio, and Video Data

7

Labeling Text Data

Technical requirements

Real-world applications of text data labeling

Tools and frameworks for text data labeling

Exploratory data analysis of text

Loading the data

Understanding the data

Cleaning and preprocessing the data

Exploring the text’s content

Analyzing relationships between text and other variables

Visualizing the results

Exploratory data analysis of sample text data set

Exploring Generative AI and OpenAI for labeling text data

GPT models by OpenAI

Zero-shot learning capabilities

Text classification with OpenAI models

Data labeling assistance

OpenAI API overview

Use case 1 – summarizing the text

Use case 2 – topic generation for news articles

Use case 3 – classification of customer queries using the user-defined categories and sub-categories

Use case 4 – information retrieval using entity extraction

Use case 5 – aspect-based sentiment analysis

Hands-on labeling of text data using the Snorkel API

Hands-on text labeling using Logistic Regression

Hands-on label prediction using K-means clustering

Generating labels for customer reviews (sentiment analysis)

Summary

8

Exploring Video Data

Technical requirements

Loading video data using cv2

Extracting frames from video data for analysis

Extracting features from video frames

Color histogram

Optical flow features

Motion vectors

Deep learning features

Appearance and shape descriptors

Visualizing video data using Matplotlib

Frame visualization

Temporal visualization

Motion visualization

Labeling video data using k-means clustering

Overview of data labeling using k-means clustering

Example of video data labeling using k-means clustering with a color histogram

Advanced concepts in video data analysis

Motion analysis in videos

Object tracking in videos

Facial recognition in videos

Video compression techniques

Real-time video processing

Video data formats and quality in machine learning

Common issues in handling video data for ML models

Troubleshooting steps

Summary

9

Labeling Video Data

Technical requirements

Capturing real-time video

Key components and features

A hands-on example to capture real-time video using a webcam

Building a CNN model for labeling video data

Using autoencoders for video data labeling

A hands-on example to label video data using autoencoders

Transfer learning

Using the Watershed algorithm for video data labeling

A hands-on example to label video data segmentation using the Watershed algorithm

Computational complexity

Performance metrics

Real-world examples for video data labeling

Advances in video data labeling and classification

Summary

10

Exploring Audio Data

Technical requirements

Real-life applications for labeling audio data

Audio data fundamentals

Hands-on with analyzing audio data

Example code for loading and analyzing sample audio file

Best practices for audio format conversion

Example code for audio data cleaning

Extracting properties from audio data

Tempo

Chroma features

Mel-frequency cepstral coefficients (MFCCs)

Zero-crossing rate

Spectral contrast

Considerations for extracting properties

Visualizing audio data with matplotlib and Librosa

Waveform visualization

Loudness visualization

Spectrogram visualization

Mel spectrogram visualization

Considerations for visualizations

Ethical implications of audio data

Recent advances in audio data analysis

Troubleshooting common issues during data analysis

Troubleshooting common installation issues for audio libraries

Summary

11

Labeling Audio Data

Technical requirements

Downloading FFmpeg

Azure Machine Learning

Real-time voice classification with Random Forest

Transcribing audio using the OpenAI Whisper model

Step 1 – importing the Whisper model

Step 2 – loading the base Whisper model

Step 3 – setting up FFmpeg

Step 4 – transcribing the YouTube audio using the Whisper model

Classifying a transcription using Hugging Face transformers

Hands-on – labeling audio data using a CNN

Exploring audio data augmentation

Introducing Azure Cognitive Services – the speech service

Creating an Azure Speech service

Speech to text

Speech translation

Summary

12

Hands-On Exploring Data Labeling Tools

Technical requirements

Azure Machine Learning data labeling

Label Studio

pyOpenAnnotate

Data labeling using Azure Machine Learning

Benefits of data labeling with Azure Machine Learning

Data labeling steps using Azure Machine Learning

Image data labeling with Azure Machine Learning

Text data labeling with Azure Machine Learning

Audio data labeling using Azure Machine Learning

Integration of the Azure Machine Learning pipeline with the labeled dataset

Exploring Label Studio

Labeling the image data

Labeling the text data

Labeling the video data

pyOpenAnnotate

Computer Vision Annotation Tool

Comparison of data labeling tools

Advanced methods in data labeling

Active learning

Semi-automated labeling

Summary

Index

Other Books You May Enjoy

Part 1: Labeling Tabular Data

This part of the book will guide you in exploring tabular data and programmatically labeling the data using Python libraries, such as Snorkel labeling functions. You will be able to achieve this without requiring any prior data science knowledge. Additionally, it covers data labeling using K-means clustering.

This part comprises the following chapters:

Chapter 1, Exploring Data for Machine LearningChapter 2, Labeling Data for ClassificationChapter 3, Labeling Data for Regression