32,39 €
Building accurate machine learning models requires quality data—lots of it. However, for most teams, assembling massive datasets is time-consuming, expensive, or downright impossible. Led by Margaux Masson-Forsythe, a seasoned ML engineer and advocate for surgical data science and climate AI advancements, this hands-on guide to active machine learning demonstrates how to train robust models with just a fraction of the data using Python's powerful active learning tools.
You’ll master the fundamental techniques of active learning, such as membership query synthesis, stream-based sampling, and pool-based sampling and gain insights for designing and implementing active learning algorithms with query strategy and Human-in-the-Loop frameworks. Exploring various active machine learning techniques, you’ll learn how to enhance the performance of computer vision models like image classification, object detection, and semantic segmentation and delve into a machine AL method for selecting the most informative frames for labeling large videos, addressing duplicated data. You’ll also assess the effectiveness and efficiency of active machine learning systems through performance evaluation.
By the end of the book, you’ll be able to enhance your active learning projects by leveraging Python libraries, frameworks, and commonly used tools.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 198
Veröffentlichungsjahr: 2024
Active Machine Learning with Python
Refine and elevate data quality over quantity with active learning
Margaux Masson-Forsythe
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Niranjan Naikwadi
Publishing Product Manager: Tejashwini R
Book Project Manager: Kirti Pisat
Senior Editor: Vandita Grover
Technical Editor: Rahul Limbachiya
Copy Editor: Safis Editing
Proofreader: Safis Editing
Indexer: Manju Arasan
Production Designer: Vijay Kamble
DevRel Marketing Coordinator: Vinishka Kalra
First published: March 2024
Production reference: 1270324
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN 978-1-83546-494-6
www.packtpub.com
To my beloved wife, Heather Masson-Forsythe, whose unwavering kindness and support are my pillars of strength with every new intense project I undertake each week.
Margaux Masson-Forsythe is a skilled machine learning engineer and advocate for advancements in surgical data science and climate AI. As the director of machine learning at Surgical Data Science Collective, she builds computer vision models to detect surgical tools in videos and track procedural motions. Masson-Forsythe manages a multidisciplinary team and oversees model implementation, data pipelines, infrastructure, and product delivery. With a background in computer science and expertise in machine learning, computer vision, and geospatial analytics, she has worked on projects related to reforestation, deforestation monitoring, and crop yield prediction.
Mourya Boggarapa is a deep learning software engineer specializing in the end-to-end integration of large language models for custom AI accelerators. He holds a master’s degree in software engineering from Carnegie Mellon University. Prior to his current role, Mourya honed his skills through diverse experiences: developing backend systems for a major bank, building development infrastructure for a tech giant, and some mobile app development. He cultivated a comprehensive understanding of software development across various domains. His primary passion lies in deep learning. Additionally, he maintains a keen interest in human-computer interaction, aiming to bridge the gap between tech and human experience.
In the rapidly evolving landscape of machine learning (ML), the concept of active ML has emerged as a transformative approach that optimizes the learning process by selectively querying the most informative data points from unlabeled datasets. This part of the book is dedicated to laying the foundational principles, strategies such as uncertainty sampling, query-by-committee, expected model change, expected error reduction, and density-weighted methods, and considerations essential for understanding and implementing active ML effectively. Through a structured exploration, we aim to equip readers with a solid grounding of the best practices for managing the human in the loop by exploring labeling interface design, effective workflows, strategies for handling model-label disagreements, finding adequate labelers, and managing them efficiently.
This part includes the following chapters:
Chapter 1, Introducing Active Machine LearningChapter 2, Designing Query Strategy FrameworksChapter 3, Managing the Human in the LoopMachine learning models require large, labeled datasets, which can be expensive and time-consuming to obtain. Active machine learning (active ML) minimizes the labeling effort needed by intelligently choosing which data points a human should label. In this book, you will gain the necessary knowledge to understand active learning, including its mechanisms and applications. With these fundamentals, the subsequent chapters will equip you with concrete skills to implement active learning techniques on your own.
By the end of this book, you will have practical experience with state-of-the-art strategies to minimize labeling costs and maximize model performance. You will be able to apply active learning to enhance the efficiency and adaptability of your models across different application areas, such as vision and language.
To begin with, this chapter provides an introduction to active ML and explains how it can improve model accuracy using fewer labeled examples. By the end of the chapter, you will have covered the following:
Understanding active machine learning systemsExploring query strategy scenariosComparing active and passive learningActive machine learning (active ML) is a powerful approach that seeks to create predictive models with remarkable accuracy, all while minimizing the number of labeled training examples required. This is achieved by employing a clever strategy that involves selectively choosing the most informative data points to be labeled by a knowledgeable oracle, such as a human annotator. By doing so, active learning enables models to extract the necessary knowledge they need from a relatively small amount of data.
Now, let’s explore some definitions and the fundamental concepts that form the foundation of active ML.
Active learning can be defined as a dynamic and iterative approach to machine learning, where the algorithm intelligently engages with an oracle to label new data points. An oracle is a source that provides labels for data points queried by the active learner. The oracle acts as a teacher, guiding the model by providing labels for its most informative queries. Typically, oracles are human annotators or experts who can manually assign labels to new data points. However, oracles can also be simulation engines, crowdsourcing services, or other systems capable of labeling.
The key objective of active ML is to select and prioritize the most informative data points for the model. The aim is to achieve higher accuracy levels while minimizing the need for extensive training labels, in comparison to traditional supervised learning methods, which rely on large datasets of pre-labeled examples to train models in predicting outcomes. On the other hand, unsupervised learning methods work with unlabeled data, seeking patterns or structures without explicit instruction on the outcomes. Active learning bridges these approaches by focusing on a semi-supervised learning strategy. This process allows the model to actively learn and adapt over time, continuously improving its predictive capabilities by leveraging the most relevant and significant data points. By actively engaging with the data and carefully choosing which samples to label, active ML optimizes the entire learning process. It allows the algorithm to focus on the most relevant and informative instances, thereby reducing the need for extensive labeling efforts. As a result, active ML not only saves time and resources but also enables machine learning models to achieve higher accuracy and better generalization. Active ML opens the door for more advanced and intelligent machine learning systems by effectively prioritizing data labeling.
Active learning is a highly versatile technique that can significantly enhance efficiency and model performance across a wide range of applications. It does so by directing human labeling efforts to areas where they can have the most impact.
This approach has proven to be particularly effective in computer vision applications, such as image classification, object detection, and image segmentation. By selectively acquiring labels for ambiguous images that traditional sampling methods often miss, active learning can reduce costs and improve accuracy. It does this by identifying the most informative edge cases to query, allowing for accurate results with fewer labeled samples. For example, if we consider a self-driving car object-detection model that needs to identify various objects such as people, trees, and other cars, we can utilize active learning to prioritize the classes that it may struggle to learn.
In natural language tasks, such as document classification and translation, active learners play a crucial role in filling gaps in linguistic coverage. By querying sentences that cover rare vocabulary and structures, active learning improves adaptation and improves overall performance. The labeling process is focused only on the most useful examples, minimizing the need for extensive labeling efforts.
Anomaly detection is another domain where active learning proves to be highly effective. By targeting rare outliers and anomalies, which are critical for identifying issues such as fraud, active learning improves the detection of these important but uncommon examples. By focusing human reviews on unusual cases, active learning enhances the overall accuracy of anomaly detection systems.
Recommendation systems heavily rely on user feedback, and active learning provides a framework for acquiring this feedback intelligently. By querying users on their preferences for certain content, active learning gathers focused signals that can be used to fine-tune recommendations. For example, streaming services can use active learning techniques to improve the accuracy and relevance of their video suggestions.
In the field of medical diagnosis, active learning techniques play a vital role in minimizing physician time spent on common diagnoses. By identifying challenging cases that require expert input, active learning ensures that effort is focused on ambiguous examples that can significantly improve diagnostic model performance.
Active learning provides both the algorithms and mechanisms necessary to efficiently focus human effort on useful areas across various applications. By selectively acquiring labels, it overcomes the inherent costs and challenges associated with supervised machine learning, making it an invaluable tool in the field of artificial intelligence. Across science, engineering, and technology, the ability to intelligently guide data collection and labeling can accelerate progress with minimal human effort.
Now, let’s move ahead to discuss the key components of an active learning system and how they apply to all the applications we have just mentioned.
Active ML systems comprise four key elements:
Unlabeled dataset: This pool of unlabeled data points is what the active learner can query from. It may contain tens, hundreds, or even millions of examples.Query strategy: This is the core mechanism of active learning. It guides how the system selects which data points to query labels for. Different criteria can be used, which we will explore later.Machine learning model: The underlying predictive model being trained, such as a neural network, random forest, or SVM.Oracle: The source that provides labels. This is typically a human annotator who can manually label queried data points.How do the key components just mentioned interact with each other? Figure 1.1 depicts the interaction between various components of an active ML loop:
Figure 1.1 – Active ML loop
Models engage in an iterative loop, such as the following:
The query strategy identifies the most useful data points to label.These are labeled by the oracle (human annotator).The newly labeled data is used to train the machine learning model.The updated model is then used to inform the next round of querying and labeling.This loop allows active learning models to intelligently explore datasets, acquiring new training labels that maximize information gain.
In the next section, we will dig deeper into the query strategy step by first examining the various scenarios that one can choose from.
Active learning can be implemented in different ways, depending on the nature of the unlabeled data and how the queries are performed. There are three main scenarios to consider when implementing active learning:
Membership query synthesisStream-based selective samplingPool-based samplingThese scenarios offer different ways to optimize and improve the active learning process. Understanding these scenarios can help you make informed decisions and choose the most suitable approach for your specific needs. In this section, we will explore each of these scenarios.
In membership query synthesis, the active learner has the ability to create its own unlabeled data points in order to improve its training. This is done by generating new data points from scratch and then requesting the oracle for labels, as depicted in Figure 1.2. By incorporating these newly labeled data points into its training set, the model becomes more robust and accurate:
Figure 1.2 – Membership query synthesis workflow
Let’s consider an image classifier as an example. With the power of synthesis, the active learner can create new images by combining various shapes, textures, and colors in different compositions. This allows the model to explore a wide range of possibilities and learn to recognize patterns and features that may not have been present in the original labeled data.
Similarly, a text classifier can also benefit from membership query synthesis. By generating new sentences and paragraphs with specific words or structures, the model can expand its understanding of different language patterns and improve its ability to classify text accurately.
There are several advantages of membership query synthesis:
The model has complete control over the data points it queries, allowing it to focus on corner cases and unusual examples that normal sampling might overlook. This helps to reduce overfitting and improve the model’s generalization by increasing the diversity of the data.By synthesizing data, the model can actively explore its weaknesses rather than rely on what is in the training dataThis is useful for problems where data synthesis is straightforward, such as simple tabular data and sequences.However, there are also several disadvantages to using this scenario:
It requires the ability to synthesize new useful data points accurately. This can be extremely difficult for complex real-world data such as images, audio, and video.Data synthesis does not work well for high-dimensional, nuanced data. The generated points are often not natural.It is less practical for real-world applications today compared to pool-based sampling. Advances in generative modeling can improve synthesis.