E-Book
29,99 €

Synthetic Data for Machine Learning E-Book

Abdulrahman Kerim

0,0

29,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

The machine learning (ML) revolution has made our world unimaginable without its products and services. However, training ML models requires vast datasets, which entails a process plagued by high costs, errors, and privacy concerns associated with collecting and annotating real data. Synthetic data emerges as a promising solution to all these challenges.
This book is designed to bridge theory and practice of using synthetic data, offering invaluable support for your ML journey. Synthetic Data for Machine Learning empowers you to tackle real data issues, enhance your ML models' performance, and gain a deep understanding of synthetic data generation. You’ll explore the strengths and weaknesses of various approaches, gaining practical knowledge with hands-on examples of modern methods, including Generative Adversarial Networks (GANs) and diffusion models. Additionally, you’ll uncover the secrets and best practices to harness the full potential of synthetic data.
By the end of this book, you’ll have mastered synthetic data and positioned yourself as a market leader, ready for more advanced, cost-effective, and higher-quality data sources, setting you ahead of your peers in the next generation of ML.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 276

Veröffentlichungsjahr: 2023

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Für immer aufgeräumt – auch digital

Jürgen Kurz

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Mission erfüllt

Owen Mark

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Macht, was ihr liebt!

Anja Förster

Der größte Raubzug der Geschichte

Matthias Weik

Der Mann und das Holz

Lars Mytting

Kopf schlägt Kapital

Günter Faltin

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Organisation für Komplexität

Niels Pfläging

The Truth About Employee Engagement

Patrick M. Lencioni

BLACKOUT - Morgen ist es zu spät

Marc Elsberg

Leseprobe

Synthetic Data for Machine Learning

Revolutionize your approach to machine learning with this comprehensive conceptual guide

Abdulrahman Kerim

BIRMINGHAM—MUMBAI

Synthetic Data for Machine Learning

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Ali Abidi

Publishing Product Managers: Dhruv J. Kataria and Anant Jain

Senior Editor: David Sugarman

Content Development Editor: Shreya Moharir

Technical Editor: Devanshi Ayare

Copy Editor: Safis Editing

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Subalakshmi Govindhan

Production Designer: Jyoti Kadam

DevRel Marketing Coordinator: Nivedita Singh

First published: October 2023

Production reference: 1280923

Published by Packt Publishing Pvt. Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB

ISBN 978-1-80324-540-9

www.packtpub.com

To my wife, Emeni, and daughter, Taj.

Contributors

About the author

Abdulrahman Kerim is a full-time lecturer at the University for the Creative Arts (UCA) and an active researcher at the School of Computing and Communications at Lancaster University, UK. Kerim got his MSc in computer engineering, which focused on developing a simulator for computer vision problems. In 2020, Kerim started his PhD to investigate synthetic data advantages and potentials. His research on developing novel synthetic-aware computer vision models has been recognized internationally. He has published many papers on the usability of synthetic data at top-tier conferences and in journals such as BMVC and IMAVIS. He currently works with researchers from Google and Microsoft to overcome real-data issues, specifically for video stabilization and semantic segmentation tasks.

I would like to extend my sincere gratitude to two extraordinary individuals who have been my pillars of strength and support throughout this journey – my loving wife, Emeni, and our precious daughter, Taj.

Emeni, your boundless patience, encouragement, and understanding have been my guiding light. Your belief in my passion for writing has given me the courage to embark on this special journey. Your sacrifices and tireless efforts to create a nurturing home environment for our family have always allowed me to pursue my dreams. You are my muse and my anchor, and I am grateful for your presence in my life.

Taj, even though you may be too young to fully comprehend the significance of this endeavor, your smiles, laughter, and innocent wonder have provided me with a constant source of inspiration. Your presence has infused joy and vitality into my creative process, reminding me of the beauty in life’s simplest moments.

To both of you, thank you, for your love and belief in me. Your sacrifices, encouragement, and support have been the driving force behind the creation of this book. This achievement is as much yours as it is mine.

About the reviewers

Oishi Deb is a PhD researcher at the University of Oxford, and her research interests include AI, ML, and AI ethics. Prior to starting her PhD, Oishi gained industrial experience at Rolls-Royce, working in software engineering, data science, and ML. Oishi was named 2017 Student of the Year by the president and vice-chancellor of the University of Leicester, based on her academic performance in her undergraduate degree in software and electronics engineering. Oishi was selected for the DeepMind scholarship program, as part of which DeepMind funded her MSc in AI. Oishi is a chair for an ELLIS reading group in deep learning, and she has also served on the program committee as a reviewer for the NeurIPS and ICML conference workshops.

Leandro Soriano Marcolino is a lecturer (assistant professor) at Lancaster University. He obtained his doctorate degree at the University of Southern California (USC), advised by Milind Tambe. He has published papers on AI, robotics, computer vision, and ML in several key conferences, such as AAAI, AAMAS, IJCAI, CVPR, NeurIPS, ICRA, and IROS. He develops new learning and decision-making techniques for autonomous agents, which usually happen at execution time. Leandro has explored several interesting domains, such as multi-agent teamwork, swarm robotics, computer Go, video games, and more recently, computer vision, including the usage of simulators and synthetic data to create or improve state-of-the-art ML models.

Preface

Part 1: Real Data Issues, Limitations, and Challenges

1 Machine Learning and the Need for Data

Technical requirements

Artificial intelligence, machine learning, and deep learning

Artificial intelligence (AI)

Machine learning (ML)

Deep learning (DL)

Why are ML and DL so powerful?

Feature engineering

Transfer across tasks

Training ML models

Collecting and annotating data

Designing and training an ML model

Validating and testing an ML model

Iterations in the ML development process

Summary

2 Annotating Real Data

Annotating data for ML

Learning from data

Training your ML model

Testing your ML model

Issues with the annotation process

The annotation process is expensive

The annotation process is error-prone

The annotation process is biased

Optical flow and depth estimation

Ground truth generation for computer vision

Optical flow estimation

Depth estimation

Summary

3 Privacy Issues in Real Data

Why is privacy an issue in ML?

ML task

Dataset size

Regulations

What exactly is the privacy problem in ML?

Copyright and intellectual property infringement

Privacy and reproducibility of experiments

Privacy issues and bias

Privacy-preserving ML

Approaches for privacy-preserving datasets

Approaches for privacy-preserving ML

Real data challenges and issues

Summary

Part 2: An Overview of Synthetic Data for Machine Learning

4 An Introduction to Synthetic Data

Technical requirements

What is synthetic data?

Synthetic and real data

Data-centric and architecture-centric approaches in ML

History of synthetic data

Random number generators

Generative Adversarial Networks (GANs)

Synthetic data for privacy issues

Synthetic data in computer vision

Synthetic data and ethical considerations

Synthetic data types

Data augmentation

Geometric transformations

Noise injection

Text replacement, deletion, and injection

Summary

5 Synthetic Data as a Solution

The main advantages of synthetic data

Unbiased

Diverse

Controllable

Scalable

Automatic data labeling

Annotation quality

Low cost

Solving privacy issues with synthetic data

Using synthetic data to solve time and efficiency issues

Synthetic data as a revolutionary solution for rare data

Synthetic data generation methods

Summary

Part 3: Synthetic Data Generation Approaches

6 Leveraging Simulators and Rendering Engines to Generate Synthetic Data

Introduction to simulators and rendering engines

Simulators

Rendering and game engines

History and evolution of simulators and game engines

Generating synthetic data

Identify the task and ground truth to generate

Create the 3D virtual world in the game engine

Setting up the virtual camera

Adding noise and anomalies

Setting up the labeling pipeline

Generating the training data with the ground truth

Challenges and limitations

Realism

Diversity

Complexity

Looking at two case studies

AirSim

CARLA

Summary

7 Exploring Generative Adversarial Networks

Technical requirements

What is a GAN?

Training a GAN

GAN training algorithm

Training loss

Challenges

Utilizing GANs to generate synthetic data

Hands-on GANs in practice

Variations of GANs

Conditional GAN (cGAN)

CycleGAN

Conditional Tabular GAN (CTGAN)

Wasserstein GAN (WGAN) and Wasserstein GAN with Gradient Penalty (WGAN-GP)

f-GAN

DragGAN

Summary

8 Video Games as a Source of Synthetic Data

The impact of the video game industry

Photorealism and the real-synthetic domain shift

Time, effort, and cost

Generating synthetic data using video games

Utilizing games for general data collection

Utilizing games for social studies

Utilizing simulation games for data generation

Challenges and limitations

Controllability

Game genres and limitations on synthetic data generation

Realism

Ethical issues

Intellectual property

Summary

9 Exploring Diffusion Models for Synthetic Data

Technical requirements

An introduction to diffusion models

The training process of DMs

Applications of DMs

Diffusion models – the pros and cons

The pros of using DMs

The cons of using DMS

Hands-on diffusion models in practice

Context

Dataset

ML model

Training

Testing

Diffusion models – ethical issues

Copyright

Bias

Inappropriate content

Responsibility

Privacy

Fraud and identity theft

Summary

Part 4: Case Studies and Best Practices

10 Case Study 1 – Computer Vision

Transforming industries – the power of computer vision

The four waves of the industrial revolution

Industry 4.0 and computer vision

Synthetic data and computer vision – examples from industry

Neurolabs using synthetic data in retail

Microsoft using synthetic data alone for face analysis

Synthesis AI using synthetic data for virtual try-on

Summary

11 Case Study 2 – Natural Language Processing

A brief introduction to NLP

Applications of NLP in practice

The need for large-scale training datasets in NLP

Human language complexity

Contextual dependence

Generalization

Hands-on practical example with ChatGPT

Synthetic data as a solution for NLP problems

SYSTRAN Soft’s use of synthetic data

Telefónica’s use of synthetic data

Clinical text mining utilizing synthetic data

The Alexa virtual assistant model

Summary

12 Case Study 3 – Predictive Analytics

What is predictive analytics?

Applications of predictive analytics

Predictive analytics issues with real data

Partial and scarce training data

Bias

Cost

Case studies of utilizing synthetic data for predictive analytics

Provinzial and synthetic data

Healthcare benefits from synthetic data in predictive analytics

Amazon fraud transaction prediction using synthetic data

Summary

13 Best Practices for Applying Synthetic Data

Unveiling the challenges of generating and utilizing synthetic data

Domain gap

Data representation

Privacy, security, and validation

Trust and credibility

Domain-specific issues limiting the usability of synthetic data

Healthcare

Finance

Autonomous cars

Best practices for the effective utilization of synthetic data

Summary

Part 5: Current Challenges and Future Perspectives

14 Synthetic-to-Real Domain Adaptation

The domain gap problem in ML

Sensitivity to sensors’ variations

Discrepancy in class and feature distributions

Concept drift

Approaches for synthetic-to-real domain adaptation

Domain randomization

Adversarial domain adaptation

Feature-based domain adaptation

Synthetic-to-real domain adaptation – issues and challenges

Unseen domain

Limited real data

Computational complexity

Synthetic data limitations

Multimodal data complexity

Summary

15 Diversity Issues in Synthetic Data

The need for diverse data in ML

Transferability

Better problem modeling

Security

Process of debugging

Robustness to anomalies

Creativity

Inclusivity

Generating diverse synthetic datasets

Latent space variations

Ensemble synthetic data generation

Diversity regularization

Incorporating external knowledge

Progressive training

Procedural content generation with game engines

Diversity issues in the synthetic data realm

Balancing diversity and realism

Privacy and confidentiality concerns

Validation and evaluation challenges

Summary

16 Photorealism in Computer Vision

Synthetic data photorealism for computer vision

Feature extraction

Domain gap

Robustness

Benchmarking performance

Photorealism approaches

Physically Based Rendering (PBR)

Neural style transfer

Photorealism evaluation metrics

Structural Similarity Index Measure (SSIM)

Learned Perceptual Image Patch Similarity (LPIPS)

Expert evaluation

Challenges and limitations of photorealistic synthetic data

Creating hyper-realistic scenes

Resources versus photorealism trade-off

Summary

17 Conclusion

Real data and its problems

Synthetic data as a solution

Real-world case studies

Challenges and limitations

Future perspectives

Summary

Index

Other Books You May Enjoy

Preface

Machine learning (ML) has made our lives far easier. We cannot imagine our world without ML-based products and services. ML models need to be trained on large-scale datasets to perform well. However, collecting and annotating real data is extremely expensive, error-prone, and subject to privacy issues, to name a few disadvantages. Synthetic data is a promising solution to real-data ML-based solutions.

Synthetic Data for Machine Learning is a unique book that will help you master synthetic data, designed to make your learning journey enjoyable. In this book, theory and good practice complement each other to provide leading-edge support!

The book helps you to overcome real data issues and improve your ML models’ performance. It provides an overview of the fundamentals of synthetic data generation and discusses the pros and cons of each approach. It reveals the secrets of synthetic data and the best practices to leverage it better.

By the end of this book, you will have mastered synthetic data and increased your chances of becoming a market leader. It will enable you to springboard into a more advanced, cheaper, and higher-quality data source, making you well prepared and ahead of your peers for the next generation of ML!

Who this book is for

If you are an ML practitioner or researcher who wants to overcome data problems in ML, this book is written especially for you! It assumes you have basic knowledge of ML and Python programming (not more!). The book was carefully designed to give you the foremost guidance to master synthetic data for ML. It builds your knowledge gradually from synthetic data concepts and algorithms to applications, study cases, and best practices. The book is one of the pioneer works on the subject, providing leading-edge support for ML engineers, researchers, companies, and decision-makers.

What this book covers

Chapter 1, Machine Learning and the Need for Data, introduces you to ML. You will understand the main difference between non-learning- and learning-based solutions. Then, the chapter explains why deep learning models often achieve state-of-the-art results. Following this, it gives you a brief idea of how the training process is done and why large-scale training data is needed in ML.

Chapter 2, Annotating Real Data, explains why ML models need annotated data. You will understand why the annotation process is expensive, error-prone, and biased. At the same time, you will be introduced to the annotation process for a number of ML tasks, such as image classification, semantic segmentation, and instance segmentation. You will explore the main annotation problems. At the same time, you will understand why ideal ground truth generation is impossible or extremely difficult for some tasks, such as optical flow estimation and depth estimation.

Chapter 3, Privacy Issues in Real Data, highlights the main privacy issues with real data. It explains why privacy is preventing us from using large-scale real data for ML in certain fields such as healthcare and finance. It demonstrates the current approaches for mitigating these privacy issues in practice. Furthermore, you will have a brief introduction to privacy-preserving ML.

Chapter 4, An Introduction to Synthetic Data, defines synthetic data. It gives a brief history of the evolution of synthetic data. Then, it introduces you to the main types of synthetic data and the basic data augmentation approaches and techniques.

Chapter 5, Synthetic Data as a Solution, highlights the main advantages of synthetic data. In this chapter, you will learn why synthetic data is a promising solution for privacy issues. At the same time, you will understand how synthetic data generation approaches can be configured to cover rare scenarios that are extremely difficult and expensive to capture in the real world.

Chapter 6, Leveraging Simulators and Rendering Engines to Generate Synthetic Data, introduces a well-known method for synthetic data generation using simulators and rendering engines. It describes the main pipeline for creating a simulator and generating automatically annotated synthetic data. Following this, it highlights the challenges and the state-of-the-art research in this field, and briefly discusses two simulators for synthetic data generation.

Chapter 7, Exploring Generative Adversarial Networks, introduces Generative Adversarial Networks (GANs) and discusses the evolution of this method. It explains the typical architecture of a GAN. After this, the chapter illustrates the training process. It highlights some great applications of GANs including generating images and text-to-image translation. It also describes a few variations of GANs: conditional GAN, CycleGAN, CTGAN, WGAN, WGAN-GP, and f-GAN. Furthermore, the chapter is supported by a real-life case study and a discussion of the state-of-the-art research in this field.

Chapter 8, Video Games as a Source of Synthetic Data, explains why to use video games for synthetic data generation. It highlights the great advancement in this sector. It discusses the current research in this direction. At the same time, it features challenges and promises toward utilizing this approach for synthetic data generation.

Chapter 9, Exploring Diffusion Models for Synthetic Data, introduces you to diffusion models and highlights the pros and cons of this synthetic data generation approach. It casts light on opportunities and challenges. The chapter is enriched by a discussion of ethical issues and concerns around utilizing this synthetic data approach in practice. In addition to that, the chapter is enriched with a review of the state-of-the-art research on this topic.

Chapter 10, Case Study 1 – Computer Vision, introduces you to a multitude of industrial applications of computer vision. You will discover some of the key problems that were successfully solved using computer vision. In parallel to this, you will grasp the major issues with traditional computer vision solutions. Additionally, you will explore and comprehend thought-provoking examples of using synthetic data to improve computer vision solutions in practice.

Chapter 11, Case Study 2 – Natural Language Processing, introduces you to a different field where synthetic data is a key player. It highlights why Natural Language Processing (NLP) models require large-scale training data to converge. It shows examples of utilizing synthetic data in the field of NLP. It explains the pros and cons of real-data-based approaches. At the same time, it shows why synthetic data is the future of NLP. It supports this discussion by bringing up examples from research and industry fields.

Chapter 12, Case Study 3 – Predictive Analytics, introduces predictive analytics as another area where synthetic data has been used recently. It highlights the disadvantages of real-data-based solutions. It supports the discussion by providing examples from the industry. Following this, it sheds light on the benefits of employing synthetic data in the predictive analytics domain.

Chapter 13, Best Practices for Applying Synthetic Data, explains some fundamental domain-specific issues limiting the usability of synthetic data. It gives general comments on issues that can be seen frequently when generating and utilizing synthetic data. Then, it introduces a set of good practices that improve the usability of synthetic data in practice.

Chapter 14, Synthetic-to-Real Domain Adaptation, introduces you to a well-known issue limiting the usability of synthetic data called the domain gap problem. It represents various approaches to bridge this gap. At the same time, it shows current state-of-the-art research for synthetic-to-real domain adaptation. Then, it represents the challenges and issues in this context.

Chapter 15, Diversity Issues in Synthetic Data, introduces you to another well-known issue in the field of synthetic data, which is generating diverse synthetic datasets. It discusses different approaches to ensure high diversity even with large-scale datasets. Then, it highlights some issues and challenges in achieving diversity for synthetic data.

Chapter 16, Photorealism in Computer Vision, explains the need for photo-realistic synthetic data in computer vision. It highlights the main approaches toward photorealism, its main challenges, and its limitations. Although the chapter focuses on computer vision, the discussion can be generalized to other domains such as healthcare, robotics, and NLP.

Chapter 17, Conclusion, summarizes the book from a high-level view. It reminds you about the problems with real-data-based ML solutions. Then, it recaps the benefits of synthetic data-based solutions, challenges, and future perspectives.

To get the most out of this book

You will need a version of PyCharm installed on your computer – the latest version, if possible. All code examples have been tested using Python 3.8 and PyCharm 2023.1 (Professional Edition) on Ubuntu 20.04.2 LTS. However, they should work with future version releases, too.

Software/hardware covered in the book

Operating system requirements

Python 3.8+

Windows, macOS, or Linux

PyCharm 2023.1

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Synthetic-Data-for-Machine-Learning. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter (now, X) handles. Here is an example: “Please note that the seed parameter will help us to get diverse images in this example.”

A block of code is set as follows:

//Example of Non-learning AI (My AI Doctor!) Patient.age //get the patient's age Patient. temperature //get the patient's temperature Patient.night_sweats //get if the patient has night sweats Paitent.Cough //get if the patient coughs

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you’ve read Synthetic Data for Machine Learning, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/9781803245409

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

Part 1:Real Data Issues, Limitations, and Challenges

In this part, you will embark on a comprehensive journey into Machine Learning (ML). You will learn why ML is so powerful. The training process and the need for large-scale annotated data will be explored. You will investigate the main issues with annotating real data and learn why the annotation process is expensive, error-prone, and biased. Following this, you will delve into privacy issues in ML and privacy-preserving ML solutions.

This part has the following chapters:

Chapter 1, Machine Learning and the Need for DataChapter 2, Annotating Real DataChapter 3, Privacy Issues in Real Data

1 Machine Learning and the Need for Data

Machine learning (ML) is the crown jewel of artificial intelligence (AI) and has changed our lives forever. We cannot imagine our daily lives without ML tools and services such as Siri, Tesla, and others.

In this chapter, you will be introduced to ML. You will understand the main differences between non-learning and learning-based solutions. Then, you will see why deep learning (DL) models often achieve state-of-the-art results. Following this, you will get a brief introduction to how the training process is done and why large-scale training data is needed in ML.

In this chapter, we’re going to cover the following main topics:

AI, ML, and DLWhy are ML and DL so powerful?Training ML models

Technical requirements

Any code used in this chapter will be available in the corresponding chapter folder in this book’s GitHub repository: https://github.com/PacktPublishing/Synthetic-Data-for-Machine-Learning.

We will be using PyTorch, which is a powerful ML framework developed by Meta AI.

Artificial intelligence, machine learning, and deep learning

In this section, we learn what exactly ML is. We will learn to differentiate between learning and non-learning AI. However, before that, we’ll introduce ourselves to AI, ML, and DL.

Artificial intelligence (AI)

There are different definitions of AI. However, one of the best is John McCarthy’s definition. McCarthy was the first to coin the term artificial intelligence in one of his proposals for the 1956 Dartmouth Conference. He defined the outlines of this field by many major contributions such as the Lisp programming language, utility computing, and timesharing. According to the father of AI in What is Artificial Intelligence? (https://www-formal.stanford.edu/jmc/whatisai.pdf):

It is the science and engineering of making intelligent machines, especially intelligent computer programs. It is related to the similar task of using computers to understand human intelligence, but AI does not have to confine itself to methods that are biologically observable.

AI is about making computers, programs, machines, or others mimic or imitate human intelligence. As humans, we perceive the world, which is a very complex task, and we reason, generalize, plan, and interact with our surroundings. Although it is fascinating to master these tasks within just a few years of our childhood, the most interesting aspect of our intelligence is the ability to improve the learning process and optimize performance through experience!

Unfortunately, we still barely scratch the surface of knowing about our own brains, intelligence, and other associated functionalities such as vision and reasoning. Thus, the trek of creating “intelligent” machines has just started relatively recently in civilization and written history. One of the most flourishing directions of AI has been learning-based AI.

AI can be seen as an umbrella that covers two types of intelligence: learning and non-learning AI. It is important to distinguish between AI that improves with experience and one that does not!

For example, let’s say you want to use AI to improve the accuracy of a physician identifying a certain disease, given a set of symptoms. You can create a simple recommendation system based on some generic cases by asking domain experts (senior physicians). The pseudocode for such a system is shown in the following code block:

//Example of Non-learning AI (My AI Doctor!) Patient.age //get the patient age Patient. temperature //get the patient temperature Patient.night_sweats //get if the patient has night sweats Paitent.Cough //get if the patient cough // AI program starts if Patient.age > 70: if Patient.temperature > 39 and Paitent.Cough: print("Recommend Disease A") return elif Patient.age < 10: if Patient.tempreture > 37 and not Paitent.Cough: if Patient.night_sweats: print("Recommend Disease B") return else: print("I cannot resolve this case!") return

This program mimics how a physician may reason for a similar scenario. Using simple if-else statements with few lines of code, we can bring “intelligence” to our program.

Important note

This is an example of non-learning-based AI. As you may expect, the program will not evolve with experience. In other words, the logic will not improve with more patients, though the program still represents a clear form of AI.

In this section, we learned about AI and explored how to distinguish between learning and non-learning-based AI. In the next section, we will look at ML.

Machine learning (ML)

ML is a subset of AI. The key idea of ML is to enable computer programs to learn from experience. The aim is to allow programs to learn without the need to dictate the rules by humans. In the example of the AI doctor we saw in the previous section, the main issue is creating the rules. This process is extremely difficult, time-consuming, and error-prone. For the program to work properly, you would need to ask experienced/senior physicians to express the logic they usually use to handle similar patients. In other scenarios, we do not know exactly what the rules are and what mechanisms are involved in the process, such as object recognition and object tracking.

ML comes as a solution to learning the rules that control the process by exploring special training data collected for this task (see Figure 1.1):

Figure 1.1 – ML learns implicit rules from data

ML has three major types: supervised, unsupervised, and reinforcement learning. The main difference between them comes from the nature of the training data used and the learning process itself. This is usually related to the problem and the available training data.

Deep learning (DL)

DL is a subset of ML, and it can be seen as the heart of ML (see Figure 1.2). Most of the amazing applications of ML are possible because of DL. DL learns and discovers complex patterns and structures in the training data that are usually hard to do using other ML approaches, such as decision trees. DL learns by using artificial neural networks (ANNs) composed of multiple layers or too many layers (an order of 10 or more), inspired by the human brain; hence the neural in the name. It has three types of layers: input, output, and hidden. The input layer receives the input, while the output layer gives the prediction of the ANN. The hidden layers are responsible for discovering the hidden patterns in the training data. Generally, each layer (from the input to the output layers) learns a more abstract representation of the data, given the output of the previous layer. The more hidden layers your ANN has, the more complex and non-linear the ANN will be. Thus, ANNs will have more freedom to better approximate the relationship between the input and output or to learn your training data. For example, AlexNet is composed of 8 layers, VGGNet is composed of 16 to 19 layers, and ResNet-50 is composed of 50 layers:

Figure 1.2 – How DL, ML, and AI are related

The main issue with DL is that it requires a large-scale training dataset to converge because we usually have a tremendous number of parameters (weights) to tweak to minimize the loss. In ML, loss is a way to penalize wrong predictions. At the same time, it is an indication of how well the model is learning the training data. Collecting and annotating such large datasets is extremely hard and expensive.

Nowadays, using synthetic data as an alternative or complementary to real data is a hot topic. It is a trending topic in research and industry. Many companies such as Google (Google’s Waymo utilizes synthetic data to train autonomous cars) and Microsoft (they use synthetic data to handle privacy issues with sensitive data) started recently to invest in using synthetic data to train next-generationML models.

Why are ML and DL so powerful?

Although most AI fields are flourishing and gaining more attention recently, ML and DL have been the most influential fields of AI. This is because of several factors that make them distinctly a better solution in terms of accuracy, performance, and applicability. In this section, we are going to look at some of these essential factors.

Feature engineering

In traditional AI, it is compulsory to design the features manually for the task. This process is extremely difficult, time-consuming, and task/problem-dependent. If you want to write a program, say to recognize car wheels, you probably need to use some filters to extract edges and corners. Then, you need to utilize these extracted features to identify the target object. As you may anticipate, it is not always easy to know what features to select or ignore. Imagine developing an AI-based solution to predict if a patient has COVID-19 based on a set of symptoms at the early beginning of the pandemic. At that time, human experts did not know how to answer such questions. ML and DL can solve such problems.

DL models learn to automatically extract useful features by learning hidden patterns, structures, and associations in the training data. A loss is used to guide the learning process and help the model achieve the objectives of the training process. However, for the model to converge, it needs to be exposed to sufficiently diverse training data.

Transfer across tasks

One strong advantage of DL

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Synthetic Data for Machine Learning E-Book

Abdulrahman Kerim

Synthetic Data for Machine Learning

Contributors

About the author

About the reviewers

Table of Contents

Preface

Part 1: Real Data Issues, Limitations, and Challenges

1

Machine Learning and the Need for Data

Technical requirements

Artificial intelligence, machine learning, and deep learning

Artificial intelligence (AI)

Machine learning (ML)

Deep learning (DL)

Why are ML and DL so powerful?

Feature engineering

Transfer across tasks

Training ML models

Collecting and annotating data

Designing and training an ML model

Validating and testing an ML model

Iterations in the ML development process

Summary

2

Annotating Real Data

Annotating data for ML

Learning from data

Training your ML model

Testing your ML model

Issues with the annotation process

The annotation process is expensive

The annotation process is error-prone

The annotation process is biased

Optical flow and depth estimation

Ground truth generation for computer vision

Optical flow estimation

Depth estimation

Summary

3

Privacy Issues in Real Data

Why is privacy an issue in ML?

ML task

Dataset size

Regulations

What exactly is the privacy problem in ML?

Copyright and intellectual property infringement

Privacy and reproducibility of experiments

Privacy issues and bias

Privacy-preserving ML

Approaches for privacy-preserving datasets

Approaches for privacy-preserving ML

Real data challenges and issues

Summary

Part 2: An Overview of Synthetic Data for Machine Learning

4

An Introduction to Synthetic Data

Technical requirements

What is synthetic data?

Synthetic and real data

Data-centric and architecture-centric approaches in ML

History of synthetic data

Random number generators

Generative Adversarial Networks (GANs)

Synthetic data for privacy issues

Synthetic data in computer vision

Synthetic data and ethical considerations

Synthetic data types

Data augmentation

Geometric transformations

Noise injection

Text replacement, deletion, and injection

Summary

5

Synthetic Data as a Solution

The main advantages of synthetic data

Unbiased

Diverse

Controllable