E-Book
32,39 €

Accelerate Model Training with PyTorch 2.X E-Book

Maicon Melo Alves

0,0

32,39 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

This book, written by an HPC expert with over 25 years of experience, guides you through enhancing model training performance using PyTorch. Here you’ll learn how model complexity impacts training time and discover performance tuning levels to expedite the process, as well as utilize PyTorch features, specialized libraries, and efficient data pipelines to optimize training on CPUs and accelerators. You’ll also reduce model complexity, adopt mixed precision, and harness the power of multicore systems and multi-GPU environments for distributed training. By the end, you'll be equipped with techniques and strategies to speed up training and focus on building stunning models.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 271

Veröffentlichungsjahr: 2024

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Für immer aufgeräumt – auch digital

Jürgen Kurz

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Mission erfüllt

Owen Mark

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Macht, was ihr liebt!

Anja Förster

Kopf schlägt Kapital

Günter Faltin

Der größte Raubzug der Geschichte

Matthias Weik

Der Mann und das Holz

Lars Mytting

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Organisation für Komplexität

Niels Pfläging

Power: Die 48 Gesetze der Macht

Robert Greene

The Truth About Employee Engagement

Patrick M. Lencioni

Leseprobe

Accelerate Model Training with PyTorch 2.X

Build more accurate models by boosting the model training process

Maicon Melo Alves

Accelerate Model Training with PyTorch 2.X

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Niranjan Naikwadi

Publishing Product Manager: Sanjana Gupta

Book Project Manager: Kirti Pisat

Content Development Editor: Manikandan Kurup

Technical Editor: Seemanjay Ameriya

Copy Editor: Safis Editing

Proofreader: Safis Editing and Manikandan Kurup

Indexer: Hemangini Bari

Production Designer: Aparna Bhagat

Senior DevRel Marketing Coordinator: Vinishka Kalra

First published: April 2024

Production reference: 1050424

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul's Square

Birmingham

B3 1RB, UK.

ISBN 978-1-80512-010-0

www.packtpub.com

To my wife and best friend, Cristiane, for being my loving partner throughout our joint life journey. To my daughters, Giovana and Camila, for being my real treasure; I’m so proud of you. To my mom, Fatima, and brothers, Johny and Karoline, for being my safe harbor. Despite everything, I also dedicate this book to my (late) father Jorge.

– Maicon Melo Alves

Foreword

Accelerating model training is critical in the area of machine learning for several reasons. As datasets grow larger and models become more complex, training times can become prohibitively long, hindering research and development progress. This is where machine learning frameworks such as PyTorch come into play, providing tools and techniques to accelerate the training process.

PyTorch, with its flexibility, GPU acceleration, optimization techniques, and distributed training capabilities, plays a crucial role in this endeavor by enabling researchers and developers to iterate quickly, train complex models efficiently, and deploy solutions faster. By leveraging PyTorch’s capabilities, practitioners can push the boundaries of what is possible in artificial intelligence and drive innovation across various domains.

Since learning all of these capabilities is not a straightforward task, this book is a great resource for all students, researchers, and professionals who intend to learn how to accelerate model training with the latest release of PyTorch in a smooth way.

This very didactic book starts by introducing how the training process works and what kind of modifications can be done at the application and environment layers to accelerate the training process.

Only after that, the following chapters describe methods to accelerate the training model, such as the Compile API, a novel capability launched in PyTorch 2.0 useful for compiling a model, and the use of specialized libraries such as OpenMP and IPEX to speed up the training process of our models even more.

It also describes the building of an efficient data pipeline to keep your GPU working at its peak for the entire training process, simplifying a model by reducing the number of parameters, and reducing the numerical precision adopted by the neural network to accelerate the training process and decrease the amount of memory needed to store the model.

Finally, this book also explains how to spread out the distributed training process to run on multiple CPUs and GPUs.

This book not only provides current and highly relevant content for the learning and updating of any professional working in the field of computing but also impresses with its extremely didactic presentation of the subject. You will certainly appreciate the quiz at the end of each chapter and the connection made between the chapters in the summary at the end of each chapter.

In all chapters, codes, and examples of use are presented. For all these reasons, I believe that the book could be successfully adopted by undergraduate and graduate courses as a support bibliography for them too.

Prof. Lúcia Maria de Assumpção Drummond

Titular professor at Fluminense Federal University, Brazil

Contributors

About the author

Dr. Maicon Melo Alves is a senior system analyst and academic professor who specializes in High-Performance Computing (HPC) systems. In the last five years, he has become interested in understanding how HPC systems have been used in AI applications. To better understand this topic, he completed an MBA in data science in 2021 at Pontifícia Universidade Católica of Rio de Janeiro (PUC-RIO). He has over 25 years of experience in IT infrastructure, and since 2006, he has worked with HPC systems at Petrobras, the Brazilian state energy company. He obtained his DSc degree in computer science from the Fluminense Federal University (UFF) in 2018 and has published three books and publications in international journals in HPC.

About the reviewer

Dimitra Charalampopoulou is a machine learning engineer with a background in technology consulting and a strong interest in AI and machine learning. She has led numerous large-scale digital transformation engineering projects for clients across the US and EMEA and has received various awards, including recognition for her start-up at the MIT Startup Competition. Additionally, she has been a speaker at two conferences in Europe on the topic of GenAI. As an advocate for women in tech, she is the founder and managing director of an NGO that promotes gender equality in tech and has taught programming classes to female students internationally.

Preface

Part 1: Paving the Way

1 Deconstructing the Training Process

Technical requirements

Remembering the training process

Dataset

The training algorithm

Understanding the computational burden of the model training phase

Hyperparameters

Operations

Parameters

Quiz time!

Summary

2 Training Models Faster

Technical requirements

What options do we have?

Modifying the software stack

Increasing computing resources

Modifying the application layer

What can we change in the application layer?

Getting hands-on

What if we change the batch size?

Modifying the environment layer

What can we change in the environment layer?

Getting hands-on

Quiz time!

Summary

Part 2: Going Faster

3 Compiling the Model

Technical requirements

What do you mean by compiling?

Execution modes

Model compiling

Using the Compile API

Basic usage

Give me a real fight – training a heavier model!

How does the Compile API work under the hood?

Compiling workflow and components

Backends

Quiz time!

Summary

4 Using Specialized Libraries

Technical requirements

Multithreading with OpenMP

What is multithreading?

Using and configuring OpenMP

Using and configuring Intel OpenMP

Optimizing Intel CPU with IPEX

Using IPEX

How does IPEX work under the hood?

Quiz time!

Summary

5 Building an Efficient Data Pipeline

Technical requirements

Why do we need an efficient data pipeline?

What is a data pipeline?

How to build a data pipeline

Data pipeline bottleneck

Accelerating data loading

Optimizing a data transfer to the GPU

Configuring data pipeline workers

Reaping the rewards

Quiz time!

Summary

6 Simplifying the Model

Technical requirements

Knowing the model simplifying process

Why simplify a model? (reason)

How to simplify a model? (process)

When do we simplify a model? (moment)

Using Microsoft NNI to simplify a model

Overview of NNI

NNI in action!

Quiz time!

Summary

7 Adopting Mixed Precision

Technical requirements

Remembering numeric precision

How do computers represent numbers?

Floating-point representation

Novel data types

A summary, please!

Understanding the mixed precision strategy

What is mixed precision?

Why use mixed precision?

How to use mixed precision

How about Tensor Cores?

Enabling AMP

Activating AMP on GPU

AMP, show us what you are capable of!

Quiz time!

Summary

Part 3: Going Distributed

8 Distributed Training at a Glance

Technical requirements

A first look at distributed training

When do we need to distribute the training process?

Where do we execute distributed training?

Learning the fundamentals of parallelism strategies

Model parallelism

Data parallelism

Distributed training on PyTorch

Basic workflow

Communication backend and program launcher

Quiz time!

Summary

9 Training with Multiple CPUs

Technical requirements

Why distribute the training on multiple CPUs?

Why not increase the number of threads?

Distributed training on rescue

Implementing distributed training on multiple CPUs

The Gloo communication backend

Coding distributed training to run on multiple CPUs

Launching distributed training on multiple CPUs

Getting faster with Intel oneCCL

What is Intel oneCCL?

Code implementation and launching

Is oneCCL really better?

Quiz time!

Summary

10 Training with Multiple GPUs

Technical requirements

Demystifying the multi-GPU environment

The popularity of multi-GPU environments

Understanding multi-GPU interconnection

How does interconnection topology affect performance?

Discovering the interconnection topology

Setting GPU affinity

Implementing distributed training on multiple GPUs

The NCCL communication backend

Coding and launching distributed training with multiple GPUs

Experimental evaluation

Quiz time!

Summary

11 Training with Multiple Machines

Technical requirements

What is a computing cluster?

Workload manager

Understanding the high-performance network

Implementing distributed training on multiple machines

Getting introduced to Open MPI

Why use Open MPI and NCCL?

Coding and launching the distributed training on multiple machines

Experimental evaluation

Quiz time!

Summary

Index

Other Books You May Enjoy

Part 1: Paving the Way

In this part, you will learn about performance optimization, before delving into the techniques, approaches, and strategies described throughout the book. First, you will learn about the aspects of the training process that make it so computationally heavy. After that, you will learn about the possible approaches to reduce the training time.

This part has the following chapters:

Chapter 1, Deconstructing the Training ProcessChapter 2, Training Models Faster

1 Deconstructing the Training Process

We already know that training neural network models takes a long time to finish. Otherwise, we would not be here discussing ways to run this process faster. But which characteristics make the building process of these models so computationally heavy? Why does the training step take so long? To answer these questions, we need to understand the computational burden of the training phase.

In this chapter, we will first remember how the training phase works under the hood. We will understand what makes the training process so computationally heavy.

Here is what you will learn as part of this first chapter:

Remembering the training processUnderstanding the computational burden of the training phaseUnderstanding the factors that influence training time

Technical requirements

You can find the complete code of the examples mentioned in this chapter in the book’s GitHub repository at https://github.com/PacktPublishing/Accelerate-Model-Training-with-PyTorch-2.X/blob/main.

You can access your favorite environment to execute this notebook, such as Google Colab or Kaggle.

Remembering the training process

Before describing the computational burden imposed by neural network training, we must remember how this process works.

Important note

This section gives a very brief introduction to the training process. If you are totally unfamiliar with this topic, you should invest some time to understand this theme before moving to the following chapters. An excellent resource for learning this topic is the book entitled Machine Learning with PyTorch and Scikit-Learn, published by Packt and written by Sebastian Raschka, Yuxi (Hayden) Liu, and Vahid Mirjalili.

Basically speaking, neural networks learn from examples, similar to a child observing an adult. The learning process relies on feeding the neural network with pairs of input and output values so that the network catches the intrinsic relation between the input and output data. Such relationships can be interpreted as the knowledge obtained by the model. So, where a human sees a bunch of data, the neural network sees veiled knowledge.

This learning process depends on the dataset used to train a model.

Dataset

A dataset comprises a set of data instances related to some problem, scenario, event, or phenomenon. Each instance has features and target information corresponding to the input and output data. The concept of a dataset instance is similar to a registry in a table or relational database.

The dataset is usually split into two parts: training and testing sets. The training set is used to train the network, whereas the testing part is used to test the model against unseen data. Occasionally, we can also use another part to validate the model after each training iteration.

Let’s look at Fashion-MNIST, a famous dataset that is commonly used to test and teach neural networks. This dataset comprises 70,000 labeled images of clothes and accessories such as dresses, shirts, and sandals belonging to 10 distinct classes or categories. The dataset is split into 60,000 instances for training and 10,000 instances for testing.

As shown in Figure 1.1, a single instance of this dataset comprises a 28 x 28 grayscale image and a label identifying its class. In the case of Fashion-MNIST, we have 70,000 instances, which is often referred to as the length of the dataset.

Figure 1.1 – Concept of a dataset instance

Besides the concept of dataset instance, we also have the concept of a dataset sample. A sample is defined as a group of instances, as shown in Figure 1.2. Usually, the training process executes on samples and not just on a single dataset instance. The reason why the training process takes samples instead of single instances is related to the way the training algorithm works. Don’t worry about this topic; we will cover it in the following sections:

Figure 1.2 – Concept of a dataset sample

The number of instances in a sample is called the batch size. For example, if we divide the Fashion-MNIST training set into samples of a batch size equal to 32, we get 1,875 samples since this set has 60,000 instances.

The higher the batch size, the lower the number of samples in a training set, as pictorially described in Figure 1.3:

Figure 1.3 – Concept of batch size

With a batch size equal to eight, the dataset in the example is divided into two samples, each with eight dataset instances. On the other hand, with a lower batch size (in this case, four), the training set is divided into a higher number of samples (four samples).

The neural network receives input samples and outputs a set of results, each corresponding to an instance of the input sample. In the case of a model to treat the classification image problem of Fashion-MNIST, the neural network gets a set of images and outputs another set of labels, as you can see in Figure 1.4. Each one of these labels indicates the corresponding class of the input image:

Figure 1.4 – Neural networks work on input samples

To extract the intrinsic knowledge present in the dataset, we need to submit the neural network to a training algorithm so it can learn the pattern present in the data. Let’s jump to the next section to understand how this algorithm works.

The training algorithm

The training algorithm is an iterative process that takes each dataset sample and adjusts the neural network parameters according to the degree of error related to the difference between the correct result and the predicted one.

A single training iteration is called the training step. So, the number of training steps executed in the learning process equals the number of samples used to train the model. As we stated before, the batch size defines the number of samples, which also determines the number of training steps.

After executing all the training steps, we say the training algorithm has completed a training epoch. The developer must define the number of epochs before starting the model-building process. Usually, the developer determines the number of epochs by varying it and evaluating the accuracy of the resultant model.

A single training step executes the four phases sequentially, as illustrated in Figure 1.5:

Figure 1.5 – The four phases of the training process

Let’s go through each one of these steps to understand their role in the entire training process.

Forward

In the forward phase, the neural network receives the input data, performs calculations, and outputs a result. This output is also known as the value predicted by the neural network. In the case of Fashion-MNIST, the input data is the grayscale image and the predicted value is the class to which the item belongs.

Considering the tasks executed in the training step, the forward phase has a higher computational cost. This happens because it executes all the heavy computations involved in the neural network. Such computations, commonly known as operations, will be explained in the next section.

It is interesting to note that the forward phase is exactly the same as the inference process. When using the model in practice, we continuously execute the forward phase to infer a value or result.

Loss calculation

After the forward phase, the neural network will output a predicted value. Then, the training algorithm needs to compare the predicted value with the expected one to see how good the prediction made by the model is.

If the predicted value is close or equal to the real value, the model is performing as expected and the training process is going in the right direction. Otherwise, the training step needs to quantify the error achieved by the model to adjust the parameters proportionally to the error degree.

Important note

In the terminology of neural networks, this error is usually referred to as loss or cost. So, it is common to see names such as loss or cost function in the literature when addressing this topic.

There are different kinds of loss functions, each one suitable to treat a specific sort of problem. The cross-entropy (CE) loss function is used in multiclass image classification problems, where we need to classify an image within a group of classes. For example, this loss function can be used in the Fashion-MNIST problem. Suppose we have just two classes or categories. In that case, we face a binary class problem, so using the binary cross-entropy (BCE) function rather than the original cross-entropy loss function is recommended.

For regression problems, the loss function is completely different from the ones used in classification problems. We can use functions such as the mean squared error (MSE), which measures the squared difference between the original value and the predicted value by the neural network.

Optimization

After obtaining the loss, the training algorithm calculates the partial derivative of the loss function concerning the current parameters of the network. This operation results in the so-called gradient, which the training process uses to adjust network parameters.

Leaving the mathematical foundations aside, we can think of the gradient as the change we need to apply to network parameters to minimize the error or loss.

Important note

You can find more information about the math used in deep learning by reading the book Hands-On Mathematics for Deep Learning, published by Packt and written by Jay Dawani.

Similar to the loss function, we also have distinct implementations of optimizers. The stochastic gradient descent (SGD) and Adam are used the most.

Backward

To finish the training process, the algorithm updates the network parameters according to the gradient obtained in the optimization phase.

Important note

This section provides a theoretical explanation of the training algorithm. So, be aware that depending on the machine learning framework, the training process can have a set of phases that is different from the ones in the preceding list.

Essentially, these phases constitute the computational burden of the training process. Follow me to the next section to understand how this computational burden is impacted by different factors.

Understanding the computational burden of the model training phase

Now that we’ve brushed up on how the training process works, let’s understand the computational cost required to train a model. By using the terms computational cost or burden, we mean the computing power needed to execute the training process. The higher the computational cost, the higher the time taken to train the model. In the same way, the higher the computational burden, the higher the computing resources required to train the model.

Essentially, we can say the computational burden to train a model is defined by a three-fold factor, as illustrated in Figure 1.6:

Figure 1.6 – Factors that influence the training computational burden

Each one of these factors contributes (to some degree) to the computational complexity imposed by the training process. Let’s talk about each one of them.

Hyperparameters

Hyperparameters define two aspects of neural networks: the neural network configuration and how the training algorithm works.

Concerning neural network configuration, the hyperparameters determine the number and type of layers and the number of neurons in each layer. Simple networks have a few layers and neurons, whereas complex networks have thousands of neurons spread in hundreds of layers. The number of layers and neurons determines the number of parameters of the network, which directly impacts the computational burden. Due to the significant influence of the number of parameters in the computational cost of the training step, we will discuss this topic later in this chapter as a separate performance factor.

Regarding how the training algorithm executes the training process, hyperparameters control the number of epochs and steps and determine the optimizer and loss function used during the training phase, among other things. Some of these hyperparameters have a tiny influence on the computational cost of the training process. For example, if we change the optimizer from SGD to Adam, we will not face any relevant impact on the computational cost of the training process.

Other hyperparameters can definitely raise the training phase time, though. One of the most emblematic examples is the batch size. The higher the batch size, the fewer training steps are needed to train a model. So, with a few training steps, we can speed up the building process since the training phase will execute fewer steps per epoch. On the other hand, we can spend more time executing a single training step if we have big batch sizes. This happens because the forward phase executed on each training step should deal with a higher dimensional input data. In other words, we have a trade-off here.

For example, consider the case of a batch size equal to 32 for the Fashion-MNIST dataset. In this case, the input data dimension is 32 x 1 x 28 x 28, where 32, 1, and 28 represent the batch size, the number of channels (colors, in this scenario), and the image size, respectively. Therefore, for this case, the input data comprises 25,088 numbers, which is the number of numbers the forward phase should compute. However, if we increase the batch size to 128, the input data changes to 100,352 numbers, which can result in a longer time to execute a single forward phase iteration.

In addition, a

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Accelerate Model Training with PyTorch 2.X E-Book

Maicon Melo Alves

Accelerate Model Training with PyTorch 2.X

Foreword

Contributors

About the author

About the reviewer

Table of Contents

Preface

Part 1: Paving the Way

1

Deconstructing the Training Process

Technical requirements

Remembering the training process

Dataset

The training algorithm

Understanding the computational burden of the model training phase

Hyperparameters

Operations

Parameters

Quiz time!

Summary

2

Training Models Faster

Technical requirements

What options do we have?

Modifying the software stack

Increasing computing resources

Modifying the application layer

What can we change in the application layer?

Getting hands-on

What if we change the batch size?

Modifying the environment layer

What can we change in the environment layer?

Getting hands-on

Quiz time!

Summary

Part 2: Going Faster

3

Compiling the Model

Technical requirements

What do you mean by compiling?

Execution modes

Model compiling

Using the Compile API

Basic usage

Give me a real fight – training a heavier model!

How does the Compile API work under the hood?

Compiling workflow and components

Backends

Quiz time!

Summary

4

Using Specialized Libraries

Technical requirements

Multithreading with OpenMP

What is multithreading?

Using and configuring OpenMP

Using and configuring Intel OpenMP

Optimizing Intel CPU with IPEX

Using IPEX

How does IPEX work under the hood?

Quiz time!

Summary

5

Building an Efficient Data Pipeline

Technical requirements

Why do we need an efficient data pipeline?

What is a data pipeline?

How to build a data pipeline

Data pipeline bottleneck

Accelerating data loading

Optimizing a data transfer to the GPU

Configuring data pipeline workers

Reaping the rewards

Quiz time!

Summary

6

Simplifying the Model

Technical requirements