32,39 €
This book, written by an HPC expert with over 25 years of experience, guides you through enhancing model training performance using PyTorch. Here you’ll learn how model complexity impacts training time and discover performance tuning levels to expedite the process, as well as utilize PyTorch features, specialized libraries, and efficient data pipelines to optimize training on CPUs and accelerators. You’ll also reduce model complexity, adopt mixed precision, and harness the power of multicore systems and multi-GPU environments for distributed training. By the end, you'll be equipped with techniques and strategies to speed up training and focus on building stunning models.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 271
Veröffentlichungsjahr: 2024
Accelerate Model Training with PyTorch 2.X
Build more accurate models by boosting the model training process
Maicon Melo Alves
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Niranjan Naikwadi
Publishing Product Manager: Sanjana Gupta
Book Project Manager: Kirti Pisat
Content Development Editor: Manikandan Kurup
Technical Editor: Seemanjay Ameriya
Copy Editor: Safis Editing
Proofreader: Safis Editing and Manikandan Kurup
Indexer: Hemangini Bari
Production Designer: Aparna Bhagat
Senior DevRel Marketing Coordinator: Vinishka Kalra
First published: April 2024
Production reference: 1050424
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul's Square
Birmingham
B3 1RB, UK.
ISBN 978-1-80512-010-0
www.packtpub.com
To my wife and best friend, Cristiane, for being my loving partner throughout our joint life journey. To my daughters, Giovana and Camila, for being my real treasure; I’m so proud of you. To my mom, Fatima, and brothers, Johny and Karoline, for being my safe harbor. Despite everything, I also dedicate this book to my (late) father Jorge.
– Maicon Melo Alves
Accelerating model training is critical in the area of machine learning for several reasons. As datasets grow larger and models become more complex, training times can become prohibitively long, hindering research and development progress. This is where machine learning frameworks such as PyTorch come into play, providing tools and techniques to accelerate the training process.
PyTorch, with its flexibility, GPU acceleration, optimization techniques, and distributed training capabilities, plays a crucial role in this endeavor by enabling researchers and developers to iterate quickly, train complex models efficiently, and deploy solutions faster. By leveraging PyTorch’s capabilities, practitioners can push the boundaries of what is possible in artificial intelligence and drive innovation across various domains.
Since learning all of these capabilities is not a straightforward task, this book is a great resource for all students, researchers, and professionals who intend to learn how to accelerate model training with the latest release of PyTorch in a smooth way.
This very didactic book starts by introducing how the training process works and what kind of modifications can be done at the application and environment layers to accelerate the training process.
Only after that, the following chapters describe methods to accelerate the training model, such as the Compile API, a novel capability launched in PyTorch 2.0 useful for compiling a model, and the use of specialized libraries such as OpenMP and IPEX to speed up the training process of our models even more.
It also describes the building of an efficient data pipeline to keep your GPU working at its peak for the entire training process, simplifying a model by reducing the number of parameters, and reducing the numerical precision adopted by the neural network to accelerate the training process and decrease the amount of memory needed to store the model.
Finally, this book also explains how to spread out the distributed training process to run on multiple CPUs and GPUs.
This book not only provides current and highly relevant content for the learning and updating of any professional working in the field of computing but also impresses with its extremely didactic presentation of the subject. You will certainly appreciate the quiz at the end of each chapter and the connection made between the chapters in the summary at the end of each chapter.
In all chapters, codes, and examples of use are presented. For all these reasons, I believe that the book could be successfully adopted by undergraduate and graduate courses as a support bibliography for them too.
Prof. Lúcia Maria de Assumpção Drummond
Titular professor at Fluminense Federal University, Brazil
Dr. Maicon Melo Alves is a senior system analyst and academic professor who specializes in High-Performance Computing (HPC) systems. In the last five years, he has become interested in understanding how HPC systems have been used in AI applications. To better understand this topic, he completed an MBA in data science in 2021 at Pontifícia Universidade Católica of Rio de Janeiro (PUC-RIO). He has over 25 years of experience in IT infrastructure, and since 2006, he has worked with HPC systems at Petrobras, the Brazilian state energy company. He obtained his DSc degree in computer science from the Fluminense Federal University (UFF) in 2018 and has published three books and publications in international journals in HPC.
Dimitra Charalampopoulou is a machine learning engineer with a background in technology consulting and a strong interest in AI and machine learning. She has led numerous large-scale digital transformation engineering projects for clients across the US and EMEA and has received various awards, including recognition for her start-up at the MIT Startup Competition. Additionally, she has been a speaker at two conferences in Europe on the topic of GenAI. As an advocate for women in tech, she is the founder and managing director of an NGO that promotes gender equality in tech and has taught programming classes to female students internationally.
In this part, you will learn about performance optimization, before delving into the techniques, approaches, and strategies described throughout the book. First, you will learn about the aspects of the training process that make it so computationally heavy. After that, you will learn about the possible approaches to reduce the training time.
This part has the following chapters:
Chapter 1, Deconstructing the Training ProcessChapter 2, Training Models FasterWe already know that training neural network models takes a long time to finish. Otherwise, we would not be here discussing ways to run this process faster. But which characteristics make the building process of these models so computationally heavy? Why does the training step take so long? To answer these questions, we need to understand the computational burden of the training phase.
In this chapter, we will first remember how the training phase works under the hood. We will understand what makes the training process so computationally heavy.
Here is what you will learn as part of this first chapter:
Remembering the training processUnderstanding the computational burden of the training phaseUnderstanding the factors that influence training timeYou can find the complete code of the examples mentioned in this chapter in the book’s GitHub repository at https://github.com/PacktPublishing/Accelerate-Model-Training-with-PyTorch-2.X/blob/main.
You can access your favorite environment to execute this notebook, such as Google Colab or Kaggle.
Before describing the computational burden imposed by neural network training, we must remember how this process works.
Important note
This section gives a very brief introduction to the training process. If you are totally unfamiliar with this topic, you should invest some time to understand this theme before moving to the following chapters. An excellent resource for learning this topic is the book entitled Machine Learning with PyTorch and Scikit-Learn, published by Packt and written by Sebastian Raschka, Yuxi (Hayden) Liu, and Vahid Mirjalili.
Basically speaking, neural networks learn from examples, similar to a child observing an adult. The learning process relies on feeding the neural network with pairs of input and output values so that the network catches the intrinsic relation between the input and output data. Such relationships can be interpreted as the knowledge obtained by the model. So, where a human sees a bunch of data, the neural network sees veiled knowledge.
This learning process depends on the dataset used to train a model.
A dataset comprises a set of data instances related to some problem, scenario, event, or phenomenon. Each instance has features and target information corresponding to the input and output data. The concept of a dataset instance is similar to a registry in a table or relational database.
The dataset is usually split into two parts: training and testing sets. The training set is used to train the network, whereas the testing part is used to test the model against unseen data. Occasionally, we can also use another part to validate the model after each training iteration.
Let’s look at Fashion-MNIST, a famous dataset that is commonly used to test and teach neural networks. This dataset comprises 70,000 labeled images of clothes and accessories such as dresses, shirts, and sandals belonging to 10 distinct classes or categories. The dataset is split into 60,000 instances for training and 10,000 instances for testing.
As shown in Figure 1.1, a single instance of this dataset comprises a 28 x 28 grayscale image and a label identifying its class. In the case of Fashion-MNIST, we have 70,000 instances, which is often referred to as the length of the dataset.
Figure 1.1 – Concept of a dataset instance
Besides the concept of dataset instance, we also have the concept of a dataset sample. A sample is defined as a group of instances, as shown in Figure 1.2. Usually, the training process executes on samples and not just on a single dataset instance. The reason why the training process takes samples instead of single instances is related to the way the training algorithm works. Don’t worry about this topic; we will cover it in the following sections:
Figure 1.2 – Concept of a dataset sample
The number of instances in a sample is called the batch size. For example, if we divide the Fashion-MNIST training set into samples of a batch size equal to 32, we get 1,875 samples since this set has 60,000 instances.
The higher the batch size, the lower the number of samples in a training set, as pictorially described in Figure 1.3:
Figure 1.3 – Concept of batch size
With a batch size equal to eight, the dataset in the example is divided into two samples, each with eight dataset instances. On the other hand, with a lower batch size (in this case, four), the training set is divided into a higher number of samples (four samples).
The neural network receives input samples and outputs a set of results, each corresponding to an instance of the input sample. In the case of a model to treat the classification image problem of Fashion-MNIST, the neural network gets a set of images and outputs another set of labels, as you can see in Figure 1.4. Each one of these labels indicates the corresponding class of the input image:
Figure 1.4 – Neural networks work on input samples
To extract the intrinsic knowledge present in the dataset, we need to submit the neural network to a training algorithm so it can learn the pattern present in the data. Let’s jump to the next section to understand how this algorithm works.
The training algorithm is an iterative process that takes each dataset sample and adjusts the neural network parameters according to the degree of error related to the difference between the correct result and the predicted one.
A single training iteration is called the training step. So, the number of training steps executed in the learning process equals the number of samples used to train the model. As we stated before, the batch size defines the number of samples, which also determines the number of training steps.
After executing all the training steps, we say the training algorithm has completed a training epoch. The developer must define the number of epochs before starting the model-building process. Usually, the developer determines the number of epochs by varying it and evaluating the accuracy of the resultant model.
A single training step executes the four phases sequentially, as illustrated in Figure 1.5:
Figure 1.5 – The four phases of the training process
Let’s go through each one of these steps to understand their role in the entire training process.
In the forward phase, the neural network receives the input data, performs calculations, and outputs a result. This output is also known as the value predicted by the neural network. In the case of Fashion-MNIST, the input data is the grayscale image and the predicted value is the class to which the item belongs.
Considering the tasks executed in the training step, the forward phase has a higher computational cost. This happens because it executes all the heavy computations involved in the neural network. Such computations, commonly known as operations, will be explained in the next section.
It is interesting to note that the forward phase is exactly the same as the inference process. When using the model in practice, we continuously execute the forward phase to infer a value or result.
After the forward phase, the neural network will output a predicted value. Then, the training algorithm needs to compare the predicted value with the expected one to see how good the prediction made by the model is.
If the predicted value is close or equal to the real value, the model is performing as expected and the training process is going in the right direction. Otherwise, the training step needs to quantify the error achieved by the model to adjust the parameters proportionally to the error degree.
Important note
In the terminology of neural networks, this error is usually referred to as loss or cost. So, it is common to see names such as loss or cost function in the literature when addressing this topic.
There are different kinds of loss functions, each one suitable to treat a specific sort of problem. The cross-entropy (CE) loss function is used in multiclass image classification problems, where we need to classify an image within a group of classes. For example, this loss function can be used in the Fashion-MNIST problem. Suppose we have just two classes or categories. In that case, we face a binary class problem, so using the binary cross-entropy (BCE) function rather than the original cross-entropy loss function is recommended.
For regression problems, the loss function is completely different from the ones used in classification problems. We can use functions such as the mean squared error (MSE), which measures the squared difference between the original value and the predicted value by the neural network.
After obtaining the loss, the training algorithm calculates the partial derivative of the loss function concerning the current parameters of the network. This operation results in the so-called gradient, which the training process uses to adjust network parameters.
Leaving the mathematical foundations aside, we can think of the gradient as the change we need to apply to network parameters to minimize the error or loss.
Important note
You can find more information about the math used in deep learning by reading the book Hands-On Mathematics for Deep Learning, published by Packt and written by Jay Dawani.
Similar to the loss function, we also have distinct implementations of optimizers. The stochastic gradient descent (SGD) and Adam are used the most.
To finish the training process, the algorithm updates the network parameters according to the gradient obtained in the optimization phase.
Important note
This section provides a theoretical explanation of the training algorithm. So, be aware that depending on the machine learning framework, the training process can have a set of phases that is different from the ones in the preceding list.
Essentially, these phases constitute the computational burden of the training process. Follow me to the next section to understand how this computational burden is impacted by different factors.
Now that we’ve brushed up on how the training process works, let’s understand the computational cost required to train a model. By using the terms computational cost or burden, we mean the computing power needed to execute the training process. The higher the computational cost, the higher the time taken to train the model. In the same way, the higher the computational burden, the higher the computing resources required to train the model.
Essentially, we can say the computational burden to train a model is defined by a three-fold factor, as illustrated in Figure 1.6:
Figure 1.6 – Factors that influence the training computational burden
Each one of these factors contributes (to some degree) to the computational complexity imposed by the training process. Let’s talk about each one of them.
Hyperparameters define two aspects of neural networks: the neural network configuration and how the training algorithm works.
Concerning neural network configuration, the hyperparameters determine the number and type of layers and the number of neurons in each layer. Simple networks have a few layers and neurons, whereas complex networks have thousands of neurons spread in hundreds of layers. The number of layers and neurons determines the number of parameters of the network, which directly impacts the computational burden. Due to the significant influence of the number of parameters in the computational cost of the training step, we will discuss this topic later in this chapter as a separate performance factor.
Regarding how the training algorithm executes the training process, hyperparameters control the number of epochs and steps and determine the optimizer and loss function used during the training phase, among other things. Some of these hyperparameters have a tiny influence on the computational cost of the training process. For example, if we change the optimizer from SGD to Adam, we will not face any relevant impact on the computational cost of the training process.
Other hyperparameters can definitely raise the training phase time, though. One of the most emblematic examples is the batch size. The higher the batch size, the fewer training steps are needed to train a model. So, with a few training steps, we can speed up the building process since the training phase will execute fewer steps per epoch. On the other hand, we can spend more time executing a single training step if we have big batch sizes. This happens because the forward phase executed on each training step should deal with a higher dimensional input data. In other words, we have a trade-off here.
For example, consider the case of a batch size equal to 32 for the Fashion-MNIST dataset. In this case, the input data dimension is 32 x 1 x 28 x 28, where 32, 1, and 28 represent the batch size, the number of channels (colors, in this scenario), and the image size, respectively. Therefore, for this case, the input data comprises 25,088 numbers, which is the number of numbers the forward phase should compute. However, if we increase the batch size to 128, the input data changes to 100,352 numbers, which can result in a longer time to execute a single forward phase iteration.
In addition, a
Tausende von E-Books und Hörbücher
Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.
Sie haben über uns geschrieben: