GPU Programming with C++ and CUDA - Paulo Motta - E-Book

GPU Programming with C++ and CUDA E-Book

Paulo Motta

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Written by Paulo Motta, a senior researcher with decades of experience, this comprehensive GPU programming book is an essential guide for leveraging the power of parallelism to accelerate your computations. The first section introduces the concept of parallelism and provides practical advice on how to think about and utilize it effectively. Starting with a basic GPU program, you then gain hands-on experience in managing the device. This foundational knowledge is then expanded by parallelizing the program to illustrate how GPUs enhance performance.
The second section explores GPU architecture and implementation strategies for parallel algorithms, and offers practical insights into optimizing resource usage for efficient execution.
In the final section, you will explore advanced topics such as utilizing CUDA streams. You will also learn how to package and distribute GPU-accelerated libraries for the Python ecosystem, extending the reach and impact of your work.
Combining expert insight with real-world problem solving, this book is a valuable resource for developers and researchers aiming to harness the full potential of GPU computing. The blend of theoretical foundations, practical programming techniques, and advanced optimization strategies it offers is sure to help you succeed in the fast-evolving field of GPU programming.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 308

Veröffentlichungsjahr: 2025

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



GPU Programming with C++ and CUDA

Uncover effective techniques for writing efficient GPU-parallel C++ applications

Paulo Motta

GPU Programming with C++ and CUDA

Copyright © 2025 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Portfolio Director: Kunal Chaudhari

Relationship Lead: Samriddhi Murarka

Project Manager: Ashwin Dinesh Kharwa

Content Engineer: Alexander Powell

Technical Editor: Rohit Singh

Copy Editor: Alexander Powell

Indexer: Tejal Soni

Proofreader: Alexander Powell

Production Designer: Prashant Ghare

Growth Lead: Vinishka Kalra

First published: August 2025

Production reference: 1280825

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-80512-454-2

www.packtpub.com

To my wife, Katia, and my daughter, Helena, for their love and support, and to our families, for their encouragement.

– Paulo Motta

Contributors

About the author

Paulo Motta is a computer scientist and software engineer with more than 20 years experience in software development. Paulo has worked in both academia and industry, including at Microsoft. He is a senior member of the IEEE and a member of the ACM.

He holds a PhD in Parallel Systems from PUC-Rio, where he developed a platform to abstract low-level hardware for parallel applications on top of the Cell Broadband Engine processor. His post-doctoral research at LNCC (Laboratório Nacional de Computação Científica, Brazil) involved redesigning a quantum walks simulator, leading to GPU-enabled middleware presented at IEEE Quantum Week. He also supervised GPU-focused research projects at LaSalle College that have been published at SBAC-PAD and SPLASH.

He enjoys teaching, writing, and sharing knowledge.

I want to thank everyone who has supported me in this journey, especially my wife, Katia, and my daughter, Helena, who both encouraged me. I am also deeply thankful to the mentors I’ve had over the years, and to the editorial team at Packt for their great support in bringing this book to life.

About the reviewer

Aditya Agrawal is a postgraduate student in Computer Science at IIT Madras. He is working on multithreading to extend OpenMP (a popular parallel programming library for shared-memory parallelism) with synchronization primitives. He has worked in a wide range of domains including data science and web development, and is currently exploring the domains of compilers and high-performance computing. His experience working on the CUDA programming language stems from a graduate course in GPU Programming, where he was introduced to the world of parallel-programming using the CUDA framework for NVIDIA GPUs. On completion of his postgraduates studies he aspires to work in the compiler field.

Dedicated to the memory of my mother, whose love and support have always been such a help over the years, as I found out who I was and what I wanted to do. Thank you mummy for always supporting me and I wish you to always stay happy and keep that sweet smiling face wherever you are.

Part 1

Understanding Where We Are Heading

In this part we take the first steps towards understanding parallel programming. We learn how to set up our environment, and we execute our first CUDA programs. These first programs will be very simple; the priority is on making sure that our environment is working properly and that we understand how execution in the GPU takes place.

This part of the book includes the following chapters:

Chapter 1, Introduction to Parallel ProgrammingChapter 2, Getting StartedChapter 3, Hello CUDAChapter 4, Hello Again, but in Parallel

1

Introduction to Parallel Programming

Welcome to the world of graphics processing unit (GPU) programming!

Before we talk about programming GPUs, we must understand what parallel programming is and how it can benefit our applications. As with everything in life, it has its challenges. In this chapter, we’ll explore both the benefits and drawbacks of parallel programming, laying the groundwork for our deep dive into GPU programming. So in this first chapter, we’ll be discussing a variety of topics without developing any code. In doing so, we’ll establish the foundations on which to build throughout our journey.

Apart from being useful, the information provided in this chapter is fundamental to understanding what happens inside a GPU, as we’ll discuss shortly. By the end of the chapter, you’ll understand why parallelism is important and when it makes sense to use it in your applications.

In this chapter, we’re going to cover the following main topics:

What parallelism is in software, and why it’s importantDifferent types of parallelismAn overview of GPU architectureComparing central processing units (CPUs) and GPUsAdvantages and challenges of GPU programming

Getting the most out of this book – get to know your free benefits

Unlock exclusive free benefits that come with your purchase, thoughtfully crafted to supercharge your learning journey and help you learn without limits.

Here’s a quick overview of what you get with this book:

Next-gen reader

Figure 1.1: Illustration of the next-gen Packt Reader’s features

Our web-based reader, designed to help you learn effectively, comes with the following features:

Multi-device progress sync: Learn from any device with seamless progress sync.

Highlighting and notetaking: Turn your reading into lasting knowledge.

Bookmarking: Revisit your most important learnings anytime.

Dark mode: Focus with minimal eye strain by switching to dark or sepia mode.

Interactive AI assistant (beta)

Figure 1.2: Illustration of Packt’s AI assistant

Our interactive AI assistant has been trained on the content of this book, to maximize your learning experience. It comes with the following features:

Summarize it: Summarize key sections or an entire chapter.

AI code explainers: In the next-gen Packt Reader, click the Explain button above each code block for AI-powered code explanations.

Note: The AI assistant is part of next-gen Packt Reader and is still in beta.

DRM-free PDF or ePub version

Figure 1.3: Free PDF and ePub

Learn without limits with the following perks included with your purchase:

Learn from anywhere with a DRM-free PDF copy of this book.

Use your favorite e-reader to learn using a DRM-free ePub version of this book.

Unlock this book’s exclusive benefits now

Scan this QR code or go to packtpub.com/unlock, then search for this book by name. Ensure it’s the correct edition.

Note: Keep your purchase invoice ready before you start.

Technical requirements

For this chapter, the only technical requirement that we have is the goodwill to keep reading!

What is parallelism in software?

Parallel programming is a way of making a computer do many things at once. But wait – isn’t this what already happens daily? Yes and no. Most common processors today are capable of executing more than one task at the same time – and we mean at the same time. However, this is only the first requirement for parallel software. The second is to make at least some of the processor cores work on the same problem in a coordinated way. Let’s consider an example.

Imagine that you’re taking on a big task, such as sorting a huge pile of books. Instead of doing it alone, you ask a group of friends to help. Each friend takes a small part of the pile and sorts it. You all work at the same time, and the job gets done much faster. This is similar to how parallel programming works: it breaks a big problem into smaller pieces and solves them at the same time using multiple cores.

Of course, this example was chosen because it has a special characteristic: it’s easily parallelizable in the sense that we can perceive how to break the big tasks into smaller ones. Not all problems can be easily broken down for parallel processing. One of our first challenges is finding ways to decompose problems into smaller tasks that can be executed simultaneously. Sometimes, there are parts of our algorithm that need to be executed on a single core while all others sit idle before we can separate the parallel tasks. This is usually called a sequential part. It’s time for a different example.

Let’s suppose you’re having a movie and games night with your friends. You all decide to prepare some food and for that, you go to the supermarket. To make things faster, your friends come along so that, once there, everyone can select multiple ingredients at the same time – this is the parallel part. However, since you’re all going in a single car, only one person can drive at a given time, no matter how many licensed drivers there are in the vehicle. You can always argue that they could take turns driving a part of the way, but in this scenario, it would only take longer to get to the supermarket.

Upon arriving, each person heads to a different aisle to gather the pre-defined ingredients. Once everything is collected, another crucial decision arises: should each person go to a separate checkout line to pay with their credit card, or should they all queue together if they only have one card? Opting for the parallel payment method reveals another interesting aspect of parallel processing.

Even when tasks are processed in parallel (each person is on a different checkout line), the execution times can vary unpredictably. This means that at any given moment, different lines move at different speeds, and those who have already paid for their ingredients may end up waiting for their friends (processors) to finish their payments.

Once all the payments are complete, a new sequential part is followed: driving back home. This time, a different driver might be executing this task while the other people – I mean, processors – sit idle waiting for the next task to execute. Some algorithms have sequential parts to synchronize data or to share intermediate results, and that’s why only one processor is working. Here, we’re collecting the data that each processor – I mean, friend – got from the supermarket and we have to move this from one location to another. There’s no use for parallelism in this small part.

Why is parallelism important?

There are many situations in which the size of the problems we want to solve increases dramatically. And this is the moment when we have to start talking about more ‘serious’ real-world applications, such as weather forecasting, scientific research, and artificial intelligence.

Remember when we were driving to the supermarket and we mentioned that we could switch drivers for each part of the way? Wouldn’t this only end up taking us more time? This was due to context switching – we would have to find a place to park, then switch drivers, then drive the car until the next stop. But why are we talking about this again? Because most of the time, we need a ‘serious’ real-world application to make it worthwhile working through all the details of parallel programming.

One exception could be using parallel programming to accelerate graphics and physics processing in video games; although these applications may not be critical for human life, they’re pretty serious. We could always classify video games within the ‘serious’ simulation category. Let’s understand some of the benefits we get by using parallelism in our software.

Speeding up tasks

Splitting tasks into smaller parts that can be done simultaneously dramatically speeds up the overall process. We now have multiple processors working on different parts of the problem at the same time.

Efficiency in resource use

Let’s revisit our supermarket example. If all our friends sit idle inside the car waiting for one of us to get all the ingredients, we aren’t using our resources efficiently. However, when each person is gathering a part of the list, we’re better utilizing the resources that were already available. This is the same with our computer processor and its multiple cores – they’re already there. Of course, most modern processors are capable of decreasing their clock speeds to reduce energy consumption, but that doesn’t eliminate the waste of having an idle component.

Handling large datasets

In fields such as data science, video processing, and scientific research, it’s very common to find problems that grow in size very fast. Many of these problems use matrices to represent data; when the size of the matrices increases, the time required to process them is impacted. By using parallelism, we can ensure that we’re executing as much as possible at any given point in time, which makes the time to process large amounts of data acceptable.

Enhancing performance

When we reduce the time needed to process our dataset, this is perceived by the user as a performance gain. However, there’s another aspect to this. Even when the time isn’t reduced, the user may perceive a performance enhancement by having a more detailed simulation – one that uses more parameters (and thus a larger amount of data, as we discussed previously). Either way, this results in better performance and more responsive applications.

Enabling advanced technologies

By using parallelism, our capacity to perform simulations allows scientists to test more hypotheses than was possible before. Only when simulations seem to point in a viable and promising direction do we try them with concrete materials. This dramatically decreases the costs of discovery and accelerates the discovery process.

A quick start guide to the different types of parallelism

So far, we’ve only been talking about parallelism. In this section, we’ll quickly discuss the different types of parallelism before we dive into GPUs – which is what we’re all waiting for!

Data parallelism

The process of performing the same operation on different pieces of data so that data is processed equally at the same time by different processor cores – for example, processing an image to apply some change to each of its pixels – is called data parallelism. If one of the ingredients that we bought at the supermarket was a huge box of carrots that needed to be peeled, we could do that on our own, or we could distribute a peeler to each of our friends and perform the same process on the same data together, with each person working on an individual carrot.

Task parallelism

Sometimes, we have multiple steps that aren’t dependent on each other and that could be executed by different processor cores. Often, one larger task is divided into smaller sub-tasks, each with its own very clear objective, which contributes to the larger task but has nothing to do with its sibling tasks. If we revisit our friends who have just returned from the supermarket, we can break the larger task of preparing dinner into sub-tasks, such as preparing some appetizers, preparing the main dish, and preparing a dessert. The data are the ingredients that were brought home; the processors will perform different operations.

However, to be technically accurate, we need to consider that the same data is used as input for the parallel sub-tasks, so for the sake of fairness, let’s consider that we have two menus tonight but all with the same set of ingredients. That way, we could input the same data and have it processed by different functions (cooking methods).

Pipeline parallelism

There’s another arrangement for parallelism we can utilize. Sometimes, we have some data processing that depends on the results of some previous processing. If we have multiple processor cores to work with, we could assign each processing step to a core and create a pipeline that leverages the benefits of parallelism as new data becomes available. In a fruit processing factory, we can witness this kind of parallelism as new fruits, let’s say apples, arrive for processing and packaging. First, the apple truck delivers a large quantity of fruit that’s dumped in an artificial water stream. The fruits get washed while being pushed through the water stream; once cleaned, a series of workers inspect the fruits to remove bad apples. After selection, the apples are separated by size and moved to different conveyor belts, after which they’re packed and sealed, and then boxed and distributed.

This is exactly analogous to what happens with video processing as well, where each video frame is equivalent to a small set of apples. Each frame passes through a processing step and moves to the next processing component.

An overview of GPU architecture

After all that cooking, it’s time for a change. Let’s talk about GPUs.

First, let me say that I’ve decided to explain GPUs first before comparing them with CPUs. I’m doing this on the assumption that you’re already somewhat familiar with the (basic) architecture of a modern CPU.

GPUs were originally used to accelerate the output of graphics processing, since modern computer usage takes place almost exclusively in graphical environments. This differs from computing in the past, where the character-based interfaces that were used weren’t graphically demanding. However, a shift occurred when it was noticed that a processing unit that was capable of dealing with the computations necessary for computer graphics could also be used for anything that could be expressed in terms of matrix computations, which is what linear algebra is all about.

In the next chapter, we’re going to focus specifically on NVIDIA GPUs, but, for now, we’ll consider a general device. In this scenario, a GPU has several components that enable parallel tasks to be executed. The first important thing to note is that a GPU core is capable of performing arithmetic and logic operations, such as calculations. A GPU core can’t be used for program control; for that, we can organize many cores within a streaming multiprocessor (SM), which is responsible for controlling instruction execution among the many cores.

As shown in Figure 1.4 (a), a streaming multiprocessor has one control unit that works with the CUDA cores, whereas in Figure 1.4 (b), we can see that a single GPU can have many streaming multiprocessors, which promotes its high throughput:

Figure 1.4: (a) a streaming multiprocessor with CUDA cores and (b) a GPU device with many streaming multiprocessors

An interesting difference between a streaming multiprocessor and a CPU core is that the streaming multiprocessor has a single control unit for many processing cores, while the CPU core has one control unit for each processing core. This is shown in Figure 1.5:

Figure 1.5: (a) streaming multiprocessor and (b) CPU

We also rely on a memory hierarchy with different amounts of space available at the different levels. As shown in Figure 1.6, there is a large amount of global memory, which is relatively slow when compared to the other memory components like L1/L2 cache and shared memory. This is usually the number that you see listed on your GPU box. Then, we have the cache levels. Close to the cores, we have two special types that are faster to work with: shared memory, which is shared among multiple cores, and registers, which are specific to each core. We’ll learn how to handle this memory later in the book; for now, we just need to know that these are the components that will make up our solutions:

Figure 1.6: Memory hierarchy of a GPU

We also count on a scheduler that organizes and manages the execution of our software (and configuration) within the real hardware.

Finally, to make sure that everything is capable of communicating efficiently, we have a high-speed interconnection between the memory and the streaming multiprocessors.

Single instruction, multiple data (SIMD)

To understand why GPUs are designed the way they are, we need to think about the problem they were created to solve: graphics processing. Vector operations and matrix multiplication are the protagonists here, and one way to handle these efficiently is to have multiple pieces of data being processed with the same instruction by different cores at the same time. Thus, we have a single instruction and multiple pieces of data (SIMD) – this is one of the common patterns for parallelism. Of course, most of the time, we don’t have enough cores to execute all the data points we need to process at once, so the scheduler will allocate a set of logical units of data to be processed first, and then the next lot, and so on, until all the data has been processed. But it’s up to us to figure out whether our data is larger than the amount that can be held in global memory. This is a problem we’ll cover later in this book.

Another interesting aspect of this design is that when we have multiple cores executing code and the program reaches a conditional statement, it drastically affects the overall performance. This is because the group will execute the first branch and then the second branch. Not all cores will execute both, resulting in some cores sitting idle while others are processing. We can overcome this by rethinking our solution. Let’s consider an example (not a cooking one).

Let’s suppose we have a huge quantity of integers that we need to separate into evens and odds so that they can be computed differently. We’ll require a branch to figure out whether an integer is odd or even, which we know will have a performance impact. However, to rethink the way we provide the numbers, we could use a for loop with an increment of two at a time. We could also have one group of cores that starts at the first odd integer and another group that starts at the first even integer. This small change allows processing to occur without interruption, thereby delivering better performance.

Memory management and access

As we mentioned previously, GPUs have their own internal memory, the consequence of which is that we have to be mindful of how we access data. Something we also mentioned was that global memory is the slowest, so if we perform operations that load and store directly to global memory, this will impact performance. So, let’s talk about how we can manage memory access to unleash our device’s full potential.

The first thing to know is that the device’s memory is separate from the system’s main memory, so we need to copy memory from our computer’s memory to the device’s global memory, after which it can be used by our programs. Figure 1.7 shows this memory separation and its connection via the system bus:

Figure 1.7: System main memory and GPU global memory

Global memory is the easiest to access because it’s used in the same way that we access main system memory – by simply referring to the variable that was allocated to global memory by our main program.

On the other hand, when you declare local variables with a limited scope, they are typically allocated to registers (except for large arrays, which wouldn’t fit anyway).

In between we have shared storage, which is a faster form of memory with a limited capacity that’s shared by multiple execution threads. Usually, we use this to compute intermediate steps as much as possible, to postpone the final access to global storage, which holds the final computation result.

Before we leave this section, let’s consider a memory access example. When multiplying matrices, we process a row of a matrix with a column of the other matrix. However, the matrices are allocated to global memory contiguously. This means that for the column access, we’ll incur cache misses for all accesses:

Figure 1.8: A matrix in contiguous allocation

Considering that the matrix is stored in memory in row orientation – the way C and C++ store this type of data – the first element of the second row will be immediately after the last element from the previous row. Let’s assume that the size of the row matches the size of our cache block. When we need to access the elements of a row, they may not be in the cache at first, but after the first load, we’ll be able to access the other elements directly from there. This isn’t the case when we want to access the rows since each element is in a separate location in the main memory and is a cache block apart from row-neighbour elements.

A simple approach would be to load the column (or at least a part of it) to shared memory so that we can perform the operations for all the rows from the other matrix against this preloaded chunk of data.

The upshot is that by carefully managing memory access patterns we can avoid bottlenecks, which is crucial for maximizing memory bandwidth and minimizing latency.

Comparing CPUs and GPUs

Now that we know a little more about how GPUs are organized, we can compare them with CPUs to understand the impact of using these devices.

Modern CPUs are typically composed of many cores, so they’re capable of executing parallel applications by using threads. But while they’re capable of handling tens of threads, a single GPU can handle thousands of threads.

As mentioned previously, the fact that GPU cores execute the same instruction on many pieces of data simultaneously is an interesting difference from CPU cores. On a CPU, each core is a complete processor that can execute either different applications or different threads of the same parallel application. This means that branch execution on a CPU core doesn’t affect the performance of other CPU core executions.

Another important distinction is that CPUs can switch between tasks quickly, while GPU cores are controlled by their streaming multiprocessor.

Regarding memory management, most of the time we don’t need to change the way we access our variables. Although this may affect program performance, the impact is proportionally much smaller than it is in the case of GPUs.

Advantages and challenges of GPU programming

So far, we’ve learned why parallelism matters, considered the various GPU device components, and compared GPUs with CPUs. Now it’s time to understand how GPUs can enhance the performance of our solutions and how to overcome the challenges that come with these benefits.

Since we’ve already talked about some of the benefits, let’s start with the challenges that come with GPU programming.

GPU challenges

The most obvious challenge is that we can’t change the device’s components, so we can’t upgrade its memory, for example. Hardware limitations will directly restrict what we can do and how we’ll need to break down our data for processing.

We also talked about memory transfers, something that can easily become an overhead if we have to move data to and from the device constantly. Typically, we try to move data to the device and compute as much as possible before having to move more data.