28,79 €
Explore different GPU programming methods using libraries and directives, such as OpenACC, with extension to languages such as C, C++, and Python
Compute Unified Device Architecture (CUDA) is NVIDIA's GPU computing platform and application programming interface. It's designed to work with programming languages such as C, C++, and Python. With CUDA, you can leverage a GPU's parallel computing power for a range of high-performance computing applications in the fields of science, healthcare, and deep learning.
Learn CUDA Programming will help you learn GPU parallel programming and understand its modern applications. In this book, you'll discover CUDA programming approaches for modern GPU architectures. You'll not only be guided through GPU features, tools, and APIs, you'll also learn how to analyze performance with sample parallel programming algorithms. This book will help you optimize the performance of your apps by giving insights into CUDA programming platforms with various libraries, compiler directives (OpenACC), and other languages. As you progress, you'll learn how additional computing power can be generated using multiple GPUs in a box or in multiple boxes. Finally, you'll explore how CUDA accelerates deep learning algorithms, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs).
By the end of this CUDA book, you'll be equipped with the skills you need to integrate the power of GPU computing in your applications.
This beginner-level book is for programmers who want to delve into parallel computing, become part of the high-performance computing community and build modern applications. Basic C and C++ programming experience is assumed. For deep learning enthusiasts, this book covers Python InterOps, DL libraries, and practical examples on performance estimation.
Jaegeun Han is currently working as a solutions architect at NVIDIA, Korea. He has around 9 years' experience and he supports consumer internet companies in deep learning. Before NVIDIA, he worked in system software and parallel computing developments, and application development in medical and surgical robotics fields. He obtained a master's degree in CSE from Seoul National University. Bharatkumar Sharma obtained a master's degree in information technology from the Indian Institute of Information Technology, Bangalore. He has around 10 years of development and research experience in the domains of software architecture and distributed and parallel computing. He is currently working with NVIDIA as a senior solutions architect, South Asia.Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 525
Veröffentlichungsjahr: 2019
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Richa TripathiAcquisition Editor: Alok DhuriContent Development Editor: Digvijay BagulSenior Editor: Afshaan KhanTechnical Editor: Romy DiasCopy Editor: Safis EditingProject Coordinator: Prajakta NaikProofreader: Safis EditingIndexer: Priyanka DhadkeProduction Designer: Deepika Naik
First published: September 2019
Production reference: 1270919
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78899-624-2
www.packt.com
Packt.com
Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry-leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Fully searchable for easy access to vital information
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Jaegeun Han is currently working as a solutions architect at NVIDIA, Korea. He has around 9 years' experience and he supports consumer internet companies in deep learning. Before NVIDIA, he worked in system software and parallel computing developments, and application development in medical and surgical robotics fields. He obtained a master's degree in CSE from Seoul National University.
Bharatkumar Sharma obtained a master's degree in information technology from the Indian Institute of Information Technology, Bangalore. He has around 10 years of development and research experience in the domains of software architecture and distributed and parallel computing. He is currently working with NVIDIA as a senior solutions architect, South Asia.
Christian Stehno studied computer science, receiving his diploma from Oldenburg University, Germany, in 2000. Since then, he's worked in different fields of computer science, first as a researcher in theoretical computer science at an academic institution, before subsequently switching to embedded system design at a research institute. In 2010, he started his own company, CoSynth, which develops embedded systems and intelligent cameras for industrial automation. In addition, he is a long-time member of the Irrlicht 3D engine developer team.
Minseok Lee is a developer technology engineer at NVIDIA. He works on parallelizing and optimizing scientific and AI applications for CPU-GPU heterogeneous systems. Before joining NVIDIA, he was a C++-based library designer/developer and received an M.Eng in computer science by working on utilizing GPU architectures more efficiently.
Aidan Temple is a software engineer and lead developer at Nanotek. He has recently graduated with honors from Glasgow Caledonian University, where he received a BSc in computer games software development.
While at university, Aidan also undertook a research degree outlining the benefits of implementing GUI-based game frameworks by means of parallel processing through the utilization of NVIDIA's CUDA architecture. He also received an IGDA scholarship.
Prior to his time at university, he studied computer games development at James Watt College of Further and Higher Education. Due to his excellent understanding and demonstration of game development and design methodologies, Aidan graduated from James Watt College with a distinction in his field.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Learn CUDA Programming
Dedication
About Packt
Why subscribe?
Contributors
About the authors
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Introduction to CUDA Programming
The history of high-performance computing
Heterogeneous computing
Programming paradigm
Low latency versus higher throughput
Programming approaches to GPU
Technical requirements 
Hello World from CUDA
Thread hierarchy
GPU architecture
Vector addition using CUDA
Experiment 1 – creating multiple blocks
Experiment 2 – creating multiple threads
Experiment 3 – combining blocks and threads
Why bother with threads and blocks?
Launching kernels in multiple dimensions
Error reporting in CUDA
Data type support in CUDA
Summary
CUDA Memory Management
Technical requirements 
NVIDIA Visual Profiler
Global memory/device memory
Vector addition on global memory
Coalesced versus uncoalesced global memory access
Memory throughput analysis
Shared memory
Matrix transpose on shared memory
Bank conflicts and its effect on shared memory
Read-only data/cache
Computer vision – image scaling using texture memory
Registers in GPU
Pinned memory
Bandwidth test – pinned versus pageable
Unified memory
Understanding unified memory page allocation and transfer
Optimizing unified memory with warp per page
Optimizing unified memory using data prefetching
GPU memory evolution
Why do GPUs have caches?
Summary
CUDA Thread Programming
Technical requirements
CUDA threads, blocks, and the GPU
Exploiting a CUDA block and warp
Understanding CUDA occupancy
Setting NVCC to report GPU resource usages
The settings for Linux
Settings for Windows
Analyzing the optimal occupancy using the Occupancy Calculator
Occupancy tuning – bounding register usage
Getting the achieved occupancy from the profiler
Understanding parallel reduction
Naive parallel reduction using global memory
Reducing kernels using shared memory
Writing performance measurement code
Performance comparison for the two reductions – global and shared memory
Identifying the application's performance limiter
Finding the performance limiter and optimization
Minimizing the CUDA warp divergence effect
Determining divergence as a performance bottleneck
Interleaved addressing
Sequential addressing
Performance modeling and balancing the limiter
The Roofline model
Maximizing memory bandwidth with grid-strided loops
Balancing the I/O throughput
Warp-level primitive programming
Parallel reduction with warp primitives
Cooperative Groups for flexible thread handling
Cooperative Groups in a CUDA thread block
Benefits of Cooperative Groups
Modularity
Explicit grouped threads' operation and race condition avoidance
Dynamic active thread selection
Applying to the parallel reduction
Cooperative Groups to avoid deadlock
Loop unrolling in the CUDA kernel
Atomic operations
Low/mixed precision operations
Half-precision operation
Dot product operations and accumulation for 8-bit integers and 16-bit data (DP4A and DP2A)
Measuring the performance
Summary
Kernel Execution Model and Optimization Strategies
Technical requirements
Kernel execution with CUDA streams
The usage of CUDA streams
Stream-level synchronization
Working with the default stream
Pipelining the GPU execution
Concept of GPU pipelining
Building a pipelining execution
The CUDA callback function
CUDA streams with priority
Priorities in CUDA
Stream execution with priorities
Kernel execution time estimation using CUDA events
Using CUDA events
Multiple stream estimation
CUDA dynamic parallelism
Understanding dynamic parallelism
Usage of dynamic parallelism
Recursion
Grid-level cooperative groups
Understanding grid-level cooperative groups
Usage of grid_group
CUDA kernel calls with OpenMP
OpenMP and CUDA calls
CUDA kernel calls with OpenMP
Multi-Process Service
Introduction to Message Passing Interface
Implementing an MPI-enabled application
Enabling MPS
Profiling an MPI application and understanding MPS operation
Kernel execution overhead comparison
Implementing three types of kernel executions
Comparison of three executions
Summary 
CUDA Application Profiling and Debugging
Technical requirements
Profiling focused target ranges in GPU applications
Limiting the profiling target in code
Limiting the profiling target with time or GPU
Profiling with NVTX
Visual profiling against the remote machine
Debugging a CUDA application with CUDA error
Asserting local GPU values using CUDA assert
Debugging a CUDA application with Nsight Visual Studio Edition
Debugging a CUDA application with Nsight Eclipse Edition
Debugging a CUDA application with CUDA-GDB
Breakpoints of CUDA-GDB
Inspecting variables with CUDA-GDB
Listing kernel functions
Variables investigation
Runtime validation with CUDA-memcheck
Detecting memory out of bounds
Detecting other memory errors
Profiling GPU applications with Nsight Systems
Profiling a kernel with Nsight Compute
Profiling with the CLI
Profiling with the GUI
Performance analysis report
Baseline compare
Source view
Summary
Scalable Multi-GPU Programming
Technical requirements 
Solving a linear equation using Gaussian elimination
Single GPU hotspot analysis of Gaussian elimination
GPUDirect peer to peer
Single node – multi-GPU Gaussian elimination
Brief introduction to MPI
GPUDirect RDMA
CUDA-aware MPI
Multinode – multi-GPU Gaussian elimination
CUDA streams
Application 1 – using multiple streams to overlap data transfers with kernel execution
Application 2 – using multiple streams to run kernels on multiple devices
Additional tricks
Benchmarking an existing system with an InfiniBand network card
NVIDIA Collective Communication Library (NCCL)
Collective communication acceleration using NCCL
Summary
Parallel Programming Patterns in CUDA
Technical requirements
Matrix multiplication optimization
Implementation of the tiling approach
Performance analysis of the tiling approach
Convolution
Convolution operation in CUDA
Optimization strategy
Filtering coefficients optimization using constant memory
Tiling input data using shared memory
Getting more performance
Prefix sum (scan)
Blelloch scan implementation
Building a global size scan
The pursuit of better performance
Other applications for the parallel prefix-sum operation
Compact and split
Implementing compact
Implementing split
N-body
Implementing an N-body simulation on GPU
Overview of an N-body simulation implementation
Histogram calculation
Compile and execution steps
Understanding a parallel histogram 
Calculating a histogram with CUDA atomic functions
Quicksort in CUDA using dynamic parallelism
Quicksort and CUDA dynamic parallelism 
Quicksort with CUDA
Dynamic parallelism guidelines and constraints
Radix sort
Two approaches
Approach 1 – warp-level primitives
Approach 2 – Thrust-based radix sort
Summary
Programming with Libraries and Other Languages
Linear algebra operation using cuBLAS
cuBLAS SGEMM operation
Multi-GPU operation
Mixed-precision operation using cuBLAS
GEMM with mixed precision
GEMM with TensorCore
cuRAND for parallel random number generation
cuRAND host API
cuRAND device API
cuRAND with mixed precision cuBLAS GEMM
cuFFT for Fast Fourier Transformation in GPU
Basic usage of cuFFT
cuFFT with mixed precision
cuFFT for multi-GPU
NPP for image and signal processing with GPU
Image processing with NPP
Signal processing with NPP
Applications of NPP
Writing GPU accelerated code in OpenCV
CUDA-enabled OpenCV installation
Implementing a CUDA-enabled blur filter
Enabling multi-stream processing
Writing Python code that works with CUDA
Numba – a high-performance Python compiler
Installing Numba
Using Numba with the @vectorize decorator
Using Numba with the @cuda.jit decorator
CuPy – GPU accelerated Python matrix library 
Installing CuPy
Basic usage of CuPy
Implementing custom kernel functions
PyCUDA – Pythonic access to CUDA API
Installing PyCUDA
Matrix multiplication using PyCUDA
NVBLAS for zero coding acceleration in Octave and R
Configuration
Accelerating Octave's computation
Accelerating R's compuation
CUDA acceleration in MATLAB
Summary
GPU Programming Using OpenACC
Technical requirements
Image merging on a GPU using OpenACC
OpenACC directives
Parallel and loop directives
Data directive
Applying the parallel, loop, and data directive to merge image code
Asynchronous programming in OpenACC
Structured data directive
Unstructured data directive
Asynchronous programming in OpenACC
Applying the unstructured data and async directives to merge image code
Additional important directives and clauses
Gang/vector/worker
Managed memory
Kernel directive
Collapse clause
Tile clause
CUDA interoperability
DevicePtr clause
Routine directive
Summary
Deep Learning Acceleration with CUDA
Technical requirements
Fully connected layer acceleration with cuBLAS 
Neural network operations
Design of a neural network layer
Tensor and parameter containers
Implementing a fully connected layer
Implementing forward propagation
Implementing backward propagation
Layer termination
Activation layer with cuDNN
Layer configuration and initialization
Implementing layer operation
Implementing forward propagation
Implementing backward propagation
Softmax and loss functions in cuDNN/CUDA
Implementing the softmax layer
Implementing forward propagation
Implementing backward propagation
Implementing the loss function
MNIST dataloader
Managing and creating a model
Network training with the MNIST dataset
Convolutional neural networks with cuDNN
The convolution layer
Implementing forward propagation
Implementing backward propagation
Pooling layer with cuDNN
Implementing forward propagation
Implementing backward propagation
Network configuration
Mixed precision operations
Recurrent neural network optimization
Using the CUDNN LSTM operation
Implementing a virtual LSTM operation
Comparing the performance between CUDNN and SGEMM LSTM
Profiling deep learning frameworks
Profiling the PyTorch model
Profiling a TensorFlow model
Summary
Appendix
Useful nvidia-smi commands
Getting the GPU's information 
Getting formatted information
Power management mode settings
Setting the GPU's clock speed
GPU device monitoring
Monitoring GPU utilization along with multiple processes
Getting GPU topology information
WDDM/TCC mode in Windows
Setting TCC/WDDM mode
Performance modeling 
The Roofline model
Analyzing the Jacobi method
Exploring container-based development
NGC configuration for a host machine
Basic usage of the NGC container
Creating and saving a new container from the NGC container
Setting the default runtime as NVIDIA Docker
Another Book You May Enjoy
Leave a review - let other readers know what you think
Traditionally, computing requirements were associated with Central Processing Units (CPUs), which have grown from having a single core to now having multiple cores. Every new generation of CPU has provided more performance, but the scientific and High Performance Computing (HPC) community has demanded more performance year on year, creating a compute gap between what applications have demanded and what hardware/software stacks could provide. At the same time, new architecture that was traditionally used for video graphics found its way into the scientific domain. Graphics Processing Units (GPUs)—essentially parallel computing processors used to accelerate computer graphics—made their mark on the HPC domain in 2007 when Compute Unified Device Architecture (CUDA) was launched. CUDA grew to become the de facto standard when it comes to using GPUs for general-purpose computation; that is, non-graphic applications.
There have been many releases of CUDA since its inception, and now CUDA stands at release 10.x. Each release provides new features that support the new hardware architecture. This book is designed to help you learn GPU parallel programming and guide you in its modern-day applications. With its help, you'll be able to discover CUDA programming approaches for modern GPU architectures. The book will not only guide you through GPU features, tools, and APIs, but also help you understand how to analyze performance with sample parallel programming algorithms. This book will ensure that you gain plenty of optimization experience and insights into CUDA programming platforms with various libraries, open accelerators (OpenACC), and other languages. As you progress, you'll discover how to generate additional computing power with multiple GPUs in a box, or multiple boxes. Finally, you'll explore how CUDA accelerates deep learning algorithms, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs).
This book is designed to be an entry point for any newcomer or novice developer. But by the end of it, you will be able to write optimized CUDA code for different domains, including artificial intelligence.
This book will be a useful resource if any of the following apply to you:
You are new to HPC or parallel computing
You have code and want to improve its performance by applying parallel computing to the GPU
You are a deep learning expert and want to make use of the GPU to accelerate performance for
deep learning algorithms such as CNNs and RNNs
You want to learn tips and tricks to optimize code and analyze GPU application performance and discover optimization strategies
You want to learn about the latest GPU features, along with efficient, distributed multi-GPU programming
If you feel you fall into any of those categories, please join us on this journey.
This beginner-level book is for programmers who want to delve into parallel computing, become part of the high-performance computing community and build modern applications. Basic C and C++ programming experience is assumed. For deep learning enthusiasts, this book covers Python InterOps, DL libraries, and practical examples on performance estimation.
Chapter 1, Introduction to CUDA Programming, demystifies some of the myths around GPU and CUDA, and introduces the CUDA programming model with a Hello World CUDA program.
Chapter 2, CUDA Memory Management, introduces the GPU memory hierarchy and how to optimally utilize it with the CUDA APIs.
Chapter 3, CUDA Thread Programming, introduces how threads operate in the GPU, highlighting key metrics on which basis optimizations are performed.
Chapter 4, Kernel Execution Model and Optimization Strategies, describes optimization strategies for CUDA kernels.
Chapter 5, CUDA Application Profiling and Debugging, covers the basic usage of tools that help with profiling and debugging CUDA applications.
Chapter 6, Scalable Multi-GPU Programming, covers how to scale CUDA algorithms across multiple GPUs within and across different nodes.
Chapter 7, Parallel Programming Patterns in CUDA, covers parallel programming algorithms that are widely used in many applications.
Chapter 8, Programming with Libraries and Other Languages, introduces the CUDA ecosystem pre-existing libraries with sample code usage.
Chapter 9, GPU Programming Using OpenACC, introduces directive-based programming with a focus on more science and less programming.
Chapter 10, Deep Learning Acceleration with CUDA, briefly reviews neural network operations and discusses how these can be accelerated on GPUs.
Appendix, includes some subsidiary reference information to help engineers use GPUs.
This book is designed for complete beginners and people who have just started to learn parallel computing. It does not require any specific knowledge besides the basics of computer architecture, and experience with C/C++ programming is assumed. For deep learning enthusiasts, in Chapter 10, Deep Learning Acceleration with CUDA, Python-based sample code is also provided, hence some Python knowledge is expected for that chapter specifically.
The code for this book is primarily developed and tested in a Linux environment. Hence, familiarity with the Linux environment is helpful. Any of the latest Linux flavors, such as CentOS or Ubuntu, are okay. The code can be compiled either using a makefile or the command line. The book primarily uses a free software stack, so there is no need to buy any software licenses. Two key pieces of software that will be used throughout are the CUDA Toolkit and PGI Community Edition.
Since the book primarily covers the latest GPU features making use of CUDA 10.x, in order to fully exploit all the training material, the latest GPU architecture (Pascal onward) will be beneficial. While not all chapters require the latest GPU, having the latest GPU will help you to reproduce the results achieved in the book. Each chapter has a section on the preferred or must-have GPU architecture in the Technical requirements section.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
Support
tab.
Click on
Code Downloads
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Learn-CUDA-Programming. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781788996242_ColorImages.pdf.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Note that there is an asynchronous alternative to cudaMemcpy."
A block of code is set as follows:
#include<stdio.h>#include<stdlib.h>
__global__ void print_from_gpu(void) {
printf("Hello World! from thread [%d,%d] \
From device\n", threadIdx.x,blockIdx.x);
}
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
int main(void) { printf("Hello World from host!\n");
print_from_gpu<<<1,1>>>();
cudaDeviceSynchronize(); return 0;}
Any command-line input or output is written as follows:
$ nvcc -o hello_world hello_world.cu
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "For Windows users, in the VS project properties dialog, you can specify your GPU's compute capability at CUDA C/C++ | Device | Code Generation."
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
Since its first release in 2007, Compute Unified Device Architecture (CUDA) has grown to become the de facto standard when it comes to using Graphic Computing Units (GPUs) for general-purpose computation, that is, non-graphics applications. So, what exactly is CUDA? Someone might ask the following:
Is it a programming language?
Is it a compiler?
Is it a new computing paradigm?
In this chapter, we will demystify some of the myths around GPU and CUDA. This chapter lays the foundation for heterogeneous computing by providing a simplified view of High-Performance Computing (HPC) history and substantiating it with laws such as Moore's Law and Dennard Scaling, which were—and still are—driving the semiconductor industry and hence the processor architecture itself. You will also be introduced to the CUDA programming model and get to know the fundamental difference between CPU and GPU architecture. By the end of this chapter, you will be able to write and understand Hello World! programs using CUDA programming constructs in the C language.
The following topics will be covered in this chapter:
The history of high-performance computing
Hello World from CUDA
Vector addition using CUDA
Error reporting with CUDA
Data type support in CUDA
The common misconception around GPU is that it is an alternative to CPU. GPUs are used to accelerate the parts of the code that are parallel in nature. Accelerator is a common term that's used for GPUs because they accelerate an application by running the parallel part of the code faster, while CPUs run the other part of the code, which is latency bound. Hence, a highly efficient CPU coupled with a high throughput GPU results in improved performance for the application.
The following diagram represents an application running on multiple processor types:
This concept can be very well defined with the help of Amdahl's law. Amdahl's law is used to define the maximum speedup that can be achieved when only a fraction of the application is parallelized. To demonstrate this, the preceding diagram shows two parts of the code. One part is latency bound, while the other is throughput bound. We will cover what these two terms mean in the next section, which differentiates between the CPU and GPU architecture.
The key point is that CPU is good for a certain fraction of code that is latency bound, while GPU is good at running the Single Instruction Multiple Data (SIMD) part of the code in parallel. If only one of them, that is, CPU code or GPU code, runs faster after optimization, this won't necessarily result in good speedup for the overall application. It is required that both of the processors, when used optimally, give maximum benefit in terms of performance. This approach of essentially offloading certain types of operations from the processor onto a GPU is called heterogeneous computing.
The following diagram depicts the two types of sections that all applications have, that is, latency bound and throughput bound:
Here, the importance of improving both sections is demonstrated using Amdahl's law.
As we mentioned in the previous section, CPU architecture is optimized for low latency access while GPU architecture is optimized for data parallel throughput computation. As shown in the following screenshot, the CPU architecture has a large amount of cache compared to GPU and has many types. The higher we go, that is, L3 to L1, the lower the amount of cache is present, but less latency. The CPU architecture is designed for low latency access to cached datasets. A large number of transistors are used to implement the speculative execution and out of order execution. Since CPUs run at a very high clock speed, it becomes necessary to hide the latency of fetching the data by frequently storing used data in caches and predicting the next instruction to execute. Applications that can explore this temporal locality can optimally make use of a CPU cache. Also, applications where it is easy to fill the instruction pipeline, for example, an application with no if and else statements in its code, can benefit from this by hiding the latency of fetching the instruction. Hence, the CPU architecture is a latency reducing architecture.
The following screenshot shows how the CPU and GPU architecture dedicate the chip die area for different memory and compute units. While GPU uses a lot of transistors for computing ALUs, CPU uses it to reduce latency:
The GPU architecture, on the other hand, is called a latency reducing or high throughput architecture. The GPU architecture hides latency with computations from other threads. When one thread is waiting for the data to be available for computation, the other threads can start execution and hence not waste any clock cycles. If you are familiar with CUDA, then you might know about the concept of warps. We will cover the concept of warps in the upcoming chapters. (In CUDA, the execution unit is a warp and not a thread. Due to this, context switching happens between warps and not threads).
Some of you might be already wondering why we can't create these threads in the CPU and do the same thing to hide latency. The reason for this is that GPUs have lots of registers, and all of the thread context switching information is already present in them. This is the fastest memory that's available. However, in CPU, there are limited sets of registers and hence thread-related information is usually stored in a lower memory hierarchy such as a cache. For example, Volta contains 20 MB of register storage. Due to this, the context switching time between threads in CPU, compared to GPU, is much higher.
Now, let's take a look at the different approaches when it comes to programming on GPU.
Let's go back to our original question, that is, what is CUDA? CUDA is a parallel computing platform and programming model architecture developed by NVIDIA that exposes general-purpose computations on GPU as first-class capabilities. Like any other processor, the GPU architecture can be coded using various methods. The easiest method, which provides drop-in acceleration, is making use of existing libraries. Alternatively, developers can choose to make use of OpenACC directives for quick acceleration results and portability. Another option is to choose to dive into CUDA by making use of language constructs in C, C++, Fortran, Python, and more for the highest performance and flexibility. We will be covering all of these methods in detail in the subsequent chapters.
The following screenshot represents the various ways we can perform GPU programming:
In this section, we provided you with a perspective of how processors and high-performance computing have evolved over time. We provided you with an overview of why the heterogeneous programming model is key to getting the best performance from an application, followed by approaches to GPU programming. In the next section, we will start writing a Hello World program on a GPU.
A Linux/Windows PC with a modern NVIDIA GPU (Pascal architecture onwards) is required for this chapter, along with all of the necessary GPU drivers and the CUDA Toolkit (10.0 onward) installed. If you're unsure of your GPU's architecture, please visit NVIDIA's GPU site (https://developer.nvidia.com/cuda-gpus) and confirm your GPU's architecture. This chapter's code is also available on GitHub at https://github.com/PacktPublishing/Learn-CUDA-Programming.
The code examples in this chapter have been developed and tested with version 10.1 of CUDA Toolkit, but it is recommended to use the latest CUDA version, if possible.
CUDA is a heterogeneous programming model that includes provisions for both CPU and GPU. The CUDA C/C++ programming interface consists of C language extensions so that you can target portions of source code for parallel execution on the device (GPU). It is based on industry-standard C/C++ and provides a library of C functions that can be executed on the host (CPU) so that it can interact with the device.
In CUDA, there are two processors that work with each other. The host is usually referred to as the CPU, while the device is usually referred to as the GPU. The host is responsible for calling the device functions. As we've already mentioned, part of the code that runs on the GPU is called device code, while the serial code that runs on the CPU is called host code.
Let's start by writing our first CUDA code in C. The intention is to take a systematic step-wise approach, start with some sequential code, and convert it into CUDA-aware code by adding some additional keywords. As we mentioned earlier, there is no necessity to learn a new language—all we need to do is add some keywords to the existing language so that we can run it in a heterogeneous environment with CPU and GPU.
Let's take a look at our first piece of code. All this code does is print Hello World! from both the host and device:
#include<stdio.h>#include<stdlib.h>
__global__ void print_from_gpu(void) {
printf("Hello World! from thread [%d,%d] \
From device\n", threadIdx.x,blockIdx.x);
}int main(void) { printf("Hello World from host!\n"); print_from_gpu<<<1,1>>>(); cudaDeviceSynchronize(); return 0;}
Let's try to compile and run the preceding snippet:
Compile the code
: Place the preceding code into a file called
hello_world.cu
and compile it using the
NVIDIA C Compiler
(
nvcc
). Note that the extension of the file is
.cu
, which tells the compiler that this file has GPU code inside it:
$ nvcc -o hello_world hello_world.cu
Execute the GPU code
: We should receive the following output after executing the GPU code:
By now, you might have already observed that the CUDA C code isn't used very differently and only requires that we learn some additional constructs to tell the compiler which function is GPU code and how to call a GPU function. It isn't like we need to learn a new language altogether.
In the preceding code, we added a few constructs and keywords, as follows:
__global__
: This keyword, when added before the function, tells the compiler that this is a function that will run on the device and not on the host. However, note that it is called by the host. Another important thing to note here is that the return type of the device function is always "void". Data-parallel portions of an algorithm are executed on the device as kernels.
<<<,>>>
: This keyword tells the compiler that this is a call to the device function and not the host function. Additionally, the
1,1
parameter basically dictates the number of threads to launch in the kernel. We will cover the parameters inside angle brackets later. For now, the
1,1
parameter basically means we are launching the kernel with only one thread, that is, sequential code with a thread since we are not doing anything important in the code apart from printing.
threadIdx.x
,
blockIdx.x
: This is a unique ID that's given to all threads. We will cover this topic more in the next section.
cudaDeviceSynchronize()
: All of the kernel calls in CUDA are asynchronous in nature. The host becomes free after calling the kernel and starts executing the next instruction afterward. This should come as no big surprise since this is a heterogeneous environment and hence both the host and device can run in parallel to make use of the types of processors that are available. In case the host needs to wait for the device to finish, APIs have been provided as part of CUDA programming that make the host code wait for the device function to finish. One such API is
cudaDeviceSynchronize
, which waits until all of the previous calls to the device have finished.
Now, let's start playing around with the two parameters, that is, threadIdx.x and blockIdx.x.
Experiment 1: First, change the parameter from <<<1,1>>> to <<<2,1>> and view the output. The output of running multiple thread-single blocks of Hello World code should be as follows:
As we can see, instead of one thread, we now have two threads printing the value. Note that their unique IDs are different.
Experiment 2: Now, instead of changing the first parameter, let's change the second, that is, change <<<1,1>>> to <<<1,2>>> and observe the output of running multiple single-thread blocks of Hello World code, as follows:
As you can see, the total number of threads that were launched into the kernel is two, just like before—the only difference is that their IDs are different. So, what are these thread and block concepts? To combat this, let's dive into the GPU architecture some more.
One of the key reasons why CUDA became so popular is because the hardware and software have been designed and tightly bound to get the best performance out of the application. Due to this, it becomes necessary to show the relationship between the software CUDA programming concepts and the hardware design itself.
The following screenshot shows the two sides of CUDA:
We can see that the CUDA software has been mapped to the GPU hardware.
The following table, in accordance with the preceding screenshot, explains software and hardware mapping in terms of the CUDA programming model:
Software
Executes on/as
Hardware
CUDA thread
CUDA Core/SIMD code
CUDA block
Streaming multiprocessor
GRID/kernel
GPU device
Let's take a look at the preceding table's components in detail:
CUDA Threads
: CUDA threads execute on a CUDA core. CUDA threads are different from CPU threads. CUDA threads are extremely lightweight and provide fast context switching. The reason for fast context switching is due to the availability of a large register size in a GPU and hardware-based scheduler. The thread context is present in registers compared to CPU, where the thread handle resides in a lower memory hierarchy such as a cache. Hence, when one thread is idle/waiting, another thread that is ready can start executing with almost no delay. Each CUDA thread must execute the same kernel and work independently on different data (SIMT).
CUDA blocks
: CUDA threads are grouped together into a logical entity called a CUDA block. CUDA blocks execute on a single
Streaming Multiprocessor
(
SM
). One block runs on a single SM, that is, all of the threads within one block can only execute on cores in one SM and do not execute on the cores of other SMs. Each GPU may have one or more SM and hence to effectively make use of the whole GPU; the user needs to divide the parallel computation into blocks and threads.
GRID/kernel
: CUDA blocks are grouped together into a logical entity called a CUDA GRID. A CUDA GRID is then executed on the device.
This may sound somewhat complicated at first glance. In this next section, we'll take a look at an example of vector addition to explain this. Hopefully, things will become much clearer.
So far, we've looked at parallel vector addition through the use of several blocks with one thread in the Experiment 1 – creating multiple blocks section and one block with several threads in the Experiment 2 – creating multiple threads section. In this experiment, we'll use multiple blocks as well as separate blocks containing multiple threads. This becomes more challenging in terms of how to find the index because we need to combine both threadIdx and blockIdx to generate a unique ID.
Let's take a look at two scenarios that depict different combinations that the developer can choose from:
Scenario 1:
Let's consider that the total number of vector elements is 32. Each block contains
eight threads and a total of four blocks.
Scenario 2:
Let's consider that the total number of vector elements is 32. Each block contains four threads and a total of eight blocks.
In both scenarios, the number of parallel executions is 32, where all 32 elements get populated in parallel. The developer makes the choice between the threads within a block and the number of blocks based on the problem's size and restriction by each piece of hardware. We will be covering details about the right choice of sizing based on the architecture in another chapter.
The following screenshot shows the vector addition GPU indexing code for different block and thread configurations:
