E-Book
28,79 €

Learn CUDA Programming E-Book

Jaegeun Han

0,0

28,79 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Explore different GPU programming methods using libraries and directives, such as OpenACC, with extension to languages such as C, C++, and Python

Key Features

Learn parallel programming principles and practices and performance analysis in GPU computing
Get to grips with distributed multi GPU programming and other approaches to GPU programming
Understand how GPU acceleration in deep learning models can improve their performance

Book Description

Compute Unified Device Architecture (CUDA) is NVIDIA's GPU computing platform and application programming interface. It's designed to work with programming languages such as C, C++, and Python. With CUDA, you can leverage a GPU's parallel computing power for a range of high-performance computing applications in the fields of science, healthcare, and deep learning.

Learn CUDA Programming will help you learn GPU parallel programming and understand its modern applications. In this book, you'll discover CUDA programming approaches for modern GPU architectures. You'll not only be guided through GPU features, tools, and APIs, you'll also learn how to analyze performance with sample parallel programming algorithms. This book will help you optimize the performance of your apps by giving insights into CUDA programming platforms with various libraries, compiler directives (OpenACC), and other languages. As you progress, you'll learn how additional computing power can be generated using multiple GPUs in a box or in multiple boxes. Finally, you'll explore how CUDA accelerates deep learning algorithms, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs).

By the end of this CUDA book, you'll be equipped with the skills you need to integrate the power of GPU computing in your applications.

What you will learn

Understand general GPU operations and programming patterns in CUDA
Uncover the difference between GPU programming and CPU programming
Analyze GPU application performance and implement optimization strategies
Explore GPU programming, profiling, and debugging tools
Grasp parallel programming algorithms and how to implement them
Scale GPU-accelerated applications with multi-GPU and multi-nodes
Delve into GPU programming platforms with accelerated libraries, Python, and OpenACC
Gain insights into deep learning accelerators in CNNs and RNNs using GPUs

Who this book is for

This beginner-level book is for programmers who want to delve into parallel computing, become part of the high-performance computing community and build modern applications. Basic C and C++ programming experience is assumed. For deep learning enthusiasts, this book covers Python InterOps, DL libraries, and practical examples on performance estimation.

Jaegeun Han is currently working as a solutions architect at NVIDIA, Korea. He has around 9 years' experience and he supports consumer internet companies in deep learning. Before NVIDIA, he worked in system software and parallel computing developments, and application development in medical and surgical robotics fields. He obtained a master's degree in CSE from Seoul National University. Bharatkumar Sharma obtained a master's degree in information technology from the Indian Institute of Information Technology, Bangalore. He has around 10 years of development and research experience in the domains of software architecture and distributed and parallel computing. He is currently working with NVIDIA as a senior solutions architect, South Asia.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 525

Veröffentlichungsjahr: 2019

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Learn CUDA Programming

A beginner's guide to GPU programming and parallel computing with CUDA 10.x and C/C++

Jaegeun Han Bharatkumar Sharma

BIRMINGHAM - MUMBAI

Learn CUDA Programming

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Richa TripathiAcquisition Editor: Alok DhuriContent Development Editor: Digvijay BagulSenior Editor: Afshaan KhanTechnical Editor: Romy DiasCopy Editor: Safis EditingProject Coordinator: Prajakta NaikProofreader: Safis EditingIndexer: Priyanka DhadkeProduction Designer: Deepika Naik

First published: September 2019

Production reference: 1270919

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78899-624-2

www.packt.com

To my wife, Kiyeon Kim, for sharing her dream with me, and my son, Kyeol Han, for showing me his joy and growth.

– Jaegeun Han

To my wife, Sonali Sharma, and my child, Akshita Sharma, for believing in me.

– Bharatkumar Sharma

Packt.com

Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry-leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Fully searchable for easy access to vital information

Copy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the authors

I would like to thank my wife, Kiyeon Kim, for supporting of book writing and taking care of our child. I also thank my son, Kyeol Han, for his joy and curiosity, and my parents and parents-in-law for their support. I also appreciate Simon See (NVIDIA) for supporting me in authoring this book and I'm thankful to my co-author, Bharat, for joining this challenge with me. Thanks to all NVIDIA GPU experts for sharing their knowledge and opening the new era of computing.

Bharatkumar Sharma obtained a master's degree in information technology from the Indian Institute of Information Technology, Bangalore. He has around 10 years of development and research experience in the domains of software architecture and distributed and parallel computing. He is currently working with NVIDIA as a senior solutions architect, South Asia.

I would like to acknowledge my mentors who helped me at different stages in my career: Jaya Panvalkar (NVIDIA), Simon See (NVIDIA), and the one and only accelerated computing rockstar, Jensen Huang (CEO NVIDIA).

About the reviewers

Christian Stehno studied computer science, receiving his diploma from Oldenburg University, Germany, in 2000. Since then, he's worked in different fields of computer science, first as a researcher in theoretical computer science at an academic institution, before subsequently switching to embedded system design at a research institute. In 2010, he started his own company, CoSynth, which develops embedded systems and intelligent cameras for industrial automation. In addition, he is a long-time member of the Irrlicht 3D engine developer team.

Minseok Lee is a developer technology engineer at NVIDIA. He works on parallelizing and optimizing scientific and AI applications for CPU-GPU heterogeneous systems. Before joining NVIDIA, he was a C++-based library designer/developer and received an M.Eng in computer science by working on utilizing GPU architectures more efficiently.

Aidan Temple is a software engineer and lead developer at Nanotek. He has recently graduated with honors from Glasgow Caledonian University, where he received a BSc in computer games software development.

While at university, Aidan also undertook a research degree outlining the benefits of implementing GUI-based game frameworks by means of parallel processing through the utilization of NVIDIA's CUDA architecture. He also received an IGDA scholarship.

Prior to his time at university, he studied computer games development at James Watt College of Further and Higher Education. Due to his excellent understanding and demonstration of game development and design methodologies, Aidan graduated from James Watt College with a distinction in his field.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Learn CUDA Programming

Dedication

About Packt

Why subscribe?

Contributors

About the authors

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Introduction to CUDA Programming

The history of high-performance computing

Heterogeneous computing

Programming paradigm

Low latency versus higher throughput

Programming approaches to GPU

Technical requirements 

Hello World from CUDA

Thread hierarchy

GPU architecture

Vector addition using CUDA

Experiment 1 – creating multiple blocks

Experiment 2 – creating multiple threads

Experiment 3 – combining blocks and threads

Why bother with threads and blocks?

Launching kernels in multiple dimensions

Error reporting in CUDA

Data type support in CUDA

Summary

CUDA Memory Management

Technical requirements 

NVIDIA Visual Profiler

Global memory/device memory

Vector addition on global memory

Coalesced versus uncoalesced global memory access

Memory throughput analysis

Shared memory

Matrix transpose on shared memory

Bank conflicts and its effect on shared memory

Read-only data/cache

Computer vision – image scaling using texture memory

Registers in GPU

Pinned memory

Bandwidth test – pinned versus pageable

Unified memory

Understanding unified memory page allocation and transfer

Optimizing unified memory with warp per page

Optimizing unified memory using data prefetching

GPU memory evolution

Why do GPUs have caches?

Summary

CUDA Thread Programming

Technical requirements

CUDA threads, blocks, and the GPU

Exploiting a CUDA block and warp

Understanding CUDA occupancy

Setting NVCC to report GPU resource usages

The settings for Linux

Settings for Windows

Analyzing the optimal occupancy using the Occupancy Calculator

Occupancy tuning – bounding register usage

Getting the achieved occupancy from the profiler

Understanding parallel reduction

Naive parallel reduction using global memory

Reducing kernels using shared memory

Writing performance measurement code

Performance comparison for the two reductions – global and shared memory

Identifying the application's performance limiter

Finding the performance limiter and optimization

Minimizing the CUDA warp divergence effect

Determining divergence as a performance bottleneck

Interleaved addressing

Sequential addressing

Performance modeling and balancing the limiter

The Roofline model

Maximizing memory bandwidth with grid-strided loops

Balancing the I/O throughput

Warp-level primitive programming

Parallel reduction with warp primitives

Cooperative Groups for flexible thread handling

Cooperative Groups in a CUDA thread block

Benefits of Cooperative Groups

Modularity

Explicit grouped threads' operation and race condition avoidance

Dynamic active thread selection

Applying to the parallel reduction

Cooperative Groups to avoid deadlock

Loop unrolling in the CUDA kernel

Atomic operations

Low/mixed precision operations

Half-precision operation

Dot product operations and accumulation for 8-bit integers and 16-bit data (DP4A and DP2A)

Measuring the performance

Summary

Kernel Execution Model and Optimization Strategies

Technical requirements

Kernel execution with CUDA streams

The usage of CUDA streams

Stream-level synchronization

Working with the default stream

Pipelining the GPU execution

Concept of GPU pipelining

Building a pipelining execution

The CUDA callback function

CUDA streams with priority

Priorities in CUDA

Stream execution with priorities

Kernel execution time estimation using CUDA events

Using CUDA events

Multiple stream estimation

CUDA dynamic parallelism

Understanding dynamic parallelism

Usage of dynamic parallelism

Recursion

Grid-level cooperative groups

Understanding grid-level cooperative groups

Usage of grid_group

CUDA kernel calls with OpenMP

OpenMP and CUDA calls

CUDA kernel calls with OpenMP

Multi-Process Service

Introduction to Message Passing Interface

Implementing an MPI-enabled application

Enabling MPS

Profiling an MPI application and understanding MPS operation

Kernel execution overhead comparison

Implementing three types of kernel executions

Comparison of three executions

Summary 

CUDA Application Profiling and Debugging

Technical requirements

Profiling focused target ranges in GPU applications

Limiting the profiling target in code

Limiting the profiling target with time or GPU

Profiling with NVTX

Visual profiling against the remote machine

Debugging a CUDA application with CUDA error

Asserting local GPU values using CUDA assert

Debugging a CUDA application with Nsight Visual Studio Edition

Debugging a CUDA application with Nsight Eclipse Edition

Debugging a CUDA application with CUDA-GDB

Breakpoints of CUDA-GDB

Inspecting variables with CUDA-GDB

Listing kernel functions

Variables investigation

Runtime validation with CUDA-memcheck

Detecting memory out of bounds

Detecting other memory errors

Profiling GPU applications with Nsight Systems

Profiling a kernel with Nsight Compute

Profiling with the CLI

Profiling with the GUI

Performance analysis report

Baseline compare

Source view

Summary

Scalable Multi-GPU Programming

Technical requirements 

Solving a linear equation using Gaussian elimination

Single GPU hotspot analysis of Gaussian elimination

GPUDirect peer to peer

Single node – multi-GPU Gaussian elimination

Brief introduction to MPI

GPUDirect RDMA

CUDA-aware MPI

Multinode – multi-GPU Gaussian elimination

CUDA streams

Application 1 – using multiple streams to overlap data transfers with kernel execution

Application 2 – using multiple streams to run kernels on multiple devices

Additional tricks

Benchmarking an existing system with an InfiniBand network card

NVIDIA Collective Communication Library (NCCL)

Collective communication acceleration using NCCL

Summary

Parallel Programming Patterns in CUDA

Technical requirements

Matrix multiplication optimization

Implementation of the tiling approach

Performance analysis of the tiling approach

Convolution

Convolution operation in CUDA

Optimization strategy

Filtering coefficients optimization using constant memory

Tiling input data using shared memory

Getting more performance

Prefix sum (scan)

Blelloch scan implementation

Building a global size scan

The pursuit of better performance

Other applications for the parallel prefix-sum operation

Compact and split

Implementing compact

Implementing split

N-body

Implementing an N-body simulation on GPU

Overview of an N-body simulation implementation

Histogram calculation

Compile and execution steps

Understanding a parallel histogram 

Calculating a histogram with CUDA atomic functions

Quicksort in CUDA using dynamic parallelism

Quicksort and CUDA dynamic parallelism 

Quicksort with CUDA

Dynamic parallelism guidelines and constraints

Radix sort

Two approaches

Approach 1 – warp-level primitives

Approach 2 – Thrust-based radix sort

Summary

Programming with Libraries and Other Languages

Linear algebra operation using cuBLAS

cuBLAS SGEMM operation

Multi-GPU operation

Mixed-precision operation using cuBLAS

GEMM with mixed precision

GEMM with TensorCore

cuRAND for parallel random number generation

cuRAND host API

cuRAND device API

cuRAND with mixed precision cuBLAS GEMM

cuFFT for Fast Fourier Transformation in GPU

Basic usage of cuFFT

cuFFT with mixed precision

cuFFT for multi-GPU

NPP for image and signal processing with GPU

Image processing with NPP

Signal processing with NPP

Applications of NPP

Writing GPU accelerated code in OpenCV

CUDA-enabled OpenCV installation

Implementing a CUDA-enabled blur filter

Enabling multi-stream processing

Writing Python code that works with CUDA

Numba – a high-performance Python compiler

Installing Numba

Using Numba with the @vectorize decorator

Using Numba with the @cuda.jit decorator

CuPy – GPU accelerated Python matrix library 

Installing CuPy

Basic usage of CuPy

Implementing custom kernel functions

PyCUDA – Pythonic access to CUDA API

Installing PyCUDA

Matrix multiplication using PyCUDA

NVBLAS for zero coding acceleration in Octave and R

Configuration

Accelerating Octave's computation

Accelerating R's compuation

CUDA acceleration in MATLAB

Summary

GPU Programming Using OpenACC

Technical requirements

Image merging on a GPU using OpenACC

OpenACC directives

Parallel and loop directives

Data directive

Applying the parallel, loop, and data directive to merge image code

Asynchronous programming in OpenACC

Structured data directive

Unstructured data directive

Asynchronous programming in OpenACC

Applying the unstructured data and async directives to merge image code

Additional important directives and clauses

Gang/vector/worker

Managed memory

Kernel directive

Collapse clause

Tile clause

CUDA interoperability

DevicePtr clause

Routine directive

Summary

Deep Learning Acceleration with CUDA

Technical requirements

Fully connected layer acceleration with cuBLAS 

Neural network operations

Design of a neural network layer

Tensor and parameter containers

Implementing a fully connected layer

Implementing forward propagation

Implementing backward propagation

Layer termination

Activation layer with cuDNN

Layer configuration and initialization

Implementing layer operation

Implementing forward propagation

Implementing backward propagation

Softmax and loss functions in cuDNN/CUDA

Implementing the softmax layer

Implementing forward propagation

Implementing backward propagation

Implementing the loss function

MNIST dataloader

Managing and creating a model

Network training with the MNIST dataset

Convolutional neural networks with cuDNN

The convolution layer

Implementing forward propagation

Implementing backward propagation

Pooling layer with cuDNN

Implementing forward propagation

Implementing backward propagation

Network configuration

Mixed precision operations

Recurrent neural network optimization

Using the CUDNN LSTM operation

Implementing a virtual LSTM operation

Comparing the performance between CUDNN and SGEMM LSTM

Profiling deep learning frameworks

Profiling the PyTorch model

Profiling a TensorFlow model

Summary

Appendix

Useful nvidia-smi commands

Getting the GPU's information 

Getting formatted information

Power management mode settings

Setting the GPU's clock speed

GPU device monitoring

Monitoring GPU utilization along with multiple processes

Getting GPU topology information

WDDM/TCC mode in Windows

Setting TCC/WDDM mode

Performance modeling 

The Roofline model

Analyzing the Jacobi method

Exploring container-based development

NGC configuration for a host machine

Basic usage of the NGC container

Creating and saving a new container from the NGC container

Setting the default runtime as NVIDIA Docker

Another Book You May Enjoy

Leave a review - let other readers know what you think

Preface

Don't take rest after your first victory. Because if you fail in second, more lips are waiting to say that your first victory was just luck.

- A. P. J. Abdul Kalam

Traditionally, computing requirements were associated with Central Processing Units (CPUs), which have grown from having a single core to now having multiple cores. Every new generation of CPU has provided more performance, but the scientific and High Performance Computing (HPC) community has demanded more performance year on year, creating a compute gap between what applications have demanded and what hardware/software stacks could provide. At the same time, new architecture that was traditionally used for video graphics found its way into the scientific domain. Graphics Processing Units (GPUs)—essentially parallel computing processors used to accelerate computer graphics—made their mark on the HPC domain in 2007 when Compute Unified Device Architecture (CUDA) was launched. CUDA grew to become the de facto standard when it comes to using GPUs for general-purpose computation; that is, non-graphic applications.

There have been many releases of CUDA since its inception, and now CUDA stands at release 10.x. Each release provides new features that support the new hardware architecture. This book is designed to help you learn GPU parallel programming and guide you in its modern-day applications. With its help, you'll be able to discover CUDA programming approaches for modern GPU architectures. The book will not only guide you through GPU features, tools, and APIs, but also help you understand how to analyze performance with sample parallel programming algorithms. This book will ensure that you gain plenty of optimization experience and insights into CUDA programming platforms with various libraries, open accelerators (OpenACC), and other languages. As you progress, you'll discover how to generate additional computing power with multiple GPUs in a box, or multiple boxes. Finally, you'll explore how CUDA accelerates deep learning algorithms, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs).

This book is designed to be an entry point for any newcomer or novice developer. But by the end of it, you will be able to write optimized CUDA code for different domains, including artificial intelligence.

This book will be a useful resource if any of the following apply to you:

You are new to HPC or parallel computing

You have code and want to improve its performance by applying parallel computing to the GPU

You are a deep learning expert and want to make use of the GPU to accelerate performance for

deep learning algorithms such as CNNs and RNNs

You want to learn tips and tricks to optimize code and analyze GPU application performance and discover optimization strategies

You want to learn about the latest GPU features, along with efficient, distributed multi-GPU programming

If you feel you fall into any of those categories, please join us on this journey.

Who this book is for

What this book covers

Chapter 1, Introduction to CUDA Programming, demystifies some of the myths around GPU and CUDA, and introduces the CUDA programming model with a Hello World CUDA program.

Chapter 2, CUDA Memory Management, introduces the GPU memory hierarchy and how to optimally utilize it with the CUDA APIs.

Chapter 3, CUDA Thread Programming, introduces how threads operate in the GPU, highlighting key metrics on which basis optimizations are performed.

Chapter 4, Kernel Execution Model and Optimization Strategies, describes optimization strategies for CUDA kernels.

Chapter 5, CUDA Application Profiling and Debugging, covers the basic usage of tools that help with profiling and debugging CUDA applications.

Chapter 6, Scalable Multi-GPU Programming, covers how to scale CUDA algorithms across multiple GPUs within and across different nodes.

Chapter 7, Parallel Programming Patterns in CUDA, covers parallel programming algorithms that are widely used in many applications.

Chapter 8, Programming with Libraries and Other Languages, introduces the CUDA ecosystem pre-existing libraries with sample code usage.

Chapter 9, GPU Programming Using OpenACC, introduces directive-based programming with a focus on more science and less programming.

Chapter 10, Deep Learning Acceleration with CUDA, briefly reviews neural network operations and discusses how these can be accelerated on GPUs.

Appendix, includes some subsidiary reference information to help engineers use GPUs.

To get the most out of this book

This book is designed for complete beginners and people who have just started to learn parallel computing. It does not require any specific knowledge besides the basics of computer architecture, and experience with C/C++ programming is assumed. For deep learning enthusiasts, in Chapter 10, Deep Learning Acceleration with CUDA, Python-based sample code is also provided, hence some Python knowledge is expected for that chapter specifically.

The code for this book is primarily developed and tested in a Linux environment. Hence, familiarity with the Linux environment is helpful. Any of the latest Linux flavors, such as CentOS or Ubuntu, are okay. The code can be compiled either using a makefile or the command line. The book primarily uses a free software stack, so there is no need to buy any software licenses. Two key pieces of software that will be used throughout are the CUDA Toolkit and PGI Community Edition.

Since the book primarily covers the latest GPU features making use of CUDA 10.x, in order to fully exploit all the training material, the latest GPU architecture (Pascal onward) will be beneficial. While not all chapters require the latest GPU, having the latest GPU will help you to reproduce the results achieved in the book. Each chapter has a section on the preferred or must-have GPU architecture in the Technical requirements section.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packt.com

Select the

Support

tab.

Click on

Code Downloads

Enter the name of the book in the

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Learn-CUDA-Programming. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781788996242_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Note that there is an asynchronous alternative to cudaMemcpy."

A block of code is set as follows:

#include<stdio.h>#include<stdlib.h>

__global__ void print_from_gpu(void) {

printf("Hello World! from thread [%d,%d] \

From device\n", threadIdx.x,blockIdx.x);

}

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

int main(void) { printf("Hello World from host!\n");

print_from_gpu<<<1,1>>>();

cudaDeviceSynchronize(); return 0;}

Any command-line input or output is written as follows:

$ nvcc -o hello_world hello_world.cu

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "For Windows users, in the VS project properties dialog, you can specify your GPU's compute capability at CUDA C/C++ | Device | Code Generation."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Introduction to CUDA Programming

Since its first release in 2007, Compute Unified Device Architecture (CUDA) has grown to become the de facto standard when it comes to using Graphic Computing Units (GPUs) for general-purpose computation, that is, non-graphics applications. So, what exactly is CUDA? Someone might ask the following:

Is it a programming language?

Is it a compiler?

Is it a new computing paradigm?

In this chapter, we will demystify some of the myths around GPU and CUDA. This chapter lays the foundation for heterogeneous computing by providing a simplified view of High-Performance Computing (HPC) history and substantiating it with laws such as Moore's Law and Dennard Scaling, which were—and still are—driving the semiconductor industry and hence the processor architecture itself. You will also be introduced to the CUDA programming model and get to know the fundamental difference between CPU and GPU architecture. By the end of this chapter, you will be able to write and understand Hello World! programs using CUDA programming constructs in the C language.

While this chapter primarily uses C to demonstrate CUDA constructs, we will be covering other programming languages such as Python, Fortran, and OpenACC in other chapters.

The following topics will be covered in this chapter:

The history of high-performance computing

Hello World from CUDA

Vector addition using CUDA

Error reporting with CUDA

Data type support in CUDA

Heterogeneous computing

The common misconception around GPU is that it is an alternative to CPU. GPUs are used to accelerate the parts of the code that are parallel in nature. Accelerator is a common term that's used for GPUs because they accelerate an application by running the parallel part of the code faster, while CPUs run the other part of the code, which is latency bound. Hence, a highly efficient CPU coupled with a high throughput GPU results in improved performance for the application.

The following diagram represents an application running on multiple processor types:

This concept can be very well defined with the help of Amdahl's law. Amdahl's law is used to define the maximum speedup that can be achieved when only a fraction of the application is parallelized. To demonstrate this, the preceding diagram shows two parts of the code. One part is latency bound, while the other is throughput bound. We will cover what these two terms mean in the next section, which differentiates between the CPU and GPU architecture.

The key point is that CPU is good for a certain fraction of code that is latency bound, while GPU is good at running the Single Instruction Multiple Data (SIMD) part of the code in parallel. If only one of them, that is, CPU code or GPU code, runs faster after optimization, this won't necessarily result in good speedup for the overall application. It is required that both of the processors, when used optimally, give maximum benefit in terms of performance. This approach of essentially offloading certain types of operations from the processor onto a GPU is called heterogeneous computing.

The following diagram depicts the two types of sections that all applications have, that is, latency bound and throughput bound:

Here, the importance of improving both sections is demonstrated using Amdahl's law.

Low latency versus higher throughput

As we mentioned in the previous section, CPU architecture is optimized for low latency access while GPU architecture is optimized for data parallel throughput computation. As shown in the following screenshot, the CPU architecture has a large amount of cache compared to GPU and has many types. The higher we go, that is, L3 to L1, the lower the amount of cache is present, but less latency. The CPU architecture is designed for low latency access to cached datasets. A large number of transistors are used to implement the speculative execution and out of order execution. Since CPUs run at a very high clock speed, it becomes necessary to hide the latency of fetching the data by frequently storing used data in caches and predicting the next instruction to execute. Applications that can explore this temporal locality can optimally make use of a CPU cache. Also, applications where it is easy to fill the instruction pipeline, for example, an application with no if and else statements in its code, can benefit from this by hiding the latency of fetching the instruction. Hence, the CPU architecture is a latency reducing architecture.

The following screenshot shows how the CPU and GPU architecture dedicate the chip die area for different memory and compute units. While GPU uses a lot of transistors for computing ALUs, CPU uses it to reduce latency:

The GPU architecture, on the other hand, is called a latency reducing or high throughput architecture. The GPU architecture hides latency with computations from other threads. When one thread is waiting for the data to be available for computation, the other threads can start execution and hence not waste any clock cycles. If you are familiar with CUDA, then you might know about the concept of warps. We will cover the concept of warps in the upcoming chapters. (In CUDA, the execution unit is a warp and not a thread. Due to this, context switching happens between warps and not threads).

Some of you might be already wondering why we can't create these threads in the CPU and do the same thing to hide latency. The reason for this is that GPUs have lots of registers, and all of the thread context switching information is already present in them. This is the fastest memory that's available. However, in CPU, there are limited sets of registers and hence thread-related information is usually stored in a lower memory hierarchy such as a cache. For example, Volta contains 20 MB of register storage. Due to this, the context switching time between threads in CPU, compared to GPU, is much higher.

Now, let's take a look at the different approaches when it comes to programming on GPU.

Programming approaches to GPU

Let's go back to our original question, that is, what is CUDA? CUDA is a parallel computing platform and programming model architecture developed by NVIDIA that exposes general-purpose computations on GPU as first-class capabilities. Like any other processor, the GPU architecture can be coded using various methods. The easiest method, which provides drop-in acceleration, is making use of existing libraries. Alternatively, developers can choose to make use of OpenACC directives for quick acceleration results and portability. Another option is to choose to dive into CUDA by making use of language constructs in C, C++, Fortran, Python, and more for the highest performance and flexibility. We will be covering all of these methods in detail in the subsequent chapters.

The following screenshot represents the various ways we can perform GPU programming:

In this section, we provided you with a perspective of how processors and high-performance computing have evolved over time. We provided you with an overview of why the heterogeneous programming model is key to getting the best performance from an application, followed by approaches to GPU programming. In the next section, we will start writing a Hello World program on a GPU.

Technical requirements

A Linux/Windows PC with a modern NVIDIA GPU (Pascal architecture onwards) is required for this chapter, along with all of the necessary GPU drivers and the CUDA Toolkit (10.0 onward) installed. If you're unsure of your GPU's architecture, please visit NVIDIA's GPU site (https://developer.nvidia.com/cuda-gpus) and confirm your GPU's architecture. This chapter's code is also available on GitHub at https://github.com/PacktPublishing/Learn-CUDA-Programming.

The code examples in this chapter have been developed and tested with version 10.1 of CUDA Toolkit, but it is recommended to use the latest CUDA version, if possible.

Hello World from CUDA

CUDA is a heterogeneous programming model that includes provisions for both CPU and GPU. The CUDA C/C++ programming interface consists of C language extensions so that you can target portions of source code for parallel execution on the device (GPU). It is based on industry-standard C/C++ and provides a library of C functions that can be executed on the host (CPU) so that it can interact with the device.

In CUDA, there are two processors that work with each other. The host is usually referred to as the CPU, while the device is usually referred to as the GPU. The host is responsible for calling the device functions. As we've already mentioned, part of the code that runs on the GPU is called device code, while the serial code that runs on the CPU is called host code.

Let's start by writing our first CUDA code in C. The intention is to take a systematic step-wise approach, start with some sequential code, and convert it into CUDA-aware code by adding some additional keywords. As we mentioned earlier, there is no necessity to learn a new language—all we need to do is add some keywords to the existing language so that we can run it in a heterogeneous environment with CPU and GPU.

Let's take a look at our first piece of code. All this code does is print Hello World! from both the host and device:

#include<stdio.h>#include<stdlib.h>

__global__ void print_from_gpu(void) {

printf("Hello World! from thread [%d,%d] \

From device\n", threadIdx.x,blockIdx.x);

}int main(void) { printf("Hello World from host!\n"); print_from_gpu<<<1,1>>>(); cudaDeviceSynchronize(); return 0;}

Let's try to compile and run the preceding snippet:

Compile the code

: Place the preceding code into a file called

hello_world.cu

and compile it using the

NVIDIA C Compiler

(

nvcc

). Note that the extension of the file is

.cu

, which tells the compiler that this file has GPU code inside it:

$ nvcc -o hello_world hello_world.cu

Execute the GPU code

: We should receive the following output after executing the GPU code:

By now, you might have already observed that the CUDA C code isn't used very differently and only requires that we learn some additional constructs to tell the compiler which function is GPU code and how to call a GPU function. It isn't like we need to learn a new language altogether.

In the preceding code, we added a few constructs and keywords, as follows:

__global__

: This keyword, when added before the function, tells the compiler that this is a function that will run on the device and not on the host. However, note that it is called by the host. Another important thing to note here is that the return type of the device function is always "void". Data-parallel portions of an algorithm are executed on the device as kernels.

<<<,>>>

: This keyword tells the compiler that this is a call to the device function and not the host function. Additionally, the

1,1

parameter basically dictates the number of threads to launch in the kernel. We will cover the parameters inside angle brackets later. For now, the

1,1

parameter basically means we are launching the kernel with only one thread, that is, sequential code with a thread since we are not doing anything important in the code apart from printing.

threadIdx.x

blockIdx.x

: This is a unique ID that's given to all threads. We will cover this topic more in the next section.

cudaDeviceSynchronize()

: All of the kernel calls in CUDA are asynchronous in nature. The host becomes free after calling the kernel and starts executing the next instruction afterward. This should come as no big surprise since this is a heterogeneous environment and hence both the host and device can run in parallel to make use of the types of processors that are available. In case the host needs to wait for the device to finish, APIs have been provided as part of CUDA programming that make the host code wait for the device function to finish. One such API is

cudaDeviceSynchronize

, which waits until all of the previous calls to the device have finished.

Try removing the cudaDeviceSynchronize() call and see whether the device output is visible or not. Alternatively, try putting this call before printing it on the host code.

Thread hierarchy

Now, let's start playing around with the two parameters, that is, threadIdx.x and blockIdx.x.

Experiment 1: First, change the parameter from <<<1,1>>> to <<<2,1>> and view the output. The output of running multiple thread-single blocks of Hello World code should be as follows:

As we can see, instead of one thread, we now have two threads printing the value. Note that their unique IDs are different.

Experiment 2: Now, instead of changing the first parameter, let's change the second, that is, change <<<1,1>>> to <<<1,2>>> and observe the output of running multiple single-thread blocks of Hello World code, as follows:

As you can see, the total number of threads that were launched into the kernel is two, just like before—the only difference is that their IDs are different. So, what are these thread and block concepts? To combat this, let's dive into the GPU architecture some more.

GPU architecture

One of the key reasons why CUDA became so popular is because the hardware and software have been designed and tightly bound to get the best performance out of the application. Due to this, it becomes necessary to show the relationship between the software CUDA programming concepts and the hardware design itself.

The following screenshot shows the two sides of CUDA:

We can see that the CUDA software has been mapped to the GPU hardware.

The following table, in accordance with the preceding screenshot, explains software and hardware mapping in terms of the CUDA programming model:

Software

Executes on/as

Hardware

CUDA thread

CUDA Core/SIMD code

CUDA block

Streaming multiprocessor

GRID/kernel

GPU device

Let's take a look at the preceding table's components in detail:

CUDA Threads

: CUDA threads execute on a CUDA core. CUDA threads are different from CPU threads. CUDA threads are extremely lightweight and provide fast context switching. The reason for fast context switching is due to the availability of a large register size in a GPU and hardware-based scheduler. The thread context is present in registers compared to CPU, where the thread handle resides in a lower memory hierarchy such as a cache. Hence, when one thread is idle/waiting, another thread that is ready can start executing with almost no delay. Each CUDA thread must execute the same kernel and work independently on different data (SIMT).

CUDA blocks

: CUDA threads are grouped together into a logical entity called a CUDA block. CUDA blocks execute on a single

Streaming Multiprocessor

(

). One block runs on a single SM, that is, all of the threads within one block can only execute on cores in one SM and do not execute on the cores of other SMs. Each GPU may have one or more SM and hence to effectively make use of the whole GPU; the user needs to divide the parallel computation into blocks and threads.

GRID/kernel

: CUDA blocks are grouped together into a logical entity called a CUDA GRID. A CUDA GRID is then executed on the device.

This may sound somewhat complicated at first glance. In this next section, we'll take a look at an example of vector addition to explain this. Hopefully, things will become much clearer.

Experiment 3 – combining blocks and threads

So far, we've looked at parallel vector addition through the use of several blocks with one thread in the Experiment 1 – creating multiple blocks section and one block with several threads in the Experiment 2 – creating multiple threads section. In this experiment, we'll use multiple blocks as well as separate blocks containing multiple threads. This becomes more challenging in terms of how to find the index because we need to combine both threadIdx and blockIdx to generate a unique ID.

Let's take a look at two scenarios that depict different combinations that the developer can choose from:

Scenario 1:

Let's consider that the total number of vector elements is 32. Each block contains

eight threads and a total of four blocks.

Scenario 2:

Let's consider that the total number of vector elements is 32. Each block contains four threads and a total of eight blocks.

In both scenarios, the number of parallel executions is 32, where all 32 elements get populated in parallel. The developer makes the choice between the threads within a block and the number of blocks based on the problem's size and restriction by each piece of hardware. We will be covering details about the right choice of sizing based on the architecture in another chapter.

The following screenshot shows the vector addition GPU indexing code for different block and thread configurations: