E-Book
46,44 €

Hands-On GPU-Accelerated Computer Vision with OpenCV and CUDA E-Book

Bhaumik Vaidya

0,0

46,44 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Discover how CUDA allows OpenCV to handle complex and rapidly growing image data processing in computer and machine vision by accessing the power of GPU

Key Features

Explore examples to leverage the GPU processing power with OpenCV and CUDA

Enhance the performance of algorithms on embedded hardware platforms

Discover C++ and Python libraries for GPU acceleration

Book Description

Computer vision has been revolutionizing a wide range of industries, and OpenCV is the most widely chosen tool for computer vision with its ability to work in multiple programming languages. Nowadays, in computer vision, there is a need to process large images in real time, which is difficult to handle for OpenCV on its own. This is where CUDA comes into the picture, allowing OpenCV to leverage powerful NVDIA GPUs. This book provides a detailed overview of integrating OpenCV with CUDA for practical applications.

To start with, you'll understand GPU programming with CUDA, an essential aspect for computer vision developers who have never worked with GPUs. You'll then move on to exploring OpenCV acceleration with GPUs and CUDA by walking through some practical examples.

Once you have got to grips with the core concepts, you'll familiarize yourself with deploying OpenCV applications on NVIDIA Jetson TX1, which is popular for computer vision and deep learning applications. The last chapters of the book explain PyCUDA, a Python library that leverages the power of CUDA and GPUs for accelerations and can be used by computer vision developers who use OpenCV with Python.

By the end of this book, you'll have enhanced computer vision applications with the help of this book's hands-on approach.

What you will learn

Understand how to access GPU device properties and capabilities from CUDA programs

Learn how to accelerate searching and sorting algorithms

Detect shapes such as lines and circles in images

Explore object tracking and detection with algorithms

Process videos using different video analysis techniques in Jetson TX1

Access GPU device properties from the PyCUDA program

Understand how kernel execution works

Who this book is for

This book is a go-to guide for you if you are a developer working with OpenCV and want to learn how to process more complex image data by exploiting GPU processing. A thorough understanding of computer vision concepts and programming languages such as C++ or Python is expected.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 432

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Hands-On GPU-Accelerated Computer Vision with OpenCV and CUDA

Effective techniques for processing complex image data in real time using GPUs

Bhaumik Vaidya

BIRMINGHAM - MUMBAI

Hands-On GPU-Accelerated Computer Vision with OpenCV and CUDA

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Acquisition Editor: Alok DhuriContent Development Editor: Pooja ParvatkarTechnical Editor: Divya VadhyarCopy Editor: Safis EditingProject Coordinator: Ulhas KambaliProofreader: Safis EditingIndexer: Mariammal ChettiyarGraphics: Tom ScariaProduction Coordinator: Deepika Naik

First published: September 2018

Production reference: 1240918

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78934-829-3

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Bhaumik Vaidya is an experienced computer vision engineer and mentor. He has worked extensively on OpenCV Library in solving computer vision problems. He is a University gold medalist in masters and is now doing a PhD in the acceleration of computer vision algorithms built using OpenCV and deep learning libraries on GPUs. He has a background in teaching and has guided many projects in computer vision and VLSI(Very-large-scale integration). He has worked in the VLSI domain previously as an ASIC verification engineer, so he has very good knowledge of hardware architectures also. He has published many research papers in reputable journals to his credit. He, along with his PhD mentor, has also received an NVIDIA Jetson TX1 embedded development platform as a research grant from NVIDIA.

"I would like to thank my parents and family for their immense support. I would especially like to thank Parth Vaghasiya, who has stood like a pillar with me, for his continuous love and support. I really appreciate and thank Umang Shah and Ayush Vyas for their help in the development of content for this book. I would like to thank my friends Nihit, Arpit, Chirag, Vyom, Anupam, Bhavin, and Karan for their constant motivation and encouragement. I would like to thank Jay, Rangat, Manan, Rutvik, Smit, Ankit, Yash, Prayag, Jenish, Darshan, Parantap, Saif, Sarth, Shrenik, Sanjeet, and Jeevraj, who have been very special to me for their love, motivation, and support. I would like to thank Dr. Chirag Paunwala, Prof. Vandana Shah, Ms. Jagruti Desai and Prof. Mustafa Surti for their continuous guidance and support.

I gratefully acknowledge the support of NVIDIA Corporation and their donation of the Jetson TX1 GPU used for this book. I am incredibly grateful to Pooja Parvatkar, Alok Dhuri, and all the amazing people of Packt Publishing for taking their valuable time out to review this book in so much detail and for helping me during the development of the book.

To the memory of my grandmother, Urmilaben Vaidya, and to my family members, Vidyut Vaidya, Sandhya Vaidya, Bhartiben Joshi, Hardik Vaidya, and Parth Vaghasiya for their sacrifices, love, support, and inspiration."

About the reviewer

Vandana Shah gained her bachelor's degree in electronics in the year 2001. She has also gained an MBA in Human Resource management an a master's in Electronics Engineering specifically in the VLSI domain. She has also submitted her thesis for a PhD in electronics, specifically concerning the domain of image processing and deep learning for brain tumor detection, and is awaiting her award. Her area of interest is image processing with deep learning and embedded systems. She has more than 13 years of experience in research, as well as in teaching and guiding undergraduate and postgraduate students of electronics and communications. She has published many papers in renowned journals, such as IEEE, Springer, and Inderscience. She is also receiving a government grant for her upcoming research in the MRI image-processing domain. She has dedicated her life to mentoring students and researchers. She is also able to train students and faculty members in soft-skill development. Besides her prowess in technical fields, she also has a strong command of Kathak, an Indian dance.

"I thank my family members for their full support."

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Hands-On GPU-Accelerated Computer Vision with OpenCV and CUDA

Packt Upsell

Why subscribe?

Packt.com

Contributors

About the author

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Code in Action

Conventions used

Get in touch

Reviews

Introducing CUDA and Getting Started with CUDA

Technical requirements

Introducing CUDA

Parallel processing 

Introducing GPU architecture and CUDA

CUDA architecture

CUDA applications

CUDA development environment

CUDA-supported GPU

NVIDIA graphics card driver

Standard C compiler

CUDA development kit

Installing the CUDA toolkit on all operating systems

Windows

Linux 

Mac

A basic program in CUDA C

Steps for creating a CUDA C program on Windows 

 Steps for creating a CUDA C program on Ubuntu

Summary

Questions

Parallel Programming using CUDA C

Technical requirements

CUDA program structure

Two-variable addition program in CUDA C

A kernel call

Configuring kernel parameters

CUDA API functions

Passing parameters to CUDA functions

Passing parameters by value

Passing parameters by reference

Executing threads on a device

Accessing GPU device properties from CUDA programs

General device properties

Memory-related properties

Thread-related properties

Vector operations in CUDA 

Two-vector addition program

Comparing latency between the CPU and the GPU code 

Elementwise squaring of vectors in CUDA

Parallel communication patterns

Map

Gather

Scatter

Stencil

Transpose 

Summary

Questions

Threads, Synchronization, and Memory

Technical requirements

Threads

Memory architecture

Global memory

Local memory and registers

Cache memory

Thread synchronization

Shared memory

Atomic operations

Constant memory

Texture memory

Dot product and matrix multiplication example

Dot product

Matrix multiplication

Summary

Questions

Advanced Concepts in CUDA

Technical requirements

Performance measurement of CUDA programs

CUDA Events

The Nvidia Visual Profiler

Error handling in CUDA 

Error handling from within the code

Debugging tools

Performance improvement of CUDA programs

Using an optimum number of blocks and threads

Maximizing arithmetic efficiency

Using coalesced or strided memory access

Avoiding thread divergence

Using page-locked host memory

CUDA streams

Using multiple CUDA streams

Acceleration of sorting algorithms using CUDA

Enumeration or rank sort algorithms

Image processing using CUDA

Histogram calculation on the GPU using CUDA

Summary

Questions

Getting Started with OpenCV with CUDA Support

Technical requirements

Introduction to image processing and computer vision

Introduction to OpenCV

Installation of OpenCV with CUDA support

Installation of OpenCV on Windows

Using pre-built binaries

Building libraries from source

Installation of OpenCV with CUDA support on Linux

Working with images in OpenCV

Image representation inside OpenCV

Reading and displaying an image

Reading and displaying a color image

Creating images using OpenCV

Drawing shapes on the blank image

Drawing a line

Drawing a rectangle

Drawing a circle

Drawing an ellipse

Writing text on an image

Saving an image to a file

Working with videos in OpenCV

Working with video stored on a computer

Working with videos from a webcam

Saving video to a disk

Basic computer vision applications using the OpenCV CUDA module

Introduction to the OpenCV CUDA module

Arithmetic and logical operations on images

Addition of two images

Subtracting two images

Image blending

Image inversion

Changing the color space of an image

Image thresholding

Performance comparison of OpenCV applications with and without CUDA support

Summary

Questions

Basic Computer Vision Operations Using OpenCV and CUDA

Technical requirements

Accessing the individual pixel intensities of an image

Histogram calculation and equalization in OpenCV

Histogram equalization

Grayscale images

Color image

Geometric transformation on images

Image resizing

Image translation and rotation

Filtering operations on images

Convolution operations on an image

Low pass filtering on an image

Averaging filters

Gaussian filters

Median filtering

High-pass filtering on an image

Sobel filters

Scharr filters

Laplacian filters

Morphological operations on images

Summary

Questions

Object Detection and Tracking Using OpenCV and CUDA

Technical requirements

Introduction to object detection and tracking

Applications of object detection and tracking

Challenges in object detection

Object detection and tracking based on color

Blue object detection and tracking

Object detection and tracking based on shape

Canny edge detection

Straight line detection using Hough transform

Circle detection 

Key-point detectors and descriptors

Features from Accelerated Segment Test (FAST) feature detector

Oriented FAST and Rotated BRIEF (ORB) feature detection

Speeded up robust feature detection and matching

Object detection using Haar cascades

Face detection using Haar cascades

From video

Eye detection using Haar cascades

Object tracking using background subtraction

Mixture of Gaussian (MoG) method

GMG for background subtraction

Summary

Questions

Introduction to the Jetson TX1 Development Board and Installing OpenCV on Jetson TX1

Technical requirements

Introduction to Jetson TX1

Important features of the Jetson TX1

Applications of Jetson TX1

Installation of JetPack on Jetson TX1

Basic requirements for installation

Steps for installation

Summary

Questions

Deploying Computer Vision Applications on Jetson TX1

Technical requirements

Device properties of Jetson TX1 GPU

Basic CUDA program on Jetson TX1

Image processing on Jetson TX1 

Compiling OpenCV with CUDA support (if necessary)

Reading and displaying images

Image addition

Image thresholding

Image filtering on Jetson TX1

Interfacing cameras with Jetson TX1

Reading and displaying video from onboard camera

Advanced applications on Jetson TX1

Face detection using Haar cascades

Eye detection using Haar cascades

Background subtraction using Mixture of Gaussian (MoG)

Computer vision using Python and OpenCV on Jetson TX1

Summary

Questions

Getting Started with PyCUDA

Technical requirements

Introduction to Python programming language

Introduction to the PyCUDA module

Installing PyCUDA on Windows

Steps to check PyCUDA installation

Installing PyCUDA on Ubuntu

Steps to check the PyCUDA installation

Summary

Questions

Working with PyCUDA

Technical requirements

Writing the first program in PyCUDA 

A kernel call

Accessing GPU device properties from PyCUDA program

Thread and block execution in PyCUDA

Basic programming concepts in PyCUDA 

Adding two numbers in PyCUDA 

Simplifying the addition program using driver class

Measuring performance of PyCUDA programs using CUDA events

CUDA events

Measuring performance of PyCUDA using large array addition   

Complex programs in PyCUDA

Element-wise squaring of a matrix in PyCUDA

Simple kernel invocation with multidimensional threads

Using inout with the kernel invocation

Using gpuarray class

Dot product using GPU array

Matrix multiplication

Advanced kernel functions in PyCUDA

Element-wise kernel in PyCUDA

Reduction kernel 

Scan kernel 

Summary

Questions

Basic Computer Vision Applications Using PyCUDA

Technical requirements

Histogram calculation in PyCUDA

Using atomic operations

Using shared memory

Basic computer vision operations using PyCUDA

Color space conversion in PyCUDA

BGR to gray conversion on an image

BGR to gray conversion on a webcam video

Image addition in PyCUDA

Image inversion in PyCUDA using gpuarray

Summary

Questions

Assessments

Chapter 1

Chapter 2

Chapter 3

Chapter 4

Chapter 5

Chapter 6

Chapter 7

Chapter 8

Chapter 9

Chapter 10

Chapter 11

Chapter 12

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

Computer vision is revolutionizing a wide range of industries and OpenCV is the most widely chosen tool for computer vision with the ability to work in multiple programming languages. Nowadays, there is a need to process large images in real time in computer vision which is difficult to handle for OpenCV on its own. In this Graphics Processing Unit (GPU) and CUDA can help. So this book provides a detailed overview on integrating OpenCV with CUDA for practical applications. It starts with explaining the programming of GPU with CUDA which is essential for computer vision developers who have never worked with GPU. Then it explains OpenCV acceleration with GPU and CUDA by taking some practical examples. When computer vision applications are to be used in real life scenarios then it needs to deployed on embedded development boards. This book covers the deployment of OpenCV applications on NVIDIA Jetson Tx1 which is very popular for computer vision and deep learning applications. The last part of the book covers the concept of PyCUDA which can be used by Computer vision developers who are using OpenCV with Python. PyCUDA is a python library which leverages the power of CUDA and GPU for accelerations. This book provides a complete guide for developers using OpenCV in C++ or Python in accelerating their computer vision applications by taking a hands on approach.

Who this book is for

This book is a go-to guide for developers working with OpenCV who now want to learn how to process more complex image data by taking advantage of GPU processing. Most computer vision engineers or developers face problems when they try to process complex image data in real time. That is where the acceleration of computer vision algorithms using GPUs will help them in developing algorithms that can work on complex image data in real time. Most people think that hardware acceleration can only be done using FPGA and ASIC design, and for that, they need knowledge of hardware description languages such as Verilog or VHDL. However, that was only true before the invention of CUDA, which leverages the power of Nvidia GPUs and can be used to accelerate algorithms by using programming languages such as C++ and Python with CUDA. This book will help those developers in learning about these concepts by helping them to develop practical applications. This book will help developers to deploy computer vision applications on embedded platforms such as Nvidia Jetson TX1.

What this book covers

Chapter 1,Introduction to CUDA and Getting Started with CUDA, introduces the CUDA architecture and how it has redefined the parallel processing capabilities of GPUs. The application of the CUDA architecture in real-life scenarios is discussed. Readers are introduced to the development environment used for CUDA and how it can be installed on all operating systems.

Chapter 2, Parallel Programming Using CUDA C, teaches the reader to write programs using CUDA for GPUs. It starts with a simple Hello World program and then incrementally builds toward complex examples in CUDA C. It also covers how the kernel works and how to use device properties, and discusses terminologies associated with CUDA programming.

Chapter 3,Threads, Synchronization, and Memory, teaches the reader about how threads are called from CUDA programs and how multiple threads communicate with each other. It describes how multiple threads are synchronized when they work in parallel. It also describes constant memory and texture memory in detail.

Chapter 4, Advanced Concepts in CUDA, covers advanced concepts such as CUDA streams and CUDA events. It describes how sorting algorithms can be accelerated using CUDA, and looks at the acceleration of simple image-processing functions using CUDA.

Chapter 5, Getting Started with OpenCV with CUDA Support, describes theinstallation of the OpenCV library with CUDA support in all operating systems. It explains how to test this installation using a simple program. The chapter examines a performance comparison between image-processing programs that execute with and without CUDA support.

Chapter 6,Basic Computer Vision Operations Using OpenCV and CUDA, teaches the reader how to write basic computer vision operations such as pixel-level operations on images, filtering, and morphological operations using OpenCV.

Chapter 7, Object Detection and Tracking Using OpenCV and CUDA, looks at the steps for accelerating some real-life computer vision applications using OpenCV and CUDA. It describes the feature detection and description algorithms that are used for object detection. The chapter also covers the acceleration of face detection using Haar cascade and video analysis techniques, such as background subtraction for object tracking.

Chapter 8, Introduction to Jetson Tx1 Development Board and Installing OpenCV on Jetson TX1, introduces the Jetson TX1 embedded platform and how it can be used to accelerate and deploy computer vision applications. It describes the installation of OpenCV for Tegra on Jetson TX1 with Jetpack.

Chapter 9, Deploying Computer Vision Applications on Jetson TX1, covers deployment of computer vision applications on Jetson Tx1. It teaches the reader how to build different computer vision applications and how to interface the camera with Jetson Tx1 for video-processing applications.

Chapter 10, Getting started with PyCUDA, introduces PyCUDA, which is a Python library for GPU acceleration. It describes the installation procedure on all operating systems.

Chapter 11, Working with PyCUDA, teaches the reader how to write programs using PyCUDA. It describes the concepts of data transfer from host to device and kernel execution in detail. It covers how to work with arrays in PyCUDA and develop complex algorithms.

Chapter 12, Basic Computer Vision Application Using PyCUDA, looks at the development and acceleration of basic computer vision applications using PyCUDA. It describes color space conversion operations, histogram calculation, and different arithmetic operations as examples of computer vision applications.

To get the most out of this book

The examples covered in this book can be run with Windows, Linux, and macOS. All the installation instructions are covered in the book. A thorough understanding of computer vision concepts and programming languages such as C++ and Python is expected. It's preferable that the reader has Nvidia GPU hardware to execute examples covered in the book.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packt.com

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

Enter the name of the book in the

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Hands-On-GPU-Accelerated-Computer-Vision-with-OpenCV-and-CUDA. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/978-1-78934-829-3_ColorImages.pdf.

Code in Action

Visit the following link to check out videos of the code being run:http://bit.ly/2PZOYcH

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

html, body, #map { height: 100%; margin: 0; padding: 0}

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

[default]exten => s,1,Dial(Zap/1|30)exten => s,2,Voicemail(u100)

exten => s,102,Voicemail(b100)

exten => i,1,Voicemail(s0)

Any command-line input or output is written as follows:

$ mkdir css

$ cd css

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Introducing CUDA and Getting Started with CUDA

This chapter gives you a brief introduction to CUDA architecture and how it has redefined the parallel processing capabilities of GPUs. The application of CUDA architecture in real-life scenarios will be demonstrated. This chapter will serve as a starting guide for software developers who want to accelerate their applications by using general-purpose GPUs and CUDA. The chapter describes development environments used for CUDA application development and how the CUDA toolkit can be installed on all operating systems. It covers how basic code can be developed using CUDA C and executed on Windows and Ubuntu operating systems.

The following topics will be covered in this chapter:

Introducing CUDA

Applications of CUDA

CUDA development environments

Installing CUDA toolkit on Windows, Linux, and macOS

Developing simple code, using CUDA C

Technical requirements

This chapter requires familiarity with the basic C or C++ programming language. All the code used in this chapter can be downloaded from the following GitHub link: https://github.com/bhaumik2450/Hands-On-GPU-Accelerated-Computer-Vision-with-OpenCV-and-CUDA /Chapter1. The code can be executed on any operating system, though it is only tested on Windows 10 and Ubuntu 16.04.

Check out the following video to see the Code in Action:http://bit.ly/2PTQMUk

Introducing CUDA

Compute Unified Device Architecture (CUDA) is a very popular parallel computing platform and programming model developed by NVIDIA. It is only supported on NVIDIA GPUs. OpenCL is used to write parallel code for other types of GPUs such as AMD and Intel, but it is more complex than CUDA. CUDA allows creating massively parallel applications running on graphics processing units (GPUs) with simple programming APIs. Software developers using C and C++ can accelerate their software application and leverage the power of GPUs by using CUDA C or C++. Programs written in CUDA are similar to programs written in simple C or C++ with the addition of keywords needed to exploit parallelism of GPUs. CUDA allows a programmer to specify which part of CUDA code will execute on the CPU and which part will execute on the GPU.

The next section describes the need for parallel computing and how CUDA architecture can leverage the power of the GPU, in detail.

Parallel processing

In recent years, consumers have been demanding more and more functionalities on a single hand held device. So, there is a need for packaging more and more transistors on a small area that can work quickly and consume minimal power. We need a fast processor that can carry out multiple tasks with a high clock speed, a small area, and minimum power consumption. Over many decades, transistor sizing has seen a gradual decrease resulting in the possibility of more and more transistors being packed on a single chip. This has resulted in a constant rise of the clock speed. However, this situation has changed in the last few years with the clock speed being more or less constant. So, what is the reason for this? Have transistors stopped getting smaller? The answer is no. The main reason behind clock speed being constant is high power dissipation with high clock rate. Small transistors packed in a small area and working at high speed will dissipate large power, and hence it is very difficult to keep the processor cool. As clock speed is getting saturated in terms of development, we need a new computing paradigm to increase the performance of the processors. Let's understand this concept by taking a small real-life example.

Suppose you are told to dig a very big hole in a small amount of time. You will have the following three options to complete this work in time:

You can dig faster.

You can buy a better shovel.

You can hire more diggers, who can help you complete the work.

If we can draw a parallel between this example and a computing paradigm, then the first option is similar to having a faster clock. The second option is similar to having more transistors that can do more work per clock cycle. But, as we have discussed in the previous paragraph, power constraints have put limitations on these two steps. The third option is similar to having many smaller and simpler processors that can carry out tasks in parallel. A GPU follows this computing paradigm. Instead of having one big powerful processor that can perform complex tasks, it has many small and simple processors that can get work done in parallel. The details of GPU architecture are explained in the next section.

Introducing GPU architecture and CUDA

GeForce 256 was the first GPU developed by NVIDIA in 1999. Initially, GPUs were only used for rendering high-end graphics on monitors. They were only used for pixel computations. Later on, people realized that if GPUs can do pixel computations, then they would also be able to do other mathematical calculations. Nowadays, GPUs are used in many applications other than rendering graphics. These kinds of GPUs are called General-Purpose GPUs (GPGPUs).

The next question that may have come to your mind is the difference between the hardware architecture of a CPU and a GPU that allows it to carry out parallel computation. A CPU has a complex control hardware and less data computation hardware. Complex control hardware gives a CPU flexibility in performance and a simple programming interface, but it is expensive in terms of power. On the other hand, a GPU has simple control hardware and more hardware for data computation that gives it the ability for parallel computation. This structure makes it more power-efficient. The disadvantage is that it has a more restrictive programming model. In the early days of GPU computing, graphics APIs such as OpenGL and DirectX were the only way to interact with GPUs. This was a complex task for normal programmers, who were not familiar with OpenGL or DirectX. This led to the development of CUDA programming architecture, which provided an easy and efficient way of interacting with the GPUs. More details about CUDA architecture are given in the next section.

Normally, the performance of any hardware architecture is measured in terms of latency and throughput. Latency is the time taken to complete a given task, while throughput is the amount of the task completed in a given time. These are not contradictory concepts. More often than not, improving one improves the other. In a way, most hardware architectures are designed to improve either latency or throughput. For example, suppose you are standing in a queue at the post office. Your goal is to complete your work in a small amount of time, so you want to improve latency, while an employee sitting at a post office window wants to see more and more customers in a day. So, the employee's goal is to increase the throughput. Improving one will lead to an improvement in the other, in this case, but the way both sides look at this improvement is different.

In the same way, normal sequential CPUs are designed to optimize latency, while GPUs are designed to optimize throughput. CPUs are designed to execute all instructions in the minimum time, while GPUs are designed to execute more instructions in a given time. This design concept of GPUs makes them very useful in image processing and computer vision applications, which we are targeting in this book, because we don't mind a delay in the processing of a single pixel. What we want is that more pixels should be processed in a given time, which can be done on a GPU.

So, to summarize, parallel computing is what we need if we want to increase computational performance at the same clock speed and power requirement. GPUs provide this capability by having lots of simple computational units working in parallel. Now, to interact with the GPU and to take advantage of its parallel computing capabilities, we need a simple parallel programming architecture, which is provided by CUDA.

CUDA architecture

This section covers basic hardware modifications done in GPU architecture and the general structure of software programs developed using CUDA. We will not discuss the syntax of the CUDA program just yet, but we will cover the steps to develop the code. The section will also cover some basic terminology that will be followed throughout this book.

CUDA architecture includes several new components specifically designed for general-purpose computations in GPUs, which were not present in earlier architectures. It includes the unified shedder pipeline which allows all arithmetic logical units (ALUs) present on a GPU chip to be marshaled by a single CUDA program. The ALUs are also designed to comply with IEEE floating-point single and double-precision standards so that it can be used in general-purpose applications. The instruction set is also tailored to general purpose computation and not specific to pixel computations. It also allows arbitrary read and write access to memory. These features make CUDA GPU architecture very useful in general purpose applications.

All GPUs have many parallel processing units called cores. On the hardware side, these cores are divided into streaming processors and streaming multiprocessors (SMs). The GPU has a grid of these streaming multiprocessors. On the software side, a CUDA program is executed as a series of multiple threads running in parallel. Each thread is executed on a different core. The GPU can be viewed as a combination of many blocks, and each block can execute many threads. Each block is bound to a different SM on the GPU. How mapping is done between a block and SM is not known to a CUDA programmer, but it is known and done by a scheduler. The threads from same block can communicate with one another. The GPU has a hierarchical memory structure that deals with communication between threads inside one block and multiple blocks. This will be dealt with in detail in the upcoming chapters.

As a programmer, you will be curious to know what will be the programming model in CUDA and how the code will understand whether it should be executed on the CPU or the GPU. For this book, we will assume that we have a computing platform comprising a CPU and a GPU. We will call a CPU and its memory the host and a GPU and its memory a device. A CUDA code contains the code for both the host and the device. The host code is compiled on CPU by a normal C or C++ compiler, and the device code is compiled on the GPU by a GPU compiler. The host code calls the device code by something called a kernel call. It will launch many threads in parallel on a device. The count of how many threads to be launched on a device will be provided by the programmer.

Now, you might ask how this device code is different from a normal C code. The answer is that it is similar to a normal sequential C code. It is just that this code is executed on a greater number of cores in parallel. However, for this code to work, it needs data on the device's memory. So, before launching threads, the host copies data from the host memory to the device memory. The thread works on data from the device's memory and stores the result on the device's memory. Finally, this data is copied back to the host memory for further processing. To summarize, the steps to develop a CUDA C program are as follows:

Allocate memory for data in the host and device memory.

Copy data from the host memory to the device memory.

Launch a kernel by specifying the degree of parallelism.

After all the threads are finished, copy the data back from the device memory to the host memory.

Free up all memory used on the host and the device.

CUDA applications

CUDA has seen an unprecedented growth in the last decade. It is being used in a wide variety of applications in various domains. It has transformed research in multiple fields. In this section, we will look at some of these domains and how CUDA is accelerating growth in each domain:

Computer vision applications

: Computer vision and image processing algorithms are computationally intensive. With more and more cameras capturing images at high definition, there is a need to process these large images in real time. With the CUDA acceleration of these algorithms, applications such as image segmentation, object detection, and classification can achieve a real-time frame rate performance of more than 30 frames per second. CUDA and the GPU allow the faster training of deep neural networks and other deep-learning algorithms; this has transformed research in computer vision. NVIDIA is developing several hardware platforms such as Jetson TX1, Jetson TX2, and Jetson TK1, which can accelerate computer vision applications. NVIDIA drive platform is also one of the platforms that is made for autonomous drive applications.

Medical imaging

: The medical imaging field is seeing widespread use of GPUs and CUDA in reconstruction and the processing of MRI images and

Computed tomography

(

)

images. It has drastically reduced the processing time for these images. Nowadays, there are several devices that are shipped with GPUs, and several libraries are available to process these images with CUDA acceleration.

Financial computing

: There is a need for better data analytics at a lower cost in all financial firms, and this will help in informed decision-making. It includes complex risk calculation and initial and lifetime margin calculation, which have to be done in real time. GPUs help financial firms to do these kinds of analytics in real time without adding too much overhead cost.

Life science, bioinformatics, and computational chemistry

: Simulating DNA genes, sequencing, and protein docking are computationally intensive tasks that need high computation resources. GPUs help in this kind of analysis and simulation. GPUs can run common molecular dynamics, quantum chemistry, and protein docking applications more than five times faster than normal CPUs.

Weather research and forecasting

: Several weather prediction applications, ocean modeling techniques, and tsunami prediction techniques utilize GPU and CUDA for faster computation and simulations, compared to CPUs.

Electronics Design Automation (EDA)

: Due to the increasing complexity in VLSI technology and the semiconductor fabrication process, the performance of EDA tools is lagging behind in this technological progress. It leads to incomplete simulations and missed functional bugs. Therefore, the EDA industry has been seeking faster simulation solutions. GPU and CUDA acceleration are helping this industry to speed up

computationally intensive EDA simulations, including functional simulation, placement and routing, Signal integrity and electromagnetics, SPICE circuit simulation, and so on.

Government and defense

: GPU and CUDA acceleration is also widely used by governments and militaries. Aerospace, defense, and intelligence industries are taking advantage of CUDA acceleration in converting large amounts of data into actionable information.

CUDA development environment

To start developing an application using CUDA, you will need to set up the development environment for it. There are some prerequisites for setting up a development environment for CUDA. These include the following:

A CUDA-supported GPU

An NVIDIA graphics card driver

A standard C compiler

A CUDA development kit

How to check for these prerequisites and install them is discussed in the following sub section.

CUDA-supported GPU

As discussed earlier, CUDA architecture is only supported on NVIDIA GPUs. It is not supported on other GPUs such as AMD and Intel. Almost all GPUs developed by NVIDIA in the last decade support CUDA architecture and can be used to develop and execute CUDA applications. A detailed list of CUDA-supported GPUs can be found on the NVIDIA website: https://developer.nvidia.com/cuda-gpus. If you can find your GPU in this list, you will be able to run CUDA applications on your PC.

If you don't know which GPU is on your PC, then you can find it by following these steps:

On windows:

In the Start menu, type

device manager

and press

Enter

In the device manager, expand the display adaptors. There, you will find the name of your NVIDIA GPU.

On Linux:

Open Terminal.

Run

sudo lshw -C video

This will list information regarding your graphics card, usually including its make and model.

On macOS:

Go to the Apple

About this Mac

More info

Select

Graphics/Displays

under

Contents

list. There, you will find the name of your NVIDIA GPU.

If you have a CUDA-enabled GPU, then you are good to proceed to the next step.

NVIDIA graphics card driver

If you want to communicate with NVIDIA GPU hardware, then you will need a system software for it. NVIDIA provides a device driver to communicate with the GPU hardware. If the NVIDIA graphics card is properly installed, then these drivers are installed automatically with it on your PC. Still, it is good practice to check for driver updates periodically from the NVIDIA website: http://www.nvidia.in/Download/index.aspx?lang=en-in. You can select your graphics card and operating system for driver download from this link.

Standard C compiler

Whenever you are running a CUDA application, it will need two compilers: one for GPU code and one for CPU code. The compiler for the GPU code will come with an installation of CUDA toolkit, which will be discussed in the next section. You also need to install a standard C compiler for executing CPU code. There are different C compilers based on the operating systems:

On Windows

: For all Microsoft Windows editions, it is recommended to use Microsoft Visual Studio C compiler. It comes with Microsoft Visual Studio and can be downloaded from its official website:

https://www.visualstudio.com/downloads/

The express edition for commercial applications needs to be purchased, but you can use community editions for free in non-commercial applications. For running the CUDA application, install Microsoft Visual Studio with a Microsoft Visual Studio C compiler selected. Different CUDA versions support different Visual Studio editions, so you can refer to the NVIDIA CUDA website for Visual Studio version support.

On Linux

: Mostly, all Linux distributions come with a standard

GNU C Complier

(

GCC

), and hence it can be used to compile CPU code for CUDA applications.

On Mac

: On the Mac operating system, you can install a GCC compiler by downloading and installing Xcode for macOS. It is freely available and can be downloaded from Apple's website:

https://developer.apple.com/xcode/

CUDA development kit

CUDA needs a GPU compiler for compiling GPU code. This compiler comes with a CUDA development toolkit. If you have an NVIDIA GPU with the latest driver update and have installed a standard C compiler for your operating system, you are good to proceed to the final step of installing the CUDA development toolkit. A step-by-step guide for installing the CUDA toolkit is discussed in the next section.

Installing the CUDA toolkit on all operating systems

This section covers instructions on how to install CUDA on all supported platforms. It also describes steps to verify installation. While installing CUDA, you can choose between a network installer and an offline local installer. A network installer has a lower initial download size, but it needs an internet connection while installing. A local offline installer has a higher initial download size. The steps discussed in this book are for local installation. A CUDA toolkit can be downloaded for Windows, Linux, and macOS for both 32-bit and 64-bit architecture from the following link: https://developer.nvidia.com/cuda-downloads.

After downloading the installer, refer to the following steps for your particular operating system. CUDAx.x is used as notation in the steps, where x.x indicates the version of CUDA that you have downloaded.

Windows

This section covers the steps to install CUDA on Windows, which are as follows:

Double-click on the installer. It will ask you to select the folder where temporary installation files will be extracted. Select the folder of your choice. It is recommended to keep this as the default.

Then, the installer will check for system compatibility. If your system is compatible, you can follow the on screen prompt to install CUDA. You can choose between an express installation (default) and a custom installation. A custom installation allows you to choose which features of CUDA to install. It is recommended to select the express default installation.

The installer will also install CUDA sample programs and the CUDA Visual Studio integration.

Please make sure you have Visual Studio installed before running this installer.

To confirm that the installation is successful, the following aspects should be ensured:

All the CUDA samples will be located at

C:\ProgramData\NVIDIA Corporation\CUDA Samples\vx.x

if you have chosen the default path for installation.

To check installation, you can run any project.

We are using the device query project located at

C:\ProgramData\NVIDIA Corporation\CUDA Samples\vx.x\1_Utilities\deviceQuery

Double-click on the

*.sln

file of your Visual Studio edition. It will open this project in Visual Studio.

Then you can click on the local

Windows debugger

in Visual Studio. If the build is successful and the following output is displayed, then the installation is complete:

Linux

This section covers the steps to install CUDA on Linux distributions. In this section, the installation of CUDA in Ubuntu, which is a popular Linux distribution, is discussed using distribution-specific packages or using the apt-get command (which is specific to Ubuntu).

The steps to install CUDA using the *.deb installer downloaded from the CUDA website are as follows:

Open Terminal and run the

dpkg

command, which is used to install packages in Debian-based systems:

sudo dpkg -i cuda-repo-<distro>_<version>_<architecture>.deb

Install the CUDA public GPG key using the following command:

sudo apt-key add /var/cuda-repo-<version>/7fa2af80.pub

Then, update the

apt

repository cache using the following command:

sudo apt-get update

Then you can install CUDA using the following command:

sudo apt-get install cuda

Include the CUDA installation path in the PATH environment variable using the following command:

If you have not installed CUDA at default locations, you need to change the path to point at your installation location.

export PATH=/usr/local/cuda-x.x/bin${PATH:+:${PATH}}

Set the

LD_LIBRARY_PATH

environment variable:

export LD_LIBRARY_PATH=/usr/local/cuda-x.x/lib64\

${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

You can also install the CUDA toolkit by using the apt-get package manager, available with Ubuntu OS. You can run the following command in Terminal:

sudo apt-get install nvidia-cuda-toolkit

To check whether the CUDA GPU compiler has been installed, you can run the nvcc -V command from Terminal. It calls the GCC compiler for C code and the NVIDIA PTX compiler for the CUDA code.

You can install the NVIDIA Nsight Eclipse plugin, which will give the GUI Integrated Development Environment for executing CUDA programs, using the following command:

sudo apt install nvidia-nsight

After installation, you can run the deviceQuery project located at ~/NVIDIA_CUDA-x.x_Samples. If the CUDA toolkit is installed and configured correctly, the output for deviceQuery should look similar to the following:

Mac

This section covers steps to install CUDA on macOS. It needs the *.dmg installer downloaded from the CUDA website. The steps to install after downloading the installer are as follows:

Launch the installer and follow the onscreen prompt to complete the installation. It will install all prerequisites, CUDA, toolkit, and CUDA samples.

Then, you need to set environment variables to point at CUDA installation using the following commands:

If you have not installed CUDA at the default locations, you need to change the path to point at your installation location.

export PATH=/Developer/NVIDIA/CUDA-x.x/bin${PATH:+:${PATH}}

export DYLD_LIBRARY_PATH=/Developer/NVIDIA/CUDA-x.x/lib\

${DYLD_LIBRARY_PATH:+:${DYLD_LIBRARY_PATH}}

Run the script:

cuda-install-samples-x.x.sh

. It will install CUDA samples with write permissions.

After it has completed, you can go to

bin/x86_64/darwin/release

and run the

deviceQuery

project. If the CUDA toolkit is installed and configured correctly, it will display your GPU's device properties.

A basic program in CUDA C

In this section, we will start learning CUDA programming by writing a very basic program using CUDA C. We will start by writing a Hello, CUDA! program in CUDA C and execute it. Before going into the details of code, one thing that you should recall is that host code is compiled by the standard C compiler and that the device code is executed by an NVIDIA GPU compiler. A NVIDIA tool feeds the host code to a standard C compiler such as Visual Studio for Windows and a GCC compiler for Ubuntu, and it uses macOS for execution. It is also important to note that the GPU compiler can run CUDA code without any device code. All CUDA code must be saved with a *.cu extension.

The following is the code for Hello, CUDA!:

#include <iostream> __global__ void myfirstkernel(void) { }int main(void) { myfirstkernel << <1, 1 >> >(); printf("Hello, CUDA!\n"); return 0;}

If you look closely at the code, it will look very similar to that of the simple Hello, CUDA! program written in C for the CPU execution. The function of this code is also similar. It just prints Hello, CUDA! on Terminal or the command line. So, two questions that should come to your mind is: how is this code different, and where is the role of CUDA C in this code? The answer to these questions can be given by closely looking at the code. It has two main differences, compared to code written in simple C:

An empty function called

myfirstkernel

with

__global__

prefix

Call the

myfirstkernel

function with

<< <1,1> >>

__global__ is a qualifier added by CUDA C to standard C. It tells the compiler that the function definition that follows this qualifier should be complied to run on a device, rather than a host. So, in the previous code, myfirstkernel will run on a device instead of a host, though, in this code, it is empty.

Now where will the main function run? The NVCC compiler will feed this function to host the C compiler, as it is not decorated by the global keyword, and hence the main function will run on the host.

The second difference in the code is the call to the empty myfirstkernel function with some angular brackets and numeric values. This is a CUDA C trick to call device code from host code. It is called a kernel call. The details of a kernel call will be explained in later chapters. The values inside the angular brackets indicate arguments we want to pass from the host to the device at runtime. Basically, it indicates the number of blocks and the number of threads that will run in parallel on the device. So, in this code, << <1,1> >> indicates that myfirstkernel will run on one block and one thread or block on the device. Though this is not an optimal use of device resources, it is a good starting point to understand the difference between code executed on the host and code executed on a device.

Again, to revisit and revise the Hello, CUDA! code, the myfirstkernel function will run on a device with one block and one thread or block. It will be launched from the host code inside the main function by a method called