E-Book
44,39 €

OpenCL Programming by Example E-Book

Ravishekhar Banger

0,0

44,39 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

Research in parallel programming has been a mainstream topic for a decade, and will continue to be so for many decades to come. Many parallel programming standards and frameworks exist, but only take into account one type of hardware architecture. Today computing platforms come with many heterogeneous devices. OpenCL provides royalty free standard to program heterogeneous hardware.
This guide offers you a compact coverage of all the major topics of OpenCL programming. It explains optimization techniques and strategies in-depth, using illustrative examples and also provides case studies from diverse fields. Beginners and advanced application developers will find this book very useful.
Beginning with the discussion of the OpenCL models, this book explores their architectural view, programming interfaces and primitives. It slowly demystifies the process of identifying the data and task parallelism in diverse algorithms.
It presents examples from different domains to show how the problems within different domains can be solved more efficiently using OpenCL. You will learn about parallel sorting, histogram generation, JPEG compression, linear and parabolic regression and k-nearest neighborhood, a clustering algorithm in pattern recognition. Following on from this, optimization strategies are explained with matrix multiplication examples. You will also learn how to do an interoperation of OpenGL and OpenCL.
"OpenCL Programming by Example" explains OpenCL in the simplest possible language, which beginners will find it easy to understand. Developers and programmers from different domains who want to achieve acceleration for their applications will find this book very useful.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 369

Veröffentlichungsjahr: 2013

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Mission erfüllt

Owen Mark

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Macht, was ihr liebt!

Anja Förster

Der größte Raubzug der Geschichte

Matthias Weik

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Organisation für Komplexität

Niels Pfläging

Radikal führen

Reinhard K. Sprenger

BLACKOUT - Morgen ist es zu spät

Marc Elsberg

30 Minuten Sympathisch und souverän: So geht Vortragen!

Thomas Lorenz

The Truth About Employee Engagement

OpenCL Programming by Example

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers and more

Why Subscribe?

Free Access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Hello OpenCL

Advances in computer architecture

Different parallel programming techniques

OpenMP

MPI

OpenACC

CUDA

CUDA or OpenCL?

Renderscripts

Hybrid parallel computing model

Introduction to OpenCL

Hardware and software vendors

Advanced Micro Devices, Inc. (AMD)

NVIDIA®

Intel®

ARM Mali GPUs

OpenCL components

An example of OpenCL program

Basic software requirements

Windows

Linux

Installing and setting up an OpenCL compliant computer

Installation steps

Installing OpenCL on a Linux system with an AMD graphics card

Installing OpenCL on a Linux system with an NVIDIA graphics card

Installing OpenCL on a Windows system with an AMD graphics card

Installing OpenCL on a Windows system with an NVIDIA graphics card

Apple OSX

Multiple installations

Implement the SAXPY routine in OpenCL

OpenCL code

OpenCL program flow

Run on a different device

Summary

References

2. OpenCL Architecture

Platform model

AMD A10 5800K APUs

AMD Radeon HD 7870 Graphics Processor

NVIDIA® GeForce® GTC 680 GPU

Intel® IVY bridge

Platform versions

Query platforms

Query devices

Execution model

NDRange

OpenCL context

OpenCL command queue

Memory model

Global memory

Constant memory

Local memory

Private memory

OpenCL ICD

What is an OpenCL ICD?

Application scaling

Summary

3. OpenCL Buffer Objects

Memory objects

Creating subbuffer objects

Histogram calculation

Algorithm

OpenCL Kernel Code

The Host Code

Reading and writing buffers

Blocking_read and Blocking_write

Rectangular or cuboidal reads

Copying buffers

Mapping buffer objects

Querying buffer objects

Undefined behavior of the cl_mem objects

Summary

4. OpenCL Images

Creating images

Image format descriptor cl_image_format

Image details descriptor cl_image_desc

Passing image buffers to kernels

Samplers

Reading and writing buffers

Copying and filling images

Mapping image objects

Querying image objects

Image histogram computation

Summary

5. OpenCL Program and Kernel Objects

Creating program objects

Creating and building program objects

OpenCL program building options

Querying program objects

Creating binary files

Offline and online compilation

SAXPY using the binary file

SPIR – Standard Portable Intermediate Representation

Creating kernel objects

Setting kernel arguments

Executing the kernels

Querying kernel objects

Querying kernel argument

Releasing program and kernel objects

Built-in kernels

Summary

6. Events and Synchronization

OpenCL events and monitoring these events

OpenCL event synchronization models

No synchronization needed

Single device in-order usage

Synchronization needed

Single device and out-of-order queue

Multiple devices and different OpenCL contexts

Multiple devices and single OpenCL context

Coarse-grained synchronization

Event-based or fine-grained synchronization

Getting information about cl_event

User-created events

Event profiling

Memory fences

Summary

7. OpenCL C Programming

Built-in data types

Basic data types and vector types

The half data type

Other data types

Reserved data types

Alignment of data types

Vector data types

Vector components

Aliasing rules

Conversions and type casts

Implicit conversion

Explicit conversion

Reinterpreting data types

Operators

Operation on half data type

Address space qualifiers

__global/global address space

__local/local address space

__constant/constant address space

__private/private address space

Restrictions

Image access qualifiers

Function attributes

Data type attributes

Variable attribute

Storage class specifiers

Built-in functions

Work item function

Synchronization and memory fence functions

Other built-ins

Summary

8. Basic Optimization Techniques with Case Studies

Finding the performance of your program?

Explaining the code

Tools for profiling and finding performance bottlenecks

Case study – matrix multiplication

Sequential implementation

OpenCL implementation

Simple kernel

Kernel optimization techniques

Case study – Histogram calculation

Finding the scope of the use of OpenCL

General tips

Summary

9. Image Processing and OpenCL

Image representation

Implementing image filters

Mean filter

Median filter

Gaussian filter

Sobel filter

OpenCL implementation of filters

Mean and Gaussian filter

Median filter

Sobel filter

JPEG compression

Encoding JPEG

OpenCL implementation

Summary

References

10. OpenCL-OpenGL Interoperation

Introduction to OpenGL

Defining Interoperation

Implementing Interoperation

Detecting if OpenCL-OpenGL Interoperation is supported

Initializing OpenCL context for OpenGL Interoperation

Mapping of a buffer

Listing Interoperation steps

Synchronization

Creating a buffer from GL texture

Renderbuffer object

Summary

11. Case studies – Regressions, Sort, and KNN

Regression with least square curve fitting

Linear approximations

Parabolic approximations

Implementation

Bitonic sort

k-Nearest Neighborhood (k-NN) algorithm

Summary

Index

OpenCL Programming by Example

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: December 2013

Production Reference: 1161213

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-84969-234-2

www.packtpub.com

Cover Image by Asher Wishkerman (<[email protected]>)

Credits

Authors

Ravishekhar Banger

Koushik Bhattacharyya

Reviewers

Thomas Gall

Erik Rainey

Erik Smistad

Acquisition Editors

Wilson D'souza

Kartikey Pandey

Kevin Colaco

Lead Technical Editor

Arun Nadar

Technical Editors

Gauri Dasgupta

Dipika Gaonkar

Faisal Siddiqui

Project Coordinators

Wendell Palmer

Amey Sawant

Proofreader

Mario Cecere

Indexers

Rekha Nair

Priya Subramani

Graphics

Sheetal Aute

Ronak Dhruv

Yuvraj Mannari

Abhinash Sahu

Production Coordinator

Conidon Miranda

Cover Work

Conidon Miranda

About the Authors

Ravishekhar Banger calls himself a "Parallel Programming Dogsbody". Currently he is a specialist in OpenCL programming and works for library optimization using OpenCL. After graduation from SDMCET, Dharwad, in Electrical Engineering, he completed his Masters in Computer Technology from Indian Institute of Technology, Delhi. With more than eight years of industry experience, his present interest lies in General Purpose GPU programming models, parallel programming, and performance optimization for the GPU. Having worked for Samsung and Motorola, he is now a Member of Technical Staff at Advanced Micro Devices, Inc. One of his dreams is to cover most of the Himalayas by foot in various expeditions. You can reach him at <[email protected]>.

Koushik Bhattacharyya is working with Advanced Micro Devices, Inc. as Member Technical Staff and also worked as a software developer in NVIDIA®. He did his M.Tech in Computer Science (Gold Medalist) from Indian Statistical Institute, Kolkata, and M.Sc in pure mathematics from Burdwan University. With more than ten years of experience in software development using a number of languages and platforms, Koushik's present area of interest includes parallel programming and machine learning.

We would like to take this opportunity to thank "PACKT publishing" for giving us an opportunity to write this book.  

Also a special thanks to all our family members, friends and colleagues, who have helped us directly or indirectly in writing  this book.

About the Reviewers

Thomas Gall had his first experience with accelerated coprocessors on the Amiga back in 1986. After working with IBM for twenty years, now he is working as a Principle Engineer and serves as Linaro.org's technical lead for the Graphics Working Group. He manages the Graphics and GPGPU teams. The GPGPU team is dedicated to optimize existing open source software to take advantage of GPGPU technologies such as OpenCL, as well as the implementation of GPGPU drivers for ARM based SoC systems.

Erik Rainey works at Texas Instruments, Inc. as a Senior Software Engineer on Computer Vision software frameworks in embedded platforms in the automotive, safety, industrial, and robotics markets. He has a young son, who he loves playing with when not working, and enjoys other pursuits such as music, drawing, crocheting, painting, and occasionally a video game. He is currently involved in creating the Khronos Group's OpenVX, the specification for computer vision acceleration.

Erik Smistad is a PhD candidate at the Norwegian University of Science and Technology, where he uses OpenCL and GPUs to quickly locate organs and other anatomical structures in medical images for the purpose of helping surgeons navigate inside the body during surgery. He writes about OpenCL and his projects on his blog, thebigblob.com, and shares his code at github.com/smistad.

www.PacktPub.com

Support files, eBooks, discount offers and more

You might want to visit www.PacktPub.com for support files and downloads related to your book.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

http://PacktLib.PacktPub.com

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.

Why Subscribe?

Fully searchable across every book published by PacktCopy and paste, print and bookmark contentOn demand and accessible via web browser

Free Access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

Preface

This book is designed as a concise introduction to OpenCL programming for developers working on diverse domains. It covers all the major topics of OpenCL programming and illustrates them with code examples and explanations from different fields such as common algorithm, image processing, statistical computation, and machine learning. It also dedicates one chapter to Optimization techniques, where it discusses different optimization strategies on a single simple problem.

Parallel programming is a fast developing field today. As it is becoming increasingly difficult to increase the performance of a single core machine, hardware vendors see advantage in packing multiple cores in a single SOC. The GPU (Graphics Processor Unit) was initially meant for rendering better graphics which ultimately means fast floating point operation for computing pixel values. GPGPU (General purpose Graphics Processor Unit) is the technique of utilization of GPU for a general purpose computation. Since the GPU provides very high performance of floating point operations and data parallel computation, it is very well suited to be used as a co-processor in a computing system for doing data parallel tasks with high arithmetic intensity.

Before NVIDIA® came up with CUDA (Compute Unified Device Architecture) in February 2007, the typical GPGPU approach was to convert general problems' data parallel computation into some form of a graphics problem which is expressible by graphics programming APIs for the GPU. CUDA first gave a user friendly small extension of C language to write code for the GPU. But it was a proprietary framework from NVIDIA and was supposed to work on NVIDIA's GPU only.

With the growing popularity of such a framework, the requirement for an open standard architecture that would be able to support different kinds of devices from various vendors was becoming strongly perceivable. In June 2008, the Khronos compute working group was formed and they published OpenCL1.0 specification in December 2008. Multiple vendors gradually provided a tool-chain for OpenCL programming including NVIDIA OpenCL Drivers and Tools, AMD APP SDK, Intel® SDK for OpenCL application, IBM Server with OpenCL development Kit, and so on. Today OpenCL supports multi-core programming, GPU programming, cell and DSP processor programming, and so on.

In this book we discuss OpenCL with a few examples.

What this book covers

Chapter 1, Hello OpenCL, starts with a brief introduction to OpenCL and provides hardware architecture details of the various OpenCL devices from different vendors.

Chapter 2, OpenCL Architecture, discusses the various OpenCL architecture models.

Chapter 3, OpenCL Buffer Objects, discusses the common functions used to create an OpenCL memory object.

Chapter 4, OpenCL Images, gives an overview of functions for creating different types of OpenCL images.

Chapter 5, OpenCL Program and Kernel Objects, concentrates on the sequential steps required to execute a kernel.

Chapter 6, Events and Synchronization, discusses coarse grained and fine-grained events and their synchronization mechanisms.

Chapter 7, OpenCL C Programming, discusses the specifications and restrictions for writing an OpenCL compliant C kernel code.

Chapter 8, Basic Optimization Techniques with Case Studies, discusses various optimization techniques using a simple example of matrix multiplication.

Chapter 9, Image Processing and OpenCL, discusses Image Processing case studies. OpenCL implementations of Image filters and JPEG image decoding are provided in this chapter.

Chapter 10, OpenCL-OpenGL Interoperation, discusses OpenCL and OpenGL interoperation, which in its simple form means sharing of data between OpenGL and OpenCL in a program that uses both.

Chapter 11, Case studies – Regressions, Sort, and KNN, discusses general algorithm-like sorting. Besides this, case studies from Statistics (Linear and Parabolic Regression) and Machine Learning (K Nearest Neighbourhood) are discussed with their OpenCL implementations.

What you need for this book

The prerequisite is proficiency in C language. Having a background of parallel programming would undoubtedly be advantageous, but it is not a requirement. Readers should find this book compact yet a complete guide for OpenCL programming covering most of the advanced topics. Emphasis is given to illustrate the key concept and problem-solution with small independent examples rather than a single large example. There are detailed explanations of the most of the APIs discussed and kernels for the case studies are presented.

Who this book is for

Application developers from different domains intending to use OpenCL to accelerate their application can use this book to jump start. This book is also good for beginners in OpenCL and parallel programming.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the erratasubmissionform link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Different parallel programming techniques

There are several different forms of parallel computing such as bit-level, instruction level, data, and task parallelism. This book will largely focus on data and task parallelism using heterogeneous devices. We just now coined a term, heterogeneous devices. How do we tackle complex tasks "in parallel" using different types of computer architecture? Why do we need OpenCL when there are many (already defined) open standards for Parallel Computing?

To answer this question, let us discuss the pros and cons of different Parallel computing Framework.

OpenMP

OpenMP is an API that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran. It is prevalent only on a multi-core computer platform with a shared memory subsystem.

A basic OpenMP example implementation of the OpenMP Parallel directive is as follows:

#pragma omp parallel { body; }

When you build the preceding code using the OpenMP shared library, libgomp would expand to something similar to the following code:

void subfunction (void *data) { use data; body; } setup data; GOMP_parallel_start (subfunction, &data, num_threads); subfunction (&data); GOMP_parallel_end (); void GOMP_parallel_start (void (*fn)(void *), void *data, unsigned num_threads)

The OpenMP directives make things easy for the developer to modify the existing code to exploit the multicore architecture. OpenMP, though being a great parallel programming tool, does not support parallel execution on heterogeneous devices, and the use of a multicore architecture with shared memory subsystem does not make it cost effective.

MPI

Message Passing Interface (MPI) has an advantage over OpenMP, that it can run on either the shared or distributed memory architecture. Distributed memory computers are less expensive than large shared memory computers. But it has its own drawback with inherent programming and debugging challenges. One major disadvantage of MPI parallel framework is that the performance is limited by the communication network between the nodes.

Supercomputers have a massive number of processors which are interconnected using a high speed network connection or are in computer clusters, where computer processors are in close proximity to each other. In clusters, there is an expensive and dedicated data bus for data transfers across the computers. MPI is extensively used in most of these compute monsters called supercomputers.

OpenACC

The OpenACCApplication Program Interface (API) describes a collection of compiler directives to specify loops and regions of code in standard C, C++, and Fortran to be offloaded from a host CPU to an attached accelerator, providing portability across operating systems, host CPUs, and accelerators. OpenACC is similar to OpenMP in terms of program annotation, but unlike OpenMP which can only be accelerated on CPUs, OpenACC programs can be accelerated on a GPU or on other accelerators also. OpenACC aims to overcome the drawbacks of OpenMP by making parallel programming possible across heterogeneous devices. OpenACC standard describes directives and APIs to accelerate the applications. The ease of programming and the ability to scale the existing codes to use the heterogeneous processor, warrantees a great future for OpenACC programming.

CUDA

Compute Unified Device Architecture (CUDA) is a parallel computing architecture developed by NVIDIA for graphics processing and GPU (General Purpose GPU) programming. There is a fairly good developer community following for the CUDA software framework. Unlike OpenCL, which is supported on GPUs by many vendors and even on many other devices such as IBM's Cell B.E. processor or TI's DSP processor and so on, CUDA is supported only for NVIDIA GPUs. Due to this lack of generalization, and focus on a very specific hardware platform from a single vendor, OpenCL is gaining traction.

CUDA or OpenCL?

CUDA is more proprietary and vendor specific but has its own advantages. It is easier to learn and start writing code in CUDA than in OpenCL, due to its simplicity. Optimization of CUDA is more deterministic across a platform, since less number of platforms are supported from a single vendor only. It has simplified few programming constructs and mechanisms. So for a quick start and if you are sure that you can stick to one device (GPU) from a single vendor that is NVIDIA, CUDA can be a good choice.

OpenCL on the other hand is supported for many hardware from several vendors and those hardware vary extensively even in their basic architecture, which created the requirement of understanding a little complicated concepts before starting OpenCL programming. Also, due to the support of a huge range of hardware, although an OpenCL program is portable, it may lose optimization when ported from one platform to another.

The kernel development where most of the effort goes, is practically identical between the two languages. So, one should not worry about which one to choose. Choose the language which is convenient. But remember your OpenCL application will be vendor agnostic. This book aims at attracting more developers to OpenCL.

There are many libraries which use OpenCL programming for acceleration. Some of them are MAGMA, clAMDBLAS, clAMDFFT, BOLT C++ Template library, and JACKET which accelerate MATLAB on GPUs. Besides this, there are C++ and Java bindings available for OpenCL also.

Once you've figured out how to write your important "kernels" it's trivial to port to either OpenCL or CUDA. A kernel is a computation code which is executed by an array of threads. CUDA also has a vast set of CUDA accelerated libraries, that is, CUBLAS, CUFFT, CUSPARSE, Thrust and so on. But it may not take a long time to port these libraries to OpenCL.

Renderscripts

Renderscripts is also an API specification which is targeted for 3D rendering and general purpose compute operations in an Android platform. Android apps can accelerate the performance by using these APIs. It is also a cross-platform solution. When an app is run, the scripts are compiled into a machine code of the device. This device can be a CPU, a GPU, or a DSP. The choice of which device to run it on is made at runtime. If a platform does not have a GPU, the code may fall back to the CPU. Only Android supports this API specification as of now. The execution model in Renderscripts is similar to that of OpenCL.

Hybrid parallel computing model

Parallel programming models have their own advantages and disadvantages. With the advent of many different types of computer architectures, there is a need to use multiple programming models to achieve high performance. For example, one may want to use MPI as the message passing framework, and then at each node level one might want to use, OpenCL, CUDA, OpenMP, or OpenACC.

Besides all the above programming models many compilers such as Intel ICC, GCC, and Open64 provide auto parallelization options, which makes the programmers job easy and exploit the underlying hardware architecture without the need of knowing any parallel computing framework. Compilers are known to be good at providing instruction-level parallelism. But tackling data level or task level auto parallelism has its own limitations and complexities.

Introduction to OpenCL

OpenCL standard was first introduced by Apple, and later on became part of the open standards organization "Khronos Group". It is a non-profit industry consortium, creating open standards for the authoring, and acceleration of parallel computing, graphics, dynamic media, computer vision and sensor processing on a wide variety of platforms and devices.

The goal of OpenCL is to make certain types of parallel programming easier, and to provide vendor agnostic hardware-accelerated parallel execution of code. OpenCL (Open Computing Language) is the first open, royalty-free standard for general-purpose parallel programming of heterogeneous systems. It provides a uniform programming environment for software developers to write efficient, portable code for high-performance compute servers, desktop computer systems, and handheld devices using a diverse mix of multi-core CPUs, GPUs, and DSPs.

OpenCL gives developers a common set of easy-to-use tools to take advantage of any device with an OpenCL driver (processors, graphics cards, and so on) for the processing of parallel code. By creating an efficient, close-to-the-metal programming interface, OpenCL will form the foundation layer of a parallel computing ecosystem of platform-independent tools, middleware, and applications.

We mentioned vendor agnostic, yes that is what OpenCL is about. The different vendors here can be AMD, Intel, NVIDIA, ARM, TI, and so on. The following diagram shows the different vendors and hardware architectures which use the OpenCL specification to leverage the hardware capabilities:

The heterogeneous system

The OpenCL framework defines a language to write "kernels". These kernels are functions which are capable of running on different compute devices. OpenCL defines an extended C language for writing compute kernels, and a set of APIs for creating and managing these kernels. The compute kernels are compiled with a runtime compiler, which compiles them on-the-fly during host application execution for the targeted device. This enables the host application to take advantage of all the compute devices in the system with a single set of portable compute kernels.

Based on your interest and hardware availability, you might want to do OpenCL programming with a "host and device" combination of "CPU and CPU" or "CPU and GPU". Both have their own programming strategy. In CPUs you can run very large kernels as the CPU architecture supports out-of-order instruction level parallelism and have large caches. For the GPU you will be better off writing small kernels for better performance. Performance optimization is a huge topic in itself. We will try to discuss this with a case study in Chapter 8, Basic Optimization Techniques with Case Study

Hardware and software vendors

There are various hardware vendors who support OpenCL. Every OpenCL vendor provides OpenCL runtime libraries. These runtimes are capable of running only on their specific hardware architectures. Not only across different vendors, but within a vendor there may be different types of architectures which might need a different approach towards OpenCL programming. Now let's discuss the various hardware vendors who provide an implementation of OpenCL, to exploit their underlying hardware.

Advanced Micro Devices, Inc. (AMD)

With the launch of AMD A Series APU, one of industry's first Accelerated Processing Unit (APU), AMD is leading the efforts of integrating both the x86_64 CPU and GPU dies in one chip. It has four cores of CPU processing power, and also a four or five graphics SIMD engine, depending on the silicon part which you wish to buy. The following figure shows the block diagram of AMD APU architecture:

An AMD GPU consist of a number of Compute Engines (CU) and each CU has 16 ALUs. Further, each ALU is a VLIW4 SIMD processor and it could execute a bundle of four or five independent instructions. Each CU could be issued a group of 64 work items which form the work group (wavefront). AMD Radeon ™ HD 6XXX graphics processors uses this design. The following figure shows the HD 6XXX series Compute unit, which has 16 SIMD engines, each of which has four processing elements:

Starting with the AMD Radeon HD 7XXX series of graphics processors from AMD, there were significant architectural changes. AMD introduced the new Graphics Core Next (GCN) architecture. The following figure shows an GCN compute unit which has 4 SIMD engines and each engine is 16 lanes wide:

A group of these Compute Units forms an AMD HD 7xxx Graphics Processor. In GCN, each CU includes four separate SIMD units for vector processing. Each of these SIMD units simultaneously execute a single operation across 16 work items, but each can be working on a separate wavefront.

Apart from the APUs, AMD also provides discrete graphics cards. The latest family of graphics card, HD 7XXX, and beyond uses the GCN architecture. We will discuss one of the discrete GPU architectures in the following chapter, where we will discuss the OpenCL Platform model. AMD also provides the OpenCL runtimes for their CPU devices.

NVIDIA®

One of NVIDIA GPU architectures is codenamed "Kepler". GeForce® GTX 680 is one Kepler architectural silicon part. Each Kepler GPU consists of different configurations of Graphics Processing Clusters (GPC) and streaming multiprocessors. The GTX 680 consists of four GPCs and eight SMXs as shown in the following figure:

Kepler architecture is part of the GTX 6XX and GTX 7XX family of NVIDIA discrete cards. Prior to Kepler, NVIDIA had Fermi architecture which was part of the GTX 5XX family of discrete and mobile graphic processing units.

Intel®

Intel's OpenCL implementation is supported in the Sandy Bridge and Ivy Bridge processor families. Sandy Bridge family architecture is also synonymous with the AMD's APU. These processor architectures also integrated a GPU into the same silicon as the CPU by Intel. Intel changed the design of the L3 cache, and allowed the graphic cores to get access to the L3, which is also called as the last level cache. It is because of this L3 sharing that the graphics performance is good in Intel. Each of the CPUs including the graphics execution unit is connected via Ring Bus. Also each execution unit is a true parallel scalar processor. Sandy Bridge provides the graphics engine HD 2000, with six Execution Units (EU), and HD 3000 (12 EU), and Ivy Bridge provides HD 2500(six EU) and HD 4000 (16 EU). The following figure shows the Sandy bridge architecture with a ring bus, which acts as an interconnect between the cores and the HD graphics:

ARM Mali™ GPUs

ARM also provides GPUs by the name of Mali Graphics processors. The Mali T6XX series of processors come with two, four, or eight graphics cores. These graphic engines deliver graphics compute capability to entry level smartphones, tablets, and Smart TVs. The below diagram shows the Mali T628 graphics processor.

Mali T628 has eight shader cores or graphic cores. These cores also support Renderscripts APIs besides supporting OpenCL.

Besides the four key competitors, companies such as TI (DSP), Altera (FPGA), and Oracle are providing OpenCL implementations for their respective hardware. We suggest you to get hold of the benchmark performance numbers of the different processor architectures we discussed, and try to compare the performance numbers of each of them. This is an important first step towards comparing different architectures, and in the future you might want to select a particular OpenCL platform based on your application workload.

OpenCL components

Before delving into the programming aspects in OpenCL, we will take a look at the different components in an OpenCL framework. The first thing is the OpenCL specification. The OpenCL specification describes the OpenCL programming architecture details, and a set of APIs to perform specific tasks, which are all required by an application developer. This specification is provided by the Khronos OpenCL consortium. Besides this, Khronos also provides OpenCL header files. They are cl.h, cl_gl.h, cl_platform.h, and so on.

An application programmer uses these header files to develop his application and the host compiler links with the OpenCL.lib library on Windows. This library contains the entry points for the runtime DLL OpenCL.dll. On Linux the application program is linked dynamically with the libOpenCL.so shared library. The source code for the OpenCL.lib file is also provided by Khronos. The different OpenCL vendors shall redistribute this OpenCL.lib file and package it along with their OpenCL development SDK. Now the application is ready to be deployed on different platforms.

The different components in OpenCL are shown in the following figure:

Different components in OpenCL

On Windows, at runtime the application first loads the OpenCL.dll dynamic link library which in turn, based on the platform selected, loads the appropriate OpenCL runtime driver by reading the Windows registry entry for the selected platform (either of amdocl.dll or any other vendor OpenCL runtimes). On Linux, at runtime the application loads the libOpenCL.so shared library, which in turn reads the file /etc/OpenCL/vendors/*.icd and loads the library for the selected platform. There may be multiple runtime drivers installed, but it is the responsibility of the application developers to choose one of them, or if there are multiple devices in the platforms, he may want to choose all the available platforms. During runtime calls to OpenCL, functions queue parallel tasks on OpenCL capable devices. We will discuss more on OpenCL Runtimes in Chapter 5, OpenCL Program and Kernel Objects.

An example of OpenCL program

In this section we will discuss all the necessary steps to run an OpenCL application.

Basic software requirements

A person involved in OpenCL programming should be very proficient in C programming, and having prior experience in any parallel programming tool will be an added advantage. He or she should be able to break a large problem and find out the data and task parallel regions of the code which he or she is trying to accelerate using OpenCL. An OpenCL programmer should know the underlying architecture for which he/she is trying to program. If you are porting an existing parallel code into OpenCL, then you just need to start learning the OpenCL programming architecture.

Besides this a programmer should also have the basic system software details, such as compiling the code and linking it to an appropriate 32 bit or 64 bit library. He should also have knowledge of setting the system path on Windows to the correct DLLs or set the LD_LIBRARY_PATH environment variable in Linux to the correct shared libraries.

The common system requirements for Windows and Linux operating systems are as follows:

Windows

You should have administrative privileges on the systemMicrosoft Windows XP, Vista, or 7Microsoft Visual Studio 2005, 2008, or 2010Display Drivers for AMD and NVIDIA GPUs. For NVIDIA GPUs you will need display drivers R295 or R300 and above

Linux

You should have root permissions to install the SDKWith the vast number of flavors of Linux, practically any supported version which has the corresponding graphic device driver installed for the GPU

The GCC compiler tool chain

Installing and setting up an OpenCL compliant computer

To install OpenCL you need to download an implementation of OpenCL. We discussed about the various hardware and software vendors in a previous section. The major graphic vendors, NVIDIA and AMD have both released implementations of OpenCL for their GPUs. Similarly AMD and Intel provide a CPU-only runtime for OpenCL. OpenCL implementations are available in so-called Software Development Kits (SDK), and often include some useful tools such as debuggers and profilers. The next step is to download and install the SDK for the GPU you have on your computer. Note that not all graphic cards are supported. A list of which graphics cards are supported can be found in the respective vendor specific websites. Also you can take a look at the Khronos OpenCL conformance products list. If you don't have a graphics card, don't worry, you can use your existing processor to run OpenCL samples on CPU as a device.

If you are still confused about which device to choose, then take a look at the list of supported devices provided with each release of an OpenCL SDK from different vendors.

Installation steps

For NVIDIA installation steps, we suggest you to take a look at the latest installation steps for the CUDA software. First install the GPU computing SDK provided for the OS. The following link provides the installation steps for NVIDIA platforms:

http://developer.download.nvidia.com/compute/cuda/3_2_prod/sdk/docs/OpenCL_Release_Notes.txt

For AMD Accelerated Parallel Processing (APP) SDK installation take a look at the AMD APP SDK latest version installation guide. The AMD APP SDK comes with a huge set of sample programs which can be used for running. The following link is where you will find the latest APP SDK installation notes:

http://developer.amd.com/download/AMD_APP_SDK_Installation_Notes.pdf

For INTEL SDK for OpenCL applications 2013, use the steps provided in the following link:

http://software.intel.com/en-us/articles/intel-sdk-for-opencl-applications-2013-release-notes

Note these links are subject to change over a period of time.

AMD's OpenCL implementation is OpenCL 1.2 conformant. Also download the latest AMD APP SDK version 2.8 or above.

For NVIDIA GPU computing, make sure you have a CUDA enabled GPU. Download the latest CUDA release 4.2 or above, and the GPU computing SDK release 4.2 or above.

For Intel, download the Intel SDK for OpenCL Applications 2013.

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

OpenCL Programming by Example E-Book

Ravishekhar Banger

Table of Contents

OpenCL Programming by Example

OpenCL Programming by Example

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers and more

Why Subscribe?

Free Access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Different parallel programming techniques

OpenMP

MPI

OpenACC

CUDA

CUDA or OpenCL?

Renderscripts

Hybrid parallel computing model

Introduction to OpenCL

Hardware and software vendors

Advanced Micro Devices, Inc. (AMD)

NVIDIA®

Intel®

ARM Mali™ GPUs

OpenCL components

An example of OpenCL program

Basic software requirements

Windows

Linux

Installing and setting up an OpenCL compliant computer

Installation steps