Electronic Structure Calculations on Graphics Processing Units -  - E-Book

Electronic Structure Calculations on Graphics Processing Units E-Book

0,0
127,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics provides an overview of computing on graphics processing units (GPUs), a brief introduction to GPU programming, and the latest examples of code developments and applications for the most widely used electronic structure methods.

The book covers all commonly used basis sets including localized Gaussian and Slater type basis functions, plane waves, wavelets and real-space grid-based approaches.
The chapters expose details on the calculation of two-electron integrals, exchange-correlation quadrature, Fock matrix formation, solution of the self-consistent field equations, calculation of nuclear gradients to obtain forces, and methods to treat excited states within DFT. Other chapters focus on semiempirical and correlated wave function methods including density fitted second order Møller-Plesset perturbation theory and both iterative and perturbative single- and multireference coupled cluster methods.

Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics presents an accessible overview of the field for graduate students and senior researchers of theoretical and computational chemistry, condensed matter physics and materials science, as well as software developers looking for an entry point into the realm of GPU and hybrid GPU/CPU programming for electronic structure calculations.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 770

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright

List of Contributors

Preface

Acknowledgments

Glossary

Abbreviations - Scientific

Abbreviations - Technical

Chapter 1: Why Graphics Processing Units

1.1 A Historical Perspective of Parallel Computing

1.2 The Rise of the GPU

1.3 Parallel Computing on Central Processing Units

1.4 Parallel Computing on Graphics Processing Units

1.5 GPU-Accelerated Applications

References

Chapter 2: GPUs: Hardware to Software

2.1 Basic GPU Terminology

2.2 Architecture of GPUs

2.3 CUDA Programming Model

2.4 Programming and Optimization Concepts

2.5 Software Libraries for GPUs

2.6 Special Features of CUDA-Enabled GPUs

References

Chapter 3: Overview of Electronic Structure Methods

3.1 Introduction

3.2 Hartree–Fock Theory

3.3 Density Functional Theory

3.4 Basis Sets

3.5 Semiempirical Methods

3.6 Density Functional Tight Binding

3.7 Wave Function-Based Electron Correlation Methods

Acknowledgments

References

Chapter 4: Gaussian Basis Set Hartree–Fock, Density Functional Theory, and Beyond on GPUs

4.1 Quantum Chemistry Review

4.2 Hardware and CUDA Overview

4.3 GPU ERI Evaluation

4.4 Integral-Direct Fock Construction on GPUs

4.5 Precision Considerations

4.6 Post-SCF Methods

4.7 Example Calculations

4.8 Conclusions and Outlook

References

Chapter 5: GPU Acceleration for Density Functional Theory with Slater-Type Orbitals

5.1 Background

5.2 Theory and CPU Implementation

5.3 GPU Implementation

5.4 Conclusion

References

Chapter 6: Wavelet-Based Density Functional Theory on Massively Parallel Hybrid Architectures

6.1 Introductory Remarks on Wavelet Basis Sets for Density Functional Theory Implementations

6.2 Operators in Wavelet Basis Sets

6.3 Parallelization

6.4 GPU Architecture

6.5 Conclusions and Outlook

References

Chapter 7: Plane-Wave Density Functional Theory

7.1 Introduction

7.2 Theoretical Background

7.3 Implementation

7.4 Optimizations

7.5 Performance Examples

7.6 Exact Exchange with Plane Waves

7.7 Summary and Outlook

7.8 Acknowledgments

References

Appendix A: Definitions and Conventions

Appendix B: Example Kernels

Chapter 8: GPU-Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory

8.1 Introduction

8.2 Software Architecture for GPU-Acceleration

8.3 Maximizing Asynchronous Progress

8.4 Libcusmm: GPU Accelerated Small Matrix Multiplications

8.5 Benchmarks and Conclusions

Acknowledgments

References

Chapter 9: Grid-Based Projector-Augmented Wave Method

9.1 Introduction

9.2 General Overview

9.3 Using GPUs in Ground-State Calculations

9.4 Time-Dependent Density Functional Theory

9.5 Random Phase Approximation for the Correlation Energy

9.6 Summary and Outlook

Acknowledgments

References

Chapter 10: Application of Graphics Processing Units to Accelerate Real-Space Density Functional Theory and Time-Dependent Density Functional Theory Calculations

10.1 Introduction

10.2 The Real-Space Representation

10.3 Numerical Aspects of the Real-Space Approach

10.4 General GPU Optimization Strategy

10.5 Kohn–Sham Hamiltonian

10.6 Orthogonalization and Subspace Diagonalization

10.7 Exponentiation

10.8 The Hartree Potential

10.9 Other Operations

10.10 Numerical Performance

10.11 Conclusions

10.12 Computational Methods

Acknowledgments

References

Chapter 11: Semiempirical Quantum Chemistry

11.1 Introduction

11.2 Overview of Semiempirical Methods

11.3 Computational Bottlenecks

11.4 Profile-Guided Optimization for the Hybrid Platform

11.5 Performance

11.6 Applications

11.7 Conclusion

Acknowledgement

References

Chapter 12: GPU Acceleration of Second-Order Møller–Plesset Perturbation Theory with Resolution of Identity

12.1 Møller–Plesset Perturbation Theory with Resolution of Identity Approximation (RI-MP2)

12.2 A Mixed-Precision Matrix Multiplication Library

12.3 Performance of Accelerated RI-MP2

12.4 Example Applications

12.5 Conclusions

References

Chapter 13: Iterative Coupled-Cluster Methods on Graphics Processing Units

13.1 Introduction

13.2 Related Work

13.3 Theory

13.4 Algorithm Details

13.5 Computational Details

13.6 Results

13.7 Conclusions

Acknowledgments

References

Chapter 14: Perturbative Coupled-Cluster Methods on Graphics Processing Units: Single- and Multi-Reference Formulations

14.1 Introduction

14.2 Overview of Electronic Structure Methods

14.3 NWChem Software Architecture

14.4 GPU Implementation

14.5 Performance

14.6 Outlook

Acknowledgments

References

Scientific Index

Technical Index

End User License Agreement

Pages

xiii

xiv

xv

xvii

xix

xxi

xxv

xxvi

xxvii

xxviii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

333

334

Guide

Cover

Table of Contents

Preface

Begin Reading

List of Illustrations

Chapter 1: Why Graphics Processing Units

Figure 1.1 Photograph taken in 1957 at NASA featuring an IBM 704 computer, the first commercially available general-purpose computer with floating-point arithmetic hardware [4]

Figure 1.2 Photograph of Burroughs Corporation's D825 parallel computer [6]

Figure 1.3 Microprocessor transistor counts 1971–2011. Until recently, the number of transistors on integrated circuits has been following Moore's Law [16], doubling approximately every 2 years

Figure 1.4 Illustration of the ever-increasing power density within silicon chips, with decreasing gate length.

Figure 1.5 Peak floating-point operations per second (a) and memory bandwidth (b) for Intel CPUs and Nvidia GPUs.

Figure 1.6 Directed acyclic graph demonstrating the decomposition of four tasks across two processors

Figure 1.7 Illustration of a GPU working in conjunction with a CPU as work is offloaded and computed on a GPU concurrently with CPU execution [24]

Figure 1.8 Cartoon representation of the dihydrofolate reductase enzyme in water (23,508 atoms)

Figure 1.9 Computational performance of classical molecular dynamics simulations with the CPU and GPU versions of Amber on varying hardware configurations measured in nanosecond per day. The results are for a standard benchmark (FactorIX enzyme in explicit solvent, 90,906 atoms, NVE ensemble, 2 fs time step) [30]

Figure 1.10 Performance acceleration for Adobe Mercury Playback Engine on GPUs. System configuration: Adobe Premier Pro CC, Windows 7 – 64-bit, Dual Intel Xeon E5 2687 W 3.10 GHz CPUs (16 total cores). Test consists of HD video workflow with complex Mercury Playback Engine effects at 720p resolution. Results based on final output render time comparing noted GPU to CPU [31]

Figure 1.11 Performance data for Adobe After Effect CC Engine on GPUs. System configuration: Adobe After Effects CC, Windows 7 – 64-bit, Dual Intel Xeon E5 2687 W 3.10 GHz CPUs (16 total cores). Test consists of live After Effect CC scenes with 3D layer, comparing time to render ray-traced 3D scene on noted GPU versus CPU [32]

Chapter 2: GPUs: Hardware to Software

Figure 2.1 General overview of CPU–GPU hardware configuration.

Figure 2.2 Illustration highlighting the movement of data between memory locations on the host and the device.

Figure 2.3 Nvidia Kepler architecture memory model. Global memory and L2 cache can be accessed by all SMs. L1 cache and shared memory (SMEM) are accessible to threads running on the same SM.

Figure 2.4 Example of C for CUDA kernel code and kernel launch code. The declaration specifier __global__ marks the kernel for execution on the GPU, while the execution configuration (grid and block size) is contained in triple angle brackets

Figure 2.5 Thread hierarchy. Blocks of threads are configured into a grid of thread blocks that logically map to the underlying GPU hardware with blocks being executed on individual SMs.

Figure 2.6 Half-warp of threads accessing 16 shared memory banks

Figure 2.7 Visual representation of unified memory. Host and device memory is presented as a single address space. This makes programming easier, but since memory copies still happen behind the scenes, performance can be lower compared to explicitly managed memory.

Chapter 4: Gaussian Basis Set Hartree–Fock, Density Functional Theory, and Beyond on GPUs

Figure 4.1 Schematic of one-block one-contracted Integral (1B1CI) mapping. Cyan squares on left represent contracted ERIs each mapped to the labeled CUDA block of 64 threads. Orange squares show mapping of primitive ERIs to CUDA threads (green and blue boxes, colored according to CUDA warp) for two representative integrals, the first a contraction over a single primitive ERI and the second involving 3

4

= 81 primitive contributions.

Figure 4.2 Schematic of one-thread one-contracted Integral (1T1CI) mapping. Cyan squares represent contracted ERIs and CUDA threads. Thread indices are shown in parentheses. Each CUDA block (red outlines) computes 16 ERIs, with each thread accumulating the primitives of an independent contraction, in a local register.

Figure 4.3 Schematic of one-thread one-primitive integral (1T1PI) mapping. Cyan squares represent two-dimensional tiles of 16 × 16 primitive ERIs, each of which is assigned to a 16 × 16 CUDA block as labeled. Red lines indicate divisions between contracted ERIs. The orange box shows assignment of primitive ERIs to threads (gray squares) within a block that contains contributions to multiple contractions.

Figure 4.4 ERI grids colored by angular momentum class for a system containing four

s

-shells and one

p

-shell. Each square represents all ERIs for a shell quartet. (a) Grid when bra and ket pairs are ordered by simple loops over shells. (b) ERI grid for same system with bra and ket pairs sorted by angular momentum,

ss

, then

sp

, then

pp

. Each integral class now handles a contiguous chunk of the total ERI grid.

Figure 4.5 Organization of ERIs for Coulomb formation. Rows and columns correspond to primitive bra and ket pairs, respectively. Each ERI is colored according to the magnitude of its Schwarz bound. Data are derived from calculation on ethane molecule. Figure (a) obtained by arbitrary ordering of pairs within each angular momentum class and suffers from load imbalance because large and small integrals are computed in neighboring cells, and (b) that sorts bra and ket primitives by Schwarz contribution within each momentum class, providing an efficient structure for parallel evaluation

Figure 4.6 Schematic representation of a J-Engine kernel for one angular momentum class, for example, (

ss

|

ss

). Cyan squares represent significant ERI contributions. Sorted bra and ket vectors are represented by triangles to the left and above the grid. The path of a 2 × 2 block as it sweeps across the grid is shown in orange. The final reduction across rows of the block is illustrated within the inset to the right.

Figure 4.7 Schematic of a K-Engine kernel. Bra and ket PQ arrays are represented by triangles to the left and above the grid. The pairs are grouped by and index and then sorted by bound. The paths of four blocks are shown in orange, with the zigzag pattern illustrated by arrows in the top right. The final reduction of an exchange element within a 2 × 2 block is shown to the right.

Figure 4.8 Organization of double- and single-precision workloads within Coulomb ERI grids. As in Figure 4.5, rows and columns correspond to primitive bra and ket pairs. (a) Each ERI is colored according to the magnitude of its Schwarz bound. (b) ERIs are colored by required precision. Yellow ERIs require double precision, while those in green may be evaluated in single precision. Blue ERIs are neglected entirely.

Figure 4.9 Relative error in final energies versus precision threshold for various basis sets. Test molecules are shown on the right. Each point is averaged over the five test systems. Error bars represent 2 standard deviations above the mean. The black line shows the empirical error bound of Eq. (4.64)

Figure 4.10 One-dimensional alkene and three-dimensional water-cube test systems. Alkene lengths vary from 24 to 706 carbon atoms and water cubes range from 10 to nearly 850 water molecules. A uniform density is used for all water boxes

Figure 4.11 First SCF iteration timings in seconds for (a) linear alkenes and (b) cubic water clusters. Total times are further broken down into J-Engine, K-Engine, distance-masked K-Engine, linear algebra (LA), and DFT exchange–correlation contributions. For water clusters, total SCF times are shown for both the naïve and distance-masked (mask) K-Engine. All calculations were performed using a single Tesla M2090 GPU and the 6-31G basis set. Power fits show scaling with increasing system size, and the exponent for each fit is provided in the legend.

Figure 4.12 Multi-GPU parallel efficiency for J-Engine, K-Engine, and exchange–correlation Fock formation based on first iteration time for water clusters, run on 2 M2090 GPUs

Figure 4.13 Total SCF time of TeraChem on eight CPUs and four GPUs, relative to GAMESS on eight CPUs for water clusters

Figure 4.14 Absorption spectrum of the nanostar dendrimer with peripheral chromophore absorption peaks calculated (

A

,

B

) in vacuum and measured experimentally (

A

exp

,

B

exp

) at 300 K in hexane

Chapter 5: GPU Acceleration for Density Functional Theory with Slater-Type Orbitals

Figure 5.1 Schematic representation of the CUDA kernel and wrapper routine for the numerical integration of Fock matrix elements

Figure 5.2 Comparison of the time required by the GPU and CPU code for computing contributions to the Fock matrix through numerical integration

Figure 5.3 Schematic representation of the hybrid CPU/GPU algorithm for the MPI parallel Fock matrix calculation

Figure 5.4 Timings and speedup for the calculation of Hessian matrix elements for an analytical frequency calculation on the budesonide molecule (C

25

H

34

O

6

) with BLYP/TZ2P and frozen core setting “small,” using a Becke grid quality of “good” in ADF

Figure 5.5 Timings and speedup for the calculation of matrix elements of the Fock matrix derivatives (routine

f1u1_ai

_gga) required for analytical frequency calculation. Test system is the budesonide molecule (C

25

H

34

O

6

) with BLYP/TZ2P and frozen core setting “small,” using a Becke grid quality of “good” in ADF

Chapter 6: Wavelet-Based Density Functional Theory on Massively Parallel Hybrid Architectures

Figure 6.1 Least asymmetric Daubechies wavelet family of order . Note that both the scaling function and the wavelet are different from zero only within the interval

Figure 6.2 Schematic representation of the application of the Hamiltonian in the BigDFT formalism

Figure 6.3 Simulation grid points

Figure 6.4 Orbital (a) and coefficient (b) distribution schemes

Figure 6.5 Comparison of the performance of BigDFT on different platforms. Runs on the CCRT machine are worse in scalability but better in absolute performance per compute core than runs on the CSCS machine (1.6–2.3 times faster)

Figure 6.6 Upper panel: Data distribution for 1D convolution+transposition on the GPU. See Section 6.4.2 for details. Lower panel: Reproduction of the portion of the input data highlighted in gray in the upper panel

Figure 6.7 Left panel: Speedup for the GPU version of the fundamental operations on the wave functions. Right panel: Double-precision speedup for the GPU version of the 3D operators used in the BigDFT code as a function of the single wave function size

Figure 6.8 (a) Speedup of the BigDFT code for a four-carbon-atom supercell (graphene) with 164 -points. The calculation is performed with eight MPI processes on the CEA-DAM INTI machine, based on Westmere processors and Nvidia Fermi. For each run, the number of equivalent MPI processes is indicated, given that the parallel efficiency of this run is 98%. Also, the efficiency of the GPU acceleration is presented.(b) Speedup of the same run on different hybrid architectures in combination with MPI runs

Figure 6.9 Relative speedup of the hybrid DFT code with respect to the equivalent pure CPU run. Different runs for simulations of increasing system size have been performed on an Intel X5472 3 GHz (Harpertown) machine, with a Fermi GPU card

Figure 6.10 Massively parallel runs for a cobalt metalloporphyrin on graphene sheet (265 atoms system) with surfaces BC. In the bottom panel, the simulations have been accelerated with Kepler GPU cards. Interesting speedups can be achieved

Chapter 7: Plane-Wave Density Functional Theory

Figure 7.1 Schematic representation of three approaches for porting loops on GPUs: (a) asynchronous kernel launches; (b) streaming kernel launches; and (c) batched kernel launches

Figure 7.2 Using multiple process accelerated by GPUs communicating with MPI.

Figure 7.3 supercell of crystalline Si

Figure 7.4 Gold MD snapshot at 1600 K

Figure 7.5 Boron in hR105 (-rhombohedral) structure

Chapter 8: GPU-Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory

Figure 8.1 Workflow for a self-consistent electronic structure calculation, illustrating both the use of traditional as well as methods

Figure 8.2 Direct comparison of the time needed for calculations on bulk liquid water using linear scaling and diagonalization-based SCF procedures. For matrices larger than 20,000 basis functions, a speedup is observed (filtering threshold 10

−5

). Note that for linear scaling approaches, the time needed depends not only on the matrix size but also on the sparsity pattern, and hence better quality basis sets typically have a larger relative cost

Figure 8.3 A schematic representation of the software architecture employed in the GPU accelerated DBCSR library. The various layers correspond to key steps in the matrix multiplication algorithm. While the Cannon layer is essential for the parallelism between processes or on the cluster level, the lower layers deal with parallelism and acceleration on the node level

Figure 8.4 Enabling concurrency and enforcing dependencies in DBCSR. Multiple streams are used to transfer data from the host to the device, and to process independent stacks. Dependencies between the streams, for example, a panel upload and stack calculations, and between host and device, for example, device buffer reuse, are enforced using events

Figure 8.5 Schematic representation of the double buffered Cannon algorithm, which illustrates how the use of two host and two device buffers for the and panels enables overlapping of message passing, host to device memory copies and computations. The ratio of the time needed for the important steps of the algorithm, depends on the hardware and on the science problem at hand

Figure 8.6 Minimum (dotted line) and maximum (solid line) arithmetic intensity for different matrix sizes commonly employed in CP2K simulations, and the corresponding maximum possible flop rate. The performance as obtained from individual kernel launches in a mini-app is shown as bars

Figure 8.7 Inner-product (a) and outer-product (b) form of matrix multiplication. The yellow areas in indicate elements that can be computed independently by accessing the highlighted areas of and .

Figure 8.8 (a) Parameterization of the -matrix product . Each thread computes an tile () of the result matrix . In order to accommodate matrix sizes larger than the available shared memory, matrices are processed in slabs (, ), with an input slab width . In order to optimize the data output, the matrices () are written back using the output slab width . (b) Close to the SM, registers are used to store the matrix tile, while slabs of , , and are stored in shared memory. (c) GPU memory stores all panel data, including the various blocks of , , , and the stack buffers .

Figure 8.9 Performance comparison of the multi-threaded DBCSR library based on matrix blocks, and was not using the MPI capabilities. The benchmark was run on a dual Sandy Bridge (E5-2620, 2.0 GHz, 6 cores) machine, equipped with one Nvidia Tesla K20 card

Figure 8.10 Aggregated nanoparticles in explicit solution (77,538 atoms) can be run on the Piz Daint computer (5272 hybrid compute nodes) at approximately 122 seconds per SCF step

Chapter 9: Grid-Based Projector-Augmented Wave Method

Figure 9.1 Multigrid V-cycle with three levels

Figure 9.2 Flowchart of the SCF loop

Figure 9.3 (a) Weak scaling performance of the CPU and GPU versions of the program using bulk Si systems. (b) The achieved speedups with GPU acceleration. The GPU runs used one CPU core per GPU.

Figure 9.4 Output of the Nvidia nvvp profiling tool for a portion of an RPA O calculation

Figure 9.5 Speedup (eight GPUs vs. eight CPU cores) as a function of the number of the () for some representative functions such as get wave functions (“get_wfs”), mapping wave functions between the 3D fft grid and reduced planewave grid (“mapG”), PAW corrections (“paw_P_ai”), batched cufft (“fft”), and cublas (“zherk” routine). For a full list of the GPU ported functions, refer to Ref. [49]. The test system is a /Ru(0001) surface, modeled with four layers of Ru in a unit cell. The speedup (timing) information comes from a summation of 1 -point (per core), 5 occupied, and 1486 unoccupied bands with an energy cutoff 150 eV. The total speedups (“Total”) in the optical limit () and other are also shown.

Chapter 10: Application of Graphics Processing Units to Accelerate Real-Space Density Functional Theory and Time-Dependent Density Functional Theory Calculations

Figure 10.1 Example of real-space grids adapted to the shape of a cis retinal molecule. The cubes mark the position of the grid points. For visualization purposes, we represent smaller and coarser grids than the ones used for actual calculations

Figure 10.2 Scheme illustrating the blocks of orbitals strategy for DFT on GPUs. (a) Operating on a single orbital might not provide enough parallelism for the GPU to perform efficiently. (b) By operating simultaneously over several orbitals there is a larger degree of data parallelism and there is less divergence among GPU threads.

Figure 10.3 Examples of different grid orders in 2D: (a) standard order (b) grid ordered by small parallelepipedic subgrids or

bricks

, and (c) order given by a Hilbert space-filling curve

Figure 10.4 Effect of the optimization of the grid mapping for data locality in the numerical throughput of the Laplacian operator as a function of the size of the orbitals block. Spherical grid with 500k points. Computations with a AMD Radeon 7970 GPU

Figure 10.5 Division of the atoms of a C

60

molecule in groups (represented by different colors) whose pseudo-potential spheres do not overlap.

Figure 10.6 Numerical throughput of the application of the pseudo-potentials nonlocal part as a function of the size of the block of orbitals (block-size). Calculation for -cyclodextrin with 256 orbitals and 260k grid points for one CPU and two GPUs.

Figure 10.7 Numerical throughput of the application of the Kohn–Sham Hamiltonian as a function of the size of the block of orbitals (block-size). Calculation for -cyclodextrin with 256 orbitals and 260k grid points for one CPU and two GPUs

Figure 10.8 Numerical throughput of (a) the orthogonalization procedure and (b) the subspace diagonalization as a function of the size of the block of orbitals (block-size) for different processors. Calculation for -cyclodextrin with 256 orbitals and 260k grid points

Figure 10.9 Numerical throughput of the fourth-order Taylor approximation to the exponential operator as a function of the size of the block of orbitals (block-size). Calculation for -cyclodextrin with 256 orbitals and 260k grid points for one CPU and two GPUs

Figure 10.10 Comparison of (a) the throughput and (b) calculation time achieved by our FFT Poisson solver as a function of the number of grid points for one CPU and two GPUs. The data is originally on main memory, so the time required to copy the input data to the GPU and copy back the result is included. The number of points corresponds to the spherical grid used by octopus, the FFT grid has a larger number of points. Following Ref. [123], the operation count for the FFTs is assumed to be

Figure 10.11 Numerical throughput of our CPU and GPU implementations as a function of the size of the block of orbitals (block-size). (a) Self-consistency cycle in a ground-state DFT calculation. (b) Real-time TDDFT propagation. -cyclodextrin molecule with 256 orbitals and 260k grid points

Figure 10.12 Performance of our CPU and GPU implementations for a set of 40 molecules of different sizes. (a) Numerical throughput of the self-consistency cycle. (b) Total execution time for a single-point energy calculation.

Figure 10.13 Performance of our CPU and GPU real-time TDDFT implementations for a set of 40 molecules of different sizes. (a) Numerical throughput of the real-time propagation. (b) Computational time required to propagate 1 attosecond.

Figure 10.14 Speed-up of the GPU calculation with the respect to the CPU for different molecules as a function of the number of valence electrons. (a) Speed-up for the time spent in the SCF-cycle of a ground-state DFT calculation (without considering initializations). (b) Speed-up for real-time TDDFT. Intel Core i7 3820 using 8 threads

Figure 10.15 Numerical performance comparison between our GPU implementation (octopus) and the terachem code. (a) Comparison of the total calculation time as a function of the number of valence electrons. (b) Speed-up of our implementation with respect to terachem (run time of terachem divided by the run time of octopus). The calculations are single-point energy evaluations performed on a set of 40 molecules, running on a Nvidia Tesla K20 GPU.

Chapter 11: Semiempirical Quantum Chemistry

Figure 11.1 Profiles of the OM3 calculations for the test proteins for the C

[1C]

computing setup

Figure 11.2 Speedups of the FDIAG subroutine in the OM3 calculations on the multi-CPU C

[6C]

, C

[12C]

, and hybrid CPU–GPU C

[12C–1G]

computing setups over the serial configuration

Figure 11.3 Speedups of the BORDER subroutine in the OM3 calculations on the multi-CPU C

[6C]

, C

[12C]

, and GPU-only C

[1G]

computing setups over the serial configuration

Figure 11.4 Speedups of the DIIS subroutine in the OM3 calculations on the multi-CPU C

[6C]

, C

[12C]

, and GPU-only C

[1G]

and C

[2G]

computing setups over the serial configuration

Figure 11.5 Speedups of the PDIAG subroutine in the OM3 calculations on the multi-CPU C

[6C]

, C

[12C]

, and GPU-only C

[1G]

and C

[2G]

computing setups over the serial configuration

Figure 11.6 Speedups of the ORTCOR subroutine in the OM3 calculations on the multi-CPU C

[6C]

, C

[12C]

, and GPU-only C

[1G]

and C

[2G]

computing setups over the serial configuration

Figure 11.7 Overall speedups of the OM3 calculations of test proteins on the multi-CPU C

[6C]

, C

[12C]

, and hybrid CPU–GPU C

[12C–1G]

and C

[12C–2G]

computing setups over the serial configuration

Figure 11.8 Profiles of the OM3 calculations for the test proteins on the C

[12C–2G]

computing setup

Figure 11.9 Experimental structures of (a) (PDB ID: 2AP7, 80% -helix), (b) (PDB ID: 2EVQ, 50% -strands), and (c) (PDB ID: 1LVR, 100% random coil). Only the backbone atoms are shown, with the atoms represented by black balls. Four dihedral angles (, , , and ) in a residue serve as stereochemical metrics, see the schematic sketch in (d)

Chapter 12: GPU Acceleration of Second-Order Møller–Plesset Perturbation Theory with Resolution of Identity

Figure 12.1 Pictorial representation of the heterogeneous MGEMM algorithm. (a, b) and matrices and their separation into matrices with large and small elements, given a cutoff parameter . (c) Components of the final matrix, where the light gray (green in ebook) component involves a dense matrix multiplication computed with CUBLAS-SGEMM, while the dark grey (blue in ebook) components involve sparse matrix multiplications computed with a DAXPY-like algorithm (explained in the text). The blocks follow the nomenclature shown in Eq. (12.20)

Figure 12.2 Speedup for various *GEMM calls as a function of matrix size. Most elements were in the range , with the “salt” values in the range . Times are scaled relative to running DGEMM on the CPU

Figure 12.3 RMS error in a single matrix element for various GEMM calls as a function of matrix size compared to CPU DGEMM. Background elements were in the range , with the “salt” values in the range or . MGEMM gives identical results for all parameters

Figure 12.4 Fraction of “large” elements as a function of the cutoff parameter for the taxol RI-MP2 matrices in steps 3 and 4 of the algorithm outlined in Section 12.1

Figure 12.5 Biochemical reactions studied. (a) Isomerization of dihydroxyacetone phosphate (DHAP) to glyceraldehyde-3-phosphate (G3P). (b) Fructose-1,6-biphosphate (FBP) to DHAP and G3P

Chapter 13: Iterative Coupled-Cluster Methods on Graphics Processing Units

Figure 13.1 A simple interleaved DGEMM algorithm. For a matrix multiplication, , the and dimensions are blocked to parallelize the algorithm over multiple GPUs and to guarantee that all blocks will fit in GPU global memory. The sum dimension, , is blocked in order to interleave computation and communication.

Figure 13.2 The (a) performance (in GF) of the blocked DGEMM algorithm outlined in Figure 13.1. The performance of Fermi and Kepler GPUs are illustrated relative to that of the Intel Core i7 3930K CPU in (b). The performance boost achieved by interleaving communication and computation is shown in (c)

Figure 13.3 Pseudocode for the DF/CD-CCSD procedure. The ladder diagram, , is evaluated on the GPU using the symmetric and antisymmetric tensors defined in Ref. [38] (here denoted by ()).

Figure 13.4 Performance in GF (10 floating point operations per second) for different implementations of spin-free CCD. Results are given for both CPU and GPU hardware, and CPU BLAS routines utilize eight threads

Figure 13.5 Speedup in the average DF-CCSD iteration time for clusters of water molecules (represented by a cc-pVDZ basis set, using a cc-pVDZ-RI auxiliary basis set) using one Fermi C2070 GPU, two Fermi C2070 GPUs, or one Kepler K20c GPU. The speedup is defined relative to the average iteration time for DF-CCSD using all six cores of a Core i7-3930K processor

Chapter 14: Perturbative Coupled-Cluster Methods on Graphics Processing Units: Single- and Multi-Reference Formulations

Figure 14.1 Dimension-flattening optimization. The solid lines correspond to different two-dimensional regions. The dotted lines correspond to the mapping of the data blocks to the thread blocks. Tensors and are flattened into two-dimensional arrays, increasing utilization of thread blocks. Each thread block not necessarily works only on values from a single dimension of the original tensor

Figure 14.2 CPU–GPU hybrid implementation: execution steps involved in the noniterative triples correction. T1, T2, and T3 tensors corresponds to the , , and amplitudes (SRCC) or to the reference-specific , , and perturbative amplitudes (MRCC). The F1 and V2 tensors corresponds to one- and two-electron integrals, respectively. The steps are as follows: step 1, copy input blocks from Global Arrays to the CPU or GPU local memory; step 2, contract input blocks into intermediate tensor block; step 3, reduce intermediate tensor to compute energy correction contribution; step 4, reduce final energy correction across all CPUs and GPUs. One of the CPU cores (Core 0 in the image) manages communication and load balancing toward the GPU, and also performs some basic sequential operations that would be more expensive on the GPU.

Figure 14.3 Schematic representation of the GPU-enhanced reference-level parallelism. Separate processor groups are delegated to calculate the reference-specific parts of the iterative MRCCSD equations and reference-specific noniterative corrections due to triples.

Figure 14.4 Speedup of the noniterative MRCCSD(T) calculations for various configurations for block sizes varying from 12 through 22 with respect to the execution times using two CPUs and no GPUs

List of Tables

Chapter 2: GPUs: Hardware to Software

Table 2.1 Memory hierarchy in the CUDA programming model detailing the location, lifetime, and privacy of the five memory spaces

Chapter 4: Gaussian Basis Set Hartree–Fock, Density Functional Theory, and Beyond on GPUs

Table 4.1 Runtime comparison for evaluating ERIs of 64 H atom lattice using 1B1CI, 1T1CI, and 1T1PI methods

Table 4.2 Dynamic precision performance for RHF calculations on various molecules using the 6-31G basis set

Chapter 6: Wavelet-Based Density Functional Theory on Massively Parallel Hybrid Architectures

Table 6.1 Parallel run of BigDFT on a hybrid heterogeneous architecture

Table 6.2 Effect of MPI and OpenMP (OMP) parallelization and GPU acceleration on the complete molecular dynamics of 32 O molecules

Chapter 7: Plane-Wave Density Functional Theory

Table 7.1 Self-consistent electron density calculation of bulk Si with a supercell of 128 atoms using block Davidson diagonalization (, , )

Table 7.3 Self-consistent electron density calculation of bulk Si with a supercell of 512 atoms using block Davidson diagonalization (, , )

Table 7.4 Self-consistent electron density calculation of bulk Si with a supercell of 256 atoms using RMM-DIIS diagonalization (, , )

Table 7.5 Non-self-consistent bulk Si band structure calculation (, , )

Table 7.6 Ab-initio molecular dynamics of 64 gold atoms at 1600 K, seen in Figure 7.4 (, , )

Table 7.7 Structural relaxation of hR105 boron (, , )

Table 7.8 Structural relaxation of hR105 boron with exact exchange (, , )

Chapter 8: GPU-Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory

Table 8.1 Key quantities of three linear scaling benchmarks that are distributed with CP2K

Chapter 9: Grid-Based Projector-Augmented Wave Method

Table 9.1 Performance for ground-state calculations of carbon nanotubes of different lengths (times in seconds, speedup denoted as S-up)

Table 9.2 Bulk Au time-propagation example

Table 9.3 Speedup (eight GPUs/eight CPU cores) for as well as time required for an entire RPA calculation with a response function cutoff of 150 eV for different simulations

Chapter 11: Semiempirical Quantum Chemistry

Table 11.1 Proteins in the test set for the OM3 calculations

Table 11.2 Percentages () of computation time in the PDIAG subroutine consumed by FMO, JACOBI, and other tasks in the OM3 calculations on a single CPU core

Table 11.3 Speedups of the FMO and JACOBI steps in the PDIAG subroutine on the multi-CPU C

[6C]

, C

[12C]

, and GPU-only C

[1G]

and C

[2G]

computing setups over the serial setup

Table 11.4 Statistics (%) for in the most favored () and additionally allowed () regions of the Ramachandran plot and standard deviations () of and

Chapter 12: GPU Acceleration of Second-Order Møller–Plesset Perturbation Theory with Resolution of Identity

Table 12.1 Speedups using CUBLAS DGEMM relative to CPU DGEMM for biomolecules in a double- basis

Table 12.2 Speedups using CUBLAS SGEMM and DGEMM and total energy errors relative to CPU DGEMM for various molecules in a double- basis

Table 12.3 MGEMM speedups and total energy errors with respect to CPU DGEMM for various molecules in a double- and triple- basis

Table 12.4 Comparison of Gibbs reaction energies (in kcal/mol) using RI-MP2 with respect to DFT

Chapter 13: Iterative Coupled-Cluster Methods on Graphics Processing Units

Table 13.1 Comparison of CPU and GPU implementations of CCD

Table 13.2 Comparison of CPU

a

and GPU

b

implementations of CCSD

Table 13.3 Comparison of CPU

a

and GPU

b

implementations of CCSD in large basis sets

Table 13.4 Average iteration time (in seconds) for DF/CD-CCSD computations of adenine-thymine and a benzene trimer represented by the aug-cc-pVDZ basis and the uracil dimer represented by the aug-cc-pVTZ basis

a

Chapter 14: Perturbative Coupled-Cluster Methods on Graphics Processing Units: Single- and Multi-Reference Formulations

Table 14.1 Time comparison for CPU and CPU+GPU runs of the noniterative part of the CCSD(T) approach for the pentacene molecule in cc-pVDZ basis set

Electronic Structure Calculations on Graphics Processing Units

From Quantum Chemistry to Condensed Matter Physics

 

Editors

Ross C. Walker

San Diego Supercomputer Center and Department of Chemistryand Biochemistry, University of California, San Diego, USA

 

and

Andreas W. Götz

San Diego Supercomputer Center, University of California,San Diego, USA

 

 

This edition first published 2016

© 2016 John Wiley & Sons, Ltd

Registered office

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom

For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.

The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought

The advice and strategies contained herein may not be suitable for every situation. In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. The fact that an organization or Website is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Website may provide or recommendations it may make. Further, readers should be aware that Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read. No warranty may be created or extended by any promotional statements for this work. Neither the publisher nor the author shall be liable for any damages arising herefrom.

Library of Congress Cataloging-in-Publication Data applied for.

ISBN: 9781118661789

A catalogue record for this book is available from the British Library.

Cover Image: Courtesy of the Editors

List of Contributors

Ani Anciaux-Sedrakian,

Mechatronics, Computer Sciences and Applied Mathematics Division, IFP Energies nouvelles, Rueil-Malmaison Cedex, France

Xavier Andrade,

Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA

Edoardo Aprà,

William R. Wiley Environmental Molecular Sciences Laboratory, Battelle, Pacific Northwest National Laboratory, Richland, WA, USA

Alán Aspuru-Guzik,

Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA

Jeroen Bédorf,

Centrum Wiskunde & Informatica, Amsterdam, The Netherlands

Kiran Bhaskaran-Nair,

William R. Wiley Environmental Molecular Sciences Laboratory, Battelle, Pacific Northwest National Laboratory, Richland, WA, USA

Damien Caliste,

Université Grenoble Alpes, INAC, Grenoble, France, and CEA, INAC, Grenoble, France

A. Eugene DePrince III,

Department of Chemistry and Biochemistry, Florida State University, Tallahassee, FL, USA

Thierry Deutsch,

Université Grenoble Alpes, INAC, Grenoble, France, and CEA, INAC, Grenoble, France

Jussi Enkovaara,

Department of Applied Physics, Aalto University, Espoo, Finland; CSC – IT Center for Science Ltd, Espoo, Finland

Paul Fleurat-Lessard,

Laboratoire de Chimie, Université de Lyon, ENS Lyon, Lyon, France; ICMUB, Université de Bourgogne Franche-Comté, Dijon, France

Luigi Genovese,

Université Grenoble Alpes, INAC, Grenoble, France, and CEA, INAC, Grenoble, France

Stefan Goedecker,

Institut für Physik, Universität Basel, Basel, Switzerland

Andreas W. Götz,

San Diego Supercomputer Center, UCSD, La Jolla, CA, USA

Samuli Hakala,

Department of Applied Physics, Aalto University, Espoo, Finland

Jeff R. Hammond,

Leadership Computing Facility, Argonne National Laboratory, Argonne, IL, USA

Ville Havu,

Department of Applied Physics, Aalto University, Espoo, Finland

Maxwell Hutchinson,

Department of Physics, University of Chicago, Chicago, IL, USA

Jürg Hutter,

Institute of Physical Chemistry, University of Zürich, Zürich, Switzerland

Adrian Jinich,

Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA

Axel Koslowski,

Max-Planck-Institut für Kohlenforschung, Mülheim an der Ruhr, Germany

Karol Kowalski,

William R. Wiley Environmental Molecular Sciences Laboratory, Battelle, Pacific Northwest National Laboratory, Richland, WA, USA

Sriram Krishnamoorthy,

Computational Sciences and Mathematics Division, Pacific Northwest National Laboratory, Richland, WA, USA

Lin Li,

SUNCAT Center for Interface Science and Catalysis, SLAC National Accelerator Laboratory, Menlo Park, CA, USA

Nathan Luehr,

Department of Chemistry and the PULSE Institute, Stanford, CA, USA; SLAC National Accelerator Laboratory, Menlo Park, CA, USA

Wenjing Ma,

Institute of Software, Chinese Academy of Sciences, Beijing, China

Todd J. Martínez,

Department of Chemistry and the PULSE Institute, Stanford, CA, USA; SLAC National Accelerator Laboratory, Menlo Park, CA, USA

Jean-François Méhaut,

Université Joseph Fourier – Laboratoire d'Informatique de Grenoble – INRIA, Grenoble, France

Peter Messmer,

NVIDIA, Zürich, Switzerland; NVIDIA Co-Design Lab for Hybrid Multicore Computing, Zürich, Switzerland

Perri Needham,

San Diego Supercomputer Center, UCSD, La Jolla, CA, USA

Risto M. Nieminen,

Department of Applied Physics, Aalto University, Espoo, Finland

Chris O'Grady,

SUNCAT Center for Interface Science and Catalysis, SLAC National Accelerator Laboratory, Menlo Park, CA, USA

Roberto Olivares-Amaya,

Department of Chemistry, Princeton University, Princeton, NJ, USA

Hans van Schoot,

Scientific Computing & Modeling NV, Theoretical Chemistry, Vrije Universiteit, Amsterdam, The Netherlands

Ole Schütt,

Department of Materials, ETH Zürich, Zürich, Switzerland

C. David Sherrill,

Center for Computational Molecular Science and Technology, Georgia Institute of Technology, Atlanta, GA, USA; School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, GA, USA; School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA

Aaron Sisto,

Department of Chemistry and the PULSE Institute, Stanford, CA, USA; SLAC National Accelerator Laboratory, Menlo Park, CA, USA

Dusan Stosic,

Department of Computer Science, Federal University of Pernambuco, Recife, Brazil

Sarah Tariq,

NVIDIA Corporation, Santa Clara, CA, USA

Walter Thiel,

Max-Planck-Institut für Kohlenforschung, Mülheim an der Ruhr, Germany

Antonino Tumeo,

Computational Sciences and Mathematics Division, Pacific Northwest National Laboratory, Richland, WA, USA

Joost VandeVondele,

Department of Materials, ETH Zürich, Zürich, Switzerland

Brice Videau,

Université Joseph Fourier – Laboratoire d'Informatique de Grenoble – INRIA, Grenoble, France

Oreste Villa,

Nvidia, Santa Clara, CA, USA

Lucas Visscher,

Amsterdam Center for Multiscale Modeling (ACMM), Theoretical Chemistry, VU University Amsterdam, Amsterdam, The Netherlands

Ross C. Walker,

San Diego Supercomputer Center, UCSD, La Jolla, CA, USA; Department of Chemistry and Biochemistry, UCSD, CA, USA

Mark A. Watson,

Department of Chemistry, Princeton University, Princeton, NJ, USA

Xin Wu,

Max-Planck-Institut für Kohlenforschung, Mülheim an der Ruhr, Germany

Jun Yan,

SUNCAT Center for Interface Science and Catalysis, SLAC National Accelerator Laboratory, Menlo Park, CA, USA

Preface

The last decade has seen tremendous growth in the use of graphics processing units (GPUs) for numerical simulations, spanning all fields of science. Originally designed for use as coprocessors in graphics applications and visualizations, GPUs have made their way into both mainstream computing platforms and many of the most powerful supercomputers. While substantial effort has gone into the hardware design of GPUs, their potential for scientific computation has only been realized due to the concurrent development of specialized programming approaches and the redesign of the underlying numerical algorithms for massively parallel processors.

Electronic structure calculations are computationally intensive, and the field has a long history of pushing the envelope in high-performance computing. This tradition has continued with the rise of GPUs. Many researchers have invested significant effort in developing software implementations that exploit the computational power of GPUs. This book pays tribute to these developments by collating these efforts into a single reference text.

We have designed this book to provide an introduction to the fast-growing field of electronic structure calculations on massively parallel GPUs. The target audience is graduate students and senior researchers in the fields of theoretical and computational chemistry, condensed matter physics, and materials science, who are looking for an accessible overview of the field, as well as software developers looking for an entry point into GPU and hybrid GPU/CPU programming for electronic structure calculations. To this end, the book provides an overview of GPU computing, a brief introduction to GPU programming, the essential background in electronic structure theory, and the latest examples of code developments and applications for the most widely used electronic structure methods.

We have tried to include all widely used electronic structure methods for which GPU implementations have been developed. The text covers all commonly used basis sets including localized Gaussian- and Slater-type basis functions, plane waves, wavelets, and real-space grid-based approaches. Several chapters expose details on strategies for the calculation of two-electron integrals, exchange-correlation quadrature, Fock matrix formation, solution of the self-consistent field equations, calculation of nuclear gradients to obtain forces, and methods to treat excited states within density functional theory. Other chapters focus on semiempirical methods and correlated wave function methods including density-fitted second-order Møller–Plesset perturbation theory and both iterative and perturbative single- and multireference coupled-cluster methods.

We have enjoyed the steep learning curve that has accompanied the editing of this book, and we trust that you, the reader, will find it an engaging and useful reference.

Ross C. Walker and Andreas W. GötzAugust 2015La Jolla, USA

Acknowledgments

We would like to thank everybody who has made a contribution to this book, either directly or indirectly. This includes our friends and families for their continuous support. Our special thanks go to the dedicated people at Wiley who guided us in our role as editors and worked hard for this book to see the light of day. In particular, we thank our primary contacts Sarah Keegan, Sarah Higginbotham, and Rebecca Ralf. We are also grateful to Dorathy Steve and her team at SPi for copy-editing. This book would never have been possible without the excellent contributions of the many individual authors who contributed to its content. We are grateful that they accepted to write chapters for this book and we would like to thank all for their patience during the editing period.

Glossary

The following provides a details glossary of GPU and GPU programming related terms. It is biased towards NVIDIA GPUs and the CUDA programming model. However, AMD GPUs use similar concepts and hardware implementations. For instance, the equivalent of a CUDA warp on NVIDIA hardware is called a wavefront on AMD hardware. This is not meant to be an exhaustive list of terms but rather is designed to provide the reader with a brief description of the various GPU-related technical terms that appear in this book.

Bandwidth

The inverse of the time that is required to transfer one byte of data. Usually measured in GB/s.

Block

A set of threads that can share data and communicate during execution on the GPU. Data across blocks cannot be synchronized. A block executes on a single SM. To optimize the performance, the block-size needs to be adjusted to the problem and the hardware (e.g., available shared memory). The number of threads in a block is limited to 1024 on the Kepler architecture. Thread blocks are created and executed in units of a warp; thus, the number of threads should be a multiple of the warp size (32 currently). A thread block has its block ID within its grid.

Cache

Fast memory that is used to reduce latency for global memory access. On NVIDIA GPUs with Kepler architecture, the SMs share a cache of 1.5 MB size. This can be considered L2 cache since each SM has an L1 cache that is called shared memory.

Constant memory

Fast read-only memory that can be written by the host and accessed by all SMs. Only a limited amount of constant memory is available.

Device

The GPU including its processor and memory. The device cannot operate on data that is located on the host.

Global memory

The memory that is available on the GPU. Comparable to main memory on the host. Data access to global memory is cached but is slow compared to other memory classes on the GPU due to higher latency. Compared to host memory, the global device memory supports high data bandwidth. On current NVIDIA hardware with Kepler architecture, the data path is 512 bits wide; thus, 16 consecutive 32-bit words can be fetched in a single cycle. As a consequence, there is considerable bandwidth degradation for strided memory access. For instance, a stride-two access will fetch 512 bits but use only half of them. There is less device memory than host memory, at present up to 12 GB on NVIDIA Tesla K40. Global memory can be accessed by the host for data transfers between the host and the device. Global memory is persistent between kernel launches.

Grid

A set of blocks that maps to the streaming multiprocessors on the GPU and execute a kernel. The order of execution of the blocks on a GPU is not deterministic. In the Kepler architecture, 16 blocks can be active at the same time in a single multiprocessor.

Host

The CPU and its main memory.

The host cannot operate on data that is located on the device. A program running on the host can transfer data to/from the device and launch kernels on the device.

Kernel

A function that executes in parallel on the GPU. NVIDIA GPUs are programmed as a sequence of kernels that are launched by the host program. By default, a kernel completes execution before the start of the next kernel with an implicit synchronization barrier. Usually, kernels execute a sufficiently large number of thread blocks to occupy all SMs of a GPU. However, the Kepler architecture supports simultaneous execution of multiple independent kernels at the same time. Kernel launches execute multiple threads that are arranged in a grid of blocks.

Latency

The time it takes from the issue of a memory operation to the arrival of the first bit. Usually measured in µs.

Latency hiding

Techniques to deal with high latency of data transfer between the host and device or access to global device memory. For example, a device memory operation issued by threads in a warp will take very long due to latency on the order of hundreds of clock cycles. CPU architectures make use of a cache memory hierarchy to reduce latency; however, this is not effective on GPUs, which are designed for throughput computing. GPUs instead deal with this latency by using a high degree of multithreading. At a given point in time, up to 64 warps can be active on each multiprocessor in the Kepler architecture. While one warp is waiting for a memory operation to complete, the control unit switches to another warp. Thus, all cores can continue computing if the parallelism on each SM is sufficiently large.

Local memory

Slow memory that is located off-chip and has the same latency as global memory. It is used to hold automatic variables for cases in which there is not sufficient register memory available. Variables stored in local memory are private to each thread.

Register memory

Very fast on-chip memory faster than shared memory. Used to store local variables that are private to each thread. On the Kepler architecture, a thread can access up to 255 32-bit registers; however, there is only a total of 65,536 32-bit registers on an SM. Ideally, all local variables used by a thread reside in registers on chip. The limited number of registers thus limits the number of concurrent threads. Memory intensive kernels can move data to local memory (this is termed register spillage) with adverse effect on performance due to high latency of local memory.

Register spillage

Term used if memory intensive kernels require more storage than is available in registers thus moving data to local memory, which usually has detrimental effect on performance.

Shared memory

Shared memory on NVIDIA GPUs is a fast on-chip memory, essentially a programmable L1 cache attached to each SM. It has low latency and high bandwidth with speed that is close to that of registers. On NVIDIA Kepler architecture, the shared memory is 64 KB for each SM and can be configured as 25%, 50% or 75% software managed cache with the remainder as hardware data cache. Data stored in shared memory can be accessed by all threads in the same thread block and persists only for the lifetime of the execution of a block. Since it is a limited resource per SM, its use limits the number of blocks that can be concurrently executed. The host cannot access shared memory.

SIMD

Single instruction multiple data. A SIMD processing unit executes single instructions on multiple data. Branching is not possible.

SIMT

Single instruction multiple threads. Model for data parallel computing on GPUs. Each core in an SM can execute a sequential thread but all cores in a group called warp execute the same instruction at the same time similar to classical SIMD processors. Branching is possible, but for conditional operations some of the cores in a warp are disabled resulting in no-ops.

Stream

Sequence of commands that execute in order. Multiple Streams

can be used to execute kernels simultaneously or to overlap kernel execution with memory copies between host and device.

Streaming Multiprocessor (SM)

Set of processing cores (ALUs) on a GPU that have access to a common shared memory space. SMs on the NVIDIA GK110 chip, for example, contain 192 single-precision cores and 64 double-precision cores. Groups of 16 cores execute operations of a group of threads in a warp in lockstep. A maximum of 64 warps (2048 threads) can be active at the same time on an SM (see also Latency Hiding). These warps can belong to a maximum of 16 different thread blocks. SMs operate at approximately 1 GHz clock speed, thus at a lower speed than typical CPUs.

Texture memory

Read-only memory that can be written by the host. Texture memory resides in device global memory but is cache-optimized for certain read operations, for example, two-dimensional arrays.

Thread (software)

In the context of GPU programming, a thread is a sequence of instructions to be executed by a GPU processing element. On a GPU, threads are grouped into blocks and threads within a block are executed in lockstep in sizes of a warp. Each thread thus executes an instance of a kernel. Each thread has thread block and grid ID within its threads block and grid, a program counter, registers, and per-thread private memory available.

Warp

Lock-step unit on a GPU. Threads within a warp execute in lock-step, that is in SIMD fashion. However, branching is allowed. Each warp should access a single cache line. A warp always consists of a subset of threads of a block. A warp consists of 32 threads (this number has remained constant so far but is subject to change). On a Kepler SM, a warp takes two cycles to execute one integer or single-precision floating point instruction on each group of 16 cores. At most 4 of the 12 groups of cores in a Kepler SM can execute double-precision instructions concurrently. At most 2 of the 12 groups of cores can concurrently execute intrinsic and transcendental functions.