127,99 €
Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics provides an overview of computing on graphics processing units (GPUs), a brief introduction to GPU programming, and the latest examples of code developments and applications for the most widely used electronic structure methods.
The book covers all commonly used basis sets including localized Gaussian and Slater type basis functions, plane waves, wavelets and real-space grid-based approaches.
The chapters expose details on the calculation of two-electron integrals, exchange-correlation quadrature, Fock matrix formation, solution of the self-consistent field equations, calculation of nuclear gradients to obtain forces, and methods to treat excited states within DFT. Other chapters focus on semiempirical and correlated wave function methods including density fitted second order Møller-Plesset perturbation theory and both iterative and perturbative single- and multireference coupled cluster methods.
Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics presents an accessible overview of the field for graduate students and senior researchers of theoretical and computational chemistry, condensed matter physics and materials science, as well as software developers looking for an entry point into the realm of GPU and hybrid GPU/CPU programming for electronic structure calculations.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 770
Veröffentlichungsjahr: 2016
Cover
Title Page
Copyright
List of Contributors
Preface
Acknowledgments
Glossary
Abbreviations - Scientific
Abbreviations - Technical
Chapter 1: Why Graphics Processing Units
1.1 A Historical Perspective of Parallel Computing
1.2 The Rise of the GPU
1.3 Parallel Computing on Central Processing Units
1.4 Parallel Computing on Graphics Processing Units
1.5 GPU-Accelerated Applications
References
Chapter 2: GPUs: Hardware to Software
2.1 Basic GPU Terminology
2.2 Architecture of GPUs
2.3 CUDA Programming Model
2.4 Programming and Optimization Concepts
2.5 Software Libraries for GPUs
2.6 Special Features of CUDA-Enabled GPUs
References
Chapter 3: Overview of Electronic Structure Methods
3.1 Introduction
3.2 Hartree–Fock Theory
3.3 Density Functional Theory
3.4 Basis Sets
3.5 Semiempirical Methods
3.6 Density Functional Tight Binding
3.7 Wave Function-Based Electron Correlation Methods
Acknowledgments
References
Chapter 4: Gaussian Basis Set Hartree–Fock, Density Functional Theory, and Beyond on GPUs
4.1 Quantum Chemistry Review
4.2 Hardware and CUDA Overview
4.3 GPU ERI Evaluation
4.4 Integral-Direct Fock Construction on GPUs
4.5 Precision Considerations
4.6 Post-SCF Methods
4.7 Example Calculations
4.8 Conclusions and Outlook
References
Chapter 5: GPU Acceleration for Density Functional Theory with Slater-Type Orbitals
5.1 Background
5.2 Theory and CPU Implementation
5.3 GPU Implementation
5.4 Conclusion
References
Chapter 6: Wavelet-Based Density Functional Theory on Massively Parallel Hybrid Architectures
6.1 Introductory Remarks on Wavelet Basis Sets for Density Functional Theory Implementations
6.2 Operators in Wavelet Basis Sets
6.3 Parallelization
6.4 GPU Architecture
6.5 Conclusions and Outlook
References
Chapter 7: Plane-Wave Density Functional Theory
7.1 Introduction
7.2 Theoretical Background
7.3 Implementation
7.4 Optimizations
7.5 Performance Examples
7.6 Exact Exchange with Plane Waves
7.7 Summary and Outlook
7.8 Acknowledgments
References
Appendix A: Definitions and Conventions
Appendix B: Example Kernels
Chapter 8: GPU-Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory
8.1 Introduction
8.2 Software Architecture for GPU-Acceleration
8.3 Maximizing Asynchronous Progress
8.4 Libcusmm: GPU Accelerated Small Matrix Multiplications
8.5 Benchmarks and Conclusions
Acknowledgments
References
Chapter 9: Grid-Based Projector-Augmented Wave Method
9.1 Introduction
9.2 General Overview
9.3 Using GPUs in Ground-State Calculations
9.4 Time-Dependent Density Functional Theory
9.5 Random Phase Approximation for the Correlation Energy
9.6 Summary and Outlook
Acknowledgments
References
Chapter 10: Application of Graphics Processing Units to Accelerate Real-Space Density Functional Theory and Time-Dependent Density Functional Theory Calculations
10.1 Introduction
10.2 The Real-Space Representation
10.3 Numerical Aspects of the Real-Space Approach
10.4 General GPU Optimization Strategy
10.5 Kohn–Sham Hamiltonian
10.6 Orthogonalization and Subspace Diagonalization
10.7 Exponentiation
10.8 The Hartree Potential
10.9 Other Operations
10.10 Numerical Performance
10.11 Conclusions
10.12 Computational Methods
Acknowledgments
References
Chapter 11: Semiempirical Quantum Chemistry
11.1 Introduction
11.2 Overview of Semiempirical Methods
11.3 Computational Bottlenecks
11.4 Profile-Guided Optimization for the Hybrid Platform
11.5 Performance
11.6 Applications
11.7 Conclusion
Acknowledgement
References
Chapter 12: GPU Acceleration of Second-Order Møller–Plesset Perturbation Theory with Resolution of Identity
12.1 Møller–Plesset Perturbation Theory with Resolution of Identity Approximation (RI-MP2)
12.2 A Mixed-Precision Matrix Multiplication Library
12.3 Performance of Accelerated RI-MP2
12.4 Example Applications
12.5 Conclusions
References
Chapter 13: Iterative Coupled-Cluster Methods on Graphics Processing Units
13.1 Introduction
13.2 Related Work
13.3 Theory
13.4 Algorithm Details
13.5 Computational Details
13.6 Results
13.7 Conclusions
Acknowledgments
References
Chapter 14: Perturbative Coupled-Cluster Methods on Graphics Processing Units: Single- and Multi-Reference Formulations
14.1 Introduction
14.2 Overview of Electronic Structure Methods
14.3 NWChem Software Architecture
14.4 GPU Implementation
14.5 Performance
14.6 Outlook
Acknowledgments
References
Scientific Index
Technical Index
End User License Agreement
xiii
xiv
xv
xvii
xix
xxi
xxv
xxvi
xxvii
xxviii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
333
334
Cover
Table of Contents
Preface
Begin Reading
Chapter 1: Why Graphics Processing Units
Figure 1.1 Photograph taken in 1957 at NASA featuring an IBM 704 computer, the first commercially available general-purpose computer with floating-point arithmetic hardware [4]
Figure 1.2 Photograph of Burroughs Corporation's D825 parallel computer [6]
Figure 1.3 Microprocessor transistor counts 1971–2011. Until recently, the number of transistors on integrated circuits has been following Moore's Law [16], doubling approximately every 2 years
Figure 1.4 Illustration of the ever-increasing power density within silicon chips, with decreasing gate length.
Figure 1.5 Peak floating-point operations per second (a) and memory bandwidth (b) for Intel CPUs and Nvidia GPUs.
Figure 1.6 Directed acyclic graph demonstrating the decomposition of four tasks across two processors
Figure 1.7 Illustration of a GPU working in conjunction with a CPU as work is offloaded and computed on a GPU concurrently with CPU execution [24]
Figure 1.8 Cartoon representation of the dihydrofolate reductase enzyme in water (23,508 atoms)
Figure 1.9 Computational performance of classical molecular dynamics simulations with the CPU and GPU versions of Amber on varying hardware configurations measured in nanosecond per day. The results are for a standard benchmark (FactorIX enzyme in explicit solvent, 90,906 atoms, NVE ensemble, 2 fs time step) [30]
Figure 1.10 Performance acceleration for Adobe Mercury Playback Engine on GPUs. System configuration: Adobe Premier Pro CC, Windows 7 – 64-bit, Dual Intel Xeon E5 2687 W 3.10 GHz CPUs (16 total cores). Test consists of HD video workflow with complex Mercury Playback Engine effects at 720p resolution. Results based on final output render time comparing noted GPU to CPU [31]
Figure 1.11 Performance data for Adobe After Effect CC Engine on GPUs. System configuration: Adobe After Effects CC, Windows 7 – 64-bit, Dual Intel Xeon E5 2687 W 3.10 GHz CPUs (16 total cores). Test consists of live After Effect CC scenes with 3D layer, comparing time to render ray-traced 3D scene on noted GPU versus CPU [32]
Chapter 2: GPUs: Hardware to Software
Figure 2.1 General overview of CPU–GPU hardware configuration.
Figure 2.2 Illustration highlighting the movement of data between memory locations on the host and the device.
Figure 2.3 Nvidia Kepler architecture memory model. Global memory and L2 cache can be accessed by all SMs. L1 cache and shared memory (SMEM) are accessible to threads running on the same SM.
Figure 2.4 Example of C for CUDA kernel code and kernel launch code. The declaration specifier __global__ marks the kernel for execution on the GPU, while the execution configuration (grid and block size) is contained in triple angle brackets
Figure 2.5 Thread hierarchy. Blocks of threads are configured into a grid of thread blocks that logically map to the underlying GPU hardware with blocks being executed on individual SMs.
Figure 2.6 Half-warp of threads accessing 16 shared memory banks
Figure 2.7 Visual representation of unified memory. Host and device memory is presented as a single address space. This makes programming easier, but since memory copies still happen behind the scenes, performance can be lower compared to explicitly managed memory.
Chapter 4: Gaussian Basis Set Hartree–Fock, Density Functional Theory, and Beyond on GPUs
Figure 4.1 Schematic of one-block one-contracted Integral (1B1CI) mapping. Cyan squares on left represent contracted ERIs each mapped to the labeled CUDA block of 64 threads. Orange squares show mapping of primitive ERIs to CUDA threads (green and blue boxes, colored according to CUDA warp) for two representative integrals, the first a contraction over a single primitive ERI and the second involving 3
4
= 81 primitive contributions.
Figure 4.2 Schematic of one-thread one-contracted Integral (1T1CI) mapping. Cyan squares represent contracted ERIs and CUDA threads. Thread indices are shown in parentheses. Each CUDA block (red outlines) computes 16 ERIs, with each thread accumulating the primitives of an independent contraction, in a local register.
Figure 4.3 Schematic of one-thread one-primitive integral (1T1PI) mapping. Cyan squares represent two-dimensional tiles of 16 × 16 primitive ERIs, each of which is assigned to a 16 × 16 CUDA block as labeled. Red lines indicate divisions between contracted ERIs. The orange box shows assignment of primitive ERIs to threads (gray squares) within a block that contains contributions to multiple contractions.
Figure 4.4 ERI grids colored by angular momentum class for a system containing four
s
-shells and one
p
-shell. Each square represents all ERIs for a shell quartet. (a) Grid when bra and ket pairs are ordered by simple loops over shells. (b) ERI grid for same system with bra and ket pairs sorted by angular momentum,
ss
, then
sp
, then
pp
. Each integral class now handles a contiguous chunk of the total ERI grid.
Figure 4.5 Organization of ERIs for Coulomb formation. Rows and columns correspond to primitive bra and ket pairs, respectively. Each ERI is colored according to the magnitude of its Schwarz bound. Data are derived from calculation on ethane molecule. Figure (a) obtained by arbitrary ordering of pairs within each angular momentum class and suffers from load imbalance because large and small integrals are computed in neighboring cells, and (b) that sorts bra and ket primitives by Schwarz contribution within each momentum class, providing an efficient structure for parallel evaluation
Figure 4.6 Schematic representation of a J-Engine kernel for one angular momentum class, for example, (
ss
|
ss
). Cyan squares represent significant ERI contributions. Sorted bra and ket vectors are represented by triangles to the left and above the grid. The path of a 2 × 2 block as it sweeps across the grid is shown in orange. The final reduction across rows of the block is illustrated within the inset to the right.
Figure 4.7 Schematic of a K-Engine kernel. Bra and ket PQ arrays are represented by triangles to the left and above the grid. The pairs are grouped by and index and then sorted by bound. The paths of four blocks are shown in orange, with the zigzag pattern illustrated by arrows in the top right. The final reduction of an exchange element within a 2 × 2 block is shown to the right.
Figure 4.8 Organization of double- and single-precision workloads within Coulomb ERI grids. As in Figure 4.5, rows and columns correspond to primitive bra and ket pairs. (a) Each ERI is colored according to the magnitude of its Schwarz bound. (b) ERIs are colored by required precision. Yellow ERIs require double precision, while those in green may be evaluated in single precision. Blue ERIs are neglected entirely.
Figure 4.9 Relative error in final energies versus precision threshold for various basis sets. Test molecules are shown on the right. Each point is averaged over the five test systems. Error bars represent 2 standard deviations above the mean. The black line shows the empirical error bound of Eq. (4.64)
Figure 4.10 One-dimensional alkene and three-dimensional water-cube test systems. Alkene lengths vary from 24 to 706 carbon atoms and water cubes range from 10 to nearly 850 water molecules. A uniform density is used for all water boxes
Figure 4.11 First SCF iteration timings in seconds for (a) linear alkenes and (b) cubic water clusters. Total times are further broken down into J-Engine, K-Engine, distance-masked K-Engine, linear algebra (LA), and DFT exchange–correlation contributions. For water clusters, total SCF times are shown for both the naïve and distance-masked (mask) K-Engine. All calculations were performed using a single Tesla M2090 GPU and the 6-31G basis set. Power fits show scaling with increasing system size, and the exponent for each fit is provided in the legend.
Figure 4.12 Multi-GPU parallel efficiency for J-Engine, K-Engine, and exchange–correlation Fock formation based on first iteration time for water clusters, run on 2 M2090 GPUs
Figure 4.13 Total SCF time of TeraChem on eight CPUs and four GPUs, relative to GAMESS on eight CPUs for water clusters
Figure 4.14 Absorption spectrum of the nanostar dendrimer with peripheral chromophore absorption peaks calculated (
A
,
B
) in vacuum and measured experimentally (
A
exp
,
B
exp
) at 300 K in hexane
Chapter 5: GPU Acceleration for Density Functional Theory with Slater-Type Orbitals
Figure 5.1 Schematic representation of the CUDA kernel and wrapper routine for the numerical integration of Fock matrix elements
Figure 5.2 Comparison of the time required by the GPU and CPU code for computing contributions to the Fock matrix through numerical integration
Figure 5.3 Schematic representation of the hybrid CPU/GPU algorithm for the MPI parallel Fock matrix calculation
Figure 5.4 Timings and speedup for the calculation of Hessian matrix elements for an analytical frequency calculation on the budesonide molecule (C
25
H
34
O
6
) with BLYP/TZ2P and frozen core setting “small,” using a Becke grid quality of “good” in ADF
Figure 5.5 Timings and speedup for the calculation of matrix elements of the Fock matrix derivatives (routine
f1u1_ai
_gga) required for analytical frequency calculation. Test system is the budesonide molecule (C
25
H
34
O
6
) with BLYP/TZ2P and frozen core setting “small,” using a Becke grid quality of “good” in ADF
Chapter 6: Wavelet-Based Density Functional Theory on Massively Parallel Hybrid Architectures
Figure 6.1 Least asymmetric Daubechies wavelet family of order . Note that both the scaling function and the wavelet are different from zero only within the interval
Figure 6.2 Schematic representation of the application of the Hamiltonian in the BigDFT formalism
Figure 6.3 Simulation grid points
Figure 6.4 Orbital (a) and coefficient (b) distribution schemes
Figure 6.5 Comparison of the performance of BigDFT on different platforms. Runs on the CCRT machine are worse in scalability but better in absolute performance per compute core than runs on the CSCS machine (1.6–2.3 times faster)
Figure 6.6 Upper panel: Data distribution for 1D convolution+transposition on the GPU. See Section 6.4.2 for details. Lower panel: Reproduction of the portion of the input data highlighted in gray in the upper panel
Figure 6.7 Left panel: Speedup for the GPU version of the fundamental operations on the wave functions. Right panel: Double-precision speedup for the GPU version of the 3D operators used in the BigDFT code as a function of the single wave function size
Figure 6.8 (a) Speedup of the BigDFT code for a four-carbon-atom supercell (graphene) with 164 -points. The calculation is performed with eight MPI processes on the CEA-DAM INTI machine, based on Westmere processors and Nvidia Fermi. For each run, the number of equivalent MPI processes is indicated, given that the parallel efficiency of this run is 98%. Also, the efficiency of the GPU acceleration is presented.(b) Speedup of the same run on different hybrid architectures in combination with MPI runs
Figure 6.9 Relative speedup of the hybrid DFT code with respect to the equivalent pure CPU run. Different runs for simulations of increasing system size have been performed on an Intel X5472 3 GHz (Harpertown) machine, with a Fermi GPU card
Figure 6.10 Massively parallel runs for a cobalt metalloporphyrin on graphene sheet (265 atoms system) with surfaces BC. In the bottom panel, the simulations have been accelerated with Kepler GPU cards. Interesting speedups can be achieved
Chapter 7: Plane-Wave Density Functional Theory
Figure 7.1 Schematic representation of three approaches for porting loops on GPUs: (a) asynchronous kernel launches; (b) streaming kernel launches; and (c) batched kernel launches
Figure 7.2 Using multiple process accelerated by GPUs communicating with MPI.
Figure 7.3 supercell of crystalline Si
Figure 7.4 Gold MD snapshot at 1600 K
Figure 7.5 Boron in hR105 (-rhombohedral) structure
Chapter 8: GPU-Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory
Figure 8.1 Workflow for a self-consistent electronic structure calculation, illustrating both the use of traditional as well as methods
Figure 8.2 Direct comparison of the time needed for calculations on bulk liquid water using linear scaling and diagonalization-based SCF procedures. For matrices larger than 20,000 basis functions, a speedup is observed (filtering threshold 10
−5
). Note that for linear scaling approaches, the time needed depends not only on the matrix size but also on the sparsity pattern, and hence better quality basis sets typically have a larger relative cost
Figure 8.3 A schematic representation of the software architecture employed in the GPU accelerated DBCSR library. The various layers correspond to key steps in the matrix multiplication algorithm. While the Cannon layer is essential for the parallelism between processes or on the cluster level, the lower layers deal with parallelism and acceleration on the node level
Figure 8.4 Enabling concurrency and enforcing dependencies in DBCSR. Multiple streams are used to transfer data from the host to the device, and to process independent stacks. Dependencies between the streams, for example, a panel upload and stack calculations, and between host and device, for example, device buffer reuse, are enforced using events
Figure 8.5 Schematic representation of the double buffered Cannon algorithm, which illustrates how the use of two host and two device buffers for the and panels enables overlapping of message passing, host to device memory copies and computations. The ratio of the time needed for the important steps of the algorithm, depends on the hardware and on the science problem at hand
Figure 8.6 Minimum (dotted line) and maximum (solid line) arithmetic intensity for different matrix sizes commonly employed in CP2K simulations, and the corresponding maximum possible flop rate. The performance as obtained from individual kernel launches in a mini-app is shown as bars
Figure 8.7 Inner-product (a) and outer-product (b) form of matrix multiplication. The yellow areas in indicate elements that can be computed independently by accessing the highlighted areas of and .
Figure 8.8 (a) Parameterization of the -matrix product . Each thread computes an tile () of the result matrix . In order to accommodate matrix sizes larger than the available shared memory, matrices are processed in slabs (, ), with an input slab width . In order to optimize the data output, the matrices () are written back using the output slab width . (b) Close to the SM, registers are used to store the matrix tile, while slabs of , , and are stored in shared memory. (c) GPU memory stores all panel data, including the various blocks of , , , and the stack buffers .
Figure 8.9 Performance comparison of the multi-threaded DBCSR library based on matrix blocks, and was not using the MPI capabilities. The benchmark was run on a dual Sandy Bridge (E5-2620, 2.0 GHz, 6 cores) machine, equipped with one Nvidia Tesla K20 card
Figure 8.10 Aggregated nanoparticles in explicit solution (77,538 atoms) can be run on the Piz Daint computer (5272 hybrid compute nodes) at approximately 122 seconds per SCF step
Chapter 9: Grid-Based Projector-Augmented Wave Method
Figure 9.1 Multigrid V-cycle with three levels
Figure 9.2 Flowchart of the SCF loop
Figure 9.3 (a) Weak scaling performance of the CPU and GPU versions of the program using bulk Si systems. (b) The achieved speedups with GPU acceleration. The GPU runs used one CPU core per GPU.
Figure 9.4 Output of the Nvidia nvvp profiling tool for a portion of an RPA O calculation
Figure 9.5 Speedup (eight GPUs vs. eight CPU cores) as a function of the number of the () for some representative functions such as get wave functions (“get_wfs”), mapping wave functions between the 3D fft grid and reduced planewave grid (“mapG”), PAW corrections (“paw_P_ai”), batched cufft (“fft”), and cublas (“zherk” routine). For a full list of the GPU ported functions, refer to Ref. [49]. The test system is a /Ru(0001) surface, modeled with four layers of Ru in a unit cell. The speedup (timing) information comes from a summation of 1 -point (per core), 5 occupied, and 1486 unoccupied bands with an energy cutoff 150 eV. The total speedups (“Total”) in the optical limit () and other are also shown.
Chapter 10: Application of Graphics Processing Units to Accelerate Real-Space Density Functional Theory and Time-Dependent Density Functional Theory Calculations
Figure 10.1 Example of real-space grids adapted to the shape of a cis retinal molecule. The cubes mark the position of the grid points. For visualization purposes, we represent smaller and coarser grids than the ones used for actual calculations
Figure 10.2 Scheme illustrating the blocks of orbitals strategy for DFT on GPUs. (a) Operating on a single orbital might not provide enough parallelism for the GPU to perform efficiently. (b) By operating simultaneously over several orbitals there is a larger degree of data parallelism and there is less divergence among GPU threads.
Figure 10.3 Examples of different grid orders in 2D: (a) standard order (b) grid ordered by small parallelepipedic subgrids or
bricks
, and (c) order given by a Hilbert space-filling curve
Figure 10.4 Effect of the optimization of the grid mapping for data locality in the numerical throughput of the Laplacian operator as a function of the size of the orbitals block. Spherical grid with 500k points. Computations with a AMD Radeon 7970 GPU
Figure 10.5 Division of the atoms of a C
60
molecule in groups (represented by different colors) whose pseudo-potential spheres do not overlap.
Figure 10.6 Numerical throughput of the application of the pseudo-potentials nonlocal part as a function of the size of the block of orbitals (block-size). Calculation for -cyclodextrin with 256 orbitals and 260k grid points for one CPU and two GPUs.
Figure 10.7 Numerical throughput of the application of the Kohn–Sham Hamiltonian as a function of the size of the block of orbitals (block-size). Calculation for -cyclodextrin with 256 orbitals and 260k grid points for one CPU and two GPUs
Figure 10.8 Numerical throughput of (a) the orthogonalization procedure and (b) the subspace diagonalization as a function of the size of the block of orbitals (block-size) for different processors. Calculation for -cyclodextrin with 256 orbitals and 260k grid points
Figure 10.9 Numerical throughput of the fourth-order Taylor approximation to the exponential operator as a function of the size of the block of orbitals (block-size). Calculation for -cyclodextrin with 256 orbitals and 260k grid points for one CPU and two GPUs
Figure 10.10 Comparison of (a) the throughput and (b) calculation time achieved by our FFT Poisson solver as a function of the number of grid points for one CPU and two GPUs. The data is originally on main memory, so the time required to copy the input data to the GPU and copy back the result is included. The number of points corresponds to the spherical grid used by octopus, the FFT grid has a larger number of points. Following Ref. [123], the operation count for the FFTs is assumed to be
Figure 10.11 Numerical throughput of our CPU and GPU implementations as a function of the size of the block of orbitals (block-size). (a) Self-consistency cycle in a ground-state DFT calculation. (b) Real-time TDDFT propagation. -cyclodextrin molecule with 256 orbitals and 260k grid points
Figure 10.12 Performance of our CPU and GPU implementations for a set of 40 molecules of different sizes. (a) Numerical throughput of the self-consistency cycle. (b) Total execution time for a single-point energy calculation.
Figure 10.13 Performance of our CPU and GPU real-time TDDFT implementations for a set of 40 molecules of different sizes. (a) Numerical throughput of the real-time propagation. (b) Computational time required to propagate 1 attosecond.
Figure 10.14 Speed-up of the GPU calculation with the respect to the CPU for different molecules as a function of the number of valence electrons. (a) Speed-up for the time spent in the SCF-cycle of a ground-state DFT calculation (without considering initializations). (b) Speed-up for real-time TDDFT. Intel Core i7 3820 using 8 threads
Figure 10.15 Numerical performance comparison between our GPU implementation (octopus) and the terachem code. (a) Comparison of the total calculation time as a function of the number of valence electrons. (b) Speed-up of our implementation with respect to terachem (run time of terachem divided by the run time of octopus). The calculations are single-point energy evaluations performed on a set of 40 molecules, running on a Nvidia Tesla K20 GPU.
Chapter 11: Semiempirical Quantum Chemistry
Figure 11.1 Profiles of the OM3 calculations for the test proteins for the C
[1C]
computing setup
Figure 11.2 Speedups of the FDIAG subroutine in the OM3 calculations on the multi-CPU C
[6C]
, C
[12C]
, and hybrid CPU–GPU C
[12C–1G]
computing setups over the serial configuration
Figure 11.3 Speedups of the BORDER subroutine in the OM3 calculations on the multi-CPU C
[6C]
, C
[12C]
, and GPU-only C
[1G]
computing setups over the serial configuration
Figure 11.4 Speedups of the DIIS subroutine in the OM3 calculations on the multi-CPU C
[6C]
, C
[12C]
, and GPU-only C
[1G]
and C
[2G]
computing setups over the serial configuration
Figure 11.5 Speedups of the PDIAG subroutine in the OM3 calculations on the multi-CPU C
[6C]
, C
[12C]
, and GPU-only C
[1G]
and C
[2G]
computing setups over the serial configuration
Figure 11.6 Speedups of the ORTCOR subroutine in the OM3 calculations on the multi-CPU C
[6C]
, C
[12C]
, and GPU-only C
[1G]
and C
[2G]
computing setups over the serial configuration
Figure 11.7 Overall speedups of the OM3 calculations of test proteins on the multi-CPU C
[6C]
, C
[12C]
, and hybrid CPU–GPU C
[12C–1G]
and C
[12C–2G]
computing setups over the serial configuration
Figure 11.8 Profiles of the OM3 calculations for the test proteins on the C
[12C–2G]
computing setup
Figure 11.9 Experimental structures of (a) (PDB ID: 2AP7, 80% -helix), (b) (PDB ID: 2EVQ, 50% -strands), and (c) (PDB ID: 1LVR, 100% random coil). Only the backbone atoms are shown, with the atoms represented by black balls. Four dihedral angles (, , , and ) in a residue serve as stereochemical metrics, see the schematic sketch in (d)
Chapter 12: GPU Acceleration of Second-Order Møller–Plesset Perturbation Theory with Resolution of Identity
Figure 12.1 Pictorial representation of the heterogeneous MGEMM algorithm. (a, b) and matrices and their separation into matrices with large and small elements, given a cutoff parameter . (c) Components of the final matrix, where the light gray (green in ebook) component involves a dense matrix multiplication computed with CUBLAS-SGEMM, while the dark grey (blue in ebook) components involve sparse matrix multiplications computed with a DAXPY-like algorithm (explained in the text). The blocks follow the nomenclature shown in Eq. (12.20)
Figure 12.2 Speedup for various *GEMM calls as a function of matrix size. Most elements were in the range , with the “salt” values in the range . Times are scaled relative to running DGEMM on the CPU
Figure 12.3 RMS error in a single matrix element for various GEMM calls as a function of matrix size compared to CPU DGEMM. Background elements were in the range , with the “salt” values in the range or . MGEMM gives identical results for all parameters
Figure 12.4 Fraction of “large” elements as a function of the cutoff parameter for the taxol RI-MP2 matrices in steps 3 and 4 of the algorithm outlined in Section 12.1
Figure 12.5 Biochemical reactions studied. (a) Isomerization of dihydroxyacetone phosphate (DHAP) to glyceraldehyde-3-phosphate (G3P). (b) Fructose-1,6-biphosphate (FBP) to DHAP and G3P
Chapter 13: Iterative Coupled-Cluster Methods on Graphics Processing Units
Figure 13.1 A simple interleaved DGEMM algorithm. For a matrix multiplication, , the and dimensions are blocked to parallelize the algorithm over multiple GPUs and to guarantee that all blocks will fit in GPU global memory. The sum dimension, , is blocked in order to interleave computation and communication.
Figure 13.2 The (a) performance (in GF) of the blocked DGEMM algorithm outlined in Figure 13.1. The performance of Fermi and Kepler GPUs are illustrated relative to that of the Intel Core i7 3930K CPU in (b). The performance boost achieved by interleaving communication and computation is shown in (c)
Figure 13.3 Pseudocode for the DF/CD-CCSD procedure. The ladder diagram, , is evaluated on the GPU using the symmetric and antisymmetric tensors defined in Ref. [38] (here denoted by ()).
Figure 13.4 Performance in GF (10 floating point operations per second) for different implementations of spin-free CCD. Results are given for both CPU and GPU hardware, and CPU BLAS routines utilize eight threads
Figure 13.5 Speedup in the average DF-CCSD iteration time for clusters of water molecules (represented by a cc-pVDZ basis set, using a cc-pVDZ-RI auxiliary basis set) using one Fermi C2070 GPU, two Fermi C2070 GPUs, or one Kepler K20c GPU. The speedup is defined relative to the average iteration time for DF-CCSD using all six cores of a Core i7-3930K processor
Chapter 14: Perturbative Coupled-Cluster Methods on Graphics Processing Units: Single- and Multi-Reference Formulations
Figure 14.1 Dimension-flattening optimization. The solid lines correspond to different two-dimensional regions. The dotted lines correspond to the mapping of the data blocks to the thread blocks. Tensors and are flattened into two-dimensional arrays, increasing utilization of thread blocks. Each thread block not necessarily works only on values from a single dimension of the original tensor
Figure 14.2 CPU–GPU hybrid implementation: execution steps involved in the noniterative triples correction. T1, T2, and T3 tensors corresponds to the , , and amplitudes (SRCC) or to the reference-specific , , and perturbative amplitudes (MRCC). The F1 and V2 tensors corresponds to one- and two-electron integrals, respectively. The steps are as follows: step 1, copy input blocks from Global Arrays to the CPU or GPU local memory; step 2, contract input blocks into intermediate tensor block; step 3, reduce intermediate tensor to compute energy correction contribution; step 4, reduce final energy correction across all CPUs and GPUs. One of the CPU cores (Core 0 in the image) manages communication and load balancing toward the GPU, and also performs some basic sequential operations that would be more expensive on the GPU.
Figure 14.3 Schematic representation of the GPU-enhanced reference-level parallelism. Separate processor groups are delegated to calculate the reference-specific parts of the iterative MRCCSD equations and reference-specific noniterative corrections due to triples.
Figure 14.4 Speedup of the noniterative MRCCSD(T) calculations for various configurations for block sizes varying from 12 through 22 with respect to the execution times using two CPUs and no GPUs
Chapter 2: GPUs: Hardware to Software
Table 2.1 Memory hierarchy in the CUDA programming model detailing the location, lifetime, and privacy of the five memory spaces
Chapter 4: Gaussian Basis Set Hartree–Fock, Density Functional Theory, and Beyond on GPUs
Table 4.1 Runtime comparison for evaluating ERIs of 64 H atom lattice using 1B1CI, 1T1CI, and 1T1PI methods
Table 4.2 Dynamic precision performance for RHF calculations on various molecules using the 6-31G basis set
Chapter 6: Wavelet-Based Density Functional Theory on Massively Parallel Hybrid Architectures
Table 6.1 Parallel run of BigDFT on a hybrid heterogeneous architecture
Table 6.2 Effect of MPI and OpenMP (OMP) parallelization and GPU acceleration on the complete molecular dynamics of 32 O molecules
Chapter 7: Plane-Wave Density Functional Theory
Table 7.1 Self-consistent electron density calculation of bulk Si with a supercell of 128 atoms using block Davidson diagonalization (, , )
Table 7.3 Self-consistent electron density calculation of bulk Si with a supercell of 512 atoms using block Davidson diagonalization (, , )
Table 7.4 Self-consistent electron density calculation of bulk Si with a supercell of 256 atoms using RMM-DIIS diagonalization (, , )
Table 7.5 Non-self-consistent bulk Si band structure calculation (, , )
Table 7.6 Ab-initio molecular dynamics of 64 gold atoms at 1600 K, seen in Figure 7.4 (, , )
Table 7.7 Structural relaxation of hR105 boron (, , )
Table 7.8 Structural relaxation of hR105 boron with exact exchange (, , )
Chapter 8: GPU-Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory
Table 8.1 Key quantities of three linear scaling benchmarks that are distributed with CP2K
Chapter 9: Grid-Based Projector-Augmented Wave Method
Table 9.1 Performance for ground-state calculations of carbon nanotubes of different lengths (times in seconds, speedup denoted as S-up)
Table 9.2 Bulk Au time-propagation example
Table 9.3 Speedup (eight GPUs/eight CPU cores) for as well as time required for an entire RPA calculation with a response function cutoff of 150 eV for different simulations
Chapter 11: Semiempirical Quantum Chemistry
Table 11.1 Proteins in the test set for the OM3 calculations
Table 11.2 Percentages () of computation time in the PDIAG subroutine consumed by FMO, JACOBI, and other tasks in the OM3 calculations on a single CPU core
Table 11.3 Speedups of the FMO and JACOBI steps in the PDIAG subroutine on the multi-CPU C
[6C]
, C
[12C]
, and GPU-only C
[1G]
and C
[2G]
computing setups over the serial setup
Table 11.4 Statistics (%) for in the most favored () and additionally allowed () regions of the Ramachandran plot and standard deviations () of and
Chapter 12: GPU Acceleration of Second-Order Møller–Plesset Perturbation Theory with Resolution of Identity
Table 12.1 Speedups using CUBLAS DGEMM relative to CPU DGEMM for biomolecules in a double- basis
Table 12.2 Speedups using CUBLAS SGEMM and DGEMM and total energy errors relative to CPU DGEMM for various molecules in a double- basis
Table 12.3 MGEMM speedups and total energy errors with respect to CPU DGEMM for various molecules in a double- and triple- basis
Table 12.4 Comparison of Gibbs reaction energies (in kcal/mol) using RI-MP2 with respect to DFT
Chapter 13: Iterative Coupled-Cluster Methods on Graphics Processing Units
Table 13.1 Comparison of CPU and GPU implementations of CCD
Table 13.2 Comparison of CPU
a
and GPU
b
implementations of CCSD
Table 13.3 Comparison of CPU
a
and GPU
b
implementations of CCSD in large basis sets
Table 13.4 Average iteration time (in seconds) for DF/CD-CCSD computations of adenine-thymine and a benzene trimer represented by the aug-cc-pVDZ basis and the uracil dimer represented by the aug-cc-pVTZ basis
a
Chapter 14: Perturbative Coupled-Cluster Methods on Graphics Processing Units: Single- and Multi-Reference Formulations
Table 14.1 Time comparison for CPU and CPU+GPU runs of the noniterative part of the CCSD(T) approach for the pentacene molecule in cc-pVDZ basis set
Editors
Ross C. Walker
San Diego Supercomputer Center and Department of Chemistryand Biochemistry, University of California, San Diego, USA
and
Andreas W. Götz
San Diego Supercomputer Center, University of California,San Diego, USA
This edition first published 2016
© 2016 John Wiley & Sons, Ltd
Registered office
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought
The advice and strategies contained herein may not be suitable for every situation. In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. The fact that an organization or Website is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Website may provide or recommendations it may make. Further, readers should be aware that Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read. No warranty may be created or extended by any promotional statements for this work. Neither the publisher nor the author shall be liable for any damages arising herefrom.
Library of Congress Cataloging-in-Publication Data applied for.
ISBN: 9781118661789
A catalogue record for this book is available from the British Library.
Cover Image: Courtesy of the Editors
Ani Anciaux-Sedrakian,
Mechatronics, Computer Sciences and Applied Mathematics Division, IFP Energies nouvelles, Rueil-Malmaison Cedex, France
Xavier Andrade,
Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA
Edoardo Aprà,
William R. Wiley Environmental Molecular Sciences Laboratory, Battelle, Pacific Northwest National Laboratory, Richland, WA, USA
Alán Aspuru-Guzik,
Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA
Jeroen Bédorf,
Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
Kiran Bhaskaran-Nair,
William R. Wiley Environmental Molecular Sciences Laboratory, Battelle, Pacific Northwest National Laboratory, Richland, WA, USA
Damien Caliste,
Université Grenoble Alpes, INAC, Grenoble, France, and CEA, INAC, Grenoble, France
A. Eugene DePrince III,
Department of Chemistry and Biochemistry, Florida State University, Tallahassee, FL, USA
Thierry Deutsch,
Université Grenoble Alpes, INAC, Grenoble, France, and CEA, INAC, Grenoble, France
Jussi Enkovaara,
Department of Applied Physics, Aalto University, Espoo, Finland; CSC – IT Center for Science Ltd, Espoo, Finland
Paul Fleurat-Lessard,
Laboratoire de Chimie, Université de Lyon, ENS Lyon, Lyon, France; ICMUB, Université de Bourgogne Franche-Comté, Dijon, France
Luigi Genovese,
Université Grenoble Alpes, INAC, Grenoble, France, and CEA, INAC, Grenoble, France
Stefan Goedecker,
Institut für Physik, Universität Basel, Basel, Switzerland
Andreas W. Götz,
San Diego Supercomputer Center, UCSD, La Jolla, CA, USA
Samuli Hakala,
Department of Applied Physics, Aalto University, Espoo, Finland
Jeff R. Hammond,
Leadership Computing Facility, Argonne National Laboratory, Argonne, IL, USA
Ville Havu,
Department of Applied Physics, Aalto University, Espoo, Finland
Maxwell Hutchinson,
Department of Physics, University of Chicago, Chicago, IL, USA
Jürg Hutter,
Institute of Physical Chemistry, University of Zürich, Zürich, Switzerland
Adrian Jinich,
Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA
Axel Koslowski,
Max-Planck-Institut für Kohlenforschung, Mülheim an der Ruhr, Germany
Karol Kowalski,
William R. Wiley Environmental Molecular Sciences Laboratory, Battelle, Pacific Northwest National Laboratory, Richland, WA, USA
Sriram Krishnamoorthy,
Computational Sciences and Mathematics Division, Pacific Northwest National Laboratory, Richland, WA, USA
Lin Li,
SUNCAT Center for Interface Science and Catalysis, SLAC National Accelerator Laboratory, Menlo Park, CA, USA
Nathan Luehr,
Department of Chemistry and the PULSE Institute, Stanford, CA, USA; SLAC National Accelerator Laboratory, Menlo Park, CA, USA
Wenjing Ma,
Institute of Software, Chinese Academy of Sciences, Beijing, China
Todd J. Martínez,
Department of Chemistry and the PULSE Institute, Stanford, CA, USA; SLAC National Accelerator Laboratory, Menlo Park, CA, USA
Jean-François Méhaut,
Université Joseph Fourier – Laboratoire d'Informatique de Grenoble – INRIA, Grenoble, France
Peter Messmer,
NVIDIA, Zürich, Switzerland; NVIDIA Co-Design Lab for Hybrid Multicore Computing, Zürich, Switzerland
Perri Needham,
San Diego Supercomputer Center, UCSD, La Jolla, CA, USA
Risto M. Nieminen,
Department of Applied Physics, Aalto University, Espoo, Finland
Chris O'Grady,
SUNCAT Center for Interface Science and Catalysis, SLAC National Accelerator Laboratory, Menlo Park, CA, USA
Roberto Olivares-Amaya,
Department of Chemistry, Princeton University, Princeton, NJ, USA
Hans van Schoot,
Scientific Computing & Modeling NV, Theoretical Chemistry, Vrije Universiteit, Amsterdam, The Netherlands
Ole Schütt,
Department of Materials, ETH Zürich, Zürich, Switzerland
C. David Sherrill,
Center for Computational Molecular Science and Technology, Georgia Institute of Technology, Atlanta, GA, USA; School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, GA, USA; School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA
Aaron Sisto,
Department of Chemistry and the PULSE Institute, Stanford, CA, USA; SLAC National Accelerator Laboratory, Menlo Park, CA, USA
Dusan Stosic,
Department of Computer Science, Federal University of Pernambuco, Recife, Brazil
Sarah Tariq,
NVIDIA Corporation, Santa Clara, CA, USA
Walter Thiel,
Max-Planck-Institut für Kohlenforschung, Mülheim an der Ruhr, Germany
Antonino Tumeo,
Computational Sciences and Mathematics Division, Pacific Northwest National Laboratory, Richland, WA, USA
Joost VandeVondele,
Department of Materials, ETH Zürich, Zürich, Switzerland
Brice Videau,
Université Joseph Fourier – Laboratoire d'Informatique de Grenoble – INRIA, Grenoble, France
Oreste Villa,
Nvidia, Santa Clara, CA, USA
Lucas Visscher,
Amsterdam Center for Multiscale Modeling (ACMM), Theoretical Chemistry, VU University Amsterdam, Amsterdam, The Netherlands
Ross C. Walker,
San Diego Supercomputer Center, UCSD, La Jolla, CA, USA; Department of Chemistry and Biochemistry, UCSD, CA, USA
Mark A. Watson,
Department of Chemistry, Princeton University, Princeton, NJ, USA
Xin Wu,
Max-Planck-Institut für Kohlenforschung, Mülheim an der Ruhr, Germany
Jun Yan,
SUNCAT Center for Interface Science and Catalysis, SLAC National Accelerator Laboratory, Menlo Park, CA, USA
The last decade has seen tremendous growth in the use of graphics processing units (GPUs) for numerical simulations, spanning all fields of science. Originally designed for use as coprocessors in graphics applications and visualizations, GPUs have made their way into both mainstream computing platforms and many of the most powerful supercomputers. While substantial effort has gone into the hardware design of GPUs, their potential for scientific computation has only been realized due to the concurrent development of specialized programming approaches and the redesign of the underlying numerical algorithms for massively parallel processors.
Electronic structure calculations are computationally intensive, and the field has a long history of pushing the envelope in high-performance computing. This tradition has continued with the rise of GPUs. Many researchers have invested significant effort in developing software implementations that exploit the computational power of GPUs. This book pays tribute to these developments by collating these efforts into a single reference text.
We have designed this book to provide an introduction to the fast-growing field of electronic structure calculations on massively parallel GPUs. The target audience is graduate students and senior researchers in the fields of theoretical and computational chemistry, condensed matter physics, and materials science, who are looking for an accessible overview of the field, as well as software developers looking for an entry point into GPU and hybrid GPU/CPU programming for electronic structure calculations. To this end, the book provides an overview of GPU computing, a brief introduction to GPU programming, the essential background in electronic structure theory, and the latest examples of code developments and applications for the most widely used electronic structure methods.
We have tried to include all widely used electronic structure methods for which GPU implementations have been developed. The text covers all commonly used basis sets including localized Gaussian- and Slater-type basis functions, plane waves, wavelets, and real-space grid-based approaches. Several chapters expose details on strategies for the calculation of two-electron integrals, exchange-correlation quadrature, Fock matrix formation, solution of the self-consistent field equations, calculation of nuclear gradients to obtain forces, and methods to treat excited states within density functional theory. Other chapters focus on semiempirical methods and correlated wave function methods including density-fitted second-order Møller–Plesset perturbation theory and both iterative and perturbative single- and multireference coupled-cluster methods.
We have enjoyed the steep learning curve that has accompanied the editing of this book, and we trust that you, the reader, will find it an engaging and useful reference.
Ross C. Walker and Andreas W. GötzAugust 2015La Jolla, USA
We would like to thank everybody who has made a contribution to this book, either directly or indirectly. This includes our friends and families for their continuous support. Our special thanks go to the dedicated people at Wiley who guided us in our role as editors and worked hard for this book to see the light of day. In particular, we thank our primary contacts Sarah Keegan, Sarah Higginbotham, and Rebecca Ralf. We are also grateful to Dorathy Steve and her team at SPi for copy-editing. This book would never have been possible without the excellent contributions of the many individual authors who contributed to its content. We are grateful that they accepted to write chapters for this book and we would like to thank all for their patience during the editing period.
Bandwidth
The inverse of the time that is required to transfer one byte of data. Usually measured in GB/s.
Block
A set of threads that can share data and communicate during execution on the GPU. Data across blocks cannot be synchronized. A block executes on a single SM. To optimize the performance, the block-size needs to be adjusted to the problem and the hardware (e.g., available shared memory). The number of threads in a block is limited to 1024 on the Kepler architecture. Thread blocks are created and executed in units of a warp; thus, the number of threads should be a multiple of the warp size (32 currently). A thread block has its block ID within its grid.
Cache
Fast memory that is used to reduce latency for global memory access. On NVIDIA GPUs with Kepler architecture, the SMs share a cache of 1.5 MB size. This can be considered L2 cache since each SM has an L1 cache that is called shared memory.
Constant memory
Fast read-only memory that can be written by the host and accessed by all SMs. Only a limited amount of constant memory is available.
Device
The GPU including its processor and memory. The device cannot operate on data that is located on the host.
Global memory
The memory that is available on the GPU. Comparable to main memory on the host. Data access to global memory is cached but is slow compared to other memory classes on the GPU due to higher latency. Compared to host memory, the global device memory supports high data bandwidth. On current NVIDIA hardware with Kepler architecture, the data path is 512 bits wide; thus, 16 consecutive 32-bit words can be fetched in a single cycle. As a consequence, there is considerable bandwidth degradation for strided memory access. For instance, a stride-two access will fetch 512 bits but use only half of them. There is less device memory than host memory, at present up to 12 GB on NVIDIA Tesla K40. Global memory can be accessed by the host for data transfers between the host and the device. Global memory is persistent between kernel launches.
Grid
A set of blocks that maps to the streaming multiprocessors on the GPU and execute a kernel. The order of execution of the blocks on a GPU is not deterministic. In the Kepler architecture, 16 blocks can be active at the same time in a single multiprocessor.
Host
The CPU and its main memory.
The host cannot operate on data that is located on the device. A program running on the host can transfer data to/from the device and launch kernels on the device.
Kernel
A function that executes in parallel on the GPU. NVIDIA GPUs are programmed as a sequence of kernels that are launched by the host program. By default, a kernel completes execution before the start of the next kernel with an implicit synchronization barrier. Usually, kernels execute a sufficiently large number of thread blocks to occupy all SMs of a GPU. However, the Kepler architecture supports simultaneous execution of multiple independent kernels at the same time. Kernel launches execute multiple threads that are arranged in a grid of blocks.
Latency
The time it takes from the issue of a memory operation to the arrival of the first bit. Usually measured in µs.
Latency hiding
Techniques to deal with high latency of data transfer between the host and device or access to global device memory. For example, a device memory operation issued by threads in a warp will take very long due to latency on the order of hundreds of clock cycles. CPU architectures make use of a cache memory hierarchy to reduce latency; however, this is not effective on GPUs, which are designed for throughput computing. GPUs instead deal with this latency by using a high degree of multithreading. At a given point in time, up to 64 warps can be active on each multiprocessor in the Kepler architecture. While one warp is waiting for a memory operation to complete, the control unit switches to another warp. Thus, all cores can continue computing if the parallelism on each SM is sufficiently large.
Local memory
Slow memory that is located off-chip and has the same latency as global memory. It is used to hold automatic variables for cases in which there is not sufficient register memory available. Variables stored in local memory are private to each thread.
Register memory
Very fast on-chip memory faster than shared memory. Used to store local variables that are private to each thread. On the Kepler architecture, a thread can access up to 255 32-bit registers; however, there is only a total of 65,536 32-bit registers on an SM. Ideally, all local variables used by a thread reside in registers on chip. The limited number of registers thus limits the number of concurrent threads. Memory intensive kernels can move data to local memory (this is termed register spillage) with adverse effect on performance due to high latency of local memory.
Register spillage
Term used if memory intensive kernels require more storage than is available in registers thus moving data to local memory, which usually has detrimental effect on performance.
Shared memory
Shared memory on NVIDIA GPUs is a fast on-chip memory, essentially a programmable L1 cache attached to each SM. It has low latency and high bandwidth with speed that is close to that of registers. On NVIDIA Kepler architecture, the shared memory is 64 KB for each SM and can be configured as 25%, 50% or 75% software managed cache with the remainder as hardware data cache. Data stored in shared memory can be accessed by all threads in the same thread block and persists only for the lifetime of the execution of a block. Since it is a limited resource per SM, its use limits the number of blocks that can be concurrently executed. The host cannot access shared memory.
SIMD
Single instruction multiple data. A SIMD processing unit executes single instructions on multiple data. Branching is not possible.
SIMT
Single instruction multiple threads. Model for data parallel computing on GPUs. Each core in an SM can execute a sequential thread but all cores in a group called warp execute the same instruction at the same time similar to classical SIMD processors. Branching is possible, but for conditional operations some of the cores in a warp are disabled resulting in no-ops.
Stream
Sequence of commands that execute in order. Multiple Streams
can be used to execute kernels simultaneously or to overlap kernel execution with memory copies between host and device.
Streaming Multiprocessor (SM)
Set of processing cores (ALUs) on a GPU that have access to a common shared memory space. SMs on the NVIDIA GK110 chip, for example, contain 192 single-precision cores and 64 double-precision cores. Groups of 16 cores execute operations of a group of threads in a warp in lockstep. A maximum of 64 warps (2048 threads) can be active at the same time on an SM (see also Latency Hiding). These warps can belong to a maximum of 16 different thread blocks. SMs operate at approximately 1 GHz clock speed, thus at a lower speed than typical CPUs.
Texture memory
Read-only memory that can be written by the host. Texture memory resides in device global memory but is cache-optimized for certain read operations, for example, two-dimensional arrays.
Thread (software)
In the context of GPU programming, a thread is a sequence of instructions to be executed by a GPU processing element. On a GPU, threads are grouped into blocks and threads within a block are executed in lockstep in sizes of a warp. Each thread thus executes an instance of a kernel. Each thread has thread block and grid ID within its threads block and grid, a program counter, registers, and per-thread private memory available.
Warp
Lock-step unit on a GPU. Threads within a warp execute in lock-step, that is in SIMD fashion. However, branching is allowed. Each warp should access a single cache line. A warp always consists of a subset of threads of a block. A warp consists of 32 threads (this number has remained constant so far but is subject to change). On a Kepler SM, a warp takes two cycles to execute one integer or single-precision floating point instruction on each group of 16 cores. At most 4 of the 12 groups of cores in a Kepler SM can execute double-precision instructions concurrently. At most 2 of the 12 groups of cores can concurrently execute intrinsic and transcendental functions.
