87,99 €
Programming multi-core and many-core computing systems
Sabri Pllana, Linnaeus University, Sweden
Fatos Xhafa, Technical University of Catalonia, Spain
Provides state-of-the-art methods for programming multi-core and many-core systems
The book comprises a selection of twenty two chapters covering: fundamental techniques and algorithms; programming approaches; methodologies and frameworks; scheduling and management; testing and evaluation methodologies; and case studies for programming multi-core and many-core systems.
Program development for multi-core processors, especially for heterogeneous multi-core processors, is significantly more complex than for single-core processors. However, programmers have been traditionally trained for the development of sequential programs, and only a small percentage of them have experience with parallel programming. In the past, only a relatively small group of programmers interested in High Performance Computing (HPC) was concerned with the parallel programming issues, but the situation has changed dramatically with the appearance of multi-core processors on commonly used computing systems. It is expected that with the pervasiveness of multi-core processors, parallel programming will become mainstream.
The pervasiveness of multi-core processors affects a large spectrum of systems, from embedded and general-purpose, to high-end computing systems. This book assists programmers in mastering the efficient programming of multi-core systems, which is of paramount importance for the software-intensive industry towards a more effective product-development cycle.
Key features:
The book serves as a reference for a larger audience of practitioners, young researchers and graduate level students. A basic level of programming knowledge is required to use this book.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 949
Veröffentlichungsjahr: 2017
COVER
TITLE PAGE
COPYRIGHT
LIST OF CONTRIBUTORS
PREFACE
ACKNOWLEDGEMENTS
ACRONYMS
PART I: FOUNDATIONS
CHAPTER 1: MULTI- AND MANY-CORES, ARCHITECTURAL OVERVIEW FOR PROGRAMMERS
1.1 INTRODUCTION
1.2 WHY MULTICORES?
1.3 HOMOGENEOUS MULTICORES
1.4 HETEROGENEOUS MULTICORES
1.5 CONCLUDING REMARKS
REFERENCES
CHAPTER 2: PROGRAMMING MODELS FOR MULTICORE AND MANY-CORE COMPUTING SYSTEMS
2.1 INTRODUCTION
2.2 A COMPARATIVE ANALYSIS OF MANY-CORES
2.3 PROGRAMMING MODELS FEATURES
2.4 PROGRAMMING MODELS FOR MANY-CORES
2.5 AN OVERVIEW OF MANY-CORE PROGRAMMING MODELS
2.6 CONCLUDING REMARKS
REFERENCES
CHAPTER 3: LOCK-FREE CONCURRENT DATA STRUCTURES
3.1 INTRODUCTION
3.2 SYNCHRONIZATION PRIMITIVES
3.3 LOCK-FREE DATA STRUCTURES
3.4 MEMORY MANAGEMENT FOR CONCURRENT DATA STRUCTURES
3.5 GRAPHICS PROCESSORS
REFERENCES
CHAPTER 4: SOFTWARE TRANSACTIONAL MEMORY
4.1 INTRODUCTION
4.2 STM: A PROGRAMMER'S PERSPECTIVE
4.3 TRANSACTIONAL SEMANTICS
4.4 STM DESIGN SPACE
4.5 STM: A HISTORICAL PERSPECTIVE
4.6 APPLICATION PERFORMANCE ON STM
4.7 CONCLUDING REMARKS
REFERENCES
PART II: PROGRAMMING APPROACHES
CHAPTER 5: HYBRID/HETEROGENEOUS PROGRAMMING WITH OMPSS AND ITS SOFTWARE/HARDWARE IMPLICATIONS
5.1 INTRODUCTION
5.2 THE OMPSS PROPOSAL
5.3 IMPLEMENTATION
5.4 TASK GRANULARITY
5.5 RELATED WORK
5.6 FUTURE WORK
5.7 CONCLUDING REMARKS
ACKNOWLEDGMENTS
REFERENCES
CHAPTER 6: SKELETON PROGRAMMING FOR PORTABLE MANY-CORE COMPUTING
6.1 INTRODUCTION
6.2 BACKGROUND: SKELETON PROGRAMMING
6.3 SKEPU: A TUNABLE SKELETON PROGRAMMING LIBRARY
6.4 SKELCL: A LIBRARY FOR HIGH-LEVEL MULTI-GPU PROGRAMMING
6.5 RELATED WORK
6.6 CONCLUDING REMARKS AND FUTURE WORK
REFERENCES
CHAPTER 7: DSL STREAM PROGRAMMING ON MULTICORE ARCHITECTURES
7.1 INTRODUCTION
7.2 A HIGH-LEVEL DSL: SLICES
7.3 INTERMEDIARY REPRESENTATION SJD
7.4 OPTIMIZING THE INTERMEDIATE REPRESENTATION
7.5 REDUCING INTERCORE COMMUNICATION COST
7.6 EVALUATION
7.7 CONCLUSION
REFERENCES
CHAPTER 8: PROGRAMMING WITH TRANSACTIONAL MEMORY
8.1 INTRODUCTION
8.2 CONCURRENCY MADE SIMPLE
8.3 TM LANGUAGE CONSTRUCTS
8.4 IMPLEMENTING A TM
8.5 PERFORMANCE LIMITATIONS
8.6 RECENT SOLUTIONS
8.7 CONCLUDING REMARKS
ACKNOWLEDGMENTS
REFERENCES
CHAPTER 9: OBJECT-ORIENTED STREAM PROGRAMMING
9.1 STREAM PROGRAMMING
9.2 OBJECT-ORIENTED STREAM PROGRAMMING
9.3 XJAVA
9.4 PERFORMANCE
9.5 EXPERIENCES
9.6 RELATED WORK
9.7 FUTURE WORK
9.8 SUMMARY
REFERENCES
CHAPTER 10: SOFTWARE-BASED SPECULATIVE PARALLELIZATION
10.1 INTRODUCTION
10.2 SPECULATIVE EXECUTION IN CORD
10.3 ADVANCED FEATURES OF CORD
10.4 RELATED WORK
10.5 FUTURE WORK
10.6 CONCLUDING REMARKS
REFERENCES
CHAPTER 11: AUTONOMIC DISTRIBUTION AND ADAPTATION
11.1 INTRODUCTION
11.2 PARALLEL PROGRAMMING MODELS
11.3 CONCURRENT CODE
11.4 CONCLUSIONS
REFERENCES
PART III: PROGRAMMING FRAMEWORKS
CHAPTER 12: PEPPHER: PERFORMANCE PORTABILITY AND PROGRAMMABILITY FOR HETEROGENEOUS MANY-CORE ARCHITECTURES
12.1 INTRODUCTION AND BACKGROUND
12.2 THE PEPPHER FRAMEWORK
12.3 THE PEPPHER METHODOLOGY
12.4 PERFORMANCE GUIDELINES AND PORTABILITY
12.5 FURTHER TECHNICAL ASPECTS
12.6 GUIDING APPLICATIONS AND BENCHMARKS
12.7 CONCLUSION
REFERENCES
CHAPTER 13: FASTFLOW: HIGH-LEVEL AND EFFICIENT STREAMING ON MULTICORE
13.1 FASTFLOW PRINCIPLES
13.2 FASTFLOW
μ
-TUTORIAL
13.3 PERFORMANCE
13.4 RELATED WORK
13.5 FUTURE WORK AND CONCLUSIONS
REFERENCES
CHAPTER 14: PARALLEL PROGRAMMING FRAMEWORK FOR H.264/AVC VIDEO ENCODING IN MULTICORE SYSTEMS
14.1 INTRODUCTION
14.2 PARALLEL PROGRAMMING FRAMEWORK FOR H.264/AVC VIDEO ENCODING
14.3 PROGRAMMING PARALLEL H.264 VIDEO ENCODERS
14.4 EVALUATION OF PARALLEL H.264 VIDEO ENCODERS
14.5 CONCLUDING REMARKS
ACKNOWLEDGMENTS
REFERENCES
CHAPTER 15: PARALLELIZING EVOLUTIONARY ALGORITHMS ON GPGPU CARDS WITH THE EASEA PLATFORM
15.1 INTRODUCTION
15.2 EASEA PARALLELIZATION OF EA ON GPGPU
15.3 EXPERIMENTS AND APPLICATIONS
15.4 CONCLUSION
REFERENCES
PART IV: TESTINE, EVALUATION AN OPTIMIZATION
CHAPTER 16: SMART INTERLEAVINGS FOR TESTING PARALLEL PROGRAMS
16.1 INTRODUCTION
16.2 REVIEWS OF PARALLEL PROGRAMS
16.3 TESTING OF PARALLEL PROGRAMS
16.4 RELATED WORK
16.5 FUTURE WORK
16.6 CONCLUDING REMARKS
REFERENCES
PROBLEMS
CHAPTER 17: PARALLEL PERFORMANCE EVALUATION AND OPTIMIZATION
17.1 SEQUENTIAL VERSUS PARALLEL PERFORMANCE
17.2 THREAD OVERHEADS
17.3 CACHE COHERENCE OVERHEADS
17.4 SYNCHRONIZATION OVERHEADS
17.5 NONUNIFORM MEMORY ACCESS
17.6 OVERLAPPING LATENCY
17.7 DIAGNOSTIC TOOLS AND TECHNIQUES
17.8 SUMMARY
REFERENCES
CHAPTER 18: A METHODOLOGY FOR OPTIMIZING MULTITHREADED SYSTEM SCALABILITY ON MULTICORES
18.1 INTRODUCTION
18.2 MULTITHREADING AND SCALABILITY
18.3 CONTROLLED PERFORMANCE MEASUREMENTS
18.4 WORKLOAD DESIGN AND IMPLEMENTATION
18.5 QUANTIFYING SCALABILITY
18.6 CASE STUDY: MEMCACHED SCALABILITY
18.7 OTHER MULTITHREADED APPLICATIONS
18.8 CASE STUDY: DATA VALIDATION
18.9 SCALABILITY ON MANY-CORE ARCHITECTURES
18.10 FUTURE WORK
18.11 CONCLUDING REMARKS
REFERENCES
CHAPTER 19: IMPROVING MULTICORE SYSTEM PERFORMANCE THROUGH DATA COMPRESSION
19.1 INTRODUCTION
19.2 OUR APPROACH
19.3 EXPERIMENTAL EVALUATION
19.4 RELATED WORK
19.5 FUTURE WORK
19.6 CONCLUDING REMARKS
REFERENCES
PART V: SCHEDULING AND MANAGEMENT
CHAPTER 20: PROGRAMMING AND MANAGING RESOURCES ON ACCELERATOR-ENABLED CLUSTERS
20.1 INTRODUCTION
20.2 PROGRAMMING ACCELERATORS ON LARGE-SCALE CLUSTERS
20.3 MAPREDUCE FOR HETEROGENEOUS CLUSTERS
20.4 RESOURCE CONFIGURATION
20.5 RESOURCE MANAGEMENT ON CLUSTERS WITH ACCELERATORS
20.6 EVALUATION
20.7 CONCLUSION
REFERENCES
CHAPTER 21: AN APPROACH FOR EFFICIENT EXECUTION OF SPMD APPLICATIONS ON MULTICORE CLUSTERS
21.1 INTRODUCTION
21.2 SPMD APPLICATIONS ON MULTICORE CLUSTERS
21.3 METHODOLOGY FOR EFFICIENT EXECUTION
21.4 SCALABILITY AND EFFICIENCY OF SPMD APPLICATIONS
21.5 SPMD APPLICATIONS AND PERFORMANCE EVALUATION
21.6 RELATED WORKS
21.7 CONCLUSION AND FUTURE WORKS
REFERENCES
CHAPTER 22: OPERATING SYSTEM AND SCHEDULING FOR FUTURE MULTICORE AND MANY-CORE PLATFORMS
22.1 INTRODUCTION
22.2 OPERATING SYSTEM KERNEL MODELS
22.3 SCHEDULING
REFERENCES
GLOSSARY
INDEX
END USER LICENSE AGREEMENT
ix
x
xi
xii
xiii
xiv
xv
xvi
xvii
xviii
xix
xx
xxi
xxii
xxiii
xxv
xxvi
xxvii
xxviii
xxix
xxx
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
321
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
481
482
483
484
485
Cover
Table of Contents
Preface
Part I: Foundations
Begin Reading
CHAPTER 1: MULTI- AND MANY-CORES, ARCHITECTURAL OVERVIEW FOR PROGRAMMERS
Figure 1.1 Flynn's taxonomy.
Figure 1.2 Multiprocessor memory architectures and programming models.
Figure 1.3 Technological trends for microprocessors. Simplified version of Figure 1 in [18].
Figure 1.4 The processor–memory gap (a) and a typical memory hierarchy (b).
Figure 1.5 Multicore processor generations: first (a), second (b), third (c).
Figure 1.6 Power 7 multicore, simplified block diagram.
Figure 1.7 ARM Cortex A15, simplified block diagram.
Figure 1.8 Sun UltraSPARC T2 architecture, simplified block diagram.
Figure 1.9 AMD Opteron Istanbul processor, simplified block diagram.
Figure 1.10 Intel Nehalem architecture – 4 cores, simplified block diagram.
Figure 1.11 Simplified block diagram of Cell/BE.
Figure 1.12 NVIDIA Fermi, simplified block diagram.
Figure 1.13 ARM Mali T604, simplified block diagram.
CHAPTER 3: LOCK-FREE CONCURRENT DATA STRUCTURES
Figure 3.1 Synchronization primitives.
CHAPTER 5: HYBRID/HETEROGENEOUS PROGRAMMING WITH OMPSS AND ITS SOFTWARE/HARDWARE IMPLICATIONS
Figure 5.1 FFT example with OmpSs task dependencies.
Figure 5.2 Block distribution for the arrays in the FFT example ( blocks) with task instantiation numbers.
Figure 5.3 Dynamically generated task graph for the FFT example ( blocks).
Figure 5.4 Task execution timeline for the FFT ( blocks) example with 8 threads.
Figure 5.5 Example of OmpSs task and loop worksharing dependencies.
Figure 5.6 FFT example with specialized task using CUFFT library for Nvidia GPUs.
Figure 5.7 Hybrid MPI/OmpSs example.
Figure 5.8 Nanos++ flow overview.
Figure 5.9 Scalability projections of hardware and software task decoders for various task granularities. The grey region in the middle of the Figure indicates the scalability limitations of software-based task decoders, based on the highly tuned software decoder of OmpSs running on a modern architecture.
CHAPTER 6: SKELETON PROGRAMMING FOR PORTABLE MANY-CORE COMPUTING
Figure 6.1 User function, macro expansion.
Figure 6.2 Average (100 runs) time of blurring quadratic grayscale images of different sizes. (a) Gaussian kernel applied
once
to the image. (b) Gaussian kernel applied
nine
times to the image.
Listing 6.1 A MapReduce example that computes the dot product.
Listing 6.2 User function used by MapOverlap when blurring an image.
Listing 6.3 Creation of a simple Map skeleton in SkelCL.
Listing 6.4 Implementation of a scalar multiplication using a Map skeleton and additional arguments.
Figure 6.3 Times for running different LibSolve solvers for and 1000 with the BRUSS2D-MIX problem.
Figure 6.4 Defining an execution plan and applying it to a Reduce skeleton.
Figure 6.5 Vector sum with Reduce on target architecture 1 (a) and architecture 2 (b). TUNE uses an empirically determined execution plan for each architecture.
Listing 6.5 Computing a Mandelbrot fractal using SkelCL.
Listing 6.6 Implementation of LM OSEM using SkelCL.
Figure 6.6 Mandelbrot example: comparison of runtime (a) and lines of code (b) of implementations using OpenCL, CUDA and SkelCL on a platform with one GPU.
Figure 6.7 LM OSEM example: comparison of runtime (a) and lines of code (b) of implementations using OpenCL, CUDA and SkelCL on a platform with four GPUs.
CHAPTER 7: DSL STREAM PROGRAMMING ON MULTICORE ARCHITECTURES
Figure 7.1 Excerpt of a StreamIt program for matrix multiplication.
Figure 7.2 Matrix multiplication in Array-OL as viewed in the IDE Gaspard2 [8].
Figure 7.3 SLICES program that captures a matrix multiplication communication pattern. First the raw stream inputs are casted to shape to view them as matrices. Then the rows of A are paired with the columns of B in the iterator loop and pushed to the output. In the graphical examples we have chosen x0 = 5 and y0 = 4.
Figure 7.4 Example of data reorganizations enabled by SJD. (The split and join consumptions and productions are always 1 in this graph.)
Figure 7.5 Multidimensional grids and blocks extraction.
Figure 7.6 Compiling 1D blocks that partially overlap .
Figure 7.7 Intermediate SJD representation equivalent to the SLICES matrix multiplication program (Figure 7.3). The intermediate representation was automatically generated by our SLICES to SJD compiler.
Figure 7.8 Set of transformations considered. Each transformation is defined by a graph rewriting rule. Node
N
is a wild card for any arity compatible node.
Figure 7.9 Reducing the intercore communication through an iterative process.
Figure 7.10 Optimizations applied to MM-COARSE. In the original program the inter-core communication cost is high since the program presents many synchronizations points: for example, at runtime
J
8 and
S
15 quickly become communication bottlenecks. After optimization,
J
8 and
S
15 have been split; the transposition of matrix
B
has been divided in blocks and distributed among the four cores; duplicate nodes (nodes
D
205,
D
208, etc.) have also been distributed among the cores; since the consumers of the duplicate streams are all in the same core, duplications can be compiled to multiple reads to the same buffer and are not added to the intercore traffic (a) Partitioning of original MM-COARSE on 4 cores, (b) Partitioning of optimized MM-COARSE on 4 cores.
Figure 7.11 Impact of the communication reductions on the performance: speedups of the original program and the optimized program on an SMP Nehalem quadcore. The executions times are normalized to the reference (StreamIt original program execution time on a single core).
CHAPTER 8: PROGRAMMING WITH TRANSACTIONAL MEMORY
Figure 8.1 The five chopsticks shared by the five dining philosophers to eat rice.
Figure 8.2 Sorted (lexicographically) linked list example in which two operations with same accesses have distinct semantics: contains(z) is a parse operation whereas size() is a snapshot operation.
CHAPTER 9: OBJECT-ORIENTED STREAM PROGRAMMING
Figure 9.1 A stream graph.
Figure 9.2 Search-based auto-tuning of a program
P
for three different machines
M
1,
M
2,
M
3.
Figure 9.3 Inferring tuning parameters and context information to enable parameter prediction on different machines.
Figure 9.4 Inferring tuning parameters and context information from XJava code.
Figure 9.5 Fusing replicable stages of a pipeline reduces the overhead for splitting and joining data streams. (a) Stage replication without fusion. (b) Fusing replicable stages.
Figure 9.6 Tuning two master/worker configurations depending on workload
w
.
CHAPTER 10: SOFTWARE-BASED SPECULATIVE PARALLELIZATION
Figure 10.1 Maintaining memory state.
Figure 10.2 Misspeculation detection.
Figure 10.3 Partitioning a loop into prologue, speculative body and epilogue.
Figure 10.4 Kernel of benchmark telecomm-CRC32.
Figure 10.5 Execution speedups for SPEC and MiBench programs on a dual quadcore 3.0 GHz Xeon server.
Figure 10.6 A speculation example.
Figure 10.7 Thread execution model.
Figure 10.8 Discard-all recovery.
Figure 10.9 Incremental recovery.
Figure 10.10 Decoupling space allocation from thread creation.
Figure 10.11 Heap prefix format.
CHAPTER 11: AUTONOMIC DISTRIBUTION AND ADAPTATION
Figure 11.1 A dependency graph derived from code behavior analysis. Edges on the left denote dataflow and on the right workflow (simplified).
Figure 11.2 Potential segmentation of the reduced graph (simplified).
Figure 11.3 Code to infrastructure mapping principles.
Figure 11.4 Rearrangement of concurrent logical segments for maximum speedup.
CHAPTER 12: PEPPHER: PERFORMANCE PORTABILITY AND PROGRAMMABILITY FOR HETEROGENEOUS MANY-CORE ARCHITECTURES
Figure 12.1 The PEPPHER framework for creating performance portable software across a variety of heterogeneous many-core systems. The application is described as a set of (generic) components with annotations specifying performance and other nonfunctional properties (left). A set of component variants is generated (e.g. by source-to-source transformation) or chosen (from expert-written libraries) for different architectures and configurations (middle). For each concrete, heterogeneous architecture, appropriate component variants are selected and compiled into codelets that are scheduled efficiently at run-time (right) using static and dynamic performance information. Performance models are derived from abstract hardware descriptions and component annotations.
Figure 12.2 Generic PEPPHER component interface for a sorting component.
Figure 12.3 Implementation of a PEPPHER sorting component by a merge sort algorithm. The pseudocode notation with C++ language extensions used in this example is one of several possibilities considered for the PEPPHER annotation framework. Alternatives keep the annotations separate to the actual code.
Figure 12.4 A specialized GPU sorting component.
Figure 12.5 Impact of different StarPU scheduling policies (greedy, heft-tm, heft-tm-pr and heft-tmdp-pr) on the LU decomposition example.
CHAPTER 13: FASTFLOW: HIGH-LEVEL AND EFFICIENT STREAMING ON MULTICORE
Figure 13.1 FastFlow-layered architecture with abstraction examples at the different layers of the stack.
Figure 13.2 Hello world pipeline.
Figure 13.3 Hello world farm.
Figure 13.4 Execution time of the farm pattern with different allocators: libc versus TBB versus FastFlow allocator on 8-core Intel (a) and 48-core AMD (b) using a computational grain of 1 µs.
Figure 13.5 Speedup of the farm (a) and throughput in thousands of messages per second (Kmsg/s) of the pipeline (b) paradigms for several computational grains.
Figure 13.6 SWPS3 versus SWPS3-FF performance for 102
k
gap penalties evaluated on release 57.5 of UniProtKB/Swiss-Prot.
Figure 13.7 YaDT-FF speedup using several standard training sets. Superlinear speedup is due to the fact that the farm emitter performs a minimal amount of work contributing to the final result, but the parallelism degree given is related to the worker threads only.
Figure 13.8 (a) scalability of StockKit-FF(n) against StockKit(1), where n is the number of worker threads. (b) execution time (T) and speedup (Sp) over bzip2 in the case of a stream of 1078 files: 86% small (0–1 MBytes), 9% medium (1–10 MBytes), 4% large (10–50 MBytes) and 1% very large (50–100 MBytes). pbzip2 uses 16 threads. pbzip2_ff uses 16 threads for each accelerator.
Comp
stands for compression and
Dec
for decompression.
CHAPTER 14: PARALLEL PROGRAMMING FRAMEWORK FOR H.264/AVC VIDEO ENCODING IN MULTICORE SYSTEMS
Figure 14.1 H.264/AVC encoding loop.
Figure 14.2 Parallel video coding architectures. (a) Homogeneous. (b) Heterogeneous, with embedded accelerators. (c) Heterogeneous, with GPU accelerators.
Figure 14.3 Exploited data-level parallelism models. (a) Frame level. (b) Slice level. (c) Macroblock level.
Figure 14.4 Parallel topologies to implement the encoding modules depicted in Figure 14.1 by using functional-level parallelism. (a) Pipeline. (b) Straight parallel. (c) Mixed.
Figure 14.5 Simultaneously exploitation of slice and macroblock parallelism levels. (a) Thread allocation. (b) Thread synchronization.
Figure 14.6 Possible slice-scattering distribution in a multi-core architecture.
Figure 14.7 Considered set of test video sequences. (a) Soccer. (b) Harbour. (c) Crew. (d) City.
Figure 14.8 Provided speedup in Intel platform using
slice-level
parallelism.
CHAPTER 15: PARALLELIZING EVOLUTIONARY ALGORITHMS ON GPGPU CARDS WITH THE EASEA PLATFORM
Figure 15.1 Basic evolutionary algorithm.
Figure 15.2 ASREA flowchart.
Figure 15.3 Impact of population size, number Figure of iterations and dimensions on execution time and overhead. (a) Impact of population size and evaluation complexity on execution time. (b) Impact of individual size on transfer time overhead.
Figure 15.4 Speedup on an average of 10 runs for several GPGPUs cards versus an Intel i7-950 on a 10-dimension 120-iteration Weierstrass benchmark.
Figure 15.5 Speedup obtained with the island model on Weierstrass 1000 dimensions,h = 0.35 for 81200 individuals.
Figure 15.6 Speedup factor for symbolic regression wrt different parameters. (a) Speedup wrt tree depth and learning set size, for the evaluation function only. (b) Speedup wrt number of fitness cases.
Figure 15.7 Influence of function set computational intensity.
Figure 15.8 Attainment surface plots of ASREA, NSGA-II and SPEA2 for ZDT functions. (a) ZDT3 at 1000 EVAL. (b) ZDT3 at 10,000 EVAL.
Figure 15.9 Speedup for ZDT3 function on population sizes 100, 1000 and 10000.
Figure 15.10 Speedup obtained on the airplane problem with EASEA tree-base GP implementation.
Figure 15.11 Difference between original and reconstructed trajectory.
CHAPTER 17: PARALLEL PERFORMANCE EVALUATION AND OPTIMIZATION
Figure 17.1 CPU Utilization view in CV showing poor concurrency in execution.
Figure 17.2 Thread synchronization analysis using the Concurrency Visualizer.
CHAPTER 18: A METHODOLOGY FOR OPTIMIZING MULTITHREADED SYSTEM SCALABILITY ON MULTICORES
Figure 18.1 Schematic of scalability load measurement configuration.
Figure 18.2 Steady-state measurement of instantaneous throughput, for a particular load value , represented as a time series to show the ramp-up and ramp=-down phases.
Figure 18.3 Physical components of scalability.
Figure 18.4 Throughput of three MCD releases up threads.
Figure 18.5 USL regression analysis of memcached 1.2.8 data (
dots
) in Figure 18.4.
Figure 18.6 Comparison of scalability data for standard MCD (
lower curve
) and patched MCD (
upper curve
).
Figure 18.7 USL model of patched MCD data in Figure 18.6 extended out to threads.
Figure 18.8 Initial J2EE throughput data (
left
) and subsequent data (
right
) measured in requests per second (RPS).
Figure 18.9 USL models of J2EE scalability for initial data (
upper curve
) and subsequent data (
lower curve
).
Figure 18.10 Web application scalability data.
Figure 18.11 USL fit to NVIDIA Tesla GPU data.
CHAPTER 19: IMPROVING MULTICORE SYSTEM PERFORMANCE THROUGH DATA COMPRESSION
Figure 19.1 The multicore architecture considered in this chapter and the off-chip memory space.
Figure 19.2 (a) Different scenarios for data when its current use is over. (b) Comparison of on-demand decompression and predecompression. Arrows indicate the execution timeline of the program.
Figure 19.3 (a) Normalized execution cycles with the static strategy using different values. (b) Comparison of the best static strategy (for each benchmark) and the dynamic strategy.
Figure 19.4 Processor usage for the dynamic strategy over the execution period.
Figure 19.5 Breakdown of overheads into three different components for the static and dynamic schemes.
Figure 19.6 Sensitivity of the dynamic strategy to the starting value for the swim benchmark.
Figure 19.7 Sensitivity of the dynamic strategy to values. (a) apsi. (b) mgrid.
Figure 19.8 Sensitivity of the dynamic strategy to values. (a) apsi. (b) mgrid.
Figure 19.9 Sensitivity of the dynamic strategy to the sampling period ().
Figure 19.10 Sensitivity of the dynamic strategy to the block size.
Figure 19.11 Sensitivity of the static and dynamic strategies to the processor counts.
Figure 19.12 Results with a hardware-based cache memory.
CHAPTER 20: PROGRAMMING AND MANAGING RESOURCES ON ACCELERATOR-ENABLED CLUSTERS
Figure 20.1 Resource configurations for enabling asymmetric clusters. (a)
Conf I:
Self-managed well-provisioned accelerators (b)
Conf II:
Resource-constrained well-provisioned accelerators (c)
Conf III:
Resource-constrained shared-driver accelerators (d)
Conf IV:
Mixed accelerators.
Figure 20.2 High-level overview of an accelerator-based asymmetric-distributed system.
Figure 20.3 Interactions between manager and compute node components.
Figure 20.4 State machine for scheduler learning and execution process.
Figure 20.5 Execution time of linear regression with increasing input size.
Figure 20.6 Execution time of word count while simultaneously executing histogram.
Figure 20.7 Execution time of histogram while simultaneously executing word count.
Figure 20.8 Framework scalability.
CHAPTER 21: AN APPROACH FOR EFFICIENT EXECUTION OF SPMD APPLICATIONS ON MULTICORE CLUSTERS
Figure 21.1 Communication and computation of SPMD tile on multicore.
Figure 21.2 Examples of SPMD applications and communications patterns.
Figure 21.3 SPMD applications on multicore cluster.
Figure 21.4 Supertile creation and methodology objective.
Figure 21.5 Methodology for efficient execution of SPMD applications.
Figure 21.6 Mapping and tile distribution.
Figure 21.7 Scheduling phase.
Figure 21.8 Performance evaluation theoretical example.
Figure 21.9 Characterization analysis for a double quadcore cluster.
Figure 21.10 Tile computation and communication ratio characterization.
Figure 21.11 Efficiency evaluation heat transfer app.
Figure 21.12 Speedup evaluation heat transfer application.
Figure 21.13 Efficiency evaluation LL-2D-STD application.
Figure 21.14 Speedup evaluation LL-2D-STD App.
CHAPTER 2: PROGRAMMING MODELS FOR MULTICORE AND MANY-CORE COMPUTING SYSTEMS
Table 2.1 Computing properties of many-core hardware.
Table 2.2 Memory properties of many-core hardware.
Table 2.3 An overview of the usability-related features of the programming models.
Table 2.4 An overview of the design-support features that the models under survey have (see Section 2.3.3 for more explanations).
Table 2.5 An overview of the implementation-support features that the models under survey offer (see Section 2.3.3 for more explanations).
Table 2.6 A qualitative comparison of the surveyed programming models. For each feature, we use one of five qualifiers, ++ (very good) to −− (very bad).
CHAPTER 3: LOCK-FREE CONCURRENT DATA STRUCTURES
Table 3.1 Properties of different approaches to nonblocking memory reclamation.
CHAPTER 7: DSL STREAM PROGRAMMING ON MULTICORE ARCHITECTURES
Table 7.1 Intercore communication cost reduction. ‘original’ and ‘optimized’ represent the intercommunication volume per steady state for the original program and the optimized program in Bytes. ‘
C
reduction’ is the reduction percentage of intercore communications computed as
CHAPTER 9: OBJECT-ORIENTED STREAM PROGRAMMING
Table 9.1 Splitting and joining streams
CHAPTER 12: PEPPHER: PERFORMANCE PORTABILITY AND PROGRAMMABILITY FOR HETEROGENEOUS MANY-CORE ARCHITECTURES
Table 12.1 Example of acceleration on a double precision LU decomposition with partial pivoting (DGETRF) on a 3.2 GB matrix
CHAPTER 13: FASTFLOW: HIGH-LEVEL AND EFFICIENT STREAMING ON MULTICORE
Table 13.1 Average latency time and standard deviation (in nanoseconds) of a push/pop operation on a SPSC queue with 1024 slots for 1M insert/extractions, on the Intel 8 core 16 context and on AMD 48 core
Table 13.2 Patterns and streams used in real-world applications
CHAPTER 14: PARALLEL PROGRAMMING FRAMEWORK FOR H.264/AVC VIDEO ENCODING IN MULTICORE SYSTEMS
Table 14.1 Example configuration using a pipelined architecture to implement the eight interprediction modes in a variable number of cores
Table 14.2 gprof profiling results of the H.264/AVC reference SW
Table 14.3 Specifications of the considered parallel computational platform
Table 14.4 Considered H.264/AVC encoding parameters
Table 14.5 Memory allocation for the reference and optimized SW versions
Table 14.6 Provided speedup using
slice
and
macroblock
parallelism levels, by considering a maximum of 4 slices (to avoid bit-rate degradation [10])
CHAPTER 17: PARALLEL PERFORMANCE EVALUATION AND OPTIMIZATION
Table 17.1 Example of cache coherence problem
CHAPTER 18: A METHODOLOGY FOR OPTIMIZING MULTITHREADED SYSTEM SCALABILITY ON MULTICORES
Table 18.1 Interpretation of queueing metrics
Table 18.2 Memcached scalability parameters
Table 18.3 J2EE application scalability parameters
Table 18.4 Preparation for USL analysis of the data in Figure 18.10
CHAPTER 19: IMPROVING MULTICORE SYSTEM PERFORMANCE THROUGH DATA COMPRESSION
Table 19.1 The base simulation parameters used in our experiments
Table 19.2 The benchmark codes used in this study and important statistics. In obtaining these statistics, the reference input sets are used
CHAPTER 21: AN APPROACH FOR EFFICIENT EXECUTION OF SPMD APPLICATIONS ON MULTICORE CLUSTERS
Table 21.1 Efficiency, speedup and scalability analysis for an SPMD app
Table 21.2 Characterization values
Table 21.3 Analytical values (time expressed in seconds)
Wiley Series on Parallel and Distributed Computing
Series Editor: Albert Y. Zomaya
A complete list of the titles in this series appears at the end of this volume.
Edited by
Sabri Pllana
Fatos Xhafa
Copyright © 2017 by John Wiley & Sons, Inc. All rights reserved
Published by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloguing-in-Publication Data
Names: Pllana, Sabri, editor. | Xhafa, Fatos, editor.
Title: Programming multicore and many-core computing systems / edited by Sabri Pllana, Fatos Xhafa.
Description: First edition. | Hoboken, New Jersey : John Wiley & Sons, [2017] | Includes bibliographical references and index.
Identifiers: LCCN 2016038244| ISBN 9780470936900 (cloth) | ISBN 9781119332008 (epub) | ISBN 9781119331995 (Adobe PDF)
Subjects: LCSH: Parallel programming (Computer science) | Coprocessors-Programming.
Classification: LCC QA76.642 .P767 2017 | DDC 005.2/75-dc23 LC record available at https://lccn.loc.gov/2016038244
Cover Image: Creative-idea/Gettyimages
Cover design by Wiley
Marco Aldinucci, Computer Science Department, University of Torino, Corso Svizzera 185, 10149 Torino, Italy. [email:
]
Jørn Amundsen, Norwegian University of Science and Technology, SemSælandsvei 7-9, NO-7491 Trondheim, Norway. [email:
]
Eduard Ayguade, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:
]
Rosa M. Badia, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:
]
Henri E. Bal, Department of Computer Science, Vrije Universiteit, De Boelelaan 1081A,1081 HV, Amsterdam, The Netherlands. [email:
]
Denis Barthou, INRIA Bordeaux Sud-Ouest, 200 avenue de la Vieille Tour 33405 Talence Cedex, France. [email:
]
Pieter Bellens, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:
]
Siegfried Benkner, Faculty of Computer Science, University of Vienna, Wahringerstrasse29, A-1090 Vienna, Austria. [email:
]
Daniel Rubio Bonilla, Department for Intelligent Service Infrastructures, HLRS, University of Stuttgart, Nobelstr. 19, 70569 Stuttgart, Germany. [email:
]
Javier Bueno, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:
]
Ali R. Butt, Virginia Tech, 2202 Kraft Drive (0106), Blacksburg, VA 24061, USA. [email:
]
Daniel Cederman, Department of Computer Science and Engineering, Chalmers University of Technology, SE-412 96 Göteborg, Sweden. [email:
]
Pierre Collet, Strasbourg University, Pole API BdSébastien Brant BP 10413 67412 Illkirch CEDEX France. [email:
]
Herbert Cornelius, Intel Gmbh, DornacherStrasse 1, D-85622 Feldkirchen, Germany. [email:
]
Tommaso Cucinotta, Retis Lab, ScuolaSuperioreSant'Anna CEIICP, Att.NeFrancesca Gattai, Via Moruzzi 1, 56124 Pisa, Italy. [email:
]
Marco Danelutto, Computer Science Department, University of Pisa, Largo Pontecorvo3, 56127 Pisa, Italy. [email:
]
Usman Dastgeer, IDA, Linköping University, S-58183 Linköping, Sweden. [email:
]
Pablo De Oliveira Castro, CEA, LIST, Université De Versailles, 45, Avenue DesÉtats Unis, Versailles, 78035 France. [email:
]
Alejandro Duran, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:
]
Mujahed Eleyat, University of Science and Technology (NTU), SemSælandsvei 7-9, NO-7491 Trondheim, Norway. [email:
]
Johan Enmyren, IDA, Linköping University, S-58183 Linköping, Sweden. [email:
]
Yoav Etsion, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:
]
Eitan Farchi, IBM, 49 Hashmonaim Street, Pardes Hanna 37052, Israel. [email:
]
Montse Farreras, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:
]
Min Feng, Department of Computer Science, The University of California At Riverside, Engineering Bldg. Unit 2, Rm. 463, Riverside, CA 92521, USA. [email:
]
Roger Ferrer, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:
]
Anders Gidenstam, School of Business and Informatics, Högskolan I Borås, S-50190 Borås, Sweden. [email:
]
Sergei Gorlatch, FB10, Universität Münster, D-48149 Münster, Germany. [email:
]
Vincent Gramoli, LPD, EPFL IC, Station 14, CH-1015 Lausanne, Switzerland. [email:
]
Rachid Guerraoui, LPD, EPFL IC, Station 14, CH-1015 Lausanne, Switzerland. [email:
]
Neil Gunther, Performance Dynamics Company, 4061 East Castro Valley Blvd. Suite 110, Castro Valley, CA 95954, USA. [email:
]
Rajiv Gupta, Department of Computer Science, The University of California at Riverside, Engineering Bldg. Unit 2, Rm. 408, Riverside, CA 92521, USA. [email:
]
Phuong Ha, Department of Computer Science, Faculty of Science, University of Tromsø, NO-9037 Tromsø, Norway. [email:
]
Pieter Hijma, Department of Computer Science, Vrije Universiteit, De Boelelaan1081A, 1081 HV, Amsterdam, The Netherlands. [email:
]
Alexandru Iordan, University of Science and Technology (NTU), SemSælandsvei 7-9, NO-7491 Trondheim, Norway. [email:
]
Magnus Jahre, University of Science and Technology (NTU), SemSælandsvei 7-9, NO-7491 Trondheim, Norway. [email:
]
Mahmut Kandemir, The Pennsylvania State University, 111 IST Building, University Park, PA 16802, USA. [email:
]
Philipp Kegel, FB10, Universität Münster, D-48149 Münster, Germany. [email:
]
Christoph Kessler, IDA, Linköping University, S-58183 Linköping, Sweden. [email:
]
Peter Kilpatrick, School of Electronics, Electrical Engineering and Computer Science, Queen's University Belfast, Belfast BT7 1NN, UK. [email:
]
Jesus Labarta, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:
]
Nicolas Lachiche, Strasbourg University, Pole API BdSébastien Brant BP 10413 67412 Illkirch CEDEX France. [email:
]
Giuseppe Lipari, Real-Time Systems Laboratory, ScuolaSuperioreSant'Anna, Pisa, Italy. [email:
]
Stéphane Louise, CEA, LIST, Gif-Sur-Yvette, 91191 France. [email:
]
Emilio Luque, Computer Architecture and Operating System Department, Universitat Autonoma De Barcelona, 08193, Barcelona, Spain. [email:
]
Ogier Maitre, Strasbourg University, Pole API BdSébastien Brant BP 10413 67412 Illkirch CEDEX France. [email:
]
Sandya Mannarswamy, Xerox Research Centre India, 225 1st C Cross, 2nd Main, Kasturi Nagar, Bangalore, India. 560043. [email:
]
Vladimir Marjanovic, Barcelona Supercomputing Center, Nexus-2 Building, Jordi Girona 29, 08034 Barcelona, Spain. [email:
]
Lluis Martinell, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:
]
Xavier Martorell, Barcelona Supercomputing Center, Nexus-2 Building, Jordi Girona 29, 08034 Barcelona, Spain. [email:
]
David Moloney, Movidius Ltd., Mountjoy Square East 19, D1 Dublin, Ireland. [email:
]
Ronal Muresano, Computer Architecture and Operating System Department, Universitat Autonoma De Barcelona, 08193, Barcelona, Spain. [email:
]
M. Mustafa Rafique, Virginia Tech, 2202 Kraft Drive (0106), Blacksburg, Virginia 24061, USA. [email:
]
Raymond Namyst, Institut National De Recherche En Informatique Et En Automatique (INRIA), Bordeaux Sud-Ouest, Cours De La Liberation 351, F-33405 Talence Cedex, France. [email:
]
Lasse Natvig, University of Science and Technology (NTU), Semsælandsvei 7-9, NO-7491 Trondheim, Norway. [email:
]
Dimitrios S. Nikolopoulos, FORTH-ICS, N. Plastira 100, VassilikaVouton, Heraklion, Crete, Greece. [email:
]
Frank Otto, Institute for Program Structures and Data Organization, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany. [email:
]
Ozcan Ozturk, Department of Computer Engineering, Engineering Building, EA 407B, Bilkent University, Bilkent, 06800, Ankara, Turkey. [email:
]
Marina Papatriantafilou, Department of Computer Science and Engineering, Chalmers University of Technology, SE-412 96 Göteborg, Sweden. [email:
]
Stefan Parvu, Nokia Group, Espoo, Karakaari 15, Finland. [email:
]
Josep M. Perez, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:
]
Judit Planas, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:
]
Sabri Pllana, Department of Computer Science, Linnaeus University, SE-351 95 Vaxjo, Sweden. [email:
]
Stephane Querry, Pole API BdSébastien Brant BP 10413 67412 Illkirch CEDEX France. [email:
]
Alex Ramirez, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:
]
Dolores Rexachs, Computer Architecture and Operating System Department, Universitat Autonoma De Barcelona, 08193, Barcelona, Spain. [email:
]
Andrew Richards, Codeplay Software Limited, York Place 45, EH1 3HP Edinburgh, United Kingdom. [email:
]
António Rodrigues, TU Lisbon/IST/INESC-ID, Rua Alves Redol 9, 1000-029Lisboa, Portugal. [email:
]
Nuno Roma, TU Lisbon/IST/INESC-ID, Rua Alves Redol 9, 1000-029 Lisboa, Portugal. [email:
]
Peter Sanders, KarlsruherInstitut Für Technologie, Amfasanengarten 5, D-76128Karlsruhe, Germany. [email:
]
Lutz Schubert, Department for Intelligent Service Infrastructures, HLRS, University of Stuttgart, Nobelstr. 19, 70569 Stuttgart, Germany. [email:
]
Hazim Shafi, One Microsoft Way, Redmond, WA 98052, USA. [email:
]
Deepak Sharma, Pole API BdSébastien Brant BP 10413 67412 Illkirch CEDEX, France. [email:
]
Leonel Sousa, TU Lisbon/IST/INESC-ID, Rua Alves Redol 9, 1000-029 Lisboa, Portugal. [email:
]
Michel Steuwer, FB10, Universität Münster, D-48149Münster, Germany. [email:
]
Shanti Subramanyam, Yahoo! Inc., 701 First Ave, Sunnyvale, CA 94089. [email:
]
Håkan Sundell, School of Business and Infomatics, Högskolan I Borås, S-501 90Bor
s, Sweden. [email:
]
Xavier Teruel, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:
]
Chen Tian, Department of Computer Science, The University of California at Riverside, Engineering Bldg. Unit 2, Rm. 463, Riverside, CA 92521, USA. [email:
]
Walter F. Tichy, Institute for Program Structures and Data Organization, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany. [email:
]
Massimo Torquati, Computer Science Department, University of Pisa, Largo Pontecorvo3, 56127 Pisa, Italy. [email:
]
Jesper Larsson Träff, Research Group Parallel Computing, Vienna University of Technology, Favoritenstrasse 16/184-5, A-1040 Vienna, Austria. [email:
]
Ioanna Tsalouchidou, Barcelona Supercomputing Center, Nexus-2 Building, 3
rd
Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:
]
Philippas Tsigas, Department of Computer Science and Engineering, Chalmers University of Technology, SE-412 96 Göteborg, Sweden. [email:
]
Philippas Tsigas, Department of Computer Science and Engineering, Chalmers Tekniska Höogskola, SE-41296 Göteborg, Sweden. [email:
]
Mateo Valero, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:
]
Rob V. Van Nieuwpoort, Department of Computer Science, Vrije Universiteit, De Boelelaan 1081A, 1081 HV, Amsterdam, The Netherlands. [email:
]
Ana Lucia Varbanescu, Department of Software Technologies, Delft University of Technology, Delft, The Netherlands Mekelweg 4, 2628 CD, Delft, The Netherlands. [email:
]
Stefan Wesner, HLRS, Department of Applications and Visualization, University of Stuttgart, Nobelstr. 19, 70569 Stuttgart, Germany. [email:
]
Multicore and many-core computing systems have emerged as an important paradigm in high-performance computing (HPC) and have significantly propelled development of advanced parallel and distributed applications as well as of embedded systems. Multicore processors are now ubiquitous, indeed, from processors with 2 or 4 cores in the 2000s, the trend of increasing the number of cores keeps the pace, and processors with hundreds or even thousands of (lightweight) cores are becoming commonplace to optimize not only performance but also energy. However, this disruptive technology (also referred to as ‘continuum computing paradigm’) presents several major challenges such as increased effort and system-specific skills for porting and optimizing application codes, managing and exploiting massive parallelism and system heterogeneity to achieve increased performance, innovative modeling strategies for low-power simulation, etc. Among these, we would distinguish the challenge of mastering the multicore and many-core and heterogeneous systems – this is precisely the focus of this book!
The emergence of multicore processors has helped in addressing several problems that are related to single-core processors – known as memory wall, power wall and instruction-level parallelism wall – but they pose several other ‘walls’ such as the programmability wall or the coherency wall. Among these, programmability wall is a long-standing challenge. Indeed, on the one hand, program development for multicore processors, especially for heterogeneous multicore processors, is significantly more complex than for single-core processors. On the other hand, programmers have been traditionally trained for the development of sequential programs, and only a small percentage of them have experience with parallel programming.
In fact, in the past only a relatively small group of programmers interested in HPC was concerned with the parallel programming issues; the situation has changed dramatically with the appearance of multicore processors in commonly used computing systems. Traditionally parallel programs in HPC community have been developed by heroic programmers using a simple text editor as programming environment, programming at a low level of abstraction and doing manual performance optimization. It is expected that with the pervasiveness of multicore processors, parallel programming will become mainstream, but it cannot be expected that a mainstream programmer will prefer to use the traditional HPC methods and tools.
The main objective of this book is to present a comprehensive view of the state-of-the-art parallel programming methods, techniques and tools to aid the programmers in mastering the efficient programming of multicore and many-core systems. The book comprises a selection of twenty-two chapter contributions by experts in the field of multicore and many-core systems that cover fundamental techniques and algorithms, programming approaches, methodologies and frameworks, task/application scheduling and management, testing and evaluation methodologies and case studies for programming multicore and many-core systems. Lessons learned, challenges and road map ahead are also given and discussed along the chapters.
The content of the book is arranged into five parts:
Part I: Foundations
The first part of the book covers fundamental issues in programming of multicore and many-core computing systems. Along four chapters the authors discuss the state of the art on multi- and many-core architectures, programming models, concurrent data structures and memory allocation, scheduling and management.
Natvig et al. in the first chapter, ‘Multi- and many-cores, architectural overview for programmers’, provide a broad overview of the fundamental parallel techniques, parallel taxonomies and various ‘walls’ in programming multicore/many-core computing systems such as ‘power wall’, ‘memory wall’ and ‘ILP (instruction level parallelism) wall’. The authors also discuss the challenges of the heterogeneity in multicore/many-core computing systems. They conclude by stressing the need for more research in parallel programming models to meet the five P's of parallel processing – performance, predictability, power efficiency, programmability and portability – when building and programming multicore and many-core computing systems.
Varbanescu et al. in the second chapter, ‘Programming models for multicore and many-core computing systems’, survey a comprehensive set of programming models for most popular families of many-core systems, including both specific and classical parallel models for multicore and many-core platforms. The authors have introduced four classes of reference features for model evaluation: usability, design support, implementation support and programmability. Based on these features, a multidimensional comparison of the surveyed models is provided aiming to identify the essential characteristics that separate or cluster these models. The authors conclude by emphasizing the influence that the choice of a programming model can have on the application design and implementation and give a few guidelines for finding a programming model that matches the application characteristics.
The third chapter by Cederman et al., ‘Lock-free concurrent data structures’, deals with the use of concurrent data structures in parallel programming of multicore and many-core systems. Several issues such as maintaining consistency in the presence of many simultaneous updates are discussed and lock-free implementations of data structures that support concurrent access are given. Lock-free concurrent data structures are shown to support the design of lock-free algorithms that scale much better when the number of processes increases. A set of fundamental synchronization primitives is also described together with challenges in managing dynamically allocated memory in a concurrent environment.
Mannarswamy in the fourth chapter, ‘Software transactional memory’, addresses the main challenges in writing concurrent code in multicore and many-core computing systems. In particular, the author focuses on the coordinating access to shared data, accessed by multiple threads concurrently. Then, the software transactional memory (STM) programming paradigm for shared memory multithreaded programs is introduced. STM is intended to facilitate the development of complex concurrent software as an alternative to conventional lock-based synchronization primitives by reducing the burden of programming complexity involved in writing concurrent code. The need for addressing performance bottlenecks and improving the application performance on STM is also discussed as a major research issue in the field.
Part II: Programming Approaches
The second part of the book is devoted to programming approaches for multicore and many-core computing systems. This part comprises seven chapters that cover a variety of programming approaches including heterogeneous programming, skeleton programming, DSL and object-oriented stream programming and programming with transactional memory.
The fifth chapter, ‘Heterogeneous programming with OMPSs and its implications’, by Ayguadé et al. discusses on programming models for heterogeneous architectures aiming to ease the asynchrony and to increment parallelization, modularity and portability of applications. The authors present the OmpSs model, which extends the OpenMP 3.0 programming model, and show how it leverages MPI and OpenCL/CUDA, mastering the efficient programming of the clustered heterogeneous multi-/many-core systems. The implementation of OmpSs as well as a discussion on the intelligence needed to be embedded in the runtime system to effectively lower the programmability wall and the opportunities to implement new mechanisms and policies is also discussed and some overheads related with task management in OmpSs are pointed out for further investigation.
Kessler et al. in the sixth chapter, ‘Skeleton programming for portable many-core computing’, consider skeleton programming (‘data parallel skeletons’) as a model to solve the portability problem that arises in multi-and many-core programming and to increase the level of abstraction in such programming environment. After overviewing the concept of algorithmic skeletons, the authors give a detailed description of two recent approaches for programming emerging heterogeneous many-core systems, namely, SkePU and SkelCL. Some other skeleton programming frameworks, which share ideas with SkePU and SkelCL but address a more narrow range of architectures or are used in industrial application development, are also discussed. Adding support for portable task parallelism, such as farm skeletons, is pointed out as an important research issue for future research.
In the seventh chapter, ‘DSL stream programming on multicore architectures’, by de Oliveira Castro et al., the authors present a novel approach for stream programming, considered a powerful alternative to program multi-core processors by offering a deterministic execution based on a sound mathematical formalism and the ability to implicitly express the parallelism by the stream structure, which leverages compiler optimizations that can harness the multicore performance without having to tune the application by hand. Two families of stream programming languages are analyzed, namely, languages in which the data access patterns are explicitly described by the programmer through a set of reorganization primitives and those in which the data access patterns are implicitly declared through a set of dependencies between tasks. Then, the authors expose the principle of a two-level approach combining the advantages and expressivity of both types of languages aiming to achieve both the expressivity of high-level languages such as Array-OL and Block Parallel and the rich optimization framework, similar to StreamIT and Brook.
The eighth chapter, ‘Programming with transactional memory’, by Gramoli and Guerraoui addresses similar issues as in Chapter 4, namely, the use of transactional memory to remedy numerous concurrency problems arising in multicore and many-core programming. The chapter analyzes the state-of-the-art concurrent programming advances based on transactional memory. Several programming languages that support TM are considered along with some TM implementations and a running example for software support. The causes for performance limitations that TMs may suffer from and some recent solutions to cope with such limitations are also discussed.
Otto and Tichy in the ninth chapter, ‘Object-oriented stream programming’, present an approach unifying the concepts of object orientation (OO) and stream programming aiming to take advantage of features of both paradigms. Aiming for better programmability and performance gains, the object-oriented stream programming (OOSP) is introduced as a solution. The benefits of OO and stream programming are exemplified with XJava, a prototype OOSP language extending Java. Other issues such as potential conflicts between tasks, run-time performance tuning and correctness, allowing for interprocess application optimization and faster parameter adjustments are also discussed.
The tenth chapter, ‘Software-based speculative parallelization’, by Tian et al. studies the thread level speculative parallelization (SP) approach for parallelizing sequential programs by exploiting dynamic parallelism that may be present in a sequential program. As SP is usually applied to loops and performed at compile time, it requires minimal help from the programmer who may be required to identify loops to which speculative parallelization is to be applied. The authors have discussed several issues in SP, such as handling misspeculations, recovery capabilities and techniques for identifying parallelizable regions. Some ongoing projects that focus on SP techniques are also briefly discussed along with direction on future research issues comprising energy efficiency in SP, using SP for heterogeneous processors and 3D multicore processors, etc.
Schubert et al. in the eleventh chapter, ‘Autonomic distribution and adaptation’, describe an approach for increasing the scalability of applications by exploiting inherent concurrency in order to parallelize and distribute the code. The authors focus more specifically on concurrency, which is a crucial part in any parallelization approach, in the sense of reducing dependencies between logical parts of an application. To that end, the authors have employed graph analysis methods to assess the dependencies on code level, so as to identify concurrent segments and relating them to the specific characteristics of the (heterogeneous, large-scale) environment. Issues posed to programming multicore and many-core computers by the high degree of scalability and especially the large variance of processor architectures are also discussed.
Part III: Programming Frameworks
The third part of the book deals with methodologies, frameworks and high programming tools for constructing and testing software that can be ported between different, possibly in themselves heterogeneous many-core systems under preservation of specific quantitative and qualitative performance aspects.
The twelfth chapter, ‘PEPPHER: Performance portability and programmability for heterogeneous many-core architectures’, by Benkner et al. presents PEPPHER framework, which introduces a flexible and extensible compositional metalanguage for expressing functional and nonfunctional properties of software components, their resource requirements and possible compilation targets, as well as providing abstract specifications of properties of the underlying hardware. Also, handles for the run-time system to schedule the components on the available hardware resources are provided. Performance predictions can be (automatically) derived by combining the supplied performance models. Performance portability is aided by guidelines and requirements to ensure that the PEPPHER framework at all levels chooses the best implementation of a given component or library routine among the available variants, including settings for tunable parameters, prescheduling decisions and data movement operations.
Aldinucci et al. in the thirteenth chapter, ‘Fastflow: high level and efficient streaming on multicore’, consider, as in other chapters, the difficulties of programmability of multicore and many-core systems, but from the perspective of two interrelated needs, namely, that of efficient mechanisms supporting correct concurrent access to shared-memory data structures and of higher-level programming environments capable of hiding the difficulties related to the correct and efficient use of shared-memory objects by raising the level of abstraction provided to application programmers. To address these needs the authors introduce and discuss FastFlow, a programming framework specifically targeting cache-coherent shared-memory multicores. The authors show the suitability of the programming abstractions provided by the top layer of FastFlow programming model for application programmers. Performance and efficiency considerations are also given along with some real-world applications.
In the fourteenth chapter, Roma et al., ‘Programming framework for H.264/AVC video encoding in multicore systems’, the authors bring the example of usefulness of multicore and many-core computing for the video encoding as part of many multimedia applications. As video encoding distinguishes for being highly computationally demanding, to cope with the real-time encoding performance concerns, parallel approaches are envisaged as solutions to accelerate the encoding. The authors have presented a new parallel programming framework, which allows to easily and efficiently implementing high-performance H.264/AVC video encoders. The modularity and flexibility make this framework particularly suited for efficient implementations in either homogeneous or heterogeneous parallel platforms, providing a suitable set of fine-tuning configurations and parameterizations that allow a fast prototyping and implementation, thus significantly reducing the developing time of the whole video encoding system.
The fifteenth chapter, ‘Parallelizing evolutionary algorithms on GPGPU cards with the EASEA platform’, by Maitre et al. presents the EASEA (EAsy Specification of Evolutionary Algorithm) software platform dedicated to evolutionary algorithms that allows to exploit parallel architectures, that range from a single GPGPU equipped machine to multi-GPGPU machines, to a cluster or even several clusters of GPGPU machines. Parallel algorithms implemented by the EASEA platform are proposed for evolutionary algorithms and evolution strategies, genetic programming and multiobjective optimization. Finally, a set of problems is presented that contains artificial and real-world problems, for which performance evaluation results are given. EASEA is shown suitable to efficiently parallelize generic evolutionary optimization problems to run on current petaflop machines and future exaflop ones.
Part IV: Testing, Evaluation and Optimization
The forth part of the book covers testing, evaluation and optimization of parallel programs, with special emphasis for multicore and many-core systems. Techniques, methodologies and approaches are presented along four chapters.
Farchi in the sixteenth chapter, ‘Smart interleavings for testing parallel programs’, discusses the challenges of testing parallel programs that execute several parallel tasks, might be distributed on different machines, under possible node or network failures and might use different synchronization primitives. Therefore, the main challenge is of parallel program testing resides in the definition and coverage of the rather huge space of possible orders of tasks and environment events. The author has presented state-of-the-art testing techniques including parallel bug pattern-based reviews and distributed reviews. The later techniques enable the design of a test plan for the parallel program that is then implemented in unit testing. Coping with the scaling is envisaged as a main challenge for future research.
In the seventeenth chapter by Shafi, ‘Parallel performance evaluation and optimization’, are covered important aspects of shared-memory parallel programming that impact performance. Guidance and mitigation techniques for diagnosing performance issues applicable to a large spectrum of shared-memory multicore programs in order to assist in performance tuning are also given. Various overheads in parallel programs including thread overheads, cache overheads and synchronization overheads are discussed and mitigation techniques analyzed. Also, optimization-related issues such as nonuniform access memory and latency are described. The chapter overviews diagnostic tools as critical means to achieving good performance in parallel applications.
The eighteenth chapter, ‘A methodology for optimizing multithreaded system scalability on multicores’, by Gunther et al. presents a methodology which combines controlled measurements of the multithreaded platform together with a scalability modeling framework within which to evaluate performance measurements for multithreaded programs. The authors show how to quantify the scalability using the Universal Scalability Law (USL) by applying it to controlled performance measurements of memcached, J2EE and WebLogic. The authors advocate that system performance analysis should be incorporated into a comprehensive methodology rather than being done as an afterthought. Their methodology, based on the USL, emphasizes the importance of validating scalability data through controlled measurements that use appropriately designed test workloads. Some results from quantifying GPU and many-core scalability using the USL methodology are also reported.
Ozturk and Kandemir in the nineteenth chapter, ‘Improving multicore system performance through data compression
