Programming Multicore and Many-core Computing Systems -  - E-Book

Programming Multicore and Many-core Computing Systems E-Book

0,0
87,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Programming multi-core and many-core computing systems

Sabri Pllana, Linnaeus University, Sweden

Fatos Xhafa, Technical University of Catalonia, Spain

Provides state-of-the-art methods for programming multi-core and many-core systems

The book comprises a selection of twenty two chapters covering: fundamental techniques and algorithms; programming approaches; methodologies and frameworks; scheduling and management; testing and evaluation methodologies; and case studies for programming multi-core and many-core systems.

Program development for multi-core processors, especially for heterogeneous multi-core processors, is significantly more complex than for single-core processors. However, programmers have been traditionally trained for the development of sequential programs, and only a small percentage of them have experience with parallel programming.  In the past, only a relatively small group of programmers interested in High Performance Computing (HPC) was concerned with the parallel programming issues, but the situation has changed dramatically with the appearance of multi-core processors on commonly used computing systems. It is expected that with the pervasiveness of multi-core processors, parallel programming will become mainstream.

The pervasiveness of multi-core processors affects a large spectrum of systems, from embedded and general-purpose, to high-end computing systems. This book assists programmers in mastering the efficient programming of multi-core systems, which is of paramount importance for the software-intensive industry towards a more effective product-development cycle.

Key features:

  • Lessons, challenges, and roadmaps ahead.
  • Contains real world examples and case studies.
  • Helps programmers in mastering the efficient programming of multi-core and many-core systems.

The book serves as a reference for a larger audience of practitioners, young researchers and graduate level students. A basic level of programming knowledge is required to use this book.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 949

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

COVER

TITLE PAGE

COPYRIGHT

LIST OF CONTRIBUTORS

PREFACE

ACKNOWLEDGEMENTS

ACRONYMS

PART I: FOUNDATIONS

CHAPTER 1: MULTI- AND MANY-CORES, ARCHITECTURAL OVERVIEW FOR PROGRAMMERS

1.1 INTRODUCTION

1.2 WHY MULTICORES?

1.3 HOMOGENEOUS MULTICORES

1.4 HETEROGENEOUS MULTICORES

1.5 CONCLUDING REMARKS

REFERENCES

CHAPTER 2: PROGRAMMING MODELS FOR MULTICORE AND MANY-CORE COMPUTING SYSTEMS

2.1 INTRODUCTION

2.2 A COMPARATIVE ANALYSIS OF MANY-CORES

2.3 PROGRAMMING MODELS FEATURES

2.4 PROGRAMMING MODELS FOR MANY-CORES

2.5 AN OVERVIEW OF MANY-CORE PROGRAMMING MODELS

2.6 CONCLUDING REMARKS

REFERENCES

CHAPTER 3: LOCK-FREE CONCURRENT DATA STRUCTURES

3.1 INTRODUCTION

3.2 SYNCHRONIZATION PRIMITIVES

3.3 LOCK-FREE DATA STRUCTURES

3.4 MEMORY MANAGEMENT FOR CONCURRENT DATA STRUCTURES

3.5 GRAPHICS PROCESSORS

REFERENCES

CHAPTER 4: SOFTWARE TRANSACTIONAL MEMORY

4.1 INTRODUCTION

4.2 STM: A PROGRAMMER'S PERSPECTIVE

4.3 TRANSACTIONAL SEMANTICS

4.4 STM DESIGN SPACE

4.5 STM: A HISTORICAL PERSPECTIVE

4.6 APPLICATION PERFORMANCE ON STM

4.7 CONCLUDING REMARKS

REFERENCES

PART II: PROGRAMMING APPROACHES

CHAPTER 5: HYBRID/HETEROGENEOUS PROGRAMMING WITH OMPSS AND ITS SOFTWARE/HARDWARE IMPLICATIONS

5.1 INTRODUCTION

5.2 THE OMPSS PROPOSAL

5.3 IMPLEMENTATION

5.4 TASK GRANULARITY

5.5 RELATED WORK

5.6 FUTURE WORK

5.7 CONCLUDING REMARKS

ACKNOWLEDGMENTS

REFERENCES

CHAPTER 6: SKELETON PROGRAMMING FOR PORTABLE MANY-CORE COMPUTING

6.1 INTRODUCTION

6.2 BACKGROUND: SKELETON PROGRAMMING

6.3 SKEPU: A TUNABLE SKELETON PROGRAMMING LIBRARY

6.4 SKELCL: A LIBRARY FOR HIGH-LEVEL MULTI-GPU PROGRAMMING

6.5 RELATED WORK

6.6 CONCLUDING REMARKS AND FUTURE WORK

REFERENCES

CHAPTER 7: DSL STREAM PROGRAMMING ON MULTICORE ARCHITECTURES

7.1 INTRODUCTION

7.2 A HIGH-LEVEL DSL: SLICES

7.3 INTERMEDIARY REPRESENTATION SJD

7.4 OPTIMIZING THE INTERMEDIATE REPRESENTATION

7.5 REDUCING INTERCORE COMMUNICATION COST

7.6 EVALUATION

7.7 CONCLUSION

REFERENCES

CHAPTER 8: PROGRAMMING WITH TRANSACTIONAL MEMORY

8.1 INTRODUCTION

8.2 CONCURRENCY MADE SIMPLE

8.3 TM LANGUAGE CONSTRUCTS

8.4 IMPLEMENTING A TM

8.5 PERFORMANCE LIMITATIONS

8.6 RECENT SOLUTIONS

8.7 CONCLUDING REMARKS

ACKNOWLEDGMENTS

REFERENCES

CHAPTER 9: OBJECT-ORIENTED STREAM PROGRAMMING

9.1 STREAM PROGRAMMING

9.2 OBJECT-ORIENTED STREAM PROGRAMMING

9.3 XJAVA

9.4 PERFORMANCE

9.5 EXPERIENCES

9.6 RELATED WORK

9.7 FUTURE WORK

9.8 SUMMARY

REFERENCES

CHAPTER 10: SOFTWARE-BASED SPECULATIVE PARALLELIZATION

10.1 INTRODUCTION

10.2 SPECULATIVE EXECUTION IN CORD

10.3 ADVANCED FEATURES OF CORD

10.4 RELATED WORK

10.5 FUTURE WORK

10.6 CONCLUDING REMARKS

REFERENCES

CHAPTER 11: AUTONOMIC DISTRIBUTION AND ADAPTATION

11.1 INTRODUCTION

11.2 PARALLEL PROGRAMMING MODELS

11.3 CONCURRENT CODE

11.4 CONCLUSIONS

REFERENCES

PART III: PROGRAMMING FRAMEWORKS

CHAPTER 12: PEPPHER: PERFORMANCE PORTABILITY AND PROGRAMMABILITY FOR HETEROGENEOUS MANY-CORE ARCHITECTURES

12.1 INTRODUCTION AND BACKGROUND

12.2 THE PEPPHER FRAMEWORK

12.3 THE PEPPHER METHODOLOGY

12.4 PERFORMANCE GUIDELINES AND PORTABILITY

12.5 FURTHER TECHNICAL ASPECTS

12.6 GUIDING APPLICATIONS AND BENCHMARKS

12.7 CONCLUSION

REFERENCES

CHAPTER 13: FASTFLOW: HIGH-LEVEL AND EFFICIENT STREAMING ON MULTICORE

13.1 FASTFLOW PRINCIPLES

13.2 FASTFLOW

μ

-TUTORIAL

13.3 PERFORMANCE

13.4 RELATED WORK

13.5 FUTURE WORK AND CONCLUSIONS

REFERENCES

CHAPTER 14: PARALLEL PROGRAMMING FRAMEWORK FOR H.264/AVC VIDEO ENCODING IN MULTICORE SYSTEMS

14.1 INTRODUCTION

14.2 PARALLEL PROGRAMMING FRAMEWORK FOR H.264/AVC VIDEO ENCODING

14.3 PROGRAMMING PARALLEL H.264 VIDEO ENCODERS

14.4 EVALUATION OF PARALLEL H.264 VIDEO ENCODERS

14.5 CONCLUDING REMARKS

ACKNOWLEDGMENTS

REFERENCES

CHAPTER 15: PARALLELIZING EVOLUTIONARY ALGORITHMS ON GPGPU CARDS WITH THE EASEA PLATFORM

15.1 INTRODUCTION

15.2 EASEA PARALLELIZATION OF EA ON GPGPU

15.3 EXPERIMENTS AND APPLICATIONS

15.4 CONCLUSION

REFERENCES

PART IV: TESTINE, EVALUATION AN OPTIMIZATION

CHAPTER 16: SMART INTERLEAVINGS FOR TESTING PARALLEL PROGRAMS

16.1 INTRODUCTION

16.2 REVIEWS OF PARALLEL PROGRAMS

16.3 TESTING OF PARALLEL PROGRAMS

16.4 RELATED WORK

16.5 FUTURE WORK

16.6 CONCLUDING REMARKS

REFERENCES

PROBLEMS

CHAPTER 17: PARALLEL PERFORMANCE EVALUATION AND OPTIMIZATION

17.1 SEQUENTIAL VERSUS PARALLEL PERFORMANCE

17.2 THREAD OVERHEADS

17.3 CACHE COHERENCE OVERHEADS

17.4 SYNCHRONIZATION OVERHEADS

17.5 NONUNIFORM MEMORY ACCESS

17.6 OVERLAPPING LATENCY

17.7 DIAGNOSTIC TOOLS AND TECHNIQUES

17.8 SUMMARY

REFERENCES

CHAPTER 18: A METHODOLOGY FOR OPTIMIZING MULTITHREADED SYSTEM SCALABILITY ON MULTICORES

18.1 INTRODUCTION

18.2 MULTITHREADING AND SCALABILITY

18.3 CONTROLLED PERFORMANCE MEASUREMENTS

18.4 WORKLOAD DESIGN AND IMPLEMENTATION

18.5 QUANTIFYING SCALABILITY

18.6 CASE STUDY: MEMCACHED SCALABILITY

18.7 OTHER MULTITHREADED APPLICATIONS

18.8 CASE STUDY: DATA VALIDATION

18.9 SCALABILITY ON MANY-CORE ARCHITECTURES

18.10 FUTURE WORK

18.11 CONCLUDING REMARKS

REFERENCES

CHAPTER 19: IMPROVING MULTICORE SYSTEM PERFORMANCE THROUGH DATA COMPRESSION

19.1 INTRODUCTION

19.2 OUR APPROACH

19.3 EXPERIMENTAL EVALUATION

19.4 RELATED WORK

19.5 FUTURE WORK

19.6 CONCLUDING REMARKS

REFERENCES

PART V: SCHEDULING AND MANAGEMENT

CHAPTER 20: PROGRAMMING AND MANAGING RESOURCES ON ACCELERATOR-ENABLED CLUSTERS

20.1 INTRODUCTION

20.2 PROGRAMMING ACCELERATORS ON LARGE-SCALE CLUSTERS

20.3 MAPREDUCE FOR HETEROGENEOUS CLUSTERS

20.4 RESOURCE CONFIGURATION

20.5 RESOURCE MANAGEMENT ON CLUSTERS WITH ACCELERATORS

20.6 EVALUATION

20.7 CONCLUSION

REFERENCES

CHAPTER 21: AN APPROACH FOR EFFICIENT EXECUTION OF SPMD APPLICATIONS ON MULTICORE CLUSTERS

21.1 INTRODUCTION

21.2 SPMD APPLICATIONS ON MULTICORE CLUSTERS

21.3 METHODOLOGY FOR EFFICIENT EXECUTION

21.4 SCALABILITY AND EFFICIENCY OF SPMD APPLICATIONS

21.5 SPMD APPLICATIONS AND PERFORMANCE EVALUATION

21.6 RELATED WORKS

21.7 CONCLUSION AND FUTURE WORKS

REFERENCES

CHAPTER 22: OPERATING SYSTEM AND SCHEDULING FOR FUTURE MULTICORE AND MANY-CORE PLATFORMS

22.1 INTRODUCTION

22.2 OPERATING SYSTEM KERNEL MODELS

22.3 SCHEDULING

REFERENCES

GLOSSARY

INDEX

END USER LICENSE AGREEMENT

Pages

ix

x

xi

xii

xiii

xiv

xv

xvi

xvii

xviii

xix

xx

xxi

xxii

xxiii

xxv

xxvi

xxvii

xxviii

xxix

xxx

1

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

321

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

481

482

483

484

485

Guide

Cover

Table of Contents

Preface

Part I: Foundations

Begin Reading

List of Illustrations

CHAPTER 1: MULTI- AND MANY-CORES, ARCHITECTURAL OVERVIEW FOR PROGRAMMERS

Figure 1.1 Flynn's taxonomy.

Figure 1.2 Multiprocessor memory architectures and programming models.

Figure 1.3 Technological trends for microprocessors. Simplified version of Figure 1 in [18].

Figure 1.4 The processor–memory gap (a) and a typical memory hierarchy (b).

Figure 1.5 Multicore processor generations: first (a), second (b), third (c).

Figure 1.6 Power 7 multicore, simplified block diagram.

Figure 1.7 ARM Cortex A15, simplified block diagram.

Figure 1.8 Sun UltraSPARC T2 architecture, simplified block diagram.

Figure 1.9 AMD Opteron Istanbul processor, simplified block diagram.

Figure 1.10 Intel Nehalem architecture – 4 cores, simplified block diagram.

Figure 1.11 Simplified block diagram of Cell/BE.

Figure 1.12 NVIDIA Fermi, simplified block diagram.

Figure 1.13 ARM Mali T604, simplified block diagram.

CHAPTER 3: LOCK-FREE CONCURRENT DATA STRUCTURES

Figure 3.1 Synchronization primitives.

CHAPTER 5: HYBRID/HETEROGENEOUS PROGRAMMING WITH OMPSS AND ITS SOFTWARE/HARDWARE IMPLICATIONS

Figure 5.1 FFT example with OmpSs task dependencies.

Figure 5.2 Block distribution for the arrays in the FFT example ( blocks) with task instantiation numbers.

Figure 5.3 Dynamically generated task graph for the FFT example ( blocks).

Figure 5.4 Task execution timeline for the FFT ( blocks) example with 8 threads.

Figure 5.5 Example of OmpSs task and loop worksharing dependencies.

Figure 5.6 FFT example with specialized task using CUFFT library for Nvidia GPUs.

Figure 5.7 Hybrid MPI/OmpSs example.

Figure 5.8 Nanos++ flow overview.

Figure 5.9 Scalability projections of hardware and software task decoders for various task granularities. The grey region in the middle of the Figure indicates the scalability limitations of software-based task decoders, based on the highly tuned software decoder of OmpSs running on a modern architecture.

CHAPTER 6: SKELETON PROGRAMMING FOR PORTABLE MANY-CORE COMPUTING

Figure 6.1 User function, macro expansion.

Figure 6.2 Average (100 runs) time of blurring quadratic grayscale images of different sizes. (a) Gaussian kernel applied

once

to the image. (b) Gaussian kernel applied

nine

times to the image.

Listing 6.1 A MapReduce example that computes the dot product.

Listing 6.2 User function used by MapOverlap when blurring an image.

Listing 6.3 Creation of a simple Map skeleton in SkelCL.

Listing 6.4 Implementation of a scalar multiplication using a Map skeleton and additional arguments.

Figure 6.3 Times for running different LibSolve solvers for and 1000 with the BRUSS2D-MIX problem.

Figure 6.4 Defining an execution plan and applying it to a Reduce skeleton.

Figure 6.5 Vector sum with Reduce on target architecture 1 (a) and architecture 2 (b). TUNE uses an empirically determined execution plan for each architecture.

Listing 6.5 Computing a Mandelbrot fractal using SkelCL.

Listing 6.6 Implementation of LM OSEM using SkelCL.

Figure 6.6 Mandelbrot example: comparison of runtime (a) and lines of code (b) of implementations using OpenCL, CUDA and SkelCL on a platform with one GPU.

Figure 6.7 LM OSEM example: comparison of runtime (a) and lines of code (b) of implementations using OpenCL, CUDA and SkelCL on a platform with four GPUs.

CHAPTER 7: DSL STREAM PROGRAMMING ON MULTICORE ARCHITECTURES

Figure 7.1 Excerpt of a StreamIt program for matrix multiplication.

Figure 7.2 Matrix multiplication in Array-OL as viewed in the IDE Gaspard2 [8].

Figure 7.3 SLICES program that captures a matrix multiplication communication pattern. First the raw stream inputs are casted to shape to view them as matrices. Then the rows of A are paired with the columns of B in the iterator loop and pushed to the output. In the graphical examples we have chosen x0 = 5 and y0 = 4.

Figure 7.4 Example of data reorganizations enabled by SJD. (The split and join consumptions and productions are always 1 in this graph.)

Figure 7.5 Multidimensional grids and blocks extraction.

Figure 7.6 Compiling 1D blocks that partially overlap .

Figure 7.7 Intermediate SJD representation equivalent to the SLICES matrix multiplication program (Figure 7.3). The intermediate representation was automatically generated by our SLICES to SJD compiler.

Figure 7.8 Set of transformations considered. Each transformation is defined by a graph rewriting rule. Node

N

is a wild card for any arity compatible node.

Figure 7.9 Reducing the intercore communication through an iterative process.

Figure 7.10 Optimizations applied to MM-COARSE. In the original program the inter-core communication cost is high since the program presents many synchronizations points: for example, at runtime

J

8 and

S

15 quickly become communication bottlenecks. After optimization,

J

8 and

S

15 have been split; the transposition of matrix

B

has been divided in blocks and distributed among the four cores; duplicate nodes (nodes

D

205,

D

208, etc.) have also been distributed among the cores; since the consumers of the duplicate streams are all in the same core, duplications can be compiled to multiple reads to the same buffer and are not added to the intercore traffic (a) Partitioning of original MM-COARSE on 4 cores, (b) Partitioning of optimized MM-COARSE on 4 cores.

Figure 7.11 Impact of the communication reductions on the performance: speedups of the original program and the optimized program on an SMP Nehalem quadcore. The executions times are normalized to the reference (StreamIt original program execution time on a single core).

CHAPTER 8: PROGRAMMING WITH TRANSACTIONAL MEMORY

Figure 8.1 The five chopsticks shared by the five dining philosophers to eat rice.

Figure 8.2 Sorted (lexicographically) linked list example in which two operations with same accesses have distinct semantics: contains(z) is a parse operation whereas size() is a snapshot operation.

CHAPTER 9: OBJECT-ORIENTED STREAM PROGRAMMING

Figure 9.1 A stream graph.

Figure 9.2 Search-based auto-tuning of a program

P

for three different machines

M

1,

M

2,

M

3.

Figure 9.3 Inferring tuning parameters and context information to enable parameter prediction on different machines.

Figure 9.4 Inferring tuning parameters and context information from XJava code.

Figure 9.5 Fusing replicable stages of a pipeline reduces the overhead for splitting and joining data streams. (a) Stage replication without fusion. (b) Fusing replicable stages.

Figure 9.6 Tuning two master/worker configurations depending on workload

w

.

CHAPTER 10: SOFTWARE-BASED SPECULATIVE PARALLELIZATION

Figure 10.1 Maintaining memory state.

Figure 10.2 Misspeculation detection.

Figure 10.3 Partitioning a loop into prologue, speculative body and epilogue.

Figure 10.4 Kernel of benchmark telecomm-CRC32.

Figure 10.5 Execution speedups for SPEC and MiBench programs on a dual quadcore 3.0 GHz Xeon server.

Figure 10.6 A speculation example.

Figure 10.7 Thread execution model.

Figure 10.8 Discard-all recovery.

Figure 10.9 Incremental recovery.

Figure 10.10 Decoupling space allocation from thread creation.

Figure 10.11 Heap prefix format.

CHAPTER 11: AUTONOMIC DISTRIBUTION AND ADAPTATION

Figure 11.1 A dependency graph derived from code behavior analysis. Edges on the left denote dataflow and on the right workflow (simplified).

Figure 11.2 Potential segmentation of the reduced graph (simplified).

Figure 11.3 Code to infrastructure mapping principles.

Figure 11.4 Rearrangement of concurrent logical segments for maximum speedup.

CHAPTER 12: PEPPHER: PERFORMANCE PORTABILITY AND PROGRAMMABILITY FOR HETEROGENEOUS MANY-CORE ARCHITECTURES

Figure 12.1 The PEPPHER framework for creating performance portable software across a variety of heterogeneous many-core systems. The application is described as a set of (generic) components with annotations specifying performance and other nonfunctional properties (left). A set of component variants is generated (e.g. by source-to-source transformation) or chosen (from expert-written libraries) for different architectures and configurations (middle). For each concrete, heterogeneous architecture, appropriate component variants are selected and compiled into codelets that are scheduled efficiently at run-time (right) using static and dynamic performance information. Performance models are derived from abstract hardware descriptions and component annotations.

Figure 12.2 Generic PEPPHER component interface for a sorting component.

Figure 12.3 Implementation of a PEPPHER sorting component by a merge sort algorithm. The pseudocode notation with C++ language extensions used in this example is one of several possibilities considered for the PEPPHER annotation framework. Alternatives keep the annotations separate to the actual code.

Figure 12.4 A specialized GPU sorting component.

Figure 12.5 Impact of different StarPU scheduling policies (greedy, heft-tm, heft-tm-pr and heft-tmdp-pr) on the LU decomposition example.

CHAPTER 13: FASTFLOW: HIGH-LEVEL AND EFFICIENT STREAMING ON MULTICORE

Figure 13.1 FastFlow-layered architecture with abstraction examples at the different layers of the stack.

Figure 13.2 Hello world pipeline.

Figure 13.3 Hello world farm.

Figure 13.4 Execution time of the farm pattern with different allocators: libc versus TBB versus FastFlow allocator on 8-core Intel (a) and 48-core AMD (b) using a computational grain of 1 µs.

Figure 13.5 Speedup of the farm (a) and throughput in thousands of messages per second (Kmsg/s) of the pipeline (b) paradigms for several computational grains.

Figure 13.6 SWPS3 versus SWPS3-FF performance for 102

k

gap penalties evaluated on release 57.5 of UniProtKB/Swiss-Prot.

Figure 13.7 YaDT-FF speedup using several standard training sets. Superlinear speedup is due to the fact that the farm emitter performs a minimal amount of work contributing to the final result, but the parallelism degree given is related to the worker threads only.

Figure 13.8 (a) scalability of StockKit-FF(n) against StockKit(1), where n is the number of worker threads. (b) execution time (T) and speedup (Sp) over bzip2 in the case of a stream of 1078 files: 86% small (0–1 MBytes), 9% medium (1–10 MBytes), 4% large (10–50 MBytes) and 1% very large (50–100 MBytes). pbzip2 uses 16 threads. pbzip2_ff uses 16 threads for each accelerator.

Comp

stands for compression and

Dec

for decompression.

CHAPTER 14: PARALLEL PROGRAMMING FRAMEWORK FOR H.264/AVC VIDEO ENCODING IN MULTICORE SYSTEMS

Figure 14.1 H.264/AVC encoding loop.

Figure 14.2 Parallel video coding architectures. (a) Homogeneous. (b) Heterogeneous, with embedded accelerators. (c) Heterogeneous, with GPU accelerators.

Figure 14.3 Exploited data-level parallelism models. (a) Frame level. (b) Slice level. (c) Macroblock level.

Figure 14.4 Parallel topologies to implement the encoding modules depicted in Figure 14.1 by using functional-level parallelism. (a) Pipeline. (b) Straight parallel. (c) Mixed.

Figure 14.5 Simultaneously exploitation of slice and macroblock parallelism levels. (a) Thread allocation. (b) Thread synchronization.

Figure 14.6 Possible slice-scattering distribution in a multi-core architecture.

Figure 14.7 Considered set of test video sequences. (a) Soccer. (b) Harbour. (c) Crew. (d) City.

Figure 14.8 Provided speedup in Intel platform using

slice-level

parallelism.

CHAPTER 15: PARALLELIZING EVOLUTIONARY ALGORITHMS ON GPGPU CARDS WITH THE EASEA PLATFORM

Figure 15.1 Basic evolutionary algorithm.

Figure 15.2 ASREA flowchart.

Figure 15.3 Impact of population size, number Figure of iterations and dimensions on execution time and overhead. (a) Impact of population size and evaluation complexity on execution time. (b) Impact of individual size on transfer time overhead.

Figure 15.4 Speedup on an average of 10 runs for several GPGPUs cards versus an Intel i7-950 on a 10-dimension 120-iteration Weierstrass benchmark.

Figure 15.5 Speedup obtained with the island model on Weierstrass 1000 dimensions,h = 0.35 for 81200 individuals.

Figure 15.6 Speedup factor for symbolic regression wrt different parameters. (a) Speedup wrt tree depth and learning set size, for the evaluation function only. (b) Speedup wrt number of fitness cases.

Figure 15.7 Influence of function set computational intensity.

Figure 15.8 Attainment surface plots of ASREA, NSGA-II and SPEA2 for ZDT functions. (a) ZDT3 at 1000 EVAL. (b) ZDT3 at 10,000 EVAL.

Figure 15.9 Speedup for ZDT3 function on population sizes 100, 1000 and 10000.

Figure 15.10 Speedup obtained on the airplane problem with EASEA tree-base GP implementation.

Figure 15.11 Difference between original and reconstructed trajectory.

CHAPTER 17: PARALLEL PERFORMANCE EVALUATION AND OPTIMIZATION

Figure 17.1 CPU Utilization view in CV showing poor concurrency in execution.

Figure 17.2 Thread synchronization analysis using the Concurrency Visualizer.

CHAPTER 18: A METHODOLOGY FOR OPTIMIZING MULTITHREADED SYSTEM SCALABILITY ON MULTICORES

Figure 18.1 Schematic of scalability load measurement configuration.

Figure 18.2 Steady-state measurement of instantaneous throughput, for a particular load value , represented as a time series to show the ramp-up and ramp=-down phases.

Figure 18.3 Physical components of scalability.

Figure 18.4 Throughput of three MCD releases up threads.

Figure 18.5 USL regression analysis of memcached 1.2.8 data (

dots

) in Figure 18.4.

Figure 18.6 Comparison of scalability data for standard MCD (

lower curve

) and patched MCD (

upper curve

).

Figure 18.7 USL model of patched MCD data in Figure 18.6 extended out to threads.

Figure 18.8 Initial J2EE throughput data (

left

) and subsequent data (

right

) measured in requests per second (RPS).

Figure 18.9 USL models of J2EE scalability for initial data (

upper curve

) and subsequent data (

lower curve

).

Figure 18.10 Web application scalability data.

Figure 18.11 USL fit to NVIDIA Tesla GPU data.

CHAPTER 19: IMPROVING MULTICORE SYSTEM PERFORMANCE THROUGH DATA COMPRESSION

Figure 19.1 The multicore architecture considered in this chapter and the off-chip memory space.

Figure 19.2 (a) Different scenarios for data when its current use is over. (b) Comparison of on-demand decompression and predecompression. Arrows indicate the execution timeline of the program.

Figure 19.3 (a) Normalized execution cycles with the static strategy using different values. (b) Comparison of the best static strategy (for each benchmark) and the dynamic strategy.

Figure 19.4 Processor usage for the dynamic strategy over the execution period.

Figure 19.5 Breakdown of overheads into three different components for the static and dynamic schemes.

Figure 19.6 Sensitivity of the dynamic strategy to the starting value for the swim benchmark.

Figure 19.7 Sensitivity of the dynamic strategy to values. (a) apsi. (b) mgrid.

Figure 19.8 Sensitivity of the dynamic strategy to values. (a) apsi. (b) mgrid.

Figure 19.9 Sensitivity of the dynamic strategy to the sampling period ().

Figure 19.10 Sensitivity of the dynamic strategy to the block size.

Figure 19.11 Sensitivity of the static and dynamic strategies to the processor counts.

Figure 19.12 Results with a hardware-based cache memory.

CHAPTER 20: PROGRAMMING AND MANAGING RESOURCES ON ACCELERATOR-ENABLED CLUSTERS

Figure 20.1 Resource configurations for enabling asymmetric clusters. (a)

Conf I:

Self-managed well-provisioned accelerators (b)

Conf II:

Resource-constrained well-provisioned accelerators (c)

Conf III:

Resource-constrained shared-driver accelerators (d)

Conf IV:

Mixed accelerators.

Figure 20.2 High-level overview of an accelerator-based asymmetric-distributed system.

Figure 20.3 Interactions between manager and compute node components.

Figure 20.4 State machine for scheduler learning and execution process.

Figure 20.5 Execution time of linear regression with increasing input size.

Figure 20.6 Execution time of word count while simultaneously executing histogram.

Figure 20.7 Execution time of histogram while simultaneously executing word count.

Figure 20.8 Framework scalability.

CHAPTER 21: AN APPROACH FOR EFFICIENT EXECUTION OF SPMD APPLICATIONS ON MULTICORE CLUSTERS

Figure 21.1 Communication and computation of SPMD tile on multicore.

Figure 21.2 Examples of SPMD applications and communications patterns.

Figure 21.3 SPMD applications on multicore cluster.

Figure 21.4 Supertile creation and methodology objective.

Figure 21.5 Methodology for efficient execution of SPMD applications.

Figure 21.6 Mapping and tile distribution.

Figure 21.7 Scheduling phase.

Figure 21.8 Performance evaluation theoretical example.

Figure 21.9 Characterization analysis for a double quadcore cluster.

Figure 21.10 Tile computation and communication ratio characterization.

Figure 21.11 Efficiency evaluation heat transfer app.

Figure 21.12 Speedup evaluation heat transfer application.

Figure 21.13 Efficiency evaluation LL-2D-STD application.

Figure 21.14 Speedup evaluation LL-2D-STD App.

List of Tables

CHAPTER 2: PROGRAMMING MODELS FOR MULTICORE AND MANY-CORE COMPUTING SYSTEMS

Table 2.1 Computing properties of many-core hardware.

Table 2.2 Memory properties of many-core hardware.

Table 2.3 An overview of the usability-related features of the programming models.

Table 2.4 An overview of the design-support features that the models under survey have (see Section 2.3.3 for more explanations).

Table 2.5 An overview of the implementation-support features that the models under survey offer (see Section 2.3.3 for more explanations).

Table 2.6 A qualitative comparison of the surveyed programming models. For each feature, we use one of five qualifiers, ++ (very good) to −− (very bad).

CHAPTER 3: LOCK-FREE CONCURRENT DATA STRUCTURES

Table 3.1 Properties of different approaches to nonblocking memory reclamation.

CHAPTER 7: DSL STREAM PROGRAMMING ON MULTICORE ARCHITECTURES

Table 7.1 Intercore communication cost reduction. ‘original’ and ‘optimized’ represent the intercommunication volume per steady state for the original program and the optimized program in Bytes. ‘

C

reduction’ is the reduction percentage of intercore communications computed as

CHAPTER 9: OBJECT-ORIENTED STREAM PROGRAMMING

Table 9.1 Splitting and joining streams

CHAPTER 12: PEPPHER: PERFORMANCE PORTABILITY AND PROGRAMMABILITY FOR HETEROGENEOUS MANY-CORE ARCHITECTURES

Table 12.1 Example of acceleration on a double precision LU decomposition with partial pivoting (DGETRF) on a 3.2 GB matrix

CHAPTER 13: FASTFLOW: HIGH-LEVEL AND EFFICIENT STREAMING ON MULTICORE

Table 13.1 Average latency time and standard deviation (in nanoseconds) of a push/pop operation on a SPSC queue with 1024 slots for 1M insert/extractions, on the Intel 8 core 16 context and on AMD 48 core

Table 13.2 Patterns and streams used in real-world applications

CHAPTER 14: PARALLEL PROGRAMMING FRAMEWORK FOR H.264/AVC VIDEO ENCODING IN MULTICORE SYSTEMS

Table 14.1 Example configuration using a pipelined architecture to implement the eight interprediction modes in a variable number of cores

Table 14.2 gprof profiling results of the H.264/AVC reference SW

Table 14.3 Specifications of the considered parallel computational platform

Table 14.4 Considered H.264/AVC encoding parameters

Table 14.5 Memory allocation for the reference and optimized SW versions

Table 14.6 Provided speedup using

slice

and

macroblock

parallelism levels, by considering a maximum of 4 slices (to avoid bit-rate degradation [10])

CHAPTER 17: PARALLEL PERFORMANCE EVALUATION AND OPTIMIZATION

Table 17.1 Example of cache coherence problem

CHAPTER 18: A METHODOLOGY FOR OPTIMIZING MULTITHREADED SYSTEM SCALABILITY ON MULTICORES

Table 18.1 Interpretation of queueing metrics

Table 18.2 Memcached scalability parameters

Table 18.3 J2EE application scalability parameters

Table 18.4 Preparation for USL analysis of the data in Figure 18.10

CHAPTER 19: IMPROVING MULTICORE SYSTEM PERFORMANCE THROUGH DATA COMPRESSION

Table 19.1 The base simulation parameters used in our experiments

Table 19.2 The benchmark codes used in this study and important statistics. In obtaining these statistics, the reference input sets are used

CHAPTER 21: AN APPROACH FOR EFFICIENT EXECUTION OF SPMD APPLICATIONS ON MULTICORE CLUSTERS

Table 21.1 Efficiency, speedup and scalability analysis for an SPMD app

Table 21.2 Characterization values

Table 21.3 Analytical values (time expressed in seconds)

Wiley Series on Parallel and Distributed Computing

 

Series Editor: Albert Y. Zomaya

 

A complete list of the titles in this series appears at the end of this volume.

PROGRAMMING MULTICORE AND MANY-CORE COMPUTING SYSTEMS

 

Edited by

 

Sabri Pllana

Fatos Xhafa

 

 

 

Copyright © 2017 by John Wiley & Sons, Inc. All rights reserved

Published by John Wiley & Sons, Inc., Hoboken, New Jersey

Published simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloguing-in-Publication Data

Names: Pllana, Sabri, editor. | Xhafa, Fatos, editor.

Title: Programming multicore and many-core computing systems / edited by Sabri Pllana, Fatos Xhafa.

Description: First edition. | Hoboken, New Jersey : John Wiley & Sons, [2017] | Includes bibliographical references and index.

Identifiers: LCCN 2016038244| ISBN 9780470936900 (cloth) | ISBN 9781119332008 (epub) | ISBN 9781119331995 (Adobe PDF)

Subjects: LCSH: Parallel programming (Computer science) | Coprocessors-Programming.

Classification: LCC QA76.642 .P767 2017 | DDC 005.2/75-dc23 LC record available at https://lccn.loc.gov/2016038244

Cover Image: Creative-idea/Gettyimages

Cover design by Wiley

LIST OF CONTRIBUTORS

Marco Aldinucci, Computer Science Department, University of Torino, Corso Svizzera 185, 10149 Torino, Italy. [email:

[email protected]

]

Jørn Amundsen, Norwegian University of Science and Technology, SemSælandsvei 7-9, NO-7491 Trondheim, Norway. [email:

[email protected]

]

Eduard Ayguade, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:

[email protected]

]

Rosa M. Badia, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:

[email protected]

]

Henri E. Bal, Department of Computer Science, Vrije Universiteit, De Boelelaan 1081A,1081 HV, Amsterdam, The Netherlands. [email:

[email protected]

]

Denis Barthou, INRIA Bordeaux Sud-Ouest, 200 avenue de la Vieille Tour 33405 Talence Cedex, France. [email:

[email protected]

]

Pieter Bellens, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:

[email protected]

]

Siegfried Benkner, Faculty of Computer Science, University of Vienna, Wahringerstrasse29, A-1090 Vienna, Austria. [email:

[email protected]

]

Daniel Rubio Bonilla, Department for Intelligent Service Infrastructures, HLRS, University of Stuttgart, Nobelstr. 19, 70569 Stuttgart, Germany. [email:

[email protected]

]

Javier Bueno, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:

[email protected]

]

Ali R. Butt, Virginia Tech, 2202 Kraft Drive (0106), Blacksburg, VA 24061, USA. [email:

[email protected]

]

Daniel Cederman, Department of Computer Science and Engineering, Chalmers University of Technology, SE-412 96 Göteborg, Sweden. [email:

[email protected]

]

Pierre Collet, Strasbourg University, Pole API BdSébastien Brant BP 10413 67412 Illkirch CEDEX France. [email:

[email protected]

]

Herbert Cornelius, Intel Gmbh, DornacherStrasse 1, D-85622 Feldkirchen, Germany. [email:

[email protected]

]

Tommaso Cucinotta, Retis Lab, ScuolaSuperioreSant'Anna CEIICP, Att.NeFrancesca Gattai, Via Moruzzi 1, 56124 Pisa, Italy. [email:

[email protected]

]

Marco Danelutto, Computer Science Department, University of Pisa, Largo Pontecorvo3, 56127 Pisa, Italy. [email:

[email protected]

]

Usman Dastgeer, IDA, Linköping University, S-58183 Linköping, Sweden. [email:

[email protected]

]

Pablo De Oliveira Castro, CEA, LIST, Université De Versailles, 45, Avenue DesÉtats Unis, Versailles, 78035 France. [email:

[email protected]

]

Alejandro Duran, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:

[email protected]

]

Mujahed Eleyat, University of Science and Technology (NTU), SemSælandsvei 7-9, NO-7491 Trondheim, Norway. [email:

[email protected]

]

Johan Enmyren, IDA, Linköping University, S-58183 Linköping, Sweden. [email:

[email protected]

]

Yoav Etsion, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:

[email protected]

]

Eitan Farchi, IBM, 49 Hashmonaim Street, Pardes Hanna 37052, Israel. [email:

[email protected]

]

Montse Farreras, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:

[email protected]

]

Min Feng, Department of Computer Science, The University of California At Riverside, Engineering Bldg. Unit 2, Rm. 463, Riverside, CA 92521, USA. [email:

[email protected]

]

Roger Ferrer, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:

[email protected]

]

Anders Gidenstam, School of Business and Informatics, Högskolan I Borås, S-50190 Borås, Sweden. [email:

[email protected]

]

Sergei Gorlatch, FB10, Universität Münster, D-48149 Münster, Germany. [email:

[email protected]

]

Vincent Gramoli, LPD, EPFL IC, Station 14, CH-1015 Lausanne, Switzerland. [email:

[email protected]

]

Rachid Guerraoui, LPD, EPFL IC, Station 14, CH-1015 Lausanne, Switzerland. [email:

[email protected]

]

Neil Gunther, Performance Dynamics Company, 4061 East Castro Valley Blvd. Suite 110, Castro Valley, CA 95954, USA. [email:

[email protected]

]

Rajiv Gupta, Department of Computer Science, The University of California at Riverside, Engineering Bldg. Unit 2, Rm. 408, Riverside, CA 92521, USA. [email:

[email protected]

]

Phuong Ha, Department of Computer Science, Faculty of Science, University of Tromsø, NO-9037 Tromsø, Norway. [email:

[email protected]

]

Pieter Hijma, Department of Computer Science, Vrije Universiteit, De Boelelaan1081A, 1081 HV, Amsterdam, The Netherlands. [email:

[email protected]

]

Alexandru Iordan, University of Science and Technology (NTU), SemSælandsvei 7-9, NO-7491 Trondheim, Norway. [email:

[email protected]

]

Magnus Jahre, University of Science and Technology (NTU), SemSælandsvei 7-9, NO-7491 Trondheim, Norway. [email:

[email protected]

]

Mahmut Kandemir, The Pennsylvania State University, 111 IST Building, University Park, PA 16802, USA. [email:

[email protected]

]

Philipp Kegel, FB10, Universität Münster, D-48149 Münster, Germany. [email:

[email protected]

]

Christoph Kessler, IDA, Linköping University, S-58183 Linköping, Sweden. [email:

[email protected]

]

Peter Kilpatrick, School of Electronics, Electrical Engineering and Computer Science, Queen's University Belfast, Belfast BT7 1NN, UK. [email:

[email protected]

]

Jesus Labarta, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:

[email protected]

]

Nicolas Lachiche, Strasbourg University, Pole API BdSébastien Brant BP 10413 67412 Illkirch CEDEX France. [email:

[email protected]

]

Giuseppe Lipari, Real-Time Systems Laboratory, ScuolaSuperioreSant'Anna, Pisa, Italy. [email:

[email protected]

]

Stéphane Louise, CEA, LIST, Gif-Sur-Yvette, 91191 France. [email:

[email protected]

]

Emilio Luque, Computer Architecture and Operating System Department, Universitat Autonoma De Barcelona, 08193, Barcelona, Spain. [email:

[email protected]

]

Ogier Maitre, Strasbourg University, Pole API BdSébastien Brant BP 10413 67412 Illkirch CEDEX France. [email:

[email protected]

]

Sandya Mannarswamy, Xerox Research Centre India, 225 1st C Cross, 2nd Main, Kasturi Nagar, Bangalore, India. 560043. [email:

[email protected]

]

Vladimir Marjanovic, Barcelona Supercomputing Center, Nexus-2 Building, Jordi Girona 29, 08034 Barcelona, Spain. [email:

[email protected]

]

Lluis Martinell, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:

[email protected]

]

Xavier Martorell, Barcelona Supercomputing Center, Nexus-2 Building, Jordi Girona 29, 08034 Barcelona, Spain. [email:

[email protected]

]

David Moloney, Movidius Ltd., Mountjoy Square East 19, D1 Dublin, Ireland. [email:

[email protected]

]

Ronal Muresano, Computer Architecture and Operating System Department, Universitat Autonoma De Barcelona, 08193, Barcelona, Spain. [email:

[email protected]

]

M. Mustafa Rafique, Virginia Tech, 2202 Kraft Drive (0106), Blacksburg, Virginia 24061, USA. [email:

[email protected]

]

Raymond Namyst, Institut National De Recherche En Informatique Et En Automatique (INRIA), Bordeaux Sud-Ouest, Cours De La Liberation 351, F-33405 Talence Cedex, France. [email:

[email protected]

]

Lasse Natvig, University of Science and Technology (NTU), Semsælandsvei 7-9, NO-7491 Trondheim, Norway. [email:

[email protected]

]

Dimitrios S. Nikolopoulos, FORTH-ICS, N. Plastira 100, VassilikaVouton, Heraklion, Crete, Greece. [email:

[email protected]

]

Frank Otto, Institute for Program Structures and Data Organization, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany. [email:

[email protected]

]

Ozcan Ozturk, Department of Computer Engineering, Engineering Building, EA 407B, Bilkent University, Bilkent, 06800, Ankara, Turkey. [email:

[email protected]

]

Marina Papatriantafilou, Department of Computer Science and Engineering, Chalmers University of Technology, SE-412 96 Göteborg, Sweden. [email:

[email protected]

]

Stefan Parvu, Nokia Group, Espoo, Karakaari 15, Finland. [email:

[email protected]

]

Josep M. Perez, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:

[email protected]

]

Judit Planas, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:

[email protected]

]

Sabri Pllana, Department of Computer Science, Linnaeus University, SE-351 95 Vaxjo, Sweden. [email:

[email protected]

]

Stephane Querry, Pole API BdSébastien Brant BP 10413 67412 Illkirch CEDEX France. [email:

[email protected]

]

Alex Ramirez, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:

[email protected]

]

Dolores Rexachs, Computer Architecture and Operating System Department, Universitat Autonoma De Barcelona, 08193, Barcelona, Spain. [email:

[email protected]

]

Andrew Richards, Codeplay Software Limited, York Place 45, EH1 3HP Edinburgh, United Kingdom. [email:

[email protected]

]

António Rodrigues, TU Lisbon/IST/INESC-ID, Rua Alves Redol 9, 1000-029Lisboa, Portugal. [email:

[email protected]

]

Nuno Roma, TU Lisbon/IST/INESC-ID, Rua Alves Redol 9, 1000-029 Lisboa, Portugal. [email:

[email protected]

]

Peter Sanders, KarlsruherInstitut Für Technologie, Amfasanengarten 5, D-76128Karlsruhe, Germany. [email:

[email protected]

]

Lutz Schubert, Department for Intelligent Service Infrastructures, HLRS, University of Stuttgart, Nobelstr. 19, 70569 Stuttgart, Germany. [email:

[email protected]

]

Hazim Shafi, One Microsoft Way, Redmond, WA 98052, USA. [email:

[email protected]

]

Deepak Sharma, Pole API BdSébastien Brant BP 10413 67412 Illkirch CEDEX, France. [email:

[email protected]

]

Leonel Sousa, TU Lisbon/IST/INESC-ID, Rua Alves Redol 9, 1000-029 Lisboa, Portugal. [email:

[email protected]

]

Michel Steuwer, FB10, Universität Münster, D-48149Münster, Germany. [email:

[email protected]

]

Shanti Subramanyam, Yahoo! Inc., 701 First Ave, Sunnyvale, CA 94089. [email:

[email protected]

]

Håkan Sundell, School of Business and Infomatics, Högskolan I Borås, S-501 90Bor

s, Sweden. [email:

[email protected]

]

Xavier Teruel, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:

[email protected]

]

Chen Tian, Department of Computer Science, The University of California at Riverside, Engineering Bldg. Unit 2, Rm. 463, Riverside, CA 92521, USA. [email:

[email protected]

]

Walter F. Tichy, Institute for Program Structures and Data Organization, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany. [email:

[email protected]

]

Massimo Torquati, Computer Science Department, University of Pisa, Largo Pontecorvo3, 56127 Pisa, Italy. [email:

[email protected]

]

Jesper Larsson Träff, Research Group Parallel Computing, Vienna University of Technology, Favoritenstrasse 16/184-5, A-1040 Vienna, Austria. [email:

[email protected]

]

Ioanna Tsalouchidou, Barcelona Supercomputing Center, Nexus-2 Building, 3

rd

Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:

[email protected]

]

Philippas Tsigas, Department of Computer Science and Engineering, Chalmers University of Technology, SE-412 96 Göteborg, Sweden. [email:

[email protected]

]

Philippas Tsigas, Department of Computer Science and Engineering, Chalmers Tekniska Höogskola, SE-41296 Göteborg, Sweden. [email:

[email protected]

]

Mateo Valero, Barcelona Supercomputing Center, Nexus-2 Building, 3rd Floor, Jordi Girona 29, 08034 Barcelona, Spain. [email:

[email protected]

]

Rob V. Van Nieuwpoort, Department of Computer Science, Vrije Universiteit, De Boelelaan 1081A, 1081 HV, Amsterdam, The Netherlands. [email:

[email protected]

]

Ana Lucia Varbanescu, Department of Software Technologies, Delft University of Technology, Delft, The Netherlands Mekelweg 4, 2628 CD, Delft, The Netherlands. [email:

[email protected]

]

Stefan Wesner, HLRS, Department of Applications and Visualization, University of Stuttgart, Nobelstr. 19, 70569 Stuttgart, Germany. [email:

[email protected]

]

PREFACE

Multicore and many-core computing systems have emerged as an important paradigm in high-performance computing (HPC) and have significantly propelled development of advanced parallel and distributed applications as well as of embedded systems. Multicore processors are now ubiquitous, indeed, from processors with 2 or 4 cores in the 2000s, the trend of increasing the number of cores keeps the pace, and processors with hundreds or even thousands of (lightweight) cores are becoming commonplace to optimize not only performance but also energy. However, this disruptive technology (also referred to as ‘continuum computing paradigm’) presents several major challenges such as increased effort and system-specific skills for porting and optimizing application codes, managing and exploiting massive parallelism and system heterogeneity to achieve increased performance, innovative modeling strategies for low-power simulation, etc. Among these, we would distinguish the challenge of mastering the multicore and many-core and heterogeneous systems – this is precisely the focus of this book!

The emergence of multicore processors has helped in addressing several problems that are related to single-core processors – known as memory wall, power wall and instruction-level parallelism wall – but they pose several other ‘walls’ such as the programmability wall or the coherency wall. Among these, programmability wall is a long-standing challenge. Indeed, on the one hand, program development for multicore processors, especially for heterogeneous multicore processors, is significantly more complex than for single-core processors. On the other hand, programmers have been traditionally trained for the development of sequential programs, and only a small percentage of them have experience with parallel programming.

In fact, in the past only a relatively small group of programmers interested in HPC was concerned with the parallel programming issues; the situation has changed dramatically with the appearance of multicore processors in commonly used computing systems. Traditionally parallel programs in HPC community have been developed by heroic programmers using a simple text editor as programming environment, programming at a low level of abstraction and doing manual performance optimization. It is expected that with the pervasiveness of multicore processors, parallel programming will become mainstream, but it cannot be expected that a mainstream programmer will prefer to use the traditional HPC methods and tools.

The main objective of this book is to present a comprehensive view of the state-of-the-art parallel programming methods, techniques and tools to aid the programmers in mastering the efficient programming of multicore and many-core systems. The book comprises a selection of twenty-two chapter contributions by experts in the field of multicore and many-core systems that cover fundamental techniques and algorithms, programming approaches, methodologies and frameworks, task/application scheduling and management, testing and evaluation methodologies and case studies for programming multicore and many-core systems. Lessons learned, challenges and road map ahead are also given and discussed along the chapters.

The content of the book is arranged into five parts:

Part I: Foundations

The first part of the book covers fundamental issues in programming of multicore and many-core computing systems. Along four chapters the authors discuss the state of the art on multi- and many-core architectures, programming models, concurrent data structures and memory allocation, scheduling and management.

Natvig et al. in the first chapter, ‘Multi- and many-cores, architectural overview for programmers’, provide a broad overview of the fundamental parallel techniques, parallel taxonomies and various ‘walls’ in programming multicore/many-core computing systems such as ‘power wall’, ‘memory wall’ and ‘ILP (instruction level parallelism) wall’. The authors also discuss the challenges of the heterogeneity in multicore/many-core computing systems. They conclude by stressing the need for more research in parallel programming models to meet the five P's of parallel processing – performance, predictability, power efficiency, programmability and portability – when building and programming multicore and many-core computing systems.

Varbanescu et al. in the second chapter, ‘Programming models for multicore and many-core computing systems’, survey a comprehensive set of programming models for most popular families of many-core systems, including both specific and classical parallel models for multicore and many-core platforms. The authors have introduced four classes of reference features for model evaluation: usability, design support, implementation support and programmability. Based on these features, a multidimensional comparison of the surveyed models is provided aiming to identify the essential characteristics that separate or cluster these models. The authors conclude by emphasizing the influence that the choice of a programming model can have on the application design and implementation and give a few guidelines for finding a programming model that matches the application characteristics.

The third chapter by Cederman et al., ‘Lock-free concurrent data structures’, deals with the use of concurrent data structures in parallel programming of multicore and many-core systems. Several issues such as maintaining consistency in the presence of many simultaneous updates are discussed and lock-free implementations of data structures that support concurrent access are given. Lock-free concurrent data structures are shown to support the design of lock-free algorithms that scale much better when the number of processes increases. A set of fundamental synchronization primitives is also described together with challenges in managing dynamically allocated memory in a concurrent environment.

Mannarswamy in the fourth chapter, ‘Software transactional memory’, addresses the main challenges in writing concurrent code in multicore and many-core computing systems. In particular, the author focuses on the coordinating access to shared data, accessed by multiple threads concurrently. Then, the software transactional memory (STM) programming paradigm for shared memory multithreaded programs is introduced. STM is intended to facilitate the development of complex concurrent software as an alternative to conventional lock-based synchronization primitives by reducing the burden of programming complexity involved in writing concurrent code. The need for addressing performance bottlenecks and improving the application performance on STM is also discussed as a major research issue in the field.

Part II: Programming Approaches

The second part of the book is devoted to programming approaches for multicore and many-core computing systems. This part comprises seven chapters that cover a variety of programming approaches including heterogeneous programming, skeleton programming, DSL and object-oriented stream programming and programming with transactional memory.

The fifth chapter, ‘Heterogeneous programming with OMPSs and its implications’, by Ayguadé et al. discusses on programming models for heterogeneous architectures aiming to ease the asynchrony and to increment parallelization, modularity and portability of applications. The authors present the OmpSs model, which extends the OpenMP 3.0 programming model, and show how it leverages MPI and OpenCL/CUDA, mastering the efficient programming of the clustered heterogeneous multi-/many-core systems. The implementation of OmpSs as well as a discussion on the intelligence needed to be embedded in the runtime system to effectively lower the programmability wall and the opportunities to implement new mechanisms and policies is also discussed and some overheads related with task management in OmpSs are pointed out for further investigation.

Kessler et al. in the sixth chapter, ‘Skeleton programming for portable many-core computing’, consider skeleton programming (‘data parallel skeletons’) as a model to solve the portability problem that arises in multi-and many-core programming and to increase the level of abstraction in such programming environment. After overviewing the concept of algorithmic skeletons, the authors give a detailed description of two recent approaches for programming emerging heterogeneous many-core systems, namely, SkePU and SkelCL. Some other skeleton programming frameworks, which share ideas with SkePU and SkelCL but address a more narrow range of architectures or are used in industrial application development, are also discussed. Adding support for portable task parallelism, such as farm skeletons, is pointed out as an important research issue for future research.

In the seventh chapter, ‘DSL stream programming on multicore architectures’, by de Oliveira Castro et al., the authors present a novel approach for stream programming, considered a powerful alternative to program multi-core processors by offering a deterministic execution based on a sound mathematical formalism and the ability to implicitly express the parallelism by the stream structure, which leverages compiler optimizations that can harness the multicore performance without having to tune the application by hand. Two families of stream programming languages are analyzed, namely, languages in which the data access patterns are explicitly described by the programmer through a set of reorganization primitives and those in which the data access patterns are implicitly declared through a set of dependencies between tasks. Then, the authors expose the principle of a two-level approach combining the advantages and expressivity of both types of languages aiming to achieve both the expressivity of high-level languages such as Array-OL and Block Parallel and the rich optimization framework, similar to StreamIT and Brook.

The eighth chapter, ‘Programming with transactional memory’, by Gramoli and Guerraoui addresses similar issues as in Chapter 4, namely, the use of transactional memory to remedy numerous concurrency problems arising in multicore and many-core programming. The chapter analyzes the state-of-the-art concurrent programming advances based on transactional memory. Several programming languages that support TM are considered along with some TM implementations and a running example for software support. The causes for performance limitations that TMs may suffer from and some recent solutions to cope with such limitations are also discussed.

Otto and Tichy in the ninth chapter, ‘Object-oriented stream programming’, present an approach unifying the concepts of object orientation (OO) and stream programming aiming to take advantage of features of both paradigms. Aiming for better programmability and performance gains, the object-oriented stream programming (OOSP) is introduced as a solution. The benefits of OO and stream programming are exemplified with XJava, a prototype OOSP language extending Java. Other issues such as potential conflicts between tasks, run-time performance tuning and correctness, allowing for interprocess application optimization and faster parameter adjustments are also discussed.

The tenth chapter, ‘Software-based speculative parallelization’, by Tian et al. studies the thread level speculative parallelization (SP) approach for parallelizing sequential programs by exploiting dynamic parallelism that may be present in a sequential program. As SP is usually applied to loops and performed at compile time, it requires minimal help from the programmer who may be required to identify loops to which speculative parallelization is to be applied. The authors have discussed several issues in SP, such as handling misspeculations, recovery capabilities and techniques for identifying parallelizable regions. Some ongoing projects that focus on SP techniques are also briefly discussed along with direction on future research issues comprising energy efficiency in SP, using SP for heterogeneous processors and 3D multicore processors, etc.

Schubert et al. in the eleventh chapter, ‘Autonomic distribution and adaptation’, describe an approach for increasing the scalability of applications by exploiting inherent concurrency in order to parallelize and distribute the code. The authors focus more specifically on concurrency, which is a crucial part in any parallelization approach, in the sense of reducing dependencies between logical parts of an application. To that end, the authors have employed graph analysis methods to assess the dependencies on code level, so as to identify concurrent segments and relating them to the specific characteristics of the (heterogeneous, large-scale) environment. Issues posed to programming multicore and many-core computers by the high degree of scalability and especially the large variance of processor architectures are also discussed.

Part III: Programming Frameworks

The third part of the book deals with methodologies, frameworks and high programming tools for constructing and testing software that can be ported between different, possibly in themselves heterogeneous many-core systems under preservation of specific quantitative and qualitative performance aspects.

The twelfth chapter, ‘PEPPHER: Performance portability and programmability for heterogeneous many-core architectures’, by Benkner et al. presents PEPPHER framework, which introduces a flexible and extensible compositional metalanguage for expressing functional and nonfunctional properties of software components, their resource requirements and possible compilation targets, as well as providing abstract specifications of properties of the underlying hardware. Also, handles for the run-time system to schedule the components on the available hardware resources are provided. Performance predictions can be (automatically) derived by combining the supplied performance models. Performance portability is aided by guidelines and requirements to ensure that the PEPPHER framework at all levels chooses the best implementation of a given component or library routine among the available variants, including settings for tunable parameters, prescheduling decisions and data movement operations.

Aldinucci et al. in the thirteenth chapter, ‘Fastflow: high level and efficient streaming on multicore’, consider, as in other chapters, the difficulties of programmability of multicore and many-core systems, but from the perspective of two interrelated needs, namely, that of efficient mechanisms supporting correct concurrent access to shared-memory data structures and of higher-level programming environments capable of hiding the difficulties related to the correct and efficient use of shared-memory objects by raising the level of abstraction provided to application programmers. To address these needs the authors introduce and discuss FastFlow, a programming framework specifically targeting cache-coherent shared-memory multicores. The authors show the suitability of the programming abstractions provided by the top layer of FastFlow programming model for application programmers. Performance and efficiency considerations are also given along with some real-world applications.

In the fourteenth chapter, Roma et al., ‘Programming framework for H.264/AVC video encoding in multicore systems’, the authors bring the example of usefulness of multicore and many-core computing for the video encoding as part of many multimedia applications. As video encoding distinguishes for being highly computationally demanding, to cope with the real-time encoding performance concerns, parallel approaches are envisaged as solutions to accelerate the encoding. The authors have presented a new parallel programming framework, which allows to easily and efficiently implementing high-performance H.264/AVC video encoders. The modularity and flexibility make this framework particularly suited for efficient implementations in either homogeneous or heterogeneous parallel platforms, providing a suitable set of fine-tuning configurations and parameterizations that allow a fast prototyping and implementation, thus significantly reducing the developing time of the whole video encoding system.

The fifteenth chapter, ‘Parallelizing evolutionary algorithms on GPGPU cards with the EASEA platform’, by Maitre et al. presents the EASEA (EAsy Specification of Evolutionary Algorithm) software platform dedicated to evolutionary algorithms that allows to exploit parallel architectures, that range from a single GPGPU equipped machine to multi-GPGPU machines, to a cluster or even several clusters of GPGPU machines. Parallel algorithms implemented by the EASEA platform are proposed for evolutionary algorithms and evolution strategies, genetic programming and multiobjective optimization. Finally, a set of problems is presented that contains artificial and real-world problems, for which performance evaluation results are given. EASEA is shown suitable to efficiently parallelize generic evolutionary optimization problems to run on current petaflop machines and future exaflop ones.

Part IV: Testing, Evaluation and Optimization

The forth part of the book covers testing, evaluation and optimization of parallel programs, with special emphasis for multicore and many-core systems. Techniques, methodologies and approaches are presented along four chapters.

Farchi in the sixteenth chapter, ‘Smart interleavings for testing parallel programs’, discusses the challenges of testing parallel programs that execute several parallel tasks, might be distributed on different machines, under possible node or network failures and might use different synchronization primitives. Therefore, the main challenge is of parallel program testing resides in the definition and coverage of the rather huge space of possible orders of tasks and environment events. The author has presented state-of-the-art testing techniques including parallel bug pattern-based reviews and distributed reviews. The later techniques enable the design of a test plan for the parallel program that is then implemented in unit testing. Coping with the scaling is envisaged as a main challenge for future research.

In the seventeenth chapter by Shafi, ‘Parallel performance evaluation and optimization’, are covered important aspects of shared-memory parallel programming that impact performance. Guidance and mitigation techniques for diagnosing performance issues applicable to a large spectrum of shared-memory multicore programs in order to assist in performance tuning are also given. Various overheads in parallel programs including thread overheads, cache overheads and synchronization overheads are discussed and mitigation techniques analyzed. Also, optimization-related issues such as nonuniform access memory and latency are described. The chapter overviews diagnostic tools as critical means to achieving good performance in parallel applications.

The eighteenth chapter, ‘A methodology for optimizing multithreaded system scalability on multicores’, by Gunther et al. presents a methodology which combines controlled measurements of the multithreaded platform together with a scalability modeling framework within which to evaluate performance measurements for multithreaded programs. The authors show how to quantify the scalability using the Universal Scalability Law (USL) by applying it to controlled performance measurements of memcached, J2EE and WebLogic. The authors advocate that system performance analysis should be incorporated into a comprehensive methodology rather than being done as an afterthought. Their methodology, based on the USL, emphasizes the importance of validating scalability data through controlled measurements that use appropriately designed test workloads. Some results from quantifying GPU and many-core scalability using the USL methodology are also reported.

Ozturk and Kandemir in the nineteenth chapter, ‘Improving multicore system performance through data compression