104,99 €
With recent changes in multicore and general-purpose computing on graphics processing units, the way parallel computers are used and programmed has drastically changed. It is important to provide a comprehensive study on how to use such machines written by specialists of the domain. The book provides recent research results in high-performance computing on complex environments, information on how to efficiently exploit heterogeneous and hierarchical architectures and distributed systems, detailed studies on the impact of applying heterogeneous computing practices to real problems, and applications varying from remote sensing to tomography. The content spans topics such as Numerical Analysis for Heterogeneous and Multicore Systems; Optimization of Communication for High Performance Heterogeneous and Hierarchical Platforms; Efficient Exploitation of Heterogeneous Architectures, Hybrid CPU+GPU, and Distributed Systems; Energy Awareness in High-Performance Computing; and Applications of Heterogeneous High-Performance Computing.
• Covers cutting-edge research in HPC on complex environments, following an international collaboration of members of the ComplexHPC
• Explains how to efficiently exploit heterogeneous and hierarchical architectures and distributed systems
• Twenty-three chapters and over 100 illustrations cover domains such as numerical analysis, communication and storage, applications, GPUs and accelerators, and energy efficiency
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 857
Veröffentlichungsjahr: 2014
Cover
Series
Title Page
Copyright
Dedication
Contributors
Preface
Part I: Introduction
Chapter 1: Summary of the Open European Network for High-Performance Computing in Complex Environments
1.1 Introduction and Vision
1.2 Scientific Organization
1.3 Activities of The project
1.4 Main Outcomes of the Action
1.5 Contents of The Book
Acknowledgment
Part II: Numerical Analysis for Heterogeneous and Multicore Systems
Chapter 2: On the Impact of the Heterogeneous Multicore and Many-Core Platforms on Iterative Solution Methods and Preconditioning Techniques
2.1 Introduction
2.2 General Description of Iterative Methods and Preconditioning
2.3 Preconditioning Techniques
2.4 Defect-Correction Technique
2.5 Multigrid Method
2.6 Parallelization of Iterative Methods
2.7 Heterogeneous Systems
2.8 Maintenance and Portability
2.9 Conclusion
Acknowledgments
References
Chapter 3: Efficient Numerical Solution of 2D Diffusion Equation on Multicore Computers
3.1 Introduction
3.2 Test Case
3.3 Parallel Implementation
3.4 Results
3.5 Discussion
3.6 Conclusion
Acknowledgment
References
Chapter 4: Parallel Algorithms for Parabolic Problems on Graphs in Neuroscience
4.1 Introduction
4.2 Formulation of the Discrete Model
4.3 Parallel Algorithms
4.4 Computational Results
4.5 Conclusions
Acknowledgments
References
Part III: Communication and Storage Considerations in High-Performance Computing
Chapter 5: An Overview of Topology Mapping Algorithms and Techniques in High-Performance Computing
5.1 Introduction
5.2 General Overview
5.3 Formalization of the Problem
5.4 Algorithmic Strategies for Topology Mapping
5.5 Mapping Enforcement Techniques
5.6 Survey of Solutions
5.7 Conclusion and Open Problems
Acknowledgment
References
Chapter 6: Optimization of Collective Communication for Heterogeneous HPC Platforms
6.1 Introduction
6.2 Overview of Optimized Collectives and Topology-Aware Collectives
6.3 Optimizations of Collectives on Homogeneous Clusters
6.4 Heterogeneous Networks
6.5 Topology- and Performance-Aware Collectives
6.6 Topology as Input
6.7 Performance as Input
6.8 Non-MPI Collective Algorithms for Heterogeneous Networks
6.9 Conclusion
Acknowledgments
References
Chapter 7: Effective Data Access Patterns on Massively Parallel Processors
7.1 Introduction
7.2 Architectural Details
7.3
K
-Model
7.4 Parallel Prefix Sum
7.5 Bitonic Sorting Networks
7.6 Final Remarks
Acknowledgments
References
Chapter 8: Scalable Storage I/O Software for Blue Gene Architectures
8.1 Introduction
8.2 Blue Gene System Overview
8.3 Design and Implementation
8.4 Conclusions and Future Work
Acknowledgments
References
Part IV: Efficient Exploitation of Heterogeneous Architectures
Chapter 9: Fair Resource Sharing for Dynamic Scheduling of Workflows on Heterogeneous Systems
9.1 Introduction
9.2 Concurrent Workflow Scheduling
9.3 Experimental Results and Discussion
9.4 Conclusions
Acknowledgments
References
Chapter 10: Systematic Mapping of Reed–Solomon Erasure Codes on Heterogeneous Multicore Architectures
10.1 Introduction
10.2 Related Works
10.3 Reed–Solomon Codes and Linear Algebra Algorithms
10.4 Mapping Reed–Solomon Codes on Cell/B.E. Architecture
10.5 Mapping Reed–Solomon Codes on Multicore GPU Architectures
10.6 Methods of Increasing the Algorithm Performance on GPUs
10.7 GPU Performance Evaluation
10.8 Conclusions and Future Works
Acknowledgments
References
Chapter 11: Heterogeneous Parallel Computing Platforms and Tools for Compute-Intensive Algorithms: A Case Study
11.1 Introduction
11.2 A Low-Cost Heterogeneous Computing Environment
11.3 First Case Study: The
N
-Body Problem
11.4 Second Case Study: The Convolution Algorithm
11.5 Conclusions
Acknowledgments
References
Chapter 12: Efficient Application of Hybrid Parallelism in Electromagnetism Problems
12.1 Introduction
12.2 Computation of Green's functions in Hybrid Systems
12.3 Parallelization in Numa Systems of a Volume Integral Equation Technique
12.4 Autotuning Parallel Codes
12.5 Conclusions and Future Research
Acknowledgments
References
Part V: CPU + GPU Coprocessing
Chapter 13: Design and Optimization of Scientific Applications for Highly Heterogeneous and Hierarchical HPC Platforms Using Functional Computation Performance Models
13.1 Introduction
13.2 Related Work
13.3 Data Partitioning Based on Functional Performance Model
13.4 Example Application: Heterogeneous Parallel Matrix Multiplication
13.5 Performance Measurement on CPUs/GPUs System
13.6 Functional Performance Models of Multiple Cores and GPUs
13.7 Fpm-Based Data Partitioning on CPUs/GPUs System
13.8 Efficient Building of Functional Performance Models
13.9 Fpm-Based Data Partitioning on Hierarchical Platforms
13.10 Conclusion
Acknowledgments
References
Chapter 14: Efficient Multilevel Load Balancing on Heterogeneous CPU + GPU Systems
14.1 Introduction: Heterogeneous CPU + GPU Systems
14.2 Background and Related Work
14.3 Load Balancing Algorithms for Heterogeneous CPU + GPU Systems
14.4 Experimental Results
14.5 Conclusions
Acknowledgments
References
Chapter 15: The All-Pair Shortest-Path Problem in Shared-Memory Heterogeneous Systems
15.1 Introduction
15.2 Algorithmic Overview
15.3 CUDA Overview
15.4 Heterogeneous Systems and Load Balancing
15.5 Parallel Solutions to The APSP
15.6 Experimental Setup
15.7 Experimental Results
15.8 Conclusions
Acknowledgments
References
Part VI: Efficient Exploitation of Distributed Systems
Chapter 16: Resource Management for HPC on the Cloud
16.1 Introduction
16.2 On the Type of Applications for HPC and HPC2
16.3 HPC on the Cloud
16.4 Scheduling Algorithms for HPC2
16.5 Toward an Autonomous Scheduling Framework
16.6 Conclusions
Acknowledgment
References
Chapter 17: Resource Discovery in Large-Scale Grid Systems
17.1 Introduction and Background
17.2 The Semantic Communities Approach
17.3 The P2P Approach
17.4 The Grid-Routing Transferring Approach
17.5 Conclusions
Acknowledgment
References
Part VII: Energy Awareness in High-Performance Computing
Chapter 18: Energy-Aware Approaches for HPC Systems
18.1 Introduction
18.2 Power Consumption of Servers
18.3 Classification and Energy Profiles of HPC Applications
18.4 Policies and Leverages
18.5 Conclusion
Acknowledgements
References
Chapter 19: Strategies for Increased Energy Awareness in Cloud Federations
19.1 Introduction
19.2 Related Work
19.3 Scenarios
19.4 Energy-Aware Cloud Federations
19.5 Conclusions
Acknowledgments
References
Chapter 20: Enabling Network Security in HPC Systems Using Heterogeneous CMPs
20.1 Introduction
20.2 Related Work
20.3 Overview of Our Approach
20.4 Heterogeneous CMP Design for Network Security Processors
20.5 Experimental Evaluation
20.6 Concluding Remarks
Acknowledgments
References
Part VIII: Applications of Heterogeneous High-Performance Computing
Chapter 21: Toward a High-Performance Distributed CBIR System for Hyperspectral Remote Sensing Data: A Case Study in Jungle Computing
21.1 Introduction
21.2 CBIR For Hyperspectral Imaging Data
21.3 Jungle Computing
21.4 IBIS and Constellation
21.5 System Design and Implementation
21.6 Evaluation
21.7 Conclusions
Acknowledgments
References
Chapter 22: Taking Advantage of Heterogeneous Platforms in Image and Video Processing
22.1 Introduction
22.2 Related Work
22.3 Parallel Image Processing on GPU
22.4 Image Processing On Heterogeneous Architectures
22.5 Video Processing on GPU
22.6 Experimental Results
22.7 Conclusion
Acknowledgment
References
Chapter 23: Real-Time Tomographic Reconstruction Through CPU + GPU Coprocessing
23.1 Introduction
23.2 Tomographic Reconstruction
23.3 Optimization of Tomographic Reconstruction For CPUs And For GPUs
23.4 Hybrid CPU + GPU Tomographic Reconstruction
23.5 Results
23.6 Discussion and Conclusion
Acknowledgments
References
Index
Series
End User License Agreement
xxiii
xxiv
xxv
xxvii
xxviii
xxix
3
4
5
6
7
8
9
10
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
215
216
217
218
219
220
221
222
224
223
225
226
227
228
229
230
231
232
233
234
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
Cover
Table of Contents
Preface
Part I: Introduction
Chapter 1: Summary of the Open European Network for High-Performance Computing in Complex Environments
Figure 2.1
Figure 2.2
Figure 2.3
Figure 2.4
Figure 2.5
Figure 2.6
Figure 2.7
Figure 3.1
Figure 3.2
Figure 3.3
Figure 3.4
Figure 3.5
Figure 3.6
Figure 3.7
Figure 3.8
Figure 4.1
Figure 4.2
Figure 4.3
Figure 4.4
Figure 4.5
Figure 6.1
Figure 6.2
Figure 6.3
Figure 6.4
Figure 6.5
Figure 6.6
Figure 6.7
Figure 6.8
Figure 6.9
Figure 7.1
Figure 7.2
Figure 7.3
Figure 7.4
Figure 7.5
Figure 7.6
Figure 7.7
Figure 7.8
Figure 8.1
Figure 8.2
Figure 8.3
Figure 8.4
Figure 8.5
Figure 9.1
Figure 9.2
Figure 9.3
Figure 9.6
Figure 10.1
Figure 10.2
Figure 10.3
Figure 10.4
Figure 10.5
Figure 11.1
Figure 11.2
Figure 12.1
Figure 12.2
Figure 12.3
Figure 12.4
Figure 12.5
Figure 13.1
Figure 13.2
Figure 13.3
Figure 13.4
Figure 13.5
Figure 13.6
Figure 13.7
Figure 13.8
Figure 13.9
Figure 13.10
Figure 14.1
Figure 14.2
Figure 14.3
Figure 14.4
Figure 14.5
Figure 15.1
Figure 15.2
Figure 15.3
Figure 16.1
Figure 16.2
Figure 16.3
Figure 16.4
Figure 16.5
Figure 17.1
Figure 17.2
Figure 17.3
Figure 17.4
Figure 17.5
Figure 18.1
Figure 18.2
Figure 18.3
Figure 18.4
Figure 20.1
Figure 20.2
Figure 20.3
Figure 20.4
Figure 20.5
Figure 21.1
Figure 21.2
Figure 21.3
Figure 21.4
Figure 21.5
Figure 21.6
Figure 22.1
Figure 22.2
Figure 22.3
Figure 22.4
Figure 22.5
Figure 22.6
Figure 23.1
Figure 23.2
Figure 23.3
Figure 23.4
Figure 23.5
Table 2.1
Table 2.2
Table 4.1
Table 4.2
Table 4.3
Table 4.4
Table 4.5
Table 5.1
Table 7.1
Table 7.2
Table 7.3
Table 9.1
Table 10.1
Table 10.2
Table 10.3
Table 11.1
Table 11.2
Table 11.3
Table 11.4
Table 11.5
Table 11.6
Table 11.7
Table 12.1
Table 12.2
Table 12.3
Table 12.4
Table 12.5
Table 12.6
Table 12.7
Table 13.1
Table 13.2
Table 13.3
Table 15.1
Table 15.2
Table 16.1
Table 18.1
Table 20.1
Table 20.2
Table 20.3
Table 20.4
Table 20.5
Table 21.1
Table 21.2
Table 21.3
Table 21.4
Table 21.5
Table 22.1
Table 22.2
Table 22.3
Table 22.4
Table 22.5
Table 22.6
Table 22.7
Table 23.1
Table 23.2
Table 23.3
Table 23.4
Emmanuel Jeannot
Julius Zilinskas
Copyright © 2014 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herin may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services please contact our Customer Care Department with the U.S. at 877-762-2974, outside the U.S. at 317-572-3993 or fax 317-572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print, however, may not be available in electronic format.
Library of Congress Cataloging in Publication Data:
Jeannot, Emmanuel.
High performance computing on complex environments / Emmanuel Jeannot, Julius Zilinskas.
pages cm
Includes bibliographical references and index.
ISBN 978-1-118-71205-4 (cloth)
1. High performance computing. I. ilinskas, J. (Julius), 1973- II. Title.
QA76.88.J43 2014
004.1′1–dc23
2013048363
To our colleague Mark Baker
Alejandro Álvarez-Melcón, Technical University of Cartagena, Cartagena, Spain
Hamid Arabnejad, Universidade do Porto, Porto, Portugal
Henri E. Bal, VU University, Amsterdam, The Netherlands
Ranieri Baraglia, National Research Council of Italy, Pisa, Italy
Jorge G. Barbosa, Universidade do Porto, Porto, Portugal
Robert Basmadjian, Passau University, Passau, Germany
Gabriele Capannini, D&IT Chalmers, Göteborg, Sweden
Jesús Carretero, Universidad Carlos III of Madrid, Madrid, Spain
Raimondas iegis, Vilnius Gediminas Technical University, Vilnius, Lithuania
David Clarke, University College Dublin, Dublin, Ireland
Andrea Clematis, IMATI CNR, Genoa, Italy
Georges Da Costa, Toulouse University, Toulouse, France
Daniele D'Agostino, IMATI CNR, Genoa, Italy
Emanuele Danovaro, IMATI CNR, Genoa, Italy
Matjaž Depolli, Jožef Stefan Institute, Ljubljana, Slovenia
Kiril Dichev, University College Dublin, Dublin, Ireland
Niels Drost, Netherlands eScience Center, Amsterdam, The Netherlands
Jose J. Fernandez, National Centre for Biotechnology, National Research Council (CNB-CSIC), Madrid, Spain
Marc E. Frincu, West University of Timisoara, Timisoara, Romania
Javier Garcia, Universidad Carlos III of Madrid, Madrid, Spain
Ester M. Garzon, University of Almería, Almería, Spain
Domingo Giménez, University of Murcia, Murcia, Spain
Arturo Gonzalez-Escribano, Universidad de Valladolid, Valladolid, Spain
Torsten Hoefler, ETH Zürich, Zürich, Switzerland
José Ignacio Agulleiro, University of Almería, Almería, Spain
Aleksandar Ilic, Technical University of Lisbon, Lisbon, Portugal
Florin Isaila, Universidad Carlos III of Madrid, Madrid, Spain
Emmanuel Jeannot, Inria Bordeaux Sud-Ouest, Talence, France
Konstantinos Karaoglanoglou, Aristotle University of Thessaloniki, Thessaloniki, Greece
Helen Karatza, Aristotle University of Thessaloniki, Thessaloniki, Greece
Gabor Kecskemeti, University of Innsbruck, Innsbruck, Austria
Attila Kertesz, MTA SZTAKI Computer and Automation Research Institute, Budapest, Hungary
Timo van Kessel, VU University, Amsterdam, The Netherlands
Gregor Kosec, Jožef Stefan Institute, Ljubljana, Slovenia
Lukasz Kuczynski, Czestochowa University of Technology, Czestochowa, Poland
Alexey Lastovetsky, University College Dublin, Dublin, Ireland
Laurent Lefevre, INRIA, LIP Laboratory, Ecole Normale Superieure of Lyon, Lyon, France
Diego R. Llanos, Universidad de Valladolid, Valladolid, Spain
Dimitar Lukarski, Uppsala University, Uppsala, Sweden
Jason Maassen, Netherlands eScience Center, Amsterdam, The Netherlands
Sidi A. Mahmoudi, University of Mons, Mons, Belgium
Pierre Manneback, University of Mons, Mons, Belgium
Attila Cs. Marosi, MTA SZTAKI Computer and Automation Research Institute, Budapest, Hungary
Guillaume Mercier, Bordeaux Polytechnic Institute, Talence, France; Inria Bordeaux Sud-Ouest, Talence, France
Franco Maria Nardini, National Research Council of Italy, Pisa, Italy
Zsolt Nemeth, MTA SZTAKI Computer and Automation Research Institute, Budapest, Hungary
Maya Neytcheva, Uppsala University, Uppsala, Sweden
Ariel Oleksiak, Poznan Supercomputing and Networking Center, Poznan, Poland
Hector Ortega-Arranz, Universidad de Valladolid, Valladolid, Spain
Erencan Ozkan, Ankara University, Ankara, Turkey
Ozcan Ozturk, Bilkent University, Ankara, Turkey
Carlos Pérez-Alcaraz, University of Murcia, Murcia, Spain
Dana Petcu, West University of Timisoara, Timisoara, Romania
José-Ginés Picón, University of Murcia, Murcia, Spain
Jean-Marc Pierson, Toulouse University, Toulouse, France
Fernando D. Quesada, Technical University of Cartagena, Cartagena, Spain
Antonio J. Plaza, University of Extremadura, Caceres, Spain
Tomás Ramírez, University of Murcia, Murcia, Spain
Vladimir Rychkov, University College Dublin, Dublin, Ireland
Frank J. Seinstra, Netherlands eScience Center, Amsterdam, The Netherlands
Fabrizio Silvestri, National Research Council of Italy, Pisa, Italy
Leonel Sousa, Technical University of Lisbon, Lisbon, Portugal
Frédéric Suter, IN2P3 Computing Center, CNRS, IN2P3, Lyon-Villeurbanne, France
Yuri Torres, Universidad de Valladolid, Valladolid, Spain
Suleyman Tosun, Ankara University, Ankara, Turkey
Roman Trobec, Jožef Stefan Institute, Ljubljana, Slovenia
Ghislain Landry Tsafack Chetsa, INRIA, LIP Laboratory, Ecole Normale Superieure of Lyon, Lyon, France
Natalija Tumanova, Vilnius Gediminas Technical University, Vilnius, Lithuania
Francisco Vazquez, University of Almería, Almería, Spain
Marcin Wozniak, Czestochowa University of Technology, Czestochowa, Poland
Roman Wyrzykowski, Czestochowa University of Technology, Czestochowa, Poland
Ziming Zhong, University College Dublin, Dublin, Ireland
Julius ilinskas, Vilnius University, Vilnius, Lithuania
High-performance computing (HPC) is an important domain of the computer science field. For more than 30 years, it has allowed finding solutions to problems and enhanced progress in many scientific and industrial areas, such as climatology, biology, geology, and drug design, as well as automobile and aerospace engineering. However, new technologies such as multicore chips and accelerators have forced researchers in the field to rethink most of the advances in the domain, such as algorithms, runtime systems, language, software, and applications.
It is expected that a high-end supercomputer will be able to deliver several hundreds of petaflops (1 petaflop is floating-point operations per second) in 5 years from now. However, this will require mastering several challenges, such as energy efficiency, scalability, and heterogeneity.
Better and efficient parallel computers will enable solving problems at a scale and within a timeframe that has not been reached so far. These modern hierarchical and heterogeneous computing infrastructures are hard to program and use efficiently, particularly for extreme-scale computing. Consequently, none of the state-of-the-art solutions are able to efficiently use such environments. Providing tools for the whole software stack will allow programmers and scientists to efficiently write new program that will use most of the available power of such future complex machines.
COST Action IC0805 “Open European Network for High-Performance Computing on Complex Environments” (ComplexHPC) was devoted to heterogeneous and hierarchical systems for HPC, and is aimed at tackling the problem at every level (from cores to large-scale environments) and providing new integrated solutions for large-scale computing for future platforms. The duration of ComplexHPC Action was May 2009–June 2013. The goal of COST Action was to establish a European research network focused on high-performance heterogeneous computing to address the whole range of challenges posed by these new platforms, including models, algorithms, programming tools, and applications. Indeed, some of the most active research groups in this area are in Europe. The network has contributed to exchanging information, identifying synergies, and pursuing common research activities, thereby reinforcing the strength of these groups and the leadership of Europe in this field. This book presents the results of COST Action. The chapters are written by expert participants of the Action.
This book is intended for scientists and researchers working in the field of HPC. It will provide advanced information for the readers already familiar with the basics of parallel and distributed computing. It may also be useful for PhD students and early stage researchers in computer science and engineering. It will also be of help to these young researchers to get a deep introduction to the related fields.
This book would not have been possible without the efforts of the contributors in preparing the respective chapters, and we would like to thank them for timely submissions and corrections. We would also like to thank Prof. Albert Zomaya for giving us the opportunity to publish this book in the “Wiley Series on Parallel and Distributed Computing.” We would also like to thank Simone Taylor, Director, Editorial Development, John Wiley & Sons, Inc., and the editorial team for their patience and guiding us through the publication of this book. We would also like to thank COST for the support that enabled the publication.
E. Jeannot and J. ilinskas
Delft, Netherlands
May, 2013
ESF provides the COST Office through an EC contract
COST is supported by the EU RTD Framework programme
COST–the acronym for European Cooperation in Science and Technology–is the oldest and widest European intergovernmental network for cooperation in research. Established by the Ministerial Conference in November 1971, COST is presently used by the scientific communities of 36 European countries to cooperate in common research projects supported by national funds.
The funds provided by COST–less than 1% of the total value of the projects–support the COST cooperation networks (COST Actions) through which, with EUR 30 million per year, more than 30 000 European scientists are involved in research having a total value which exceeds EUR 2 billion per year. This is the financial worth of the European added value which COST achieves.
A “bottom up approach” (the initiative of launching a COST Action comes from the European scientists themselves), “à la carte participation” (only countries interested in the Action participate), “equality of access” (participation is open also to the scientific communities of countries not belonging to the European Union) and “flexible structure” (easy implementation and light management of the research initiatives) are the main characteristics of COST.
As precursor of advanced multidisciplinary research COST has a very important role for the realisation of the European Research Area (ERA) anticipating and complementing the activities of the Framework Programmes, constituting a “bridge” towards the scientific communities of emerging countries, increasing the mobility of researchers across Europe and fostering the establishment of “Networks of Excellence” in many key scientific domains such as: Biomedicine and Molecular Biosciences; Food and Agriculture; Forests, their Products and Services; Materials, Physical and Nanosciences; Chemistry and Molecular Sciences and Technologies; Earth System Science and Environmental Management; Information and Communication Technologies; Transport and Urban Development; Individuals, Societies, Cultures and Health. It covers basic and more appliedresearch and also addresses issues of pre-normative nature or of societal importance.
Web: http://www.cost.eu
Neither the COST Office nor any person acting on its behalf is responsible for the use which might be made of the information contained in this publication. The COST Office is not responsible for the external websites referred to in this publication.
PART I
Introduction
Emmanuel Jeannot
Inria Bordeaux Sud-Ouest, Talence, France
Julius ilinskas
Vilnius University, Vilnius, Lithuania
In this chapter, we describe the COST Action IC0805 entitled “Open European Network for High-Performance Computing on Complex Environments.” This Action had representation from more than 20 countries and lasted from 2009 to 2013. We outline the scientific focus of this Action, its organization, and its main outcomes. The chapter concludes by presenting the structure of the book and its different chapters.
In recent years, the evolution and growth of the techniques and platforms commonly used for high-performance computing (HPC) in the context of different application domains has been truly astonishing. While parallel computing systems have now achieved certain maturity thanks to high-level libraries (such as ScaLAPACK) or runtime libraries (such as MPI), recent advances in these technologies pose several challenging research issues. Indeed, current HPC-oriented environments are extremely complex and very difficult to manage, particularly for extreme-scale application problems.
At the very low level, the latest generation CPUs are made of multicore processors that can be general-purpose or highly specialized in nature. On the other hand, several processors can be assembled into a so-called symmetrical multiprocessor (SMP) which can also have access to powerful specialized processors, such as graphics processing units (GPUs), that are now increasingly being used for programmable computing resulting from their advent in the video-game industry, which has significantly reduced their cost and availability. Modern HPC-oriented parallel computers are typically composed of several SMP nodes interconnected by a network. This kind of infrastructure is hierarchical and represents a first class of heterogeneous system in which the communication time between two processing units is different, depending on whether the units are on the same chip, on the same node, or not. Moreover, current hardware trends anticipate a further increase in the number of cores (in a hierarchical way) inside the chip, thusincreasing the overall heterogeneity, even more toward building extreme-scale systems.
At a higher level, the emergence of heterogeneous computing now allows groups of users to benefit from networks of processors that are already available in their research laboratories. This is a second type of infrastructure where both the network and the processing units are heterogeneous in nature. Specifically, here the goal is to deal with networks that interconnect a (often high) number of heterogeneous computers that can significantly differ from one another in terms of their hardware and software architecture, including different types of CPUs operating at different clock speeds and under different design paradigms, and also different memory sizes, caching strategies, and operating systems.
At the high level, computers are increasingly interconnected together throughout wide area networks to form large-scale distributed systems with high computing capacity. Furthermore, computers located in different laboratories can collaborate in the solution of a common problem. Therefore, the current trends of HPC are clearly oriented toward extreme-scale, complex infrastructures with a great deal of intrinsic heterogeneity and many different hierarchical levels.
It is important to note that all the heterogeneity levels mentioned above are tightly linked. First of all, some of the nodes in computational distributed environments may be multicore SMP clusters. Second, multicore chips will soon be fully heterogeneous with special-purpose cores (e.g., multimedia, recognition, networking) and not only GPUs mixed with general-purpose ones. Third, these different levels share many common problems such as efficient programming, scalability, and latency management. Hence, it is very important to conduct research targeting the heterogeneity at all presented hardware levels. Moreover, it is also important to take special care of the scalability issues, which form a key dimension in the complexity of today environment. The extreme scale of this environment comes from every level:
Low Level
: number of CPUs, number of cores per processor;
Medium Level
: number of nodes (e.g., with memory);
High Level
: distributed/large-scale (geography dispersion, latency, etc.);
Application
: extreme-scale problem size (e.g., calculation-intensive or data-intensive).
In 2008, the knowledge on how to efficiently use program or scale applications on such infrastructures was still vague. This was one of the main challenges that researchers wanted to take on. Therefore, at that time, we decided to launch the COST Action for high-performance and extreme-scale computing in such complex environments entitled “Open European Network for High-Performance Computing in Complex Environments.” The main reasons were as follows:
There was a huge demand in terms of computational power for scientific and data-intensive applications;
The architectural advances offered the potential to meet the application requirements;
None of the state-of-the-art solutions in HPC at that time allowed exploitation to this potential level;
Most of the research carried out in this area was fragmented and scattered across different research teams without any coordination.
COST1 was indeed an appropriate framework for the proposed Action. The main goal of this Action was to overcome the actual research fragmentation on this very hot topic by gathering the most relevant European research teams involved in all the scientific areas described above (from the CPU core to the scientific applications) and coordinate their research.
Summarizing, this project within the COST framework allowed us to expect some potential benefits such as high-level scientific results in the very important domain of high-performance and extreme-scale computing in complex environment; strong coordination between different research teams with significant expertise on this subject; a better visibility of the European research in this area; and a strong impact on other scientists and high-performance applications.
The expected scientific impacts of the project were to encourage the specific community to focus research on hot topics and applications of interest for the EU, to propagate the collaboration of research groups with the industry, to stimulate the formation of new groups in new EU countries, and to facilitate the solution of highly computationally demanding scientific problems as mentioned above. For this, the groups involved in this Action collaborated with several scientific and industrial groups that could benefit from the advances made by this Action, and prompted the incorporation of new groups to the network.
To achieve the research tasks, different leading European research teams participated in the concrete activities detailed in Section 1.3.
Four working groups were set up to coordinate the scientific research:
numerical analysis for hierarchical and heterogeneous and multicore systems;
libraries for the efficient use of complex systems with emphasis on computational library and communication library;
algorithms and tools for mapping and executing applications onto distributed and heterogeneous systems;
applications of hierarchical-heterogeneous systems.
It is important to note that these working groups targeted vertical aspects of the architectural structure outlined in the previous section. For instance, the Action's goal was to carry out work on numerical analysis at the multicore level, at the heterogeneous system level, as well as at the large-scale level. The last working group (Applications) was expected to benefit from research of the other three groups.
To achieve the goal of this Action, the following concrete activities were proposed. The main goal was to promote collaboration through science meetings, workshops, schools, and internships. This allowed interchange of ideas and mobility of researchers.
The goal was to provide young researchers with a good opportunity to share information and knowledge and to present their current research. These schools contributed to the expansion of the computing community and spread of EU knowledge.
The goal of these meetings was to take the opportunity during international conferences to meet the attendees and other researchers by co-locating workshops.
The scientific work plan was divided among different working groups. Each working group had substantial autonomy in terms of research projects. A leader nominated by the Management Committee led each working group. Members of a given working group met once or twice a year to discuss and exchange specific scientific issues and problems.
These meetings were devoted to the organization of the network and ensured the scientific quality of the network.
The goal of short-term scientific missions (STSMs) was to enable visits by early stage researchers to foreign laboratories and departments. This was mainly targeted at young researchers to receive cross-disciplinary training and to take advantage of the existing resources. The goal was to increase the competitiveness and career development of those scientists in this rapidly developing field through cutting-edge collaborative research on the topic.
We believe that this COST Action was a great success. It gathered 26 European countries and 2 non-COST countries (Russia and South Africa). We have held 12 meetings and 2 spring schools. Fifty-two STSMs have been carried out. We have a new FP7 project coming from this Action (HOST). We haveedited a book, and more than 100 papers have been published thanks to this Action.
We have set up an application catalog that gathers applications from the Action members. Its goal is to gather a set of HPC applications that can be used as test cases or benchmarks for researchers in the HPC field. The applications catalog is available at https://complexhpc-catalogue.bordeaux.inria.fr.
In total, the Action gathered more than 250 participants over the four years of the project.
We have sent a survey to the Action members. From this survey, it clearly appears that one of the greatest successes of the Action is the continuous strengthening of the network for many of its members both in terms of research teams and research domains. Many STSMs have been done through new network connections. Spring schools are seen as a major success, as they helped many young researchers to share and exchange knowledge and gain new connections. Many PhD theses have been defended during the course of the Action, and some of the management committee members have been invited on the defense board of some of these PhDs. Moreover, many presentations given during the meeting are considered very useful and have opened new research directions for other attendees.
We had four goals in this Action:
to train new generations of scientists in high-performance and heterogeneous computing;
to overcome research fragmentation, and foster HPC efforts to increase Europe's competitiveness;
to tackle the problem at every level (from cores to large-scale environment);
vertical integration to provide new integrated solutions for large-scale computing for future platforms.
Goal 1 has exceeded our expectations. The spring schools have been a great success. We had many STSMs, and the number of early stage researchers attending the meeting was always very high. We had great response from young researchers.
Goal 2 has also been achieved satisfactorily. Thanks to the Action, many joint researches have been carried out, and we have created a nice network of researchers within our Action. Moreover, many top-level publications have been made thanks to the Action.
Goal 3 has also been achieved. We have scientific results that cover the core level and the distributed infrastructure, as well as results that cover the intermediate layers. This is due to the fact that the consortium was made of researchers from different areas. This was very fruitful.
Goal 4 has not been achieved. The main reason is the fact that providing integrated solutions requires more research and development than a COST Action can provide. It goes far beyond the networking activities of COST Action.
This book presents some of the main results, in terms of research, of the COST Action presented in this chapter. We are very proud to share this with the interested reader. We have structured the book according to the following parts in order to have a good balance between each part:
Numerical Analysis for Heterogeneous and Multicore Systems (Chapters 2, 3, and 4);
Communication and Storage Considerations in High-Performance Computing (Chapters 5, 6, 7, and 8);
Efficient Exploitation of Heterogeneous Architectures (Chapters 9, 10, 11, and 12);
CPU + GPU coprocessing (Chapters 13, 14, and 15);
Efficient Exploitation of Distributed Systems (Chapters 16 and 17);
Energy Awareness in High-Performance Computing (Chapters 18, 19, and 20);
Applications of Heterogeneous High-Performance Computing (Chapters 21, 22, and 23).
Chapter 2 discusses the redesign of the iterative solution algorithm in order to efficiently execute them on heterogeneous architectures. Chapter 3 studies the performance of a meshless numerical partial differential equation (PDE) solver, parallelized with OpenMP. The results depend on the way the computations are distributed and the way the cache is used. Chapter 4 presents the development of three parallel numerical algorithms for the solution of parabolic problems on graphs with a theoretical and experimental study of their scalability.
Chapter 5 surveys different techniques for mapping processes to computing units in order to optimize communication cost and reduce execution time. Chapter 6 offers a comprehensive overview of how to implement topology- and performance-aware collective communications. Chapter 7 analyzes the many-core architecture using a new model (K-model) in order to estimate the complexity of a given algorithm designed for such an architecture. Chapter 8 presents a scalable I/O storage system for the hierarchical architecture of Blue Gene computers featuring buffering and asynchronous I/O.
Chapter 9 describes algorithmic techniques for offline scheduling of independent workflows in order to satisfy user's quality of service. Chapter 10 investigates the advantage of using modern heterogeneous architecture for the efficient implementation of the Reed–Solomon erasure code. Chapter 11 analyzes the factors that enable the development of efficient parallel programs on modern many-core parallel architecture. Chapter 12 studies efficient solutions for electromagnetism applications in clusters of CPU + GPU nodes.
Chapter 13 describes how the functional performance model can be used to optimize the performance of scientific applications for heterogeneous and hierarchical platform. Chapter 14 presents algorithms for multilevel load-balancing on multicore and multi-GPU environments. Chapter 15 faces the all-pair shortest path problem for sparse graph. Different scheduling strategies are studied to efficiently solve such problems on heterogeneous systems.
Chapter 16 surveys different resource management systems and scheduling algorithms for HPC for clouds. Chapter 17 discusses different approaches for performing resource discovery in large-scale distributed systems.
Chapter 18 focuses on how to optimize and adapt software solution to improve energy efficiency in the context of HPC application. Chapter 19 studies energy-aware scheduling policies for three scenarios of federated cloud dealing with energy awareness. Chapter 20 explores the use of heterogeneous chip multiprocessors for network security and strategy to improve energy consumption in such contexts.
Chapter 21 describes the “jungle computing paradigm,” which consists in gathering a complex hierarchical collection of heterogeneous computing hardware with an application to hyperspectral remote sensing. Chapter 22 presents a new model for image and video processing based on parallel and heterogeneous platforms in order to improve the performance of the application when dealing with high-definition images. Chapter 23 applies load-balancing techniques to efficiently execute tomographic reconstruction using hybrid GPU + CPU systems.
As you can see, this covers a large spectrum of results and topics on HPC and heterogeneous systems.
We wish you a fruitful and enjoyable time with this book.
This publication is supported by COST.
1 European Cooperation in Science and Technology: http://www.cost.eu.
PART II
Numerical Analysis for Heterogeneous and Multicore Systems
Dimitar Lukarski and Maya Neytcheva
Uppsala University, Uppsala, Sweden
Computer simulations are now broadly recognized as a third branch of research, complementing theory and experimental work. The significant increase of available computing power has enabled tackling very large scale, challenging, real-life problems and opening new possibilities for revolutionary breakthrough results in science and engineering. At the same time, the complexity of the computer architecture has risen to levels where it is possible to achieve its full computing power only after careful redesigning of existing algorithms and developing novel computational and communication strategies. In this chapter, we discuss this issue for a class of methods, broadly used in scientific computations—the iterative solution methods.
For many years, the potential of available, serial as well as parallel, computer resources has been growing hand in hand with the need to numerically solve increasingly larger models of real-life problems. During the past decades, it has been recognized that, together with theoretical development and laboratory or field experiments, computation has become a third branch of research. Scientific computing is today's driving force behind the progress in the most challenging and demanding problems we attempt to solve. As examples, we mention turbulent combustion with complex chemistry, atmospheric dynamics, laser fusion, medical imaging, detailed modeling of the human heart, and artificial brain simulation with over a million neurons, to name a few.
In recent years, we have been witnessing a change in the means to increase the computational power which has a strong impact on how that power should be utilized via the algorithms used in scientific computations. Therefore, we briefly describe the phases in performing numerical simulations. Consider a complex physical phenomenon, described as a set of, usually coupled, processes that develop in space and time, which we want to study, analyze, and predict. It is assumed that the simulation requires a large amount of computer resources in terms of memory and computation.
The process of performing the numerical simulations can be split into the following steps:
Mathematical model: Describes the phenomenon continuously in time and space in terms of mathematical relations, most often as coupled ordinary or partial differential equations. These equations depend on various problem parameters, such as thermal conductivity, capacitance, material properties, and so on.
Discrete model: Because of the high complexity of the continuous model, analytical solutions are in general not available. Therefore, we pose the task to compute the solution in a number of discrete points in time and space, thus discretizing the mathematical model. This can be accomplished using various techniques. Space discretization can be done using finite differences, finite elements, finite volumes, boundary elements, and so on. Similarly, in time, various explicit or implicit time-stepping procedures can be utilized. In addition to the model parameters, here additional discretization parameters are introduced, usually denoted as
in space and
in time. The discrete model is expressed in terms of linear or nonlinear algebraic systems of equations which have to be solved. Depending on the problem, but also on the discretization techniques, the matrices associated with the algebraic systems can be dense or sparse, symmetric or nonsymmetric, and so on. As nonlinear systems are most often solved via linearization, we consider from now on only linear systems that are also large and sparse.
The linear systems arising in Step II have to be solved by a proper solution method—direct or iterative. Because of the targeted large-sized problems and the lesser demands on computer resources, we consider iterative solvers only. The iterative methods may introduce yet other method parameters which increase further the dimension of the parameter space.
Computer implementation: To enable computer simulations, the numerical methods have to be implemented on some computer platform.
Visualization and verification: This step is also of importance, but it is not considered here any further.
When performing numerical simulations, we deal with two major concerns. The first concern is robustness with respect to model, discretization, and method parameters. Robustness is understood in the sense that the numerical efficiency of the iterative method chosen in Step III should not depend on changes in the parameters. For example, the number of iterations should not increase uncontrollably when decreases. The numerical efficiency, related to fast convergence rate, can be seen also as an element of the robustness of the method. The second concern is the efficiency of the implementation in Step IV. It is based on a programming model (such as shared or distributed memory model), programming language, and a particular computer platform.
It has been recognized that, in order to achieve fast, accurate, and reliable results from computer simulations, sufficient knowledge is required for all the above steps, in particular Steps II, III, and IV, and awareness of the interplay among them. By choosing one or another discretization method, we may influence the structure and the properties of the arising matrices; by choosing a particular solution method we may ensure robustness with respect to the various problem, discretization, and method parameters; but we may sacrifice the amount of internal parallelism. Knowledge about the computer architecture on which the simulations are to be run may influence the choice of the solution method, which in turn has to be combined with the requirements of accuracy and robustness.
With the ongoing radical shift in the computer architecture toward multicore and many-core computational units, the importance of the above arguments becomes even stronger. The new technology based on multicore and many-core devices provides higher performance capabilities both in terms of computational power (GFlop/s) and memory bandwidth (GB/s). The available and easily accessible power enables scientists to tackle larger problems with higher resolution, providing in this way a better understanding of the world.
The radical shift is clearly seen when we compare the supercomputers available 10 years ago with the personal computers of today. All supercomputers in 2003 contained less than 10,000 cores1. By the end of 2004, the situation was still the same, with two exceptions. At present, the computer landscape is very different. Not only the Top500 leaders have over 500,000 cores. Currently, NVIDIA delivers GPU (graphical processing unit) cards with more than 2500 cores per device (see GPU NVIDIA K20X2). With an improved power supply, four of these cards can be installed in a standard personal computer and thus one can obtain a system with more than 10,000 cores, which is a commodity at our desktop.
In order to achieve fast and reliable performance of the iterative methods, it becomes crucial to reconsider and redesign the implementation of well-known algorithms as well as to gain a deeper insight into what the expected performance is of the most common solution techniques on multicore heterogeneous computer platforms.
The focus of this chapter is to show how iterative methods can be performed efficiently on highly parallel, heterogeneous platforms. We present various methods and examples to show how this can be done, which includes mathematical description, as well as hardware-specific aspects.
We briefly discuss basic iterative techniques as well as two of the most often used projection-based methods—the conjugate gradient (CG) method [1], the generalized minimal residual (GMRES) method [2], and the multigrid (MG) method [3]. We also describe the defect-correction technique [4] as an illustration of an approach particularly suitable for solving linear systems on heterogeneous computers.
Consider the solution of the linear system
where is a nonsingular matrix, so Equation (2.1) has a unique solution. The matrix is large and sparse, and therefore the number of nonzero elements, , is proportional to the size of the matrix, ; that is, .
Finding the solution to Equation (2.1) is equivalent to finding the root of the equation
One straightforward way to introduce simple iterative methods is to rewrite (2.2) as a fixed-point iteration, namely
The computational procedure (2.3) defines a basic stationary iterative method. The computation cost per iteration involves one matrix–vector multiplication and two vector updates, and is clearly . Such a method, however, usually exhibits too slow a convergence, which manifests itself in unacceptably many iterations. In some cases, convergence may not even be achieved.
Aiming at accelerating the convergence of the iterative process has led to the idea to involve some method parameter, replacing the simple iteration (2.3) by
where is the residual at the th iteration and or are some properly chosen method parameters. In Equation (2.4), the method parameters to tune are scalars. Of course, nothing prevents us from replacing them with a properly chosen matrix, referred to in the sequel as ; thus we consider
As will be discussed later, can also vary during the iterative process. For simplicity, now we consider that it is some explicitly given nonsingular matrix.
It is easy to see that Equation (2.5) is obtained by replacing the original system by the transformed system
and applying the fixed-point scheme to it. In this case, the iterative scheme becomes
The scheme (2.6) has a higher computational complexity than that of Equation (2.4), since a solution of a system with the matrix is required at each iteration. Clearly, the achieved decrease in the number of iterations must be significant enough to compensate for the extra cost per iteration.
We see from Equation (2.6) that the choice would lead to a procedure that converges within one iteration. However, the computational cost would be unacceptably high, similar to that of a direct solution method. Clearly, should satisfy some conditions so that we can achieve faster convergence, keeping at the same time the overall computational costs of the whole iterative method as low as possible.
The matrix is referred to as a preconditioner to . We consider next some well-known choices of , leading to a family of classical iterative methods which are based on the so-called matrix splitting technique.
Intuitively, has to be related to . Consider the following splitting of ,
where is nonsingular and can be seen as an error matrix. Then,
where is the identity matrix of proper order.
The matrix is referred to as the iteration matrix and is used in theoretical derivations to show the convergence of the corresponding iterative method, as well as to estimate its rate of convergence (see [5] for details). Using the splitting, we rewrite Equation (2.5) as follows:
or
Let be represented in the following way, where , , and are the diagonal, the strictly lower triangular, and the strictly upper triangular part of , respectively. Table 2.1 shows some classical iterative schemes, based on the latter splitting of .
Table 2.1 Classical Iterative Schemes Based on Matrix Splitting
Method
Scheme
Jacobi iteration
Gauss–Seidel backward
Gauss–Seidel forward
SOR
For more details on the convergence of these methods, refer to [5].
The common characteristic of these methods is the simplicity of their implementation. Here, is a diagonal or a triangular matrix, and the degree of parallelism is related to the sparsity structure of the underlying matrices.
The bottleneck of these methods is their slow convergence. Their importance has not been lost, however. Today, they are mostly used as subsolvers in more advanced iterative techniques described in Sections 2.2.2 and 2.5. Because of their low arithmetic cost and ease of implementation, they are important ingredients in the so-called projection methods.
The idea behind the projection methods is that the original problem of huge dimension (easily of tens or even hundreds of millions of degrees of freedom) is projected over a subspace of much smaller dimension. An approximate solution is sought in that smaller subspace and then projected back to the original large space. When the projection utilizes some particular subspaces, referred to as Krylov subspaces, the corresponding methods are known as the Krylov subspace methods. For detailed derivations and properties of the projection methods, refer, for example, to [1].
We present the most popular representatives of the Krylov subspace methods, namely, the CG method, used for solving linear systems with symmetric and positive-definite matrices, and the GMRES method, which is suitable for general nonsymmetric matrices. The presentation of the algorithms follows [1]:
The notation should be interpreted as a solution of a system with the preconditioner .
The arithmetic complexity per CG iteration includes one matrix–vector multiplication, three vector updates, two scalar products, and solution of a system with . Provided that is sparse, the arithmetic cost per iteration, excluding that to solve a system with , is linear in the number of degrees of freedom.
In contrast to CG, the arithmetic cost per GMRES iteration increases with the number of iterations. We see that we need to keep an increasing number of vectors , as well as the vectors . We have to do more scalar products at each iteration, and also solve a small least-squared problem with the matrix , which is also of growing dimension.
If we assume that the preconditioner is such that the number of CG or GMRES iterations is kept small and bounded independently of the problem size, we see that the major ingredients of the iterative methods are matrix–vector multiplications, vector updates, and scalar products. Parallelization of these operations is well studied, and efficient computational libraries are available (e.g., [6, 7] etc.).
The crucial element when applying iterative solution methods on parallel computer architectures is the preconditioning step. Experience shows that the most numerically efficient preconditioners are among the most difficult to parallelize efficiently. As we will discuss in the next sections, a compromise is usually made in order to optimize the overall performance of the iterative solver.
Among the first who coined the term preconditioning in its modern sense were D.J. Evans and O. Axelsson in the late 1960s and early 1970s (5, 8). Preconditioning is usually understood as an auxiliary matrix, , that would improve the condition number of the product (the preconditioned matrix) and thus enhance the convergence.
We note that the preconditioner may be available in an explicit matrix form, but it can also be implicitly defined by some procedure, denoted as , that implements the action of a preconditioner on a vector or the solution of a system with it. We mention further that the best known preconditioners so far are implicitly defined. These include MG methods [3], algebraic multigrid (AMG) methods [9], algebraic multilevel iteration (AMLI) methods [10, 11], and some domain decomposition (DD) preconditioners [12].
The preconditioner may be an approximation of , and during the preconditioning step we solve a system with it. However, may be some (sparse) approximation of , and then the action of
