E-Book
107,99 €

Algorithms and Parallel Computing E-Book

Fayez Gebali

0,0

107,99 €

oder

Leseprobe lesen

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: John Wiley & Sons
Kategorie: Wissenschaft und neue Technologien
Serie: Wiley Series on Parallel and Distributed Computing
Sprache: Englisch

Beschreibung

There is a software gap between the hardware potential and the performance that can be attained using today's software parallel program development tools. The tools need manual intervention by the programmer to parallelize the code. Programming a parallel computer requires closely studying the target algorithm or application, more so than in the traditional sequential programming we have all learned. The programmer must be aware of the communication and data dependencies of the algorithm or application. This book provides the techniques to explore the possible ways to program a parallel computer for a given application.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 508

Veröffentlichungsjahr: 2011

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Table of Contents

Cover

Table of Contents

Title page

Dedication

Preface

ABOUT THIS BOOK

CHAPTER ORGANIZATION AND OVERVIEW

ACKNOWLEDGMENTS

COMMENTS AND SUGGESTIONS

List of Acronyms

Chapter 1 Introduction

1.1 INTRODUCTION

1.2 TOWARD AUTOMATING PARALLEL PROGRAMMING

1.3 ALGORITHMS

1.4 PARALLEL COMPUTING DESIGN CONSIDERATIONS

1.5 PARALLEL ALGORITHMS AND PARALLEL ARCHITECTURES

1.6 RELATING PARALLEL ALGORITHM AND PARALLEL ARCHITECTURE

1.7 IMPLEMENTATION OF ALGORITHMS: A TWO-SIDED PROBLEM

1.8 MEASURING BENEFITS OF PARALLEL COMPUTING

1.9 AMDAHL’S LAW FOR MULTIPROCESSOR SYSTEMS

1.10 GUSTAFSON–BARSIS’S LAW

1.11 APPLICATIONS OF PARALLEL COMPUTING

Chapter 2 Enhancing Uniprocessor Performance

2.1 INTRODUCTION

2.2 INCREASING PROCESSOR CLOCK FREQUENCY

2.3 PARALLELIZING ALU STRUCTURE

2.4 USING MEMORY HIERARCHY

2.5 PIPELINING

2.6 VERY LONG INSTRUCTION WORD (VLIW) PROCESSORS

2.7 INSTRUCTION-LEVEL PARALLELISM (ILP) AND SUPERSCALAR PROCESSORS

2.8 MULTITHREADED PROCESSOR

Chapter 3 Parallel Computers

3.1 INTRODUCTION

3.2 PARALLEL COMPUTING

3.3 SHARED-MEMORY MULTIPROCESSORS (UNIFORM MEMORY ACCESS [UMA])

3.4 DISTRIBUTED-MEMORY MULTIPROCESSOR (NONUNIFORM MEMORY ACCESS [NUMA])

3.5 SIMD PROCESSORS

3.6 SYSTOLIC PROCESSORS

3.7 CLUSTER COMPUTING

3.8 GRID (CLOUD) COMPUTING

3.9 MULTICORE SYSTEMS

3.10 SM

3.11 COMMUNICATION BETWEEN PARALLEL PROCESSORS

3.12 SUMMARY OF PARALLEL ARCHITECTURES

Chapter 4 Shared-Memory Multiprocessors

4.1 INTRODUCTION

4.2 CACHE COHERENCE AND MEMORY CONSISTENCY

4.3 SYNCHRONIZATION AND MUTUAL EXCLUSION

Chapter 5 Interconnection Networks

5.1 INTRODUCTION

5.2 CLASSIFICATION OF INTERCONNECTION NETWORKS BY LOGICAL TOPOLOGIES

5.3 INTERCONNECTION NETWORK SWITCH ARCHITECTURE

Chapter 6 Concurrency Platforms

6.1 INTRODUCTION

6.2 CONCURRENCY PLATFORMS

6.3 CILK++

6.4 OpenMP

6.5 COMPUTE UNIFIED DEVICE ARCHITECTURE (CUDA)

Chapter 7 Ad Hoc Techniques for Parallel Algorithms

7.1 INTRODUCTION

7.2 DEFINING ALGORITHM VARIABLES

7.3 INDEPENDENT LOOP SCHEDULING

7.4 DEPENDENT LOOPS

7.5 LOOP SPREADING FOR SIMPLE DEPENDENT LOOPS

7.6 LOOP UNROLLING

7.7 PROBLEM PARTITIONING

7.8 DIVIDE-AND-CONQUER (RECURSIVE PARTITIONING) STRATEGIES

7.9 PIPELINING

Chapter 8 Nonserial–Parallel Algorithms

8.1 INTRODUCTION

8.2 COMPARING DAG AND DCG ALGORITHMS

8.3 PARALLELIZING NSPA ALGORITHMS REPRESENTED BY A DAG

8.4 FORMAL TECHNIQUE FOR ANALYZING NSPAs

8.5 DETECTING CYCLES IN THE ALGORITHM

8.6 EXTRACTING SERIAL AND PARALLEL ALGORITHM PERFORMANCE PARAMETERS

8.7 USEFUL THEOREMS

8.8 PERFORMANCE OF SERIAL AND PARALLEL ALGORITHMS ON PARALLEL COMPUTERS

Chapter 9 z-Transform Analysis

9.1 INTRODUCTION

9.2 DEFINITION OF z-TRANSFORM

9.3 THE 1-D FIR DIGITAL FILTER ALGORITHM

9.4 SOFTWARE AND HARDWARE IMPLEMENTATIONS OF THE z-TRANSFORM

9.5 DESIGN 1: USING HORNER’S RULE FOR BROADCAST INPUT AND PIPELINED OUTPUT

9.6 DESIGN 2: PIPELINED INPUT AND BROADCAST OUTPUT

9.7 DESIGN 3: PIPELINED INPUT AND OUTPUT

Chapter 10 Dependence Graph Analysis

10.1 INTRODUCTION

10.2 THE 1-D FIR DIGITAL FILTER ALGORITHM

10.3 THE DEPENDENCE GRAPH OF AN ALGORITHM

10.4 DERIVING THE DEPENDENCE GRAPH FOR AN ALGORITHM

10.5 THE SCHEDULING FUNCTION FOR THE 1-D FIR FILTER

10.6 NODE PROJECTION OPERATION

10.7 NONLINEAR PROJECTION OPERATION

10.8 SOFTWARE AND HARDWARE IMPLEMENTATIONS OF THE DAG TECHNIQUE

Chapter 11 Computational Geometry Analysis

11.1 INTRODUCTION

11.2 MATRIX MULTIPLICATION ALGORITHM

11.3 THE 3-D DEPENDENCE GRAPH AND COMPUTATION DOMAIN

11.4 THE FACETS AND VERTICES OF

11.5 THE DEPENDENCE MATRICES OF THE ALGORITHM VARIABLES

11.6 NULLSPACE OF DEPENDENCE MATRIX: THE BROADCAST SUBDOMAIN B

11.7 DESIGN SPACE EXPLORATION: CHOICE OF BROADCASTING VERSUS PIPELINING VARIABLES

11.8 DATA SCHEDULING

11.9 PROJECTION OPERATION USING THE LINEAR PROJECTION OPERATOR

11.10 EFFECT OF PROJECTION OPERATION ON DATA

11.11 THE RESULTING MULTITHREADED/MULTIPROCESSOR ARCHITECTURE

11.12 SUMMARY OF WORK DONE IN THIS CHAPTER

Chapter 12 Case Study: One-Dimensional IIR Digital Filters

12.1 INTRODUCTION

12.2 THE 1-D IIR DIGITAL FILTER ALGORITHM

12.3 THE IIR FILTER DEPENDENCE GRAPH

12.4 z-DOMAIN ANALYSIS OF 1-D IIR DIGITAL FILTER ALGORITHM

Chapter 13 Case Study: Two- and Three-Dimensional Digital Filters

13.1 INTRODUCTION

13.2 LINE AND FRAME WRAPAROUND PROBLEMS

13.3 2-D RECURSIVE FILTERS

13.4 3-D DIGITAL FILTERS

Chapter 14 Case Study: Multirate Decimators and Interpolators

14.1 INTRODUCTION

14.2 DECIMATOR STRUCTURES

14.3 DECIMATOR DEPENDENCE GRAPH

14.4 DECIMATOR SCHEDULING

14.5 DECIMATOR DAG FOR s1 = [1 0]

14.6 DECIMATOR DAG FOR s2 = [1 −1]

14.7 DECIMATOR DAG FOR s3 = [1 1]

14.8 POLYPHASE DECIMATOR IMPLEMENTATIONS

14.9 INTERPOLATOR STRUCTURES

14.10 INTERPOLATOR DEPENDENCE GRAPH

14.11 INTERPOLATOR SCHEDULING

14.12 INTERPOLATOR DAG FOR s1 = [1 0]

14.13 INTERPOLATOR DAG FOR s2 = [1 −1]

14.14 INTERPOLATOR DAG FOR s3 = [1 1]

14.15 POLYPHASE INTERPOLATOR IMPLEMENTATIONS

Chapter 15 Case Study: Pattern Matching

15.1 INTRODUCTION

15.2 EXPRESSING THE ALGORITHM AS A REGULAR ITERATIVE ALGORITHM (RIA)

15.3 OBTAINING THE ALGORITHM DEPENDENCE GRAPH

15.4 DATA SCHEDULING

15.5 DAG NODE PROJECTION

15.6 DESIGN 1: DESIGN SPACE EXPLORATION WHEN s = [1 1]t

15.7 DESIGN 2: DESIGN SPACE EXPLORATION WHEN s = [1 −1]t

15.8 DESIGN 3: DESIGN SPACE EXPLORATION WHEN s = [1 0]t

Chapter 16 Case Study: Motion Estimation for Video Compression

16.1 INTRODUCTION

16.2 FBMAS

16.3 DATA BUFFERING REQUIREMENTS

16.4 FORMULATION OF THE FBMA

16.5 HIERARCHICAL FORMULATION OF MOTION ESTIMATION

16.6 HARDWARE DESIGN OF THE HIERARCHY BLOCKS

Chapter 17 Case Study: Multiplication over GF(2m)

17.1 INTRODUCTION

17.2 THE MULTIPLICATION ALGORITHM IN GF(2m)

17.3 EXPRESSING FIELD MULTIPLICATION AS AN RIA

17.4 FIELD MULTIPLICATION DEPENDENCE GRAPH

17.5 DATA SCHEDULING

17.6 DAG NODE PROJECTION

17.7 DESIGN 1: USING d1 = [1 0]t

17.8 DESIGN 2: USING d2 = [1 1]t

17.9 DESIGN 3: USING d3 = [1 −1]t

17.10 APPLICATIONS OF FINITE FIELD MULTIPLIERS

Chapter 18 Case Study: Polynomial Division over GF(2)

18.1 INTRODUCTION

18.2 THE POLYNOMIAL DIVISION ALGORITHM

18.3 THE LFSR DEPENDENCE GRAPH

18.4 DATA SCHEDULING

18.5 DAG NODE PROJECTION

18.6 DESIGN 1: DESIGN SPACE EXPLORATION WHEN s1 = [1 −1]

18.7 DESIGN 2: DESIGN SPACE EXPLORATION WHEN s2 = [1 0]

18.8 DESIGN 3: DESIGN SPACE EXPLORATION WHEN s3 = [1 −0.5]

18.9 COMPARING THE THREE DESIGNS

Chapter 19 The Fast Fourier Transform

19.1 INTRODUCTION

19.2 DECIMATION-IN-TIME FFT

19.3 PIPELINE RADIX-2 DECIMATION-IN-TIME FFT PROCESSOR

19.4 DECIMATION-IN-FREQUENCY FFT

19.5 PIPELINE RADIX-2 DECIMATION-IN-FREQUENCY FFT PROCESSOR

Chapter 20 Solving Systems of Linear Equations

20.1 INTRODUCTION

20.2 SPECIAL MATRIX STRUCTURES

20.3 FORWARD SUBSTITUTION (DIRECT TECHNIQUE)

20.4 BACK SUBSTITUTION

20.5 MATRIX TRIANGULARIZATION ALGORITHM

20.6 SUCCESSIVE OVER RELAXATION (SOR) (ITERATIVE TECHNIQUE)

Chapter 21 Solving Partial Differential Equations Using Finite Difference Method

21.1 INTRODUCTION

21.2 FDM FOR 1-D SYSTEMS

References

Index

Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data

Gebali, Fayez.

Algorithms and parallel computing/Fayez Gebali.

p. cm.—(Wiley series on parallel and distributed computing ; 82)

Includes bibliographical references and index.

ISBN 978-0-470-90210-3 (hardback)

ISBN 978-0-470-93463-0 (ebk)

1. Parallel processing (Electronic computers) 2. Computer algorithms. I. Title.

QA76.58.G43 2011

004′.35—dc22

2010043659

To my children: Michael Monir, Tarek Joseph, Aleya Lee, and Manel Alia

Preface

ABOUT THIS BOOK

There is a software gap between hardware potential and the performance that can be attained using today’s software parallel program development tools. The tools need manual intervention by the programmer to parallelize the code. This book is intended to give the programmer the techniques necessary to explore parallelism in algorithms, serial as well as iterative. Parallel computing is now moving from the realm of specialized expensive systems available to few select groups to cover almost every computing system in use today. We can find parallel computers in our laptops, desktops, and embedded in our smart phones. The applications and algorithms targeted to parallel computers were traditionally confined to weather prediction, wind tunnel simulations, computational biology, and signal processing. Nowadays, just about any application that runs on a computer will encounter the parallel processors now available in almost every system.

Parallel algorithms could now be designed to run on special-purpose parallel processors or could run on general-purpose parallel processors using several multilevel techniques such as parallel program development, parallelizing compilers, multithreaded operating systems, and superscalar processors. This book covers the first option: design of special-purpose parallel processor architectures to implement a given class of algorithms. We call such systems accelerator cores. This book forms the basis for a course on design and analysis of parallel algorithms. The course would cover Chapters 1–4 then would select several of the case study chapters that constitute the remainder of the book.

Although very large-scale integration (VLSI) technology allows us to integrate more processors on the same chip, parallel programming is not advancing to match these technological advances. An obvious application of parallel hardware is to design special-purpose parallel processors primarily intended for use as accelerator cores in multicore systems. This is motivated by two practicalities: the prevalence of multicore systems in current computing platforms and the abundance of simple parallel algorithms that are needed in many systems, such as in data encryption/decryption, graphics processing, digital signal processing and filtering, and many more.

It is simpler to start by stating what this book is not about. This book does not attempt to give a detailed coverage of computer architecture, parallel computers, or algorithms in general. Each of these three topics deserves a large textbook to attempt to provide a good cover. Further, there are the standard and excellent textbooks for each, such as Computer Organization and Design by D.A. Patterson and J.L. Hennessy, Parallel Computer Architecture by D.E. Culler, J.P. Singh, and A. Gupta, and finally, Introduction to Algorithms by T.H. Cormen, C.E. Leiserson, and R.L. Rivest. I hope many were fortunate enough to study these topics in courses that adopted the above textbooks. My apologies if I did not include a comprehensive list of equally good textbooks on the above subjects.

This book, on the other hand, shows how to systematically design special-purpose parallel processing structures to implement algorithms. The techniques presented here are general and can be applied to many algorithms, parallel or otherwise.

This book is intended for researchers and graduate students in computer engineering, electrical engineering, and computer science. The prerequisites for this book are basic knowledge of linear algebra and digital signal processing. The objectives of this book are (1) to explain several techniques for expressing a parallel algorithm as a dependence graph or as a set of dependence matrices; (2) to explore scheduling schemes for the processing tasks while conforming to input and output data timing, and to be able to pipeline some data and broadcast other data to all processors; and (3) to explore allocation schemes for the processing tasks to processing elements.

CHAPTER ORGANIZATION AND OVERVIEW

Chapter1 defines the two main classes of algorithms dealt with in this book: serial algorithms, parallel algorithms, and regular iterative algorithms. Design considerations for parallel computers are discussed as well as their close tie to parallel algorithms. The benefits of using parallel computers are quantified in terms of speedup factor and the effect of communication overhead between the processors. The chapter concludes by discussing two applications of parallel computers.

Chapter2 discusses the techniques used to enhance the performance of a single computer such as increasing the clock frequency, parallelizing the arithmetic and logic unit (ALU) structure, pipelining, very long instruction word (VLIW), superscalar computing, and multithreading.

Chapter3 reviews the main types of parallel computers discussed here and includes shared memory, distributed memory, single instruction multiple data stream (SIMD), systolic processors, and multicore systems.

Chapter4 reviews shared-memory multiprocessor systems and discusses two main issues intimately related to them: cache coherence and process synchronization.

Chapter5 reviews the types of interconnection networks used in parallel processors. We discuss simple networks such as buses and move on to star, ring, and mesh topologies. More efficient networks such as crossbar and multistage interconnection networks are discussed.

Chapter6 reviews the concurrency platform software tools developed to help the programmer parallelize the application. Tools reviewed include Cilk++, OpenMP, and compute unified device architecture (CUDA). It is stressed, however, that these tools deal with simple data dependencies. It is the responsibility of the programmer to ensure data integrity and correct timing of task execution. The techniques developed in this book help the programmer toward this goal for serial algorithms and for regular iterative algorithms.

Chapter7 reviews the ad hoc techniques used to implement algorithms on parallel computers. These techniques include independent loop scheduling, dependent loop spreading, dependent loop unrolling, problem partitioning, and divide-and-conquer strategies. Pipelining at the algorithm task level is discussed, and the technique is illustrated using the coordinate rotation digital computer (CORDIC) algorithm.

Chapter8 deals with nonserial–parallel algorithms (NSPAs) that cannot be described as serial, parallel, or serial–parallel algorithms. NSPAs constitute the majority of general algorithms that are not apparently parallel or show a confusing task dependence pattern. The chapter discusses a formal, very powerful, and simple technique for extracting parallelism from an algorithm. The main advantage of the formal technique is that it gives us the best schedule for evaluating the algorithm on a parallel machine. The technique also tells us how many parallel processors are required to achieve maximum execution speedup. The technique enables us to extract important NSPA performance parameters such as work (W), parallelism (P), and depth (D).

Chapter9 introduces the z-transform technique. This technique is used for studying the implementation of digital filters and multirate systems on different parallel processing machines. These types of applications are naturally studied in the z-domain, and it is only natural to study their software and hardware implementation using this domain.

Chapter10 discusses to construct the dependence graph associated with an iterative algorithm. This technique applies, however, to iterative algorithms that have one, two, or three indices at the most. The dependence graph will help us schedule tasks and automatically allocate them to software threads or hardware processors.

Chapter11 discusses an iterative algorithm analysis technique that is based on computation geometry and linear algebra concepts. The technique is general in the sense that it can handle iterative algorithms with more than three indices. An example is two-dimensional (2-D) or three-dimensional (3-D) digital filters. For such algorithms, we represent the algorithm as a convex hull in a multidimensional space and associate a dependence matrix with each variable of the algorithm. The null space of these matrices will help us derive the different parallel software threads and hardware processing elements and their proper timing.

Chapter12 explores different parallel processing structures for one-dimensional (1-D) finite impulse response (FIR) digital filters. We start by deriving possible hardware structures using the geometric technique of Chapter 11. Then, we explore possible parallel processing structures using the z-transform technique of Chapter 9.

Chapter13 explores different parallel processing structures for 2-D and 3-D infinite impulse response (IIR) digital filters. We use the z-transform technique for this type of filter.

Chapter14 explores different parallel processing structures for multirate decimators and interpolators. These algorithms are very useful in many applications, especially telecommunications. We use the dependence graph technique of Chapter 10 to derive different parallel processing structures.

Chapter15 explores different parallel processing structures for the pattern matching problem. We use the dependence graph technique of Chapter 10 to study this problem.

Chapter16 explores different parallel processing structures for the motion estimation algorithm used in video data compression. In order to delay with this complex algorithm, we use a hierarchical technique to simplify the problem and use the dependence graph technique of Chapter 10 to study this problem.

Chapter17 explores different parallel processing structures for finite-field multiplication over GF(2m). The multi-plication algorithm is studied using the dependence graph technique of Chapter 10.

Chapter18 explores different parallel processing structures for finite-field polynomial division over GF(2). The division algorithm is studied using the dependence graph technique of Chapter 10.

Chapter19 explores different parallel processing structures for the fast Fourier transform algorithm. Pipeline techniques for implementing the algorithm are reviewed.

Chapter20 discusses solving systems of linear equations. These systems could be solved using direct and indirect techniques. The chapter discusses how to parallelize the forward substitution direct technique. An algorithm to convert a dense matrix to an equivalent triangular form using Givens rotations is also studied. The chapter also discusses how to parallelize the successive over-relaxation (SOR) indirect technique.

Chapter21 discusses solving partial differential equations using the finite difference method (FDM). Such equations are very important in many engineering and scientific applications and demand massive computation resources.

ACKNOWLEDGMENTS

I wish to express my deep gratitude and thank Dr. M.W. El-Kharashi of Ain Shams University in Egypt for his excellent suggestions and encouragement during the preparation of this book. I also wish to express my personal appreciation of each of the following colleagues whose collaboration contributed to the topics covered in this book:

Dr. Esam Abdel-RaheemDr. Turki Al-SomaniUniversity of Windsor, CanadaAl-Baha University, Saudi ArabiaDr. Atef IbrahimDr. Mohamed FayedElectronics Research Institute, EgyptAl-Azhar University, EgyptMr. Brian McKinneyDr. Newaz RafiqICEsoft, CanadaParetoLogic, Inc., CanadaDr. Mohamed RehanDr. Ayman TawfikBritish University, EgyptAjman University, United Arab Emirates

COMMENTS AND SUGGESTIONS

This book covers a wide range of techniques and topics related to parallel computing. It is highly probable that it contains errors and omissions. Other researchers and/or practicing engineers might have other ideas about the content and organization of a book of this nature. We welcome receiving comments and suggestions for consideration. If you find any errors, we would appreciate hearing from you. We also welcome ideas for examples and problems (along with their solutions if possible) to include with proper citation.

Please send your comments and bug reports electronically to [email protected], or you can fax or mail the information to

Dr. FAYEZ GEBALI

Electrical and Computer Engineering Department

University of Victoria, Victoria, B.C., Canada V8W 3P6

Tel: 250-721-6509

Fax: 250-721-6052

List of Acronyms

1-Done-dimensional2-Dtwo-dimensional3-Dthree-dimensionalALUarithmetic and logic unitAMPasymmetric multiprocessing systemAPIapplication program interfaceASAacyclic sequential algorithmASICapplication-specific integrated circuitASMPasymmetric multiprocessorCADcomputer-aided designCFDcomputational fluid dynamicsCMPchip multiprocessorCORDICcoordinate rotation digital computerCPIclock cycles per instructionCPUcentral processing unitCRCcyclic redundancy checkCTcomputerized tomographyCUDAcompute unified device architectureDAGdirected acyclic graphDBMSdatabase management systemDCGdirected cyclic graphDFTdiscrete Fourier transformDGdirected graphDHTdiscrete Hilbert transformDRAMdynamic random access memoryDSPdigital signal processingFBMAfull-search block matching algorithmFDMfinite difference methodFDMfrequency division multiplexingFFTfast Fourier transformFIRfinite impulse responseFLOPSfloating point operations per secondFPGAfield-programmable gate arrayGF(2m) Galois field with 2m elementsGFLOPSgiga floating point operations per secondGPGPUgeneral purpose graphics processor unitGPUgraphics processing unitHCORDIChigh-performance coordinate rotation digital computerHDLhardware description languageHDTVhigh-definition TVHRCThigh-resolution computerized tomographyHTMhardware-based transactional memoryIAiterative algorithmIDHTinverse discrete Hilbert transformIEEEInstitute of Electrical and Electronic EngineersIIRinfinite impulse responseILPinstruction-level parallelismI/Oinput/outputIPintellectual property modulesIPInternet protocolIRinstruction registerISAinstruction set architectureJVMJava virtual machineLANlocal area networkLCAlinear cellular automatonLFSRlinear feedback shift registerLHSleft-hand sideLSBleast-significant bitMACmedium access controlMACmultiply/accumulateMCAPIMulticore Communications Management APIMIMDmultiple instruction multiple dataMIMOmultiple-input multiple-outputMINmultistage interconnection networksMISDmultiple instruction single data streamMIMDmultiple instruction multiple dataMPImessage passing interfaceMRAPIMulticore Resource Management APIMRImagnetic resonance imagingMSBmost significant bitMTAPIMulticore Task Management APINISTNational Institute for Standards and TechnologyNoCnetwork-on-chipNSPAnonserial–parallel algorithmNUMAnonuniform memory accessNVCCNVIDIA C compilerOFDMorthogonal frequency division multiplexingOFDMAorthogonal frequency division multiple accessOSoperating systemP2Ppeer-to-peerPAprocessor arrayPEprocessing elementPRAMparallel random access machineQoSquality of serviceRAIDredundant array of inexpensive disksRAMrandom access memoryRAWread after writeRHSright-hand sideRIAregular iterative algorithmRTLregister transfer languageSEswitching elementSFswitch fabricSFGsignal flow graphSIMDsingle instruction multiple data streamSIMPsingle instruction multiple programSISDsingle instruction single data streamSLAservice-level agreementSMstreaming multiprocessorSMPsymmetric multiprocessorSMTsimultaneous multithreadingSoCsystem-on-chipSORsuccessive over-relaxationSPstreaming processorSPAserial–parallel algorithmSPMDsingle program multiple data streamSRAMstatic random access memorySTMsoftware-based transactional memoryTCPtransfer control protocolTFLOPStera floating point operations per secondTLPthread-level parallelismTMtransactional memoryUMAuniform memory accessVHDLvery high-speed integrated circuit hardware description languageVHSICvery high-speed integrated circuitVIQvirtual input queuingVLIWvery long instruction wordVLSIvery large-scale integrationVOQvirtual output queuingVRQvirtual routing/virtual queuingWANwide area networkWARwrite after readWAWwrite after writeWiFiwireless fidelity

Chapter 1

Introduction

1.1 INTRODUCTION

The idea of a single-processor computer is fast becoming archaic and quaint. We now have to adjust our strategies when it comes to computing:

It is impossible to improve computer performance using a single processor. Such processor would consume unacceptable power. It is more practical to use many simple processors to attain the desired performance using perhaps thousands of such simple computers [1].As a result of the above observation, if an application is not running fast on a single-processor machine, it will run even slower on new machines unless it takes advantage of parallel processing.Programming tools that can detect parallelism in a given algorithm have to be developed. An algorithm can show regular dependence among its variables or that dependence could be irregular. In either case, there is room for speeding up the algorithm execution provided that some subtasks can run concurrently while maintaining the correctness of execution can be assured.Optimizing future computer performance will hinge on good parallel programming at all levels: algorithms, program development, operating system, compiler, and hardware.The benefits of parallel computing need to take into consideration the number of processors being deployed as well as the communication overhead of processor-to-processor and processor-to-memory. Compute-bound problems are ones wherein potential speedup depends on the speed of execution of the algorithm by the processors. Communication-bound problems are ones wherein potential speedup depends on the speed of supplying the data to and extracting the data from the processors.

Lesen Sie weiter in der vollständigen Ausgabe!

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben: