Distributed Computing with Python - Francesco Pierfederici - E-Book

Distributed Computing with Python E-Book

Francesco Pierfederici

0,0
31,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Harness the power of multiple computers using Python through this fast-paced informative guide

About This Book

  • You'll learn to write data processing programs in Python that are highly available, reliable, and fault tolerant
  • Make use of Amazon Web Services along with Python to establish a powerful remote computation system
  • Train Python to handle data-intensive and resource hungry applications

Who This Book Is For

This book is for Python developers who have developed Python programs for data processing and now want to learn how to write fast, efficient programs that perform CPU-intensive data processing tasks.

What You Will Learn

  • Get an introduction to parallel and distributed computing
  • See synchronous and asynchronous programming
  • Explore parallelism in Python
  • Distributed application with Celery
  • Python in the Cloud
  • Python on an HPC cluster
  • Test and debug distributed applications

In Detail

CPU-intensive data processing tasks have become crucial considering the complexity of the various big data applications that are used today. Reducing the CPU utilization per process is very important to improve the overall speed of applications.

This book will teach you how to perform parallel execution of computations by distributing them across multiple processors in a single machine, thus improving the overall performance of a big data processing task. We will cover synchronous and asynchronous models, shared memory and file systems, communication between various processes, synchronization, and more.

Style and Approach

This example based, step-by-step guide will show you how to make the best of your hardware configuration using Python for distributing applications.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 208

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Distributed Computing with Python
Credits
About the Author
About the Reviewer
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. An Introduction to Parallel and Distributed Computing
Parallel computing
Distributed computing
Shared memory versus distributed memory
Amdahl's law
The mixed paradigm
Summary
2. Asynchronous Programming
Coroutines
An asynchronous example
Summary
3. Parallelism in Python
Multiple threads
Multiple processes
Multiprocess queues
Closing thoughts
Summary
4. Distributed Applications – with Celery
Establishing a multimachine environment
Installing Celery
Testing the installation
A tour of Celery
More complex Celery applications
Celery in production
Celery alternatives – Python-RQ
Celery alternatives – Pyro
Summary
5. Python in the Cloud
Cloud computing and AWS
Creating an AWS account
Creating an EC2 instance
Storing data in Amazon S3
Amazon elastic beanstalk
Creating a private cloud
Summary
6. Python on an HPC Cluster
Your typical HPC cluster
Job schedulers
Running a Python job using HTCondor
Running a Python job using PBS
Debugging
Summary
7. Testing and Debugging Distributed Applications
The big picture
Common problems – clocks and time
Common problems – software environments
Common problems – permissions and environments
Common problems – the availability of hardware resources
Challenges – the development environment
A useful strategy – logging everything
A useful strategy – simulating components
Summary
8. The Road Ahead
The first two chapters
The tools
The cloud and the HPC world
Debugging and monitoring
Where to go next
Index

Distributed Computing with Python

Distributed Computing with Python

Copyright © 2016 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: April 2016

Production reference: 1060416

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78588-969-1

www.packtpub.com

Credits

Author

Francesco Pierfederici

Reviewer

James King

Commissioning Editor

Veena Pagare

Acquisition Editor

Aaron Lazar

Content Development Editor

Parshva Sheth

Technical Editor

Abhishek R. Kotian

Copy Editor

Neha Vyas

Project Coordinator

Nikhil Nair

Proofreader

Safis Editing

Indexer

Rekha Nair

Graphics

Disha Haria

Production Coordinator

Melwyn Dsa

Cover Work

Melwyn Dsa

About the Author

Francesco Pierfederici is a software engineer who loves Python. He has been working in the fields of astronomy, biology, and numerical weather forecasting for the last 20 years.

He has built large distributed systems that make use of tens of thousands of cores at a time and run on some of the fastest supercomputers in the world. He has also written a lot of applications of dubious usefulness but that are great fun. Mostly, he just likes to build things.

I would like to thank my wife, Alicia, for her unreasonable patience during the gestation of this book. I would also like to thank Parshva Sheth and Aaron Lazar at Packt Publishing and the technical reviewer, James King, who were all instrumental in making this a better book.

About the Reviewer

James King is a software developer with a broad range of experience in distributed systems. He is a contributor to many open source projects including OpenStack and Mozilla Firefox. He enjoys mathematics, horsing around with his kids, games, and art.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Preface

Parallel and distributed computing is a fascinating subject that only a few years ago developers in only a very few large companies and national labs were privy to. Things have changed dramatically in the last decade or so, and now everybody can build small- and medium-scale distributed applications in a variety of programming languages including, of course, our favorite one: Python.

This book is a very practical guide for Python programmers who are starting to build their own distributed systems. It starts off by illustrating the bare minimum theoretical concepts needed to understand parallel and distributed computing in order to lay the basic foundations required for the rest of the (more practical) chapters.

It then looks at some first examples of parallelism using nothing more than modules from the Python standard library. The next step is to move beyond the confines of a single computer and start using more and more nodes. This is accomplished using a number of third-party libraries, including Celery and Pyro.

The remaining chapters investigate a few deployment options for our distributed applications. The cloud and classic High Performance Computing (HPC) clusters, together with their strengths and challenges, take center stage.

Finally, the thorny issues of monitoring, logging, profiling, and debugging are touched upon.

All in all, this is very much a hands-on book, teaching you how to use some of the most common frameworks and methodologies to build parallel and distributed systems in Python.

What this book covers

Chapter 1, AnIntroduction to Parallel and Distributed Computing, takes you through the basic theoretical foundations of parallel and distributed computing.

Chapter 2, Asynchronous Programming, describes the two main programming styles used in distributed applications: synchronous and asynchronous programming.

Chapter 3, Parallelism in Python, shows you how to do more than one thing at the same time in your Python code, using nothing more than the Python standard library.

Chapter 4, Distributed Applications – with Celery, teaches you how to build simple distributed applications using Celery and some of its competitors: Python-RQ and Pyro.

Chapter 5, Python in the Cloud, shows how you can deploy your Python applications on the cloud using Amazon Web Services.

Chapter 6, Python on an HPC Cluster, shows how to deploy your Python applications on a classic HPC cluster, typical of many universities and national labs.

Chapter 7, Testing and Debugging Distributed Applications, talks about the challenges of testing, profiling, and debugging distributed applications in Python.

Chapter 8, The Road Ahead, looks at what you have learned so far and which directions interested readers could take to push their development of distributed systems further.

What you need for this book

The following software and hardware is recommended:

Python 3.5 or laterA laptop or desktop computer running Linux or Mac OS XIdeally, some extra computers or some extra virtual machines to test your distributed applications

All software mentioned in this book is free of charge and can be downloaded from the Internet with the exception of PBS Pro, which is commercial. Most of the PBS Pro functionality, however, is available in its close sibling Torque, which is open source.

Who this book is for

This book is for developers who already know Python and want to learn how to parallelize their code and/or write distributed systems. While a Unix environment is assumed, most if not all of the examples should also work on Windows systems.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. An Introduction to Parallel and Distributed Computing

The first modern digital computer was invented in the late 30s and early 40s (that is, arguably, the Z1from Konrad Zuse in 1936), probably before most of the readers of this book—let alone the author—were born. These last seventy odd years have seen computers become faster and cheaper at an amazing rate, which was unique across industries. Just think that today's smartphones (for example, the latest iPhones or Android phones) are faster than the fastest computer in the world from just 20 years ago. Not to mention, the amazing feat of miniaturization: those supercomputers used to take up entire rooms; now they fit in our pockets.

These years have also seen, among others, two key inventions relevant to the topic at hand. One is the ability to cram more than one processor on a single motherboard (and even multiple CPU cores on a single processor). This development was crucial in allowing computations to be performed truly concurrently. As we know, processors are able to perform only one task at a time; however, as we will see later on in the chapter, they are fast enough to give the illusion of being able to run multiple tasks at the same time. To be able to perform more than one action exactly at the same time, you need access to more than one processor.

The other critical invention is high-speed computer networking. This allowed, for the first time, a potentially enormous number of computers to communicate with each other. These networked machines can either be located in the same office/building (the so-called Local Area Network (LAN)) or be spread out across different buildings, cities, or even across the planet (that is, WAN or wide area networking).

By now, most of us are familiar with multiprocessor/multicore computers, and indeed, the chances are pretty high that the phone in our pocket, the tablet in our hands, or the laptop we take on the road has a handful of cores already. The graphics card, also called Graphics Processing Unit (GPU) in these devices is more often than not massively parallel, with hundreds or even thousands of processing units. Computer networks too are all around us, starting from the most famous of them all: the Internet, to the Wi-Fi in our homes and coffee shops and the 4G mobile networks our phones use.

In the rest of this chapter, we will lay some working definitions of the topics that we will explore in the rest of the book. We will be introducing the concepts of parallel and distributed computing. We will give some examples of each that are taken from familiar topics and technologies. Some general advantages and disadvantages of each architecture and programming paradigm will be discussed as well.

Before proceeding with our definitions and a little bit of theory, let's clarify a possible source of confusion. In this and the following chapters, we will use the term processor and the term CPUcore (or even simply core) interchangeably, unless otherwise specified. This is, of course, technically incorrect; a processor has one or more cores, and a computer has one or more processors as cores do not exist in isolation. Depending on the algorithm and its performance requirements, running on multiple processors or on a single processor using multiple cores can make a bit of difference in speed, assuming, of course, that the algorithm can be parallelized in the first place. For our intents and purposes, however, we will ignore these differences and refer to more advanced texts for further exploration of this topic.

Parallel computing

Definitions of parallel computing abound. However, for the purpose of this book, a simple definition will suffice, which is as follows:

Parallel computing is the simultaneous use of more than one processor to solve a problem.

Typically, this definition is further specialized by requiring that the processors reside on the same motherboard. This is mostly to distinguish parallel computing from distributed computing (which is discussed in the next section).

The idea of splitting work among many workers is as old as human civilization, is not restricted to the digital world, and finds an immediate and obvious application in modern computers equipped with higher and higher numbers of compute units.

There are, of course, many reasons why parallel computing might be useful and even necessary. The simplest one is performance; if we can indeed break up a long-running computation into smaller chunks and parcel them out to different processors, then we can do more work in the same amount of time.

Other times, and just as often, parallel computing techniques are used to present users with responsive interfaces while the system is busy with some other task. Remember that one processor executes just one task at the time. Applications with GUIs need to offload work to a separate thread of execution running on another processor so that one processor is free to update the GUI and respond to user inputs.

The following figure illustrates this common architecture, where the main thread is processing user and system inputs using what is called an event loop. Tasks that require a long time to execute and those that would otherwise block the GUI are offloaded to a background or worker thread:

A simple real-world example of this parallel architecture could be a photo organization application. When we connect a digital camera or a smartphone to our computers, the photo application needs to perform a number of actions; all the while its user interface needs to stay interactive. For instance, our application needs to copy images from the device to the internal disk, create thumbnails, extract metadata (for example, date and time of the shot), index the images, and finally update the image gallery. While all of this happens, we are still able to browse images that are already imported, open them, edit them, and so on.

Of course, all these actions could very well be performed sequentially on a single processor—the same processor that is handling the GUI. The drawback would be a sluggish interface and an extremely slow overall application. Performing these steps in parallel keeps the application snappy and its users happy.

The astute reader might jump up at this point and rightfully point out that older computers, with a single processor and a single core, could already perform multiple things at the same time (by way of multitasking). What happened back then (and even today, when we launch more tasks than there are processors and cores on our computers) was that the one running task gave up the CPU (either voluntarily or forcibly by the OS, for example, in response to an IO event) so that another task could run in its place. These interrupts would happen over and over again, with various tasks acquiring and giving up the CPU many times over the course of the application's life. In those cases, users had the impression of multiple tasks running concurrently, as the switches were extremely fast. In reality, however, only one task was running at any given time.

The typical tools used in parallel applications are threads. On systems such as Python (as we will see in Chapter 3, Parallelism in Python) where threads have significant limitations, programmers resort to launching (oftentimes, by means of forking) subprocesses instead. These subprocesses replace (or complement) threads and run alongside the main application process.

The first technique is called multithreaded programming. The second is called multiprocessing. It is worth noting that multiprocessing should not be seen as inferior or as a workaround with respect to using multiple threads.

There are many situations where multiprocessing is preferable to multiple threads. Interestingly, even though they both run on a single computer, a multithreaded application is an example of shared-memory architecture, whereas a multiprocess application is an example of distributed memory architecture (refer to the following section to know more).

Distributed computing

For the remainder of this book, we will adopt the following working definition of distributed computing:

Distributed computing is the simultaneous use of more than one computer to solve a problem.

Typically, as in the case of parallel computing, this definition is oftentimes further restricted. The restriction usually is the requirement that these computers appear to their users as a single machine, therefore hiding the distributed nature of the application. In this book, we will be happy with the more general definition.

Distributing computation across multiple computers is again a pretty obvious strategy when using systems that are able to speak to each other over the (local or otherwise) network. In many respects, in fact, this is just a generalization of the concepts of parallel computing that we saw in the previous section.

Reasons to build distributed systems abound. Oftentimes, the reason is the ability to tackle a problem so big that no individual computer could handle it at all, or at least, not in a reasonable amount of time. An interesting example from a field that is probably familiar to most of us is the rendering of 3D animation movies, such as those from Pixar and DreamWorks.

Given the sheer number of frames to render for a full-length feature (30 frames per second on a two-hour movie is a lot!), movie studios need to spread the full-rendering job to large numbers of computers (computer farms as they are called).

Other times, the very nature of the application being developed requires a distributed system. This is the case, for instance, for instant messaging and video conferencing applications. For these pieces of software, performance is not the main driver. It is just that the problem that the application solves is itself distributed.

In the following figure, we see a very common web application architecture (another example of a distributed application), where multiple users connect to the website over the network. At the same time, the application itself communicates with systems (such as a database server) running on different machines in its LAN:

Another interesting example of distributed systems, which might be a bit counterintuitive, is the CPU-GPU combination. These days, graphics cards are very sophisticated computers in their own right. They are highly parallel and offer compelling performance for a large number of compute-intensive problems, not just for displaying images on screen. Tools and libraries exist to allow programmers to make use of GPUs for general-purpose computing (for example CUDA from NVIDIA, OpenCL, and OpenAcc among others).

However, the system composed by the CPU and GPU is really an example of a distributed system, where the network is replaced by the PCI bus. Any application exploiting both the CPU and the GPU needs to take care of data movement between the two subsystems just like a more traditional application running across the network!

It is worth noting that, in general, adapting the existing code to run across computers on a network (or on the GPU) is far from a simple exercise. In these cases, I find it quite helpful to go through the intermediate step of using multiple processes on a single computer first (refer to the discussion in the previous section). Python, as we will see in Chapter 3, Parallelism in Python, has powerful facilities for doing just that (refer to the concurrent.futures module).

Once I evolve my application so that it uses multiple processes to perform operations in parallel, I start thinking about how to turn these processes into separate applications altogether, which are no longer part of my monolithic core.

Special attention must be given to the data—where to store it and how to access it. In simple cases, a shared filesystem (for example, NFS on Unix systems) is enough; other times, a database and/or a message bus is needed. We will see some concrete examples from Chapter 4, Distributed Applications – with Celery, onwards. It is important to remember that, more often than not, data, rather than CPU, is the real bottleneck.

Shared memory versus distributed memory

Conceptually, parallel computing and distributed computing look very similar—after all, they both are about