31,19 €
Master the robust features of R parallel programming to accelerate your data science computations
This book is for R programmers who want to step beyond its inherent single-threaded and restricted memory limitations and learn how to implement highly accelerated and scalable algorithms that are a necessity for the performant processing of Big Data. No previous knowledge of parallelism is required. This book also provides for the more advanced technical programmer seeking to go beyond high level parallel frameworks.
R is one of the most popular programming languages used in data science. Applying R to big data and complex analytic tasks requires the harnessing of scalable compute resources.
Mastering Parallel Programming with R presents a comprehensive and practical treatise on how to build highly scalable and efficient algorithms in R. It will teach you a variety of parallelization techniques, from simple use of R's built-in parallel package versions of lapply(), to high-level AWS cloud-based Hadoop and Apache Spark frameworks. It will also teach you low level scalable parallel programming using RMPI and pbdMPI for message passing, applicable to clusters and supercomputers, and how to exploit thousand-fold simple processor GPUs through ROpenCL. By the end of the book, you will understand the factors that influence parallel efficiency, including assessing code performance and implementing load balancing; pitfalls to avoid, including deadlock and numerical instability issues; how to structure your code and data for the most appropriate type of parallelism for your problem domain; and how to extract the maximum performance from your R code running on a variety of computer systems.
This book leads you chapter by chapter from the easy to more complex forms of parallelism. The author's insights are presented through clear practical examples applied to a range of different problems, with comprehensive reference information for each of the R packages employed. The book can be read from start to finish, or by dipping in chapter by chapter, as each chapter describes a specific parallel approach and technology, so can be read as a standalone.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 324
Veröffentlichungsjahr: 2016
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: May 2016
Production reference: 1240516
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78439-400-4
www.packtpub.com
Authors
Simon R. Chapple
Eilidh Troup
Thorsten Forster
Terence Sloan
Reviewers
Steven Paul Sanderson II
Joseph McKavanagh
Willem Ligtenberg
Commissioning Editor
Kunal Parikh
Acquisition Editor
Subho Gupta
Content Development Editor
Siddhesh Salvi
Technical Editor
Kunal Chaudhari
Copy Editor
Shruti Iyer
Project Coordinator
Nidhi Joshi
Proofreader
Safis Editing
Indexer
Mariammal Chettiyar
Graphics
Abhinash Sahu
Production Coordinator
Melwyn Dsa
Cover Work
Melwyn Dsa
Simon R. Chapple is a highly experienced solution architect and lead software engineer with more than 25 years of developing innovative solutions and applications in data analysis and healthcare informatics. He is also an expert in supercomputer HPC and big data processing.
Simon is the chief technology officer and a managing partner of Datalytics Technology Ltd, where he leads a team building the next generation of a large scale data analysis platform, based on a customizable set of high performance tools, frameworks, and systems, which enables the entire life cycle of data processing for real-time analytics from capture through analysis to presentation, to be encapsulated for easy deployment into any existing operational IT environment.
Previously, he was director of Product Innovation at Aridhia Informatics, where he built a number of novel systems for healthcare providers in Scotland, including a unified patient pathway tracking system that utilized ten separate data system integrations for both 18-weeks Referral To Treatment and cancer patient management (enabling the provider to deliver best performance on patient waiting times in Scotland). He also built a unique real-time chemotherapy patient mobile-based public cloud-hosted monitoring system undergoing clinical trial in Australia, which is highly praised by nurses and patients, "its like having a nurse in your living room… hopefully all chemo patients will one day know the security and comfort of having an around-the-clock angel of their own."
Simon is also a coauthor of the ROpenCL open source package—enabling statistics programs written in R to exploit the parallel computation within graphics accelerator chips.
I would particularly like to thank my fellow authors at Edinburgh Parallel Computing Centre for the SPRINT chapter, and the book reviewers, Willem Ligtenberg, Joe McKavanagh, and Steven Sanderson, for their diligent feedback in the preparation of this book. I would also like to thank the editorial team at Packt for their unending patience in getting this book over the finish line, and my wife and son for their understanding in allowing me to steal precious time away from them to be an author – it is to my loved ones, Heather and Adam, that I dedicate this book.
Eilidh Troup is an Applications Consultant employed by EPCC at the University of Edinburgh. She has a degree in Genetics from the University of Glasgow and she now focuses on making high-performance computing accessible to a wider range of users, in particular biologists. Eilidh works on a variety of software projects, including the Simple Parallel R INTerface (SPRINT) and the SEEK for Science web-based data repository.
Thorsten Forster is a data science researcher at University of Edinburgh. With a background in statistics and computer science, he has obtained a PhD in biomedical sciences and has over 10 years of experience in this interdisciplinary research.
Conducting research on the data analysis approach to biomedical big data rooted in statistics and machine learning (such as microarrays and next-generation sequencing), Thorsten has been a project manager on the SPRINT project, which is targeted at allowing lay users to make use of parallelized analysis solutions for large biological datasets within the R statistical programming language. He is also a co-founder of Fios Genomics Ltd, a university spun-out company providing biomedical big data research with data-analytical services.
Thorsten's current work includes devising a gene transcription classifier for the diagnosis of bacterial infections in newborn babies, transcriptional profiling of interferon gamma activation of macrophages, investigating the role of cholesterol in immune responses to infections, and investigating the genomic factors that cause childhood wheezing to progress to asthma.
Thorsten's complete profile is available at http://tinyurl.com/ThorstenForster-UEDIN.
Terence Sloan is a software development group manager at EPCC, the High Performance Computing Centre at the University of Edinburgh. He has more than 25 years of experience in managing and participating in data science and HPC projects with Scottish SMEs, UK corporations, and European and global collaborations.
Terry, was the co-principal investigator on the Wellcome Trust (Award no. 086696/Z/08/Z), the BBSRC (Award no. BB/J019283/1), and the three EPSRC-distributed computational science awards that have helped develop the SPRINT package for R. He has also held awards from the ESRC (Award nos. RES-189-25-0066, RES-149-25-0005) that investigated the use of operational big data for customer behavior analysis.
Terry is a coordinator for the Data Analytics with HPC, Project Preparation, and Dissertation courses on the University of Edinburgh's MSc programme, in HPC with Data Science.
He also plays the drums.
I would like to thank Dr. Alan Simpson, EPCC's technical director and the computational science and engineering director for the ARCHER supercomputer, for supporting the development of SPRINT and its use on UK's national supercomputers.
Steven Paul Sanderson II is currently in the last year of his MPH (Masters in Public Health Program) at Stony Brook University School of Medicine's Graduate Program in Public Health. He has a decade of experience in working in an acute care hospital setting. Steven is an active user of the StackExchange sites, and his aim is to self-learn several topics, including SQL, R, VB, and Python.
He is currently employed as a decision support analyst III, supporting both financial and clinical programs.
He has had the privilege to work on other titles from Packt Publishing, including, Gephi Cookbook by Devangana Khokhar, Network Graph Analysis and Visualization with Gephi, and Mastering Gephi Network Visualization, both by Ken Cherven. He has also coauthored a book with former professor Phillip Baldwin, called The Pleistocene Re-Wilding of Johnny Paycheck, which can be found as a self-published book at http://www.lulu.com/shop/phillip-baldwin/the-pleistocene-re-wilding-of-johnny-paycheck/paperback/product-21204148.html.
I would like to thank my parents for always pushing me to try new things and continue learning. I'd like to thank my wife for being my support system. I would also like to thank Nidhi Joshi at Packt Publishing for continuing to keep me involved in the learning process by keeping me in the review process of new and interesting books.
Willem Ligtenberg first started using R at Eindhoven University of Technology for his master's thesis in biomedical engineering. At this time, he used R from Python through Rpy. Although not a true computer scientist, Willem found himself attracted to distributed computing (the bioinformatics field often requires this) by first using a computer cluster of the Computational Biology group. Reading interesting articles on GPGPU computing, he convinced his professor to buy a high-end graphics card for initial experimentation.
Willem currently works as a bioinformatics/statistics consultant at Open Analytics and has a passion for speed enhancement through either Rcpp or OpenCL. He developed the ROpenCL package, which he first presented at UseR! 2011. The RopenCL package will be used later in this book. Willem also teaches parallel computing in R (using both the GPU and CPU). Another interest of his is in how to optimally use databases in workflows, and from this followed another R package (Rango) that he presented at UseR! 2015. Rango allows R users to interact with databases using S4 objects and abstracts differences between various database backends, allowing users to focus on what they want to achieve.
Joseph McKavanagh is a divisional CTO in Kainos and is responsible for technology strategy and leadership. He works with customers in the public and private sectors to deliver and support high-impact digital transformation and managed cloud and big data solutions. Joseph has delivered Digital Transformation projects for central and regional UK governments and spent 18 months as a transformation architect in Government Digital Service, helping to deliver the GDS Exemplar programme. He has an LLB degree in law and accountancy and a master's degree in computer science and applications, both from Queen's University, Belfast.
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <customercare><@packtpub.com> for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
We are in the midst of an information explosion. Everything in our lives is becoming instrumented and connected in real-time with the Internet of Things, from our own biology to the world's environment. By some measures, it is projected that by 2020, world data will have grown by more than a factor of 10 from today to a staggering 44 Zettabytes—just one Zettabyte is the equivalent of 250 billion DVDs. In order to process this volume and velocity of big data, we need to harness a vast amount of compute, memory, and disk resources, and to do this, we need parallelism.
Despite its age, R—the open source statistical programming language, continues to grow in popularity as one of the key cornerstone technologies to analyze data, and is used by an ever-expanding community of, dare I say the currently in-vogue designation of, "data scientists".
There are of course many other tools that a data scientist may deploy in taming the beast of big data. You may also be a Python, SAS, SPSS, or MATLAB guru. However, R, with its long open source heritage since 1997, remains pervasive, and with the extraordinarily wide variety of additional CRAN-hosted plug-in library packages that were developed over the intervening 20 years, it is highly capable of almost all forms of data analysis, from small numeric matrices to very large symbolic datasets, such as bio-molecular DNA. Indeed, I am tempted to go as far as to suggest that R is becoming the de facto data science scripting language, which is capable of orchestrating highly complex analytics pipelines that involve many different types of data.
R, in itself, has always been a single-threaded implementation, and it is not designed to exploit parallelism within its own language primitives. Instead, it relies on specifically implemented external package libraries to achieve this for certain accelerated functions and to enable the use of parallel processing frameworks. We will focus on a select number of these that represent the best implementations that are available today to develop parallel algorithms across a range of technologies.
In this book, we will cover many different aspects of parallelism, from Single Program Multiple Data (SPMD) to Single Instruction Multiple Data (SIMD) vector processing, including utilizing R's built-in multicore capabilities with its parallel package, message passing using the Message Passing Interface (MPI) standard, and General Purpose GPU (GPGPU)-based parallelism with OpenCL. We will also explore different framework approaches to parallelism, from load balancing through task farming to spatial processing with grids. We will touch on more general purpose batch-data processing in the cloud with Hadoop and (as a bonus) the hot new tech in cluster computing, Apache Spark, which is much better suited to real-time data processing at scale.
We will even explore how to use a real bona fide multi-million pound supercomputer. Yes, I know that you may not own one of these, but in this book, we'll show you what its like to use one and how much performance parallelism can achieve. Who knows, with your new found knowledge, maybe you can rock up at your local Supercomputer Center and convince them to let you spin up some massively parallel computing!
All of the coding examples that are presented in this book are original work and have been chosen partly so as not to duplicate the kind of example you might otherwise encounter in other books of this nature. They are also chosen to hopefully engage you, dear reader, with something a little bit different to the run-of-the-mill. We, the authors, very much hope you enjoy the journey that you are about to undertake through Mastering Parallel Programming in R.
Chapter 1, Simple Parallelism with R, starts our journey by quickly showing you how to exploit the multicore processing capability of your own laptop using core R's parallelized versions of lapply(). We also briefly reach out and touch the immense computing capacity of the cloud through Amazon Web Services.
Chapter 2, Introduction to Message Passing, covers the standard Message Passing Interface (MPI), which is a key technology that implements advanced parallel algorithms. In this chapter, you will learn how to use two different R MPI packages, Rmpi and pbdMPI, together with the OpenMPI implementation of the underlying communications subsystem.
Chapter 3, Advanced Message Passing, will complete our tour of MPI by developing a detailed Rmpi worked example, illustrating the use of nonblocking communications and localized patterns of interprocess message exchange, which is required to implement spatial Grid parallelism.
Chapter 4, Developing SPRINT, an MPI-based R Package for Supercomputers, introduces you to the experience of running parallel code on a real supercomputer. This chapter also provides a detailed exposition of developing SPRINT, an R package written in C for parallel computation that can run on laptops, as well as supercomputers. We'll also show you how you can extend this package with your own natively-coded high performance parallel algorithms and make them accessible to R.
Chapter 5, The Supercomputer in Your Laptop, will show how to unlock the massive parallel and vector processing capability of the Graphics Processing Unit (GPU) inside your very own laptop direct from R using the ROpenCL package, an R wrapper for the Open Computing Language (OpenCL).
Chapter 6, The Art of Parallel Programming, concludes this book by providing the basic science behind parallel programming and its performance, the art of best practice by highlighting a number of potential pitfalls you'll want to avoid, and taking a glimpse into the future of parallel computing systems.
Online Chapter, Apache Spa-R-k, is an introduction to Apache Spark, which now succeeds Hadoop as the most popular distributed memory big data parallel computing environment. You will learn how to setup and install a Spark cluster and how to utilize Spark's own DataFrame abstraction direct from R. This chapter can be downloaded from Packt's website at https://www.packtpub.com/sites/default/files/downloads/B03974_BonusChapter.pdf
You don't need to read this book in order from beginning to end, although you will find this easiest with respect to the introduction of concepts, and the increasing technical depth of programming knowledge applied. For the most part, each chapter has been written to be understandable when read on it's own.
To run the code in this book, you will require a multicore modern specification laptop or desktop computer. You will also require a decent bandwidth Internet connection to download R and the various R code libraries from CRAN, the main online repository for R packages.
The examples in this book have largely been developed using RStudio version 0.98.1062, with the 64-bit R version 3.1.0 (CRAN distribution), running on a mid-2014 generation Apple MacBook Pro OS X 10.9.4, with a 2.6 GHz Intel Core i5 processor and 16 GB of memory. However, all of these examples should also work with the latest version of R.
Some of the examples in this book will not be able to run with Microsoft Windows, but they should run without problem on variants of Linux. Each chapter will detail any required additional external libraries or runtime system requirements, and provide you with information on how to access and install them. This book's errata section will highlight any issues discovered post publication.
This book is for the intermediate to advanced-level R developer who wants to understand how to harness the power of parallel computing to perform long running computations and analyze large quantities of data. You will require a reasonable knowledge and understanding of R programming. You should be a sufficiently capable programmer so that you can read and understand lower-level languages, such as C/C++, and be familiar with the process of code compilation. You may consider yourself to be the new breed of data scientist—a skilled programmer as well as a mathematician.
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "You'll note the use of mpi.cart.create(), which constructs a Cartesian rank/grid mapping from a group of existing MPI processes."
A block of code is set as follows:
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
Any command-line input or output is written as follows:
New terms and important words are shown in bold.
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/repository-name. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from http://www.packtpub.com/sites/default/files/downloads/MasteringParallelProgrammingwithR_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <[email protected]> with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.
