50,99 €
An essential guide to healthcare data problems, sources, and solutions
Strategies in Biomedical Data Science provides medical professionals with much-needed guidance toward managing the increasing deluge of healthcare data. Beginning with a look at our current top-down methodologies, this book demonstrates the ways in which both technological development and more effective use of current resources can better serve both patient and payer. The discussion explores the aggregation of disparate data sources, current analytics and toolsets, the growing necessity of smart bioinformatics, and more as data science and biomedical science grow increasingly intertwined. You'll dig into the unknown challenges that come along with every advance, and explore the ways in which healthcare data management and technology will inform medicine, politics, and research in the not-so-distant future. Real-world use cases and clear examples are featured throughout, and coverage of data sources, problems, and potential mitigations provides necessary insight for forward-looking healthcare professionals.
Big Data has been a topic of discussion for some time, with much attention focused on problems and management issues surrounding truly staggering amounts of data. This book offers a lifeline through the tsunami of healthcare data, to help the medical community turn their data management problem into a solution.
The sheer amount of healthcare data being generated will only increase as both biomedical research and clinical practice trend toward individualized, patient-specific care. Strategies in Biomedical Data Science provides expert insight into the kind of robust data management that is becoming increasingly critical as healthcare evolves.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 700
Veröffentlichungsjahr: 2016
“The allure of data analytics is in knowing what is currently unknowable by identifying patterns in apparent chaos. If these insights could be applied in the healthcare field to individualized patient care, the improvement in outcomes could be profound indeed. This type of research and innovation right here in Tempe (ASU) demonstrates why ASU is ranked as the number one most innovative university in the nation.
“Industry analysts expect there will be three to four connected Internet of Things (IoT) devices for every person on the planet by 2020. Healthcare can and is leading the way in IoT adoption. To prepare for the coming deluge of IoT data, healthcare IT organizations should be investing in data analytics capability to convert that raw data flood into actionable information that delivers better healthcare outcomes.”
—Steve Phillips
Senior Vice President and Chief Information Officer, Avnet, Inc.
Twitter: @Avnet
“I think it is really great that Jay Etchings is working on this; the dearth of information for dealing with large, complex biomedical data sets makes building systems capable of supporting precision medicine very challenging. I would say that we are not yet at the “blueprint” stage, but we certainly can use help in getting the right people thinking about this, so we can build the recipes going forward. While true clinical application at scale is still not here, we are rapidly approaching that event horizon, and as we have learned in biomedical research, the infrastructure challenges alone require careful planning and very deliberate applications of the proper technologies to deal with the vast amount of data that is generated. The algorithms to automate things such as true clinical decision guidance have yet to be written, and although some approaches such as neuro-linguistic programming or machine learning look promising, actually creating a “doc in a box” is probably many years off. This does not mean we should not be striving to move forward as rapidly as possible, because the impact that can be had on a patient’s life is truly inspirational and that should always be remembered. This is not building systems to showcase technology or how smart we are, it is to help propel a truly world changing methodology of how medicine is practiced.”
—James Lowey, CIO
TGen, The Translational Genomics Research Institute
Twitter: @loweyj, @Tgen
“The journey to precision medicine will require the confluence and analysis of enormous amounts of data from genomics, clinical and fundamental research, clinical care, and environmental and lifestyle data, including connected health data from the “Internet of Medical Things.” The entire healthcare ecosystem needs to work together, along with the information and communications technology ecosystems, to collect, transport, analyze, and leverage the vast amount of data that can be honed to develop insights and recommendations for precision medicine. The opportunity to improve healthcare is compelling, the data is vast and will continue to grow, and we need to work together to realize improved outcomes. We need to build the technology and process-enabled capabilities to protect the data and the people. The need for increased TIPPSS—trust, identity, privacy, protection, safety, and security—mechanisms is critical to the success and safety in our ongoing healthcare journey.”
—Florence D. Hudson
Senior Vice President and Chief Innovation Officer, Internet2
Twitter: @FloInternet2
“In the last decade, the wave of data coming off modern sequencing instruments is transforming bioscience into a digital science. Not only are the data sets enormous, the need to work through them quickly to have a real-time impact on therapy is crucial, requiring all of the elements of high-performance computing: fast compute, storage and networking, sophisticated data management, and highly parallel application codes.
The ability to quickly crunch massive amounts of disease and patient data is at the heart of precision medicine. While much of the promise of precision medicine is still on the horizon, advances have already led to life-saving treatments for children and adults with lethal cancers and genetic diseases. At the Center for Pediatric Genomic Medicine (CPGM) at Children’s Mercy Hospital in Kansas City, MO, researchers used 25 hours of supercomputer time to decode the genetic variants of an infant suffering from liver failure. Thanks to the fast genomic diagnosis, doctors were able to proceed with the most effective treatment and the baby is alive and well.”
—Tiffany Trader
Managing Editor, HPCwire
The Wiley & SAS Business Series presents books that help senior-level managers with their critical management decisions.
Titles in the Wiley & SAS Business Series include:
Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications by
Bart Baesens
Bank Fraud: Using Technology to Combat Losses
by Revathi Subramanian
Big Data Analytics: Turning Big Data into Big Money
by Frank Ohlhorst
Big Data, Big Innovation: Enabling Competitive Differentiation through Business Analytics
by Evan Stubbs
Business Analytics for Customer Intelligence
by Gert Laursen
Business Intelligence Applied: Implementing an Effective Information and Communications Technology Infrastructure
by Michael Gendron
Business Intelligence and the Cloud: Strategic Implementation Guide
by Michael S. Gendron
Business Transformation: A Roadmap for Maximizing Organizational Insights
by Aiman Zeid
Connecting Organizational Silos: Taking Knowledge Flow Management to the Next Level with Social Media
by Frank Leistner
Data-Driven Healthcare: How Analytics and BI Are Transforming the Industry
by Laura Madsen
Delivering Business Analytics: Practical Guidelines for Best Practice
by Evan Stubbs
Demand-Driven Forecasting: A Structured Approach to Forecasting, Second Edition
by Charles Chase
Demand-Driven Inventory Optimization and Replenishment: Creating a More Efficient Supply Chain
by Robert A. Davis
Developing Human Capital: Using Analytics to Plan and Optimize Your Learning and Development Investments
by Gene Pease, Barbara Beresford, and Lew Walker
Economic and Business Forecasting: Analyzing and Interpreting Econometric Results
by John Silvia, Azhar Iqbal, Kaylyn Swankoski, Sarah Watt, and Sam Bullard
Foreign Currency Financial Reporting from Euros to Yen to Yuan: A Guide to Fundamental Concepts and Practical Applications
by Robert Rowan
Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production with Data Driven Models
by Keith Holdaway
Health Analytics: Gaining the Insights to Transform Health Care
by Jason Burke
Heuristics in Analytics: A Practical Perspective of What Influences Our Analytical World
by Carlos Andre Reis Pinheiro and Fiona McNeill
Human Capital Analytics: How to Harness the Potential of Your Organization’s Greatest Asset
by Gene Pease, Boyce Byerly, and Jac Fitz-enz
Implement, Improve and Expand Your Statewide Longitudinal Data System: Creating a Culture of Data in Education
by Jamie McQuiggan and Armistead Sapp
Killer Analytics: Top 20 Metrics Missing from Your Balance Sheet
by Mark Brown
Predictive Analytics for Human Resources
by Jac Fitz-enz and John Mattox II
Predictive Business Analytics: Forward-Looking Capabilities to Improve Business Performance
by Lawrence Maisel and Gary Cokins
Retail Analytics: The Secret Weapon
by Emmett Cox
Social Network Analysis in Telecommunications
by Carlos Andre Reis Pinheiro
Statistical Thinking: Improving Business Performance, Second Edition
by Roger W. Hoerl and Ronald D. Snee
Style and Statistics: The Art of Retail Analytics
by Brittany Bullard
Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics
by Bill Franks
The Analytic Hospitality Executive: Implementing Data Analytics in Hotels and Casinos
by Kelly A. McGuire
The Executive’s Guide to Enterprise Social Media Strategy: How Social Networks Are Radically Transforming Your Business
by David Thomas and Mike Barlow
The Value of Business Analytics: Identifying the Path to Profitability
by Evan Stubbs
The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions
by Phil Simon
Too Big to Ignore: The Business Case for Big Data
by Phil Simon
Using Big Data Analytics: Turning Big Data into Big Money
by Jared Dean
Win with Advanced Business Analytics: Creating Business Value from Your Data
by Jean Paul Isson and Jesse Harriott
For more information on any of the above titles, please visit www.wiley.com.
Jay Etchings
Cover image: DNA strand © Don Bishop/Getty Images, Inc. Cover design: Wiley
Copyright © 2017 by SAS Institute, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or on the Web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Names: Etchings, Jay, 1966– author. | SAS Institute, issuing body.
Title: Strategies in biomedical data science : driving force for innovation / Jay Etchings.
Other titles: Wiley and SAS business series.
Description: Hoboken, New Jersey : John Wiley & Sons, Inc., [2017] | Series:
Wiley & SAS business series | Includes bibliographical references and index.
Identifiers: LCCN 2016036794 (print) | LCCN 2016037346 (ebook) | ISBN 978-1-119-23219-3 (hardcover) | ISBN 978-1-119-25597-0 (ePub) | ISBN 978-1-119-25618-2 (ePDF)
Subjects: | MESH: Medical Informatics | Computational Biology—methods | Cybernetics—methods
Classification: LCC R859.7.A78 (print) | LCC R859.7.A78 (ebook) | NLM W 26.5 | DDC 610.285—dc23
LC record available at https://lccn.loc.gov/2016036794
Foreword
Acknowledgments
Introduction
WHO SHOULD READ THIS BOOK?
WHAT’S IN THIS BOOK?
HOW TO CONTACT US
Chapter 1 Healthcare, History, and Heartbreak
TOP ISSUES IN HEALTHCARE
DATA MANAGEMENT
BIOSIMILARS, DRUG PRICING, AND PHARMACEUTICAL COMPOUNDING
PROMISING AREAS OF INNOVATION
CONCLUSION
NOTES
Chapter 2 Genome Sequencing
CHALLENGES OF GENOMIC ANALYSIS
THE LANGUAGE OF LIFE
A BRIEF HISTORY OF DNA SEQUENCING
DNA SEQUENCING AND THE HUMAN GENOME PROJECT
SELECT TOOLS FOR GENOMIC ANALYSIS
The R Project
Genome Analysis Toolkit
Molecular Evolutionary Genetics Analysis
Bowtie
CONCLUSION
Notes
Note
Chapter 3 Data Management
BITS ABOUT DATA
DATA TYPES
DATA SECURITY AND COMPLIANCE
DATA STORAGE
SWIFTSTACK
CONCLUSION
NOTES
Note
Chapter 4 Designing a Data-Ready Network Infrastructure
RESEARCH NETWORKS: A PRIMER
ESNET AT 30: EVOLVING TOWARD EXASCALE AND RAISING EXPECTATIONS
INTERNET2 INNOVATION PLATFORM
ADVANCES IN NETWORKING
INFINIBAND AND MICROSECOND LATENCY
THE FUTURE OF HIGH-PERFORMANCE FABRICS
NETWORK FUNCTION VIRTUALIZATION
SOFTWARE-DEFINED NETWORKING
OPENDAYLIGHT
CONCLUSION
NOTES
Chapter 5 Data-Intensive Compute Infrastructures
BIG DATA APPLICATIONS IN HEALTH INFORMATICS
SOURCES OF BIG DATA IN HEALTH INFORMATICS
INFRASTRUCTURE FOR BIG DATA ANALYTICS
FUNDAMENTAL SYSTEM PROPERTIES
GPU-ACCELERATED COMPUTING AND BIOMEDICAL INFORMATICS
CONCLUSION
NOTES
NOTES
INTRODUCTION
EVIS
SCIENTIFIC COMPUTING
VALIDATION
MEDICAL DEVICE DEVELOPMENT
CONCLUSION
Note
Chapter 6 Cloud Computing and Emerging Architectures
CLOUD BASICS
CHALLENGES FACING CLOUD COMPUTING APPLICATIONS IN BIOMEDICINE
HYBRID CAMPUS CLOUDS
RESEARCH AS A SERVICE
FEDERATED ACCESS WEB PORTALS
CLUSTER HOMOGENEITY
EMERGING ARCHITECTURES (ZETA ARCHITECTURE)
CONCLUSION
NOTES
Chapter 7 Data Science
NOSQL APPROACHES TO BIOMEDICAL DATA SCIENCE
USING SPLUNK FOR DATA ANALYTICS
STATISTICAL ANALYSIS OF GENOMIC DATA WITH HADOOP
EXTRACTING AND TRANSFORMING GENOMIC DATA
PROCESSING EQTL DATA
GENERATING MASTER SNP FILES FOR CASES AND CONTROLS
GENERATING GENE EXPRESSION FILES FOR CASES AND CONTROLS
CLEANING RAW DATA USING MAPREDUCE
TRANSPOSE DATA USING PYTHON
STATISTICAL ANALYSIS USING SPARK
HIVE TABLES WITH PARTITIONS
CONCLUSION
NOTES
Appendix: A Brief Statistics Primer
Content Contributed by Daniel Peñnaherrera,July 13, 2016
FOUNDATIONS
POPULATION AND SAMPLE
RANDOM VARIABLES
EXPECTED VALUE AND VARIANCE
REGRESSION ANALYSIS
MULTIVARIATE LINEAR REGRESSION
LOGISTIC REGRESSION
Chapter 8 Next-Generation Cyberinfrastructures
Next-Generation Cyber Capability
NGCC DESIGN AND INFRASTRUCTURE
Conclusion
NOTE
Conclusion
Appendix A The Research Data Management Survey
Appendix B Central IT and Research Support
INSTITUTIONAL DEMOGRAPHICS (BACKGROUND)
OVERVIEW OF CENTRAL IT ORGANIZATIONS
CENTRAL IT INFRASTRUCTURE
CENTRAL IT RESEARCH SUPPORT SERVICES
CENTRAL IT OFFERED SERVICES
FUNDING MECHANISMS
SUMMARY AND CONCLUSIONS
REFERENCES
Appendix C HPC Working Example
Appendix D HPC and Hadoop
Appendix E Bioinformatics + Docker
Glossary
About the Author
About the Contributors
Index
End User License Agreement
Cover
Table of Contents
Chapter
Chapter 2
Table 2.1
Table CS2.1
Chapter 3
Table 3.1
Table CS3.1
Chapter 4
Table 4.1
Table 4.2
Table 4.3
Table 4.4
Table 4.5
Table 4.6
Table 4.7
Table 4.8
Table 4.9
Table 4.10
Table 4.11
Table 4.12
Table 4.13
Table 4.14
Table 4.15
Table 4.16
Table 4.17
Table 4.18
Table 4.19
Table 4.20
Table 4.21
Table 4.22
Table 4.23
Table 4.24
Table 4.25
Table 4.26
Table 4.27
Table 4.28
Table 4.29
Table 4.30
Table 4.31
Chapter 5
Table 5.1
Table 5.2
Table 5.3
Chapter 7
Table 7.1
Table 7.2
Appendix A
Table A.1
Appendix B
Table B.1
Table B.2
Table B.3
Table B.4
Table B.5
xi
xii
xiii
xv
xvi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
163
164
165
166
167
168
169
170
171
172
173
174
175
176
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
215
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
419
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
The emergence of data science is radically transforming the biomedical knowledge generation paradigm. While modern biomedicine has been a pioneer in evidence-based science, its approach for decades has largely followed a well-worn path of experimental design, data collection, analysis, and interpretation. Data science introduces an alternative pathway—one that starts with the vast collections of diverse digital data increasingly accessible to the community.
While the data science evidence generation concept has many birth parents, Jim Gray of Microsoft best described the unique opportunity afforded by this new paradigm. In a 2007 address to the U.S. National Research Council, Gray argued: “With an exaflood of unexamined data and teraflops of cheap computing power, we should be able to make many valuable discoveries simply by searching all that information for unexpected patterns” [1]. Gray coined the phrase “data-intensive scientific discovery.” Notably, he broke with the high-performance computing “high priests” and advocated the adoption of new models of computing. Following Gray’s untimely death shortly after his address, his colleagues captured this concept in a collection of essays ultimately published as The Fourth Paradigm: Data-Intensive Scientific Discovery [2]. It was within these essays that the term “big data” was introduced.
“Data science” and “big data” are now overburdened terms with many meanings. The most useful definitions are operational in nature. One of the most colorful comes from John Myles of Facebook, who indicates that big data is any problem “so large that traditional approaches to data analysis are doomed to failure” [3]. I find the definition of the chief architect of Data.gov, Philip Ashlock, most elucidating: “Analysis that can help you find patterns, anomalies, or new structures amidst otherwise chaotic or complex data points” [3].
Data science remains controversial in biomedicine. Jeff Drazen, the editor in chief of the New England Journal of Medicine, has described data science practitioners as “research parasites” [4]. More subtly, Robert Weinberg openly questions whether such approaches have any potential to generate real insight in his article describing an emerging crisis in understanding cancer, “Coming Full Circle—From Endless Complexity to Simplicity and Back Again” [5].
I have been an eyewitness and co-conspirator in the data science transformation occurring in biomedicine. I grew up with the Human Genome Project and the rapid accumulation of large volumes of big data it generates. I have made contributions through the “Discovery Science” paradigm that the Genome Project made acceptable in biomedicine. For example, with my colleagues at the Cooperative Human Linkage Center, we were early adopters of computational science and the Internet (then NSFnet) in our efforts to construct the map of human inheritance [6]. For us at the time, big data topped out at a gigabyte! While serving as the founding director of the National Institutes of Health’s National Cancer Institute’s Center for Biomedical Informatics and Information Technology, I was tasked with helping bring data science to the cancer community. The charge was broad—including basic science, clinical research, and health encounter data. It was technologically challenging— predating many technology paradigms now taken for granted as standard in data science. Through these pioneering efforts, I experienced the aforementioned controversial nature of data science and the second of Arthur C. Clarke’s laws: “The only way of discovering the limits of the possible is to venture a little way past them into the impossible” [7].
Strategies in Biomedical Data Science is an ambitious attempt to look at “the limits of the possible” for data science in biomedicine. Unique in its scope, it takes a comprehensive look at all aspects of data science. Work in the sciences is routinely compartmentalized and segregated among specialists. This segregation is particularly true in biomedicine as it wrestles with the integration of data science and its underpinning in information technology. While such specialization is essential for progress within disciplines, the failure to have cross-cutting discussions results in lost opportunities. This book is significant in that it purposely embraces the “transdisciplinary” nature of biomedical data science. Transdisciplinary research (a foundational aspect of Arizona State University’s “New American University”) brings together different disciplines to create innovations that are beyond the capacity of any single specialty. Data science is definitionally transdisciplinary and somewhat ironically is discipline-agnostic.
Strategies in Biomedical Data Science unapologetically mixes biology, analytics, and information technology. Its transdisciplinary topics cover diverse data types—genomic, clinical encounter, personal monitoring devices—and the data science opportunities (and challenges) in each. Within each of these topics, it provides insights into the software capabilities that are used to wrangle Gray’s “exaflood” of data and to find his “unexpected patterns.” It provides insightful discussions of the underpinning computational and network infrastructure necessary to realize the potential of data science. More specifically, it provides practical blueprints that translate Gray’s suggested alternative to traditional high-performance computing paradigms into reality. Within each of these, it provides case studies written by experts that transition the topics from concept to real-world examples. Importantly, these case studies are provided by both academics and industry sources, demonstrating the importance of both to the biomedical data science progress as well as the need to blend these often-adversarial communities.
I have had the opportunity to know the author, Jay Etchings, for over three years. Jay is a true computational renaissance man, as reflected in the breadth of topics facilely presented in Strategies in Biomedical Data Science. I was first introduced to Jay when he was an architect for Dell. Jay translated ASU’s vision for a first-generation, purpose-built data science research platform into the operational Next Generation Cyber Capability (NGCC) described in the book. The NGCC is a physical instantiation of what Gray envisioned. Now at ASU as the director of Research Computing Operations, Jay and his team deliver biomedical data science to a diverse collection of international scientists.
Jay brings a fresh perspective and a diverse pedigree of work experiences to biomedical data science. He has been at the forefront of developing and deploying big data capabilities throughout his career. For example, Jay was on the leading edge in bringing big data infrastructure to the gaming industry— a community that is always an early adopter of breakthrough technology. Jay has hands-on experience in the complexities of biomedical data from his efforts to provide support for the Centers for Medicare and Medicaid Services. Jay’s commercial background brings with it a can-do approach to problems and a low tolerance for the arcane consternation that often paralyzes academics. This fresh perspective and his enthusiasm for biomedicine pervade his writing. Strategies in Biomedical Data Science is a one-stop shop of data science essentials and is likely to serve as the go-to resource for years to come.
Ken Buetow, Ph.D.,
Professor, Arizona State University
Director, Computational Science and Informatics Core Program
Director, Complex Adaptive Systems Initiative
David Snyder. 2016. “The Big Picture of Big Data—IEEE—The Institute.”
http://theinstitute .ieee.org/ieee-roundup/members/achievements/the-big-picture-of-big-data.
Anthony J. G. Hey, ed. 2009.
The Fourth Paradigm: Data-Intensive Scientific Discovery
. Redmond, WA: Microsoft Research.
Jennifer Dutcher. 2014. “What Is Big Data?” September 3.
https://datascience.berkeley.edu /what-is-big-data/.
Dan L. Longo and Jeffrey M. Drazen. 2016. “Data Sharing.”
New England Journal of Medicine
374, no. 3: 276–277. doi: 10.1056/NEJMe1516564.
Robert A. Weinberg. 2014. “Coming Full Circle—From Endless Complexity to Simplicity and Back Again.”
Cell
157 (1): 267–71. doi: 10.1016/j.cell.2014.03.004.
J. C. Murray, K. H. Buetow, J. L. Weber, S. Ludwigsen, T. Scherpbier-Heddema, F. Manion, et al. 1994. “A Comprehensive Human Linkage Map with Centimorgan Density. Cooperative Human Linkage Center (CHLC).”
Science
265, no. 5181: 2049–2054.
Arthur C. Clarke. 1962. “Hazards of Prophecy: The Failure of Imagination” In
Profiles of the Future: An Inquiry into the Limits of the Possible
. New York: Harper & Row.
Most broadly, this book has been inspired by the need for a collaborative and multidisciplinary approach to solving the intricate puzzle that is cancer. Cancer poses a complex adaptive challenge that reaches across all domains: medicine, biology, technology, and the social sciences. Transdisciplinary collaboration is the only true path to the future. Ubiquitous research computing in support of “open science” and open big data has an essential role to play in this collaborative process.
More specifically, this book is dedicated to Sue Stigler and the family she leaves behind. Her three-and-a-half-year battle with cancer came to a close on December 7, 2015. Sue’s kindness and devotion, and her endless support for others even while ill, were remarkable; her selflessness will always be remembered. If you would like to donate to the Stigler family college fund, please visit their GoFundMe page, https://www.gofundme.com/bpebavas.
Author proceeds support childhood brain cancer research through an ASU Foundation account supporting Dr. Joshua LaBaer’s work in the Biodesign Institute. Dr LaBaer is conducting cutting-edge research on pediatric low-grade astrocytomas (PLGAs), which are the most common cancers of the brain in children.
In the research and discovery leading to this book, I have worked with more amazing and committed individuals than I could have ever imagined. My mentor and friend Ken Buetow is fond of saying, “If you’re the smartest person in the room, you are in the wrong room.” Time and again I have been in the right room. I am able to count some of the smartest people on the planet as colleagues and friends. Publication of this book was made a reality by their support and example.
A very special thanks to my good friend Phil Simon for convincing me to put thoughts, concepts, and theory on paper and share it with the world.
At Arizona State University I would like to thank Gordon Wishon, Dr. Elizabeth Cantwell, and Dr. Sethuraman Panchanathan (“Panch”) for giving me the opportunity to drive innovation at the university.
I would also like to recognize the dedication of our Research Computing team at Arizona State University for the continued commitment to our “commander’s intent” and to Christopher Myhill for sharing the commander’s intent with me while at Dell Enterprise.
Tremendous thanks to the teamwork of Jon McNally, Johnathan “Jr.” Lee, Lee Reynolds, Ram Polur, Daniel Penaherrera, Sheetal Shetty, James Napier, Tiffany Marin, Deborah Whitten, Curtis Thompson, Srinivasa Mathkur, Marisa Brazil, and of course Carol Schumacher, arguably the best administrative assistant alive. Special thanks also to Wendy “DigDug” Cegielski for her editing hours and continued motivation; next year you will be Dr. Wendy.
In no specific order I also would like to thank this list of super-smart and generous folks as well as our many terrific and invaluable partners: NimbleStorage, Brocade, Internet2, ESNET, Penguin Computing, TGEN, SwiftStack, MarkLogic, the Open Daylight Foundation, the Linux Foundation, Open Networking Foundation, IT Partners, friends at University of Arizona, Northern Arizona University, Dell Enterprise, University of Massachusetts-Lowell, Baylor University, Washington State University, Georgia Tech, Broad Institute of MIT and Harvard, University of Nevada Las Vegas, and the College of Southern Nevada (formally CCSN), and thanks for the support and mentorship from domain professionals both public and private like Mark “Pup” Roberts, Brandon Mikkelsen, Sean Dudley, Joel Dudley, James Lowey, Todd Decker, Jeff Creighton, Jim Scott, Gregory Palmer, Neela Jacques, Al Ritacco, and of course my engineer stepbrother Pedro Victor Gomes.
Last but certainly not least, I would like to recognize my awesome team of Jacob, Dixon, and Annika for their enduring patience throughout the never-ending collecting of the data and experience that comprises this text.
Heather, though you have departed from my arms, there is always a place for you in my heart.
Never let the future disturb you.
You will meet it, if you have to, with the same weapons of reason which today arm you against the present.
—Marcus Aurelius
Some time ago, while I was engaged as a consultant, it became painfully obvious that the approaches to healthcare data management and overall infrastructure architecture were stuck in the Stone Age. While data and information technology (IT) professionals sprinted to remain on the cutting edge of top tech trends, much of the healthcare system remained a technical backwater. The many explanations for this include compliance controls, challenges associated with the rapid proliferation of data, and reliance on old systems with proprietary code where porting was more painful than the day-to-day operations. This state of affairs has been frustrating for all involved. But beyond the very real frustrations, there are far more important negative impacts. Technical inefficiencies increase costs, lead to a loss of research productivity, and hurt clinical outcomes. In other words, everyone suffers. When I talk to people about data management and IT support within the healthcare field, a recurring theme is that much is “lost in translation” between the various stakeholders: IT professionals, researchers, doctors, clinicians, and administrators.
Over the past 20 years, much of my time has been spent in medical and technical fields. I have held positions with two large insurance payer providers and have worked with the Centers for Medicare & Medicaid Services (CMS) as a recovery audit contractor. I have even worked clinically as an emergency medical technician with a strong background in exercise physiology. Seeking greater challenges led me to Las Vegas, Nevada, where I was fortunate to work on the first cloud-enabled centrally deterministic (Class 2) gaming systems for the state lottery. This was well before the term “cloud” had even arrived. At the close of the project, I returned to the medical field, joining a Fortune 50 payer provider ingesting targeted acquisitions.
My wide-ranging work experiences have showed me that medical and research professionals are usually not technology experts, and most do not desire to be. At the same time, computer scientists and infrastructure experts are not biologists, doctors, or researchers. This longtime disconnect paves the way for high-paid consultants to act as intermediaries brought in to work between IT and biomedical staff.
Not surprisingly, this does not work terribly well, neither does it best serve the medical and research communities. Consultants typically demand high compensation and often are not able to perform the sort of knowledge transfer necessary to make a meaningful and sustainable impact. There are many different permutations and possible explanations for this. But, in the end, I think it is at heart a failure to adequately translate or bridge biomedicine and IT.
The primary motivation for this book is to begin to create a sustainable and readily accessible bridge between IT and data technologists, on one hand, and the community of clinicians, researchers, and academics who deliver and advance healthcare, on the other hand. This book is thus a translational text that will hopefully work both ways. It can help IT staff learn more about clinical and research needs within biomedicine. It also can help doctors and researchers learn more about data and other technical tools that are potentially at their disposal.
My experience in healthcare has shown me that both IT professionals and biologists tend to become isolated or siloed in their professional worlds. This isolation hurts us all: IT staff, biologists, doctors, and patients alike. This is not to suggest that IT staff and data managers should get master’s degrees in biology or epidemiology. Rather, I am suggesting that as IT staff and data managers learn more about the biomedical context of their work, they will be able to work better and more efficiently. Furthermore, as biomedicine becomes ever more dependent on computing and big data, there is more and more domain-specific technical knowledge to assimilate.
As IT and biomedicine innovate with increasing rapidity, I predict that we will see more and more hybrid job titles, such as health technologist and bioinformatician. In order to stay current, both IT professionals and biomedical professionals will need to become less isolated. This book begins to bring together these two fields that are so dependent on each other and have so much to offer each other. It is my sincere hope that this work will narrow the gap between those engaged in use-inspired research and those supporting that research from an infrastructure delivery perspective.
In the interest of creating as accessible a bridge text as possible between IT staff and biomedical personnel, this book is relatively nontechnical. For the most part, the aim is to offer a conceptual introduction to key topics in data management for the biomedical sciences. While a certain familiarity with IT, networking, and applications is assumed, you will find very little in the way of code examples. The goal is to equip you with some foundational concepts that will leave you prepared to seek out whatever additional information you and your institution might need.
I have worked in IT for over 20 years, but I am most inspired by how computing technologies can be used to solve human problems. I certainly appreciate elegant code and innovative technical solutions. But at the end of the day, it is the prospect of improving patient outcomes that keeps me engaged and driven to learn and continually extend the boundaries of the possible. One area of biomedical research that I find particularly inspiring is the potential to use targeted therapies to more effectively treat pediatric low-grade astrocytomas (PLGAs). PLGAs are by far the most common cancer of the brain among children. They are often fatal, and current chemotherapies frequently have lifelong side effects, including neurocognitive impairment. Dr. Joshua LaBaer, interim director of the Biodesign Institute at Arizona State University, is working to develop effective targeted therapies that reduce harmful effects on normal cells. Proceeds from this book support the ASU Research Foundation and the work of Dr. Joshua LaBaer, Director, The Biodesign Institute, Personalized Diagnostics and Virginia G. Piper Chair in Personalized Medicine.
In reflecting on the important roles to be played by humans and by computing, I am reminded of a frequently cited quote by Leo M. Cherne, an American economist and public servant, that is often inaccurately attributed to Albert Einstein: “The computer is incredibly fast, accurate, and stupid. Man is unbelievably slow, inaccurate, and brilliant. The marriage of the two is a force beyond calculation.” As our capabilities to gather, analyze, and archive data dramatically improve, computing is likely to be increasingly valuable to biomedical research and clinical medicine. Yet let us always remember the need for humans, slow and inaccurate as we usually are.
Strategies in Biomedical Data Science is designed to help anyone who works with biomedical data. This certainly includes IT staff and systems administrators. These readers will hopefully gain a deeper understanding of particular challenges and solutions for biomedical data management. The target audience also includes bioscience researchers and clinical staff. While persons in these roles are not typically directly responsible for data management, they are most certainly concerned with and affected by how data is created, used, and archived. I hope these readers will gain a deeper understanding of how IT staff tend to approach systems architecture and data management. Quite frequently we focus on research academic and other public research institutions. Such institutions are tremendously important for cutting-edge research and collaboration. Most of the best practices and scenarios presented in the book are, however, equally applicable to private-sector use cases.
All readers are welcome to work through this book in whatever order best suits their particular interests and needs.
Strategies in Biomedical Data Science offers a relatively high-level introduction to the cutting-edge and rapidly changing field of biomedical data. It provides biomedical IT professionals with much-needed guidance toward managing the increasing deluge of healthcare data. This book demonstrates ways in which both technological development and more effective use of current resources can better serve both patient and payer. The discussion explores the aggregation of disparate data sources, current analytics and tool sets, the growing necessity of smart bioinformatics, and more as data science and biomedical science grow increasingly intertwined. Real-world use cases and clear examples are featured throughout, and coverage of data sources, problems, and potential mitigation provides necessary insight for forward-looking healthcare professionals.
The book begins with an overview of current technical challenges in healthcare and then moves into topics in biomedical data management, including network infrastructure, compute infrastructure, cloud architecture, and finally next-generation cyberinfrastructures.
Many of the chapters include use cases and/or case studies. Use cases examine a general use case and typically focus on one application or technology. Case studies are more particular examinations of how a company or institution has used an application or technology to meet an operational need. One of our objectives is to shine a light into the black box that is the emerging realm of precision medicine. Much of the case study data has been compiled over the past few years and has been updated to include as much current data as available. Please be aware that some case study materials have been anonymized at the request of the institution providing the information. Case studies appear after chapters, while use cases are presented within the chapters.
Strategies in Biomedical Data Science has benefited tremendously from the many wonderful experts who have generously contributed content. Contributors are acknowledged throughout the book, alongside their contributions, and you can find their biographies in the “About the Contributors” section.
Chapter 1, “Healthcare, History, and Heartbreak,” examines some of the current top issues in healthcare that pertain to data and IT. There are great challenges but also tremendous opportunities for innovation in IT and data science. Chapter 1 also presents some promising areas for innovation, including the Internet of Things, cloud computing, and dramatic advances in data storage. Chapter 2, “Genome Sequencing,” recaps the remarkable history of how scientists deciphered the central dogma, the deceptively simple model that explains the molecular basis of biological life. We then review the history of genomic sequencing from its origins to next-generation sequencing (NGS) and recount its startling price drop. Perhaps most important, we survey some common genomics tools and resources for analyzing and working with genomics data in silico. Following this chapter you will find a case study presenting a dramatic example of exome sequencing leading to clinical diagnosis.
Chapter 3, “Data Management,” explores challenges and solutions for managing large quantities of biomedical data. The chapter begins with an overview of different types of data and moves on to issues of security and compliance in biomedical research. We offer a general research data life cycle to help you plan and anticipate potential problems. Particular storage technologies covered include iRODS, OpenStack Swift, SwiftStack, and NimbleStorage, a performance storage array. Following this chapter you will find three case studies. The first considers the data demands of genetic sequencing. The second offers specification for HudsonAlpha’s SwiftStack storage cluster. The third focuses on the use of NimbleStorage’s predictive flash storage at ASU.
Chapter 4, “Designing a Data-Ready Network Infrastructure,” offers a brief history of computer networking before examining research networks and some advances in networking. We also share a model that can be used to deliver secure and regulated data storage and services so that institutions can comply with security standards. Networking advances discussed include InfiniBand, a computer-networking communications standard used in high-performance computing, which features very high throughput and very low latency; network function virtualization (NFV); and software-defined networking (SDN). The bulk of this chapter is a detailed guide to OpenDaylight, an open source SDN platform.
Chapter 5, “Data-Intensive Compute Infrastructures,” is all about big data. It starts with a brief survey of the current state of big data efforts in healthcare and biomedicine. We consider big data applications as well as data sources. From there we dive into infrastructure for big data analytics, first examining service-oriented architecture and cloud computing. We then focus on hierarchical system structures and discuss the following layers: sensing, data storage and management, data computing and application services, and application services. We end by presenting graphics processing unit (GPU) accelerated computing. Following the chapter you will find two case studies. The first reports on how computational modeling and scientific computing can model treatment options for vascular disease. The second presents how GPU was used to model the molecular dynamics of antibiotic resistance.
Chapter 6, “Cloud Computing and Emerging Architectures,” begins with an overview of cloud computing, including service and deployment models as well as challenges. After this we examine Research as a Service (RaaS) and cluster homogeneity, key components of some versions of cloud computing, and we also consider federated access. The second half of the chapter dives into Zeta Architecture, an emergent architecture that is used by Google and that offers better hardware utilization, fewer moving parts, and greater responsiveness and flexibility. Zeta and other emerging architectures are catalyzed by limitations on one-size-fits-all enterprise architectures. Following this chapter is a case study on using on-demand computing for biomedical research on ventricular tachycardia.
Chapter 7, “Data Science,” focuses on the tools and techniques demanded by this exciting and rapidly growing field. First we examine some basic statistical concepts as these are the foundation of much data science. From there we explore some NoSQL database offerings and Splunk, and offer a detailed example of genomic analysis (eQTL), which entails Apache Spark and Hive tables. Following this chapter you will find two case studies: one on UC Irvine Health’s Hortonworks Data Platform and the second on subclonal variations and the computing and data science strategies used to study these.
Chapter 8, “Next-Generation Cyberinfrastructures,” brings together many of the central strands of this book. It reports on the Next-Generation Cyber Capability (NGCC), which is Arizona State University’s approach to meeting compute and data needs for its research community and key collaborators. Following this chapter is a case study on one of the first NGCC projects, the National Biomarker Development Alliance.
A brief conclusion reviews the book’s goals and invites feedback and suggestions.
In addition to the case studies, Strategies in Biomedical Data Science contains five appendixes and a glossary.
Appendix A reports on a survey about research management. Appendix B reports on a survey about the current state and desired capabilities for IT resources at research universities. Appendix C offers some high-performance computing working examples. Appendix D details how to bridge high-performance computing to Hadoop. Finally, Appendix E discusses using Docker for bioinformatics.
Thanks for reading!
Should this book inspire the reader to dig deeper into research computing or the research itself, we will consider it a win. If you find this book to be of little value, please leave it on your next flight, bus ride, or at a homeless shelter for some other reader to find and take to their next job interview.
As you use this book and work with biomedical data, we welcome your comments and feedback. In the hybrid and rapidly evolving field of biomedical data, collaboration and exchange are truly essential. We hope there will be a second edition of this book, and I would value comments and feedback to help improve this material.
You can reach Jay at [email protected] or [email protected].
Over the past decade, we have unlocked many of the mysteries about DNA and RNA. This knowledge isn’t just sitting in books on the shelf nor is it confined to the workbenches of laboratories. We have used these research findings to pinpoint the causes of many diseases. Moreover, scientists have translated this genetic knowledge into several treatments and therapies prompting a bridge between the laboratory bench and the patient’s bedside.
—Barack Obama on the Genomics and Personalized Medicine Act (S. 976), March 23, 2007
While we are surely poised to continue to make tremendous medical advances—notably in personalized medicine, pharmacogenomics, and precision medicine—we are also facing substantial challenges. The challenges facing healthcare today are many, and if we do not adequately address them we risk missing opportunities, pushing the cost of care up, and slowing the pace of biomedical innovation. In briefly surveying the state of healthcare, it is not my intention to offer a political diagnosis or solution. Rather, it is my intention to use our current technical knowledge to point the way to practical solutions. For example, a long-theorized solution to health records management would be a single cloud-based system where healthcare information sharing exists universally. But if I were to present this as the best technical solution, it would not be my intention to also advocate for a shift to a single-payer healthcare system. As much as possible this book and the discussions in this chapter aim to avoid politics.
After decades of technological lag, biomedicine has started to embrace new technologies with increasing rapidity. Next-generation sequencing, mobile technologies, wearable sensors, three-dimensional medical imaging, and advances in analytic software now make it possible to capture vast amounts of information. Yet we still struggle with the collection, management, security, and thoughtful interpretation of all this information. At the same time, healthcare is changing quickly as the field grapples with new technologies and is transformed by mergers and new partnerships. As a complex adaptive system, healthcare is more than the sum of its parts, and it is always difficult to predict the future. But we do know that as the post–Affordable Care Act healthcare landscape takes shape, the industry is shifting toward digitally enabled, consumer-focused care models. Given these trends, technology will be granted many opportunities to improve patient care.
At the outset of this book it is worth surveying some of the top issues in healthcare. For many of you, these will be quite familiar. Whether you’re an expert or not, you should feel free to skip ahead if you like. But it is my sincere hope that the background material will be of real value in bridging the gap between healthcare and biomedicine, on the one hand, and information technology (IT) and data management, on the other. Just as doctors in an age of increasing specialization can benefit from attending to the whole patient, it is very valuable for IT staff to have a more holistic and systemic understanding of healthcare.
There are many, many sources that comment on the state of healthcare and biomedicine more broadly. Although I worked as a contractor for two of the country’s largest Medicare/Medicaid contract holders, I am not a policy expert. But I have come to appreciate the importance of taking in the bigger picture. My admittedly incomplete survey of top healthcare issues is drawn from PwC’s Top Health Industry Issues of 2016 and PwC’s Top Health Industry Issues of 2015 [1]. These two brief reports offer compelling syntheses and analyses of current trends. In rereading these reports and reflecting on my own experiences in the field, I was struck by the number of top issues that are substantially or in part data or IT issues. Many of the top healthcare issues are centrally concerned with the storage, security, sharing, and analysis of data. In other words, IT and data management will be called on to make major contributions to advancing the dynamic healthcare field. Next I explore nine key issues impacting healthcare.
As the health sector continues to change in response to the Affordable Care Act (2010), we are seeing many mergers and partnerships. “The ACA’s emphasis on value and outcomes has sent ripples through the $3.2 trillion health sector, spreading and shifting risk in its wake. At the same time, capital is inexpensive, thanks to sustained low interest rates. Industry’s response? Go big” [2]. Mergers between large insurance providers are consolidating the insurance market. In 2015, the second largest U.S. insurer, Anthem, made a $48.4 billion offer for health and life insurance provider Cigna. Mergers have also been common in the pharmaceutical field, including Pfizer’s whopping $160 billion deal for specialty pharmaceutical star Allergan. While these deals are still awaiting regulatory approval, 2016 and 2017 will likely see more mergers and acquisitions. Many new partnerships are also being formed between pharmaceutical, life sciences, software, pharmacy, healthcare providers, and engineering companies, among others.
Mergers, acquisitions, and partnerships are driven by a number of larger market forces. Sometimes predicted lower IT or data costs drive consolidation. More often it is simply that IT and data will need to be able to respond nimbly to these changes. One of the largest challenges is postacquisition data management.
Many providers in the healthcare space have grown through organic means and have survived on shoestring budgets. When compliance moved to the forefront, many chief information officers were granted grace periods to meet compliance and conducted internal audits, patching together existing components to meet the objectives. This expenditure had the systemic impact of preventing the distribution of funds toward infrastructure improvements. The maintenance of many legacy systems resulted, leaving organizations with out-of-date, proprietary, inflexible systems that were simply not designed to interoperate on the larger scale. Now when that smaller provider, which potentially maintains a large collection of Medicare/Medicaid accounts, is acquired by a larger entity, the most significant challenge is the integration of those legacy systems without impacting operational activities. The challenge of migrating years of patient data records into a system from an out-of-date platform encumbered by complex and tangled spaghetti code and created by a resource long since departed is substantial. The need to do so while maintaining business continuity drives many a large entity to maintain the down-level system for years following the acquisition.
As more and more patient data is stored and shared, security is an increasing concern. Patient data typically contains individualized information. If that data is stolen, the risks of identity theft are substantial, and there exists a thriving black market for stolen health records. Data security breaches are relatively common. “During the summer of 2014, more than 5 million patients had their personal data compromised” [1]. These breaches are often costly for companies. Medical devices themselves can also be hacked. For example, in 2015 the government warned that “an infusion pump . . . could be modified to deliver a fatal dose of medication” [2].
