99,99 €
Complete guidance for mastering the tools and techniques of the digital revolution
With the digital revolution opening up tremendous opportunities in many fields, there is a growing need for skilled professionals who can develop data-intensive systems and extract information and knowledge from them. This book frames for the first time a new systematic approach for tackling the challenges of data-intensive computing, providing decision makers and technical experts alike with practical tools for dealing with our exploding data collections.
Emphasizing data-intensive thinking and interdisciplinary collaboration, The Data Bonanza: Improving Knowledge Discovery in Science, Engineering, and Business examines the essential components of knowledge discovery, surveys many of the current research efforts worldwide, and points to new areas for innovation. Complete with a wealth of examples and DISPEL-based methods demonstrating how to gain more from data in real-world systems, the book:
The Data Bonanza is a must-have guide for information strategists, data analysts, and engineers in business, research, and government, and for anyone wishing to be on the cutting edge of data mining, machine learning, databases, distributed systems, or large-scale computing.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Veröffentlichungsjahr: 2013
Copyright © 2013 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created ore extended by sales representatives or written sales materials. The advice and strategies contained herin may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services please contact our Customer Care Department with the U.S. at 877-762-2974, outside the U.S. at 317-572-3993 or fax 317-572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print, however, may not be available in electronic format.
Library of Congress Cataloging-in-Publication Data:
Atkinson, Malcolm.
The data bonanza : improving knowledge discovery for science, engineering and business / Malcolm Atkinson, Rob Baxter, Michelle Galea, Mark Parsons, Peter Brezany, Oscar Corcho, Jano van Hemert, David Snelling.
pages cm
ISBN 978-1-118-39864-7 (pbk.)
1. Information technology. 2. Information retrieval. 3. Databases. I. Title.
T58.5.A93 2013
001–dc23
2012035310
Printed in the United States of America.
10 9 8 7 6 5 4 3 2 1
To data-to-knowledge highway engineers, everywhere.
Table of Contents
Title
Copyright Page
Dedication Page
CONTRIBUTORS
FOREWORD
PREFACE
THE EDITORS
PART I: Strategies For Success In The Digital-Data Revolution
Chapter 1: The Digital-Data Challenge
1.1 The Digital Revolution
1.2 Changing How We Think and Behave
1.3 Moving Adroitly in this Fast-Changing Field
1.4 Digital-Data Challenges Exist Everywhere
1.5 Changing How We Work
1.6 Divide and Conquer Offers the Solution
1.7 Engineering Data-to-Knowledge Highways
References
Chapter 2: The Digital-Data Revolution
2.1 Data, Information, and Knowledge
2.2 Increasing Volumes and Diversity of Data
2.3 Changing the Ways We Work with Data
References
Chapter 3: The Data-Intensive Survival Guide
3.1 Introduction: Challenges and Strategy
3.2 Three Categories of Expert
3.3 The Data-Intensive Architecture
3.4 An Operational Data-Intensive System
3.5 Introducing DISPEL
3.6 A Simple DISPEL Example
3.7 Supporting Data-Intensive Experts
3.8 DISPEL in the Context of Contemporary Systems
3.9 Datascopes
3.10 Ramps for Incremental Engagement
3.11 Readers’ Guide to the Rest of This Book
References
Chapter 4: Data-Intensive Thinking with DISPEL
4.1 Processing Elements
4.2 Connections
4.3 Data Streams and Structure
4.4 Functions
4.5 The Three-Level Type System
4.6 Registry, Libraries, and Descriptions
4.7 Achieving Data-Intensive Performance
4.8 Reliability and Control
4.9 The Data-to-Knowledge Highway
References
PART II: Data-Intensive Knowledge Discovery
Chapter 5: Data-Intensive Analysis
5.1 Knowledge Discovery in Telco Inc.
5.2 Understanding Customers to Prevent Churn
5.3 Preventing Churn Across Multiple Companies
5.4 Understanding Customers by Combining Heterogeneous Public and Private Data
5.5 Conclusions
References
Chapter 6: Problem Solving in Data-Intensive Knowledge Discovery
6.1 The Conventional Life Cycle of Knowledge Discovery
6.2 Knowledge Discovery Over Heterogeneous Data Sources
6.3 Knowledge Discovery from Private and Public, Structured and Nonstructured Data
6.4 Conclusions
References
Chapter 7: Data-Intensive Components and Usage Patterns
7.1 Data Source Access and Transformation Components
7.2 Data Integration Components
7.3 Data Preparation and Processing Components
7.4 Data-Mining Components
7.5 Visualization and Knowledge Delivery Components
References
Chapter 8: Sharing and Reuse in Knowledge Discovery
8.1 Strategies for Sharing and Reuse
8.2 Data Analysis Ontologies for Data Analysis Experts
8.3 Generic Ontologies for Metadata Generation
8.4 Domain Ontologies for Domain Experts
8.5 Conclusions
References
PART III: Data-Intensive Engineering
Chapter 9: Platforms for Data-Intensive Analysis
9.1 The Hourglass Reprise
9.2 The Motivation for a Platform
9.3 Realization
References
Chapter 10: Definition of the DISPEL Language
10.1 A Simple Example
10.2 Processing Elements
10.3 Data Streams
10.4 Type System
10.5 Registration
10.6 Packaging
10.7 Workflow Submission
10.8 Examples of DISPEL
10.9 Summary
References
Chapter 11: DISPEL Development
11.1 The Development Landscape
11.2 Data-Intensive Workbenches
11.3 Data-Intensive Component Libraries
11.4 Summary
References
Chapter 12: DISPEL Enactment
12.1 Overview of DISPEL Enactment
12.2 DISPEL Language Processing
12.3 DISPEL Optimization
12.4 DISPEL Deployment
12.5 DISPEL Execution and Control
References
PART IV: Data-Intensive Application Experience
Chapter 13: The Application Foundations of DISPEL
13.1 Characteristics of Data-Intensive Applications
13.2 Evaluating Application Performance
13.3 Reviewing the Data-Intensive Strategy
Chapter 14: Analytical Platform for Customer Relationship Management
14.1 Data Analysis in the Telecoms Business
14.2 Analytical Customer Relationship Management
14.3 Scenario 1: Churn Prediction
14.4 Scenario 2: Cross Selling
14.5 Exploiting the Models and Rules
14.6 Summary: Lessons Learned
References
Chapter 15: Environmental Risk Management
15.1 Environmental Modeling
15.2 Cascading Simulation Models
15.3 Environmental Data Sources and Their Management
15.4 Scenario 1: ORAVA
15.5 Scenario 2: RADAR
15.6 Scenario 3: SVP
15.7 New Technologies for Environmental Data Mining
15.8 Summary: Lessons Learned
References
Chapter 16: Analyzing Gene Expression Imaging Data in Developmental Biology
16.1 Understanding Biological Function
16.2 Gene Image Annotation
16.3 Automated Annotation of Gene Expression Images
16.4 Exploitation and Future Work
16.5 Summary
References
Chapter 17: Data-Intensive Seismology: Research Horizons
17.1 Introduction
17.2 Seismic Ambient Noise Processing
17.3 Solution Implementation
17.4 Evaluation
17.5 Further Work
17.6 Conclusions
References
PART V: Data-Intensive Beacons of Success
Chapter 18: Data-Intensive Methods in Astronomy
18.1 Introduction
18.2 The Virtual Observatory
18.3 Data-Intensive Photometric Classification of Quasars
18.4 Probing the Dark Universe with Weak Gravitational Lensing
18.5 Future Research Issues
18.6 Conclusions
References
Chapter 19: The World at One’s Fingertips: Interactive Interpretation of Environmental Data
19.1 Introduction
19.2 The Current State of the Art
19.3 The Technical Landscape
19.4 Interactive Visualization
19.5 From Visualization to Intercomparison
19.6 Future Development: The Environmental Cloud
19.7 Conclusions
References
Chapter 20: Data-Driven Research in the Humanities—the DARIAH Research Infrastructure
20.1 Introduction
20.2 The Tradition of Digital Humanities
20.3 Humanities Research Data
20.4 Use Case
20.5 Conclusion and Future Development
References
Chapter 21: Analysis of Large and Complex Engineering and Transport Data
21.1 Introduction
21.2 Applications and Challenges
21.3 The Methods Used
21.4 Future Developments
21.5 Conclusions
References
Chapter 22: Estimating Species Distributions—Across Space, Through Time, and with Features of the Environment
22.1 Introduction
22.2 Data Discovery, Access, and Synthesis
22.3 Model Development
22.4 Managing Computational Requirements
22.5 Exploring and Visualizing Model Results
22.6 Analysis Results
22.7 Conclusion
References
PART VI: The Data-Intensive Future
Chapter 23: Data-Intensive Trends
23.1 Reprise
23.2 Data-Intensive Applications
References
Chapter 24: Data-Rich Futures
24.1 Future Data Infrastructure
24.2 Future Data Economy
24.3 Future Data Society and Professionalism
References
Appendix A: Glossary
Michelle Galea and Malcolm Atkinson
Appendix B: DISPEL Reference Manual
Paul Martin
Appendix C: Component Definitions
Malcolm Atkinson and Chee Sun Liew
INDEX
Contributors
M. ATKINSON, School of Informatics, University of Edinburgh, Edinburgh, UK
A. ASCHENBRENNER, State and University Library Göttingen, Göttingen, Germany
J. AUSTIN, Department of Computer Science, University of York, York, UK
R. BALDOCK, Medical Research Council, Human Genetics Unit, Edinburgh, UK
R. BAXTER, EPCC, University of Edinburgh, Edinburgh, UK
P. BESANA, School of Informatics, University of Edinburgh, Edinburgh, UK
T. BLANKE, Digital Research Infrastructure in the Arts and Humanities, King’s College London, London, UK
J. BLOWER, Reading e-Science Centre, University of Reading, Reading, UK
R. COOK, Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
O. CORCHO, Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid, Madrid, Spain
T. DAMOULAS, Department of Computer Science, Cornell University, Ithaca, New York, USA
D. FINK, Cornell Lab of Ornithology, Cornell University, Ithaca, New York, USA
C. FRITZE, State and University Library Göttingen, Göttingen, Germany
M. GALEA, School of Informatics, University of Edinburgh, Edinburgh, UK
A. GEMMELL, Reading e-Science Centre, University of Reading, Reading, UK
O. HABALA, Oddelenie paralelného a distribuovaného spracovania informáciì’, Ústav informatiky SAV, Bratislava, Slovakia
K. HAINES, Reading e-Science Centre, University of Reading, Reading, UK
L. HAN, School of Computing, Mathematics & Digital Technology, Manchester Metropolitan University, Manchester, UK
J. VAN HEMERT, Optos plc, Dunfermline, UK
L. HLUCHÝ, Oddelenie paralelnèho a distribuovanèho spracovania informàciì’, Ústav informatiky SAV, Bratislava, Slovakia
W. HOCHACHKA, Cornell Lab of Ornithology, Cornell University, Ithaca, New York, USA
M. HOLLIMAN, Institute of Astronomy, University of Edinburgh, Edinburgh, UK
A. HUME, EPCC, University of Edinburgh, Edinburgh, UK
M. JARKA, Comarch SA, Warsaw, Poland
S. KELLING, Cornell Lab of Ornithology, Cornell University, Ithaca, New York, USA
T. KITCHING, Institute of Astronomy, University of Edinburgh, Edinburgh, UK and Mullard Space Science Laboratory, University College London, Dorking, UK
A. KRAUSE, EPCC, University of Edinburgh, Edinburgh, UK
R. MANN, Institute of Astronomy, University of Edinburgh, Edinburgh, UK
P. MARTIN, School of Informatics, University of Edinburgh, Edinburgh, UK
W. MICHENER, DataONE, University of New Mexico, Albuquerque, New Mexico, USA
A. MOUAT, EPCC, University of Edinburgh, Edinburgh, UK
K. NODDLE, Institute of Astronomy, University of Edinburgh, Edinburgh, UK
I. OVERTON, Medical Research Council, Human Genetics Unit, Edinburgh, UK
M. PARSONS, EPCC, University of Edinburgh, Edinburgh, UK
W. PEMPE, State and University Library Göttingen, Göttingen, Germany
A. RIETBROCK, School of Environmental Sciences, University of Liverpool, Liverpool, UK
K. ROSENBERG, Cornell Lab of Ornithology, Cornell University, Ithaca, New York, USA
C. ŠILVA, Department of Computer Science, Polytechnic Institute of New York, Brooklyn, New York, USA
A. SPINUSO, School of Informatics, University of Edinburgh, Edinburgh, UK; Royal Netherlands Meteorological Institute, Information and Observation Services and Technology-R&D, Utrecht, The Netherlands
D. SNELLING, Research Transformation and Innovation, Fujitsu Laboratories of Europe Limited, Hayes, UK
C. SUN LIEW, School of Informatics, University of Edinburgh, Edinburgh, UK; Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia
B. SIMO, Oddelenie paralelnèho a distribuovanèho spracovania informàciì’, Ústav informatiky SAV, Bratislava, Slovakia
V. TRAN, Oddelenie paralelnèho a distribuovanèho spracovania informàciì’, Ústav informatiky SAV, Bratislava, Slovakia
L. TRANI, School of Informatics, University of Edinburgh, Edinburgh, UK; Royal Netherlands Meteorological Institute, Information and Observation Services and Technology–R&D, Utrecht, The Netherlands
L. VALKONEN, School of Informatics, University of Edinburgh, Edinburgh, UK
G. YAIKHOM, School of Informatics, University of Edinburgh, Edinburgh, UK; School of Computer Science and Informatics, Cardiff University, Cardiff, UK
Foreword
This book is a systematic approach to help face the challenge of data-intensive computing and a response to the articulation of that challenge set out in The Fourth Paradigm, a collection of essays by scientists and computer scientists that I co-edited. It also recognizes the fact that we face that challenge in the context of a digital revolution that is transforming communities worldwide at Internet speed.
This book proposes a strategy for partitioning the challenge, both in the ways in which we organize and in which we build systems. This partitioning builds on natural foci of interest and the examples show that this approach works well in the context of groups and organizations. The technological strategy reflects the evolving pattern of provision driven by business models and the flourishing diversity of tools and applications that enable human innovation.
We all face the need to separate concerns every time we face a data-intensive problem. This is key to making data-intensive methods routinely available and to their easy application. This leads to the recognition of effective working practices that need supporting with better ‘datascopes’ that are easily steered and focused to extract the relevant information from immense and diverse collections of data. The book calls for the introduction of ”intellectual on-ramps” that match the new tools to well-understood interfaces, so that practitioners can incrementally master the new data-intensive methods.
This book calls for recognition that this notion of intellectual on-ramps is worthy of study. Data-intensive computing warrants an appropriate engineering discipline that identifies effective ways of building appropriate highways from data to required knowledge. It is a call to arms for a serious attempt at initiating the professionalization of this discipline.
We are very much at the start of the digital revolution. The growth in digital data will not abate, and in certain areas it will probably accelerate. There is much to be gained by exploiting the opportunities this bonanza of data brings but the extraction of insights and knowledge from ‘Big Data’ will also certainly transform our organizations and society. Responding effectively to these changes requires the availability of ready-made tools and reusable processes together with practitioners with the skills to deploy them precisely and safely. This book frames an approach as to how these tools, processes, and skills may be developed. Such a systematic approach is now urgently needed as the opportunities are rapidly outgrowing our capabilities to assemble and run data-intensive campaigns.
This book provides a vocabulary to facilitate data-intensive engineering by introducing key concepts and notations. It presents nine in-depth case studies that show how practitioners have tackled data-intensive challenges in a wide range of disciplines. It also provides an up-to-date analysis of this rapidly changing field and a survey of many of the current research hot-spots that are driving it. For all these reasons, I believe that this book is a welcome addition to the literature on data-intensive computing.
TONY HEY
Redmond, Washington
July 2012
Preface
The world is undergoing a digital-data revolution. More and more data are born digital. Almost every business, governmental, and organizational activity is driven by data and produces data. Science, engineering, medicine, design and innovation are powered by data. This prevalence of data in all that we do is changing society, organizations, and individual behavior. To thrive in this new environment requires new strategies, new skills, and new technology. This book is the first to expound the strategies that will make you adept at exploiting the expanding opportunities of this new world.
This book identifies the driving forces that are provoking change and proposes a strategy for building the skills, methods, technologies, and businesses that will be well adapted in the emerging data-wealthy world. This strategy will change the way in which you spend your organization’s (country’s, company’s, institution’s, and profession’s) resources. You will invest more in exploiting data, even if that means spending less on creating, capturing or archiving it, as today most data are underused or never used, even though they frequently contain the latent evidence that should be leading to innovation, knowledge and wisdom. After reading this book, you will expect an understandable path from data, via analysis, evidence and visualization, to influential outputs that change behavior. When this is not happening, your organization is underperforming and is at risk. You will come to expect that all of those with whom you deal should be competent at getting good value from data.
This book will change your skills by developing your ingenuity in discovering, understanding, exploiting, and presenting data. You will acquire a compendium of tools for addressing every stage in the data life cycle. It will change education—everyone needs survival skills for the data-wealthy world. All professionals and experts need experience and judgement for their part of the path from data to discovery, innovation, and outcomes.
This book will initiate the development of professionals who will engineer tomorrow’s data highways. These highways will be designed to meet carefully analyzed and anticipated needs, to interwork with existing data infrastructure, to accelerate the journeys of millions from data to knowledge. These knowledge discovery highways will make it easy for people to get data from wherever they are stored and promptly deliver understandable information to wherever it is needed.
This book will be valuable to a wide range of information strategists, decision makers, researchers, students, and practitioners, from domains such as computer science (data mining, machine learning, statistics, databases, knowledge-based systems, large-scale computing), e-Science, and to workers in any discipline or industry where large-scale data handling and analysis is important.
The ideas presented are relevant to and draw on: data mining, knowledge discovery in databases, machine learning, artificial intelligence, databases and data management, data warehousing, information systems, distributed computing, grid computing, cloud computing, ubiquitous computing, e-Science (including a wide range of scientific and engineering fields dealing with large data), modeling and simulation.
This book consists of 24 chapters grouped into six parts; they are introduced here.
Part I: Strategies for success in the digital-data revolutionPart I provides an executive summary of the whole book to convince strategists, politicians, managers, and educators that our future data-intensive society requires new thinking, new behavior, new culture, and new distribution of investment and effort. This part will introduce the major concepts so that readers are equipped to discuss and steer their organization’s response to the opportunities and obligations brought by the growing wealth of data. It will help readers understand the changing context brought about by advances in digital devices, digital communication, and ubiquitous computing.
Chapter 1: The digital-data challenge This chapter will help readers to understand the challenges ahead in making good use of the data and introduce ideas that will lead to helpful strategies. A global digital-data revolution is catalyzing change in the ways in which we live, work, relax, govern, and organize. This is a significant change in society, as important as the invention of printing or the industrial revolution, but more challenging because it is happening globally at lnternet speed. Becoming agile in adapting to this new world is essential.
Chapter 2: The digital-data revolution This chapter reviews the relationships between data, information, knowledge, and wisdom. It analyses and quantifies the changes in technology and society that are delivering the data bonanza, and then reviews the consequential changes via representative examples in biology, Earth sciences, social sciences, leisure activity, and business. It exposes quantitative details and shows the complexity and diversity of the growing wealth of data, introducing some of its potential benefits and examples of the impediments to successfully realizing those benefits.
Chapter 3: The data-intensive survival guide This chapter presents an overview of all of the elements of the proposed data-intensive strategy. Sufficient detail is presented for readers to understand the principles and practice that we recommend. It should also provide a good preparation for readers who choose to sample later chapters. It introduces three professional viewpoints: domain experts, data-analysis experts, and data-intensive engineers. Success depends on a balanced approach that develops the capacity of all three groups. A data-intensive architecture provides a flexible framework for that balanced approach. This enables the three groups to build and exploit data-intensive processes that incrementally step from data to results. A language is introduced to describe these incremental data processes from all three points of view. The chapter introduces ‘datascopes’ as the productized data handling environments and ‘intellectual ramps’ as the ‘on ramps’ for the highways from data to knowledge.
Chapter 4: Data-intensive thinking with DISPEL This chapter engages the reader with technical issues and solutions, by working through a sequence of examples, building up from a sketch of a solution to a large-scale data challenge. It uses the DISPEL language extensively, introducing its concepts and constructs. It shows how DISPEL may help designers, data-analysts, and engineers develop solutions to the requirements emerging in any data-intensive application domain. The reader is taken through simple steps initially, this then builds to conceptually complex steps that are necessary to cope with the realities of real data providers, real data, real distributed systems, and long-running processes.
Part II: Data-intensive knowledge discoveryPart II focuses on the needs of data-analysis experts. It illustrates the problem-solving strategies appropriate for a data-rich world, without delving into the details of underlying technologies. It should engage and inform data-analysis specialists, such as statisticians, data miners, image analysts, bio-informaticians or chemo-informaticians, and generate ideas pertinent to their application areas.
Chapter 5: Data-intensive analysis This chapter introduces a set of common problems that data-analysis experts often encounter, by means of a set of scenarios of increasing levels of complexity. The scenarios typify knowledge discovery challenges and the presented solutions provide practical methods; a starting point for readers addressing their own data challenges.
Chapter 6: Problem solving in data-intensive knowledge discovery On the basis of the previous scenarios, this chapter provides an overview of effective strategies in knowledge discovery, highlighting common problem-solving methods that apply in conventional contexts, and focusing on the similarities and differences of these methods.
Chapter 7: Data-intensive components and usage patterns This chapter provides a systematic review of the components that are commonly used in knowledge discovery tasks as well as common patterns of component composition. That is, it introduces the processing elements from which knowledge discovery solutions are built and common composition patterns for delivering trustworthy information. It reflects on how these components and patterns are evolving in a data-intensive context.
Chapter 8: Sharing and re-use in knowledge discovery This chapter introduces more advanced knowledge discovery problems, and shows how improved component and pattern descriptions facilitate re-use. This supports the assembly of libraries of high level components well-adapted to classes of knowledge discovery methods or application domains. The descriptions are made more powerful by introducing notations from the semantic Web.
Part III: Data-intensive engineeringPart III is targeted at technical experts who will develop complex applications, new components, or data-intensive platforms. The techniques introduced may be applied very widely; for example, to any data-intensive distributed application, such as index generation, image processing, sequence comparison, text analysis, and sensor-stream monitoring. The challenges, methods, and implementation requirements are illustrated by making extensive use of DISPEL.
Chapter 9: Platforms for data-intensive analysis This chapter gives a reprise of data-intensive architectures, examines the business case for investing in them, and introduces the stages of data-intensive workflow enactment.
Chapter 10: Definition of the DISPEL language This chapter describes the novel aspects of the DISPEL language: its constructs, capabilities, and anticipated programming style.
Chapter 11: DISPEL development This chapter describes the tools and libraries that a DISPEL developer might expect to use. The tools include those needed during process definition, those required to organize enactment, and diagnostic aids for developers of applications and platforms.
Chapter 12: DISPEL enactment This chapter describes the four stages of DISPEL enactment. It is targeted at the data-intensive engineers who implement enactment services.
Part IV: Data-intensive application experience This part of the book is about applications that shaped the ideas behind the data-intensive architecture and methods. It provides a wealth of examples drawn from experience, describing in each case the aspects of data-intensive systems tested by the application, the DISPEL-based methods developed to meet the challenge, and the conclusions drawn from the prototype experiments.
Chapter 13: The application foundations of DISPEL The early development of DISPEL was influenced and assisted by research challenges from four very different data-intensive application domains. This chapter reviews these four domains in terms of their particular needs and requirements and how, as a suite, they provide an effective test of all key dimensions of a data-intensive system. It reviews the data-intensive strategy in terms of these applications and finds support for the approach.
Chapter 14: Analytical platform for customer relationship management This chapter demonstrates that the data-intensive methods and technology work well for traditional commercial knowledge discovery applications. Readers are introduced to the application domain through a scene-setting discussion, which assumes no prior knowledge, and are then taken through the process of analyzing customer data to predict behavior or preferences.
Chapter 15: Environmental risk management This chapter presents applications in the context of environmental risk management. The scenarios involve significant data-integration challenges as they take an increasing number of factors into account when managing the outflow from a reservoir to limit the effects downstream.
Chapter 16: Analyzing gene expression imaging data in developmental biology This chapter describes the application of data-intensive methods to the automatic identification and annotation of gene expression patterns in the mouse embryo. It shows how image processing and machine learning can be combined to annotate images and identify networks of gene functions.
Chapter 17: Data-intensive seismology: research horizons Seismology has moved from focusing on events to analyzing continuous streams of data obtained from thousands of seismometers. This is fundamental to understanding the inner structure and processes of the Earth and this chapter investigates the data-intensive architecture necessary to enable the analysis of large-scale distributed seismic waveforms.
Part V: Data-intensive beacons of success This part introduces a group of challenging, sophisticated data-intensive applications, which are starting to shape and promote a new generation of knowledge discovery technology. The chapters show that science, engineering, and society are fertile lands for data-intensive research. This part is targeted at novel application developers who like to include visionary aspects in their research.
Chapter 18: Data-intensive methods in astronomy Astronomy has been at the forefront of the digital revolution as it pioneered faster and more sensitive digital cameras, and established a new modus operandi for sharing and integrating data globally. These are yielding floods of data, and opening up new approaches to exploring the cosmos and testing the physical models that underpin it. This chapter describes two examples that exemplify the data-intensive science now underway.
Chapter 19: Interactive interpretation of environmental data A crucial step in any science is to explore the data available; this will often stimulate new insights and hypotheses. As the volumes of data grow and diverse formats are encountered, the effort of handling data inhibits exploration. This chapter shows how these inhibitory difficulties can be overcome, so that oceanographers and atmospheric scientists can easily select, vizualise and explore compositions of their data.
Chapter 20: Data-driven research in the humanities Researchers in the arts and humanities are using digitization to see new aspects of the many artifacts and phenomena they study. Digital resources allow statistical methods and computational matching to be employed, as well as the full panoply of text processing and collaborative annotation. In this chapter, researchers show their plans for a Europe-wide data infrastructure to facilitate this new research.
Chapter 21: Analysis of engineering and transport data Analysis of vibration data from aero-engines, turbines, and locomotives, ‘listening to the engine’, can reveal incipient problems and trigger appropriate remedial responses. To do this safely for large numbers of operational engines is a major data-intensive challenge. This chapter reports on ten years’ progress and its spin-offs into analyzing medical time series.
Chapter 22: Determining the patterns of bird species occurrence In this chapter, the ornithologists describe the challenge of estimating the populations of birds as they migrate and of inferring the factors affecting species numbers. It takes a great deal of sophisticated data analysis to extract and visualize the relevant information and much ingenuity to discover and use the data required from other disciplines.
Part VI: The data-intensive future This part presents a summary of the state of the industry and research, the observed trends and the current ‘hot-spots’ of dataintensive innovation. It provides a framework for reviewing the current activity and anticipated changes it will bring about. It offers a rich set of pointers to the literature and Web sites, built over the 15 months of the data-intensive research theme at the e-Science Institute. This should help readers select and find highly relevant further reading.
Chapter 23: Data-intensive trends This chapter summarizes the learning about data-intensive methods and their potential power. It then analyzes some of the categories of data-intensive challenge and assesses how they will develop.
Chapter 24: Data-rich futures This chapter dangerously attempts some ‘crystal gazing’. It looks first at technological factors and current research that should be observed by those who wish to further develop data-intensive strategies. It explores some of the economic factors that will shape a data-rich future and concludes with a view on the social issues that will emerge. We call on those with influence to strive for professional standards in handling data, from their collection to the actions based on their evidence.
There are two pieces of information that we wish to offer readers in order to help them get the most benefit from this book: the choice of routes through the book and conventions used in the book. We consider each of these in turn.
1. The primary story line of the book develops in the conventional way by reading the parts and the chapters in order, up to the end of Part IV. It begins with scene setting to establish a conceptual framework. It then addresses the methods and engineering needed to put the ideas into practice before successful applications in a wide range of domains. Many readers will want a selected-reading route that matches their needs; the following map and suggested routes will encourage them to plan their reading.
2. Research strategists, scientists and managers, who may be less interested in technical detail should follow Chapters 1 to 3→Part IV→Part V→Part VI.
3. Data-analysis experts should try Part I→Part II→Chapter 10→Chapter 11→Part IV→Part VI.
4. Those building data-intensive systems should read Part I→Part III→Part IV→Part VI.
5. Domain experts, who are looking for ideas that may pay off with their specialist data could read the applications first, Part V→Part IV→Chapters 1 to 3. We hope there are enough signposts that you can adjust your route easily to follow new interests.
The conventions concern the structure and the representation of certain items. As the map shows, there are six parts; each one begins with a preamble intended to orient readers who start in that part, and to relate that part to the other parts of the book. Appendix A provides a glossary, where we hope you will find useful definitions of terms used frequently in the book. Terms used infrequently may be traced via the index. Each chapter concludes with the references cited in that chapter. References to Web sites are shown as URLs in line with the text if they are short and in footnotes if they are long. The http:// prefix is omitted.
Readers’ map of the book.
There are many programming examples, mostly represented in DISPEL. A consistent highlighting convention has been used for these, which we believe helps legibility. These are also available on the Web site (see following text), so that you can view them in your favorite editor. There are often corresponding diagrams, showing the data-flow graph that the program would generate. These pick up the same color conventions. The language DISPEL is introduced in Chapters 3, 4 and 10; the language definition is in Appendix B. The components that are used in the examples are provided from standard libraries of components; these are described in Appendix C.
A Web site at www.dispel.lang.org holds material intended to help readers of this book; these include pages covering the following topics.
An overview and table of contents of the book.
A collection of success stories.
Teaching material including presentations to be used in conjunction with chapters in the book.
Copies of the program examples from the book, so that they may be used by readers.
The libraries of components used in the book, with corresponding descriptions in a local registry.
The DISPEL reference manual.
Links to other sites including many of those referenced from the book, the associated open-source project and new developments.
This Web site will be updated and contributions from others will be welcome.
Many people have contributed to this book through discussions and visits over the past 6 years; we thank them all and limit the explicit list to those who worked directly on the book.
Funders The primary funder of the editorial process was the European Commission’s Framework Program 7’s support for the ADMIRE project (www.admireproject.eu), grant number 215024. The tour of the USA by Atkinson and De Roure, to study their use of data (bit.ly/c0G2rn), and Atkinson’s work on the book was funded by the UK’s Engineering and Physical Sciences Research Council (EPSRC) Senior Research Fellowship to undertake the UK e-Science Envoy role, grant EP/D079829/1. The initial workshop on data-intensive research (bit.ly/bQpu5h), where many of the commitments for contributions to the book were made, and the one-year data-intensive research theme (bit.ly/cygimA), were funded by the e-Science Institute, EPSRC grant EP/D056314/1. The work to further develop this book, the technology and the methodology described, is partially funded in the University of Edinburgh by the EPSRC NeSC Research Platform (research.nesc.ac.uk) grant EP/F057695/1, and in Universidad Politècnica de Madrid by the Spanish Ministry of Science and Technology under grant TIN2010-17060. Baxter would also like to acknowledge the direct support of EPCC at the University of Edinburgh, and support from the UK’s Software Sustainability Institute under EPSRC grant number EP/H043160/1.
Helpers A big thank you to Jo Newman, who was a continuous support to the editors of the book, impeccably arranging many meetings and tele-communications, as well as thorough proof reading, checking copyright information, and communicating with all of the contributing authors. Kathy Humphry read nearly every chapter and gave good advice on each one of them.
In addition to being contributing authors, we must thank Ivan Janciak, in the University of Vienna, who made life much easier for all of us by setting up convenient macros for building the book; Paul Martin and Gagarine Yaikhom, of the Universities of Edinburgh and Cardiff respectively, for setting up the system for typesetting DISPEL highlighting its structure; and Amrey Krause and Chee Sun Liew of the University of Edinburgh, who set up the system for validating DISPEL text used in the book. Chee Sun Liew also did a great deal of LATEX wrangling to shape the book into its final form. Ivan Janciak, Alexander Wöhrer and Marek Lenart of the University of Vienna reviewed several book chapters and helped improve the quality of figures and the formatting of the text.
We would also like to thank our colleagues Martin Šeleng and Peter Krammer from the Institute of Informatics of the Slovak Academy of Sciences for their excellent work on the data mining scenarios of the environmental risk management application, and our dear friends at the Slovak Hydro-Meteorological Institute and the Slovak Water Enterprise for their invaluable help with designing the pilot scenarios and providing real input data for them. A big thank you must go to the EPCC software engineering team for making large swathes of the prototype data-intensive platform work: Ally Hume, Malcolm Illingworth, Amrey Krause, Adrian Mouat and David Scott, with a special mention for our integration, test and build-meister Radek Ostrowski.
We heartily thank all of the open-source developers on whose work we built; all those who helped and confirm that all the remaining errors are the responsibility of the editors, led by myself.
MALCOLM ATKINSON
Edinburgh, UK
April 2012
The Editors
Malcolm Atkinson PhD is Professor of e-Science in the School of Informatics at the University of Edinburgh in Scotland. He is also Data-Intensive Research group leader, Director of the e-Science Institute, IT architect for the ADMIRE and VERCE EU projects, and UK e-Science Envoy. Atkinson has been leading research projects for several decades and served on many advisory bodies.
Rob Baxter PhD is EPCC’s Software Development Group Manager. He has over fifteen years’ experience of distributed software project management on the bleeding edge of technology. He managed the European ADMIRE project, rated “excellent” in its final review and plays a prominent role in the EUDAT, iCORDI and PERICLES projects—all large-scale scientific data infrastructures.
Peter Brezany PhD is Professor of Computer Science in the University of Vienna Faculty of Computer Science. He is known internationally for his work in high performance programming languages and their applications. He has led several projects addressing large-scale data analytics.
Oscar Corcho PhD is an Associate Professor at the Facultad de Informática, Universidad Politécnica de Madrid, and he belongs to the Ontology Engineering Group. His research activities are focused on Semantic e-Science and Real World Internet.
Michelle Galea PhD has over 15 years experience in the public sector, banking and academia, addressing the challenges of managing data from strategic and research perspectives. She is a research associate in the School of Informatics, University of Edinburgh.
Jano van Hemert PhD is the Imaging Research Manager and Academic Liaison at Optos, which is a global company providing retinal diagnostics. He is an Honorary Fellow of the University of Edinburgh and a member of The Young Academy of Scotland of the Royal Society of Edinburgh.
Mark Parsons PhD is EPCC’s Executive Director and Associate Dean for e-Research at The University of Edinburgh. He has wide-ranging interests in distributed and high performance computing and is a leading contributor to the European PRACE research infrastructure.
David Snelling PhD is a Senior Research Fellow and manager of the Research Transformation and Innovation team at Fujitsu Laboratories of Europe. He is a primary architect of the Unicore Grid and a member of the European Commission’s Expert Groups on Next-generation Grids and Cloud Computing, W3C, OGF, DMTF and OASIS standards organizations.
Part I
Strategies for Success in the Digital-Data Revolution
We provide an overview and introduction to each of the six parts of the book. This is an introduction to Part I, which itself gives a complete introduction to the current data-rich environment in which we find ourselves today. It is intended to be a synopsis of the whole book as well as being its introduction. It is helpful as an overview for technology leaders and research strategists who wish to better understand what data-intensive methods can do for them or their organization. It should also help those who are supporting users of data-intensive methods, for example, those who provide Cloud infrastructure for these applications. All readers who intend to dive more deeply into the book will find it a valuable orientation, setting the scene for later parts and steering readers with specific interests to relevant chapters.
The book has been written at a time when there is a great deal of contemporary interest in data and in the best methods of obtaining insight from data. It is based on a decade of experience of data-intensive research. In the last decade there has been a flurry of papers pointing out the advent of the growing wealth of data and of the substantial challenge of exploiting that wealth successfully. An early example, The Data Deluge: An e-Science Perspective by Tony Hey and Anne Trefethen [1], focused on the challenge of data volumes. Although those data volumes are no longer challenging, the growth of data has outrun the increase in capacity and power of data handling technology.
By the end of the last decade, there was growing recognition that the abundance of data was a widespread phenomena, spanning most domains of science, business and government. Jim Gray recognized the new way of thinking this enables and named it The Fourth Paradigm. The book of that name, in honor of Jim Gray’s memory, by Tony Hey et al. [2], provides a compelling collection of essays showing the potential power of data in more than a dozen disciplines.
In the same year, the US government adopted a report, Harnessing the power of digital data for science and society, by an Interagency Working Group on Digital Data that spanned nearly every branch of government [3]. This recognized the potential of data and initiated a programme to make it as widely used as possible for the benefit of research and society. A corresponding report, Riding the Wave, set the data-intensive research agenda for Europe [4]. In February 2011, Science devoted a special issue to data (volume 331, issue 6018, pages 639–806) showing that though there are many demonstrable successes in scientific and medical research, the use of data is still fraught with challenges.
This book addresses those challenges by deliberately combining a vision of a future where a well-polished ecosystem of data-services, data-analysis tools, and professionally adopted data-intensive methods makes it far easier to exploit the growing wealth of data. Exploiting data effectively is now recognized as a key issue for many industries, for commerce, for government, for healthcare and for research. There are many other publications addressing related issues. They report how individuals or large teams with the required skills, and often with much effort, extract gems of knowledge from the new wealth of data. This book recognizes that the ever growing number of cases where such knowledge discovery is needed cannot be met by throwing that level of skill and effort at every case. The skill base cannot be grown at a rate which matches growth in demand, and the effort has to be reduced to deliver timely results economically. To address this issue, we raise the level of discourse, partition the intellectual challenge, and propose both sharing and automation.
Published in 2009, Beautiful Data, edited by Segaran and Hammerbacher [5], provides 20 examples of how data can be used effectively. Elegant solutions with a wide variety of data yield information that is then presented carefully to achieve intended effects on its beholders. A reference that should be consulted for inspiration, the following books give more help with understanding applicable principles.
The book, Scientific Data Management: Challenges, Technology and Deployment, edited by Shoshani and Rotem [6], provides a collection of strategies from experienced practitioners on how to build technology, to handle very large volumes of data, to organize computations that analyze and visualize those data, and to specify and manage the processes involved. It takes a file-oriented and high-performance computing viewpoint for the most part, looks predominantly at applications in the physical sciences and is replete with good solutions in that context. These are revisited in Parts IV and V of this book.
In the same year, World Wide Research, edited by Dutton and Jeffreys [7], examined the digital-data revolution from the viewpoint of the Arts and Humanities. Their primary concern was the transformative impact of the Internet on their communities of researchers. A key ingredient is the new ability to create, curate, and share data, with a significant impact on research, and the mores guiding researchers.
The contemporary work, Beautiful Visualization, edited by Steele and Iliinsky [8], focuses on going the last mile with data, presenting the information in forms so well adapted to the recipients and intended purpose that it is natural to interpret it correctly.
All four books set the scene for this book. They contribute compelling examples of the high value and power of using data well, and they present detailed practical techniques that can be harnessed to take data and to convert it into reliable knowledge that can be safely acted on. They also show the ingenuity and detailed work currently necessary to achieve this; the insights, creativity, and perseverance of professionals will always be critical to success but we hope that our vision will make their work far easier. We envisage the automation of many of the technical details that currently limit the number of successes.
Two books announce a rich environment for implementing this vision: Data Mining, edited by Witten et al. [9], and Data Analysis with Open Source Tools, by Janert [10]. They offer two extensive collections of readily available elements that will be used in exploring and exploiting data. We have made extensive use of some of these elements to validate our vision. They reappear in Parts II and IV of this book. Today, tools for larger-scale data, such as Massive Online Analysis (MOA) (moa.cs.waikato.ac.nz), are emerging. These would be elements in future systems.
As always in computing, the digital environment is changing rapidly; indeed, we argue that the current intertwined set of changes constitutes a significant digital revolution. An aspect of this is that the choices of technology and the dominant business models are changing; for example, Distributed and Cloud Computing: From Parallel Processing to the Internet of Things [11] shows how data-intensive computing can now be accomplished in the Cloud. No book can be wholly insulated from such changes, but we have tried to do two things: to recognize the changes, explain them and indicate why they are important drivers in our story, and to deliver principles as well as practical details with the belief that these principles have long-term value.
Chapter 2 provides examples of the current data, showing its scale and complexity, as well as the global efforts to collaborate in making the best use of the data. It shows how these early days of the digital revolution are reshaping our world, both social and business behavior, an idea also explored by Dutton and Jeffreys [7] and manifest in the applications shown in Hey et al. [2], Segaran and Hammerbacher [5] and Steele and Iliinsky [8]. Steering these changes to the benefit of science and society by having governmental decisions lead the way is an explicit goal of the Interagency Working Group on Digital Data [3].
Chapter 3 rehearses a strategy for rapidly increasing our capabilities and agility in the exploitation of data, based on the recognition of how to partition the challenges, both in the human and technical dimensions. This provides a foundation for understanding the rest of the book and concludes with a guide for those who wish to then focus on a particular aspect of this strategy.
This part’s final chapter, Chapter 4, introduces data-intensive thinking. It begins with the elementary stages of first addressing a knowledge discovery challenge and introduces a language and diagrammatic notation to facilitate thinking about these issues. It uses that notation and a running example to initiate consideration of the technological challenges of full-scale knowledge discovery processes showing the variety of issues encountered and the basic tactics for overcoming them.
Taken together, the four chapters in this part will give their readers an appreciation as to why they should exploit the burgeoning data bonanza, an awareness of the evolving context of the digital revolution, an introduction to a strategy and vision as to how to proceed, and a tutorial on how to think about data-intensive challenges.
1. A. J. G. Hey and A. E. Trefethen, The Data Deluge: An e-Science Perspective, In, Berman, F, Fox, G C and Hey, A J G (eds.) Grid Computing - Making the Global Infrastructure a Reality. Ch. 36, pp. 809–824. John Wiley & Sons, Ltd, 2003.
2. A. J. G. Hey, S. Tansley, and K. Tolle, The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009.
3. Interagency Working Group on Digital Data, “Harnessing the power of digital data for science and society: report to the Committee on Science of the National Science and Technology Council,” tech. rep., Executive office of the President, Office of Science and Technology, Washington D.C., USA, 2009.
4. High-Level Expert Group on Scientific Data, “Riding the wave how Europe can gain from the rising tide of scientific data,” tech. rep., European Commission, 2010.
5. T. Segaran and J. Hammerbacher, Beautiful Data: The Stories Behind Elegant Data Solutions. O’Reilly, 2009.
6. A. Shoshani and D. Rotem, Scientific Data Management: Challenges, Technology and Deployment, Computational Science Series. Chapman and Hall/CRC, 2010.
7. W. H. Dutton and P. W. Jeffreys, World Wide Research: Reshaping the Sciences and Humanities. {MIT} Press, 2010.
8. J. Steele and N. Iliinsky, Beautiful Visualisation: Looking at Data Through the Eyes of Experts. O’Reilly, 2010.
9. I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques (Third Edition). Morgan Kauffman, 2011.
10. P. K. Janert, Data Analysis with Open Source Tools. O’Reilly, 2011.
11. K. Hwang, J. Dongarra, and G. C. Fox, Distributed and Cloud Computing: From Parallel Processing to the Internet of Things. Morgan Kaufmann, 2011.
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!