Big Data Analytics in Earth, Atmospheric, and Ocean Sciences -  - E-Book

Big Data Analytics in Earth, Atmospheric, and Ocean Sciences E-Book

0,0
142,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Applying tools for data analysis to the rapidly increasing volume of data about the Earth An ever-increasing volume of Earth data is being gathered. These data are "big" not only in size but also in their complexity, different formats, and varied scientific disciplines. As such, big data are disrupting traditional research. New methods and platforms, such as the cloud, are tackling these new challenges. Big Data Analytics in Earth, Atmospheric, and Ocean Sciences explores new tools for the analysis and display of the rapidly increasing volume of data about the Earth. Volume highlights include: * An introduction to the breadth of big earth data analytics * Architectures developed to support big earth data analytics * Different analysis and statistical methods for big earth data * Current applications of analytics to Earth science data * Challenges to fully implementing big data analytics The American Geophysical Union promotes discovery in Earth and space science for the benefit of humanity. Its publications disseminate scientific knowledge and provide resources for researchers, students, and professionals. Find out more in this Q&A with the editors.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 506

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.


Ähnliche


Table of Contents

Cover

Title Page

Copyright

List of Contributors

Preface

1 AN INTRODUCTION TO BIG DATA ANALYTICS

1.1 Overview

1.2 Definitions

1.3 Example Problems

1.4 Big Data Analysis Concepts

1.5 Technology and Tools

1.6 Challenges

1.7 Summary

References

Part I: Big Data Analytics Architecture

2 INTRODUCTION TO BIG DATA ANALYTICS ARCHITECTURE

References

3 Scaling Big Earth Science Data Systems Via Cloud Computing

3.1 Introduction

3.2 Key Concepts of Science Data Systems (SDSes)

3.3 Increasing Data Processing, Volumes, and Rates

3.4 Cloud Concepts for SDSes

3.5 Architecture Components of Cloud‐Based SDS

3.6 Considerations for Multi‐cloud and Hybrid SDS

3.7 Cloud Economics

3.8 Large‐Scaling Considerations

3.9 Example of Cloud SDSes

3.10 Conclusion

References

4 NOAA Open Data Dissemination (Formerly NOAA Big Data Project/Program)

4.1 Obstacles to the Public's Use of NOAA Environmental Data

4.2 Public Access of NOAA Data Creates Challenges for the Agency

4.3 The Vision for NOAA's “Oddball” Approach to Big Data

4.4 A NOAA Cooperative Institute Data Broker provides Research and Operational Agility

4.5 Public‐Private Partnerships Provide the Pipeline

4.6 BDP Exceeds Expectations and Evolves into Enterprise Operations

4.7 Engaging Users in the Cloud

4.8 Challenges and Opportunities

4.9 Vision for the Future

Acknowledgments

References

5 A Data Cube Architecture for Cloud‐Based Earth Observation Analytics

5.1 Introduction

5.2 Open Data Cube for the Cloud Design

5.3 S3 Array I/O Performance

5.4 Discussion and Conclusion

References

6 Open Source Exploratory Analysis of Big Earth Data With NEXUS

6.1 Introduction

6.2 Architecture

6.3 Deployment Architecture

6.4 Benchmarking and Studies

6.5 Analytics Collaborative Framework

6.6 Federated Analytics Collaborative Systems

6.7 Conclusion

References

7 Benchmark Comparison of Cloud Analytics Methods Applied to Earth Observations

7.1 Introduction

7.2 Experimental Setup

7.3 AODS Candidates

7.4 Experimental Results

7.5 Conclusions

References

Part II: Analysis Methods for Big Earth Data

8 Introduction to Analysis Methods for Big Earth Data

References

9 Spatial Statistics for Big Data Analytics in the Ocean and Atmosphere: Perspectives, Challenges, and Opportunities

9.1 Spatial Data and Spatial Statistics

9.2 What Constitutes Big Spatial Data?

9.3 Statistical Implications of the Four Vs of Big Spatial Data

9.4 Challenges to the Statistical Analysis of Big Spatial Data

9.5 Opportunities in Spatial Analysis of Big Data

9.6 Conclusion

References

10 Giving Scientists Back Their Flow: Analyzing Big Geoscience Data Sets in the Cloud

10.1 Introduction

10.2 Where's the Opportunity?

10.3 The Future

Reference

11 The Distributed Oceanographic Match‐Up Service

11.1 Introduction

11.2 DOMS Capabilities

11.3 System Architecture

11.4 Workflow

11.5 Future Development

11.6 Acknowledgments

Availability Statement

References

Part III: Big Earth Data Applications

12 Introduction to Big Earth Data Applications

References

13 Topological Methods for Pattern Detection in Climate Data

13.1 Introduction

13.2 Topological Methods for Pattern Detection

13.3 Case Study: Atmospheric Rivers Detection

13.4 Conclusions and Recommendations

Acknowledgments

References

14 Exploring Large Scale Data Analysis and Visualization for Atmospheric Radiation Measurement Data Discovery Using NoSQL Technologies

14.1 Introduction

14.2 Software and Workflow

14.3 Hardware Architecture

14.4 Applications

14.5 Conclusions

Acknowledgments

References

15 Demonstrating Condensed Massive Satellite Data Sets for Rapid Data Exploration: The MODIS Land Surface Temperatures of Antarctica

15.1 Introduction

15.2 Data

15.3 Methods

15.4 Results

15.5 Conclusions

Acknowledgments

Availability Statement

References

16 Developing Big Data Infrastructure for Analyzing AIS Vessel Tracking Data on a Global Scale

16.1 Introduction

16.2 Background

16.3 Use Case: Producing Heat Maps of Vessel Traffic using AIS Data

16.4 Data Processing Overview

16.5 Future Work

16.6 Conclusions

References

17 Future of Big Earth Data Analytics

17.1 Introduction

17.2 How Data Get Bigger

17.3 The Evolution of Analytics Algorithms

17.4 Analytics Architectures

17.5 Conclusions

References

Index

End User License Agreement

List of Tables

Chapter 1

Table 1.1 Terms for understanding general concepts

Chapter 5

Table 5.1 Different operating characteristics between HPC and cloud deploym...

Table 5.2 EC2 instance types

Table 5.3 Storage types

Chapter 7

Table 7.1 Hardware systems supporting the AODS architectures

Chapter 9

Table 9.1 Examples of spatial algorithms and their associated time complexi...

Chapter 13

Table 13.1 List of data sources used in the experiments

Table 13.2 Classification accuracy score of the SVM classifier for 3 hourly...

Table 13.3 Precision and sensitivity scores calculated for all data sets li...

List of Illustrations

Chapter 1

Figure 1.1 Leveraging feature binning technology to see geographic trends be...

Figure 1.2 Ridesharing pick up locations in midtown Manhattan. In the southe...

Figure 1.3 Tornado hotspots (+) and reported start points across the United ...

Figure 1.4 Spatiotemporal clustering (DBSCAN...

Figure 1.5 Cloud service models (IaaS, PaaS, and SaaS) (Chou, 2018).

Chapter 3

Figure 3.1 Fundamental systems within the end‐to‐end Earth Observing Systems...

Figure 3.2 Key concepts of a typical SDS of data ingest from GDS, processing...

Figure 3.3 On‐premise SDS components.

Figure 3.4 NISAR processing per month of mission (forward processing + bulk ...

Figure 3.5 The relevant service layers of an SDS, their key area of focus (l...

Figure 3.6 A high‐level SDS functional architecture.

Figure 3.7 The trade‐off spectrum of on‐premise to all‐in cloud‐native, and ...

Figure 3.8 Software layer stack of the hybrid SDS.

Figure 3.9 Cloud storage tiers of AWS from hot (fast and expensive) to cold ...

Figure 3.10 SMAP SDS architecture.

Figure 3.11 NISAR GDS/SDS/DAAC data processing workflow.

Chapter 4

Figure 4.1 The actual total archive volume (active and secure copies) of dat...

Figure 4.2 BDP's data distribution scheme is led by a one‐way transfer of a ...

Chapter 5

Figure 5.1 Open data cube notional architecture.

Figure 5.2 Execution flow for a data request on S3.

Figure 5.3 The ODC Cloud execution engine.

Figure 5.4 Ingest timings (time in log scale).

Figure 5.5 Time taken for data retrieval (time in log scale).

Chapter 6

Figure 6.2 Using an RDD framework to generate 30‐year time series.

Figure 6.3 NEXUS data tiling architecture. Source: Nexus.

Figure 6.4 Apache Extensible Data Gateway Environment (EDGE) architecture.

Figure 6.5 NEXUS system architecture.

Figure 6.6 NEXUS Serverless data ingestion and processing workflow subsystem...

Figure 6.7 NEXUS Automated deployment on AWS.

Figure 6.8 NEXUS performance compared with GIOVANNI and AWS EMR.

Figure 6.9 On‐the‐fly multivariate analysis of Hurricane Katrina

Figure 6.10 The Apache Science Data Analytics Platform (SDAP), an Analytics ...

Figure 6.11 Federated Analytics Centers architecture. Source: Nexus.

Chapter 7

Figure 7.1 Area‐averaged time series of MODIS/Terra aerosol optical depth fr...

Figure 7.1 Elapsed time for computations of area‐averaged time series of MOD...

Figure 7.3 External spatiotemporal index of data chunks across the HDFS used...

Figure 7.4 Depiction of overall ClimateSpark architecture, which leverages t...

Figure 7.5 High‐level architecture of NEXUS. Data ingest is handled by an Ex...

Figure 7.6 Elapsed time results for computation of area‐averaged time series...

Chapter 8

Figure 8.1 Model‐based analysis.

Chapter 9

Figure 9.1 Number of active Argo floats (1997–2021).

Figure 9.2 Time complexity of algorithms.

Figure 9.3 Storage requirements for high dimensional data.

Figure 9.4 Three‐dimensional visualization of Ecological Marine Units (EMU)....

Figure 9.5 Mann‐Kendall statistic for trends in maximum daily temperature (1...

Chapter 10

Figure 10.1 The typical workflow of a scientist. They highlight a research q...

Figure 10.2 The progression of compute capabilities. Cloud computing is set ...

Chapter 11

Figure 11.1 DOMS supports distributed access to subsets of

in situ

data host...

Figure 11.2 An example of satellite to

in situ

data collocation in three dim...

Figure 11.3 Overall architecture of first DOMS prototype. The primary servic...

Figure 11.4 NEXUS data tiling architecture. Source: Nexus.

Figure 11.5 Giovanni versus NEXUS performance (time in seconds) on area aver...

Figure 11.6 DOMS's workflow in the prototype GUI guides the user from select...

Figure 11.7 Sample analytical graphics showing nearest matched pairs between...

Chapter 13

Figure 13.1 Sample images of two weather patterns having distinguishable str...

Figure 13.2 The block diagram illustrating the extreme weather pattern detec...

Figure 13.3 A toy example of three connected components (

C

0

,

C

1

,

C

2

) in the ...

Figure 13.4 A toy example of evolution plot describing the changes of the co...

Figure 13.5 An illustration of two‐class data that are separable in a high‐d...

Figure 13.6 Sample images illustrating AR detection problem. The upper row s...

Figure 13.7 Examples of normalized evolution plots of averaged (red curves) ...

Chapter 14

Figure 14.1 Cassandra cluster ring with replication factor (RF) of 3.

Figure 14.2 Apache Spark cluster and workflow.

Figure 14.3 Overview of the big data software architecture workflow.

Figure 14.4 Screenshot of LASSO Bundle Browser found at https://www.adc.arm....

Figure 14.5 Screenshot of the web application created to simultaneously plot...

Figure 14.6 Interactive time series plot of different variables from ARMBE d...

Figure 14.7 Interactive parallel coordinates plot of different variables fro...

Figure 14.8 Apache Spark application result of ARMBE conditional querying fo...

Chapter 15

Figure 15.1 A flowchart of the data set condensation process.

Figure 15.2 The 5

th

(a) and 95

th

(b) percentiles for Antarctic land surface ...

Figure 15.3 Example statistical baseline images of Antarctic land surface te...

Figure 15.4 An example anomaly database schema. The Anomalies table stores a...

Figure 15.5 Baseline sample populations at each grid cell in the Antarctic l...

Figure 15.6 The most extreme cold temperatures in Antarctica occur along the...

Figure 15.7 A profile of minimum temperature (solid gray line) and elevation...

Figure 15.8 A systematic data error in the MOD11A1 data set. The image shows...

Chapter 16

Figure 16.1 Technologies used to process AIS data.

Figure 16.2 AIS vessel traffic data processing pipeline.

Figure 16.3 Voyages are created by connecting straight lines between pings: ...

Figure 16.4 Algorithm for generating vessel voyages. The exact numbers used,...

Figure 16.5 These example images show filtered vessel traffic. Image (a) sho...

Figure 16.6 Heat maps are generated by overlaying a grid and then counting t...

Figure 16.7 (a) Total vessel traffic and (b) unique vessel traffic over one ...

Figure 16.8 Sample results for processing U.S. Coast Guard terrestrial data ...

Figure 16.9 Sample results for processing Marine Exchange of Alaska data for...

Chapter 17

Figure 17.1 Growth of Cumulative EOSDIS Archives since 2000, shown on a semi...

Guide

Cover Page

Table of Contents

Title Page

Copyright

List of Contributors

Preface

Begin Reading

Index

Wiley End User License Agreement

Pages

iv

ix

x

xi

xii

xiii

xv

xvi

xvii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

29

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

153

155

156

157

158

159

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

Special Publications 77

BIG DATA ANALYTICS IN EARTH, ATMOSPHERIC, AND OCEAN SCIENCES

Thomas HuangTiffany C. VanceChristopher Lynnes

 

Editors

 

 

 

 

 

 

This Work is a co‐publication of the American Geophysical Union and John Wiley and Sons, Inc.

 

This Work is a co‐publication of the American Geophysical Union and John Wiley and Sons, Inc.

This edition first published 2023

© 2023 American Geophysical Union

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

Published under the aegis of the AGU Publications Committee

Matthew Giampoala, Vice President, Publications

Carol Frost, Chair, Publications Committee

For details about the American Geophysical Union visit us at www.agu.org.

The rights of Thomas Huang, Tiffany C. Vance, and Christopher Lynnes to be identified as the editors of this work has been asserted in accordance with law.

Registered Office

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

Editorial Office

111 River Street, Hoboken, NJ 07030, USA

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging‐in‐Publication Data

Names: Huang, Thomas (Technologist), editor. | Vance, C. Tiffany, editor.

  | Lynnes, Christopher, editor.

Title: Big data analytics in earth, atmospheric, and ocean sciences /

  Thomas Huang, Vance, C. Tiffany, Christopher Lynnes, editors.

Description: Hoboken, NJ : Wiley-American Geophysical Union, 2023. |

  Includes bibliographical references and index.

Identifiers: LCCN 2022020168 (print) | LCCN 2022020169 (ebook) | ISBN

  9781119467571 (cloth) | ISBN 9781119467564 (adobe pdf) | ISBN   9781119467533 (epub)

Subjects: LCSH: Earth sciences–Data processing. | Atmospheric

  science–Data processing. | Marine sciences–Data processing. | Big

  data.

Classification: LCC QE48.8 .B54 2022 (print) | LCC QE48.8 (ebook) | DDC

550.0285/57–dc23/eng20220722

LC record available at https://lccn.loc.gov/2022020168

LC ebook record available at https://lccn.loc.gov/2022020169

Cover Design: Wiley

Cover Images: Courtesy of Kate Culpepper with design elements provided by Esri, HERE, Garmin, FAO, NOAA, USGS, EPA | Source: Esri, DigitalGlobe, GeoEye, Earthstar Geographics, CNES/Airbus DS, USDA, USGS, AeroGRID, IGN, and the GIS User Community

List of Contributors

Edward M. Armstrong

NASA Jet Propulsion Laboratory

California Institute of Technology

Pasadena, California, USA

Alberto Arribas

Microsoft

Reading, UK; and

Department of Meteorology

University of Reading

Reading, UK

Jessica Austin

Axiom Data Science, LLC

Anchorage, Alaska, USA

Rob Bochenek

Axiom Data Science, LLC

Anchorage, Alaska, USA

Mark A. Bourassa

Center for Ocean‐Atmospheric Prediction Studies, and

Department of Earth, Ocean, and Atmospheric Science

Florida State University

Tallahassee, Florida, USA

Jonathan Brannock

North Carolina Institute for Climate Studies

NOAA Cooperative Institute for Satellite Earth System Studies

North Carolina State University

Asheville, North Carolina, USA

Otis Brown

North Carolina Institute for Climate Studies

NOAA Cooperative Institute for Satellite Earth System Studies

North Carolina State University

Asheville, North Carolina, USA

Kevin A. Butler

Environmental Systems Research Institute

Redlands, California, USA

Nga T. Chung

NASA Jet Propulsion Laboratory

California Institute of Technology

Pasadena, California, USA

Thomas Cram

National Center for Atmospheric Research

Boulder, Colorado, USA

Jenny Dissen

North Carolina Institute for Climate Studies

NOAA Cooperative Institute for Satellite Earth System Studies

North Carolina State University

Asheville, North Carolina, USA

Kyle Dumas

ARM Research Facility

Oak Ridge National Laboratory

Oak Ridge, Tennessee, USA

John‐Marc Dunaway

Axiom Data Science, LLC

Anchorage, Alaska, USA

Jocelyn Elya

Center for Ocean‐Atmospheric Prediction Studies

Florida State University

Tallahassee, Florida, USA

Eamon Ford

NASA Jet Propulsion Laboratory

California Institute of Technology

Pasadena, California, USA

David W. Gallaher

National Snow and Ice Data Center

Boulder, Colorado, USA

Kevin Michael Gill

NASA Jet Propulsion Laboratory

California Institute of Technology

Pasadena, California, USA

Glenn E. Grant

National Snow and Ice Data Center

Boulder, Colorado, USA

Frank R. Greguska III

NASA Jet Propulsion Laboratory

California Institute of Technology

Pasadena, California, USA

Mahabaleshwara Hegde

NASA Goddard Space Flight Center

Greenbelt, Maryland, USA

Alex Held

CSIRO Centre for Earth Observation

Canberra, ACT, Australia

Erik Hoel

Environmental Systems Research Institute

Redlands, California, USA

Benjamin Holt

NASA Jet Propulsion Laboratory

California Institute of Technology

Pasadena, California, USA

Hook Hua

NASA Jet Propulsion Laboratory

California Institute of Technology

Pasadena, California, USA

Thomas Huang

NASA Jet Propulsion Laboratory

California Institute of Technology

Pasadena, California, USA

Joseph C. Jacob

NASA Jet Propulsion Laboratory

Pasadena, California, USA

Zaihua Ji

National Center for Atmospheric Research

Boulder, Colorado, USA

Karthik Kashinath

Lawrence Berkeley National Laboratory

Berkeley, California, USA

Edward J. Kearns

First Street Foundation

Brooklyn, New York, USA

Bhargavi Krishna

ARM Research Facility

Oak Ridge National Laboratory

Oak Ridge, Tennessee, USA

Vitaliy Kurlin

Department of Computer Science

University of Liverpool

Liverpool, UK

Michael M. Little

NASA Goddard Space Flight Center

Greenbelt, Maryland, USA

Qin Lv

Department of Computer Science

University of Colorado

Boulder, Colorado, USA

Christopher Lynnes

NASA Goddard Space Flight Center (retd.)

Greenbelt, Maryland, USA

Gerald Manipon

NASA Jet Propulsion Laboratory

California Institute of Technology

Pasadena, California, USA

Theo McCaie

Met Office

Exeter, UK

Dmitriy Morozov

Lawrence Berkeley National Laboratory

Berkeley, California, USA

Grzegorz Muszynski

Lawrence Berkeley National Laboratory

Berkeley, California, USA; and

Department of Computer Science

University of Liverpool

Liverpool, UK

Matt Paget

CSIRO Centre for Earth Observation

Canberra, ACT, Australia

Tom Powell

Met Office

Exeter, UK

Giri Prakash

ARM Research Facility

Oak Ridge National Laboratory

Oak Ridge, Tennessee, USA

Prabhat Ram

Lawrence Berkeley National Laboratory

Berkeley, California, USA

Niall Robinson

Met Office

Exeter, UK; and

University of Exeter

Exeter, UK

Sujen Shah

NASA Jet Propulsion Laboratory

California Institute of Technology

Pasadena, California, USA

Adrienne Simonson

Office of the Chief Information Officer

National Oceanic and Atmospheric Administration

Asheville, North Carolina, USA

Shawn R. Smith

Center for Ocean‐Atmospheric Prediction Studies

Florida State University

Tallahassee, Florida, USA

Kate Szura

Interactions LLC

Franklin, Massachusetts, USA

Ronnie Taib

CSIRO Data61

Sydney, NSW, Australia

Jacob Tomlinson

NVIDIA

Reading, UK

Vardis Tsontos

NASA Jet Propulsion Laboratory

California Institute of Technology

Pasadena, California, USA

Tiffany C. Vance

U.S. Integrated Ocean Observing System

National Oceanic and Atmospheric Administration

Silver Spring, Maryland, USA

Peter Wang

CSIRO Data61

Sydney, NSW, Australia

Michael Wehner

Lawrence Berkeley National Laboratory

Berkeley, California, USA

Brian D. Wilson

NASA Jet Propulsion Laboratory

California Institute of Technology

Pasadena, California, USA

Robert Woodcock

CSIRO Centre for Earth Observation

Canberra, ACT, Australia

Elizabeth Yam

NASA Jet Propulsion Laboratory

California Institute of Technology

Pasadena, California, USA

Chaowei Phil Yang

George Mason University

Fairfax, Virginia, USA

Alice Yepremyan

NASA Jet Propulsion Laboratory

California Institute of Technology

Pasadena, California, USA

Hailiang Zhang

NASA Goddard Space Flight Center

Greenbelt, Maryland, USA

Preface

The seeds for this book were sown in sessions on Big Data Analytics, held at the 2016 Fall Meeting of the American Geophysical Union. At the time, Earth Science data were projected to rise by orders of magnitude in the coming decade, and the community was investigating a variety of emergent technologies and techniques to make the best use of the coming deluge. The chapters of this book are a representative, but by no means exhaustive, collection of those and similar investigations.

Big Earth Data Analytics can be defined as the application of increasingly sophisticated tools for data analysis and display to the rapidly increasing volume of Earth science data to obtain information, and eventually insight. This combines two concepts: Big Earth Data and Data Analytics. Big Earth Data refers both to the volume of data sets and the combination of data from a variety of sources, in a variety of formats, and from a variety of disciplines. To get a sense of the volume, NOAA generates tens of terabytes of data a day from satellites, radars, ships, weather models, and other sources. The National Aeronautics and Space Administration (NASA) Earth Observation archives were growing by more than 30 TB per day in 2020 with daily growth expected to increase to 130 TB/day by 2024 as new satellites launch; and the European Centre for Medium‐Range Weather Forecasts (ECMWF) meteorological data archive adds 200 terabytes of new data daily. However, the data are "big" not only in their volume but in their varied formats, disciplines, structures, and formats. As such, they are disruptors to traditional analysis methods, and to the kinds of questions that can be asked by researchers. Data analytics are increasingly driven by the availability of high‐volume and heterogeneous data sets. Data size and complexity affect all aspects of data management and usage, requiring new approaches and tools. Despite the challenges to acquire, use, and analyze Big Earth Data, they are already being utilized extensively in climate, oceanographic, and biology related works. Easily available data lead to the ability to analyze longer scale records and patterns over large spatial domains.

Analyses of these data borrow both from traditional scientific analyses and from tools developed for business applications. These types of data analytics are developed by university and other research teams. They are increasingly becoming an area of interest to cloud providers and analytics companies. From Google's Earth Engine for analyzing Earth science data at scale, to the National Oceanic and Atmospheric Administration's (NOAA's) Big Data Program, big data about the Earth and their analysis are increasingly common. Amazon's Elastic MapReduce and SageMaker are common building blocks for cloud‐based analysis and Galileo (a.k.a. Service Workbench) is Amazon's latest Web application for interactive analysis. Microsoft Azure ML Studio is another popular cloud‐based data analysis solution. Big Earth Data analyses increasingly rely on cloud‐based storage and processing capabilities as the volume of the data and the computing resources needed go beyond local resources.

This book is organized into three parts. It starts with the big picture, covering Big Data Analytics Architecture. This part begins with a chapter addressing the geospatial aspect of Big Earth Data from a variety of perspectives. This is followed by a chapter discussing the data management challenges posed by data at scale, particularly in the context of making them available for analysis. This is complemented by a chapter discussing the challenges of scaling up the analysis itself. The following chapters cover large‐scale projects such as NASA's Earth Exchange, which enables large scale data analysis in a supercomputing environment and the NOAA Big Data Project, which makes data sets available to end users via several cloud providers. Part I also includes chapters on architectures and fully realized systems, such as Data Cube, NEXUS and the Apache Science Data Analytics Platform, and a NoSQL based platform for exploring and analyzing in situ data.

The second part of the book, Analysis Methods for Big Earth Data, addresses some specific techniques to derive information and/or insight from big data, emphasizing the unique aspects of Earth Observations. Part II begins with two chapters on the use of geospatial statistics for analysis, followed by a chapter melding machine learning with geophysical constraints, and finally a chapter benchmarking different analytical methods for spatiotemporal analysis.

The third part of the book, Big Earth Data Applications, describes a few specific applications of big analysis techniques and platforms: weather and climate model analysis, atmospheric river patterns, Antarctic land surface temperatures extremes, satellite in situ match‐ups of oceanographic data, and vessel tracking. This is clearly a small sample of existing applications; rather, the sample shows how some very different analysis methods can find diverse applications in the Earth sciences.

While the application of big Earth data analytics covers a range of applications, a number of common themes in the chapters of this book include (1) the role of the cloud, especially with ever increasing data sizes; (2) limitations and costs of using the cloud, including the unpredictability of costs and the high cost of data egress from the cloud; (3) techniques to maintain data integrity during file transfers; (4) efficiencies via partial reads from Web object storage; (5) the use of data/object stores; (6) serverless and other intrinsic functions to standardize computations; (7) data pipelines and the use of Docker to encapsulate analyses; (8) development of application programming interfaces; (9) GeoTIFFs, Zarr, and Parquet as cloud file formats for satellite and in situ data; and (10) hard limits on data sizes in the cloud, which is especially important with satellite data.

While the chapters in this book provide a broad introduction to the subject, there are still many opportunities to address challenges posed by big data analytics, such as incorporating new data sources, implementing data standards, optimizing the use of cloud and supercomputing resources, and incorporating artificial intelligence and machine learning. As these challenges are surmounted, the computing power and agile infrastructure of the cloud will support the emergence of important new analyses and insights, in turn supporting new policy making. At the same time, new policy challenges are raised by the solutions. The use of cloud resources for data storage and analysis has the potential to both enable and complicate the accessibility of both the data and the analysis methods by the wider community, particularly as the community broadens to new application, education, and citizen scientist users. On the other hand, data egress fees or cloud provider‐specific tools may impair long‐term data preservation, scientific reproducibility, and basic equity.

Thomas Huang

NASA Jet Propulsion Laboratory

California Institute of Technology

Pasadena, California, USA

Tiffany C. Vance

U.S. Integrated Ocean Observing System

National Oceanic and Atmospheric Administration

Silver Spring, Maryland, USA

Christopher Lynnes

NASA Goddard Space Flight Center

Greenbelt, Maryland, USA (retd.)

1AN INTRODUCTION TO BIG DATA ANALYTICS

Erik Hoel

Environmental Systems Research Institute, Redlands, California, USA

Big data analytics, in the context of geospatial data, employs distributed computing using advanced tools that support spatiotemporal analysis, spatial statistics, and machine learning algorithms and techniques (e.g., classification, clustering, and prediction) on very large spatiotemporal data sets to visualize, detect patterns, gain deeper understandings, and answer questions. In this chapter, the key definitions, domain specific problems, analysis concepts, current technologies and tools, and remaining challenges are discussed.

1.1 Overview

Big data analytics involves analyzing large volumes of varied data, or big data, to identify and understand patterns, correlations, and trends that ordinarily are invisible due to the volumes involved in order to allow users and organizations to make better decisions. These analytics, in the context of geospatial data, commonly involve spatial processing, sophisticated spatial statistical algorithms, and predictive modeling. Big data can be obtained from a wide variety of sources; this includes sensors (both stationary and moving), aerial and satellite imagery, Lidar, videos, social networks, website activity, sales transaction records, and real‐time stock trading transactions. Users and data scientists apply big data analytics to evaluate these large collections of data, data with volumes that traditional analytical systems are unable to accommodate (Miller & Goodchild, 2014). This is particularly the case with unstructured or semistructured data (such data types are problematic with data warehouses, which often utilize relational database concepts and work with structured data).

To address these complex demands, many new analytic environments and technologies have been developed. This includes distributed processing infrastructures such as Spark and MapReduce (Dean & Ghemawat, 2008; Garillot &Maas, 2018; Zaharia et al., 2010), distributed file stores, and NoSQL databases (Alexander & Copeland, 1988; DeWitt & Gray, 1992; Klein et al., 2016; NoSQL, 2022; Pavlo & Aslett, 2016). Many of these technologies are available in open‐source software frameworks, such as Apache Hadoop (2018), that can be used to process huge data sets with clustered systems.

When working with big data, there is a collection of objectives that users have when performing big data analytics (Marz & Warren, 2013; Mysore et al., 2013). These include

Discovering value from big data

. Visualize and analyze big data in a way that reveals patterns, trends, and relationships that traditional reports and spatial processing do not. Data may exist in many disparate places, streams, or web logs.

Exploiting streaming data

. Filter and convert raw streaming data from various sources, which contain geographical elements, into geographic layers of information. The geographical layers can then be used to create new, more useful maps and dashboards for decision making.

Exposing geographic patterns

. Use maps and visualization to see the story behind the data. Examples of identifying geographical patterns include retailers seeing where promotions are most effective and where the competition is, banks understanding why loans are defaulting and where there is an underserved market, climate‐change scientists determining the impact of shifting weather patterns.

Finding spatial relationships

. Seeing spatially enabled big data on a map allows you to answer questions and ask new ones. Where are disease outbreaks occurring? Where is insurance risk greatest given recently updated population shifts? Geographic thinking adds a new dimension to big data problem solving and helps you make sense of big data.

Performing predictive modeling

. Predictive modeling using spatially enabled big data helps you develop strategies from if/then scenarios. Governments can use it to design disaster response plans. Natural resource managers can analyze recovery of wetlands after a disaster. Health service organizations can identify the spread of disease and ways to contain it.

1.1.1 What Differentiates Spatial Big Data

Spatial big data are differentiated from standard (nonspatial) big data by the presence of spatial relationships, geostatistical correlations, and spatial semantic relations (this can be generalized to include the temporal domain (Hägerstrand, 1970). Spatial big data offer additional challenges beyond what is encountered with more traditional big data. Spatial big data are characterized by the following (Barwick, 2011):

Volume

. The quantity of data. Spatial big data also include global satellite imagery, mobile sensors (smart phones, GPS trackers, and fitness monitors), and georeferenced digital camera imagery.

Variety

. Spatial data are composed of 2D or 3D vector or raster imagery. Spatial data are more complex and subsume the types found with conventional big data.

Velocity

. Velocity of spatial data is significant given the rapid collection of satellite imagery in addition to mobile sensors.

Veracity

. For vector data (points, lines, and polygons), the quality and accuracy vary. Quality is dependent upon whether the points have been GPS determined, determined by unknown origins, or determined manually. Resolution and projection issues can also alter veracity. For geocoded points, there may be errors in the address tables and in the point location algorithms associated with addresses. For raster data, veracity depends on accuracy of recording instruments in satellites or aerial devices, and on timeliness.

Value

. For real‐time spatial big data, decisions can be enhanced through visualization of dynamic change in such spatial phenomena as climate, traffic, social‐media‐based attitudes, and massive inventory locations. Exploration of data trends can include spatial proximities and relationships.

Once spatial big data are structured, formal spatial analytics can be applied, such as spatial autocorrelation, overlays, buffering, spatial cluster techniques, and location quotients.

1.2 Definitions

The terms in Table 1.1 are referenced in this chapter and are included here to facilitate a more rapid understanding of the general concepts discussed later.

Table 1.1 Terms for understanding general concepts

Amazon Web Services

(AWS) A secure, on‐demand, cloud computing platform where users pay for the computing resources that they consume (e.g., computing, database storage, and content delivery).

Artificial Intelligence

Computer systems or machines that are able to perform tasks and mimic behavior that normally requires human intelligence, such as visual perception, speech recognition, and language translation.

Big Data as a Service (BDaaS)

Cloud‐based hardware and software services that support the analysis of large or complex data sets. These services can provide data, analytical tools, event‐driven processing, visualization, and management capabilities.

Cloudera

A software company that provides a software platform that can run either in the cloud or on‐prem, supporting data warehousing, machine learning, and big data analytics. The company is a major contributor to the Apache Hadoop platform (e.g., Avro, HBase, Hive, and Spark).

Computer Vision

A scientific discipline that focuses on the acquisition, extraction, analysis, and understanding of information obtained from either single or multidimensional image or video data.

Data as a Service (DaaS)

Built on top of software as a service, data are provided to users on demand for further processing and analysis. The centralization of the data enables higher quality curated data at a lower cost to the client.

Databricks

A company that provides a cloud‐based platform for working with Apache Spark. Databricks traces it origins to the AMPLab project at Berkeley that evolved into an open‐source distributed computing framework for working with big data.

Data Mining

The process of discovering and extracting hidden patterns and knowledge found in big data using methods and techniques that are commonly associated with database management, machine learning, and statistics.

Deep Learning

A subfield of machine learning that focuses on algorithms and computational architectures that mimic the structure of the brain (commonly termed artificial neural networks). Recent advances in large‐ scale distributed processing have enabled the development and use of very large neural networks.

Elastic Compute Cloud (EC2)

Infrastructure within Amazon Web Services (AWS) that provides scalable computing capacity; clients can develop, deploy, and run their own applications. EC2 is elastic and allows clients to scale their compute and storage up or down as necessary.

Hadoop

An open‐source framework and set of software modules that enable users to solve problems on big data sets using a distributed cluster of hardware resources. This includes distributed data storage and computation using the MapReduce programming model. Apache Hadoop was originally inspired by Google's work in the distributed processing domain.

HDFS

A distributed and scalable file system and data store that is part of Apache Hadoop. HDFS stores big data files across a cluster of machines and supports high reliability by replication of the data across different nodes in the cluster.

Hive

Data warehouse software module in Apache Hadoop that facilitates querying and analyzing big data stored in HDFS in a distributed and replicated manner using a SQL‐like language termed HiveQL.

IBM Cloud

A set of cloud computing capabilities and services that provides capabilities including Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS).

Infrastructure as a Service (IaaS)

A type of cloud computing infrastructure that virtualizes computing resources, storage, data partitioning, scaling, and networking. Unlike Software as a Service (SaaS) or Platform as a Service (PaaS), IaaS clients must maintain the applications, data, middleware, and operating system.

Machine Learning

A subset of artificial intelligence where software systems can automatically learn and improve without any explicit programming, relying upon statistical methods for pattern detection and inference. Machine learning software creates statistical models using sample data in order to make decisions or predictions.

MapReduce

A programming model, originally developed at Google, that is often used when processing big data sets in a distributed manner. MapReduce programs contain a map procedure where data can be sorted and filtered, and a reduce procedure where summary operations are performed. MapReduce systems, such as Apache Hadoop, are responsible for managing communications and data transfer among the collection of distributed processing nodes.

Microsoft Azure

A cloud computing service from Microsoft for creating, deploying, and managing applications using data centers managed by Microsoft. Hundreds of services are available that provide functionality related to compute, data management, messaging, mobile, and storage capabilities.

Natural Language Processing (NLP)

A portion of artificial intelligence that focuses on enabling computers to understand and communicate (including language translation) through human language, both written and spoken.

NoSQL data stores

A non‐SQL or non‐relational database that provides a mechanism for storage and retrieval of data. NoSQL data stores often trade consistency in favor of availability, speed, horizontal scalability, and partitionability.

Oracle Cloud

A collection of cloud computing services from Oracle providing servers, storage, network, applications, and services using Oracle‐managed data centers. The Oracle Clouse provides Software as a Service (SaaS), Platform as a Service (PaaS), Infrastructure as a Service (IaaS), and Data as a Service (DaaS).

Pig

An Apache platform to develop programs for analyzing big data sets that run on Apache Hadoop using a high‐level language (Pig Latin). Pig can be used to develop functionality that runs as MapReduce, Tez, or Spark jobs.

Platform as a Service (PaaS)

A category of cloud computing service that allows clients to develop, deploy, run, and manage applications without needing to build or maintain the cloud computing infrastructure. Unlike software as a service (SaaS), the client is responsible for maintaining the applications and data.

Predictive Analytics

A group of statistical and machine learning algorithms that are used to predict the likelihood of future or other unknown events based upon existing historical data.

Real‐time Data Processing

A collection of software and hardware that processes data on‐the‐fly and is subjected to a constraint where responses must be provided within a short interval of time (e.g., fractions of a second), independent of system or event data load.

Redshift

A column‐oriented, fully managed, data warehouse for big data. Redshift is similar to other columnar NoSQL databases as it is intended to scale out with distributed clusters of low‐cost hardware.

Simple Storage Service (S3)

An object storage service offered by Amazon Web Services (AWS); it is intended to store any type of data (objects) that can later be used for big data analytic processing.

Software as a Service (SaaS)

A category of cloud computing service that allows clients to license applications, web‐based software, on‐demand software, and hosted software. The delivery model is on a subscription basis and is centrally hosted. Differing from Platform as a Service (PaaS), SaaS does not require client to manage either data or software.

Spark

An analytic engine and cluster‐computing framework, part of Apache Hadoop, that supports applications that run across a distributed cluster. Originally developed at Berkeley in 2009, it provides a framework for programming clusters of machines with data parallelism.

Speech Recognition

A collection of methodologies and techniques that enables the recognition and transformation of spoken language into text for further computational processing.

Storm

A real‐time, distributed, high‐volume, stream‐processing framework for big data. It is part of the Apache Hadoop open‐source framework.

Stream Processing

A computer programming paradigm (similar to dataflow programming), where given a sequence of data (a stream), a series of pipelined operations (or kernel functions) is applied to each element in the stream.

1.3 Example Problems

There are a significant number of industries and application domains that benefit from spatiotemporal big data analytics (Hey et al., 2009). As the sheer number of processes and technologies that are collecting spatial data grows, the ubiquity and significance of the data have grown. Spatial big data analytics has wide applicability and value across numerous domains; a few of these are the following.

1.3.1 Agriculture

Farmers can use spatial big data analytics to detect and analyze patterns in weather data, correlated with historical crop yields, surface topography, and soil characteristics. This helps farmers determine the best seed varieties to use and times and places to plant crops in order to maximize yields. In addition, the distribution of fertilizer can be optimized based upon historical information. Tractor and heavy equipment movement can also be tracked via GPS and incorporated into the logistic optimization analytics, and the areas of usable and productive land within a field can be identified.

1.3.2 Commerce

Commercial retailers have always used local shopping patterns and demographics to drive marketing strategies and site selection. However, retailers can now use spatial big data analytics to analyze the locations and characteristics of customers along with social media conversations and browsing behavior in order to better understand customers' needs. Retailers can essentially build a richer and more useful understanding and relationship with their customer base. New store site selection on regional or national levels can be optimized based on the locations of customers, competitors, and other nontraditional data.

1.3.3 Connected Cars

Developers of systems for connected cars and autonomous vehicles can use spatial big data analytics to provide accurate situational awareness to drivers and vehicles about their surrounding environment. Systems can apply analytics capabilities such as road snapping, predictive road snapping, change detection of objects sensed by the vehicle but not on the map, and accident prediction. This is all under the topic of improved vehicle reliability and passenger safety.

1.3.4 Environment

Environmental organizations can employ spatial big data analytics to answer a number of important questions including whether there are spatiotemporal correlations between species observations (this can be by geographic area or species).

1.3.5 Financial Services

In the financial services/insurance industry, spatial big data analytics are used to overlay weather data with claim data to assist companies in detecting possible instances of fraud. In other contexts, non‐traditional data sources like satellite imagery are combined with traditional topographic data sources to identify the potential risk of offering flood insurance. Insurers can also assess spatial relationships between their insurance portfolios and past hazards to balance risk exposure. Finally, banks can use spatiotemporal historical transaction data to help them detect evidence of fraud.

1.3.6 Government Agencies

National and regional government agencies would like to use spatial big data analytics to process and overlay nationwide data sets containing land use; parcels; planning information; geological informational, and environmental data in order to create information products that can be used by analysts, scientists, and policy makers to make better policy decisions.

1.3.7 Health Care

Public health agencies can use spatial big data analytics to see how far patients are from health facilities helping them evaluate access to care. Hospital networks can determine the density of hospitals in certain areas to identify gaps and opportunities. They can also measure the prevalence of certain habits and illnesses in the community using demographic data. Public health agencies can also utilize tracking data to perform contact tracing of infected individuals to identify who they have been in contact with in the past. The contact information can then be utilized to help reduce the infections in the general population. Proximity tracing is a variant in which contact is specified using a proximity‐based filtering criteria (e.g., spatial and temporal range) in order to identify potential contact events.

1.3.8 Marketing

Geospatial big data analytics is frequently used in corporate marketing for prospect and customer segmentation. Data from body sensors (e.g., smart phones, smart watches, fitness monitors) can be used to segment the customer base according to physical activity or behavioral patterns and deliver advertising in a targeted manner. Companies also want to be able to identify where their customers are in relation to their competitors' customers. This allows them to identify areas where they are losing the market and help determine where they need to focus their marketing efforts.

1.3.9 Mining

Mining companies can apply spatial big data analytics to perform complex vehicle tracking analysis to find ways to better manage equipment moves. For example, they can analyze patterns of equipment locations when braking, and they can review shock absorption, RPM changes, and other telematics information. They can also analyze geochemical sample results.

1.3.10 Petroleum

Spatial big data analytics enable petroleum companies to identify suitable areas for exploration based upon historical production, geographic composition, and competitor activity (including leasing activity). Spatial big data analytics can also be used to review historical production data to assess reservoir production over time. Vehicle tracking data can be analyzed to determine time spent on both commercial and noncommercial roads. They can also review vessel tracks over offshore blocks using AIS vessel tracking information.

1.3.11 Retail

Retailers can use spatial big data analytics to model retail networks and help them select the best sites to optimize their store network. Analytic results can be used to create customer profile maps, allowing retailers to better understand customer behavior and the factors that influence their behavior. Retailers also want to spatially analyze the types of products that consumers are buying based upon seasonal and weather‐related stimuli. This often incorporates promotions and sale activity. The spatiotemporal analysis can extend to a very fine‐grained level, for example, hourly sales activity on Black Friday.

1.3.12 Telecommunications

Telecommunications companies can use spatial big data analytics to review spatial trends in bandwidth usage over time to help plan new network deployments. They can analyze spatial patterns in consumer habits, spending patterns, demographics, and service purchases to improve marketing, define new products, and help plan network expansions. Customer service departments can correlate network problems and trouble tickets with customer complaints or cancellations to determine where and when service issues have led to customer dissatisfaction. Call detail records can be used to identify areas where cellular service is problematic (quality, speed, coverage), both temporally and spatially.

1.3.13 Transportation

With spatial big data analytics, commercial delivery companies can reconstruct vehicle routes from millions of individual position reports to check for routing inefficiencies and identify incidents of unsafe speeding and braking. This level of visibility into past trips helps them develop strategies to improve efficiency and safety. Transportation planners can also use spatial big data analytics to aggregate, visualize, and analyze historical crash data for metropolitan areas, helping them identify unsafe road conditions. State and regional transportation agencies can analyze and model traffic slowdowns and congestion in order to optimize future road construction and rapid transit planning activities. City mobility planning (encompassing buses, ride sharing, and public bike systems) makes heavy use of spatiotemporal big data analytics in optimizing route planning and resource deployments in order to maximize throughputs and minimize congestion delays.

1.3.14 Utilities

Geospatial big data analytics is used by utility companies to summarize and analyze customer usage patterns across a service area. They can assess customer usage through time and correlate usage to weather patterns, helping them anticipate future demand. Utilities can also use spatial big data analytics to analyze Supervisory Control and Data Acquisition (SCADA), smart meter, and other sensor data to detect and quantify potential problems in the distribution network, such as when and where outages occur, whether they correlate with weather events, and how many customers are affected. They can use this information to prioritize maintenance activities and prevent or mitigate future problems. Public utility commissions consume raw energy data from utilities and prepare future forecasts of energy consumption. Energy efficiency can also be studied to determine what the seasonal impacts are and what can be done to guide consumers toward smarter energy usage (Fig. 1.1).

Figure 1.1 Leveraging feature binning technology to see geographic trends between industrial emission activity in 2014 (small hexes) as reported in the EPA Toxic Release Inventory and total U.S. electrical generation by load (large hexes) in 2018 as published by the Homeland Infrastructure Foundation‐Level Data.

1.4 Big Data Analysis Concepts

The type of analysis that may be performed against spatial big data often parallels that which is typically done with traditional spatial data (Longley et al., 2015). However, when working with big data, it is oftentimes necessary to identify the key or most significant subsets of data in the larger collection. Once the interesting data are identified, further detailed analysis using the full breadth of spatiotemporal analysis tools and techniques can then be applied. This is particularly common when working with spatial big data that are obtained from sensors.

1.4.1 Summarizing Data

Summarizing data encompasses operations that calculate total counts, lengths, areas, and basic descriptive statistics of features and their attributes within areas or near other features (Fig. 1.2). Common operations that summarize data include the following.

Figure 1.2 Ridesharing pick up locations in midtown Manhattan. In the southern portion of the figure, the raw data are shown. The northern region shows the data aggregated into 250 m height hexagon cells.

Aggregations

aggregate points into polygon features or bins. At all locations where points exist, a polygon is returned with a count of points as well as optional statistics.

Joins

matches two data sets based upon their spatial, temporal, or attribute relationships (Abel et al.,

1995

).

Spatial

joins match features based upon their spatial relationships (e.g., overlapping, intersecting, within distance, etc.);

temporal

joins match features based upon their temporal relationships; and

attribute

joins match features based upon their attribute values.

Track reconstruction

creates line tracks from temporally enabled, moving point features (e.g., positions of cars, aircraft, ships, or animals).

Summarization

overlays one data set on another and calculates summary statistics representing these relationships. For example, one set of polygons may be overlaid on another data set in order to summarize the number of polygons, their area, or attribute statistics.

1.4.2 Identify Locations

Location identification involves identifying areas that meet a number of different specified criteria. The criteria can be based on attribute queries (for example, parcels that are vacant) and spatial queries (for example, within 1 km of a river). The areas that are found can be selected from existing features (such as existing land parcels), or new features can be created where all the requirements are met. Common operations that are used to identify locations include (1) incident detection, which detects all features that meet a specified criteria (e.g., lightning strikes exceeding a given intensity), and (2) similarity, which identifies the features that are either the most similar or least similar to another set of features based upon attribution.

1.4.3 Pattern Analysis

Pattern analysis involves identifying, quantifying, and visualizing spatial patterns in spatial data (Bonham‐Carter, 1994; Golledge & Stimson, 1997). Identifying geographic patterns is important for understanding how geographic phenomena behave.

Although it is possible to understand the overall pattern of features and their associated values through traditional mapping, calculating a statistic quantifies the pattern (Vapnik, 2000). Statistical quantification facilitates the comparison of patterns with different distributions or across different time periods. Pattern analysis tools are often used as a starting point for more in‐depth analyses. For example, spatial autocorrelation can be used to identify distances where the processes promoting spatial clustering are most pronounced. This might help the user to select an appropriate distance (scale of analysis) to use for investigating hot spots (hot spot analysis using the Getis‐Ord Gi* statistic) (Fig. 1.3).

Figure 1.3 Tornado hotspots (+) and reported start points across the United States from 1950 to 2018. Hotspots are calculated using the Getis‐Ord Gi* statistic on tornado geographic frequency and weighted by severity (Fugita Scale 0–5) to determine locations with a higher risk of damage based upon reported historical events (p‐value < 0.05; z‐score > 3). Tornado data from the NOAA Storm Prediction Center for Severe Weather.

Pattern analysis tools are used for inferential statistics; they start with the null hypothesis that features, or the values associated with the features, exhibit a spatially random pattern. They then compute a p‐value representing the probability that the null hypothesis is correct (that the observed pattern is simply one of many possible versions of complete spatial randomness). Calculating a probability may be important if you need to have a high level of confidence in decision making. If there are public safety or legal implications associated with your decision, for example, you may need to justify your decision using statistical evidence.

1.4.4 Cluster Analysis

Cluster analysis is used to identify the locations of statistically significant hot spots, spatial outliers, and similar features (Ester et al., 1996) (Fig. 1.4). Cluster analysis is particularly useful when action is needed based on the location of one or more clusters. An example would be the assignment of additional police officers to deal with a cluster of burglaries. Pinpointing the location of spatial clusters is also important when looking for potential causes of clustering; where a disease outbreak occurs can often provide clues about what might be causing it. Unlike pattern analysis (which as used answer the questions such as, “Is there spatial clustering?”) cluster analysis supports the visualization of the cluster locations and extent. Cluster analysis can be used to answer the questions such as, “Where are the clusters (hot spots and cold spots)?”, “Where are incidents most dense?”, “Where are the spatial outliers?”, and “Which features are most alike?”

Figure 1.4 Spatiotemporal clustering (DBSCAN – Density‐Based Spatial Clustering of Applications) of ridesharing drop‐off locations in midtown Manhattan. This identified clusters (darker points in the figure) where many drop offs occurred in a similar place and time and the minimal cluster size is 15 events.

1.4.5 Proximity Analysis

Proximity analysis allows people to answer one of the most common questions posed in spatial analysis: “What is near what?” This type of analysis supports the determination of proximal features within one or more data sets; for example, identify features that are closest to one another or calculate the distances between or around them. Common analysis methods include the following:

Distance calculation: The Euclidean distance from a single source or set of sources.

Travel cost calculation: The least accumulative cost distance from or to the least‐cost source, while accounting for surface distance along with horizontal and vertical cost factors.

Optimal travel cost calculation: The optimum cost network from a set of input regions. One example application of this tool is finding the best network for emergency vehicles.

1.4.6 Predictive Modeling

Predictive analytics builds models to forecast behavior and other future developments. It encompasses techniques from spatial statistics, data mining, machine learning, and artificial intelligence (Minsky, 1986; Newell et al., 1959; Pedregosa et al., 2011). Patterns are identified in historical data and are used when creating models for future events.

Machine learning uses algorithms and statistical models to analyze large data sets without using explicit sequences of instructions. Machine learning algorithms create a model of training data that is used to make optimized predictions and decisions. Machine learning is considered to be a subset of artificial intelligence.

Deep learning is a subset of artificial intelligence where models resembling biological nervous systems are arrayed in multiple layers where each layer uses the output of the preceding as input to create a more abstract and composite representation of the data (LeCun et al., 2015). Deep learning architectures include deep neural networks, belief networks, and recurrent neural networks. Deep learning is commonly used in the domains of natural language processing, computer vision, and speech recognition.

1.5 Technology and Tools

There are several key technologies that are commonly employed to process large volumes of spatial