142,99 €
Applying tools for data analysis to the rapidly increasing volume of data about the Earth An ever-increasing volume of Earth data is being gathered. These data are "big" not only in size but also in their complexity, different formats, and varied scientific disciplines. As such, big data are disrupting traditional research. New methods and platforms, such as the cloud, are tackling these new challenges. Big Data Analytics in Earth, Atmospheric, and Ocean Sciences explores new tools for the analysis and display of the rapidly increasing volume of data about the Earth. Volume highlights include: * An introduction to the breadth of big earth data analytics * Architectures developed to support big earth data analytics * Different analysis and statistical methods for big earth data * Current applications of analytics to Earth science data * Challenges to fully implementing big data analytics The American Geophysical Union promotes discovery in Earth and space science for the benefit of humanity. Its publications disseminate scientific knowledge and provide resources for researchers, students, and professionals. Find out more in this Q&A with the editors.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 506
Veröffentlichungsjahr: 2022
Cover
Title Page
Copyright
List of Contributors
Preface
1 AN INTRODUCTION TO BIG DATA ANALYTICS
1.1 Overview
1.2 Definitions
1.3 Example Problems
1.4 Big Data Analysis Concepts
1.5 Technology and Tools
1.6 Challenges
1.7 Summary
References
Part I: Big Data Analytics Architecture
2 INTRODUCTION TO BIG DATA ANALYTICS ARCHITECTURE
References
3 Scaling Big Earth Science Data Systems Via Cloud Computing
3.1 Introduction
3.2 Key Concepts of Science Data Systems (SDSes)
3.3 Increasing Data Processing, Volumes, and Rates
3.4 Cloud Concepts for SDSes
3.5 Architecture Components of Cloud‐Based SDS
3.6 Considerations for Multi‐cloud and Hybrid SDS
3.7 Cloud Economics
3.8 Large‐Scaling Considerations
3.9 Example of Cloud SDSes
3.10 Conclusion
References
4 NOAA Open Data Dissemination (Formerly NOAA Big Data Project/Program)
4.1 Obstacles to the Public's Use of NOAA Environmental Data
4.2 Public Access of NOAA Data Creates Challenges for the Agency
4.3 The Vision for NOAA's “Oddball” Approach to Big Data
4.4 A NOAA Cooperative Institute Data Broker provides Research and Operational Agility
4.5 Public‐Private Partnerships Provide the Pipeline
4.6 BDP Exceeds Expectations and Evolves into Enterprise Operations
4.7 Engaging Users in the Cloud
4.8 Challenges and Opportunities
4.9 Vision for the Future
Acknowledgments
References
5 A Data Cube Architecture for Cloud‐Based Earth Observation Analytics
5.1 Introduction
5.2 Open Data Cube for the Cloud Design
5.3 S3 Array I/O Performance
5.4 Discussion and Conclusion
References
6 Open Source Exploratory Analysis of Big Earth Data With NEXUS
6.1 Introduction
6.2 Architecture
6.3 Deployment Architecture
6.4 Benchmarking and Studies
6.5 Analytics Collaborative Framework
6.6 Federated Analytics Collaborative Systems
6.7 Conclusion
References
7 Benchmark Comparison of Cloud Analytics Methods Applied to Earth Observations
7.1 Introduction
7.2 Experimental Setup
7.3 AODS Candidates
7.4 Experimental Results
7.5 Conclusions
References
Part II: Analysis Methods for Big Earth Data
8 Introduction to Analysis Methods for Big Earth Data
References
9 Spatial Statistics for Big Data Analytics in the Ocean and Atmosphere: Perspectives, Challenges, and Opportunities
9.1 Spatial Data and Spatial Statistics
9.2 What Constitutes Big Spatial Data?
9.3 Statistical Implications of the Four Vs of Big Spatial Data
9.4 Challenges to the Statistical Analysis of Big Spatial Data
9.5 Opportunities in Spatial Analysis of Big Data
9.6 Conclusion
References
10 Giving Scientists Back Their Flow: Analyzing Big Geoscience Data Sets in the Cloud
10.1 Introduction
10.2 Where's the Opportunity?
10.3 The Future
Reference
11 The Distributed Oceanographic Match‐Up Service
11.1 Introduction
11.2 DOMS Capabilities
11.3 System Architecture
11.4 Workflow
11.5 Future Development
11.6 Acknowledgments
Availability Statement
References
Part III: Big Earth Data Applications
12 Introduction to Big Earth Data Applications
References
13 Topological Methods for Pattern Detection in Climate Data
13.1 Introduction
13.2 Topological Methods for Pattern Detection
13.3 Case Study: Atmospheric Rivers Detection
13.4 Conclusions and Recommendations
Acknowledgments
References
14 Exploring Large Scale Data Analysis and Visualization for Atmospheric Radiation Measurement Data Discovery Using NoSQL Technologies
14.1 Introduction
14.2 Software and Workflow
14.3 Hardware Architecture
14.4 Applications
14.5 Conclusions
Acknowledgments
References
15 Demonstrating Condensed Massive Satellite Data Sets for Rapid Data Exploration: The MODIS Land Surface Temperatures of Antarctica
15.1 Introduction
15.2 Data
15.3 Methods
15.4 Results
15.5 Conclusions
Acknowledgments
Availability Statement
References
16 Developing Big Data Infrastructure for Analyzing AIS Vessel Tracking Data on a Global Scale
16.1 Introduction
16.2 Background
16.3 Use Case: Producing Heat Maps of Vessel Traffic using AIS Data
16.4 Data Processing Overview
16.5 Future Work
16.6 Conclusions
References
17 Future of Big Earth Data Analytics
17.1 Introduction
17.2 How Data Get Bigger
17.3 The Evolution of Analytics Algorithms
17.4 Analytics Architectures
17.5 Conclusions
References
Index
End User License Agreement
Chapter 1
Table 1.1 Terms for understanding general concepts
Chapter 5
Table 5.1 Different operating characteristics between HPC and cloud deploym...
Table 5.2 EC2 instance types
Table 5.3 Storage types
Chapter 7
Table 7.1 Hardware systems supporting the AODS architectures
Chapter 9
Table 9.1 Examples of spatial algorithms and their associated time complexi...
Chapter 13
Table 13.1 List of data sources used in the experiments
Table 13.2 Classification accuracy score of the SVM classifier for 3 hourly...
Table 13.3 Precision and sensitivity scores calculated for all data sets li...
Chapter 1
Figure 1.1 Leveraging feature binning technology to see geographic trends be...
Figure 1.2 Ridesharing pick up locations in midtown Manhattan. In the southe...
Figure 1.3 Tornado hotspots (+) and reported start points across the United ...
Figure 1.4 Spatiotemporal clustering (DBSCAN...
Figure 1.5 Cloud service models (IaaS, PaaS, and SaaS) (Chou, 2018).
Chapter 3
Figure 3.1 Fundamental systems within the end‐to‐end Earth Observing Systems...
Figure 3.2 Key concepts of a typical SDS of data ingest from GDS, processing...
Figure 3.3 On‐premise SDS components.
Figure 3.4 NISAR processing per month of mission (forward processing + bulk ...
Figure 3.5 The relevant service layers of an SDS, their key area of focus (l...
Figure 3.6 A high‐level SDS functional architecture.
Figure 3.7 The trade‐off spectrum of on‐premise to all‐in cloud‐native, and ...
Figure 3.8 Software layer stack of the hybrid SDS.
Figure 3.9 Cloud storage tiers of AWS from hot (fast and expensive) to cold ...
Figure 3.10 SMAP SDS architecture.
Figure 3.11 NISAR GDS/SDS/DAAC data processing workflow.
Chapter 4
Figure 4.1 The actual total archive volume (active and secure copies) of dat...
Figure 4.2 BDP's data distribution scheme is led by a one‐way transfer of a ...
Chapter 5
Figure 5.1 Open data cube notional architecture.
Figure 5.2 Execution flow for a data request on S3.
Figure 5.3 The ODC Cloud execution engine.
Figure 5.4 Ingest timings (time in log scale).
Figure 5.5 Time taken for data retrieval (time in log scale).
Chapter 6
Figure 6.2 Using an RDD framework to generate 30‐year time series.
Figure 6.3 NEXUS data tiling architecture. Source: Nexus.
Figure 6.4 Apache Extensible Data Gateway Environment (EDGE) architecture.
Figure 6.5 NEXUS system architecture.
Figure 6.6 NEXUS Serverless data ingestion and processing workflow subsystem...
Figure 6.7 NEXUS Automated deployment on AWS.
Figure 6.8 NEXUS performance compared with GIOVANNI and AWS EMR.
Figure 6.9 On‐the‐fly multivariate analysis of Hurricane Katrina
Figure 6.10 The Apache Science Data Analytics Platform (SDAP), an Analytics ...
Figure 6.11 Federated Analytics Centers architecture. Source: Nexus.
Chapter 7
Figure 7.1 Area‐averaged time series of MODIS/Terra aerosol optical depth fr...
Figure 7.1 Elapsed time for computations of area‐averaged time series of MOD...
Figure 7.3 External spatiotemporal index of data chunks across the HDFS used...
Figure 7.4 Depiction of overall ClimateSpark architecture, which leverages t...
Figure 7.5 High‐level architecture of NEXUS. Data ingest is handled by an Ex...
Figure 7.6 Elapsed time results for computation of area‐averaged time series...
Chapter 8
Figure 8.1 Model‐based analysis.
Chapter 9
Figure 9.1 Number of active Argo floats (1997–2021).
Figure 9.2 Time complexity of algorithms.
Figure 9.3 Storage requirements for high dimensional data.
Figure 9.4 Three‐dimensional visualization of Ecological Marine Units (EMU)....
Figure 9.5 Mann‐Kendall statistic for trends in maximum daily temperature (1...
Chapter 10
Figure 10.1 The typical workflow of a scientist. They highlight a research q...
Figure 10.2 The progression of compute capabilities. Cloud computing is set ...
Chapter 11
Figure 11.1 DOMS supports distributed access to subsets of
in situ
data host...
Figure 11.2 An example of satellite to
in situ
data collocation in three dim...
Figure 11.3 Overall architecture of first DOMS prototype. The primary servic...
Figure 11.4 NEXUS data tiling architecture. Source: Nexus.
Figure 11.5 Giovanni versus NEXUS performance (time in seconds) on area aver...
Figure 11.6 DOMS's workflow in the prototype GUI guides the user from select...
Figure 11.7 Sample analytical graphics showing nearest matched pairs between...
Chapter 13
Figure 13.1 Sample images of two weather patterns having distinguishable str...
Figure 13.2 The block diagram illustrating the extreme weather pattern detec...
Figure 13.3 A toy example of three connected components (
C
0
,
C
1
,
C
2
) in the ...
Figure 13.4 A toy example of evolution plot describing the changes of the co...
Figure 13.5 An illustration of two‐class data that are separable in a high‐d...
Figure 13.6 Sample images illustrating AR detection problem. The upper row s...
Figure 13.7 Examples of normalized evolution plots of averaged (red curves) ...
Chapter 14
Figure 14.1 Cassandra cluster ring with replication factor (RF) of 3.
Figure 14.2 Apache Spark cluster and workflow.
Figure 14.3 Overview of the big data software architecture workflow.
Figure 14.4 Screenshot of LASSO Bundle Browser found at https://www.adc.arm....
Figure 14.5 Screenshot of the web application created to simultaneously plot...
Figure 14.6 Interactive time series plot of different variables from ARMBE d...
Figure 14.7 Interactive parallel coordinates plot of different variables fro...
Figure 14.8 Apache Spark application result of ARMBE conditional querying fo...
Chapter 15
Figure 15.1 A flowchart of the data set condensation process.
Figure 15.2 The 5
th
(a) and 95
th
(b) percentiles for Antarctic land surface ...
Figure 15.3 Example statistical baseline images of Antarctic land surface te...
Figure 15.4 An example anomaly database schema. The Anomalies table stores a...
Figure 15.5 Baseline sample populations at each grid cell in the Antarctic l...
Figure 15.6 The most extreme cold temperatures in Antarctica occur along the...
Figure 15.7 A profile of minimum temperature (solid gray line) and elevation...
Figure 15.8 A systematic data error in the MOD11A1 data set. The image shows...
Chapter 16
Figure 16.1 Technologies used to process AIS data.
Figure 16.2 AIS vessel traffic data processing pipeline.
Figure 16.3 Voyages are created by connecting straight lines between pings: ...
Figure 16.4 Algorithm for generating vessel voyages. The exact numbers used,...
Figure 16.5 These example images show filtered vessel traffic. Image (a) sho...
Figure 16.6 Heat maps are generated by overlaying a grid and then counting t...
Figure 16.7 (a) Total vessel traffic and (b) unique vessel traffic over one ...
Figure 16.8 Sample results for processing U.S. Coast Guard terrestrial data ...
Figure 16.9 Sample results for processing Marine Exchange of Alaska data for...
Chapter 17
Figure 17.1 Growth of Cumulative EOSDIS Archives since 2000, shown on a semi...
Cover Page
Table of Contents
Title Page
Copyright
List of Contributors
Preface
Begin Reading
Index
Wiley End User License Agreement
iv
ix
x
xi
xii
xiii
xv
xvi
xvii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
29
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
153
155
156
157
158
159
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
Special Publications 77
Thomas HuangTiffany C. VanceChristopher Lynnes
Editors
This Work is a co‐publication of the American Geophysical Union and John Wiley and Sons, Inc.
This Work is a co‐publication of the American Geophysical Union and John Wiley and Sons, Inc.
This edition first published 2023
© 2023 American Geophysical Union
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
Published under the aegis of the AGU Publications Committee
Matthew Giampoala, Vice President, Publications
Carol Frost, Chair, Publications Committee
For details about the American Geophysical Union visit us at www.agu.org.
The rights of Thomas Huang, Tiffany C. Vance, and Christopher Lynnes to be identified as the editors of this work has been asserted in accordance with law.
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging‐in‐Publication Data
Names: Huang, Thomas (Technologist), editor. | Vance, C. Tiffany, editor.
| Lynnes, Christopher, editor.
Title: Big data analytics in earth, atmospheric, and ocean sciences /
Thomas Huang, Vance, C. Tiffany, Christopher Lynnes, editors.
Description: Hoboken, NJ : Wiley-American Geophysical Union, 2023. |
Includes bibliographical references and index.
Identifiers: LCCN 2022020168 (print) | LCCN 2022020169 (ebook) | ISBN
9781119467571 (cloth) | ISBN 9781119467564 (adobe pdf) | ISBN 9781119467533 (epub)
Subjects: LCSH: Earth sciences–Data processing. | Atmospheric
science–Data processing. | Marine sciences–Data processing. | Big
data.
Classification: LCC QE48.8 .B54 2022 (print) | LCC QE48.8 (ebook) | DDC
550.0285/57–dc23/eng20220722
LC record available at https://lccn.loc.gov/2022020168
LC ebook record available at https://lccn.loc.gov/2022020169
Cover Design: Wiley
Cover Images: Courtesy of Kate Culpepper with design elements provided by Esri, HERE, Garmin, FAO, NOAA, USGS, EPA | Source: Esri, DigitalGlobe, GeoEye, Earthstar Geographics, CNES/Airbus DS, USDA, USGS, AeroGRID, IGN, and the GIS User Community
Edward M. Armstrong
NASA Jet Propulsion Laboratory
California Institute of Technology
Pasadena, California, USA
Alberto Arribas
Microsoft
Reading, UK; and
Department of Meteorology
University of Reading
Reading, UK
Jessica Austin
Axiom Data Science, LLC
Anchorage, Alaska, USA
Rob Bochenek
Axiom Data Science, LLC
Anchorage, Alaska, USA
Mark A. Bourassa
Center for Ocean‐Atmospheric Prediction Studies, and
Department of Earth, Ocean, and Atmospheric Science
Florida State University
Tallahassee, Florida, USA
Jonathan Brannock
North Carolina Institute for Climate Studies
NOAA Cooperative Institute for Satellite Earth System Studies
North Carolina State University
Asheville, North Carolina, USA
Otis Brown
North Carolina Institute for Climate Studies
NOAA Cooperative Institute for Satellite Earth System Studies
North Carolina State University
Asheville, North Carolina, USA
Kevin A. Butler
Environmental Systems Research Institute
Redlands, California, USA
Nga T. Chung
NASA Jet Propulsion Laboratory
California Institute of Technology
Pasadena, California, USA
Thomas Cram
National Center for Atmospheric Research
Boulder, Colorado, USA
Jenny Dissen
North Carolina Institute for Climate Studies
NOAA Cooperative Institute for Satellite Earth System Studies
North Carolina State University
Asheville, North Carolina, USA
Kyle Dumas
ARM Research Facility
Oak Ridge National Laboratory
Oak Ridge, Tennessee, USA
John‐Marc Dunaway
Axiom Data Science, LLC
Anchorage, Alaska, USA
Jocelyn Elya
Center for Ocean‐Atmospheric Prediction Studies
Florida State University
Tallahassee, Florida, USA
Eamon Ford
NASA Jet Propulsion Laboratory
California Institute of Technology
Pasadena, California, USA
David W. Gallaher
National Snow and Ice Data Center
Boulder, Colorado, USA
Kevin Michael Gill
NASA Jet Propulsion Laboratory
California Institute of Technology
Pasadena, California, USA
Glenn E. Grant
National Snow and Ice Data Center
Boulder, Colorado, USA
Frank R. Greguska III
NASA Jet Propulsion Laboratory
California Institute of Technology
Pasadena, California, USA
Mahabaleshwara Hegde
NASA Goddard Space Flight Center
Greenbelt, Maryland, USA
Alex Held
CSIRO Centre for Earth Observation
Canberra, ACT, Australia
Erik Hoel
Environmental Systems Research Institute
Redlands, California, USA
Benjamin Holt
NASA Jet Propulsion Laboratory
California Institute of Technology
Pasadena, California, USA
Hook Hua
NASA Jet Propulsion Laboratory
California Institute of Technology
Pasadena, California, USA
Thomas Huang
NASA Jet Propulsion Laboratory
California Institute of Technology
Pasadena, California, USA
Joseph C. Jacob
NASA Jet Propulsion Laboratory
Pasadena, California, USA
Zaihua Ji
National Center for Atmospheric Research
Boulder, Colorado, USA
Karthik Kashinath
Lawrence Berkeley National Laboratory
Berkeley, California, USA
Edward J. Kearns
First Street Foundation
Brooklyn, New York, USA
Bhargavi Krishna
ARM Research Facility
Oak Ridge National Laboratory
Oak Ridge, Tennessee, USA
Vitaliy Kurlin
Department of Computer Science
University of Liverpool
Liverpool, UK
Michael M. Little
NASA Goddard Space Flight Center
Greenbelt, Maryland, USA
Qin Lv
Department of Computer Science
University of Colorado
Boulder, Colorado, USA
Christopher Lynnes
NASA Goddard Space Flight Center (retd.)
Greenbelt, Maryland, USA
Gerald Manipon
NASA Jet Propulsion Laboratory
California Institute of Technology
Pasadena, California, USA
Theo McCaie
Met Office
Exeter, UK
Dmitriy Morozov
Lawrence Berkeley National Laboratory
Berkeley, California, USA
Grzegorz Muszynski
Lawrence Berkeley National Laboratory
Berkeley, California, USA; and
Department of Computer Science
University of Liverpool
Liverpool, UK
Matt Paget
CSIRO Centre for Earth Observation
Canberra, ACT, Australia
Tom Powell
Met Office
Exeter, UK
Giri Prakash
ARM Research Facility
Oak Ridge National Laboratory
Oak Ridge, Tennessee, USA
Prabhat Ram
Lawrence Berkeley National Laboratory
Berkeley, California, USA
Niall Robinson
Met Office
Exeter, UK; and
University of Exeter
Exeter, UK
Sujen Shah
NASA Jet Propulsion Laboratory
California Institute of Technology
Pasadena, California, USA
Adrienne Simonson
Office of the Chief Information Officer
National Oceanic and Atmospheric Administration
Asheville, North Carolina, USA
Shawn R. Smith
Center for Ocean‐Atmospheric Prediction Studies
Florida State University
Tallahassee, Florida, USA
Kate Szura
Interactions LLC
Franklin, Massachusetts, USA
Ronnie Taib
CSIRO Data61
Sydney, NSW, Australia
Jacob Tomlinson
NVIDIA
Reading, UK
Vardis Tsontos
NASA Jet Propulsion Laboratory
California Institute of Technology
Pasadena, California, USA
Tiffany C. Vance
U.S. Integrated Ocean Observing System
National Oceanic and Atmospheric Administration
Silver Spring, Maryland, USA
Peter Wang
CSIRO Data61
Sydney, NSW, Australia
Michael Wehner
Lawrence Berkeley National Laboratory
Berkeley, California, USA
Brian D. Wilson
NASA Jet Propulsion Laboratory
California Institute of Technology
Pasadena, California, USA
Robert Woodcock
CSIRO Centre for Earth Observation
Canberra, ACT, Australia
Elizabeth Yam
NASA Jet Propulsion Laboratory
California Institute of Technology
Pasadena, California, USA
Chaowei Phil Yang
George Mason University
Fairfax, Virginia, USA
Alice Yepremyan
NASA Jet Propulsion Laboratory
California Institute of Technology
Pasadena, California, USA
Hailiang Zhang
NASA Goddard Space Flight Center
Greenbelt, Maryland, USA
The seeds for this book were sown in sessions on Big Data Analytics, held at the 2016 Fall Meeting of the American Geophysical Union. At the time, Earth Science data were projected to rise by orders of magnitude in the coming decade, and the community was investigating a variety of emergent technologies and techniques to make the best use of the coming deluge. The chapters of this book are a representative, but by no means exhaustive, collection of those and similar investigations.
Big Earth Data Analytics can be defined as the application of increasingly sophisticated tools for data analysis and display to the rapidly increasing volume of Earth science data to obtain information, and eventually insight. This combines two concepts: Big Earth Data and Data Analytics. Big Earth Data refers both to the volume of data sets and the combination of data from a variety of sources, in a variety of formats, and from a variety of disciplines. To get a sense of the volume, NOAA generates tens of terabytes of data a day from satellites, radars, ships, weather models, and other sources. The National Aeronautics and Space Administration (NASA) Earth Observation archives were growing by more than 30 TB per day in 2020 with daily growth expected to increase to 130 TB/day by 2024 as new satellites launch; and the European Centre for Medium‐Range Weather Forecasts (ECMWF) meteorological data archive adds 200 terabytes of new data daily. However, the data are "big" not only in their volume but in their varied formats, disciplines, structures, and formats. As such, they are disruptors to traditional analysis methods, and to the kinds of questions that can be asked by researchers. Data analytics are increasingly driven by the availability of high‐volume and heterogeneous data sets. Data size and complexity affect all aspects of data management and usage, requiring new approaches and tools. Despite the challenges to acquire, use, and analyze Big Earth Data, they are already being utilized extensively in climate, oceanographic, and biology related works. Easily available data lead to the ability to analyze longer scale records and patterns over large spatial domains.
Analyses of these data borrow both from traditional scientific analyses and from tools developed for business applications. These types of data analytics are developed by university and other research teams. They are increasingly becoming an area of interest to cloud providers and analytics companies. From Google's Earth Engine for analyzing Earth science data at scale, to the National Oceanic and Atmospheric Administration's (NOAA's) Big Data Program, big data about the Earth and their analysis are increasingly common. Amazon's Elastic MapReduce and SageMaker are common building blocks for cloud‐based analysis and Galileo (a.k.a. Service Workbench) is Amazon's latest Web application for interactive analysis. Microsoft Azure ML Studio is another popular cloud‐based data analysis solution. Big Earth Data analyses increasingly rely on cloud‐based storage and processing capabilities as the volume of the data and the computing resources needed go beyond local resources.
This book is organized into three parts. It starts with the big picture, covering Big Data Analytics Architecture. This part begins with a chapter addressing the geospatial aspect of Big Earth Data from a variety of perspectives. This is followed by a chapter discussing the data management challenges posed by data at scale, particularly in the context of making them available for analysis. This is complemented by a chapter discussing the challenges of scaling up the analysis itself. The following chapters cover large‐scale projects such as NASA's Earth Exchange, which enables large scale data analysis in a supercomputing environment and the NOAA Big Data Project, which makes data sets available to end users via several cloud providers. Part I also includes chapters on architectures and fully realized systems, such as Data Cube, NEXUS and the Apache Science Data Analytics Platform, and a NoSQL based platform for exploring and analyzing in situ data.
The second part of the book, Analysis Methods for Big Earth Data, addresses some specific techniques to derive information and/or insight from big data, emphasizing the unique aspects of Earth Observations. Part II begins with two chapters on the use of geospatial statistics for analysis, followed by a chapter melding machine learning with geophysical constraints, and finally a chapter benchmarking different analytical methods for spatiotemporal analysis.
The third part of the book, Big Earth Data Applications, describes a few specific applications of big analysis techniques and platforms: weather and climate model analysis, atmospheric river patterns, Antarctic land surface temperatures extremes, satellite in situ match‐ups of oceanographic data, and vessel tracking. This is clearly a small sample of existing applications; rather, the sample shows how some very different analysis methods can find diverse applications in the Earth sciences.
While the application of big Earth data analytics covers a range of applications, a number of common themes in the chapters of this book include (1) the role of the cloud, especially with ever increasing data sizes; (2) limitations and costs of using the cloud, including the unpredictability of costs and the high cost of data egress from the cloud; (3) techniques to maintain data integrity during file transfers; (4) efficiencies via partial reads from Web object storage; (5) the use of data/object stores; (6) serverless and other intrinsic functions to standardize computations; (7) data pipelines and the use of Docker to encapsulate analyses; (8) development of application programming interfaces; (9) GeoTIFFs, Zarr, and Parquet as cloud file formats for satellite and in situ data; and (10) hard limits on data sizes in the cloud, which is especially important with satellite data.
While the chapters in this book provide a broad introduction to the subject, there are still many opportunities to address challenges posed by big data analytics, such as incorporating new data sources, implementing data standards, optimizing the use of cloud and supercomputing resources, and incorporating artificial intelligence and machine learning. As these challenges are surmounted, the computing power and agile infrastructure of the cloud will support the emergence of important new analyses and insights, in turn supporting new policy making. At the same time, new policy challenges are raised by the solutions. The use of cloud resources for data storage and analysis has the potential to both enable and complicate the accessibility of both the data and the analysis methods by the wider community, particularly as the community broadens to new application, education, and citizen scientist users. On the other hand, data egress fees or cloud provider‐specific tools may impair long‐term data preservation, scientific reproducibility, and basic equity.
Thomas Huang
NASA Jet Propulsion Laboratory
California Institute of Technology
Pasadena, California, USA
Tiffany C. Vance
U.S. Integrated Ocean Observing System
National Oceanic and Atmospheric Administration
Silver Spring, Maryland, USA
Christopher Lynnes
NASA Goddard Space Flight Center
Greenbelt, Maryland, USA (retd.)
Erik Hoel
Environmental Systems Research Institute, Redlands, California, USA
Big data analytics, in the context of geospatial data, employs distributed computing using advanced tools that support spatiotemporal analysis, spatial statistics, and machine learning algorithms and techniques (e.g., classification, clustering, and prediction) on very large spatiotemporal data sets to visualize, detect patterns, gain deeper understandings, and answer questions. In this chapter, the key definitions, domain specific problems, analysis concepts, current technologies and tools, and remaining challenges are discussed.
Big data analytics involves analyzing large volumes of varied data, or big data, to identify and understand patterns, correlations, and trends that ordinarily are invisible due to the volumes involved in order to allow users and organizations to make better decisions. These analytics, in the context of geospatial data, commonly involve spatial processing, sophisticated spatial statistical algorithms, and predictive modeling. Big data can be obtained from a wide variety of sources; this includes sensors (both stationary and moving), aerial and satellite imagery, Lidar, videos, social networks, website activity, sales transaction records, and real‐time stock trading transactions. Users and data scientists apply big data analytics to evaluate these large collections of data, data with volumes that traditional analytical systems are unable to accommodate (Miller & Goodchild, 2014). This is particularly the case with unstructured or semistructured data (such data types are problematic with data warehouses, which often utilize relational database concepts and work with structured data).
To address these complex demands, many new analytic environments and technologies have been developed. This includes distributed processing infrastructures such as Spark and MapReduce (Dean & Ghemawat, 2008; Garillot &Maas, 2018; Zaharia et al., 2010), distributed file stores, and NoSQL databases (Alexander & Copeland, 1988; DeWitt & Gray, 1992; Klein et al., 2016; NoSQL, 2022; Pavlo & Aslett, 2016). Many of these technologies are available in open‐source software frameworks, such as Apache Hadoop (2018), that can be used to process huge data sets with clustered systems.
When working with big data, there is a collection of objectives that users have when performing big data analytics (Marz & Warren, 2013; Mysore et al., 2013). These include
Discovering value from big data
. Visualize and analyze big data in a way that reveals patterns, trends, and relationships that traditional reports and spatial processing do not. Data may exist in many disparate places, streams, or web logs.
Exploiting streaming data
. Filter and convert raw streaming data from various sources, which contain geographical elements, into geographic layers of information. The geographical layers can then be used to create new, more useful maps and dashboards for decision making.
Exposing geographic patterns
. Use maps and visualization to see the story behind the data. Examples of identifying geographical patterns include retailers seeing where promotions are most effective and where the competition is, banks understanding why loans are defaulting and where there is an underserved market, climate‐change scientists determining the impact of shifting weather patterns.
Finding spatial relationships
. Seeing spatially enabled big data on a map allows you to answer questions and ask new ones. Where are disease outbreaks occurring? Where is insurance risk greatest given recently updated population shifts? Geographic thinking adds a new dimension to big data problem solving and helps you make sense of big data.
Performing predictive modeling
. Predictive modeling using spatially enabled big data helps you develop strategies from if/then scenarios. Governments can use it to design disaster response plans. Natural resource managers can analyze recovery of wetlands after a disaster. Health service organizations can identify the spread of disease and ways to contain it.
Spatial big data are differentiated from standard (nonspatial) big data by the presence of spatial relationships, geostatistical correlations, and spatial semantic relations (this can be generalized to include the temporal domain (Hägerstrand, 1970). Spatial big data offer additional challenges beyond what is encountered with more traditional big data. Spatial big data are characterized by the following (Barwick, 2011):
Volume
. The quantity of data. Spatial big data also include global satellite imagery, mobile sensors (smart phones, GPS trackers, and fitness monitors), and georeferenced digital camera imagery.
Variety
. Spatial data are composed of 2D or 3D vector or raster imagery. Spatial data are more complex and subsume the types found with conventional big data.
Velocity
. Velocity of spatial data is significant given the rapid collection of satellite imagery in addition to mobile sensors.
Veracity
. For vector data (points, lines, and polygons), the quality and accuracy vary. Quality is dependent upon whether the points have been GPS determined, determined by unknown origins, or determined manually. Resolution and projection issues can also alter veracity. For geocoded points, there may be errors in the address tables and in the point location algorithms associated with addresses. For raster data, veracity depends on accuracy of recording instruments in satellites or aerial devices, and on timeliness.
Value
. For real‐time spatial big data, decisions can be enhanced through visualization of dynamic change in such spatial phenomena as climate, traffic, social‐media‐based attitudes, and massive inventory locations. Exploration of data trends can include spatial proximities and relationships.
Once spatial big data are structured, formal spatial analytics can be applied, such as spatial autocorrelation, overlays, buffering, spatial cluster techniques, and location quotients.
The terms in Table 1.1 are referenced in this chapter and are included here to facilitate a more rapid understanding of the general concepts discussed later.
Table 1.1 Terms for understanding general concepts
Amazon Web Services
(AWS) A secure, on‐demand, cloud computing platform where users pay for the computing resources that they consume (e.g., computing, database storage, and content delivery).
Artificial Intelligence
Computer systems or machines that are able to perform tasks and mimic behavior that normally requires human intelligence, such as visual perception, speech recognition, and language translation.
Big Data as a Service (BDaaS)
Cloud‐based hardware and software services that support the analysis of large or complex data sets. These services can provide data, analytical tools, event‐driven processing, visualization, and management capabilities.
Cloudera
A software company that provides a software platform that can run either in the cloud or on‐prem, supporting data warehousing, machine learning, and big data analytics. The company is a major contributor to the Apache Hadoop platform (e.g., Avro, HBase, Hive, and Spark).
Computer Vision
A scientific discipline that focuses on the acquisition, extraction, analysis, and understanding of information obtained from either single or multidimensional image or video data.
Data as a Service (DaaS)
Built on top of software as a service, data are provided to users on demand for further processing and analysis. The centralization of the data enables higher quality curated data at a lower cost to the client.
Databricks
A company that provides a cloud‐based platform for working with Apache Spark. Databricks traces it origins to the AMPLab project at Berkeley that evolved into an open‐source distributed computing framework for working with big data.
Data Mining
The process of discovering and extracting hidden patterns and knowledge found in big data using methods and techniques that are commonly associated with database management, machine learning, and statistics.
Deep Learning
A subfield of machine learning that focuses on algorithms and computational architectures that mimic the structure of the brain (commonly termed artificial neural networks). Recent advances in large‐ scale distributed processing have enabled the development and use of very large neural networks.
Elastic Compute Cloud (EC2)
Infrastructure within Amazon Web Services (AWS) that provides scalable computing capacity; clients can develop, deploy, and run their own applications. EC2 is elastic and allows clients to scale their compute and storage up or down as necessary.
Hadoop
An open‐source framework and set of software modules that enable users to solve problems on big data sets using a distributed cluster of hardware resources. This includes distributed data storage and computation using the MapReduce programming model. Apache Hadoop was originally inspired by Google's work in the distributed processing domain.
HDFS
A distributed and scalable file system and data store that is part of Apache Hadoop. HDFS stores big data files across a cluster of machines and supports high reliability by replication of the data across different nodes in the cluster.
Hive
Data warehouse software module in Apache Hadoop that facilitates querying and analyzing big data stored in HDFS in a distributed and replicated manner using a SQL‐like language termed HiveQL.
IBM Cloud
A set of cloud computing capabilities and services that provides capabilities including Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS).
Infrastructure as a Service (IaaS)
A type of cloud computing infrastructure that virtualizes computing resources, storage, data partitioning, scaling, and networking. Unlike Software as a Service (SaaS) or Platform as a Service (PaaS), IaaS clients must maintain the applications, data, middleware, and operating system.
Machine Learning
A subset of artificial intelligence where software systems can automatically learn and improve without any explicit programming, relying upon statistical methods for pattern detection and inference. Machine learning software creates statistical models using sample data in order to make decisions or predictions.
MapReduce
A programming model, originally developed at Google, that is often used when processing big data sets in a distributed manner. MapReduce programs contain a map procedure where data can be sorted and filtered, and a reduce procedure where summary operations are performed. MapReduce systems, such as Apache Hadoop, are responsible for managing communications and data transfer among the collection of distributed processing nodes.
Microsoft Azure
A cloud computing service from Microsoft for creating, deploying, and managing applications using data centers managed by Microsoft. Hundreds of services are available that provide functionality related to compute, data management, messaging, mobile, and storage capabilities.
Natural Language Processing (NLP)
A portion of artificial intelligence that focuses on enabling computers to understand and communicate (including language translation) through human language, both written and spoken.
NoSQL data stores
A non‐SQL or non‐relational database that provides a mechanism for storage and retrieval of data. NoSQL data stores often trade consistency in favor of availability, speed, horizontal scalability, and partitionability.
Oracle Cloud
A collection of cloud computing services from Oracle providing servers, storage, network, applications, and services using Oracle‐managed data centers. The Oracle Clouse provides Software as a Service (SaaS), Platform as a Service (PaaS), Infrastructure as a Service (IaaS), and Data as a Service (DaaS).
Pig
An Apache platform to develop programs for analyzing big data sets that run on Apache Hadoop using a high‐level language (Pig Latin). Pig can be used to develop functionality that runs as MapReduce, Tez, or Spark jobs.
Platform as a Service (PaaS)
A category of cloud computing service that allows clients to develop, deploy, run, and manage applications without needing to build or maintain the cloud computing infrastructure. Unlike software as a service (SaaS), the client is responsible for maintaining the applications and data.
Predictive Analytics
A group of statistical and machine learning algorithms that are used to predict the likelihood of future or other unknown events based upon existing historical data.
Real‐time Data Processing
A collection of software and hardware that processes data on‐the‐fly and is subjected to a constraint where responses must be provided within a short interval of time (e.g., fractions of a second), independent of system or event data load.
Redshift
A column‐oriented, fully managed, data warehouse for big data. Redshift is similar to other columnar NoSQL databases as it is intended to scale out with distributed clusters of low‐cost hardware.
Simple Storage Service (S3)
An object storage service offered by Amazon Web Services (AWS); it is intended to store any type of data (objects) that can later be used for big data analytic processing.
Software as a Service (SaaS)
A category of cloud computing service that allows clients to license applications, web‐based software, on‐demand software, and hosted software. The delivery model is on a subscription basis and is centrally hosted. Differing from Platform as a Service (PaaS), SaaS does not require client to manage either data or software.
Spark
An analytic engine and cluster‐computing framework, part of Apache Hadoop, that supports applications that run across a distributed cluster. Originally developed at Berkeley in 2009, it provides a framework for programming clusters of machines with data parallelism.
Speech Recognition
A collection of methodologies and techniques that enables the recognition and transformation of spoken language into text for further computational processing.
Storm
A real‐time, distributed, high‐volume, stream‐processing framework for big data. It is part of the Apache Hadoop open‐source framework.
Stream Processing
A computer programming paradigm (similar to dataflow programming), where given a sequence of data (a stream), a series of pipelined operations (or kernel functions) is applied to each element in the stream.
There are a significant number of industries and application domains that benefit from spatiotemporal big data analytics (Hey et al., 2009). As the sheer number of processes and technologies that are collecting spatial data grows, the ubiquity and significance of the data have grown. Spatial big data analytics has wide applicability and value across numerous domains; a few of these are the following.
Farmers can use spatial big data analytics to detect and analyze patterns in weather data, correlated with historical crop yields, surface topography, and soil characteristics. This helps farmers determine the best seed varieties to use and times and places to plant crops in order to maximize yields. In addition, the distribution of fertilizer can be optimized based upon historical information. Tractor and heavy equipment movement can also be tracked via GPS and incorporated into the logistic optimization analytics, and the areas of usable and productive land within a field can be identified.
Commercial retailers have always used local shopping patterns and demographics to drive marketing strategies and site selection. However, retailers can now use spatial big data analytics to analyze the locations and characteristics of customers along with social media conversations and browsing behavior in order to better understand customers' needs. Retailers can essentially build a richer and more useful understanding and relationship with their customer base. New store site selection on regional or national levels can be optimized based on the locations of customers, competitors, and other nontraditional data.
Developers of systems for connected cars and autonomous vehicles can use spatial big data analytics to provide accurate situational awareness to drivers and vehicles about their surrounding environment. Systems can apply analytics capabilities such as road snapping, predictive road snapping, change detection of objects sensed by the vehicle but not on the map, and accident prediction. This is all under the topic of improved vehicle reliability and passenger safety.
Environmental organizations can employ spatial big data analytics to answer a number of important questions including whether there are spatiotemporal correlations between species observations (this can be by geographic area or species).
In the financial services/insurance industry, spatial big data analytics are used to overlay weather data with claim data to assist companies in detecting possible instances of fraud. In other contexts, non‐traditional data sources like satellite imagery are combined with traditional topographic data sources to identify the potential risk of offering flood insurance. Insurers can also assess spatial relationships between their insurance portfolios and past hazards to balance risk exposure. Finally, banks can use spatiotemporal historical transaction data to help them detect evidence of fraud.
National and regional government agencies would like to use spatial big data analytics to process and overlay nationwide data sets containing land use; parcels; planning information; geological informational, and environmental data in order to create information products that can be used by analysts, scientists, and policy makers to make better policy decisions.
Public health agencies can use spatial big data analytics to see how far patients are from health facilities helping them evaluate access to care. Hospital networks can determine the density of hospitals in certain areas to identify gaps and opportunities. They can also measure the prevalence of certain habits and illnesses in the community using demographic data. Public health agencies can also utilize tracking data to perform contact tracing of infected individuals to identify who they have been in contact with in the past. The contact information can then be utilized to help reduce the infections in the general population. Proximity tracing is a variant in which contact is specified using a proximity‐based filtering criteria (e.g., spatial and temporal range) in order to identify potential contact events.
Geospatial big data analytics is frequently used in corporate marketing for prospect and customer segmentation. Data from body sensors (e.g., smart phones, smart watches, fitness monitors) can be used to segment the customer base according to physical activity or behavioral patterns and deliver advertising in a targeted manner. Companies also want to be able to identify where their customers are in relation to their competitors' customers. This allows them to identify areas where they are losing the market and help determine where they need to focus their marketing efforts.
Mining companies can apply spatial big data analytics to perform complex vehicle tracking analysis to find ways to better manage equipment moves. For example, they can analyze patterns of equipment locations when braking, and they can review shock absorption, RPM changes, and other telematics information. They can also analyze geochemical sample results.
Spatial big data analytics enable petroleum companies to identify suitable areas for exploration based upon historical production, geographic composition, and competitor activity (including leasing activity). Spatial big data analytics can also be used to review historical production data to assess reservoir production over time. Vehicle tracking data can be analyzed to determine time spent on both commercial and noncommercial roads. They can also review vessel tracks over offshore blocks using AIS vessel tracking information.
Retailers can use spatial big data analytics to model retail networks and help them select the best sites to optimize their store network. Analytic results can be used to create customer profile maps, allowing retailers to better understand customer behavior and the factors that influence their behavior. Retailers also want to spatially analyze the types of products that consumers are buying based upon seasonal and weather‐related stimuli. This often incorporates promotions and sale activity. The spatiotemporal analysis can extend to a very fine‐grained level, for example, hourly sales activity on Black Friday.
Telecommunications companies can use spatial big data analytics to review spatial trends in bandwidth usage over time to help plan new network deployments. They can analyze spatial patterns in consumer habits, spending patterns, demographics, and service purchases to improve marketing, define new products, and help plan network expansions. Customer service departments can correlate network problems and trouble tickets with customer complaints or cancellations to determine where and when service issues have led to customer dissatisfaction. Call detail records can be used to identify areas where cellular service is problematic (quality, speed, coverage), both temporally and spatially.
With spatial big data analytics, commercial delivery companies can reconstruct vehicle routes from millions of individual position reports to check for routing inefficiencies and identify incidents of unsafe speeding and braking. This level of visibility into past trips helps them develop strategies to improve efficiency and safety. Transportation planners can also use spatial big data analytics to aggregate, visualize, and analyze historical crash data for metropolitan areas, helping them identify unsafe road conditions. State and regional transportation agencies can analyze and model traffic slowdowns and congestion in order to optimize future road construction and rapid transit planning activities. City mobility planning (encompassing buses, ride sharing, and public bike systems) makes heavy use of spatiotemporal big data analytics in optimizing route planning and resource deployments in order to maximize throughputs and minimize congestion delays.
Geospatial big data analytics is used by utility companies to summarize and analyze customer usage patterns across a service area. They can assess customer usage through time and correlate usage to weather patterns, helping them anticipate future demand. Utilities can also use spatial big data analytics to analyze Supervisory Control and Data Acquisition (SCADA), smart meter, and other sensor data to detect and quantify potential problems in the distribution network, such as when and where outages occur, whether they correlate with weather events, and how many customers are affected. They can use this information to prioritize maintenance activities and prevent or mitigate future problems. Public utility commissions consume raw energy data from utilities and prepare future forecasts of energy consumption. Energy efficiency can also be studied to determine what the seasonal impacts are and what can be done to guide consumers toward smarter energy usage (Fig. 1.1).
Figure 1.1 Leveraging feature binning technology to see geographic trends between industrial emission activity in 2014 (small hexes) as reported in the EPA Toxic Release Inventory and total U.S. electrical generation by load (large hexes) in 2018 as published by the Homeland Infrastructure Foundation‐Level Data.
The type of analysis that may be performed against spatial big data often parallels that which is typically done with traditional spatial data (Longley et al., 2015). However, when working with big data, it is oftentimes necessary to identify the key or most significant subsets of data in the larger collection. Once the interesting data are identified, further detailed analysis using the full breadth of spatiotemporal analysis tools and techniques can then be applied. This is particularly common when working with spatial big data that are obtained from sensors.
Summarizing data encompasses operations that calculate total counts, lengths, areas, and basic descriptive statistics of features and their attributes within areas or near other features (Fig. 1.2). Common operations that summarize data include the following.
Figure 1.2 Ridesharing pick up locations in midtown Manhattan. In the southern portion of the figure, the raw data are shown. The northern region shows the data aggregated into 250 m height hexagon cells.
Aggregations
aggregate points into polygon features or bins. At all locations where points exist, a polygon is returned with a count of points as well as optional statistics.
Joins
matches two data sets based upon their spatial, temporal, or attribute relationships (Abel et al.,
1995
).
Spatial
joins match features based upon their spatial relationships (e.g., overlapping, intersecting, within distance, etc.);
temporal
joins match features based upon their temporal relationships; and
attribute
joins match features based upon their attribute values.
Track reconstruction
creates line tracks from temporally enabled, moving point features (e.g., positions of cars, aircraft, ships, or animals).
Summarization
overlays one data set on another and calculates summary statistics representing these relationships. For example, one set of polygons may be overlaid on another data set in order to summarize the number of polygons, their area, or attribute statistics.
Location identification involves identifying areas that meet a number of different specified criteria. The criteria can be based on attribute queries (for example, parcels that are vacant) and spatial queries (for example, within 1 km of a river). The areas that are found can be selected from existing features (such as existing land parcels), or new features can be created where all the requirements are met. Common operations that are used to identify locations include (1) incident detection, which detects all features that meet a specified criteria (e.g., lightning strikes exceeding a given intensity), and (2) similarity, which identifies the features that are either the most similar or least similar to another set of features based upon attribution.
Pattern analysis involves identifying, quantifying, and visualizing spatial patterns in spatial data (Bonham‐Carter, 1994; Golledge & Stimson, 1997). Identifying geographic patterns is important for understanding how geographic phenomena behave.
Although it is possible to understand the overall pattern of features and their associated values through traditional mapping, calculating a statistic quantifies the pattern (Vapnik, 2000). Statistical quantification facilitates the comparison of patterns with different distributions or across different time periods. Pattern analysis tools are often used as a starting point for more in‐depth analyses. For example, spatial autocorrelation can be used to identify distances where the processes promoting spatial clustering are most pronounced. This might help the user to select an appropriate distance (scale of analysis) to use for investigating hot spots (hot spot analysis using the Getis‐Ord Gi* statistic) (Fig. 1.3).
Figure 1.3 Tornado hotspots (+) and reported start points across the United States from 1950 to 2018. Hotspots are calculated using the Getis‐Ord Gi* statistic on tornado geographic frequency and weighted by severity (Fugita Scale 0–5) to determine locations with a higher risk of damage based upon reported historical events (p‐value < 0.05; z‐score > 3). Tornado data from the NOAA Storm Prediction Center for Severe Weather.
Pattern analysis tools are used for inferential statistics; they start with the null hypothesis that features, or the values associated with the features, exhibit a spatially random pattern. They then compute a p‐value representing the probability that the null hypothesis is correct (that the observed pattern is simply one of many possible versions of complete spatial randomness). Calculating a probability may be important if you need to have a high level of confidence in decision making. If there are public safety or legal implications associated with your decision, for example, you may need to justify your decision using statistical evidence.
Cluster analysis is used to identify the locations of statistically significant hot spots, spatial outliers, and similar features (Ester et al., 1996) (Fig. 1.4). Cluster analysis is particularly useful when action is needed based on the location of one or more clusters. An example would be the assignment of additional police officers to deal with a cluster of burglaries. Pinpointing the location of spatial clusters is also important when looking for potential causes of clustering; where a disease outbreak occurs can often provide clues about what might be causing it. Unlike pattern analysis (which as used answer the questions such as, “Is there spatial clustering?”) cluster analysis supports the visualization of the cluster locations and extent. Cluster analysis can be used to answer the questions such as, “Where are the clusters (hot spots and cold spots)?”, “Where are incidents most dense?”, “Where are the spatial outliers?”, and “Which features are most alike?”
Figure 1.4 Spatiotemporal clustering (DBSCAN – Density‐Based Spatial Clustering of Applications) of ridesharing drop‐off locations in midtown Manhattan. This identified clusters (darker points in the figure) where many drop offs occurred in a similar place and time and the minimal cluster size is 15 events.
Proximity analysis allows people to answer one of the most common questions posed in spatial analysis: “What is near what?” This type of analysis supports the determination of proximal features within one or more data sets; for example, identify features that are closest to one another or calculate the distances between or around them. Common analysis methods include the following:
Distance calculation: The Euclidean distance from a single source or set of sources.
Travel cost calculation: The least accumulative cost distance from or to the least‐cost source, while accounting for surface distance along with horizontal and vertical cost factors.
Optimal travel cost calculation: The optimum cost network from a set of input regions. One example application of this tool is finding the best network for emergency vehicles.
Predictive analytics builds models to forecast behavior and other future developments. It encompasses techniques from spatial statistics, data mining, machine learning, and artificial intelligence (Minsky, 1986; Newell et al., 1959; Pedregosa et al., 2011). Patterns are identified in historical data and are used when creating models for future events.
Machine learning uses algorithms and statistical models to analyze large data sets without using explicit sequences of instructions. Machine learning algorithms create a model of training data that is used to make optimized predictions and decisions. Machine learning is considered to be a subset of artificial intelligence.
Deep learning is a subset of artificial intelligence where models resembling biological nervous systems are arrayed in multiple layers where each layer uses the output of the preceding as input to create a more abstract and composite representation of the data (LeCun et al., 2015). Deep learning architectures include deep neural networks, belief networks, and recurrent neural networks. Deep learning is commonly used in the domains of natural language processing, computer vision, and speech recognition.
There are several key technologies that are commonly employed to process large volumes of spatial