Geospatial Data Analytics on AWS - Scott Bateman - E-Book

Geospatial Data Analytics on AWS E-Book

Scott Bateman

0,0
34,79 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Managing geospatial data and building location-based applications in the cloud can be a daunting task. This comprehensive guide helps you overcome this challenge by presenting the concept of working with geospatial data in the cloud in an easy-to-understand way, along with teaching you how to design and build data lake architecture in AWS for geospatial data.
You’ll begin by exploring the use of AWS databases like Redshift and Aurora PostgreSQL for storing and analyzing geospatial data. Next, you’ll leverage services such as DynamoDB and Athena, which offer powerful built-in geospatial functions for indexing and querying geospatial data. The book is filled with practical examples to illustrate the benefits of managing geospatial data in the cloud. As you advance, you’ll discover how to analyze and visualize data using Python and R, and utilize QuickSight to share derived insights. The concluding chapters explore the integration of commonly used platforms like Open Data on AWS, OpenStreetMap, and ArcGIS with AWS to enable you to optimize efficiency and provide a supportive community for continuous learning.
By the end of this book, you’ll have the necessary tools and expertise to build and manage your own geospatial data lake on AWS, along with the knowledge needed to tackle geospatial data management challenges and make the most of AWS services.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 308

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Geospatial Data Analytics on AWS

Discover how to manage and analyze geospatial data in the cloud

Scott Bateman

Janahan Gnanachandran

Jeff DeMuth

BIRMINGHAM—MUMBAI

Geospatial Data Analytics on AWS

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Reshma Raman

Book Project Manager: Kirti Pisat

Content Development Editor: Joseph Sunil

Technical Editor: Sweety Pagaria

Copy Editor: Safis Editing

Proofreader: Safis Editing

Indexer: Subalakshmi Govindhan

Production Designer: Ponraj Dhandapani

DevRel Marketing Coordinator: Nivedita Singh

First published: June 2023

Production reference: 1280623

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80461-382-5

www.packtpub.com

This book is dedicated to my father, Orren, for instilling my passion for both computers and maps. To the memory of my mother, Jolene, for teaching me the persistence and patience needed to get my thoughts on the page. To my wife, Angel, and wonderful children Jackson and Emily for the support and encouragement to bring this book to life.

Scott Bateman

This book is dedicated to my parents, Gnanachandran and Rajaletchumy, for guiding me through life’s challenges and celebrating my successes. To my amazing wife, Dilakshana and lovely children, Jashwin, and Dhruvish, for your unwavering support and belief in me. Finally, to the team at Amazon Web Services (AWS), whose relentless pursuit of excellence has transformed the way we harness technology. Thank you for inspiring this work and for your commitment to empowering businesses worldwide.

Janahan (Jana) Gnanachandran

Contributors

About the authors

Scott Bateman is a Principal Solutions Architect at AWS focused on customers in the energy industry. Prior to joining the AWS Houston office in 2019, he was Director of business applications at bpx energy and has worked for over a quarter century innovating with technology to solve the toughest energy business problems. As part of the Geospatial Technical Field Community (TFC) group within AWS, Scott is able to speak with customers about common challenges gathering geospatial data, tracking assets, optimizing driving routes, and better understanding facilities and property through remote sensing. When not working or writing, Scott enjoys snowboarding, flying drones, traveling to unknown destinations, and learning something new every day.

Janahan (Jana) Gnanachandran is a Principal Solutions Architect at AWS. He partners with customers across industries to increase speed, agility, and drive innovation using the cloud for digital transformation. Jana holds a BE in Electronics and Communication Engineering from Anna University, India, and an MS in Computer Engineering from the University of Louisiana at Lafayette. Alongside designing scalable, secure, and cost-effective cloud solutions, Jana conducts workshops, training sessions, and public speaking engagements on cloud best practices, architectural best practices, and data and AI/ML strategies. When not immersed in cloud computing, he enjoys playing tennis or golf, photography, and traveling with his wife and their 2 kids.

Jeff DeMuth is a solutions architect who joined Amazon Web Services (AWS) in 2016. He focuses on the geospatial community and is passionate about geographic information systems (GIS) and technology. Outside of work, Jeff enjoys travelling, building Internet of Things (IoT) applications, and tinkering with the latest gadgets.

About the reviewers

Faizan Tayyab is a GIS Professional with over 16 years of experience in the oil and gas industry. He holds a master’s degree and has several certifications covering various technologies including web & cloud technologies. He is also a software trainer, teaching courses to worldwide audience on technologies through Udemy, a popular online training platform. He is well-versed with various technologies and actively contributes to open source geospatial community through development and distribution of software components available free for use by other geographers and developers.

Angela Orthmeyer is currently the Lead Geospatial Data Analyst at RapidSOS, the world’s leading intelligent safety platform. Angela is a creative problem solver with experience in GIS, data science and project management. Prior to her position at RapidSOS, she was a Data Scientist at CKM Analytix, a Natural Resources Social Scientist at the National Oceanic and Atmospheric Administration, and a Peace Corps Volunteer in Panama. Her education spans the social and natural sciences. Angela received a Master of Environmental Management from Yale University, a B.S. in Biology from the University of Richmond, and a certificate in Data Analytics from Principal Analytics Prep.

Rohit Mendadhala is a Geospatial Data Scientist and an FAA Certified Drone Pilot with over 7 years of professional experience in developing, managing, and implementing geospatial solutions at scale for organizations across a wide range of industries such as government, environmental, transportation, software, telecom, real-estate research. His core areas of expertise include Geospatial Data Analytics, Data Visualization, Spatial Analysis and Mapping, Spatial Data Science, GIS Development, Web and Enterprise GIS, Market Research and Analysis using ArcGIS and open-source geospatial platforms. He enjoys discovering underlying patterns in large datasets with a spatial context and curating answers to critical thought-provoking questions.

Shital Dhakal is a seasoned GIS professional with over seven years’ experience in the field of GIS and remote sensing. He has acquired industry and research experience in North America, Europe, and Asia. Currently, he works at a San Francisco Bay Area-based start-up and helps local government to implement enterprise GIS strategies. He is a certified GIS Professional (GISP) and has an MSc from Boise State University, Idaho. When he is not playing with spatial data, writing blogs, or making maps, he can be found hiking in the Sierra Nevada and, occasionally, in the Himalayas.

Table of Contents

Preface

Part 1: Introduction to the Geospatial Data Ecosystem

1

Introduction to Geospatial Data in the Cloud

Introduction to cloud computing and AWS

Storing geospatial data in the cloud

Building your geospatial data strategy

Preventing unauthorized access

The last mile in data consumption

Leveraging your AWS account team

Geospatial data management best practices

Data – it’s about both quantity and quality

People, processes, and technology are equally important

Cost management in the cloud

Right-sizing, simplified

The elephant in the server room

Bird’s-eye view on savings

Can’t we just add another server?

Additional savings at every desk

Summary

References

2

Quality and Temporal Geospatial Data Concepts

Quality impact on geospatial data

Transmission methods

Streaming data

Understanding file formats

Normalizing data

Considering temporal dimensions

Summary

References

Part 2: Geospatial Data Lakes using Modern Data Architecture

3

Geospatial Data Lake Architecture

Modern data architecture overview

The AWS modern data architecture pillars

Geospatial Data Lake

Designing a geospatial data lake using modern data architecture

Data collection and ingestion layer

Data storage layer

Data processing and transformation

Data analytics and insights

Data visualization and mapping

Summary

References

4

Using Geospatial Data with Amazon Redshift

What is Redshift?

Understanding Redshift partitioning

Redshift Spectrum

Redshift geohashing support

Redshift AQUA

Redshift geospatial support

Launching a Redshift cluster and running a geospatial query

Summary

References

5

Using Geospatial Data with Amazon Aurora PostgreSQL

Lab prerequisites

Setting up the database

Connecting to the database

Installing the PostGIS extension

Geospatial data loading

Queries and transformations

Architectural considerations

Summary

References

6

Serverless Options for Geospatial

What is serverless?

Serverless services

Object storage and serverless websites with S3

Geospatial applications and S3 web hosting

Serverless hosting security and performance considerations

Python with Lambda and API Gateway

Deploying your first serverless geospatial application

Summary

References

7

Querying Geospatial Data with Amazon Athena

Setting up and configuring Athena

Geospatial data formats

WKT

JSON-encoded geospatial data

Spatial query structure

Spatial functions

AWS service integration

Architectural considerations

Summary

References

Part 3: Analyzing and Visualizing Geospatial Data in AWS

8

Geospatial Containers on AWS

Understanding containers

Scaling containers

Container portability

GDAL

GeoServer

Updating containers

AWS services

Deployment options

Deploying containers

Summary

References

9

Using Geospatial Data with Amazon EMR

Introducing Hadoop

Introduction to EMR

Common Hadoop frameworks

EMRFS

Geospatial with EMR

Launching EMR

Summary

References

10

Geospatial Data Analysis Using R on AWS

Introduction to the R geospatial data analysis ecosystem

Setting up R and RStudio on EC2

RStudio on Amazon SageMaker

Analyzing and visualizing geospatial data using RStudio

Summary

References

11

Geospatial Machine Learning with SageMaker

AWS ML background

AWS service integration

Common libraries and algorithms

Introducing Geospatial ML with SageMaker

Deploying a SageMaker Geospatial example

First-time use steps

Geospatial data processing

Geospatial data visualization

Architectural considerations

Summary

References

12

Using Amazon QuickSight to Visualize Geospatial Data

Geospatial visualization background

Amazon QuickSight overview

Connecting to your data source

Configuring Athena

Configuring QuickSight

Visualization layout

Features and controls

Point maps

Filled maps

Putting it all together

Reports and collaboration

Summary

References

Part 4: Accessing Open Source and Commercial Platforms and Services

13

Open Data on AWS

What is open data?

Bird’s-eye view

Modern applications

The Registry of Open Data on AWS

Requester Pays model

Analyzing open data

Using your AWS account

Analyzing multiple data classes

Federated queries with Athena

Open Data on AWS benefits

Summary

References

14

Leveraging OpenStreetMapon AWS

What is OpenStreetMap?

OSM’s data structure

OSM benefits

Accessing OSM from AWS

Application – ski lift scout

The OSM community

Architectural considerations

Summary

References

15

Feature Servers and Map Servers on AWS

Types of servers and deployment options

Capabilities and cloud integrations

Deploying a container on AWS with ECR and EC2

Summary

Further reading

16

Satellite and Aerial Imagery on AWS

Imagery options

Sentinel

Landsat

NAIP

Architectural considerations

Demonstrating satellite imagery using AWS

Summary

References

Index

Other Books You May Enjoy

Part 1: Introduction to the Geospatial Data Ecosystem

In this part we will learn how to work with Geospatial Data in the cloud, and the economics of storing and analyzing the data in the cloud.

This part has the following chapters:

Chapter 1, Introduction to Geospatial Data in the Cloud, shows us how to work with Geospatial Data in the cloud, and the economics of storing and analyzing the data in the cloud.Chapter 2, Quality and Temporal Geospatial Concepts, explores the different quality characteristics of geospatial data. Additionally, concepts will be presented that show how the time-specific (temporal) aspects of data can be captured and designated in the data structure.

1

Introduction to Geospatial Data in the Cloud

This book is divided into four parts that will walk you through key concepts, tools, and techniques for dealing with geospatial data. Part 1 sets the foundation for the entire book, establishing key ideas that provide synergy with subsequent parts. Each chapter is further subdivided into topics that dive deep into a specific subject. This introductory chapter of Part 1 will cover the following topics:

Introduction to cloud computing and AWSStoring geospatial data in the cloudBuilding your geospatial data strategyGeospatial data management best practicesCost management in the cloud

Introduction to cloud computing and AWS

You are most likely familiar with the benefits that geospatial analysis can provide. Governmental entities, corporations, and other organizations routinely solve complex, location-based problems with the help of geospatial computing. While paper maps are still around, most use cases for geospatial data have evolved to live in the digital world. We can now create maps faster and draw more geographical insights from data than at any point in history. This phenomenon has been made possible by blending the expertise of geospatial practitioners with the power of Geographical Information Systems (GIS). Critical thinking and higher-order analysis can be done by humans while computers handle the monotonous data processing and rendering tasks. As the geospatial community continues to refine the balance of which jobs require manual effort and which can be handled by computers, we are collectively improving our ability to understand our world.

Geospatial computing has been around for decades, but the last 10 years have seen a dramatic shift in the capabilities and computing power available to practitioners. The emergence of the cloud as a fundamental building block of technical systems has offered needle-moving opportunities in compute, storage, and analytical capabilities. In addition to a revolution in the infrastructure behind GIS systems, the cloud has expanded the optionality in every layer of the technical stack. Common problems such as running out of disk space, long durations of geospatial processing jobs, limited data availability, and difficult collaboration across teams can be things of the past. AWS provides solutions to these problems and more, and in this book, we will describe, dissect, and provide examples of how you can do this for your organization.

Cloud computing provides the ability to rapidly experiment with new tools and processing techniques that would never be possible using a fixed set of compute resources. Not only are new capabilities available and continually improving but your team will also have more time to learn and use these new technologies with the time saved in creating, configuring, and maintaining the environment. The undifferenced heavy lifting of managing geospatial storage devices, application servers, geodatabases, and data flows can be replaced with time spent analyzing, understanding, and visualizing the data. Traditional this or that technical trade-off decisions are no longer binary proposals. Your organization can use the right tool for each job, and blend as many tools and features into your environment as is appropriate for your requirements. By paying for the precise amount of resources you use in AWS, it is possible to break free from restrictive, punitive, and time-limiting licensing situations. In some cases, the amount of an AWS compute resource you use is measured and charged down to the millisecond, so you literally don’t pay for a second of unused time. If a team infrequently needs to leverage a capability, such as a monthly data processing job, this can result in substantial cost savings by eliminating idle virtual machines and supporting technical resources. If cost savings are not your top concern, the same proportion of your budget can be dedicated to more capable hardware that delivers dramatically reduced timeframes compared to limited compute environments.

The global infrastructure of AWS allows you to position data in the best location to minimize latency, providing the best possible performance. Powerful replication and caching technologies can be used to minimize wait time and allow robust cataloging and characterization of your geospatial assets. The global flexibility of your GIS environment is further enabled with the use of innovative end user compute options. Virtual desktop services in AWS allow organizations to keep the geospatial processing close to the data for maximum performance, even if the user is geographically distanced from both. AWS and the cloud have continued to evolve and provide never-before-seen capabilities in geospatial power and flexibility. Over the course of this book, we will examine what these concepts are, how they work, and how you can put them to work in your environment.

Now that we have learned the story of cloud computing on AWS, let’s check out how we can implement geospatial data there.

Storing geospatial data in the cloud

As you learn about the possibilities for storing geospatial data in the cloud, it may seem daunting due to the number of options available. Many AWS customers experiment with Amazon Simple Storage Service (S3) for geospatial data storage as their first project. Relational databases, NoSQL databases, and caching options commonly follow in the evolution of geospatial technical architectures. General GIS data storage best practices still apply to the cloud, so much of the knowledge that practitioners have gained over the years directly applies to geospatial data management on AWS. Familiar GIS file formats that work well in S3 include the following:

Shapefiles (.shp, .shx, .dbf, .prj, and others)File geodatabases (.gdb)Keyhole Markup Language (.kml)Comma-Separated Values (.csv)Geospatial JavaScript Object Notation (.geojson)Geostationary Earth Orbit Tagged Image File Format (.tiff)

The physical location of data is still important for latency-sensitive workloads. Formats and organization of data can usually remain unchanged when moving to S3 to limit the impact of migrations. Spatial indexes and use-based access patterns will dramatically improve the performance and ability of your system to deliver the desired capabilities to your users.

Relational databases have long been the cornerstone of most enterprise GIS environments. This is especially true for vector datasets. AWS offers the most comprehensive set of relational database options with flexible sizing and architecture to meet your specific requirements. For customers looking to migrate geodatabases to the cloud with the least amount of environmental change, Amazon Elastic Compute Cloud (EC2) virtual machine instances provide a similar capability to what is commonly used in on-premises data centers. Each database server can be instantiated on the specific operating system that is used by the source server. Using EC2 with Amazon Elastic Block Store (EBS) network-attached storage provides the highest level of control and flexibility. Each server is created by specifying the amount of CPU, memory, and network throughput desired. Relational database management system (RDBMS) software can be manually installed on the EC2 instance, or an Amazon Machine Image (AMI) for the particular use case can be selected from the AWS catalog to remove manual steps from the process. While this option provides the highest degree of flexibility, it also requires the most database configuration and administration knowledge.

Many customers find it useful to leverage Amazon Relational Database Service (RDS) to establish database clusters and instances for their GIS environments. RDS can be leveraged by creating full-featured database Microsoft SQL Server, Oracle, PostgreSQL, MySQL, or MariaDB clusters. AWS allows the selection of specific instance types to focus on memory or compute optimization in a variety of configurations. Multiple Availability Zone (AZ)-enabled databases can be created to establish fault tolerance or improve performance. Using RDS dramatically simplifies database administration, and decreases the time required to select, provision, and configure your geospatial database using the specific technical parameters to meet the business requirements.

Amazon Aurora provides an open source path to highly capable and performant relational databases. PostgreSQL or MySQL environments can be created with specific settings for the desired capabilities. Although this may mean converting data from a source format, such as Microsoft SQL Server or Oracle, the overall cost savings and simplified management make this an attractive option to modernize and right-size any geospatial database.

In addition to standard relational database options, AWS provides other services to manage and use geospatial data. Amazon Redshift is the fastest and most widely used cloud data warehouse and supports geospatial data through the geometry data type. Users can query spatial data in Redshift’s built-in SQL functions to find the distance between two points, interrogate polygon relationships, and provide other location insights into their data. Amazon DynamoDB is a fully managed, key-value NoSQL database with an SLA of up to 99.999% availability. For organizations leveraging MongoDB, Amazon DocumentDB provides a fully managed option for simplified instantiation and management. Finally, AWS offers the Amazon OpenSearch Service for petabyte-scale data storage, search, and visualization.

The best part is that you don’t have to choose a single option for your geospatial environment. Often, companies find that different workloads benefit from having the ability to choose the most appropriate data landscape. Combining Infrastructure as a Service (IaaS) workloads with fully managed databases and modern databases is not only possible but a signature of a well-architected geospatial environment. Transactional systems may benefit from relational geodatabases, while mobile applications may be more aligned with NoSQL data stores. When you operate in a world of consumption-based resources, there is no downside to using the most appropriate data store for each workload. Having familiarity with the cloud options for storing geospatial data is crucial in strategic planning, which we will cover in the next topic.

Building your geospatial data strategy

One of the most important concepts to consider in your geospatial data strategy is the amount of change you are willing to accept in your technical infrastructure. This does not apply to new systems, but most organizations will have a treasure trove of geospatial data already. While lifting and shifting on-premises workloads to the cloud is advantageous, adapting your architecture to the cloud will amplify benefits in agility, resiliency, and cost optimization. For example, 95% of AWS customers elect to use open source geospatial databases as part of their cloud migration. This data conversion process, from vendor relational databases such as Oracle and Microsoft SQL Server to open source options such as PostgreSQL, enjoys a high degree of compatibility. This is an example of a simple change that can be made to eliminate significant license usage costs when migrating to the cloud. Simple changes such as these provide immediate and tangible benefits to geospatial practitioners in cloud architectures. Often, the same capabilities can be provided in AWS for a significantly reduced cost profile when comparing the cloud to on-premises GIS architectures.

All the same concepts and technologies you and your team are used to when operating an on-premises environment exist on AWS. Stemming from the consumption-based pricing model and broad set of EC2 instances available, AWS can offer a much more flexible model for the configuration and consumption of compute resources. Application servers used in geospatial environments can be migrated directly by selecting the platform, operating system, version, and dependencies appropriate for the given workload. Additional consideration should be given in this space to containerization where feasible. Leveraging containers in your server architecture can speed up environment migrations and provide additional scaling options.

Preventing unauthorized access

A key part of building your geospatial data strategy is determining the structure and security of your data. AWS Identity and Access Management (IAM) serves as the foundation for defining authorization and authentication mechanisms in your environment. Single Sign-On (SSO) is commonly used to integrate with existing directories to leverage pre-existing hierarchies and permission methodologies. The flexibility of AWS allows you to bring the existing security constructs while expanding the ability to monitor, audit, and rectify security concerns in your GIS environment. It is highly recommended to encrypt most data; however, the value of encrypting unaltered public data can be debated. Keys should be regularly rotated and securely handled in accordance with any existing policies or guidelines from your organization.

As changes take place within your architecture, alerts and notifications provide critical insight to stewards of the environment. Amazon Simple Notification Service (SNS) can be integrated with any AWS service to send emails or text messages to the appropriate teams or individuals for optimized performance and security. Budgets and cost management alerts are native to AWS, making it easy to manage multiple accounts and environments based on your organization’s key performance indicators. Part of developing a cloud geospatial data strategy should be to internally ask where data issues are going unnoticed or not being addressed. By creating business rules, thresholds, and alerts, these data anomalies can notify administrators when specific areas within your data environment need attention.

The last mile in data consumption

Some commonly overlooked aspects of a geospatial data management strategy are the desktop end user tools that are necessary to manage and use the environment. Many GIS environments are dependent on high-powered desktop machines used by specialists. The graphics requirements for visualizing spatial data into a consumable image can be high, and the data throughput must support fluid panning and zooming through the data. Complications can arise when the user has a high-latency connection to the data. Many companies learned this the hard way when remote workers during COVID tried to continue business as usual from home. Traditional geospatial landscapes were designed for the power users to be in the office. Gigabit connectivity was a baseline requirement, and network outages meant that highly paid specialists were unable to do their work.

Virtual desktops have evolved, and continue to evolve, to provide best-in-class experiences for power users that are not co-located with their data. Part of a well-architected geospatial data management strategy is to store once, use many times. This principle takes a backseat when the performance when used is unacceptable. A short-term fix is to cache the data locally, but that brings a host of other cost and concurrency problems. Virtual desktops or Desktop-as-a-Service (DaaS) address this problem by keeping the compute close to the data. The user can be thousands of miles away and still enjoy a fluid graphical experience. Amazon WorkSpaces and Amazon AppStream provide this capability in the cloud. WorkSpaces provides a complete desktop environment for Windows or Linux that can be configured exactly as your specialists have today. AppStream adds desktop shortcuts to a specialist’s local desktop and streams the application visuals as a native application. Having access to the native geospatial data management tools as part of a cloud-based architecture results in a more robust and cohesive overall strategy.

Leveraging your AWS account team

AWS provides corporations and organizational customers