E-Book
39,59 €

Serverless Analytics with Amazon Athena E-Book

Anthony Virtuoso

0,0

39,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Lebensstil
Sprache: Englisch

Beschreibung

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using SQL, without needing to manage any infrastructure.
This book begins with an overview of the serverless analytics experience offered by Athena and teaches you how to build and tune an S3 Data Lake using Athena, including how to structure your tables using open-source file formats like Parquet. You’ll learn how to build, secure, and connect to a data lake with Athena and Lake Formation. Next, you’ll cover key tasks such as ad hoc data analysis, working with ETL pipelines, monitoring and alerting KPI breaches using CloudWatch Metrics, running customizable connectors with AWS Lambda, and more. Moving on, you’ll work through easy integrations, troubleshooting and tuning common Athena issues, and the most common reasons for query failure. You will also review tips to help diagnose and correct failing queries in your pursuit of operational excellence. Finally, you’ll explore advanced concepts such as Athena Query Federation and Athena ML to generate powerful insights without needing to touch a single server.
By the end of this book, you’ll be able to build and use a data lake with Amazon Athena to add data-driven features to your app and perform the kind of ad hoc data analysis that often precedes many of today’s ML modeling exercises.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 578

Veröffentlichungsjahr: 2021

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Serverless Analytics with Amazon Athena

Query structured, unstructured, or semi-structured data in seconds without setting up any infrastructure

Anthony Virtuoso

Mert Turkay Hocanin

Aaron Wishnick

BIRMINGHAM—MUMBAI

Serverless Analytics with Amazon Athena

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Kunal Parikh

Publishing Product Manager: Devika Battike

Senior Editor: David Sugarman

Content Development Editor: Joseph Sunil

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Project Coordinator: Aparna Nair

Proofreader: Safis Editing

Indexer: Tejal Soni

Production Designer: Shankar Kalbhor

First published: November 2021

Production reference: 1131021

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80056-234-9

www.packt.com

To my wife, Cristina, thank you for the support and understanding as I spent late nights and early mornings working on this book. I also appreciate all the laughs we had over my terrible spelling. For my sons, Luca and Massimo, who worked on their own pop-up books alongside me; I'll be first in line for an advanced copy of your books.

– Anthony Virtuoso

I dedicate this book to my wife, Subrina, who has been incredibly supportive, and our son, Tristan, who was born while writing this book. Without the both of you and the encouragement and love you gave me, this book would not have been possible. I also want to thank my parents, siblings, and everyone else who helped make this possible.

– Mert Turkay Hocanin

Foreword

Creating a data strategy is a top priority for leading organizations. That's because with any major initiative, from creating new experiences to building new revenue streams, leaders must be able to quickly gather insights and get to the truth. Data-driven organizations seek the truth by treating data like an organizational asset, no longer the property of individual departments. They set up processes to collect and store valuable data. Their data is democratized, meaning it's available to the right people and systems that need it. And their data is used to build new and innovative products that use data and machine learning (ML) to deliver new customer experiences.

AWS offers the broadest and deepest set of services for analytics and ML, and Amazon Athena is a key pillar of our offerings. Amazon Athena is a serverless analytics service that enables customers to use standard SQL to analyze all the data in their Amazon S3 data lakes, their data warehouses, and their transactional databases, as well as data that lives on-premises, in SaaS applications, and in other clouds. In other words, with Athena, you can query all your data from a single place using a language familiar to most analysts, using any business intelligence or ML tools you'd like. It's really all about having all your data at your fingertips.

I am incredibly lucky to have worked on creating and launching virtually all of the analytics offerings from AWS over the past decade. I was part of the team that created the original vision for Athena and launched the service in 2016. We created Athena because customers wanted a way to query all their data, both the structured data from databases as well as the semi-structured and unstructured data in their data lakes and other data sources, without having to manage infrastructure or give up SQL or the standard tools they were already using. We launched Athena at re:Invent 2016 and have been iterating on and improving the service ever since.

Mert, Aaron, and Anthony were founding members of the Amazon Athena team and have played pivotal roles in defining, building, and evolving the service. They are deeply passionate engineers who love helping customers succeed with Athena and with analytics overall. At AWS, the vast majority of our roadmap is driven by working closely with our customers, understanding their requests and priorities and bringing them into our services. Mert, Aaron, and Anthony are customer-obsessed, always looking for ways to help customers get more from Athena, and they have an innate ability to teach and bring people along. I'm so grateful they chose to write this book to share their expertise with all of us.

This book, like Amazon Athena, is designed to get you up and running with queries with minimal upfront setup and work. You'll progress from running simple queries to building sophisticated, automated pipelines to work with near-real-time event data, queries to external data sources, custom functions, and more, all while learning from Mert, Aaron, and Anthony's experience working with real-world customer scenarios.

I highly recommend this book to any new or existing customers looking to transform their business with data and with Amazon Athena.

Rahul Pathak, VP, AWS Analytics

Contributors

About the authors

Anthony Virtuoso works as a principal engineer at Amazon and holds multiple patents in distributed systems, software-defined networks, and security. In his 8 years at Amazon, he has helped launch several Amazon web services, the most recent of which was Amazon Managed Blockchain. As one of the original authors of Athena Query Federation, you'll often find him lurking on the Athena Federation GitHub repository answering questions and shipping bug fixes. When not at work, Anthony obsesses over a different set of customers, namely his wife and two little boys, aged 2 and 5. His kids enjoy doing science experiments with their dad, such as 3D printing toys, building with LEGO, or searching the local pond for tardigrades.

Mert Turkay Hocanin is a principal big data architect at Amazon Web Services within the AWS Glue and AWS Lake Formation services and has previously worked for several other services, including Amazon Athena, Amazon EMR, and Amazon Managed Blockchain. During his time at AWS, he has worked with several Fortune 500 companies on some of the largest data lakes in the world and was involved with the launching of three Amazon web services. Prior to being a big data architect, he was a senior software developer within Amazon's retail systems organization, building one of the earliest data lakes in the company in 2013. When he is not helping customers build data lakes, he enjoys spending time with his wife, Subrina, and son, Tristan, and exploring New York City.

Aaron Wishnick works as a senior software engineer at Amazon, where he has been for 7 years. During that time, he has worked on Amazon's payment systems and financial intelligence systems, as well as working for AWS on Athena and AWS Proton. When not at work, Aaron and his fiance, Alyssa, are on a quest to determine just how much dog fur is too much, with their husky and malamute, Mina and Wally.

About the reviewers

Seth Denney is a software engineer who has spent most of his career in big data analytics, building infrastructure and query engines to support a wide variety of use cases at companies including Amazon and Google. While on the AWS Athena team, he was intimately involved with the Lake Formation and Query Federation projects, to name a few.

Janak Agarwal has been the product manager for Amazon Athena since he joined AWS in December 2018. Prior to joining AWS, Janak was at Microsoft for 9+ years, where he led a team of engineers for Microsoft Office 365. He also co-founded CourseKart, an e-learning platform in India, and TaskUnite, a medical technology company in the US. Janak holds a master's in electrical engineering from USC and an MBA from the Wharton School.

Preface

Section 1: Fundamentals Of Amazon Athena

Chapter 1: Your First Query

Technical requirements

What is Amazon Athena?

Use cases

Separation of storage and compute

Obtaining and preparing sample data

Running your first query

Creating your first table

Running your first analytics queries

Summary

Chapter 2: Introduction to Amazon Athena

Technical requirements

Getting to know Amazon Athena

Understanding the "serverless" trend

Beyond "serverless" with 'fully managed' offerings

Key features

What is Presto?

Understanding scale and latency

TableScan performance

Memory-bound operations

Writing results

Metering and billing

Additional costs

File formats affect cost and performance

Cost controls

Connecting and securing

Determining when to use Amazon Athena

Ad hoc analytics

Adding analytics features to your application

Serverless ETL pipeline

Other use cases

Summary

Chapter 3: Key Features, Query Types, and Functions

Technical requirements

Running ETL queries

Using CREATE-TABLE-AS-SELECT

Using INSERT-INTO

Running approximate queries

Organizing workloads with WorkGroups and saved queries

Using Athena's APIs

Summary

Section 2: Building and Connecting to Your Data Lake

Chapter 4: Metastores, Data Sources, and Data Lakes

Technical requirements

What is a metastore?

Data sources, connectors, and catalogs

Databases and schemas

Tables/datasets

What is a data source?

S3 data sources

Other data sources

Registering S3 datasets in your metastore

Using Athena CREATE TABLE statements

Using Athena's Create Table wizard

Using the AWS Glue console

Using AWS Glue Crawlers

Discovering your datasets on S3 using AWS Glue Crawlers

How do AWS Glue Crawlers work?

AWS Glue Crawler best practices for Athena

Designing a data lake architecture

Stages of data

Transforming data using Athena

Summary

Chapter 5: Securing Your Data

Technical requirements

General best practices to protect your data on AWS

Separating permissions based on IAM users, roles, or even accounts

Least privilege for IAM users, roles, and accounts

Rotating IAM user credentials frequently

Blocking public access on S3 buckets

Enabling data and metadata encryption and enforcing it

Ensuring that auditing is enabled

Good intentions cannot replace good mechanisms

Encrypting your data and metadata in Glue Data Catalog

Encrypting your data

Encrypting your metadata in Glue Data Catalog

Enabling coarse-grained access controls with IAM resource policies for data on S3

Enabling FGACs with Lake Formation for data on S3

Auditing with CloudTrail and S3 access logs

Auditing with AWS CloudTrail

Auditing with S3 server access logs

Summary

Chapter 6: AWS Glue and AWS Lake Formation

Technical requirements

What AWS Glue and AWS Lake Formation can do for you

Securing your data lake with Lake Formation

What AWS Lake Formation governed tables can do for you

Summary

Section 3: Using Amazon Athena

Chapter 7: Ad Hoc Analytics

Technical requirements

Understanding the ad hoc analytics hype

Building an ad hoc analytics strategy

Choosing your storage

Sharing data

Selecting query engines

Deploying to customers

Using QuickSight with Athena

Getting sample data

Setting up QuickSight

Using Jupyter Notebooks with Athena

pandas

Matplotlib and Seaborn

SciPy and NumPy

Using our notebook to explore

Summary

Chapter 8: Querying Unstructured and Semi-Structured Data

Technical requirements

Why isn't all data structured to begin with?

Querying JSON data

Reading our customer's dataset

Parsing JSON fields

Querying comma-separated value and tab-separated value data

Querying arbitrary log data

Doing full log scans on S3

Reading application log data

Summary

Chapter 9: Serverless ETL Pipelines

Technical requirements

Understanding the uses of ETL

ETL for integration

ETL for aggregation

ETL for modularization

ETL for performance

Deciding whether to ETL or query in place

Designing ETL queries for Athena

Don't forget about performance

Begin with integration points

Use an orchestrator

Using Lambda as an orchestrator

Creating an ETL function

Coding the ETL function

Testing your ETL function

Triggering ETL queries with S3 notifications

Summary

Chapter 10: Building Applications with Amazon Athena

Technical requirements

Connecting to Athena

JDBC and ODBC

Which one should I use?

Best practices for connecting to Athena

Idempotency tokens

Query tracking

Securing your application

Credential management

Network safety

Optimizing for performance and cost

Workload isolation

Application monitoring

CTAS for large result sets

Summary

Chapter 11: Operational Excellence – Monitoring, Optimization, and Troubleshooting

Technical requirements

Monitoring Athena to ensure queries run smoothly

Optimizing for cost and performance

Troubleshooting failing queries

Summary

Section 4: Advanced Topics

Chapter 12: Athena Query Federation

Technical requirements

What is Query Federation?

Athena Query Federation features

How Athena Connectors work

Using Lambda for big data

Federating queries across VPCs

Using pre-built Connectors

Building a custom connector

Setting up your development environment

Writing your connector code

Summary

Chapter 13: Athena UDFs and ML

Technical requirements

What are UDFs?

Writing a new UDF

Setting up your development environment

Writing your UDF code

Building your UDF code

Deploying your UDF code

Using your UDF

Using built-in ML UDFs

Pre-setup requirements

Setting up your SageMaker notebook

Using our notebook to train a model

Using our trained model in an Athena UDF

Summary

Chapter 14: Lake Formation – Advanced Topics

Reinforcing your data perimeter with Lake Formation

Establishing a data perimeter

Shared responsibility security model

How Lake Formation can help

Understanding the benefits of governed tables

ACID transactions on S3-backed tables

Summary

Other Books You May Enjoy

Preface

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL, without needing to manage any infrastructure.

This book begins with an overview of the serverless analytics experience offered by Athena and teaches you how to build and tune an S3 data lake using Athena, including how to structure your tables using open source file formats such as Parquet. You'll learn how to build, secure, and connect to a data lake with Athena and Lake Formation. Next, you'll cover key tasks such as ad hoc data analysis, working with ETL pipelines, monitoring and alerting KPI breaches using CloudWatch Metrics, running customizable connectors with AWS Lambda, and more. Moving ahead, you'll work through easy integrations, troubleshooting and tuning common Athena issues, and the most common reasons for query failure, as well as reviewing tips for diagnosing and correcting failing queries in your pursuit of operational excellence. Finally, you'll explore advanced concepts such as Athena Query Federation and Athena ML to generate powerful insights without needing to touch a single server.

By the end of this book, you'll be able to build and use a data lake with Amazon Athena to add data-driven features to your app and perform the kind of ad hoc data analysis that often precedes many of today's ML modeling exercises.

Who this book is for

BI analysts, application developers, and system administrators who are looking to generate insights from an ever-growing sea of data while controlling costs and limiting operational burdens will find this book helpful. Basic SQL knowledge is expected to make the most out of this book.

What this book covers

Chapter 1, Your First Query, is all about orienting you to the serverless analytics experience offered by Amazon Athena. For now, we will simplify things in order to run your first queries and demonstrate why so many people choose Amazon Athena for their workloads. This will help establish your mental model for the deeper discussions, features, and examples of later sections.

Chapter 2, Introduction to Amazon Athena, continues your introduction to Athena by discussing the service's capabilities, scalability, and pricing. You'll learn when to use Amazon Athena and how to estimate the performance and costs of your workloads before building them on Athena. We'll also take a look behind the scenes to see how Athena uses PrestoDB, an open source SQL engine from Facebook, to process your queries.

Chapter 3, Key Features, Query Types, and Functions, concludes our introduction to Amazon Athena by exploring built-in features you can use to make your reports or application more powerful. This includes approximate query techniques to speed up analysis of large datasets and Create Table As Select (CTAS) statements for running queries that generate significant amounts of result data.

Chapter 4, Metastores, Data Sources, and Data Lakes, teaches you what a metastore is and what they contain. We will introduce Apache Hive and AWS Glue Data Catalog implementations of a metastore. We'll then learn how to create tables through Athena or discover datasets in S3 using AWS Glue crawlers. We then focus on a typical data lake architecture, which contains three different stages for data.

Chapter 5, Securing Your Data, covers the various methods that can be employed to secure your data and ensure it can only be viewed by those that have permission to do so.

Chapter 6, AWS Glue and AWS Lake Formation, demonstrates step by step how to build a secure data lake in Lake Formation and how Athena interacts with Lake Formation to keep data safe.

Chapter 7, Ad Hoc Analytics, focuses on how you can use Athena to quickly get to know your data, look for patterns, find outliers, and generally surface insights that will help you get the most from your data.

Chapter 8, Querying Unstructured and Semi-Structured Data, shows how Amazon Athena combines a traditional query engine, and its requirement for an upfront schema, with extensions that allow it to handle data that contains varying or no schema.

Chapter 9, Serverless ETL Pipelines, continue with the theme of controlling chaos by using automation to normalize newly arrived data through a process known as extract, transform, load (ETL).

Chapter 10, Building Applications with Amazon Athena, tells you what to do when integrating Amazon Athena into your applications. How will the application make Athena calls? How should credentials be stored? Should you use JDBC, ODBC, or Athena's SDK? What are the best practices on setting up connectivity between your application and Athena and the security considerations? Lastly, what is the best way for me to store my data on S3 to optimize speed and cost? This chapter will answer all these questions and give examples – including working code – to get you started integrating with Athena fast, easily, and in a secure way.

Chapter 11, Operational Excellence – Maintenance, Optimization, and Troubleshooting, focuses on operational excellence by looking at what could go wrong when using Athena in a production environment. We'll learn how to monitor and alert KPI breaches – such as queue dwell times – using CloudWatch metrics so you can avoid surprises. You'll also see how to optimize your data and queries to avoid problems before they happen. We'll then look at how the layout of data stored in S3 can have a significant impact on both cost and performance. Lastly, we will look at the most common reasons for query failure and review tips to help diagnose and correct failing queries.

Chapter 12, Athena Query Federation, is all about getting the most out of Amazon Athena by using Athena's Query Federation capabilities to expand beyond queries over data in S3. We will illustrate how Query Federation allows you to combine data from multiple sources (for example, S3 and Elasticsearch) to provide a single source of truth for your queries. Then we will peel back the hood and explain how Amazon Athena uses AWS Lambda to run customizable connectors. We will even write our own connector in order to show you how easy it is to customize Athena with your own code.

Chapter 13, Athena UDFs and ML, continues the theme of enhancing Amazon Athena with our own functionality by adding our own user-defined functions and machine learning models. These capabilities allow us to do everything from applying ML inference to identify suspicious records in our dataset to converting port numbers in a VPC flow log to the common name for that port (for example, HTTP). In all of these examples, we add our own logic to Athena's row-level processing without the need to run any servers of our own.

Chapter 14, Lake Formation – Advanced Topics, covers some of the advanced features that Lake Formation brings to the table, and explores various use cases that are enabled by these features.

To get the most out of this book

To work on the technologies in this book, you will need a computer with a Chrome, Safari, or Microsoft Edge browser installed and AWS CLI version 2 installed.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Please ensure that you close any outstanding AWS instances after you are done working on them so that you don't incur unnecessary expenses.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Serverless-Analytics-with-Amazon-Athena. If there's an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/9781800562349_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "We simply specify a SYSTEM_TIME that Athena will use to set the read point in the transaction log."

A block of code is set as follows:

try:

sink.writeFrame(new_and_updated_impressions_dataframe)

glueContext.commit_transaction(txid1)

except:

glueContext.abort_transaction(txid1)

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

"inventory_id","item_name","available_count"

"1","A simple widget","5"

"2","A more advanced widget","10"

"3","The most advanced widget","1"

"4","A premium widget","0"

"5","A gold plated widget","9"

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "Administrators can set a workgroup to encrypt query results. In the workgroup settings, set query results to be encrypted using SSE-KMS, CSE-KMS, or SSE-S3 and check the Override client-side settings."

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you've read Serverless Analytics with Amazon Athena, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.

Section 1: Fundamentals Of Amazon Athena

In this section, you will run your first Athena queries and establish an understanding of key Athena concepts that will be put into practice in later sections.

This section consists of the following chapters:

Chapter 1, Your First QueryChapter 2, Introduction to Amazon AthenaChapter 3, Key Features, Query Types, and Functions

Chapter 1: Your First Query

This chapter is all about introducing you to the serverless analytics experience offered by Amazon Athena. Data is one of the most valuable assets you and your company generate. In recent years, we have seen a revolution in data retention, where companies are capturing all manner of data that was once ignored. Everything from logs to clickstream data, to support tickets are now routinely kept for years. Interestingly, the data itself is not what is valuable. Instead, the insights that are buried in that mountain of data are what we are after. Certainly, increased awareness and retention have made the information we need to power our businesses, applications, and decisions more available but the explosion in data sizes has made the insights we seek less accessible. What could once fit nicely in a traditional RDBMS, such as Oracle, now requires a distributed filesystem such as HDFS and an accompanying Massively Parallel Processing (MPP) engine such as Spark to run even the most basic of queries in a timely fashion.

Enter Amazon Athena. Unlike traditional analytics engines, Amazon Athena is a fully managed offering. You will never have to set up any servers or tune cryptic settings to get your queries running. This allows you to focus on what is most important: using data to generating insights that drive your business. You can just focus on getting the most out of your data. This ease of use is precisely why this first chapter is all about getting hands-on and running your first query. Whether you are a seasoned analytics veteran or a newcomer to the space, this chapter will give you the knowledge you need to be running your first Athena query in less than 30 minutes. For now, we will simplify things to demonstrate why so many people choose Amazon Athena for their workloads. This will help establish your mental model for the deeper discussions, features, and examples of later sections.

In this chapter, we will cover the following topics:

What is Amazon Athena?Obtaining and preparing sample dataRunning your first query

Technical requirements

Wherever possible, we will provide samples or instructions to guide you through the setup. However, to complete the activities in this chapter, you will need to ensure you have the following prerequisites available. Our command-line examples will be executed using Ubuntu, but most flavors of Linux should also work without modification.

You will need internet access to GitHub, S3, and the AWS Console.

You will also require a computer with the following installed:

Chrome, Safari, or Microsoft EdgeThe AWS CLI

In addition, this chapter requires you to have an AWS account and accompanying IAM user (or role) with sufficient privileges to complete the activities in this chapter. Throughout this book, we will provide detailed IAM policies that attempt to honor the age-old best practice of "least privilege." For simplicity, you can always run through these exercises with a user that has full access, but we recommend that you use scoped-down IAM policies to avoid making costly mistakes and to learn more about how to best use IAM to secure your applications and data. You can find the suggested IAM policy for this chapter in this book's accompanying GitHub repository, listed as chapter_1/iam_policy_chapter_1.json:

https://github.com/PacktPublishing/Serverless-Analytics-with-Amazon-Athena/tree/main/chapter_1

This policy includes the following:

Read and Write access to one S3 bucket using the following actions:s3:PutObject: Used to upload data and also for Athena to write query results.s3:GetObject: Used by Athena to read data.s3:ListBucketMultipartUploads: Used by Athena to write query results.s3:AbortMultipartUpload: Used by Athena to write query results.s3:ListBucketVersionss3:CreateBucket: Used by you if you don't already have a bucket you can use.s3:ListBucket: Used by Athena to read data.s3:DeleteObject: Used to clean up if you made a mistake or would like to reattempt an exercise from scratch.s3:ListMultipartUploadParts: Used by Athena to write a result.s3:ListAllMyBuckets: Used by Athena to ensure you own the results bucket.s3:ListJobs: Used by Athena to write results.Read and Write access to one Glue Data Catalog database, using the following actions:glue:DeleteDatabase: Used to clean up if you made a mistake or would like to reattempt an exercise from scratch.glue:GetPartitions: Used by Athena to query your data in S3.glue:UpdateTable: Used when we import our sample data.glue:DeleteTable: Used to clean up if you made a mistake or would like to reattempt an exercise from scratch.glue:CreatePartition: Used when we import our sample data.glue:UpdatePartition: Used when we import our sample data.glue:UpdateDatabase: Used when we import our sample data.glue:CreateTable: Used when we import our sample data.glue:GetTables: Used by Athena to query your data in S3.glue:BatchGetPartition: Used by Athena to query your data in S3.glue:GetDatabases: Used by Athena to query your data in S3.glue:GetTable: Used by Athena to query your data in S3.glue:GetDatabase: Used by Athena to query your data in S3.glue:GetPartition: Used by Athena to query your data in S3.glue:CreateDatabase: Used to create a database if you don't already have one you can use.glue:DeletePartition: Used to clean up if you made a mistake or would like to reattempt an exercise from scratch.Access to run Athena queries.

Important Note

We recommend against using Firefox with the Amazon Athena console as we have found, and reported, a bug associated with switching between certain elements in the UX.

What is Amazon Athena?

Amazon Athena is a query service that allows you to run standard SQL over data stored in a variety of sources and formats. As you will see later in this chapter, Athena is serverless, so there is no infrastructure to set up or manage. You simply pay $5 per TB scanned for the queries you run without needing to worry about idle resources or scaling.

Note

AWS has a habit of reducing prices over time. For the latest Athena pricing, please consult the Amazon Athena product page at https://aws.amazon.com/athena/pricing/?nc=sn&loc=3.

Athena is based on Presto (https://prestodb.io/), a distributed SQL engine that's open sourced by Facebook. It supports ANSI SQL, as well as Presto SQL features ranging from geospatial functions to rough query extensions, which allow you to run approximating queries, with statistically bound errors, over large datasets in only a fraction of the time. Athena's commitment to open source also provides an interesting avenue to avoid lock-in concerns because you always have the option to download and manage your own Presto deployment from GitHub. Of course, you will lose many of Athena's enhancements and must manage the infrastructure yourself, but you can take comfort in knowing you are not beholden to potentially punitive licensing agreements as you might be with other vendors.

While Athena's roots are open source, the team at AWS have added several enterprise features to the service, including the following:

Federated Identity via SAML and Active Directory supportTable, column, and even row-level access control via Lake FormationWorkload classification and grouping for cost control via WorkGroupsAutomated regression testing to take the pain out of upgrades

Later chapters will cover these topics in greater detail. If you feel compelled to do so, you can use the table of contents to skip directly to those chapters and learn more.

Let's look at some use cases for Athena.

Use cases

Amazon Athena supports a wide range of use cases and we have personally used it for several different patterns. Thanks to Athena's ease of use, it is extremely common to leverage Athena for ad hoc analysis and data exploration.

Later in this book, you will use Athena from within a Jupyter notebook for machine learning. Similarly, many analysts enjoy using Athena directly from BI Tools such as Looker and Tableau, courtesy of Athena's JDBC driver. Athena's robust SQL dialect and asynchronous API model also enables application developers to build analytics right into their applications, enabling features that would not previously have been practical due to scale or operational burden. In many cases, you can replace RDBMS-driven features with Athena at a fraction of the cost and lower operational burden.

Another emerging use case for Athena is in the ETL space. While Athena advertises itself as being an engine that avoids the need for ETL by being able to query the data in place, as it is, we have seen the benefits of replacing existing or building new ETL pipelines using Athena where cost and capacity management are key factors. Athena will not necessarily achieve the same scale or performance as Spark, for example, but if your ETL jobs do not require multi-TB joins, you might find Athena to be an interesting option.

Separation of storage and compute

If you are new to serverless analytics, you may be wondering where your data is stored. Amazon Athena builds on the concept of Separation of Storage and Compute to decouple the computational resources (for example, CPU, memory, network) that do the heavy lifting of executing your SQL queries from the responsibility of keeping your data safe and available. In short, this means Athena itself does not store your data. Instead, you are free to choose from several data stores with customers increasingly pairing with DynamoDB to rapidly mutate data with S3 for their bulk data. With Athena, you can easily write a query that spans both data stores.

Amazon's Simple Storage Service, or S3 for short, is easily the most recommended data store to use with Athena. When Athena launched in 2016, S3 was the first data store it supported. Unsurprisingly, Athena has been optimized to take advantage of S3's unique ability to deliver exabyte scale and throughput while still providing eleven nines (99.999999999%) of durability. In addition to effortless scaling from a few gigabytes of data up to many petabytes, S3 offers some of the lowest prices for performance that you can find. Depending on your replication requirements, storing 1 GB of data for a month will cost you between $0.01 and $0.023. Even the most cost-efficient enterprise hard drives cost around $0.21 per GB before you add on redundancy, the power to run them, or a server and data center to house them. As with most AWS services, you should consult S3's pricing page (https://aws.amazon.com/s3/pricing/) for the latest details since AWS has cut their prices more than 70 times in the last decade.

Metastore

In addition to accessing the raw 1s and 0s that represent your data, Athena also requires metadata that helps its SQL engine understand how to interpret the data you have stored in S3 or elsewhere. This supplemental information helps Athena map collections of files, or objects in the case of S3, to SQL constructs such as tables, columns, and rows. The repository for this data, about your data, is often called a metastore. Athena works with Hive-compliant metastores, including AWS's Glue Data Catalog service. In later chapters, we will look at AWS Glue Data Catalog in more detail, as well as how you can attach Athena to your own metastore, even a homegrown one. For now, all you need to know is that Athena requires the use of a metastore to discover key attributes of the data you wish to query. The most common pieces of information that are kept in the Metastore include the following:

A list of tables that existThe storage location of each table (for example, the S3 path or DynamoDB table name)The format of the files or objects that comprise the table (for example, CSV, Parquet, JSON)The column names and data types in each table (for example, inventory column is an integer, while revenue is a decimal (10,2))

Now that we have a good overview of Amazon Athena, let's look at how to use it in practice.

Obtaining and preparing sample data

Before we can start running our first query, we will need some data that we would like to analyze. Throughout this book, we will try to make use of open datasets that you can freely access but that also contain interesting information that may mirror your real-world datasets. In this chapter, we will be making use of the NYC Taxi & Limousine Commission's (TLC's) Trip Record Data for New York City's iconic yellow taxis. Yellow taxis have been recording and providing ride data to TLC since 2009. Yellow taxis are traditionally hailed by signaling to a driver who is on duty and seeking a passenger (also known as a street hail). In recent years, yellow taxis have also started to use their own ride-hailing apps such as Curb and Arro to keep pace with emerging ride-hailing technologies from Uber and Lyft. However, yellow taxis remain the only vehicles permitted to respond to street hails from passengers in NYC. For that reason, the dataset often has interesting patterns that can be correlated with other events in the city, such as a concert or inclement weather.

Our exercise will focus on just one of the many datasets offered by the TLC. The yellow taxis data includes the following fields:

VendorID: A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.tpep_pickup_datetime: The date and time when the meter was engaged.tpep_dropoff_datetime: The date and time when the meter was disengaged.Passenger_count: The number of passengers in the vehicle.Trip_distance: The elapsed trip distance in miles reported by the taximeter.RateCodeID: The final rate code in effect at the end of the trip. 1= Standard rate, 2= JFK, 3= Newark, 4= Nassau or Westchester, 5= Negotiated fare, 6= Group ride.Store_and_fwd_flag: This flag indicates whether the trip record was held in the vehicle's memory before being sent to the vendor, also known as "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip, while N= not a store and forward trip.pulocationid: Location where the meter was engaged.dolocationid: Location where the meter was disengaged.Payment_type: A numeric code signifying how the passenger paid for the trip. 1= Credit card, 2= Cash, 3= No charge, 4= Dispute, 5= Unknown, 6= Voided trip.Fare_amount: The time-and-distance fare calculated by the meter.Extra: Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.MTA_tax: $0.50 MTA tax that is automatically triggered based on the metered rate in use.Improvement_surcharge: $0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.Tip_amount: This field is automatically populated for credit card tips. Cash tips are not included.Tolls_amount: Total amount of all tolls paid in a trip.Total_amount: The total amount charged to passengers. Does not include cash tips.congestion_surcharge: Amount surcharges associated with time/traffic fees imposed by the city.

This dataset is easy to obtain and is relatively interesting to run analytics against. The inconsistency in field naming is difficult to overlook but we will normalize using a mixture of camel case and underscore conventions later:

Our first step is to download the Trip Record Data for June 2020. You can obtain this directly from the NYC TLC's website (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) or our GitHub repository using the following command:

wget https://github.com/PacktPublishing/Serverless-Analytics-with-Amazon-Athena/raw/main/chapter_1/yellow_tripdata_2020-06.csv.gz

If you choose to download it from the NYC TLC directly, please gzip the file before proceeding to the next step.

Now that we have some data, we can add it to our data lake by uploading it to Amazon S3. To do this, we must create an S3 bucket. If you already have an S3 bucket that you plan to use, you can skip creating a new bucket. However, we do encourage you to avoid completing these exercises in accounts that house production workloads. As a best practice, all experimentation and learning should be done in isolation. Once you have picked a bucket name and the region that you would like to use for these exercises, you can run the following command:

aws s3api create-bucket \

--bucket packt-serverless-analytics \

--region us-east-1

Important Note

Be sure to substitute your bucket name and region. You can also create buckets directly from the AWS Console by logging in and navigating to S3 from the service list. Later in this chapter, we will use the AWS Console to edit and run our Athena queries. For simple operations, using the AWS CLI can be faster and easier to see what is happening since the AWS Console can hide multi-step operations behind a single button.

Now that our bucket is ready, we can upload the data we would like to query. In addition to the bucket, we will want to put our data into a subfolder to keep things organized as we proceed through later exercises. We have an entire chapter dedicated to organizing and optimizing the layout of your data in S3. For now, let's just upload the data to a subfolder called tables/nyc_taxi using the following AWS CLI command. Be sure to replace the bucket name, packt-serverless-analytics, in the following example command with the name of your bucket:

aws s3 cp ./yellow_tripdata_2020-06.csv.gz \

s3://packt-serverless-analytics/tables/nyc_taxi/yellow_tripdata_2020-06.csv.gz

This command may take a few moments to complete since it needs to upload our roughly 10 MB file over the internet to Amazon S3. If you get a permission error or message about access being denied, double-check you used the right bucket name.

If the command seems to have finished running without issue, you can use the following command to confirm the file is where we expect. Be sure to replace the example bucket with your actual bucket name:

aws s3 ls s3://packt-serverless-analytics/tables/nyc_taxi/

Now that we have confirmed our sample data is where we expect, we need to add this data to our Metastore, as described in the What is Amazon Athena? section. To do this, we will use AWS Glue Data Catalog as our Metastore by creating a database to house our table. Remember that Data Catalog will not store our data, just details about where engines such as Athena can find it (for example, S3) and what format was used to store the data (for example, CSV). Unlike Amazon S3, multiple accounts can have databases and tables with the same name so that you can use the following commands as-is, without the need to rename anything. If you already have a database that you would like to use, you can skip creating a new database, but be sure to substitute your database name into subsequent commands; otherwise, they will fail:

aws glue create-database \

--database-input "{\"Name\":\"packt_serverless_analytics\"}" \

--region us-east-1

Now that both our data and Metastore are ready, we can define our table right from Athena itself by running our first query.

Running your first query

Athena supports both Data Definition Language (DDL) and Data Manipulation Language (DML) queries. Queries where you SELECT data from a table are a common example of DML queries. Our first meaningful Athena query will be a DDL query that creates, or defines, our NYC Taxis data table:

Let's begin by ensuring our AWS account and IAM user/role are ready to use Athena. To do that, navigate to the Athena query editor in the AWS Console: https://console.aws.amazon.com/athena/home.

Be sure to use the same region that you uploaded your data and created your database in.

If this is your first time using Athena, you will likely be met by a screen like the following. Luckily, Athena is telling us that "Before you run your first query, you need to set up a query result location in Amazon S3…". Since Athena writes the results of all queries to S3, even DDL queries, we will need to configure this setting before we can proceed. To do so, click on the highlighted text in the AWS Console that's shown in the following screenshot:

Figure 1.1 – The prompt for setting the query result's location upon your first visit to Athena

After clicking on the modal's link, you will see the following prompt so that you can set your query result's location. You can use the same S3 bucket we used to upload our sample data, with results being used as the name of the folder that Athena will write query results to within that bucket. Be sure your location ends with a "/" to avoid errors:

Figure 1.2 – Athena's settings prompt for the query result's location

Next, let's learn how to create a table.

Creating your first table

It is now time to run our first Athena query. The following DDL query asks Athena to create a new table called nyc_taxi in the packt_serverless_analytics database, which is stored in the AWS Glue Data Catalog. The query also specifies the schema (columns), file format, and storage location of the table. For now, the other nuances of this create query are unimportant. You may find it easier to copy create table from the create_nyc_taxi.sql (http://bit.ly/3mXj3K0) file in the chapter_1 folder of this book's GitHub repository. Paste it into Athena's query editor, change LOCATION so that it matches your bucket name, and click Run query. It should complete in a few seconds:

CREATE EXTERNAL TABLE 'packt_serverless_analytics'.'nyc_taxi'(

'vendorid' bigint,

'tpep_pickup_datetime' string,

'tpep_dropoff_datetime' string,

'passenger_count' bigint,

'trip_distance' double,

'ratecodeid' bigint,

'store_and_fwd_flag' string,

'pulocationid' bigint,

'dolocationid' bigint,

'payment_type' bigint,

'fare_amount' double,

'extra' double,

'mta_tax' double,

'tip_amount' double,

'tolls_amount' double,

'improvement_surcharge' double,

'total_amount' double,

'congestion_surcharge' double)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

STORED AS INPUTFORMAT

'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

's3://<YOUR_BUCKET_NAME>/tables/nyc_taxi/'

TBLPROPERTIES (

'areColumnsQuoted'='false',

'columnsOrdered'='true',

'compressionType'='gzip',

'delimiter'=',',

'skip.header.line.count'='1',

'typeOfData'='file')

Once your table creation DDL query completes, the left navigation pane of the Athena console will refresh with the definition of your new table. If you have other databases and tables, you may need to choose your database from the dropdown before your new table will appear.

Figure 1.3 – Athena's Database navigator will show the schema of your newly created table

At this point, the significance of the query we just ran may not be entirely apparent, but rest assured we will go deeper into why serverless DDL queries are a powerful thing. Oh, and did we mention that Athena does not charge for DDL queries?

Running your first analytics queries

When working with a new or unfamiliar set of data, it can be helpful to view a sample of the rows before exploring the dataset in more meaningful ways. This allows you to understand the schema of your dataset, including verifying that the schema (for example, column names) match the values and types. There are a few ways to do this, including the following limit query:

SELECT * from packt_serverless_analytics.nyc_taxi limit 100

This works fine in most cases, but we can do better. Many query engines, Athena included, will end up returning all 100 rows requested in the preceding query from the same S3 object. If your dataset contains many objects or files, you are getting an extremely narrow view of the table. For that reason, I prefer using the following query to view data from a broader portion of the dataset:

SELECT *

FROM packt_serverless_analytics.nyc_taxi TABLESAMPLE BERNOULLI (1)

limit 100

This query is like the earlier limit query but uses Athena's TABLESAMPLE feature to obtain our 100 requested rows using BERNOULLI sampling. When a table is sampled using the Bernoulli method, all the objects of the table may be scanned as opposed to likely stopping after the first object. This is because the probability of a row being included in the result is independent of any other row reducing the significance of the object scan order. In the following screenshot, we can see some of the rows that were returned using TABLESAMPLE with the BERNOULLI method:

Figure 1.4 – Results of executing TAMPLESAMPLE against our nyc_taxi table

While that query allowed us to confirm that Athena can indeed access our data and that the schema appears to match the data itself, we have not extracted any real insights from the data. For this, we will run our first real analytics query by generating a histogram of ride durations and distances. Our goal here is to learn how much time people are typically spending in taxis, but we'll also be able to gain insights into the quality of our data. The following query uses Athena's numeric_histogram function to approximate the distribution with 10 buckets according to the difference between tpep_pickup_datetime and tpep_dropoff_datetime. Since the dataset stores datetimes as strings, we are using the date_parse function to convert the values into actual timestamps that we can then use with Athena's date_diff function to generate the ride durations as minutes. Lastly, the query uses a CROSS JOIN with UNEST to turn the histogram into rows and columns. Normally, the numeric_histogram function returns a map containing the histogram, but this can be difficult to read. UNEST helps us turn it into a more intuitive tabular format. Do not worry about remembering all these functions and SQL techniques right now. Athena frequently adds new capabilities, and you can always consult a reference.

You can copy the following code from GitHub at http://bit.ly/2Jm6o5v:

SELECT ride_minutes, number_rides

FROM (SELECT numeric_histogram(10,

date_diff('minute',

date_parse(tpep_pickup_datetime,'%Y-%m-%d %H:%i:%s'),

date_parse(tpep_dropoff_datetime, '%Y-%m-%d %H:%i:%s')

)

FROM packt_serverless_analytics.nyc_taxi ) AS x (ride_histogram)

CROSS JOIN

UNNEST(ride_histogram) AS t (ride_minutes, number_rides);

Once you run the query, the results will look as follows. You can experiment with the number of buckets that are generated by adjusting the parameters of the numeric_histogram function. Generating 100 or even 1,000 buckets can uncover patterns that were hidden with fewer buckets. Even with just 10 buckets, we can already see a strong correlation between the distance and the number of rides. I was surprised to see that such a large portion of the yellow cab rides lasted less than 7 minutes. From this query, we can also see some likely data quality issues in the dataset. Unless one of the June 2020 rides happened in a time-traveling DeLorean, we likely have an erroneous record. Less obvious is the fact that several hundred rides claim to have lasted longer than 24 hours:

Figure 1.5 – Ride duration histogram results

Let's try one more histogram query, but this time, we will target the trip distance of the rides that took less than 7 minutes. The following code block contains the modified histogram query you can run to understand that bucket of rides. You can download it from GitHub at http://bit.ly/3hkggJl:

SELECT trip_distance, number_rides

FROM

(SELECT numeric_histogram(5,trip_distance)

FROM packt_serverless_analytics.nyc_taxi

WHERE date_diff('minute',

date_parse(tpep_pickup_datetime,'%Y-%m-%d %H:%i:%s'),

date_parse(tpep_dropoff_datetime, '%Y-%m-%d %H:%i:%s')

) <= 6.328061

) AS x (ride_histogram)

CROSS JOIN UNNEST(ride_histogram) AS t (trip_distance , number_rides);

Considering that the average person can walk a mile in 15 minutes, New Yorkers must be in a serious hurry to opt for taxi rides instead of a 15-minute walk!

Figure 1.6 – Ride distance histogram results

With that, we've been through the basics of AWS Athena. Let's conclude by providing a recap of what we've learned.

Summary

In this chapter, you saw just how easy it is to get started running queries with Athena. We obtained sample data from the NYC TLC, used it to create a table in our S3-based data lake, and ran some analytics queries to understand the insights contained in that data. Since Athena is serverless, we spent absolutely no time setting up any infrastructure or software. Incredibly, all the operations we ran in this chapter cost less than $0.00135. Without the serverless aspect of Athena, we would have found ourselves purchasing many thousands of dollars of hardware or hundreds of dollars in cloud resources to run these basic exercises.

While the main goals of this chapter were to orient you to the uniquely serverless experience of using Amazon Athena, there are a few concepts worth remembering as you continue reading. The first is the role of the Metastore. We saw that uploading our data to S3 was not enough for Athena to query the data. We also needed to register the location, schema, and file format as a table in AWS Glue Data Catalog. Once our table was defined, it became queryable from Athena. Chapter 3, Key Features, Query Types, and Functions, will cover this topic in greater depth.

The next important thing we saw was the feature-rich SQL dialect we used in our basic analytics queries. Since Athena utilizes a customized variant of Presto, you can refer to Presto's documentation (https://prestodb.io/docs/current/) as a supplement for Athena's documentation.

Chapter 2, Introduction to Amazon Athena, will go deeper into Athena's capabilities and open source roots so that you can understand when to use Athena, as well as how you can gain deeper insight into specific behaviors of the service.

Chapter 2: Introduction to Amazon Athena

The previous chapter walked you through your first, hands-on experience with serverless analytics using Amazon Athena. This chapter will continue that introduction by discussing Athena's capabilities, scalability, and pricing in more detail. In the past, vendors such as Oracle and Microsoft produced mostly one-size-fits-all analytics engines and RDBMSes. Bucking the historical norms, AWS has championed a fit for purpose database and analytics strategy. By optimizing for specific use cases, the analytics engines' very architecture could exploit nuances of the workload for which they were intended, thereby delivering an all-around better product. For example, Redshift, EMR, Glue, Athena, and Timestream all offer related but differentiated capabilities with their own unique advantages and trade-offs. The knowledge you will gain in this chapter provides a broad-based understanding of what functionality Athena offers as well as a set of criteria to help you determine whether Athena is the best service for your project. We will also spend some time peeling back the curtain and discussing how Athena builds upon Presto, an open source SQL engine initially developed at Facebook.

Most of the chapters in this book stand on their own and allow you to skip around as you follow your curiosity. However, we do not recommend skipping this chapter unless you already know Athena well and are using this book to dive deep into specific topics.

In the subsequent sections of this chapter, we will cover the following topics:

Getting to know Amazon Athena What is Presto?Understanding scale and latency Metering and billingConnecting and securingDetermining when to use Amazon Athena

Technical requirements

This chapter is one of the few, perhaps even the only chapter in this book, that will not have many hands-on activities. As such, there are not any specific technical requirements for this chapter beyond those already covered in Chapter 1, Your First Query, namely:

Basic knowledge of SQL is recommended but not required.A computer with internet access to GitHub, S3, and the AWS Console; a Chrome, Safari, or Microsoft Edge browser; and the AWS CLI installed.An AWS account and IAM user that can run Athena queries.

As always, any code references or samples for this chapter can be found in the book's companion GitHub repository located at https://github.com/PacktPublishing/Serverless-Analytics-with-Amazon-Athena.

Getting to know Amazon Athena

In Chapter 1, Your First Query, we learned that Amazon Athena is a query service that allows you to run standard SQL over data stored in various sources and formats. We also saw that Athena's pricing model is unique in that we are charged by how much data our query reads and not by how many servers or how much time our queries require. In this section, we will go beyond that cursory introduction and discuss the broader set of capabilities that together make Athena a product worth considering for your next analytics project. We do not go into full detail on every item we are preparing to discuss, but later chapters will allow you to get hands-on with the most notable features. For now, our goal is to increase your awareness of what is possible with Athena, so you can perform technical product selection exercises (aka bakeoffs) or steer toward areas of interest.

Understanding the "serverless" trend

The word serverless appears dozens, possibly hundreds of times, in this book. At the end of the book, we will run an Athena query over the complete text to find the exact number of times we used the word serverless. So, what is the big deal? Why is serverless such a seemingly important concept? Or is it just the latest buzzword to catch on? Like most things, the truth lies somewhere between the two extremes, and that's why we will spend some time understanding what it means to be serverless.

In the simplest terms, a serverless offering is one where you do not have to manage any servers. AWS Lambda is often thought of as the gold standard for serverless technologies since it was the first large-scale offering of this type. With AWS Lambda, you have virtually no boilerplate to slow you down; you literally jump straight into writing your business logic or function as follows:

def lambda_handler(event, context):

return {

"response": "Hello World!"

}

AWS Lambda will handle executing this code in response to several invocation triggers, ranging from SQS messages to HTTP calls. As an AWS Lambda customer, you do not have to worry about setting up Java, a WebService stack, or anything. Right from the beginning, you are writing business logic and not spending time on undifferentiated infrastructure work.

This model has some obvious advantages that customers love. The first of which is that, without servers, your capacity planning responsibilities shrink both in size and complexity. Instead of determining how many servers you need to run that monthly finance report or how much memory your SQL engine will need to handle all the advertising campaigns on Black Friday, you only need to worry about your account limits. To the uninitiated, this might seem easy. You might even say to yourself, I have great metrics about my peak loads and can do my own capacity planning just fine! It is true. You will likely have more context about your future needs than a service like Athena can infer. But what happens to all that hardware after the peak has passed? I am not just referring to that seasonal peak that comes once a year but also the peak of each week and each hour. That hardware, which you or your company paid for, will be sitting idle, taking up space in your data center, and consuming capital that could have been deployed elsewhere. But what about the cloud? I do not need to buy any servers; I can just turn them on and off as needed. Yes! That is true.

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Serverless Analytics with Amazon Athena E-Book

Anthony Virtuoso

Serverless Analytics with Amazon Athena

Serverless Analytics with Amazon Athena

Foreword

Contributors

About the authors

About the reviewers

Table of Contents

Preface

Section 1: Fundamentals Of Amazon Athena

Chapter 1: Your First Query

Technical requirements

What is Amazon Athena?

Use cases

Separation of storage and compute

Obtaining and preparing sample data

Running your first query

Creating your first table

Running your first analytics queries

Summary

Chapter 2: Introduction to Amazon Athena

Technical requirements

Getting to know Amazon Athena

Understanding the "serverless" trend

Beyond "serverless" with 'fully managed' offerings

Key features

What is Presto?

Understanding scale and latency

TableScan performance

Memory-bound operations

Writing results

Metering and billing

Additional costs

File formats affect cost and performance

Cost controls

Connecting and securing

Determining when to use Amazon Athena

Ad hoc analytics

Adding analytics features to your application

Serverless ETL pipeline

Other use cases

Summary

Further reading

Chapter 3: Key Features, Query Types, and Functions

Technical requirements

Running ETL queries

Using CREATE-TABLE-AS-SELECT

Using INSERT-INTO

Running approximate queries

Organizing workloads with WorkGroups and saved queries

Using Athena's APIs

Summary

Section 2: Building and Connecting to Your Data Lake

Chapter 4: Metastores, Data Sources, and Data Lakes

Technical requirements

What is a metastore?

Data sources, connectors, and catalogs

Databases and schemas

Tables/datasets

What is a data source?

S3 data sources

Other data sources

Registering S3 datasets in your metastore

Using Athena CREATE TABLE statements

Using Athena's Create Table wizard

Using the AWS Glue console

Using AWS Glue Crawlers

Discovering your datasets on S3 using AWS Glue Crawlers

How do AWS Glue Crawlers work?

AWS Glue Crawler best practices for Athena

Designing a data lake architecture

Stages of data

Transforming data using Athena

Summary

Further reading

Chapter 5: Securing Your Data

Technical requirements

General best practices to protect your data on AWS

Separating permissions based on IAM users, roles, or even accounts