E-Book
31,19 €

Serverless ETL and Analytics with AWS Glue E-Book

Vishal Pathak

0,0

31,19 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Lebensstil
Sprache: Englisch

Beschreibung

Organizations these days have gravitated toward services such as AWS Glue that undertake undifferentiated heavy lifting and provide serverless Spark, enabling you to create and manage data lakes in a serverless fashion. This guide shows you how AWS Glue can be used to solve real-world problems along with helping you learn about data processing, data integration, and building data lakes.
Beginning with AWS Glue basics, this book teaches you how to perform various aspects of data analysis such as ad hoc queries, data visualization, and real-time analysis using this service. It also provides a walk-through of CI/CD for AWS Glue and how to shift left on quality using automated regression tests. You’ll find out how data security aspects such as access control, encryption, auditing, and networking are implemented, as well as getting to grips with useful techniques such as picking the right file format, compression, partitioning, and bucketing. As you advance, you’ll discover AWS Glue features such as crawlers, Lake Formation, governed tables, lineage, DataBrew, Glue Studio, and custom connectors. The concluding chapters help you to understand various performance tuning, troubleshooting, and monitoring options.
By the end of this AWS book, you’ll be able to create, manage, troubleshoot, and deploy ETL pipelines using AWS Glue.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 551

Veröffentlichungsjahr: 2022

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Serverless ETL and Analytics with AWS Glue

Your comprehensive reference guide to learning about AWS Glue and its features

Vishal Pathak

Subramanya Vajiraya

Noritaka Sekiyama

Tomohiro Tanaka

Albert Quiroga

Ishan Gaur

BIRMINGHAM—MUMBAI

Serverless ETL and Analytics with AWS Glue

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Reshma Raman

Senior Editor: Tazeen Shaikh

Content Development Editor: Sean Lobo

Technical Editor: Devanshi Ayare

Copy Editor: Safis Editing

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Pratik Shirodkar

Production Designer: Jyoti Chauhan

Marketing Coordinator: Nivedita Singh

First published: August 2022

Production reference: 1220722

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80056-498-5

www.packt.com

Contributors

About the authors

Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey in AWS, Vishal helped customers implement business intelligence, data warehouse, and data lake projects in the US and Australia.

Subramanya Vajiraya is a Big data Cloud Engineer at AWS Sydney specializing in AWS Glue. He obtained his Bachelor of Engineering degree specializing in Information Science & Engineering from NMAM Institute of Technology, Nitte, KA, India (Visvesvaraya Technological University, Belgaum) in 2015 and obtained his Master of Information Technology degree specialized in Internetworking from the University of New South Wales, Sydney, Australia in 2017. He is passionate about helping customers solve challenging technical issues related to their ETL workload and implementing scalable data integration and analytics pipelines on AWS.

Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. He has 11 years of experience working in the software industry. Based in Tokyo, Japan, he is responsible for implementing software artifacts, building libraries, troubleshooting complex issues and helping guide customer architectures.

Tomohiro Tanaka is a senior cloud support engineer at AWS. He works to help customers solve their issues and build data lakes across AWS Glue, AWS IoT, and big data technologies such Apache Spark, Hadoop, and Iceberg.

Albert Quiroga works as a senior solutions architect at Amazon, where he is helping to design and architect one of the largest data lakes in the world. Prior to that, he spent four years working at AWS, where he specialized in big data technologies such as EMR and Athena, and where he became an expert on AWS Glue. Albert has worked with several Fortune 500 companies on some of the largest data lakes in the world and has helped to launch and develop features for several AWS services.

Ishan Gaur has more than 13 years of IT experience in software development and data engineering, building distributed systems and highly scalable ETL pipelines using Apache Spark, Scala, and various ETL tools such as Ab Initio and Datastage. He currently works at AWS as a senior big data cloud engineer and is an SME of AWS Glue. He is responsible for helping customers to build out large, scalable distributed systems and implement them in AWS cloud environments using various big data services, including EMR, Glue, and Athena, as well as other technologies, such as Apache Spark, Hadoop, and Hive.

About the reviewers

Akira Ajisaka is an open source developer who has over 10 years of engineering experience in big data. He contributes to the open source community and is an Apache Software Foundation member and Apache Hadoop PMC member. He has worked for the AWS Glue ETL team since 2022 and is learning a lot about Apache Spark.

Keerthi Chadalavada is a senior software engineer with AWS Glue. She is passionate about building cloud-based, data-intensive applications at scale. Her recent work includes enabling data engineers to build event-driven ETL pipelines that respond in near real time to data events and provide the latest insights to business users. In addition, her work on Glue Blueprints enabled data engineers to build templates for repeatable ETL pipelines and enabled non-data engineers without technical expertise to use these templates to gain faster insights from their data. Keerthi holds a master’s degree in computer science from Ohio State University and a bachelor’s degree in computer science from Bits Pilani, India.

Preface

Section 1 – Introduction, Concepts, and the Basics of AWS Glue

Chapter 1: Data Management – Introduction and Concepts

Types of data processing – OLTP and OLAP

Data warehouses and data marts

Data lakes

Data lakehouse

Data mesh

Distributed computing for big data

Apache Spark

Apache Spark on the AWS cloud

AWS Glue

Querying data using AWS

Summary

Chapter 2: Introduction to Important AWS Glue Features

Data integration

Integrating data with AWS Glue

Data discovery

Data ingestion

Data preparation

Data replication

Features of AWS Glue

AWS Glue Data Catalog

Glue connections

AWS Glue crawlers

Custom classifiers

AWS Glue Schema Registry

AWS Glue ETL jobs

Glue development endpoints

AWS Glue interactive sessions

Triggers

Summary

Chapter 3: Data Ingestion

Technical requirements

Data ingestion from file/object stores

Data ingestion from Amazon S3

Data ingestion from HDFS data stores

Data ingestion from JDBC data stores

AWS Glue custom JDBC connectors

Data ingestion from streaming data sources

AWS Glue Schema Registry

Data ingestion from SaaS data stores

Summary

Section 2 – Data Preparation, Management, and Security

Chapter 4: Data Preparation

Technical requirements

Introduction to data preparation

Data preparation using AWS Glue

Visual data preparation using AWS Glue DataBrew

Source code-based approach to data preparation using AWS Glue

Selecting the right service/tool

Summary

Chapter 5: Data Layouts

Technical requirements

Why do we need to pay attention to data layout?

Key techniques to optimally storing data

Selecting a file format

Compressing your data

Splittable or unsplittable files

Partitioning

Bucketing

Optimizing the number of files and each file size

What is compaction?

Compaction with AWS Glue ETL Spark jobs

Automatic Compaction with AWS Lake Formation acceleration

Optimizing your storage with Amazon S3

Selecting suitable S3 storage classes for your data

Using S3 Lifecycle for managing object lifecycles

Summary

Chapter 6: Data Management

Technical requirements

Normalizing data

Casting data types and map column names

Inferring schemas

Computing schemas on the fly

Enforcing schemas

Flattening nested schemas

Normalizing scale

Handling missing values and outliers

Normalizing date and time values

Handling error records

Deduplicating records

Denormalizing tables

Securing data content

Masking values

Hashing values

Managing data quality

AWS Glue DataBrew data quality rules

DeeQu

Summary

Chapter 7: Metadata Management

Technical requirements

Populating metadata

Glue Data Catalog API

DDL statements

Glue crawlers

Crawler configuration

Maintaining metadata

Glue crawlers

Updating Data Catalog tables from ETL jobs

Partition management

Partition indexes

Versioning and rollback

Table versioning

Lake Formation-governed tables

Lineage

Glue DataBrew

Summary

Chapter 8: Data Security

Technical requirements

Access control

IAM permissions

Glue dependencies on other AWS services

S3 bucket policies

S3 object ownership

Lake Formation permissions

Encryption

Encryption at rest

Encryption in transit

Network

Glue network architecture

Glue connections

Network configuration requirements and limitations

Connecting to resources on the public internet

Connecting to resources in your on-premise network

Summary

Chapter 9: Data Sharing

Technical requirements

Overview of data sharing strategies

Single tenant

Hub and spoke

Data mesh

Sharing data with multiple AWS accounts using S3 bucket policies and Glue catalog policies

Scenario 1 – sharing data from one account with another using S3 bucket policies and Glue catalog policies

Prerequisite – S3

Prerequisite – Glue

Configuring S3 bucket policies and Glue Catalog resource policies

Sharing data with multiple AWS accounts using AWS Lake Formation permissions

Lake Formation permission model

Lake Formation cross-account sharing

Lake Formation named resource-based access control

Lake Formation tag-based access control

Scenario 2 – sharing data from one account with another using Lake Formation Tag-based access control

Prerequisite – S3

Prerequisite – Glue

Prerequisite – Lake Formation and IAM

Step 1 – configuring Glue catalog policies

Step 2 – configuring Lake Formation permissions (producer)

Step 3 – configuring Lake Formation permissions (consumer)

Summary

Chapter 10: Data Pipeline Management

Technical requirements

What are data pipelines?

Why do we need data pipelines?

How do we build and manage data pipelines?

Selecting the appropriate data processing services for your analysis

AWS Batch

Amazon ECS

AWS Lambda

AWS Glue ETL jobs

Amazon EMR

Orchestrating your pipelines with workflow tools

Using AWS Glue workflows

Using AWS Step Functions

Using Amazon Managed Workflows for Apache Airflow

utomating how you provision your pipelines with provisioning tools

Provisioning resources with AWS CloudFormation

Provisioning AWS Glue workflows and resources with AWS Glue Blueprints

Developing and maintaining your data pipelines

Developing AWS Glue ETL jobs locally

Deploying AWS Glue ETL jobs

Deploying workflows and pipelines using provisioning tools such as IaC

Summary

Section 3 – Tuning, Monitoring, Data Lake Common Scenarios, and Interesting Edge Cases

Chapter 11: Monitoring

Defining an SLA for a data platform

Monitoring the SLA of a data platform

Monitoring the components of a data platform

Monitoring state changes

Monitoring delay

Monitoring performance

Monitoring common failures

Monitoring log messages

Analyzing usage

Summary

Chapter 12: Tuning, Debugging, and Troubleshooting

Tuning AWS Glue workloads

Tuning AWS Glue crawlers

Tuning the performance of AWS Glue Spark ETL jobs

Troubleshooting and debugging common issues in AWS Glue ETL

ETL job failures

Summary

Chapter 13: Data Analysis

Creating Marketplace connections

Creating the Glue Hudi connection

Creating a Delta Lake connection

Creating an OpenSearch connection

Creating the CloudFormation stack

Prerequisites for creating the CloudFormation stack

The benefit of ad hoc analysis and how a data lake enables it

Amazon Athena

Amazon Redshift Spectrum

Creating and updating Hudi tables using Glue

Creating and updating Delta Lake tables using Glue

Inserting data into Lake Formation governed tables

Consuming streaming data using Glue

Creating chapter-data-analysis-msk-connection

Loading and consuming data from MSK using Glue

Glue streaming job as a consumer of a Kafka topic

Hudi DeltaStreamer streaming job as a consumer of a Kafka topic

Creating and consuming CDC data through streaming jobs on Glue

Glue’s integration with OpenSearch

Cleaning up

Summary

Chapter 14: Machine Learning Integration

Technical requirements

Glue ML transformations

Creating an ML transform

Training an ML transform

Using an ML transform

SageMaker integration

Developing ML pipelines with Glue

Summary

Chapter 15: Architecting Data Lakes for Real-World Scenarios and Edge Cases

Technical requirements

Running a highly selective query on a big fact table using AWS Glue

Hands-on tutorial

Dealing with Join performance issues with big fact and small dimension tables in ETL workloads

Solving Join problems involving big fact and big dimension tables using AWS Glue

Hands-on tutorial

Solution

Reducing time on read operations using AWS Glue grouping

Solving S3 eventual consistency problems using AWS Glue

Using glueparquet

S3-optimized output committer

Summary

Other Books You May Enjoy

Section 1 – Introduction, Concepts, and the Basics of AWS Glue

In this section, you will learn about the basics of AWS Glue and the general trends in data management. You will be introduced to the important AWS Glue features and ways to ingest data using AWS Glue from heterogeneous sources.

This section includes the following chapters:

Chapter 1, Data Management – Introduction and ConceptsChapter 2, Introduction to Important AWS Glue FeaturesChapter 3, Data Ingestion

Chapter 2: Introduction to Important AWS Glue Features

In the previous chapter, we talked about the evolution of different data management strategies, such as data warehousing, data lakes, the data lakehouse, and data meshes, and the key differences between each. We introduced the Apache Spark framework, briefly discussed the Spark workload execution mechanism, learned how Spark workloads can be fulfilled on the AWS cloud, and introduced AWS Glue and its components.

In this chapter, we will discuss the different components of AWS Glue so that we know how AWS Glue can be used to perform different data integration tasks.

Upon completing this chapter, you will be able to define data integration and explain how AWS Glue can be used for this. You will also be able to explain the fundamental concepts related to different features of AWS Glue, such as AWS Glue Data Catalog, AWS Glue connections, AWS Glue crawlers, AWS Glue Schema Registry, AWS Glue jobs, AWS Glue development endpoints, AWS Glue interactive sessions, and AWS Glue triggers.

In this chapter, we will cover the following topics:

Data integrationIntegrating data with AWS GlueFeatures of AWS Glue

Now, let’s dive into the concepts of data integration and AWS Glue. We will discuss the key components and features of AWS Glue that make it a powerful data integration tool.

Data integration

Data integration is a complex operation that involves several tasks – data discovery, ingestion, preparation, transformation, and replication. Data integration is the very first step in deriving insights from data so that data can be shared across the organization for collaboration and faster decision-making.

The data integration process is often iterative. Upon completing a particular iteration, we can query and visualize the data and make data-driven business decisions. For this purpose, we can use AWS services such as Amazon Athena, Amazon Redshift, and Amazon QuickSight, as well as some other third-party services. The process is often repeated until the right quality data is obtained. We can set up a job as part of our data integration workflow to profile the data obtained against a specific set of rules to ensure that it meets our requirements. For instance, AWS Glue DataBrew offers built-in capabilities to define data quality rules and allows us to profile data based on our requirements. We will be discussing AWS Glue DataBrew Profile jobs in detail in Chapter 4, Data Preparation. Once the right quality data is obtained, it can be used for analysis, machine learning (ML), or building data applications.

Since data integration helps drive the business forward, it is a critical business process. This also means there is less room for error as this directly impacts the quality of the data that’s obtained, which, in turn, impacts the decision-making process.

Now, let’s briefly explore how data integration can be simplified using AWS Glue.

Integrating data with AWS Glue

AWS Glue was initially introduced as a serverless ETL service that allows users to crawl, catalog, transform, and ingest data into AWS for analytics. However, over the years, it has evolved into a fully-managed serverless data integration service.

AWS Glue simplifies the process of data integration, which, as discussed earlier, usually involves discovering, preparing, extracting, and combining data for analysis from different data stores. These tasks are often handled by multiple individuals/teams with a diverse set of skills in an organization.

As mentioned in the previous section, data integration is an iterative process that involves several steps. Let’s take a look at how AWS Glue can be used to perform some of these tasks.

Data discovery

AWS Glue Data Catalog can be used to discover and search data across all our datasets. Data Catalog enables us to store table metadata for our datasets and makes it easy to query these datasets from several applications and services. AWS Glue Data Catalog can not only be used by AWS services such as AWS Glue, AWS EMR, Amazon Athena, and Amazon Redshift Spectrum, but also by on-premise or third-party product implementations that support the Hive metastore using the open source AWS Glue Data Catalog Client for Apache Hive Metastore (https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore).

AWS Glue Crawlers enable us to populate the Data Catalog with metadata for our datasets by crawling the data stores based on the user-defined configuration.

AWS Glue Schema Registry allows us to manage and enforce schemas for data streams. This helps us enhance data quality and safeguard against unexpected schema drifts that can impact the quality of our data significantly.

Data ingestion

AWS Glue makes it easy to ingest data from several standard data stores, such as HDFS, Amazon S3, JDBC, and AWS Glue. It allows data to get ingested from SaaS and custom data stores via custom and marketplace connectors.

Data preparation

AWS Glue enables us to de-duplicate and cleanse data with built-in ML capabilities using its FindMatches feature. With FindMatches, we can label sets of records as either matching or not matching and the system will learn the criteria and build an ETL job that we can use to find duplicate records. We will discuss FindMatches in detail in Chapter 14, Machine Learning Integration.

AWS Glue also enables us to interactively develop, test, and debug our ETL code using AWS Glue development endpoints, AWS Glue interactive sessions, and AWS Glue Jupyter Notebooks. Apart from notebook environments, we can also use our favorite IDE to develop and test ETL code using AWS Glue development endpoints or AWS Glue local development libraries.

AWS Glue DataBrew provides an interactive visual interface for cleaning and normalizing data without writing code. This is especially beneficial to novice users who do not have Apache Spark and Python/Scala programming skills. AWS Glue DataBrew comes pre-packed with over 250 transformations that can be used to transform data as per our requirements.

Using AWS Glue Studio, we can develop highly scalable Apache Spark ETL jobs using the visual interface without having in-depth knowledge of Apache Spark.

Data replication

The Elastic Views feature of AWS Glue enables us to create views of data stored in different AWS data stores and materialize them in a target data store of our choice. We can create materialized views by using PartiQL to write queries.

At the time of writing, AWS Glue Elastic Views currently supports Amazon DynamoDB as a source. We can materialize these views in several target data stores, such as Amazon Redshift, Amazon OpenSearch Service, and Amazon S3.

Once materialized views have been created, they can be shared with other users for use in their applications. AWS Glue Elastic Views continuously monitors changes in our dataset and updates the target data stores automatically.

In this section, we mentioned several AWS Glue features and how they aid in different data integration tasks. In the next section, we will explore the different features of AWS Glue and understand how they can help implement our data integration workload.

Features of AWS Glue

AWS Glue has different features that appear disjointed, but in reality, they are interdependent. Often, users have to use a combination of these features to achieve their goals.

The following are the key features of AWS Glue:

AWS Glue Data CatalogAWS Glue ConnectionsAWS Glue Crawlers and ClassifiersAWS Glue Schema RegistryAWS Glue JobsAWS Glue Notebooks and interactive sessions AWS Glue TriggersAWS Glue WorkflowsAWS Glue BlueprintsAWS Glue MLAWS Glue StudioAWS Glue DataBrewAWS Glue Elastic Views

Now that we know the different features and services involved in executing an AWS Glue workload, let’s discuss the fundamental concepts related to some of these features.

AWS Glue Data Catalog

A Data Catalog can be defined as an inventory of data assets in an organization that helps data professionals find and understand relevant datasets to extract business value. A Data Catalog acts as metadata storage (or a metastore) that contains metadata stored by disparate systems. This can be used to keep track of data in data silos. Typically, the user is expected to provide information about data formats, locations, and serialization deserialization mechanisms, along with the query. Metastores make it easy for us to capture these pieces of information during table creation and can be reused every time the table is used. Metastores also enable us to discover and explore relevant data in the data repository using metastore service APIs. The most popular metastore product that’s used widely in the industry is Apache Hive Metastore.

AWS Glue Data Catalog is a persistent metastore for data assets. The dataset can be stored anywhere – AWS, on-premise, or in a third-party provider – and Data Catalog can still be used. AWS Glue Data Catalog allows users to store, annotate, and share metadata in AWS. The concept is similar to Apache Hive Metastore; however, the key difference is that AWS Glue Data Catalog is serverless and there is no additional administrative overhead in managing the infrastructure.

Traditional Hive metastores use relational database management systems (RDBMSs) for metadata storage – for example, MySQL, PostgreSQL, Derby, Oracle, and MSSQL. The problem with using RDBMS for Hive metastores is that relational database servers need to be deployed and managed. If the metastore is to be used for production workloads, then we need to factor high availability (HA) and redundancy into the design. This will increase the complexity of the solution architecture and the cost associated with the infrastructure and how it’s managed. AWS Glue Data Catalog, on the other hand, is fully managed and doesn’t have any administrative overhead (deployment and infrastructure management).

Each AWS account has one Glue Data Catalog per AWS region and is identified by a combination of catalog_id and aws_region. The value of catalog_id is the 12-digit AWS account number. The value of catalog_id remains the same for each catalog in every AWS region. For instance, to access the Data Catalog in the North Virginia AWS region, aws_region must be set to 'us-east-1' and the value of the catalog_id parameter must be the 12-digit AWS account number – for example, 123456789012.

AWS Glue Data Catalog is comprised of the following components:

DatabasesTablesPartitions

Now, let’s dive into each of these catalog item types in more detail.

Databases

A database is a logical collection of metadata tables in AWS Glue. When a table is created, it must be created under a specific database. A table cannot be present in more than one database.

Tables

A table in a Glue Data Catalog is a resource that holds the metadata for any given dataset. The following diagram shows the metadata of a table stored in the Data Catalog:

Figure 2.1 – Metadata of a table stored in a Data Catalog

All tables contain information such as the name, input format, output format, location, and schema of the dataset, as well as table properties (stored as key-value pairs – primarily used to store table statistics, the compression format, and the data format) and Serializer-Deserializer (SerDe) information such as SerDe name, the serialization library, and SerDe class parameters.

The SerDe library information in the table’s metadata informs the query processing engine of which class to use to translate data between the table view and the low-level input/output format. Similarly, InputFormat and OutputFormat specify the classes that describe the original data structure so that the query processing engine can map the data to its table view. At a high level, the process would look something like this:

Table Versions

It is important to note that AWS Glue supports versioning catalog tables. By default, a new version of the table is created when the table is updated. However, we can use the skipArchive option in the AWS Glue UpdateTable API to prevent AWS Glue from creating an archived version of the table. Once the table is deleted, all the versions of the table will be removed as well.

Partitions

Tables are organized into partitions. Partitioning is an optimization technique by which a table is further divided into related parts based on the values of a particular column(s). A table can have a combination of multiple partition keys to identify a particular partition (also known as partition_spec).

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Serverless ETL and Analytics with AWS Glue E-Book

Vishal Pathak

Serverless ETL and Analytics with AWS Glue

Serverless ETL and Analytics with AWS Glue

Contributors

About the authors

About the reviewers

Table of Contents

Preface

Section 1 – Introduction, Concepts, and the Basics of AWS Glue

Chapter 1: Data Management – Introduction and Concepts

Types of data processing – OLTP and OLAP

Data warehouses and data marts

Data lakes

Data lakehouse

Data mesh

Distributed computing for big data

Apache Spark

Apache Spark on the AWS cloud

AWS Glue

Querying data using AWS

Summary

Chapter 2: Introduction to Important AWS Glue Features

Data integration

Integrating data with AWS Glue

Data discovery

Data ingestion

Data preparation

Data replication

Features of AWS Glue

AWS Glue Data Catalog

Glue connections

AWS Glue crawlers

Custom classifiers

AWS Glue Schema Registry

AWS Glue ETL jobs

Glue development endpoints

AWS Glue interactive sessions

Triggers

Summary

Chapter 3: Data Ingestion

Technical requirements

Data ingestion from file/object stores

Data ingestion from Amazon S3

Data ingestion from HDFS data stores

Data ingestion from JDBC data stores

AWS Glue custom JDBC connectors

Data ingestion from streaming data sources

AWS Glue Schema Registry

Data ingestion from SaaS data stores

Summary

Section 2 – Data Preparation, Management, and Security

Chapter 4: Data Preparation

Technical requirements

Introduction to data preparation

Data preparation using AWS Glue

Visual data preparation using AWS Glue DataBrew

Source code-based approach to data preparation using AWS Glue

Selecting the right service/tool

Summary

Chapter 5: Data Layouts

Technical requirements

Why do we need to pay attention to data layout?

Key techniques to optimally storing data

Selecting a file format

Compressing your data

Splittable or unsplittable files

Partitioning

Bucketing

Optimizing the number of files and each file size

What is compaction?

Compaction with AWS Glue ETL Spark jobs

Automatic Compaction with AWS Lake Formation acceleration

Optimizing your storage with Amazon S3

Selecting suitable S3 storage classes for your data

Using S3 Lifecycle for managing object lifecycles

Summary

Further reading

Chapter 6: Data Management

Technical requirements