29,99 €
Performing data engineering with Amazon Web Services (AWS) combines AWS's scalable infrastructure with robust data processing tools, enabling efficient data pipelines and analytics workflows. This comprehensive guide to AWS data engineering will teach you all you need to know about data lake management, pipeline orchestration, and serving layer construction.
Through clear explanations and hands-on exercises, you’ll master essential AWS services such as Glue, EMR, Redshift, QuickSight, and Athena. Additionally, you’ll explore various data platform topics such as data governance, data quality, DevOps, CI/CD, planning and performing data migration, and creating Infrastructure as Code. As you progress, you will gain insights into how to enrich your platform and use various AWS cloud services such as AWS EventBridge, AWS DataZone, and AWS SCT and DMS to solve data platform challenges.
Each recipe in this book is tailored to a daily challenge that a data engineer team faces while building a cloud platform. By the end of this book, you will be well-versed in AWS data engineering and have gained proficiency in key AWS services and data processing techniques. You will develop the necessary skills to tackle large-scale data challenges with confidence.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 582
Veröffentlichungsjahr: 2024
Data Engineering with AWS Cookbook
A recipe-based approach to help you tackle data engineering problems with AWS services
Trâm Ngọc Phạm
Gonzalo Herreros González
Viquar Khan
Huda Nofal
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
The authors acknowledge the use of cutting-edge AI, such as ChatGPT, with the sole aim of enhancing the language and clarity within the book, thereby ensuring a smooth reading experience for readers. It’s important to note that the content itself has been crafted by the authors and edited by a professional publishing team.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors nor Packt Publishing or its dealers and distributors will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Apeksha Shetty
Publishing Product Manager: Nilesh Kowadkar
Book Project Manager: Urvi Sharma
Senior Editor: Rohit Singh
Technical Editor: Kavyashree K S
Copy Editor: Safis Editing
Proofreader: Rohit Singh
Indexer: Manju Arasan
Production Designer: Shankar Kalbhor
Senior DevRel Marketing Executive: Nivedita Singh
First published: November 2024
Production reference: 1301024
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN 978-1-80512-728-4
www.packtpub.com
To my mother, Ngoc Truong, for her love and sacrifices, and for exemplifying the power of determination. To my family members and friends, who always offer support and kindness throughout my life journey.
– Trâm Ngọc Phạm
Trâm Ngọc Phạm is a senior data architect with over a decade of hands-on experience working in the big data and AI field, from playing a lead role in tailoring cloud data platforms to BI and analytics use cases for enterprises in Vietnam. While working as a Senior Data and Analytics consultant for the AWS Professional Services team, she specialized in guiding finance and telco companies across Southeast Asian countries to build enterprise-scale data platforms and drive analytics use cases that utilized AWS services and big data tools.
Gonzalo Herreros González is a principal data architect. He holds a bachelor’s degree in computer science and a master’s degree in data analytics. He has experience of over a decade in big data and two decades of software development, both in AWS and on-premises.
Previously, he worked at MasterCard where he achieved the first PCI-DSS Hadoop cluster in the world. More recently, he worked at AWS for over 6 years, building data pipelines for the internal network data, and later, as an architect in the AWS Glue service team, building transforms for AWS Glue Studio and helping large customers succeed with AWS data services.
Trâm Ngọc Phạm is a senior data architect with over a decade of hands-on experience working in the big data and AI field, from playing a lead role in tailoring cloud data platforms to BI and analytics use cases for enterprises in Vietnam. While working as a Senior Data and Analytics consultant for the AWS Professional Services team, she specialized in guiding finance and telco companies across Southeast Asian countries to build enterprise-scale data platforms and drive analytics use cases that utilized AWS services and big data tools.
Gonzalo Herreros González is a principal data architect. He holds a bachelor’s degree in computer science and a master’s degree in data analytics. He has experience of over a decade in big data and two decades of software development, both in AWS and on-premises.
Previously, he worked at MasterCard where he achieved the first PCI-DSS Hadoop cluster in the world. More recently, he worked at AWS for over 6 years, building data pipelines for the internal network data, and later, as an architect in the AWS Glue service team, building transforms for AWS Glue Studio and helping large customers succeed with AWS data services.
Viquar Khan is a senior data architect at AWS Professional Services and brings over 20 years of expertise in finance and data analytics, empowering global financial institutions to harness the full potential of AWS technologies. He designs cutting-edge, customized data solutions tailored to complex industry needs. A polyglot developer skilled in Java, Scala, Python, and other languages, Viquar has excelled in various technical roles. As an expert group member of JSR368 (JavaTM Message Service 2.1), he has shaped industry standards and actively contributes to open source projects such as Apache Spark and Terraform. His technical insights have reached and benefited over 6.7 million users on Stack Overflow.
Huda Nofal is a seasoned data engineer with over 7 years of experience at Amazon, where she has played a key role in helping internal business teams achieve their data goals. With deep expertise in AWS services, she has successfully designed and implemented data pipelines that power critical decision-making processes across various organizations. Huda’s work primarily focuses on leveraging Redshift, Glue, data lakes, and Lambda to create scalable, efficient data solutions.
Saransh Arora is a seasoned data engineer with more than 6 years of experience in the field. He has developed proficiency in Python, Java, Spark, SQL, and various data engineering tools, enabling him to address a wide range of data challenges. He has expertise in data orchestration, management, and analysis, with a strong emphasis on leveraging big data technologies to generate actionable insights. Saransh also possesses significant experience in machine learning and predictive analytics. Currently serving as a data engineer at AWS, he is dedicated to driving innovation and delivering business value. As an expert in data engineering, Saransh has also been working on the integration of generative AI into data engineering practices.
Haymang Ahuja specializes in ETL development, cloud computing, big data technologies, and cutting-edge AI. He is adept at creating robust data pipelines and delivering high-performance data solutions, backed by strong software development skills and proficiency in programming languages such as Python and SQL. His expertise includes big data technologies such as Spark, Apache Hudi, Airflow, Kylin, HDFS, and HBase. With a combination of technical knowledge, problem-solving skills, and a commitment to leveraging emerging technologies, he helps organizations achieve their strategic objectives and stay competitive in the dynamic digital landscape.
Hello and welcome! In today’s rapidly evolving data landscape, managing, migrating, and governing large-scale data systems are among the top priorities for data engineers. This book serves as a comprehensive guide to help you navigate these essential tasks, with a focus on three key pillars of modern data engineering:
Hadoop and data warehouse migration: Organizations are increasingly moving from traditional Hadoop clusters and on-premises data warehouses to more scalable, cloud-based data platforms. This book walks you through the best practices, methodologies, and how to use the tools for migrating large-scale data systems, ensuring data consistency, minimal downtime, and scalable performance.Data lake operations: Building and maintaining a data lake in today’s multi-cloud, big data environment is complex and demands a strong operational strategy. This book covers how to ingest, transform, and manage data at scale using AWS services such as S3, Glue, and Athena. You will learn how to structure and maintain a robust data lake architecture that supports the varied needs of data analysts, data scientists, and business users alike.Data lake governance: Managing and governing your data lake involves more than just operational efficiency; it requires stringent security protocols, data quality controls, and compliance measures. With the explosion of data, it’s more important than ever to have clear governance frameworks in place. This book delves into the best practices for implementing governance strategies using services such as AWS Lake Formation, Glue, and other AWS security frameworks. You’ll also learn about setting up policies that ensure your data lake is compliant with industry regulations while maintaining scalability and flexibility.This cookbook is tailored to data engineers who are looking to implement best practices and take their cloud data platforms to the next level. Throughout this book, you’ll find practical examples, detailed recipes, and real-world scenarios from the authors’ experience of working with complex data environments across different industries.
By the end of this journey, you will have a thorough understanding of how to migrate, operate, and govern your data platforms at scale, all while aligning with industry best practices and modern technological advancements.
So, let’s dive in and build the future of data engineering together!
This book is designed for data engineers, data platform engineers, and cloud practitioners who are actively involved in building and managing data infrastructure in the cloud. If you’re involved in designing, building, or overseeing data solutions on AWS, this book will be ideal as it provides proven strategies for addressing challenges in large-scale data environments. Data engineers and big data professionals aiming to enhance their understanding of AWS features for optimizing their workflow, even if they’re new to the platform, will find value. Basic familiarity with AWS security (users and roles) and command shell is recommended. This book will provide you with practical guidance, hands-on recipes, and advanced techniques for tackling real-world challenges.
Chapter 1, Managing Data Lake Storage, covers the fundamentals of managing S3 buckets. We’ll focus on implementing robust security measures through data encryption and access control, managing costs by optimizing storage tiers and applying retention policies, and utilizing monitoring techniques to ensure timely issue resolution. Additionally, we’ll cover other essential aspects of S3 bucket management.
Chapter 2, Sharing Your Data Across Environments and Accounts, presents methods for securely and efficiently sharing data across different environments and accounts. We will explore strategies for load distribution and collaborative analysis using Redshift data sharing and RDS replicas. We will implement fine-grained access control with Lake Formation and manage Glue data sharing through both Lake Formation and Resource Access Manager (RAM). Additionally, we will discuss real-time sharing via event-driven services, temporary data sharing with S3, and sharing operational data from CloudWatch.
Chapter 3, Ingesting and Transforming Your Data with AWS Glue, explores different features of AWS Glue when building data pipelines and data lakes. It covers the multiple tools and engines provided for the different kinds of users, from visual jobs with little or no code to managed notebooks and jobs using the different data handling APIs provided.
Chapter 4, A Deep Dive into AWS Orchestration Frameworks, explores the essential services and techniques for managing data workflows and pipelines on AWS. You’ll learn how to define a simple workflow using AWS Glue Workflows, set up event-driven orchestration with Amazon EventBridge, and create data workflows with AWS Step Functions. We also cover managing data pipelines using Amazon MWAA, monitoring their health, and setting up a data ingestion pipeline with AWS Glue to bring data from a JDBC database into a catalog table.
Chapter 5, Running Big Data Workloads with Amazon EMR, teaches how to make the most of your AWS EMR clusters and explore the service features that enable them to be customizable, efficient, scalable, and robust.
Chapter 6, Governing Your Platform, presents the key aspects of data governance within AWS. This includes data protection techniques such as data masking in Redshift and classifying sensitive information using Maice. We will also cover ensuring data quality with Glue quality checks. Additionally, we will discuss resource governance to enforce best practices and maintain a secure, compliant infrastructure using AWS Config and resource tagging.
Chapter 7, Data Quality Management, covers how to use AWS Glue Deequ and AWS DataBrew to automate data quality checks and maintain high standards across your datasets. You will learn how to define and enforce data quality rules and monitor data quality metrics. This chapter also provides practical examples and recipes for integrating these tools into your data workflows, ensuring that your data is accurate, complete, and reliable for analysis.
Chapter 8, DevOps – Defining IaC and Building CI/CD Pipelines, explores multiple ways to automate AWS services and CI/CD deployment pipelines, the pros and cons of each tool, and examples of common data product deployments to illustrate DevOps best practices.
Chapter 9, Monitoring Data Lake Cloud Infrastructure, provides a comprehensive guide to the day-to-day operations of a cloud-based data platform. It covers key topics such as monitoring, logging, and alerting using AWS services such as CloudWatch, CloudTrail, and X-Ray. You will learn how to set up dashboards to monitor the health and performance of your data platform, troubleshoot issues, and ensure high availability and reliability. This chapter also discusses best practices for cost management and scaling operations to meet changing demands, making it an essential resource for anyone responsible for the ongoing maintenance and optimization of a data platform.
Chapter 10, Building a Serving Layer with AWS Analytics Services, guides you through the process of building an efficient serving layer using AWS Redshift, Athena, and QuickSight. The serving layer is where your data becomes accessible to end-users for analysis and reporting. In this chapter, you will learn how to load data from your data lake into Redshift, query it using Redshift Spectrum and Athena, and visualize it using QuickSight. This chapter also covers best practices for managing different QuickSight environments and migrating assets between them. By the end of this chapter, you will have the knowledge to create a powerful and user-friendly analytics layer that meets the needs of your organization.
Chapter 11, Migrating to AWS – Steps, Strategies, and Best Practices for Modernizing Your Analytics and Big Data Workloads, presents a theoretical framework for migrating data and workloads to AWS. It explores key concepts, strategies, and best practices for planning and executing a successful migration. You’ll learn about various migration approaches—rehosting, replatforming, and refactoring—and how to choose the best option for your organization’s needs. The chapter also addresses critical challenges and considerations, such as data security, compliance, and minimizing downtime, preparing you to navigate the complexities of cloud migration with confidence.
Chapter 12, Harnessing the Power of AWS for Seamless Data Warehouse Migration, explores the key strategies for efficiently migrating data warehouses to AWS. You’ll learn how to generate a migration assessment report using the AWS Schema Conversion Tool (SCT), extract and transfer data with AWS Database Migration Service (DMS), and handle large-scale migrations with the AWS Snow Family. You’ll also learn how to streamline your data migration, ensuring minimal disruption and maximum efficiency while transitioning to the cloud.
Chapter 13, Strategizing Hadoop Migrations – Cost, Data, and Workflow Modernization with AWS, guides you through essential recipes for migrating your on-premises Hadoop ecosystem to AWS, covering a range of critical tasks. You’ll learn about cost analysis using the AWS Total Cost of Ownership (TCO) calculators and the Hadoop Migration Assessment tool. You’ll also learn how to choose the right storage solution, migrate HDFS data using AWS DataSync, and transition key components such as the Hive Metastore and Apache Oozie workflows to AWS EMR. We also cover setting up a secure network connection to your EMR cluster, seamless HBase migration to AWS, and transitioning HBase to DynamoDB.
To follow the recipes in this book, you will need the following:
Software/hardware covered in the book
OS requirements
AWS CLI
Windows, macOS X, and Linux (any)
Access to AWS services such as EMR, Glue, Redshift, QuickSight, and Lambda
Python (for scripting and SDK usage)
In addition to these requirements, you will also need a basic knowledge of data engineer terminology.
If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Data-Engineering-with-AWS-Cookbook. In case there’s an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Make sure you replace <your_bucket_name> with the actual name of your S3 bucket.”
A block of code is set as follows:
{ Sid: DenyListBucketFolder, Action: [s3:*], Effect: Deny, Resource: [arn:aws:s3:::<bucket-name>/<folder-name>/*] }Any command-line input or output is written as follows:
CREATE DATASHARE datashare_name;Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: “Choose Policies from the navigation pane on the left and choose Create policy.”
Tips or important notes
Appear like this.
In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How it works..., There’s more..., and See also).
To give clear instructions on how to complete a recipe, use these sections as follows:
This section tells you what to expect in the recipe and describes how to set up any software or any preliminary settings required for the recipe.
This section contains the steps required to follow the recipe.
This section usually consists of a detailed explanation of what happened in the previous section.
This section consists of additional information about the recipe in order to make you more knowledgeable about the recipe.
This section provides helpful links to other useful information for the recipe.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, select your book, click on the Errata Submission Form link, and enter the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Once you’ve read Data Engineering with AWS Cookbook, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link belowhttps://packt.link/free-ebook/9781805127284
Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directlyAmazon Simple Storage Service (Amazon S3) is a highly scalable and secure cloud storage service. It allows you to store and retrieve any amount of data at any time from anywhere in the world. S3 buckets aim to help enterprises and individuals achieve their data backup and delivery needs and serve a variety of use cases, including but not limited to web and mobile applications, big data analytics, data lakes, and data backup and archiving.
In this chapter, we will learn how to keep data secure in S3 buckets and configure your buckets in a way that best serves your use case from performance and cost perspectives.
The following recipes will be covered in this chapter:
Controlling access to S3 bucketsStorage types in S3 for optimized storage costsEnforcing encryption of S3 bucketsSetting up retention policies for your objectsVersioning your dataReplicating your dataMonitoring your S3 bucketsThe recipes in this chapter assume you have an S3 bucket with admin permission. If you don’t have admin permission to the bucket, you will need to configure the permission for each recipe as needed.
You can find the code files for this chapter in this book’s GitHub repository: https://github.com/PacktPublishing/Data-Engineering-with-AWS-Cookbook/tree/main/Chapter01.
Controlling access to S3 buckets through policies and IAM roles is crucial for maintaining the security and integrity of your objects and data stored in Amazon S3. By defining granular permissions and access controls, you can ensure that only authorized users or services have the necessary privileges to interact with your S3 resources. You can restrict permissions according to your requirements by precisely defining who can access your data, what actions they can take, and under what conditions. This fine-grained access control helps protect sensitive data, prevent unauthorized modifications, and mitigate the risk of accidental or malicious actions.
AWS Identity and Access Management (IAM) allows you to create an entity referred to as an IAM identity, which is granted specific actions on your AWS account. This entity can be a person or an application. You can create this identity as an IAM role, which is designed to be attached to any entity that needs it. Alternatively, you can create IAM users, which represent individual people and are usually used for granting long-term access to specific users. IAM users can be grouped into an IAM group, allowing permissions to be assigned at the group level and inherited by all member users. IAM policies are sets of permissions that can be attached to the IAM identity to grant specific access rights.
In this recipe, we will learn how to create a policy so that we can view all the buckets in the account, give read access to one specific bucket content, and then give write access to one of its folders.
For this recipe, you need to have an IAM user, role, or group to which you want to grant access. You also need to have an S3 bucket with a folder to grant access to.
To learn how to create IAM identities, go to https://docs.aws.amazon.com/IAM/latest/UserGuide/id.html.
Now, you can attach this policy to an IAM role, user, or group. However, exercise caution and ensure access is granted only as necessary; avoid providing admin access policies to regular users.
An IAM policy comprises three key elements:
Effect: This specifies whether the policy allows or denies accessAction: This details the specific actions being allowed or deniedResource: This identifies the resources to which the actions applyA single statement can apply multiple actions to multiple resources. In this recipe, we’ve defined three statements:
The AllowListBuckets statement gives access to list all buckets in the AWS accountThe AllowBucketListing statement gives access to list the content of a specific S3 bucketThe AllowFolderAccess gives access to upload, download, and delete objects from a specific folderIf you want to make sure that no access is given to a specific bucket or object in your bucket, you can use a deny statement, as shown here:
{ "Sid":"DenyListBucketFolder", "Action":[ "s3:*" ], "Effect":"Deny", "Resource":[ "arn:aws:s3:::<bucket-name>/<folder-name>/*" }Instead of using an IAM policy to set up permissions to your bucket, you can use S3 bucket policies. These can be located in the Permission tab of the bucket. Bucket policies can be used when you’re trying to set up access at the bucket level, regardless of the IAM role or user.
Amazon S3 offers different tiers or classes of storage that allow you to optimize for cost and performance based on your access pattern and data requirements. The default storage class for S3 buckets is S3 Standard, which offers high availability and low latency. For less frequently accessed data, S3 Standard-IA and S3 One Zone-IA can be used. For rare access, Amazon S3 offers archiving classes called Glacier, which are the lowest-cost classes. If you’re not sure how frequently your data will be accessed, S3 Intelligent-Tiering would be optimal for you as it will automatically move objects between the classes based on the access patterns. However, be aware that additional costs may be incurred when you’re moving objects to a higher-cost storage class.
These storage classes provide users with the flexibility to choose the right trade-off between storage costs and access performance based on their specific data storage and retrieval requirements. You can choose the storage class based on your access patterns, durability requirements, and budget considerations. Configuring storage classes at the object level allows for a mix of storage classes within the same bucket. Objects from diverse storage classes, including S3 Standard, S3 Intelligent-Tiering, S3 Standard-IA, and S3 One Zone-IA, can coexist in a single bucket.
In this recipe, we will learn how to enforce the S3 Intelligent-Tiering storage class for an S3 bucket through a bucket policy.
For this recipe, you only need to have an S3 bucket for which you will enforce the storage class.
The policy will ensure that objects are stored via the Intelligent-Tiering class by allowing the PUT operation to be used on the bucket for all users (Principal: *), but only if the storage class is set to INTELLIGENT_TIERING. You can do this by choosing it from the storage class list in the Object properties section. If you’re using the console or the S3 API, add the x-amz-storage-class: INTELLIGENT_TIERING header. Use the -storage-class INTELLIGENT_TIERING parameter when using the AWS CLI.
Intelligent-Tiering will place newly uploaded objects in the S3 Standard class (Frequent Access class). If the object hasn’t been accessed in 30 consecutive days, it will be moved to the Infrequent Access tier; if it hasn’t been accessed in 90 consecutive days, it will be moved to the Archive Instant Access tier. For further cost savings, you can enable INTELLIGENT_TIERING to move your object to the Archive Access tier and Deep Archive Access tier if they have not been accessed for a longer period. To do this, follow these steps:
Navigate to the Properties tab for the bucket.Scroll down to Intelligent-Tiering Archive configurations and click on Create configuration.Name the configuration and specify whether you want to enable it for all objects in the bucket or on a subset based on a filter and/or tags.Under Status, click on Enable to enable the configuration directly after you create it.Under Archive rule actions, enable the Archive Access tier and specify the number of days in which the objects should be moved to this class if they’re not being accessed. The value must be between 90 and 730 days. Similarly, enable the Deep Archive Access tier and set the number of days to a minimum of 180 days. It’s also possible to enable only one of these classes:Figure 1.1 – Intelligent-Tiering Archive rule action
Click on Create to create the configuration.Amazon S3 encryption increases the level of security and privacy of your data; it helps ensure that only authorized parties can read it. Even if an unauthorized person gains logical or physical access to that data, the data is unreadable if they don’t get a hold of the key to unencrypt it.
S3 supports encrypting data both at transit (as it travels to and from S3) and at rest (while it’s stored on disks in S3 data centers).
For protecting data at rest, you have two options. The first is server-side encryption (SSE), in which Amazon S3 will be handling the heavy encryption operation on the server side in AWS. By default, Amazon S3 encrypts your data using SSE-S3. However, you can change this to SSE-KMS, which uses KMS keys for encryption, or to SSE-C, where you can provide and manage your own encryption key. Alternatively, you can encrypt your data using client-side encryption, where Amazon S3 doesn’t play any role in the encryption process rather; you are responsible for all the encryption operations.
In this recipe, we’ll learn how to enforce SSE-KMS server-side encryption using customer-managed keys.
For this recipe, you need to have a KMS key in the same region as your bucket to use for encryption. KMS provides a managed key for S3 (aws/s3) that can be utilized for encryption. However, if you desire greater control over the key properties, such as modifying its policies or performing key rotation, you can create a customer-managed key. To do so, follow these steps:
Sign in to the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the AWS Key Management Service (AWS KMS) service.In the navigation pane, choose Customer managed keys and click on Create key.For Key type, choose Symmetric, while for Key usage, choose Encrypt and decrypt. Click on Next:Figure 1.2 – KMS configuration
Click on Next.Type an Alias value for the KMS key. This will be the display name. Optionally, you can provide Description and Tags key-value pairs for the key.Click on Next. Optionally, you can provide Key administrators to administer the key. Click on Finish to create the key.Figure 1.3 – Changing the default encryption
By changing the default encryption for your bucket, all newly uploaded objects to your bucket, which don’t have an encryption setting, will be encrypted using the KMS you have provided. Already existing objects in your bucket will not be affected. Enabling the bucket key leads to cost savings in KMS service calls associated with the encryption or decryption of individual objects. This is achieved by KMS generating a key at the bucket level rather than generating a separate KMS key for each encrypted object. S3 uses this bucket-level key to generate distinct data keys for objects within the bucket, thereby eliminating the need for additional KMS requests to complete encryption operations.
By following this recipe, you can encrypt your objects with SSE-KMS but only if they don’t have encryption configured. You can enforce your objects to have an SSE-KMS encryption setting in the PUT operation using a bucket policy, as shown here:
Navigate to the bucket’s Permissions tab.Go to the Bucket Policy section and click on Edit.Paste the following policy. Make sure you replace <your-bucket-name> with the actual name of your S3 bucket and <your-kms-key-arn> with the Amazon Resource Name (ARN) of your KMS key: { "Version": "2012-10-17", "Id": "EnforceSSE-KMS", "Statement": [ { "Sid": "DenyNonKmsEncrypted", "Effect": "Deny", "Principal": "*", "Action": "s3:PutObject", "Resource": "arn:aws:s3:::<your-bucket-name>/*", "Condition": { "StringNotEquals": { "s3:x-amz-server-side-encryption": "aws:kms" } } }, { "Sid": "AllowKmsEncrypted", "Effect": "Allow", "Principal": "*", "Action": "s3:PutObject", "Resource": "arn:aws:s3:::<your-bucket-name>/*", "Condition": { "StringEquals": { "s3:x-amz-server-side-encryption": "aws:kms", "s3:x-amz-server-side-encryption-aws-kms-key-id": "<your-kms-key-arn>" } } } ] }Save your changes.This policy contains two statements. The first statement (DenyNonKmsEncrypted) denies the s3:PutObject action for any request that does not include SSE-KMS encryption. The second statement (AllowKmsEncrypted) only allows the s3:PutObject action when the request includes SSE-KMS encryption and the specified KMS key.
Amazon S3’s storage lifecycle allows you to manage the lifecycle of objects in an S3 bucket based on predefined rules. The lifecycle management feature consists of two main actions: transitions and expiration. Transitions involve automatically moving objects between different storage classes based on a defined duration. This helps in optimizing costs by storing less frequently accessed data in a cheaper storage class. Expiration, on the other hand, allows users to set rules to automatically delete objects from an S3 bucket. These rules can be based on a specified duration. Additionally, you can apply a combination of transitions and expiration actions to objects. Amazon S3’s storage lifecycle provides flexibility and ease of management for users and it helps organizations optimize storage costs while ensuring that data is stored according to its relevance and access patterns.
In this recipe, we will learn how to set up a lifecycle policy to archive objects in S3 Glacier after a certain period and then expire them.
To complete this recipe, you need to have a Glacier vault, which is a separate storage container that can be used to store archives, independent from S3. You can create one by following these steps:
Open the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the Glacier service.Click on Create vault to start creating a new Glacier vault.Provide a unique and descriptive name for your vault in the Vault name field.Optionally, you can choose to receive notifications for events by clicking Turn on notifications under the Event notifications section.Click on Create to create the vault.The following screenshot shows a rule that’s been restricted to a set of objects based on a prefix:
Figure 1.4 – Lifecycle rule configuration
Under Lifecycle rule actions, select the following options:Move current versions of objects between storage classes.Then,choose one of the Glacier classes and set Days after object creation in which the object will be transitioned (for example, 60 days).Expire current versions of objects. Then, set Days after object creation in which the object will expire. Choose a value higher than the one you set for transitioning the object to Glacier (for example, 100).Review the transition and expiration actions you have set and click on Create rule to apply the lifecycle policy to the bucket:
Figure 1.5 – Reviewing the lifecycle rule
Note
It may take some time for the lifecycle rule to be applied to all the selected objects, depending on the size of the bucket and the number of objects. The rule will affect existing files, not just new ones, so ensure that no applications are accessing files that will be archived or deleted as they will no longer be accessible via direct S3 retrieval.
After you save the lifecycle rule, Amazon S3 will periodically evaluate it to find objects that meet the criteria specified in the lifecycle rule. In this recipe, the object will remain in its default storage type for the specified period (for example, 60 days) after which it will automatically be moved to the Glacier storage class. This transition is handled transparently, and the object’s metadata and properties remain unchanged. Once the objects are transitioned to Glacier, they are stored in a Glacier vault and become part of the Glacier storage infrastructure. Objects will then remain in Glacier for the remaining period of expiry (for example, 40 days), after which they will expire and be permanently deleted from your S3 bucket.
Please note that once the objects have expired, they will be queued for deletion, so it might take a few days after the object reaches the end of its lifetime for it to be deleted.
Lifecycle configuration can be specified as an XML when using the S3 API or AWS console, which can be helpful if you are planning on using the same lifecycle rules on multiple buckets. You can read more on setting this up at https://docs.aws.amazon.com/AmazonS3/latest/userguide/intro-lifecycle-rules.html.
Amazon S3 versioning refers to maintaining multiple variants of an object at the same time in the same bucket. Versioning provides you with an additional layer of protection by giving you a way to recover from unintended overwrites and accidental deletions as well as application failures.
S3 Object Versioning is not enabled by default and has to be explicitly enabled for each bucket. Once enabled, versioning cannot be disabled and can only be suspended. When versioning is enabled, you will be able to preserve, retrieve, and restore any version of an object stored in the bucket using the version ID. Every version of an object is the whole object, not the delta from the previous version, and you can set permissions at the version level. So, you can set different permissions for different versions of the same object.
In this recipe, we’ll learn how to delete the current version of an object to make the previous one the current version.
For this recipe, you need to have a version-enabled bucket with an object that has at least two versions.
You can enable versioning for your bucket by going to the bucket’s Properties tab, editing the Bucket Versioning area, and setting it to Enable:
Figure 1.6 – Enabling bucket versioning
You can create a new version of an object by simply uploading a file with the same name to the versioning-enabled bucket.
It’s important to note that enabling versioning for a bucket is irreversible. Once versioning is enabled, it will be applied to all existing and future objects in that bucket. So, before enabling versioning, make sure that your application or workflow is compatible with object versioning.
Enabling versioning for the first time will take time to take effect, so we recommend waiting 15 minutes before performing any write operation on objects in the bucket.
Figure 1.7 – Object versions
Select the current version of the object that you want to delete. It’s the top-most version with the latest modified date.Click on the Delete button and write permanently delete as prompted on the next screen.After deleting the current version, the previous version will automatically become the latest version:
Figure 1.8 – Object versions after version deletion
Verify that the previous version is now the latest version by checking the Last modified timestamps or verifying this through object listing, metadata, or download.Once you enable bucket versioning, each object in the bucket will have a version ID that uniquely identifies the object in the bucket, and the non-version-enabled buckets will have their version IDs set to null for their objects. The older versions of an object become non-current but continue to exist and remain accessible. When you delete the current version of the object, it will be permanently removed and the S3 versioning mechanism will automatically promote the previous version as the current one after deletion. If you delete an object without specifying the version ID, Amazon S3 doesn’t delete it permanently; instead, it inserts a delete marker into it and it becomes the current object version. However, you can still restore its previous versions:
Figure 1.9 – Object with a delete marker
S3 rates apply to every version of an object that’s stored and requested, so keeping non-current versions of objects can increase your storage cost. You can use lifecycle rules to archive the non-current versions or permanently delete them after a certain period and keep the bucket clean from unnecessary object versions.
Follow these steps to add a lifecycle rule to delete non-current versions after a certain period:
Go to the bucket’s Management tab and click on the Lifecycle configuration.Click on the Add lifecycle rule button to create a new rule.Provide a unique name for the rule.Under Apply rule to, select the appropriate resources (for example, the entire bucket or specific prefixes).Set the action to Permanently delete non-current versions.Specify Days after objects become noncurrent in which the delete will be executed. Optionally, you can specify Number of newer versions to retain, which means it will keep the said number of versions for the object and all others will be deleted when they are eligible for deletion based on the specified period.Click on Save to save the lifecycle rule.AWS S3 replication is an automatic asynchronous process that involves copying objects to one or multiple destination buckets. Replication can be configured across buckets in the same AWS region with Same-Region Replication, which can be useful for scenarios such as isolating different workloads, segregating data for different teams, or achieving compliance requirements. Replication can also be configured for buckets across different AWS regions with Cross-Region Replication (CRR), which helps in reducing latency for accessing data, especially for enterprises with a large number of locations, by maintaining multiple copies of the objects in different geographies or different regions. It provides compliance and data redundancy for improved performance, availability, and disaster recovery capabilities.
In this recipe, we’ll learn how to set up replication between two buckets in different AWS regions and the same AWS account.
You need to have an S3 bucket in the destination AWS region to act as a target for the replication. Also, S3 versioning must be enabled for both the source and destination buckets.
Figure 1.10 – Replication rule configuration
If this is the first replication rule for the bucket, Priority will be set to 0. Subsequent rules that are added will be assigned higher priorities. When multiple rules share the same destination, the rule with the highest priority takes precedence during execution, typically the one created last. If you wish to control the priority for each rule, you can achieve this by setting the rule using XML. For guidance on how to configure this, refer to the See also section.In the Source bucket section, you have the option to replicate all objects in the bucket by selecting Apply to all objects in the bucket or you can narrow it down to specific objects by selecting Limit the scope of this rule using one or more filters and specifying a Prefix value (for example, logs_ or logs/) to filter objects. Additionally, you have the option to replicate objects based on their tags. Simply choose Add tag and input key-value pairs. This process can be repeated so that you can include multiple tags:Figure 1.11 – Source bucket configuration
Under Destination, select Choose a bucket in this account and enter or browse for the destination bucket name.Under IAM role, select Choose from existing IAM roles, then choose Create new role from the drop-down list.Under Destination storage class, you can select Change the storage class for the replicated objects and choose one of the storage classes to be set for the replicated objects in the destination bucket.Click on Save to save your changes.By adding this replication rule, you grant the source bucket permission to replicate objects to the destination bucket in the said region. Once the replication process is complete, the destination bucket will contain a copy of the objects from the source bucket. The objects in the destination bucket will have the same ownership, permissions, and metadata as the source objects. When you enable replication to your bucket, several background processes occur to facilitate this process. S3 continuously monitors changes to objects in your source bucket. Once a change is detected, S3 generates a replication request for the corresponding objects and initiates the process of transferring the data from the source to the destination bucket.
There are additional options that you can enable while setting the replication rule under Additional replication options. The Replication metrics option enables you to monitor the replication progress with S3 Replication metrics. It does this by tracking bytes pending, operations pending, and replication latency. The Replication Time Control (RTC) option can be beneficial if you have a strict service-level agreement (SLA) for data replication as it will ensure that approximately 99% of your objects will be replicated within a 15-minute timeframe. It also enables replication metrics to notify you of any instances of delayed object replication. The Delete marker replication option will replicate object versions with a delete marker. Finally, the Replica modification sync option will replicate the metadata changes of objects.
Enabling and