50,99 €
Your complete Guide to preparing for the AWS® Certified Data Engineer: Associate exam
The AWS® Certified Data Engineer Study Guide is your one-stop resource for complete coverage of the challenging DEA-C01 Associate exam. This Sybex Study Guide covers 100% of the DEA-C01 objectives. Prepare for the exam faster and smarter with Sybex thanks to accurate content including, an assessment test that validates and measures exam readiness, real-world examples and scenarios, practical exercises, and challenging chapter review questions. Reinforce and retain what you’ve learned with the Sybex online learning environment and test bank, accessible across multiple devices. Get ready for the AWS Certified Data Engineer exam – quickly and efficiently – with Sybex.
Coverage of 100% of all exam objectives in this Study Guide means you’ll be ready for:
ABOUT THE AWS DATA ENGINEER – ASSOCIATE CERTIFICATION
The AWS Data Engineer – Associate certification validates skills and knowledge in core data-related Amazon Web Services. It recognizes your ability to implement data pipelines and to monitor, troubleshoot, and optimize cost and performance issues in accordance with best practices
Interactive learning environment
Take your exam prep to the next level with Sybex’s superior interactive online study tools. To access our learning environment, simply visit www.wiley.com/go/sybextestprep, register your book to receive your unique PIN, and instantly gain one year of FREE access after activation to:
• Interactive test bank with 5 practice exams to help you identify areas where further review is needed. Get more than 90% of the answers correct, and you’re ready to take the certification exam.
• 100 electronic flashcards to reinforce learning and last-minute prep before the exam
• Comprehensive glossary in PDF format gives you instant access to the key terms so you are fully prepared
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1016
Veröffentlichungsjahr: 2025
Syed Humair
Chenjerai Gumbo
Adam Gatt
Asif Abbasi
Lakshmi Nair
Copyright © 2025 by John Wiley & Sons, Inc. All rights, including for text and data mining, AI training, and similar technologies, are reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada and the United Kingdom.
ISBNs: 9781394286584 (paperback), 9781394286607 (ePDF), 9781394286591 (ePub)
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at www.wiley.com/go/permission.
The manufacturer's authorized representative according to the EU General Product Safety Regulation is Wiley-VCH GmbH, Boschstr. 12, 69469 Weinheim, Germany, e-mail: [email protected].
Trademarks: WILEY, the Wiley logo, and Sybex are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. AWS is a trademark or registered trademark of Amazon Technologies, Inc. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and authors have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. For product technical support, you can find answers to frequently asked questions or reach us via live chat at https://sybexsupport.wiley.com.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Control Number: 2025930429
Cover image: © Jeremy Woodhouse/Getty ImagesCover design: Wiley
Collaborating on this Study Guide has been an incredible experience. We are immensely grateful to Amazon Web Services (AWS) for their certification program and, in particular, for developing the AWS Certified Data Engineer – Associate exam. This is an extremely important certification that will strengthen and empower professionals across the industry.
We want to thank the Wiley team, including Ken Brown, Satish Gowrishankar, Kelly Talbot, Saravanan Dakshinamurthy, and everyone at Wiley, for their assistance in creating, editing, and publishing this Study Guide. Your patience with us during the editing process was amazing. Thank you for your help and support during the process. We especially want to mention Kelly's effort to help us raise the bar for this Study Guide to help potential test takers.
We also appreciate the input, guidance, and insights of all our colleagues at Amazon and AWS in our professional lives and in our efforts to create this book.
We are especially thankful for the support and understanding of our families, friends, and colleagues as we devoted an inordinate amount of time to writing this book.
Last but not least, we would like to thank you, our readers, for pursuing the AWS Certified Data Engineer – Associate exam certification and for your devotion to helping the industry, clients, and people across the world maximize the power and potential of AWS.
— Syed Humair, Chenjerai Gumbo, Adam Gatt, Asif Abbasi, and Lakshmi Nair
Syed Humair is a Senior Analytics Specialist Solutions Architect at Amazon Web Services (AWS), renowned for his expertise in data engineering, machine learning, and enterprise architecture. With nearly two decades of experience, his skill set encompasses data strategy, data warehousing, business intelligence, and data analytics, with a particular emphasis on cloud‐based solutions. Humair's impact spans diverse industries, including retail, travel, telecommunications, healthcare, and financial services. In his role at AWS, Humair excels in guiding customers through the complexities of data analytics and AI, solidifying his status as a knowledgeable author and an invaluable asset in the field.
Chenjerai Gumbo is an AWS Solutions Architect Leader of Analytics and an Institute of Directors (IOD) Member. He is a technology leader with a keen interest in leadership, business models, data, and process improvement, with exposure to telecommunications (fixed and mobile), utility (electricity), and AWS cloud environments. As a leader, he is constantly pursuing an understanding of the synergy between business, people, and technology in order to help organizations transform their businesses. He holds an MBA from the University of Stellenbosch Business School.
Adam Gatt is a seasoned data architect with over 20 years of experience. He specializes in data warehousing, business intelligence, and big data. His career includes positions at Amazon Web Services (AWS) and Hewlett‐Packard, as well as contributions to various industries such as mass media, cybersecurity, insurance, and telecommunications. Adam has built a reputation for translating complex technical ideas for non‐technical audiences and delivering innovative solutions for business challenges. He is currently a Senior Redshift Solution Architect at AWS, helping customers build robust, scalable, and high‐performance analytics solutions in the cloud.
Asif Abbasi is a Solutions Engineering and Architecture Leader and a Principal Solutions Architect at AWS, driving innovation across the EMEA region. His passion lies in leading technology teams and facilitating customer adoption of data, analytics, and AI/ML. As a published author in AWS data analytics, Generative AI, and Apache Spark, Asif empowers organizations to overcome complex business challenges through strategic implementation of analytics and AI ecosystems. With two decades of experience, Asif is dedicated to bridging the gap between business problems and technological solutions. Asif specializes in simplifying technology for CXOs, business and IT directors, and data scientists, ensuring seamless understanding and implementation.
Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries and diverse geographies. She focuses on crafting cloud‐based data platforms, enabling real‐time streaming, big data processing, and robust data governance.
In the rapidly evolving landscape of cloud computing and data engineering, the AWS Certified Data Engineer – Associate certification has emerged as a must-have credential for data professionals seeking to demonstrate their expertise in designing, building, and managing data solutions on Amazon Web Services. This comprehensive exam preparation guide, crafted by a team of seasoned industry experts, delivers deep technical knowledge and practical insights to empower aspiring data engineers to excel in their certification journey and professional careers.
As someone who has witnessed the transformative power of data engineering firsthand, I understand the challenges practitioners face in navigating the comprehensive depth and breadth of AWS data services. The authors—Syed Humair, Chenjerai Gumbo, Adam Gatt, Asif Abbasi, and Lakshmi Nair—bring a wealth of real-world experience that transcends traditional textbook learning. Their approach combines rigorous technical depth with practical, hands-on guidance that reflects the complex realities of modern data engineering.
What sets this book apart is its holistic approach to exam preparation. Beyond simply teaching to the test, it provides a robust framework for understanding AWS data services, architectural patterns, and best practices. From data ingestion, storage, transformation, governance, security, and analytics, each chapter is designed to both prepare you for the exam and increase your technical competency.
The methodical breakdown of complex concepts, coupled with practical examples, sample exam questions, and strategic exam-taking tips, make this guide an indispensable resource. Whether you are a professional looking to validate your skills, a student entering the field, or a technologist seeking to expand your cloud data engineering capabilities, this book offers a clear and guided pathway to success.
I am confident this guide will not only help you pass the AWS Certified Data Engineer – Associate exam, but will also serve as a foundational reference for your continued professional development.
Imtiaz SayedWorldwide Tech Leader - AWS Data AnalyticsDec 09, 2024
In today's data-driven world, the demand for skilled data engineers is at an all-time high. AWS offers some of the most powerful and widely used cloud solutions, making AWS data engineering skills essential for anyone looking to thrive in this field. By mastering AWS Certified Data Engineering, you can work with cutting-edge technologies for processing, storing, and analyzing vast amounts of data, whether in batch or real-time scenarios. This knowledge will not only enhance your career prospects but also prepare you to design efficient, scalable, and secure data solutions that meet the needs of modern organizations.
Even if you are already familiar with other data engineering platforms, AWS expertise will set you apart, as it is the leading provider of cloud services worldwide. Gaining a solid understanding of AWS data engineering practices will help you make informed decisions about how to implement robust data solutions and handle real-world challenges.
This certification validates your ability to work with data-related services on AWS, including data ingestion, transformation, storage, and visualization. The exam is designed to test your skills in building automated pipelines, managing data catalogs, ensuring data security, and implementing governance policies. As a certified associate, you demonstrate the knowledge needed to design and maintain reliable, efficient, and secure data workflows on the AWS platform.
This book is designed to help you pass the AWS Certified Data Engineering Associate exam. Covering all key exam topics, such as streaming and batch data ingestion, building automated data pipelines, transformation, and storage, this book provides detailed explanations, practical examples, and valuable insights. Each chapter aligns with core AWS services and concepts, ensuring you gain a comprehensive understanding of data engineering in the AWS ecosystem.
Beyond exam preparation, this book serves as a lasting reference guide. From setting up secure data pipelines to monitoring and troubleshooting operations, you will find the knowledge and skills needed for real-world scenarios. Whether you're an aspiring data engineer or a seasoned professional, this book will equip you with the tools to advance your career and excel in the ever-growing field of data engineering.
Don't just study the questions and answers! The questions on the actual exam will be different from the practice questions included in this book. The exam is designed to test your knowledge of a concept or objective, so use this book to learn the objectives behind the questions.
The AWS Certified Data Engineer – Associate exam is designed to validate the skills and expertise required to build, maintain, and optimize data processing systems on the AWS platform. It focuses on assessing the candidate's ability to manage data throughout its lifecycle, including ingestion, transformation, storage, and analysis. The exam is particularly relevant for professionals involved in creating robust, scalable, and secure data infrastructure, which is critical for data-driven decision-making and AI-powered solutions.
This certification covers a range of topics, including streaming and batch data ingestion, automated data pipeline construction, data transformation techniques, storage services, and database management. Security and governance, such as encryption, masking, and access controls, are also critical components. Additionally, the exam tests knowledge of advanced topics like data cataloging, monitoring, auditing, and troubleshooting data operations, ensuring candidates can maintain efficient systems. By earning this certification, professionals demonstrate their ability to work effectively in AWS's dynamic cloud environment, positioning themselves as key players in the data engineering landscape.
The field of data engineering is rapidly growing as organizations increasingly rely on data to drive decision-making and innovation. As we transition into a world dominated by AI and Generative AI technologies, the role of data engineers is becoming more critical than ever. These professionals are the backbone of systems that collect, transform, and manage data—tasks that are essential for deploying advanced AI models and driving innovation in industries ranging from healthcare to finance and beyond.
The AWS Certified Data Engineering – Associate exam is designed to validate the skills required to build and maintain data infrastructure on the AWS cloud platform, one of the leading platforms for scalable and secure data operations. The certification equips individuals with the expertise to design data pipelines, manage data storage, implement security protocols, and optimize data systems. This certification serves not only as a stepping stone for those entering the field but also as a benchmark for professionals looking to advance their careers in cloud-based data engineering.
According to industry analyses, the demand for data engineers is on a steep rise, with salaries reflecting the growing importance of this role. In 2024, the average salary for data engineers in the United States is estimated at $153,000 annually, with experienced professionals earning even higher. Data engineering is also increasingly intertwined with machine learning (ML) and AI. Nearly 30 percent of job postings now cite ML-related skills as critical for data engineers, highlighting the evolving role of these professionals in shaping AI systems. Furthermore, the expansion of remote work opportunities in this field is making data engineering an attractive and competitive career path globally.
As businesses race to integrate Generative AI and other emerging technologies, having a solid foundation in data engineering is essential. Building automated pipelines, ensuring data privacy, and maintaining robust infrastructure are key tasks that enable organizations to effectively implement and scale their AI strategies. This makes certified data engineers indispensable in today's job market and positions them as vital contributors to the technological advancements of the future.
The purpose of this book is to prepare you for the AWS Certified Data Engineering – Associate exam by covering all the essential topics, including:
Chapter 1
: Streaming and Batch Data Ingestion
Chapter 2
: Building Automated Data Pipelines
Chapter 3
: Data Transformation
Chapter 4
: Storage Services
Chapter 5
: Databases and Data Warehouses on AWS
Chapter 6
: Data Catalogs
Chapter 7
: Visualizing Your Data
Chapter 8
: Monitoring and Auditing Data
Chapter 9
: Maintaining and Troubleshooting Data Operations
Chapter 10
: Authentication and Authorization
Chapter 11
: Data Encryption and Masking
Chapter 12
: Data Privacy and Governance
Chapter 13
: How to Take the Exam
Additionally, the book comes with a test bank that includes practice exams and flashcards to reinforce key concepts. These resources will not only help you pass the certification exam but will also serve as a comprehensive reference for practical data engineering tasks.
By mastering the material in this guide, you'll not only enhance your expertise in AWS's data services but also position yourself as a key player in the evolving landscape of data-driven innovation. As the demand for skilled data engineers continues to grow, this certification will provide you with the knowledge and credibility needed to excel in this exciting and dynamic field.
The AWS Certified Data Engineering – Associate exam is available to anyone interested in validating their skills in building, maintaining, and optimizing data infrastructure on the AWS platform. This certification is accessible to everyone—you don't need to work for a particular company or meet specific prerequisites to qualify.
The exam is administered by AWS through authorized testing partners, including Pearson VUE and PSI. Candidates can choose to take the exam either online from the comfort of their homes or at a testing center near them. Upon passing, you'll receive an official certificate from AWS that verifies your expertise, as well as a digital badge to showcase your achievement on professional platforms like LinkedIn.
To register for the exam, visit the official AWS Certification portal at https://aws.amazon.com/certification. You'll need to create an AWS Certification account if you don't already have one. From there, select the AWS Certified Data Engineering – Associate exam, choose your preferred testing method (online or in-person), and follow the prompts to schedule your exam. During registration, you will be asked to provide details such as your name, contact information, and payment method to complete the process.
Exam policies can change from time to time. We highly recommend that you check both the AWS and Pearson VUE sites for the most up-to-date information when you begin your preparation, when you register, and again a few days before your scheduled exam date.
This study guide uses several common elements to help you prepare. These include the following:
Summaries
The summary section of each chapter briefly explains the chapter, allowing you to easily understand what it covers.
Exam Essentials
The exam essentials focus on major exam topics and critical knowledge that you should take into the test. The exam essentials focus on the exam objectives provided by AWS.
Chapter Review Questions
A set of questions at the end of each chapter will help you assess your knowledge and if you are ready to take the exam based on your knowledge of that chapter's topics.
The review questions, assessment test, and other testing elements included in this book are not derived from the actual exam questions, so don't memorize the answers to these questions and assume that doing so will enable you to pass the exam. You should learn the underlying topic, as described in the text of the book. This will let you answer the questions provided in this book and pass the exam. Learning the underlying topic is also the approach that will serve you best in the workplace—the ultimate goal of a certification.
Studying the material in the AWS Certified Data Engineer Study Guide: Associate (DEA-C01) Exam Study Guide is an important part of preparing for the AWS Certified Data Engineer – Associate certification exam, but we provide additional tools to help you prepare. The online TestBank will help you understand the types of questions that will appear on the certification exam.
The Practice Tests in the TestBank include all the questions in each chapter as well as the questions from the Assessment test. In addition, there are five practice exams with 50 questions each. You can use these tests to evaluate your understanding and identify areas that may require additional study.
The Flashcards in the TestBank will push the limits of what you should know for the certification exam. There are 100 questions, which are provided in digital format. Each flashcard has one question and one correct answer.
To start using these to study for the AWS Certified Data Engineer – Associate exam, go to www.wiley.com/go/sybextestprep, and register your book to receive your unique PIN. Once you have the PIN, return to www.wiley.com/go/sybextestprep, find your book, and click Register or Log In and follow the link to register a new account or add this book to an existing account.
Like all exams, the AWS Certified Data Engineer – Associate certification from AWS is updated periodically and may eventually be retired or replaced. At some point after AWS is no longer offering this exam, the old editions of our books and online tools will be retired. If you have purchased this book after the exam was retired, or are attempting to register in the Sybex online learning environment after the exam was retired, please know that we make no guarantees that this exam's online Sybex tools will be available once the exam is no longer available.
AWS Certified Data Engineer Study Guide: Associate (DEA-C01) Exam Study Guide has been written to cover every AWS exam objective at a level appropriate to its exam weighting. The following table provides a breakdown of this book's exam coverage, showing you the weight of each section and the chapter where each objective or subobjective is covered:
Subject Area
% of Exam
Domain 1:
Data Ingestion and Transformation
34%
Domain 2:
Data Store Management
26%
Domain 3:
Data Operations and Support
22%
Domain 4:
Data Security and Governance
18%
Total
100%
Exam Objectives
Chapter
Domain 1: Data Ingestion and Transformation
Task 1.1 Perform data ingestion
1
Task 1.2 Transform and process data
3
Task 1.3 Orchestrate data pipelines with AWS
2
Task 1.4 Apply programming concepts with AWS
2
Domain 2: Data Store Management
Task 2.1 Choose a data store
4
,
5
Task 2.2 Understand data cataloging systems
6
Task 2.3 Managing the lifecycle of data
4
,
5
Task 2.4: Design data models and schema evolution
5
Domain 3: Data Operations and Support
Task 3.1 Automate data processing by using AWS services
9
Task 3.2 Analyze data by using AWS services
7
Task 3.3 Maintain and monitor data pipelines with AWS
2
,
8
Task 3.4 Ensure data quality
3
,
6
Domain 4: Data Security and Governance
4.1: Apply authentication mechanisms
10
4.2: Apply authorization mechanisms
10
4.3 Ensure data encryption and masking
11
4.4 Prepare logs for audit
8
4.5: Understanding data privacy and governance
12
If you believe you have found a mistake in this book, please bring it to our attention. At John Wiley & Sons, we understand how important it is to provide our customers with accurate content, but even with our best efforts, an error may occur.
In order to submit your possible errata, please email it to our Customer Service Team at [email protected] with the subject line “Possible Book Errata Submission.”
For compliance reasons, a company needs to delete records from Amazon S3 data lake once it reaches its retention limit. Which solution will meet the requirement with the
least
amount of operational overhead?
Copy the data from Amazon S3 to Amazon Redshift to delete the records.
Configure the tables using Apache Hudi format and use Amazon Athena to delete the records.
Configure the tables using Apache Iceberg format and use Amazon Redshift Spectrum to delete the records.
Configure the tables using Apache Iceberg format and use Amazon Athena to delete the records.
You've been asked to recommend the right database solution for a customer with relational data who requires strong durability, high availability, and disaster recovery. Which overall solution will best suit their requirements?
Amazon DocumentDB with additional replicas for enhanced read performance
Amazon DynamoDB with its automatic replication across multiple availability zones
Amazon RDS with a Multi-AZ DB instance deployment
Amazon Neptune to optimize instances of highly connected data with additional read replicas for high availability
Which of the following services supports real-time processing through streaming capabilities?
Only Amazon EMR
Only AWS Glue
Only Amazon Redshift
All of the above
A financial services company stores its data in Amazon Redshift. A data engineer wants to run real-time queries on financial data to support a web-based trading application. The engineer would like to run the queries from within the tradition application. Which solution offers the
least
operational overhead?
Unload the data to Amazon S3 and use S3 Select to run the queries.
Configure and set up a JDBC connection to Amazon Redshift.
Establish WebSocket connections to Amazon Redshift.
Use the Amazon Redshift Data API.
Unload the data to Amazon S3 and use Amazon Athena to run the queries.
A telecommunications company needs to implement a solution that prevents the accidental sharing of customer call records containing PII across AWS accounts while maintaining an audit trail of all access attempts. Which combination of services would best meet this requirement?
AWS IAM policies with CloudWatch Logs
Lake Formation with AWS CloudTrail and Macie
S3 bucket policies with access logging
AWS Backup with cross-account controls
An online retailer is developing a recommendation engine for its web store. The engine will provide personalized product recommendations to customers based on their past purchases and browsing history. The engine must efficiently navigate a highly connected dataset with low latency. This involves complex queries to identify products similar to those purchased by the customer and find products bought by other customers who have purchased the same item. Which solution best addresses this use case?
Use Amazon RDS for PostgreSQL with the Apache AGE (graph database) extension.
Use Amazon DocumentDB (with MongoDB compatibility) to efficiently handle semi-structured data.
Use Amazon Neptune to efficiently transverse graph datasets.
Use Amazon MemoryDB for Redis for ultra-fast query performance.
A company has moved its data transformation job to an Amazon EMR cluster with Apache Pig. The cluster uses on-demand instances to process large datasets. The output is critical to operations. It usually takes 1 hour to complete the job, and the company must ensure that the entire process adheres to the SLA of 2 hours. The company is looking for a solution that will provide cost reduction and negligibly impact availability. Which combination of solutions can be implemented to meet the requirements? (Choose two.)
Add a task node that runs on Spot instance.
Configure an EMR cluster that uses instance groups.
Use Spot instance for all node types.
Configure an EMR cluster that uses instance fleets.
Assign Spot capacity for all node types and enable the switch to the on-demand instances option.
In Amazon Athena, what is the purpose of using partitions?
To increase the maximum query complexity
To reduce the amount of data scanned and lower query cost
To enable real-time data analysis
To create visualization directly in Athena
A customer is experiencing poor performance with a query on Redshift. The query joins two large tables on a single column key and aggregates the result. Each table has hundreds of millions of rows. What would be the best distribution style to use on each table?
Use the ALL distribution style to create a copy of each table on every compute node.
Use the EVEN distribution style to distribute rows round-robin to each compute node.
Use the KEY distribution style and set the columns used in the join as distribution keys.
A data scientist is developing a REST API for internal applications. Currently, every API call made to the application is logged in JSON format to an HTTP endpoint. What is the recommended low-code approach to stream the JSON data into S3 in Parquet format?
Set up a Kinesis Data Stream to ingest the data and use Amazon Data Firehose as a delivery stream. Once the data is ingested into S3, use AWS Glue job to convert the JSON data to Parquet format.
Use Amazon Data Firehose as a delivery stream. Enable a record transformation that references a table stored in an Apache Hive metastore in EMR.
Use Amazon Data Firehose as a delivery stream. Enable a record transformation that references a table stored in an AWS Glue data catalog defining the schema for your source records.
Use Amazon EMR to process streaming data. Create a Spark job to convert the JSON to Parquet format using an Apache Hive metastore to determine the schema of the JSON data.
When using AWS Glue DataBrew, which of the following is
not
a built-in transformation?
A. Handling missing values
B. Standardizing date formats
C. Training machine learning models
D. Removing duplicate records
You are working as a data architect for a company that runs an online web store on Amazon DynamoDB with high throughput volumes. You want to send customers an email when their order status has changed to “Shipped.” What would be the best approach to handle this requirement?
Set up a DynamoDB Stream on the Orders table with a Lambda trigger that sends an email via the Amazon Simple Email Service (Amazon SES) when the order status has changed to “Shipped.”
Export the Orders table into S3 once an hour to maintain a complete history. Run an AWS Glue job to identify orders whose status has changed to “Shipped” and send an email via Amazon SES.
Set up a nightly extract into Amazon Redshift using the
COPY
command. Run a Lambda function to check for any status changes to “Shipped” and send an email via Amazon SES.
A company using an on-premise Hadoop cluster for various batch processing jobs (Spark, TensorFlow, MXNet, Hive, and Presto) anticipates a data surge. They want to migrate to AWS for scalable and durable storage. However, they want to reuse their existing jobs. Which solution will meet these requirements
most
cost-effectively?
Migrate to Amazon EMR. Store the data in Amazon S3. Launch Transient EMR clusters when jobs need to run.
Migrate to Amazon EMR. Store all the data in HDFS. Add more core nodes on-demand.
Re-architect the solution with the serverless offering of AWS Glue.
Migrate to Amazon Redshift. Use Amazon Redshift RA3 instances for frequently used data and Amazon S3 for in-frequent data access and optimize the storage cost.
Which AWS service automatically detects and reports sensitive data stored in Amazon S3 and security issues with S3 buckets?
AWS Security Hub
Amazon Macie
Amazon EventBridge
Amazon Key Management Service (AWS KMS)
Your company has a website application running on Amazon RDS for MySQL and a data warehouse on Amazon Redshift. The analytics team has a requirement to show real-time information from the application on a dashboard, along with historical data from Redshift. What approach would you advise the team to use that would involve the
least
amount of effort?
Use the RDS MySQL native export functionality to unload data to S3 and load it in parallel into Redshift with the
COPY
command.
Implement Athena federated queries to query across the RDS MySQL database and Redshift.
Utilize a Redshift federated query to directly query the MySQL RDS database and combine the results with historical Redshift data in a database view.
Use the RDS MySQL native export functionality to unload data to S3 and query the data using Redshift Spectrum. Combine the data from the Spectrum query and historical data stored in Redshift in a database view.
In an oil and gas company, a data analyst needs to implement anomaly detection on the pressure sensor data stream collected by Kinesis Data Streams. While Lambda can trigger the valve actions, the focus is on cost-effective anomaly detection within the data stream. Which of the following solutions would be recommended?
Launch a Spark streaming application using Amazon EMR cluster and connect to Amazon Kinesis Data Streams. Use ML algorithm to identify the anomaly. Spark will send an alert to open the valve if an anomaly is discovered.
Use the
RANDOM_CUT_FOREST
function in the Amazon Managed Service for Apache Flink to detect the anomaly and send an alert to open the valve.
Use Amazon Data Firehose and choose Amazon S3 as data lake storage of the sensor's data. Create a Lambda function to schedule the query of Amazon Athena in S3 query. The Lambda function will send an alert to open the valve if the anomaly is discovered.
Provision an EC2 fleet with a KCL application to consume the stream and aggregate the data collected by the sensors to detect the anomaly.
Which of the following is
not
a primary purpose of the listed AWS services?
Amazon EMR: Big data processing and analysis
Amazon Redshift: Data warehousing and analytics
AWS Glue: ETL and data integration
Amazon MWAA: Real-time data processing
A large digital newspaper business is running a data warehouse on a provisioned Amazon Redshift cluster that is not encrypted. Its security department has mandated that the cluster be encrypted to meet regulation requirements with keys that can be rotated and have comprehensive logging. What is the easiest and fastest way to do this?
Unload the data into S3, apply SSE-S3 encryption, and then load the data back into Redshift via the
COPY
command.
Unload the data into S3, apply SSE-KMS encryption, and then load the data back into Redshift via the
COPY
command.
Take a manual snapshot of the cluster, and then create a new cluster from the snapshot with encryption enabled.
Enable encryption on the cluster with a KMS customer-managed key.
Which open table format is natively supported by AWS Glue Data Catalog for registering and managing table metadata, allowing for improved query performance and schema evolution?
Apache Hudi
Apache Iceberg
Delta Lake
All of the above
Which of the following is
not
a best practice for effective data visualization?
Using consistent color schemes across related visualizations
Including as much data as possible in a single visualization
Providing clear titles and labels for axes and data points
Tailoring the complexity of the visualization to the audience
A data engineer has custom Python scripts that perform common formatting logic, which is used by many Lambda functions. If there is any need to modify the Python scripts, the data engineer must manually update all the Lambda functions. Which solution will meet the requirement where the data engineer requires less manual intervention to update the Lambda functions?
Package the common Python script into Lambda layers. Apply the Lambda layers to the Lambda functions.
Assign aliases to the Lambda functions.
Store a pointer to a custom Python script in environment variables in a shared S3 bucket.
Combine multiple Lambda functions into one Lambda function.
A data analyst creates a table from a record set stored in Amazon S3. The data is then partitioned using
year=2024/month=12/day=06/
format. Although the partitioning was successful, no records were returned when the
SELECT
* query was executed. What could be the possible reason?
The analyst did not run the
MSCK REPAIR TABLE
command after partitioning the data.
The analyst did not run the
MSCK REPAIR TABLE
command before partitioning the data.
The newly created table does not have read permissions.
The S3 bucket where the sample data is stored does not have read permissions.
The analyst needs to use the
CTAS
command
CREATE TABLE AS SELECT
while creating the table in Amazon Athena.
You are working for a global company with multiple AWS accounts. There is a requirement to perform cross-account encryption, where one account uses a KMS key from a different account to access an S3 bucket. What steps are required to configure encryption across these AWS account boundaries securely? (Choose two.)
In the account that owns the key, enable S3 Server-Side Encryption with AWS Key Management Service Keys (SSE-KMS) and assign the KMS customer-managed key to the S3 bucket.
In the account that owns the key, enable S3 Server-Side Encryption with AWS Key Management Service Keys (SSE-KMS) and assign the KMS customer-managed key to the S3 bucket. Grant access to the external account on the key's key policy in AWS KMS.
In the external account, create an IAM policy with the required permissions on the key and attach the policy to users or roles who need access to the key.
In the external account, enforce TLS for S3 access by specifying
aws:SecureTransport
on bucket policies.
A company is planning to migrate a legacy Hadoop cluster running on premises to AWS. The cluster must use the latest EMR release and include its custom scripts and workflows during the migration. The data engineer must re-use the existing Java application code on premises in the new EMR cluster. Which of these solutions is the recommended approach?
Submit a
PIG
step in the EMR cluster and compile the Java program using the version of the cluster.
Submit a
STREAMING
step in the EMR cluster and compile the Java program using the version of the cluster.