46,99 €
Prepare for the AWS Machine Learning Engineer exam smarter and faster and get job-ready with this efficient and authoritative resource
In AWS Certified Machine Learning Engineer Study Guide: Associate (MLA-C01) Exam, veteran AWS Practice Director at Trace3—a leading IT consultancy offering AI, data, cloud and cybersecurity solutions for clients across industries—Dario Cabianca delivers a practical and up-to-date roadmap to preparing for the MLA-C01 exam. You'll learn the skills you need to succeed on the exam as well as those you need to hit the ground running at your first AI-related tech job.
You'll learn how to prepare data for machine learning models on Amazon Web Services, build, train, refine models, evaluate model performance, deploy and secure your machine learning applications against bad actors.
Inside the book:
Perfect for everyone preparing for the AWS Certified Machine Learning Engineer -- Associate exam, AWS Certified Machine Learning Engineer Study Guide is also an invaluable resource for those preparing for their first role in AI or data science, as well as junior-level practicing professionals seeking to review the fundamentals with a convenient desk reference.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 685
Veröffentlichungsjahr: 2025
Cover
Table of Contents
Title Page
Copyright
Dedication
Acknowledgments
About the Author
About the Technical Editor
Introduction
Chapter 1: Introduction to Machine Learning
Understanding Artificial Intelligence
Understanding Machine Learning
Understanding Deep Learning
Summary
Exam Essentials
Review Questions
Chapter 2: Data Ingestion and Storage
Introducing Ingestion and Storage
Ingesting and Storing Data
Summary
Exam Essentials
Review Questions
Chapter 3: Data Transformation and Feature Engineering
Introduction
Understanding Feature Engineering
Data Cleaning and Transformation
Feature Engineering Techniques
Data Labeling
Managing Class Imbalance
Data Splitting
Summary
Exam Essentials
Review Questions
Chapter 4: Model Selection
Understanding AWS AI Services
Developing Models with Amazon SageMaker Built-in Algorithms
Criteria for Model Selection
Summary
Exam Essentials
Review Questions
Chapter 5: Model Training and Evaluation
Training
Hyperparameter Tuning
Model Performance Evaluation
Deep-Dive Model Tuning Example
Summary
Exam Essentials
Review Questions
Chapter 6: Model Deployment and Orchestration
AWS Model Deployment Services
Advanced Model Deployment Techniques
Orchestrating ML Workflows
Deep Dive Model Deployment Example
Summary
Exam Essentials
Review Questions
Chapter 7: Model Monitoring and Cost Optimization
Monitoring Model Inference
Monitoring Infrastructure and Cost
Summary
Exam Essentials
Review Questions
Chapter 8: Model Security
Security Design Principles
Securing AWS Services
Summary
Exam Essentials
Review Questions
Appendix A: Answers to the Review Questions
Chapter 1: Introduction to Machine Learning
Chapter 2: Data Ingestion and Storage
Chapter 3: Data Transformation and Feature Engineering
Chapter 4: Model Selection
Chapter 5: Model Training and Evaluation
Chapter 6: Model Deployment and Orchestration
Chapter 7: Model Monitoring and Cost Optimization
Chapter 8: Model Security
Appendix B: Mathematics Essentials
Linear Algebra
Statistics
Probability Theory
Calculus
Index
End User License Agreement
Chapter 2
TABLE 2.1 Data format support for built-in ML algorithms in Amazon SageMaker.
TABLE 2.2 Example of a data access pattern.
TABLE 2.3 AWS services for structured, semi-structured, and unstructured data.
TABLE 2.4 Amazon S3 storage classes.
Chapter 3
TABLE 3.1 Examples of pre-training bias metrics.
Chapter 5
TABLE 5.1 Regularization methods comparison.
Chapter 6
TABLE 6.1 Inference and Training Comparison.
TABLE 6.2 Inference-Based EC2 Instance Types.
Chapter 7
TABLE 7.1 AWS Pricing Models for Compute Infrastructure.
Chapter 8
TABLE 8.1 ML Predefined Permissions.
Cover
Table of Contents
Title Page
Copyright
Dedication
Acknowledgments
About the Author
About the Technical Editor
Introduction
Begin Reading
Appendix A: Answers to the Review Questions
Appendix B: Mathematics Essentials
Index
End User License Agreement
iii
iv
v
ix
xi
xiii
xv
xvi
xvii
xviii
xix
xx
xxi
xxii
xxiii
xxiv
xxv
xxvi
xxvii
xxviii
xxix
xxx
xxxi
xxxii
xxxiii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
Dario Cabianca
Copyright © 2025 by Dario Cabianca. All rights reserved, including rights for text and data mining and training of artificial intelligence technologies or similar technologies.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
The manufacturer’s authorized representative according to the EU General Product Safety Regulation is Wiley-VCH GmbH, Boschstr. 12, 69469 Weinheim, Germany, e-mail: [email protected].
Trademarks: Wiley and the Wiley logo, and the Sybex logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. AWS is a registered trademark of Amazon Technologies, Inc. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Control Number: 2025908052
Paperback ISBN: 9781394319954
ePDF ISBN: 9781394319992
ePub ISBN: 9781394319978
Cover Design: Wiley
Cover Image: © Jeremy Woodhouse/Getty Images
To my family.
Creating the AWS Certified Machine Learning Engineer Study Guide Associate (MLA-C01) Exam has been an extraordinary journey, and I would like to take this opportunity to express my gratitude to those who made this book possible.
In addition to my family, I would like to thank Kenyon Brown, senior acquisition editor at Wiley, who helped get the book started. I am also grateful to Christine O’Connor, managing editor at Wiley, and to Dabian Witherspoon, project manager, for his detailed coordination of all tasks as the book progressed through its stages.
Justin Roberts, the technical editor, did a phenomenal job validating the content I authored and testing the code I created. I am also grateful to Kim Wimpsett, the copyeditor, and Dhilip Kumar Rajendran, the content refinement specialist, for their accurate and comprehensive review of the copyedits and the proofs.
Finally, I would like to thank the late Professor Giovanni Degli Antoni, my doctoral thesis advisor, who always inspired and motivated me to pursue my scientific curiosity during my years at the University of Milan, whose Department of Computer Science has been named in his honor.
Dario Cabianca is a computer scientist (PhD, University of Milan), author, and AWS practice director at Trace3, which is a leading IT consultancy offering AI, data, cloud and cybersecurity solutions for clients across industries. At Trace3, Dario oversees the practice nationally, serving customers, building partnerships, and evangelizing Trace3’s portfolio of AWS competencies and services. He has worked with a variety of global consulting firms and enterprises for more than two decades and has earned 10 cloud certifications with AWS, Google Cloud, Microsoft Azure, and ISC2.
Justin Roberts is a solutions architect at Amazon Web Services (AWS), where he advises strategic customers on designing and running complex large-scale systems on AWS. Justin has worked for several enterprises over almost two decades across numerous disciplines. He currently holds multiple industry certifications, including 14 AWS certifications, and is a member of the exclusive AWS “Golden Jacket” club for holding all active AWS certifications at once. Justin has a BS from Eastern Kentucky University and an MBA from Bellarmine University.
The demand for machine learning (ML) engineers has significantly increased, particularly since 2023, when the introduction of ChatGPT revolutionized the artificial intelligence (AI) landscape. This field has seen a substantial interest and investment, as organizations across various sectors recognize the transformative potential of AI. As ML and AI become progressively more sophisticated, the need for skilled professionals to develop, implement, and maintain these systems has never been greater. To meet this demand, the new AWS Certified Machine Learning Engineer – Associate certification was developed to equip aspiring engineers with the knowledge and skills necessary to excel in this dynamic field.
The AWS Certified Machine Learning Engineer – Associate certification is a testament to the proficiency and expertise required to navigate this ever-evolving field. This certification not only validates an individual’s technical skills but also underscores their ability to leverage AWS’s extensive suite of ML and AI services to drive innovation. As this technology continues to mature, certified professionals are well-positioned to lead the charge in developing cutting-edge AI solutions.
This study guide adopts a methodical approach by walking you step-by-step through all the phases of the ML lifecycle. The exposition of each topic offers a combination of theoretical knowledge, practical exercises with tested code in Python, and necessary diagrams and plots to visually represent ML models and AI in action.
Throughout this study guide, we will delve into the fascinating world of AWS SageMaker AI (formerly known as Amazon SageMaker) and Amazon Bedrock, exploring their numerous features and functionalities. We will cover the core concepts and practical applications, providing you with the knowledge and tools needed to excel as an AWS Machine Learning Engineer. Whether you are just starting your journey or looking to deepen your expertise, this guide will serve as a comprehensive resource to mastering these platforms and achieving certification.
By obtaining the AWS Certified Machine Learning Engineer – Associate certification, you are not just enhancing your skillset but also contributing to the forefront of technological innovation. Let this study guide be your roadmap to success in this rapidly expanding field.
The AWS Certified Machine Learning Engineer – Associate Exam is intended to validate the technical skills required to design, build, and operationalize well-architected ML workloads on AWS. The exam covers a wide range of topics, including data preparation, feature engineering, model training, model evaluation, and deployment strategies.
The exam consists of 65 questions and has a duration of 130 minutes. It is available in multiple languages, including English, Japanese, Korean, and Simplified Chinese. The exam costs $150 and can be taken at a Pearson VUE testing center or online as a proctored exam. This certification is valid for 3 years.
Your exam results are presented as a scaled score ranging from 100 to 1,000. To pass, a minimum score of 720 is required. This score reflects your overall performance on the exam and indicates whether you have successfully passed.
The official exam guide is available at https://d1.awsstatic.com/training-and-certification/docs-machine-learning-engineer-associate/AWS-Certified-Machine-Learning-Engineer-Associate_Exam-Guide.pdf.
During the writing of this book, “Amazon SageMaker” was renamed “Amazon SageMaker AI.” As a result, the first chapters of this book still use the former name, because at that time this was the correct name in use. In this book, the terms “Amazon SageMaker” and “Amazon SageMaker AI” are used interchangeably to denote the new AWS unified platform for data, analytics, ML, and AI. See https://aws.amazon.com/blogs/aws/introducing-the-next-generation-of-amazon-sagemaker-the-center-for-all-your-data-analytics-and-ai.
The increasing demand for AWS ML and AI engineers—due to the rapid adoption of ML and AI technologies across industries—has made this a perfect time to pursue the AWS Certified Machine Learning Engineer – Associate certification. Companies are looking for skilled professionals who can harness the power of AWS to build, deploy, and manage ML models efficiently. By earning this certification, you can demonstrate your proficiency in using AWS tools and services to drive impactful ML and AI solutions. This certification not only validates your technical skills but also sets you apart in a competitive job market, making you a valuable asset to potential employers.
One of the key reasons to pursue this certification is the comprehensive knowledge you’ll gain about AWS’s cutting-edge ML and AI services. While preparing for the exam, you’ll master the use of Amazon SageMaker AI, a powerful platform for building, training, deploying and monitoring ML models at scale. You’ll also explore the latest additions to Amazon SageMaker AI, which continuously evolves to bring together a broad set of AWS ML, AI, and data analytics services. As a result, you’ll become proficient in using Amazon Bedrock, a service that simplifies the deployment of foundation models by offering pretrained models from leading AI companies. However, due to the relatively new nature of Amazon Bedrock, there is a lack of in-depth material available, making this certification even more valuable as it positions you at the forefront of emerging AI technologies.
Amazon SageMaker AI and Amazon Bedrock are designed for seamless integration with numerous AWS services that are required during the phases of the ML lifecycle. Therefore, the study continues with extensive coverage of such services. These include storage services (e.g., Amazon S3, Amazon Elastic File System [EFS], Amazon FSx for Lustre, and others), ingestion services (e.g., Amazon Data Firehose, Amazon Kinesis Data Streams, Amazon Managed Streaming for Apache Kafka [MSK], and others), deployment services (e.g., Amazon Elastic Compute Cloud [EC2], Amazon Elastic Container Service [ECS], and others), orchestration services (e.g., AWS Step Functions, Amazon Managed Workflows for Apache Airflow [MWAA], and others), monitoring, cost optimization, and security services, just to name a few.
Another significant advantage of becoming AWS Machine Learning Engineer certified is the access to exclusive resources and a supportive community of professionals. By joining the certified AWS community, you’ll have the opportunity to network with other professionals, share knowledge, and stay updated on the latest trends and advancements in the field. This certification not only boosts your career prospects, but also keeps you engaged in a dynamic and constantly evolving industry.
Your journey to become AWS Machine Learning Engineer Certified begins with a structured approach that covers foundational knowledge, hands-on practice, and thorough exam preparation. This study guide is crafted to mirror that journey.
Foundational knowledge
Start by building a robust understanding of ML concepts, formulate ML problems, and learn algorithms and statistical methods. It’s also important to grasp the basics of linear algebra, calculus, probability, and statistics, as they form the mathematical foundation for ML. Additionally, familiarize yourself with AWS services, particularly Amazon SageMaker AI, which provides tools and features for every phase of the ML lifecycle. Learning Python, the primary programming language used in ML, is also essential.
Hands-on practice
Engage in practical experience through AWS resources like tutorials, labs, and workshops. Focus on using Amazon SageMaker AI for various phases of the ML lifecycle, including
Data preparation
Use Amazon SageMaker Data Wrangler to simplify data preparation and feature engineering.
Model building
Leverage Amazon SageMaker Studio for an integrated development environment that supports building, training, and debugging ML models.
Model training
Utilize Amazon SageMaker Training to efficiently train models with built-in algorithms or your own custom code.
Model deployment
Use Amazon SageMaker Endpoint to deploy trained models for real-time predictions, and Amazon SageMaker Batch Transform for batch predictions.
Model monitoring
Employ Amazon SageMaker Model Monitor to continuously monitor the performance of deployed models and ensure that they remain accurate over time.
By working on real-world projects that cover the entire ML lifecycle, you’ll gain hands-on experience and deepen your understanding.
Exam preparation
Use AWS’s official exam guide to understand key objectives. Utilize practice exams and sample questions to test your readiness. Regular review and practice will ensure that you are well-prepared for the certification exam. On exam day, manage your time effectively and read each question carefully to increase your chances of passing and earning the certification.
This book is intended for a broad audience of software, data, and cloud engineers/architects with ideally 1 year of hands-on experience with AWS services. Given the engineering focus of the certification, basic knowledge of the Python programming language—which is the de facto ML programming language—is expected.
Moreover, due to the data-centric nature of ML, having a firm grasp of basic mathematics and statistics is essential, but don’t worry—we’ll cover the basics in Appendix B (Mathematics Essentials), and provide guidance as needed.
This book comes with tested code in Python, with which you can experiment using Amazon SageMaker Studio. To get the most out this book, an AWS account is highly recommended.
This study guide utilizes a number of common elements to help you acquire and reinforce your knowledge. Each chapter includes
Summaries
The summary section briefly explains the key concepts of the chapter, allowing you to easily remember what you learned.
Exam essentials
The exam essentials section highlights the exam topics and the knowledge you need to have for the exam. These exam topics are directly related to the task statements provided by AWS, which are available in the upcoming exam objectives section.
Chapter review questions
A set of questions will help you assess your knowledge and your exam readiness.
An online learning environment accompanies this study guide, designed to simplify your learning experience. Whether you’re preparing at home or on the go, this platform is here to make studying easier and more convenient for you. The following learning resources are included:
Practice tests
This study guide includes a total of 115 questions. All 90 questions in the guide are available in our proprietary digital test engine, along with the 25 questions in the assessment test at the end of this introduction.
Electronic flash cards
One hundred questions in a flash card format (a question followed by a single correct answer) are provided.
Glossary
The key terms you need to know for the exam are available as a searchable glossary in PDF format along with their definitions.
The online learning environment and the test bank are available at https://www.wiley.com/go/sybextestprep.
This study guide uses certain typographic styles in order to help you quickly identify important information and to avoid confusion over the meaning of words such as on-screen prompts.
In particular, look for the following styles:
Italicized text
indicates key terms that are described at length for the first time in a chapter. These words are likely to appear in the searchable online glossary. (Italics are also used for emphasis.)
A
monospaced font
indicates the contents of a program or configuration files, messages displayed as a text-mode macOS/Linux shell prompt, filenames, text-mode command names, and Internet URLs.
In addition to these text conventions, which can apply to individual words or entire paragraphs, a few conventions highlight segments of text:
A note indicates information that’s useful or interesting, but that’s somewhat peripheral to the main text. A note might be relevant to a small number of networks, for instance, or it may refer to an outdated feature.
A tip provides information that you should understand for the exam. A tip can save you time or frustration and may not be entirely obvious. A tip might describe how to get around a limitation or how to use a feature to perform an unusual task.
This study guide is designed to comprehensively address each exam objective, reflecting the exam weighting outlined in the official guide, as illustrated in the following table:
Domain
Weight %
Domain 1: Data Preparation for Machine Learning (ML)
28%
Domain 2: ML Model Development
26%
Domain 3: Deployment and Orchestration of ML Workflows
22%
Domain 4: ML Solution Monitoring, Maintenance, and Security
24%
Knowledge of
Chapter
Data formats and ingestion mechanisms (for example, validated and non-validated formats, Apache Parquet, JSON, CSV, Apache ORC, Apache Avro, RecordIO)
2
,
4
,
5
,
6
How to use the core AWS data sources (for example, Amazon S3, Amazon Elastic File System [Amazon EFS], Amazon FSx for NetApp ONTAP)
2
,
5
,
6
How to use AWS streaming data sources to ingest data (for example, Amazon Kinesis, Apache Flink, Apache Kafka)
2
AWS storage options, including use cases and tradeoffs
2
Knowledge of
Chapter
Data cleaning and transformation techniques (for example, detecting and treating outliers, imputing missing data, combining, deduplication)
3
,
4
Feature engineering techniques (for example, data scaling and standardization, feature splitting, binning, log transformation, normalization)
3
,
4
Encoding techniques (for example, one-hot encoding, binary encoding, label encoding, tokenization)
3
Tools to explore, visualize, or transform data and features (for example, Amazon Amazon SageMaker Data Wrangler, AWS Glue, AWS Glue DataBrew)
3
Services that transform streaming data (for example, AWS Lambda, Spark)
3
Data annotation and labeling services that create high-quality labeled datasets
3
Knowledge of
Chapter
Pretraining bias metrics for numeric, text, and image data (for example, class imbalance [CI], difference in proportions of labels [DPL])
3
Strategies to address CI in numeric, text, and image datasets (for example, synthetic data generation, resampling)
3
,
5
Techniques to encrypt data
3
,
8
Data classification, anonymization, and masking
3
Implications of compliance requirements (for example, personally identifiable information [PII], protected health information [PHI], data residency)
8
Validating data quality (for example, by using AWS Glue DataBrew and AWS Glue Data Quality)
3
Identifying and mitigating sources of bias in data (for example, selection bias, measurement bias) by using AWS tools (for example, Amazon SageMaker Clarify)
3
Preparing data to reduce prediction bias (for example, by using dataset splitting, shuffling, and augmentation)
3
Configuring data to load into the model training resource (for example, Amazon EFS, Amazon FSx)
2
,
3
Knowledge of
Chapter
Capabilities and appropriate uses of ML algorithms to solve business problems
1
,
4
How to use AWS AI (for example, Amazon Translate, Amazon Transcribe, Amazon Rekognition, Amazon Bedrock) to solve specific business problems
1
,
4
,
6
How to consider interpretability during model selection or algorithm selection
4
,
6
Amazon SageMaker built-in algorithms and when to apply them
4
Knowledge of
Chapter
Elements in the training process (for example, epoch, steps, batch size)
5
Methods to reduce model training time (for example, early stopping, distributed training)
5
Factors that influence model size
5
Methods to improve model performance
5
Benefits of regularization techniques (for example, dropout, weight decay, L1 and L2)
5
Hyperparameter tuning techniques (for example, random search, Bayesian optimization)
5
Model hyperparameters and their effects on model performance (for example, number of trees in a tree-based model, number of layers in a neural network)
1
,
4
,
5
Methods to integrate models that were built outside Amazon SageMaker into Amazon SageMaker
5
,
6
Knowledge of
Chapter
Model evaluation techniques and metrics (for example, confusion matrix, heat maps, F1 score, accuracy, precision, recall, root mean square error [RMSE], receiver operating characteristic [ROC], area under the ROC curve [AUC])
5
Methods to create performance baselines
5
Methods to identify model overfitting and underfitting
5
Metrics available in Amazon SageMaker Clarify to gain insights into ML training data and models
5
Convergence issues
5
Knowledge of
Chapter
Deployment best practices (for example, versioning, rollback strategies)
6
AWS deployment services (for example, Amazon SageMaker AI endpoints)
6
Methods to serve ML models in real time and in batches
6
How to provision compute resources in production environments and test environments (for example, CPU, GPU)
6
Model and endpoint requirements for deployment endpoints (for example, serverless endpoints, real-time endpoints, asynchronous endpoints, batch inference)
6
Knowledge of
Chapter
How to choose appropriate containers (for example, provided or customized)
6
Methods to optimize models on edge devices (for example, Amazon SageMaker Neo)
6
Knowledge of
Chapter
Difference between on-demand and provisioned resources
6
How to compare scaling policies
6
Tradeoffs and use cases of infrastructure as code (IaC) options (for example, AWS CloudFormation, AWS Cloud Development Kit [AWS CDK])
6
Containerization concepts and AWS container services
6
How to use Amazon SageMaker endpoint auto scaling policies to meet scalability requirements (for example, based on demand, time)
6
Knowledge of
Chapter
Capabilities and quotas for AWS CodePipeline, AWS CodeBuild, and AWS CodeDeploy
6
Automation and integration of data ingestion with orchestration services
6
Version control systems and basic usage (for example, Git)
6
CI/CD principles and how they fit into ML workflows
6
,
7
Deployment strategies and rollback actions (for example, blue/green, canary, linear)
6
How code repositories and pipelines work together
6
Knowledge of
Chapter
Drift in ML models
7
Techniques to monitor data quality and model performance
7
Design principles for ML lenses relevant to monitoring
7
Knowledge of
Chapter
Key performance metrics for ML infrastructure (for example, utilization, throughput, availability, scalability, fault tolerance)
6
,
7
Monitoring and observability tools to troubleshoot latency and performance issues (for example, AWS X-Ray, Amazon CloudWatch Lambda Insights, Amazon CloudWatch Logs Insights)
7
,
8
How to use AWS CloudTrail to log, monitor, and invoke retraining activities
7
,
8
Differences between instance types and how they affect performance (for example, memory optimized, compute optimized, general purpose, inference optimized)
6
,
7
Knowledge of
Chapter
Capabilities of cost analysis tools (for example, AWS Cost Explorer, AWS Billing and Cost Management, AWS Trusted Advisor)
7
Cost tracking and allocation techniques (for example, resource tagging)
7
Knowledge of
Chapter
IAM roles, policies, and groups that control access to AWS services (for example, AWS Identity and Access Management [IAM], bucket policies, Amazon SageMaker Role Manager)
8
Amazon SageMaker security and compliance features
8
Controls for network access to ML resources
8
Security best practices for CI/CD pipelines
8
When configuring Amazon S3 for data storage, what best practice should be followed to ensure efficient cost management and data retrieval?
Store all data in the S3 Standard storage class.
Utilize versioning for all objects.
Implement lifecycle policies to transition data to appropriate storage classes.
Enable cross-region replication for all buckets.
For high-throughput data ingestion into Amazon S3, which feature can be leveraged to manage large-scale file transfers efficiently?
Amazon S3 Multipart Upload
Amazon S3 Access Points
Amazon S3 Transfer Acceleration
Amazon S3 Batch Operations
What is a key advantage of using Amazon FSx for Lustre over Amazon S3 for machine learning workloads requiring high-speed processing?
Better compatibility with Hadoop
Lower cost for large datasets
Support for distributed file systems with high throughput
Seamless integration with Amazon Glacier
In the context of feature engineering, what is the primary goal of applying Principal Component Analysis (PCA) to a dataset?
Increase the dimensionality of data
Extract uncorrelated features for better model performance
Normalize the distribution of features
Enhance the interpretability of the dataset
When dealing with categorical variables in feature engineering, which method can be used to effectively capture the ordinal relationship between categories?
One-hot encoding
Binary encoding
Ordinal encoding
Frequency encoding
For a problem requiring prediction of time-series data, which machine learning algorithm is most suitable?
K-nearest neighbors
DeepAR
Support vector machines
Random forests
Which ensemble method combines the predictions of multiple weak learners to improve model performance and robustness?
Decision trees
Neural networks
XGBoost
K-means clustering
In the context of model development, what is the purpose of using a regularization technique such as L1 or L2 regularization?
To improve the accuracy of the training dataset
To simplify the model by penalizing large coefficients
To enhance data visualization
To increase the learning rate of the model
Which optimization algorithm is commonly used to minimize the loss function during the training of deep learning models?
Gradient descent
Newton’s method
Genetic algorithm
Simulated annealing
When evaluating a binary classification model, which metric should be used to determine the balance between precision and recall?
Accuracy
F1 score
ROC-AUC
Mean squared error
How can cross-validation help in assessing the generalization capability of a machine learning model?
By splitting the dataset into train and test sets multiple times
By using the entire dataset for training
By creating synthetic data points for evaluation
By reducing the dimensionality of features
What is the key benefit of using Amazon SageMaker AI for model deployment and orchestration?
Automatic hyperparameter tuning
Real-time model monitoring
Seamless integration with Amazon SageMaker Pipelines
Built-in data visualization tools
In the context of deploying machine learning models, what is the purpose of using AWS Step Functions?
To perform ETL operations on data
To create and manage complex ML workflows with state transitions
To monitor model performance in real time
To deploy models on edge devices
What is the advantage of using Amazon SageMaker Model Monitor for deployed models?
Automatic scaling of model endpoints
Continuous monitoring of model quality and data drift
Real-time training of models
Deployment of models across multiple regions
How can the detection of data drift help maintain model performance?
By retraining the model on the same dataset
By identifying changes in data distribution that affect model predictions
By increasing the model’s learning rate
By reducing the number of features used in the model
Which AWS services can help secure machine learning models by enforcing access controls and encryption?
Amazon VPC and Amazon CloudWatch
AWS IAM and AWS Key Management Service (KMS)
AWS CloudTrail and Amazon GuardDuty
AWS IAM and AWS Config
When deploying machine learning models, what is a recommended best practice to prevent unauthorized access to sensitive data?
Using private S3 buckets for model storage
Storing model credentials in AWS Secrets Manager
Enabling encryption at rest and in transit
Allowing restricted network access to model endpoints
In feature engineering, which method helps in transforming skewed data distributions into a more Gaussian-like distribution?
Min-max scaling
Log transformation
Label encoding
One-hot encoding
What is the purpose of using dropout in training deep learning models?
To improve the accuracy of the model
To prevent overfitting by randomly dropping neurons
To increase the learning rate
To simplify the model architecture
Which Amazon SageMaker AI service helps detect bias in machine learning models during post-deployment monitoring?
Amazon SageMaker Model Monitor
Amazon SageMaker Data Wrangler
Amazon SageMaker Clarify
Amazon SageMaker Neo
You are tasked with deploying a machine learning model that requires GPU acceleration for inference to achieve optimal performance. Which AWS compute instance type would you select for this purpose?
T2.micro
C5.large
P3.2xlarge
R5.xlarge
You need to deploy multiple versions of a machine learning model to evaluate their performance in a live environment. Which AWS service enables you to deploy and manage these multiple model versions effectively?
Amazon SageMaker Model Registry
AWS Glue Data Catalog
Amazon SageMaker Multi-Model Endpoints
Amazon SageMaker Ground Truth
A retail company needs to process customer images in real time for personalized shopping experiences using a deep learning model. Which combination of AWS services and configurations would you use to achieve high performance and scalable inference?
Amazon SageMaker with Multi-Model Endpoints and Elastic Load Balancing
Amazon EC2 with GPU instances and AWS Auto Scaling
AWS Lambda with Amazon S3 and API Gateway
Amazon SageMaker with Endpoint Variants and Auto Scaling
A healthcare company needs to ensure that its deployed machine learning models comply with regulatory standards and maintain high accuracy over time. Which feature of Amazon SageMaker Model Monitor would you leverage to track data quality and model performance, and what key metrics would you monitor to ensure compliance?
Baseline constraints and statistics; monitor data distribution and prediction accuracy
Model registry; track model versions and updates
Hyperparameter tuning; optimize model hyperparameters
Feature store; manage and store feature data
An organization is looking to simplify the process of managing IAM roles for various Amazon SageMaker users and workloads. It needs to ensure that roles are correctly configured with the necessary permissions while maintaining security best practices. Which feature of Amazon SageMaker Role Manager would you use to achieve this?
Automatic role creation
Role templates
Role auditing
Role inheritance
C. Lifecycle policies in Amazon S3 help manage storage costs by automatically transitioning data to lower-cost storage classes as it becomes less frequently accessed.
A. Amazon S3 Multipart Upload allows for efficient, parallel upload of large files by splitting them into smaller parts, which can be uploaded independently and reassembled.
C. Amazon FSx for Lustre is designed for high-performance workloads and provides a distributed file system with high throughput and low latency, making it ideal for data-intensive ML tasks.
B. PCA reduces the dimensionality of the data by transforming it into a set of uncorrelated principal components, improving model performance by eliminating redundant features.
C. Ordinal encoding assigns numerical values to categorical variables while preserving the order of the categories, which is important for algorithms that can leverage this relationship.
B. DeepAR is specifically designed to handle sequential data and can capture temporal dependencies, making it suitable for time series prediction.
C. XGBoost combines the predictions of multiple weak learners, typically decision trees, to create a strong predictive model by sequentially training each new learner to correct the errors made by the previous ones, effectively “boosting” the overall accuracy through an iterative process of error correction.
B. Regularization techniques add a penalty for large coefficients, encouraging simpler models that generalize better to new data and reducing the risk of overfitting.
A. Gradient descent is a widely used optimization algorithm in deep learning that iteratively adjusts model parameters to minimize the loss function.
B. The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both aspects, particularly useful for imbalanced datasets.
A. Cross-validation involves partitioning the data into multiple train and test splits, providing a more reliable estimate of the model’s generalization performance.
C. Amazon SageMaker Pipelines allows for the creation and management of end-to-end machine learning workflows, facilitating efficient model deployment and orchestration.
B. AWS Step Functions enable the orchestration of complex workflows with state transitions, ensuring that each step of the ML process is executed in the correct order.
B. Amazon SageMaker Model Monitor continuously tracks model performance and data drift, alerting users to issues that could impact model accuracy and reliability.
B. Detecting data drift involves monitoring shifts in data distribution, which can help identify when the model may need retraining to maintain performance.
B. AWS IAM enables access control by securely managing identities and access to AWS services and resources. AWS KMS provides encryption keys and manages their lifecycle, ensuring data and model security through encryption at rest and in transit.
C. Encrypting data at rest and in transit ensures that sensitive data remains secure, even if accessed without proper authorization.
B. Log transformation reduces skewness in data, making distributions more Gaussian-like, which can improve model performance and interpretation.
B. Dropout is a technique that prevents overfitting in neural networks. It works by randomly deactivating a percentage of neurons during training. This forces the remaining neurons to compensate, which prevents any one neuron from becoming too dependent on others.
C. Amazon SageMaker Clarify detects and measures bias in machine learning models, both before and after deployment, helping to ensure fair and unbiased predictions.
C. P3.2xlarge provides GPU acceleration, which is essential for models requiring high computational power for inference, ensuring optimal performance.
C. Amazon SageMaker Multi-Model Endpoints allows you to deploy multiple models on a single endpoint, effectively managing different model versions and improving resource utilization, leading to cost savings and simplified deployment management.
D. Amazon SageMaker with Endpoint Variants and Auto Scaling allows you to deploy multiple versions of a model with Auto Scaling to handle different traffic loads, providing high performance and scalable inference for real-time image processing.
A. Baseline constraints and statistics help ensure that the input data remains consistent with the training data and the model’s predictions continue to meet the required accuracy and compliance standards.
B. Role templates simplify the assignment of appropriate permissions based on common Amazon SageMaker tasks, ensuring that users have the necessary access while adhering to security best practices and reducing the complexity of role management.
Domain 2: ML Model Development
2.1 Choose a modeling approach
2.2 Train and refine models
Machine learning (ML) has become ubiquitous in our digital world. Whether you need to book a flight, visit your doctor, make an online purchase, pay a bill, or check the weather forecast, behind the scenes any of these actions has started (or is a part of) a process that collects large amounts of data, processes the data, and performs some ML task.
ML is a branch of artificial intelligence (AI) that enables systems to learn and improve from experience without being explicitly programmed.1 By analyzing large datasets, ML algorithms can identify patterns, make decisions, and predict outcomes.
The integration of ML into various domains has revolutionized industries by enhancing efficiency, accuracy, and decision-making capabilities.
This chapter will provide the ML foundations you need to know for the exam. To better understand ML, we need to set the context where ML originated, which is AI.
AI is a branch of computer science whose main focus is to develop systems capable of performing tasks that typically require human intelligence. These tasks include (but are not limited to) recognizing speech, making decisions, solving problems, identifying patterns, and understanding languages. AI systems leverage techniques such as ML, natural language processing (NLP), deep learning, and computer vision to simulate cognitive functions like learning, reasoning, and self-correction. The field of AI is rapidly evolving, with applications spanning various industries, from healthcare and finance to autonomous vehicles and smart cities. Understanding AI involves exploring its fundamental concepts, its historical development, and the ethical implications of its widespread adoption.
AI systems ingest data, such as human-level knowledge, and emulate natural intelligence. ML is a subset of AI, where data and algorithms continuously improve the training model to help achieve higher-quality output predictions. Deep learning is a subset of ML. It is an approach to realizing ML that relies on a layered architecture, simulating the human brain to identify data patterns and train the model.
Figure 1.1 illustrates this hierarchy. It all starts with data and how data can be used to extract relevant information, which ultimately produces knowledge.
FIGURE 1.1 Deep learning, machine learning, and AI.
Data, information, and knowledge form the foundation of understanding and applying AI. Data is the raw, unprocessed facts and figures collected from various sources. When data is organized and processed, it transforms into information, which provides context and meaning. Knowledge is derived from the synthesis of information, enabling comprehension and informed decision-making.
In AI, data serves as the essential input that fuels ML algorithms, enabling the training of models to recognize patterns and make predictions. Information, structured from the data, helps refine these models by offering insights and context. Ultimately, knowledge empowers AI systems to emulate human-like reasoning, enhancing their ability to perform complex tasks, adapt to new situations, and provide valuable solutions across various domains.
Data is defined as a value or set of values representing a specific concept or concepts. Data is the most critical asset of the digital age, fueling advancements in technology and shaping the way we understand and interact with the world. To harness its full potential, it’s essential to recognize the different classes of data: structured, semi-structured, and unstructured. Each class has unique characteristics and applications, particularly in the context of AI.
Understanding these classes of data and their peculiarities is crucial for leveraging AI’s full potential. Structured data provides the foundation for organized analysis, semi-structured data offers flexibility in handling complex information, and unstructured data unlocks a wealth of untapped insights. Together, they enable AI systems to learn, adapt, and deliver transformative solutions across various domains. Let’s review each class in more detail.
Structured data is highly organized and easily searchable in databases. This type of data adheres to a fixed schema, meaning it follows a predefined format with specific fields and records. For example, a spreadsheet containing customer information, such as names, addresses, and purchase histories, is structured data. It allows for efficient querying and analysis, making it invaluable for business intelligence and data-driven decision-making.
These are examples of structured data:
Relational databases (e.g., SQL databases)
Spreadsheets (e.g., Excel files)
Financial records (e.g., transaction logs)
Semi-structured data lacks the rigid structure of structured data but still contains some organizational elements, such as tags or markers, that provide context and hierarchy. This class of data is often used for storing and transmitting complex information that doesn’t fit neatly into a table format. Semi-structured data is more flexible than structured data, allowing for greater adaptability in handling diverse data types.
These are examples of semi-structured data:
JavaScript Object Notation (JSON) files
Extensible Markup Language (XML) documents
Email messages (headers, body text, attachments)
Unstructured data is the most common and diverse type of data, encompassing information that doesn’t follow a predefined format or schema. This class of data is often rich in content but challenging to analyze and search. Unstructured data requires advanced AI techniques, such as NLP and computer vision, to extract meaningful insights.
These are examples of unstructured data:
Text files (e.g., Word documents)
Multimedia files (e.g., images, videos)
Social media content (e.g., tweets, posts)
Information bridges the gap between raw data and actionable knowledge. Derived from data, information provides a semantic element in the form of context and meaning, resulting in a transformation of disparate facts and figures into coherent, useful insights.
Information can be understood as data that has been processed, organized, or structured in a way that adds context and relevance. Unlike raw data, which is often unorganized and lacks inherent meaning, information is data presented in a format that is understandable and useful to its recipient(s). For example, a list of temperatures recorded at various times of the day is mere data; when these temperatures are organized into a table showing daily weather patterns, they become information.
The key characteristics of information are accuracy, relevance, completeness, and timeliness:
Accuracy
Information must be precise and free from errors to be valuable. Inaccurate information can lead to misguided decisions and outcomes.
Relevance
Information should be pertinent to the context or problem at hand. Irrelevant information, no matter how accurate, serves little purpose.
Completeness
Information must be comprehensive enough to provide a clear understanding without ambiguity. Incomplete information can result in incorrect conclusions.
Timeliness
Information is most useful when it is available at the right time. Outdated information can be as detrimental as inaccurate information.
In AI, information plays a critical role in training and refining algorithms. AI systems rely on vast amounts of data to learn and make predictions. This data is processed into information that adds context and aids in pattern recognition and decision-making.
For instance, in NLP, raw text data is processed to extract meaningful information, such as sentiment analysis or language translation. Similarly, in computer vision, images are analyzed to identify objects and patterns, turning visual data into actionable information.
Knowledge represents a higher level of understanding that goes beyond mere data and information. It encompasses the insights, experiences, and contextual understanding that enable individuals and systems to make informed decisions, solve problems, and innovate.
Knowledge is derived from the synthesis and application of information. It involves recognizing patterns, understanding relationships, and drawing conclusions based on experience and context. Unlike data, which is raw and unprocessed, or information, which is organized and meaningful, knowledge embodies a deeper comprehension that guides action and thought.
The key characteristics of knowledge are contextuality, applicability, basis in experience, and dynamism.
Contextuality
Knowledge is deeply rooted in context. It involves understanding not just the facts but also the circumstances and nuances that surround them.
Applicability
Knowledge is practical. It involves the ability to apply information to real-world situations, making it actionable and relevant.
Experience-based
Knowledge is often gained through experience. It encompasses lessons learned, insights gained, and the wisdom accumulated over time.
Dynamism
Knowledge is ever-evolving. As new information becomes available and experiences accumulate, knowledge grows and adapts.
In AI, knowledge is crucial for developing systems that can derive inferences, learn, and adapt. AI systems rely on vast amounts of data and information to build knowledge bases that enable them to perform complex tasks and make intelligent decisions.
For example, expert systems in AI were designed to emulate human expertise in specific domains. These systems utilized knowledge bases that contained rules, facts, and relationships, allowing them to provide recommendations and solve problems. Similarly, ML algorithms use data and information to build models that capture knowledge about patterns and trends, enabling predictive analytics and decision-making.
ML is a subset of AI that allows computers to learn from existing information, without being explicitly programmed, and apply that learning to perform other similar tasks.
Without explicit programming, the machine learns from data and information it feeds. The machine picks up patterns, trends, or essential features from previous data and makes predictions on new data. This aspect of ML, i.e., a machine’s ability to learn without being explicitly programmed, is important and emphasizes the key difference between ML and classical programming.
Let’s say you are tasked by your manager to build a program that translates text from English to Italian. At a very high level, you could pursue a classical programming approach by manually coding rules and exceptions for lexicon, grammar, syntax, and vocabulary. This approach would be extremely complex and not scalable. Instead, with ML, you could leverage models that excel at understanding context, idioms, and nuances in both languages, which are essential for accurate translation. These ML models can continuously improve their translation capabilities with more data, adapting to new expressions, idioms, and evolving language usage.
A real-life ML application is in recommendation systems. For example, Amazon uses ML to recommend books to users based on users’ book categories selection and purchase history. Likewise, Spotify uses ML to recommend songs to users based on previously purchased songs and genres to their liking. In both cases, Amazon and Spotify use large amounts of data to train their ML models and derive meaningful inferences in the form or recommendations.
In the upcoming sections, the fundamental concepts of ML will be introduced. These concepts will form the basis you need to master each exam objective. Let’s start with the ML lifecycle.
First, you must have a holistic view of the entire ML lifecycle. Because ML is deeply grounded in science, it all starts by questions like “What is the business problem we are trying to solve?” and, most importantly, “How does machine learning address this business need?”
Figure 1.2 illustrates the ML lifecycle. These steps are generally followed in all ML projects, regardless of the cloud provider or the tools in use. Let’s review each step.
FIGURE 1.2 Machine learning lifecycle.
Defining a ML problem is the critical first step in the development of any ML project. It involves understanding and clearly articulating the specific business challenge or question that needs to be addressed using data and ML techniques. This phase includes identifying the problem’s scope, the desired outcomes, and the feasibility of applying ML solutions. Crucially, it requires a detailed analysis of the available data, understanding the business context, and determining the performance metrics that will be used to evaluate success. By establishing a clear and well-defined problem statement, you can ensure that your efforts are focused and aligned with strategic goals, leading to more effective and impactful ML solutions.
What sets an ML problem apart from other problems in computer science/engineering is its reliance on data-driven approaches and statistical models to make predictions or decisions. Traditional programming relies on explicit instructions and algorithms to solve problems, whereas ML leverages patterns and relationships within the data to infer solutions. This shift from deterministic algorithms to probabilistic models means that ML problems often involve uncertainty and require continuous learning and adaptation from new data, making them uniquely dynamic and challenging. This aspect of continuous learning and adaptation from new data is reflected in the cycle represented in Figure 1.2.