AWS Certified Machine Learning Engineer Study Guide - Dario Cabianca - E-Book

AWS Certified Machine Learning Engineer Study Guide E-Book

Dario Cabianca

0,0
46,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Prepare for the AWS Machine Learning Engineer exam smarter and faster and get job-ready with this efficient and authoritative resource

In AWS Certified Machine Learning Engineer Study Guide: Associate (MLA-C01) Exam, veteran AWS Practice Director at Trace3—a leading IT consultancy offering AI, data, cloud and cybersecurity solutions for clients across industries—Dario Cabianca delivers a practical and up-to-date roadmap to preparing for the MLA-C01 exam. You'll learn the skills you need to succeed on the exam as well as those you need to hit the ground running at your first AI-related tech job.

You'll learn how to prepare data for machine learning models on Amazon Web Services, build, train, refine models, evaluate model performance, deploy and secure your machine learning applications against bad actors.

Inside the book:

  • Complimentary access to the Sybex online test bank, which includes an assessment test, chapter review questions, practice exam, flashcards, and a searchable key term glossary
  • Strategies for selecting and justifying an appropriate machine learning approach for specific business problems and identifying the most efficient AWS solutions for those problems
  • Practical techniques you can implement immediately in an artificial intelligence and machine learning (AI/ML) development or data science role

Perfect for everyone preparing for the AWS Certified Machine Learning Engineer -- Associate exam, AWS Certified Machine Learning Engineer Study Guide is also an invaluable resource for those preparing for their first role in AI or data science, as well as junior-level practicing professionals seeking to review the fundamentals with a convenient desk reference.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 685

Veröffentlichungsjahr: 2025

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Table of Contents

Title Page

Copyright

Dedication

Acknowledgments

About the Author

About the Technical Editor

Introduction

Chapter 1: Introduction to Machine Learning

Understanding Artificial Intelligence

Understanding Machine Learning

Understanding Deep Learning

Summary

Exam Essentials

Review Questions

Chapter 2: Data Ingestion and Storage

Introducing Ingestion and Storage

Ingesting and Storing Data

Summary

Exam Essentials

Review Questions

Chapter 3: Data Transformation and Feature Engineering

Introduction

Understanding Feature Engineering

Data Cleaning and Transformation

Feature Engineering Techniques

Data Labeling

Managing Class Imbalance

Data Splitting

Summary

Exam Essentials

Review Questions

Chapter 4: Model Selection

Understanding AWS AI Services

Developing Models with Amazon SageMaker Built-in Algorithms

Criteria for Model Selection

Summary

Exam Essentials

Review Questions

Chapter 5: Model Training and Evaluation

Training

Hyperparameter Tuning

Model Performance Evaluation

Deep-Dive Model Tuning Example

Summary

Exam Essentials

Review Questions

Chapter 6: Model Deployment and Orchestration

AWS Model Deployment Services

Advanced Model Deployment Techniques

Orchestrating ML Workflows

Deep Dive Model Deployment Example

Summary

Exam Essentials

Review Questions

Chapter 7: Model Monitoring and Cost Optimization

Monitoring Model Inference

Monitoring Infrastructure and Cost

Summary

Exam Essentials

Review Questions

Chapter 8: Model Security

Security Design Principles

Securing AWS Services

Summary

Exam Essentials

Review Questions

Appendix A: Answers to the Review Questions

Chapter 1: Introduction to Machine Learning

Chapter 2: Data Ingestion and Storage

Chapter 3: Data Transformation and Feature Engineering

Chapter 4: Model Selection

Chapter 5: Model Training and Evaluation

Chapter 6: Model Deployment and Orchestration

Chapter 7: Model Monitoring and Cost Optimization

Chapter 8: Model Security

Appendix B: Mathematics Essentials

Linear Algebra

Statistics

Probability Theory

Calculus

Index

End User License Agreement

List of Tables

Chapter 2

TABLE 2.1 Data format support for built-in ML algorithms in Amazon SageMaker.

TABLE 2.2 Example of a data access pattern.

TABLE 2.3 AWS services for structured, semi-structured, and unstructured data.

TABLE 2.4 Amazon S3 storage classes.

Chapter 3

TABLE 3.1 Examples of pre-training bias metrics.

Chapter 5

TABLE 5.1 Regularization methods comparison.

Chapter 6

TABLE 6.1 Inference and Training Comparison.

TABLE 6.2 Inference-Based EC2 Instance Types.

Chapter 7

TABLE 7.1 AWS Pricing Models for Compute Infrastructure.

Chapter 8

TABLE 8.1 ML Predefined Permissions.

Guide

Cover

Table of Contents

Title Page

Copyright

Dedication

Acknowledgments

About the Author

About the Technical Editor

Introduction

Begin Reading

Appendix A: Answers to the Review Questions

Appendix B: Mathematics Essentials

Index

End User License Agreement

Pages

iii

iv

v

ix

xi

xiii

xv

xvi

xvii

xviii

xix

xx

xxi

xxii

xxiii

xxiv

xxv

xxvi

xxvii

xxviii

xxix

xxx

xxxi

xxxii

xxxiii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

AWS®Certified Machine Learning Engineer Study Guide

Associate (MLA-C01) Exam

Dario Cabianca

Copyright © 2025 by Dario Cabianca. All rights reserved, including rights for text and data mining and training of artificial intelligence technologies or similar technologies.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

The manufacturer’s authorized representative according to the EU General Product Safety Regulation is Wiley-VCH GmbH, Boschstr. 12, 69469 Weinheim, Germany, e-mail: [email protected].

Trademarks: Wiley and the Wiley logo, and the Sybex logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. AWS is a registered trademark of Amazon Technologies, Inc. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Control Number: 2025908052

Paperback ISBN: 9781394319954

ePDF ISBN: 9781394319992

ePub ISBN: 9781394319978

Cover Design: Wiley

Cover Image: © Jeremy Woodhouse/Getty Images

To my family.

Acknowledgments

Creating the AWS Certified Machine Learning Engineer Study Guide Associate (MLA-C01) Exam has been an extraordinary journey, and I would like to take this opportunity to express my gratitude to those who made this book possible.

In addition to my family, I would like to thank Kenyon Brown, senior acquisition editor at Wiley, who helped get the book started. I am also grateful to Christine O’Connor, managing editor at Wiley, and to Dabian Witherspoon, project manager, for his detailed coordination of all tasks as the book progressed through its stages.

Justin Roberts, the technical editor, did a phenomenal job validating the content I authored and testing the code I created. I am also grateful to Kim Wimpsett, the copyeditor, and Dhilip Kumar Rajendran, the content refinement specialist, for their accurate and comprehensive review of the copyedits and the proofs.

Finally, I would like to thank the late Professor Giovanni Degli Antoni, my doctoral thesis advisor, who always inspired and motivated me to pursue my scientific curiosity during my years at the University of Milan, whose Department of Computer Science has been named in his honor.

About the Author

Dario Cabianca is a computer scientist (PhD, University of Milan), author, and AWS practice director at Trace3, which is a leading IT consultancy offering AI, data, cloud and cybersecurity solutions for clients across industries. At Trace3, Dario oversees the practice nationally, serving customers, building partnerships, and evangelizing Trace3’s portfolio of AWS competencies and services. He has worked with a variety of global consulting firms and enterprises for more than two decades and has earned 10 cloud certifications with AWS, Google Cloud, Microsoft Azure, and ISC2.

About the Technical Editor

Justin Roberts is a solutions architect at Amazon Web Services (AWS), where he advises strategic customers on designing and running complex large-scale systems on AWS. Justin has worked for several enterprises over almost two decades across numerous disciplines. He currently holds multiple industry certifications, including 14 AWS certifications, and is a member of the exclusive AWS “Golden Jacket” club for holding all active AWS certifications at once. Justin has a BS from Eastern Kentucky University and an MBA from Bellarmine University.

Introduction

The demand for machine learning (ML) engineers has significantly increased, particularly since 2023, when the introduction of ChatGPT revolutionized the artificial intelligence (AI) landscape. This field has seen a substantial interest and investment, as organizations across various sectors recognize the transformative potential of AI. As ML and AI become progressively more sophisticated, the need for skilled professionals to develop, implement, and maintain these systems has never been greater. To meet this demand, the new AWS Certified Machine Learning Engineer – Associate certification was developed to equip aspiring engineers with the knowledge and skills necessary to excel in this dynamic field.

The AWS Certified Machine Learning Engineer – Associate certification is a testament to the proficiency and expertise required to navigate this ever-evolving field. This certification not only validates an individual’s technical skills but also underscores their ability to leverage AWS’s extensive suite of ML and AI services to drive innovation. As this technology continues to mature, certified professionals are well-positioned to lead the charge in developing cutting-edge AI solutions.

This study guide adopts a methodical approach by walking you step-by-step through all the phases of the ML lifecycle. The exposition of each topic offers a combination of theoretical knowledge, practical exercises with tested code in Python, and necessary diagrams and plots to visually represent ML models and AI in action.

Throughout this study guide, we will delve into the fascinating world of AWS SageMaker AI (formerly known as Amazon SageMaker) and Amazon Bedrock, exploring their numerous features and functionalities. We will cover the core concepts and practical applications, providing you with the knowledge and tools needed to excel as an AWS Machine Learning Engineer. Whether you are just starting your journey or looking to deepen your expertise, this guide will serve as a comprehensive resource to mastering these platforms and achieving certification.

By obtaining the AWS Certified Machine Learning Engineer – Associate certification, you are not just enhancing your skillset but also contributing to the forefront of technological innovation. Let this study guide be your roadmap to success in this rapidly expanding field.

The AWS Certified Machine Learning Engineer – Associate Exam

The AWS Certified Machine Learning Engineer – Associate Exam is intended to validate the technical skills required to design, build, and operationalize well-architected ML workloads on AWS. The exam covers a wide range of topics, including data preparation, feature engineering, model training, model evaluation, and deployment strategies.

The exam consists of 65 questions and has a duration of 130 minutes. It is available in multiple languages, including English, Japanese, Korean, and Simplified Chinese. The exam costs $150 and can be taken at a Pearson VUE testing center or online as a proctored exam. This certification is valid for 3 years.

Your exam results are presented as a scaled score ranging from 100 to 1,000. To pass, a minimum score of 720 is required. This score reflects your overall performance on the exam and indicates whether you have successfully passed.

The official exam guide is available at https://d1.awsstatic.com/training-and-certification/docs-machine-learning-engineer-associate/AWS-Certified-Machine-Learning-Engineer-Associate_Exam-Guide.pdf.

During the writing of this book, “Amazon SageMaker” was renamed “Amazon SageMaker AI.” As a result, the first chapters of this book still use the former name, because at that time this was the correct name in use. In this book, the terms “Amazon SageMaker” and “Amazon SageMaker AI” are used interchangeably to denote the new AWS unified platform for data, analytics, ML, and AI. See https://aws.amazon.com/blogs/aws/introducing-the-next-generation-of-amazon-sagemaker-the-center-for-all-your-data-analytics-and-ai.

Why Become AWS Machine Learning Engineer Certified?

The increasing demand for AWS ML and AI engineers—due to the rapid adoption of ML and AI technologies across industries—has made this a perfect time to pursue the AWS Certified Machine Learning Engineer – Associate certification. Companies are looking for skilled professionals who can harness the power of AWS to build, deploy, and manage ML models efficiently. By earning this certification, you can demonstrate your proficiency in using AWS tools and services to drive impactful ML and AI solutions. This certification not only validates your technical skills but also sets you apart in a competitive job market, making you a valuable asset to potential employers.

One of the key reasons to pursue this certification is the comprehensive knowledge you’ll gain about AWS’s cutting-edge ML and AI services. While preparing for the exam, you’ll master the use of Amazon SageMaker AI, a powerful platform for building, training, deploying and monitoring ML models at scale. You’ll also explore the latest additions to Amazon SageMaker AI, which continuously evolves to bring together a broad set of AWS ML, AI, and data analytics services. As a result, you’ll become proficient in using Amazon Bedrock, a service that simplifies the deployment of foundation models by offering pretrained models from leading AI companies. However, due to the relatively new nature of Amazon Bedrock, there is a lack of in-depth material available, making this certification even more valuable as it positions you at the forefront of emerging AI technologies.

Amazon SageMaker AI and Amazon Bedrock are designed for seamless integration with numerous AWS services that are required during the phases of the ML lifecycle. Therefore, the study continues with extensive coverage of such services. These include storage services (e.g., Amazon S3, Amazon Elastic File System [EFS], Amazon FSx for Lustre, and others), ingestion services (e.g., Amazon Data Firehose, Amazon Kinesis Data Streams, Amazon Managed Streaming for Apache Kafka [MSK], and others), deployment services (e.g., Amazon Elastic Compute Cloud [EC2], Amazon Elastic Container Service [ECS], and others), orchestration services (e.g., AWS Step Functions, Amazon Managed Workflows for Apache Airflow [MWAA], and others), monitoring, cost optimization, and security services, just to name a few.

Another significant advantage of becoming AWS Machine Learning Engineer certified is the access to exclusive resources and a supportive community of professionals. By joining the certified AWS community, you’ll have the opportunity to network with other professionals, share knowledge, and stay updated on the latest trends and advancements in the field. This certification not only boosts your career prospects, but also keeps you engaged in a dynamic and constantly evolving industry.

How to Become AWS Machine Learning Engineer Certified

Your journey to become AWS Machine Learning Engineer Certified begins with a structured approach that covers foundational knowledge, hands-on practice, and thorough exam preparation. This study guide is crafted to mirror that journey.

Foundational knowledge

 Start by building a robust understanding of ML concepts, formulate ML problems, and learn algorithms and statistical methods. It’s also important to grasp the basics of linear algebra, calculus, probability, and statistics, as they form the mathematical foundation for ML. Additionally, familiarize yourself with AWS services, particularly Amazon SageMaker AI, which provides tools and features for every phase of the ML lifecycle. Learning Python, the primary programming language used in ML, is also essential.

Hands-on practice

 Engage in practical experience through AWS resources like tutorials, labs, and workshops. Focus on using Amazon SageMaker AI for various phases of the ML lifecycle, including

Data preparation

 Use Amazon SageMaker Data Wrangler to simplify data preparation and feature engineering.

Model building

 Leverage Amazon SageMaker Studio for an integrated development environment that supports building, training, and debugging ML models.

Model training

 Utilize Amazon SageMaker Training to efficiently train models with built-in algorithms or your own custom code.

Model deployment

 Use Amazon SageMaker Endpoint to deploy trained models for real-time predictions, and Amazon SageMaker Batch Transform for batch predictions.

Model monitoring

 Employ Amazon SageMaker Model Monitor to continuously monitor the performance of deployed models and ensure that they remain accurate over time.

By working on real-world projects that cover the entire ML lifecycle, you’ll gain hands-on experience and deepen your understanding.

Exam preparation

 Use AWS’s official exam guide to understand key objectives. Utilize practice exams and sample questions to test your readiness. Regular review and practice will ensure that you are well-prepared for the certification exam. On exam day, manage your time effectively and read each question carefully to increase your chances of passing and earning the certification.

Who Should Buy This Book

This book is intended for a broad audience of software, data, and cloud engineers/architects with ideally 1 year of hands-on experience with AWS services. Given the engineering focus of the certification, basic knowledge of the Python programming language—which is the de facto ML programming language—is expected.

Moreover, due to the data-centric nature of ML, having a firm grasp of basic mathematics and statistics is essential, but don’t worry—we’ll cover the basics in Appendix B (Mathematics Essentials), and provide guidance as needed.

This book comes with tested code in Python, with which you can experiment using Amazon SageMaker Studio. To get the most out this book, an AWS account is highly recommended.

Study Guide Features

This study guide utilizes a number of common elements to help you acquire and reinforce your knowledge. Each chapter includes

Summaries

 The summary section briefly explains the key concepts of the chapter, allowing you to easily remember what you learned.

Exam essentials

 The exam essentials section highlights the exam topics and the knowledge you need to have for the exam. These exam topics are directly related to the task statements provided by AWS, which are available in the upcoming exam objectives section.

Chapter review questions

 A set of questions will help you assess your knowledge and your exam readiness.

Interactive Online Learning Environment and Test Bank

An online learning environment accompanies this study guide, designed to simplify your learning experience. Whether you’re preparing at home or on the go, this platform is here to make studying easier and more convenient for you. The following learning resources are included:

Practice tests

 This study guide includes a total of 115 questions. All 90 questions in the guide are available in our proprietary digital test engine, along with the 25 questions in the assessment test at the end of this introduction.

Electronic flash cards

 One hundred questions in a flash card format (a question followed by a single correct answer) are provided.

Glossary

 The key terms you need to know for the exam are available as a searchable glossary in PDF format along with their definitions.

The online learning environment and the test bank are available at https://www.wiley.com/go/sybextestprep.

Conventions Used in This Book

This study guide uses certain typographic styles in order to help you quickly identify important information and to avoid confusion over the meaning of words such as on-screen prompts.

In particular, look for the following styles:

Italicized text

indicates key terms that are described at length for the first time in a chapter. These words are likely to appear in the searchable online glossary. (Italics are also used for emphasis.)

A

monospaced font

indicates the contents of a program or configuration files, messages displayed as a text-mode macOS/Linux shell prompt, filenames, text-mode command names, and Internet URLs.

In addition to these text conventions, which can apply to individual words or entire paragraphs, a few conventions highlight segments of text:

A note indicates information that’s useful or interesting, but that’s somewhat peripheral to the main text. A note might be relevant to a small number of networks, for instance, or it may refer to an outdated feature.

A tip provides information that you should understand for the exam. A tip can save you time or frustration and may not be entirely obvious. A tip might describe how to get around a limitation or how to use a feature to perform an unusual task.

AWS Certified Machine Learning Engineer Exam Objectives

This study guide is designed to comprehensively address each exam objective, reflecting the exam weighting outlined in the official guide, as illustrated in the following table:

Domain

Weight %

Domain 1: Data Preparation for Machine Learning (ML)

28%

Domain 2: ML Model Development

26%

Domain 3: Deployment and Orchestration of ML Workflows

22%

Domain 4: ML Solution Monitoring, Maintenance, and Security

24%

Domain 1: Data Preparation for Machine Learning (ML)

Task Statement 1.1: Ingest and Store Data

Knowledge of

Chapter

Data formats and ingestion mechanisms (for example, validated and non-validated formats, Apache Parquet, JSON, CSV, Apache ORC, Apache Avro, RecordIO)

2

,

4

,

5

,

6

How to use the core AWS data sources (for example, Amazon S3, Amazon Elastic File System [Amazon EFS], Amazon FSx for NetApp ONTAP)

2

,

5

,

6

How to use AWS streaming data sources to ingest data (for example, Amazon Kinesis, Apache Flink, Apache Kafka)

2

AWS storage options, including use cases and tradeoffs

2

Task Statement 1.2: Transform Data and Perform Feature Engineering

Knowledge of

Chapter

Data cleaning and transformation techniques (for example, detecting and treating outliers, imputing missing data, combining, deduplication)

3

,

4

Feature engineering techniques (for example, data scaling and standardization, feature splitting, binning, log transformation, normalization)

3

,

4

Encoding techniques (for example, one-hot encoding, binary encoding, label encoding, tokenization)

3

Tools to explore, visualize, or transform data and features (for example, Amazon Amazon SageMaker Data Wrangler, AWS Glue, AWS Glue DataBrew)

3

Services that transform streaming data (for example, AWS Lambda, Spark)

3

Data annotation and labeling services that create high-quality labeled datasets

3

Task Statement 1.3: Ensure Data Integrity and Prepare Data for Modeling

Knowledge of

Chapter

Pretraining bias metrics for numeric, text, and image data (for example, class imbalance [CI], difference in proportions of labels [DPL])

3

Strategies to address CI in numeric, text, and image datasets (for example, synthetic data generation, resampling)

3

,

5

Techniques to encrypt data

3

,

8

Data classification, anonymization, and masking

3

Implications of compliance requirements (for example, personally identifiable information [PII], protected health information [PHI], data residency)

8

Validating data quality (for example, by using AWS Glue DataBrew and AWS Glue Data Quality)

3

Identifying and mitigating sources of bias in data (for example, selection bias, measurement bias) by using AWS tools (for example, Amazon SageMaker Clarify)

3

Preparing data to reduce prediction bias (for example, by using dataset splitting, shuffling, and augmentation)

3

Configuring data to load into the model training resource (for example, Amazon EFS, Amazon FSx)

2

,

3

Domain 2: ML Model Development

Task Statement 2.1: Choose a Modeling Approach

Knowledge of

Chapter

Capabilities and appropriate uses of ML algorithms to solve business problems

1

,

4

How to use AWS AI (for example, Amazon Translate, Amazon Transcribe, Amazon Rekognition, Amazon Bedrock) to solve specific business problems

1

,

4

,

6

How to consider interpretability during model selection or algorithm selection

4

,

6

Amazon SageMaker built-in algorithms and when to apply them

4

Task Statement 2.2: Train and Refine Models

Knowledge of

Chapter

Elements in the training process (for example, epoch, steps, batch size)

5

Methods to reduce model training time (for example, early stopping, distributed training)

5

Factors that influence model size

5

Methods to improve model performance

5

Benefits of regularization techniques (for example, dropout, weight decay, L1 and L2)

5

Hyperparameter tuning techniques (for example, random search, Bayesian optimization)

5

Model hyperparameters and their effects on model performance (for example, number of trees in a tree-based model, number of layers in a neural network)

1

,

4

,

5

Methods to integrate models that were built outside Amazon SageMaker into Amazon SageMaker

5

,

6

Task Statement 2.3: Analyze Model Performance

Knowledge of

Chapter

Model evaluation techniques and metrics (for example, confusion matrix, heat maps, F1 score, accuracy, precision, recall, root mean square error [RMSE], receiver operating characteristic [ROC], area under the ROC curve [AUC])

5

Methods to create performance baselines

5

Methods to identify model overfitting and underfitting

5

Metrics available in Amazon SageMaker Clarify to gain insights into ML training data and models

5

Convergence issues

5

Domain 3: Deployment and Orchestration of ML Workflows

Task Statement 3.1: Select Deployment Infrastructure Based on Existing Architecture and Requirements

Knowledge of

Chapter

Deployment best practices (for example, versioning, rollback strategies)

6

AWS deployment services (for example, Amazon SageMaker AI endpoints)

6

Methods to serve ML models in real time and in batches

6

How to provision compute resources in production environments and test environments (for example, CPU, GPU)

6

Model and endpoint requirements for deployment endpoints (for example, serverless endpoints, real-time endpoints, asynchronous endpoints, batch inference)

6

Knowledge of

Chapter

How to choose appropriate containers (for example, provided or customized)

6

Methods to optimize models on edge devices (for example, Amazon SageMaker Neo)

6

Task Statement 3.2: Create and Script Infrastructure Based on Existing Architecture and Requirements

Knowledge of

Chapter

Difference between on-demand and provisioned resources

6

How to compare scaling policies

6

Tradeoffs and use cases of infrastructure as code (IaC) options (for example, AWS CloudFormation, AWS Cloud Development Kit [AWS CDK])

6

Containerization concepts and AWS container services

6

How to use Amazon SageMaker endpoint auto scaling policies to meet scalability requirements (for example, based on demand, time)

6

Task Statement 3.3: Use Automated Orchestration Tools to Set up Continuous Integration and Continuous Delivery (CI/CD) Pipelines

Knowledge of

Chapter

Capabilities and quotas for AWS CodePipeline, AWS CodeBuild, and AWS CodeDeploy

6

Automation and integration of data ingestion with orchestration services

6

Version control systems and basic usage (for example, Git)

6

CI/CD principles and how they fit into ML workflows

6

,

7

Deployment strategies and rollback actions (for example, blue/green, canary, linear)

6

How code repositories and pipelines work together

6

Domain 4: ML Solution Monitoring, Maintenance, and Security

Task Statement 4.1: Monitor Model Inference

Knowledge of

Chapter

Drift in ML models

7

Techniques to monitor data quality and model performance

7

Design principles for ML lenses relevant to monitoring

7

Task Statement 4.2: Monitor and Optimize Infrastructure and Costs

Knowledge of

Chapter

Key performance metrics for ML infrastructure (for example, utilization, throughput, availability, scalability, fault tolerance)

6

,

7

Monitoring and observability tools to troubleshoot latency and performance issues (for example, AWS X-Ray, Amazon CloudWatch Lambda Insights, Amazon CloudWatch Logs Insights)

7

,

8

How to use AWS CloudTrail to log, monitor, and invoke retraining activities

7

,

8

Differences between instance types and how they affect performance (for example, memory optimized, compute optimized, general purpose, inference optimized)

6

,

7

Knowledge of

Chapter

Capabilities of cost analysis tools (for example, AWS Cost Explorer, AWS Billing and Cost Management, AWS Trusted Advisor)

7

Cost tracking and allocation techniques (for example, resource tagging)

7

Task Statement 4.3: Secure AWS Resources

Knowledge of

Chapter

IAM roles, policies, and groups that control access to AWS services (for example, AWS Identity and Access Management [IAM], bucket policies, Amazon SageMaker Role Manager)

8

Amazon SageMaker security and compliance features

8

Controls for network access to ML resources

8

Security best practices for CI/CD pipelines

8

Assessment Test

When configuring Amazon S3 for data storage, what best practice should be followed to ensure efficient cost management and data retrieval?

Store all data in the S3 Standard storage class.

Utilize versioning for all objects.

Implement lifecycle policies to transition data to appropriate storage classes.

Enable cross-region replication for all buckets.

For high-throughput data ingestion into Amazon S3, which feature can be leveraged to manage large-scale file transfers efficiently?

Amazon S3 Multipart Upload

Amazon S3 Access Points

Amazon S3 Transfer Acceleration

Amazon S3 Batch Operations

What is a key advantage of using Amazon FSx for Lustre over Amazon S3 for machine learning workloads requiring high-speed processing?

Better compatibility with Hadoop

Lower cost for large datasets

Support for distributed file systems with high throughput

Seamless integration with Amazon Glacier

In the context of feature engineering, what is the primary goal of applying Principal Component Analysis (PCA) to a dataset?

Increase the dimensionality of data

Extract uncorrelated features for better model performance

Normalize the distribution of features

Enhance the interpretability of the dataset

When dealing with categorical variables in feature engineering, which method can be used to effectively capture the ordinal relationship between categories?

One-hot encoding

Binary encoding

Ordinal encoding

Frequency encoding

For a problem requiring prediction of time-series data, which machine learning algorithm is most suitable?

K-nearest neighbors

DeepAR

Support vector machines

Random forests

Which ensemble method combines the predictions of multiple weak learners to improve model performance and robustness?

Decision trees

Neural networks

XGBoost

K-means clustering

In the context of model development, what is the purpose of using a regularization technique such as L1 or L2 regularization?

To improve the accuracy of the training dataset

To simplify the model by penalizing large coefficients

To enhance data visualization

To increase the learning rate of the model

Which optimization algorithm is commonly used to minimize the loss function during the training of deep learning models?

Gradient descent

Newton’s method

Genetic algorithm

Simulated annealing

When evaluating a binary classification model, which metric should be used to determine the balance between precision and recall?

Accuracy

F1 score

ROC-AUC

Mean squared error

How can cross-validation help in assessing the generalization capability of a machine learning model?

By splitting the dataset into train and test sets multiple times

By using the entire dataset for training

By creating synthetic data points for evaluation

By reducing the dimensionality of features

What is the key benefit of using Amazon SageMaker AI for model deployment and orchestration?

Automatic hyperparameter tuning

Real-time model monitoring

Seamless integration with Amazon SageMaker Pipelines

Built-in data visualization tools

In the context of deploying machine learning models, what is the purpose of using AWS Step Functions?

To perform ETL operations on data

To create and manage complex ML workflows with state transitions

To monitor model performance in real time

To deploy models on edge devices

What is the advantage of using Amazon SageMaker Model Monitor for deployed models?

Automatic scaling of model endpoints

Continuous monitoring of model quality and data drift

Real-time training of models

Deployment of models across multiple regions

How can the detection of data drift help maintain model performance?

By retraining the model on the same dataset

By identifying changes in data distribution that affect model predictions

By increasing the model’s learning rate

By reducing the number of features used in the model

Which AWS services can help secure machine learning models by enforcing access controls and encryption?

Amazon VPC and Amazon CloudWatch

AWS IAM and AWS Key Management Service (KMS)

AWS CloudTrail and Amazon GuardDuty

AWS IAM and AWS Config

When deploying machine learning models, what is a recommended best practice to prevent unauthorized access to sensitive data?

Using private S3 buckets for model storage

Storing model credentials in AWS Secrets Manager

Enabling encryption at rest and in transit

Allowing restricted network access to model endpoints

In feature engineering, which method helps in transforming skewed data distributions into a more Gaussian-like distribution?

Min-max scaling

Log transformation

Label encoding

One-hot encoding

What is the purpose of using dropout in training deep learning models?

To improve the accuracy of the model

To prevent overfitting by randomly dropping neurons

To increase the learning rate

To simplify the model architecture

Which Amazon SageMaker AI service helps detect bias in machine learning models during post-deployment monitoring?

Amazon SageMaker Model Monitor

Amazon SageMaker Data Wrangler

Amazon SageMaker Clarify

Amazon SageMaker Neo

You are tasked with deploying a machine learning model that requires GPU acceleration for inference to achieve optimal performance. Which AWS compute instance type would you select for this purpose?

T2.micro

C5.large

P3.2xlarge

R5.xlarge

You need to deploy multiple versions of a machine learning model to evaluate their performance in a live environment. Which AWS service enables you to deploy and manage these multiple model versions effectively?

Amazon SageMaker Model Registry

AWS Glue Data Catalog

Amazon SageMaker Multi-Model Endpoints

Amazon SageMaker Ground Truth

A retail company needs to process customer images in real time for personalized shopping experiences using a deep learning model. Which combination of AWS services and configurations would you use to achieve high performance and scalable inference?

Amazon SageMaker with Multi-Model Endpoints and Elastic Load Balancing

Amazon EC2 with GPU instances and AWS Auto Scaling

AWS Lambda with Amazon S3 and API Gateway

Amazon SageMaker with Endpoint Variants and Auto Scaling

A healthcare company needs to ensure that its deployed machine learning models comply with regulatory standards and maintain high accuracy over time. Which feature of Amazon SageMaker Model Monitor would you leverage to track data quality and model performance, and what key metrics would you monitor to ensure compliance?

Baseline constraints and statistics; monitor data distribution and prediction accuracy

Model registry; track model versions and updates

Hyperparameter tuning; optimize model hyperparameters

Feature store; manage and store feature data

An organization is looking to simplify the process of managing IAM roles for various Amazon SageMaker users and workloads. It needs to ensure that roles are correctly configured with the necessary permissions while maintaining security best practices. Which feature of Amazon SageMaker Role Manager would you use to achieve this?

Automatic role creation

Role templates

Role auditing

Role inheritance

Answers to Assessment Test

C. Lifecycle policies in Amazon S3 help manage storage costs by automatically transitioning data to lower-cost storage classes as it becomes less frequently accessed.

A. Amazon S3 Multipart Upload allows for efficient, parallel upload of large files by splitting them into smaller parts, which can be uploaded independently and reassembled.

C. Amazon FSx for Lustre is designed for high-performance workloads and provides a distributed file system with high throughput and low latency, making it ideal for data-intensive ML tasks.

B. PCA reduces the dimensionality of the data by transforming it into a set of uncorrelated principal components, improving model performance by eliminating redundant features.

C. Ordinal encoding assigns numerical values to categorical variables while preserving the order of the categories, which is important for algorithms that can leverage this relationship.

B. DeepAR is specifically designed to handle sequential data and can capture temporal dependencies, making it suitable for time series prediction.

C. XGBoost combines the predictions of multiple weak learners, typically decision trees, to create a strong predictive model by sequentially training each new learner to correct the errors made by the previous ones, effectively “boosting” the overall accuracy through an iterative process of error correction.

B. Regularization techniques add a penalty for large coefficients, encouraging simpler models that generalize better to new data and reducing the risk of overfitting.

A. Gradient descent is a widely used optimization algorithm in deep learning that iteratively adjusts model parameters to minimize the loss function.

B. The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both aspects, particularly useful for imbalanced datasets.

A. Cross-validation involves partitioning the data into multiple train and test splits, providing a more reliable estimate of the model’s generalization performance.

C. Amazon SageMaker Pipelines allows for the creation and management of end-to-end machine learning workflows, facilitating efficient model deployment and orchestration.

B. AWS Step Functions enable the orchestration of complex workflows with state transitions, ensuring that each step of the ML process is executed in the correct order.

B. Amazon SageMaker Model Monitor continuously tracks model performance and data drift, alerting users to issues that could impact model accuracy and reliability.

B. Detecting data drift involves monitoring shifts in data distribution, which can help identify when the model may need retraining to maintain performance.

B. AWS IAM enables access control by securely managing identities and access to AWS services and resources. AWS KMS provides encryption keys and manages their lifecycle, ensuring data and model security through encryption at rest and in transit.

C. Encrypting data at rest and in transit ensures that sensitive data remains secure, even if accessed without proper authorization.

B. Log transformation reduces skewness in data, making distributions more Gaussian-like, which can improve model performance and interpretation.

B. Dropout is a technique that prevents overfitting in neural networks. It works by randomly deactivating a percentage of neurons during training. This forces the remaining neurons to compensate, which prevents any one neuron from becoming too dependent on others.

C. Amazon SageMaker Clarify detects and measures bias in machine learning models, both before and after deployment, helping to ensure fair and unbiased predictions.

C. P3.2xlarge provides GPU acceleration, which is essential for models requiring high computational power for inference, ensuring optimal performance.

C. Amazon SageMaker Multi-Model Endpoints allows you to deploy multiple models on a single endpoint, effectively managing different model versions and improving resource utilization, leading to cost savings and simplified deployment management.

D. Amazon SageMaker with Endpoint Variants and Auto Scaling allows you to deploy multiple versions of a model with Auto Scaling to handle different traffic loads, providing high performance and scalable inference for real-time image processing.

A. Baseline constraints and statistics help ensure that the input data remains consistent with the training data and the model’s predictions continue to meet the required accuracy and compliance standards.

B. Role templates simplify the assignment of appropriate permissions based on common Amazon SageMaker tasks, ensuring that users have the necessary access while adhering to security best practices and reducing the complexity of role management.

Chapter 1Introduction to Machine Learning

THE AWS CERTIFIED MACHINE LEARNING (ML) ENGINEER ASSOCIATE EXAM OBJECTIVES COVERED IN THIS CHAPTER MAY INCLUDE, BUT ARE NOT LIMITED TO THE FOLLOWING:

Domain 2: ML Model Development

2.1 Choose a modeling approach

2.2 Train and refine models

Machine learning (ML) has become ubiquitous in our digital world. Whether you need to book a flight, visit your doctor, make an online purchase, pay a bill, or check the weather forecast, behind the scenes any of these actions has started (or is a part of) a process that collects large amounts of data, processes the data, and performs some ML task.

ML is a branch of artificial intelligence (AI) that enables systems to learn and improve from experience without being explicitly programmed.1 By analyzing large datasets, ML algorithms can identify patterns, make decisions, and predict outcomes.

The integration of ML into various domains has revolutionized industries by enhancing efficiency, accuracy, and decision-making capabilities.

This chapter will provide the ML foundations you need to know for the exam. To better understand ML, we need to set the context where ML originated, which is AI.

Understanding Artificial Intelligence

AI is a branch of computer science whose main focus is to develop systems capable of performing tasks that typically require human intelligence. These tasks include (but are not limited to) recognizing speech, making decisions, solving problems, identifying patterns, and understanding languages. AI systems leverage techniques such as ML, natural language processing (NLP), deep learning, and computer vision to simulate cognitive functions like learning, reasoning, and self-correction. The field of AI is rapidly evolving, with applications spanning various industries, from healthcare and finance to autonomous vehicles and smart cities. Understanding AI involves exploring its fundamental concepts, its historical development, and the ethical implications of its widespread adoption.

AI systems ingest data, such as human-level knowledge, and emulate natural intelligence. ML is a subset of AI, where data and algorithms continuously improve the training model to help achieve higher-quality output predictions. Deep learning is a subset of ML. It is an approach to realizing ML that relies on a layered architecture, simulating the human brain to identify data patterns and train the model.

Figure 1.1 illustrates this hierarchy. It all starts with data and how data can be used to extract relevant information, which ultimately produces knowledge.

FIGURE 1.1 Deep learning, machine learning, and AI.

Data, Information, and Knowledge

Data, information, and knowledge form the foundation of understanding and applying AI. Data is the raw, unprocessed facts and figures collected from various sources. When data is organized and processed, it transforms into information, which provides context and meaning. Knowledge is derived from the synthesis of information, enabling comprehension and informed decision-making.

In AI, data serves as the essential input that fuels ML algorithms, enabling the training of models to recognize patterns and make predictions. Information, structured from the data, helps refine these models by offering insights and context. Ultimately, knowledge empowers AI systems to emulate human-like reasoning, enhancing their ability to perform complex tasks, adapt to new situations, and provide valuable solutions across various domains.

Data

Data is defined as a value or set of values representing a specific concept or concepts. Data is the most critical asset of the digital age, fueling advancements in technology and shaping the way we understand and interact with the world. To harness its full potential, it’s essential to recognize the different classes of data: structured, semi-structured, and unstructured. Each class has unique characteristics and applications, particularly in the context of AI.

Understanding these classes of data and their peculiarities is crucial for leveraging AI’s full potential. Structured data provides the foundation for organized analysis, semi-structured data offers flexibility in handling complex information, and unstructured data unlocks a wealth of untapped insights. Together, they enable AI systems to learn, adapt, and deliver transformative solutions across various domains. Let’s review each class in more detail.

Structured Data

Structured data is highly organized and easily searchable in databases. This type of data adheres to a fixed schema, meaning it follows a predefined format with specific fields and records. For example, a spreadsheet containing customer information, such as names, addresses, and purchase histories, is structured data. It allows for efficient querying and analysis, making it invaluable for business intelligence and data-driven decision-making.

These are examples of structured data:

Relational databases (e.g., SQL databases)

Spreadsheets (e.g., Excel files)

Financial records (e.g., transaction logs)

Semi-Structured Data

Semi-structured data lacks the rigid structure of structured data but still contains some organizational elements, such as tags or markers, that provide context and hierarchy. This class of data is often used for storing and transmitting complex information that doesn’t fit neatly into a table format. Semi-structured data is more flexible than structured data, allowing for greater adaptability in handling diverse data types.

These are examples of semi-structured data:

JavaScript Object Notation (JSON) files

Extensible Markup Language (XML) documents

Email messages (headers, body text, attachments)

Unstructured Data

Unstructured data is the most common and diverse type of data, encompassing information that doesn’t follow a predefined format or schema. This class of data is often rich in content but challenging to analyze and search. Unstructured data requires advanced AI techniques, such as NLP and computer vision, to extract meaningful insights.

These are examples of unstructured data:

Text files (e.g., Word documents)

Multimedia files (e.g., images, videos)

Social media content (e.g., tweets, posts)

Information

Information bridges the gap between raw data and actionable knowledge. Derived from data, information provides a semantic element in the form of context and meaning, resulting in a transformation of disparate facts and figures into coherent, useful insights.

Information can be understood as data that has been processed, organized, or structured in a way that adds context and relevance. Unlike raw data, which is often unorganized and lacks inherent meaning, information is data presented in a format that is understandable and useful to its recipient(s). For example, a list of temperatures recorded at various times of the day is mere data; when these temperatures are organized into a table showing daily weather patterns, they become information.

The key characteristics of information are accuracy, relevance, completeness, and timeliness:

Accuracy

 Information must be precise and free from errors to be valuable. Inaccurate information can lead to misguided decisions and outcomes.

Relevance

 Information should be pertinent to the context or problem at hand. Irrelevant information, no matter how accurate, serves little purpose.

Completeness

 Information must be comprehensive enough to provide a clear understanding without ambiguity. Incomplete information can result in incorrect conclusions.

Timeliness

 Information is most useful when it is available at the right time. Outdated information can be as detrimental as inaccurate information.

In AI, information plays a critical role in training and refining algorithms. AI systems rely on vast amounts of data to learn and make predictions. This data is processed into information that adds context and aids in pattern recognition and decision-making.

For instance, in NLP, raw text data is processed to extract meaningful information, such as sentiment analysis or language translation. Similarly, in computer vision, images are analyzed to identify objects and patterns, turning visual data into actionable information.

Knowledge

Knowledge represents a higher level of understanding that goes beyond mere data and information. It encompasses the insights, experiences, and contextual understanding that enable individuals and systems to make informed decisions, solve problems, and innovate.

Knowledge is derived from the synthesis and application of information. It involves recognizing patterns, understanding relationships, and drawing conclusions based on experience and context. Unlike data, which is raw and unprocessed, or information, which is organized and meaningful, knowledge embodies a deeper comprehension that guides action and thought.

The key characteristics of knowledge are contextuality, applicability, basis in experience, and dynamism.

Contextuality

 Knowledge is deeply rooted in context. It involves understanding not just the facts but also the circumstances and nuances that surround them.

Applicability

 Knowledge is practical. It involves the ability to apply information to real-world situations, making it actionable and relevant.

Experience-based

 Knowledge is often gained through experience. It encompasses lessons learned, insights gained, and the wisdom accumulated over time.

Dynamism

 Knowledge is ever-evolving. As new information becomes available and experiences accumulate, knowledge grows and adapts.

In AI, knowledge is crucial for developing systems that can derive inferences, learn, and adapt. AI systems rely on vast amounts of data and information to build knowledge bases that enable them to perform complex tasks and make intelligent decisions.

For example, expert systems in AI were designed to emulate human expertise in specific domains. These systems utilized knowledge bases that contained rules, facts, and relationships, allowing them to provide recommendations and solve problems. Similarly, ML algorithms use data and information to build models that capture knowledge about patterns and trends, enabling predictive analytics and decision-making.

Understanding Machine Learning

ML is a subset of AI that allows computers to learn from existing information, without being explicitly programmed, and apply that learning to perform other similar tasks.

Without explicit programming, the machine learns from data and information it feeds. The machine picks up patterns, trends, or essential features from previous data and makes predictions on new data. This aspect of ML, i.e., a machine’s ability to learn without being explicitly programmed, is important and emphasizes the key difference between ML and classical programming.

Let’s say you are tasked by your manager to build a program that translates text from English to Italian. At a very high level, you could pursue a classical programming approach by manually coding rules and exceptions for lexicon, grammar, syntax, and vocabulary. This approach would be extremely complex and not scalable. Instead, with ML, you could leverage models that excel at understanding context, idioms, and nuances in both languages, which are essential for accurate translation. These ML models can continuously improve their translation capabilities with more data, adapting to new expressions, idioms, and evolving language usage.

A real-life ML application is in recommendation systems. For example, Amazon uses ML to recommend books to users based on users’ book categories selection and purchase history. Likewise, Spotify uses ML to recommend songs to users based on previously purchased songs and genres to their liking. In both cases, Amazon and Spotify use large amounts of data to train their ML models and derive meaningful inferences in the form or recommendations.

In the upcoming sections, the fundamental concepts of ML will be introduced. These concepts will form the basis you need to master each exam objective. Let’s start with the ML lifecycle.

ML Lifecycle

First, you must have a holistic view of the entire ML lifecycle. Because ML is deeply grounded in science, it all starts by questions like “What is the business problem we are trying to solve?” and, most importantly, “How does machine learning address this business need?”

Figure 1.2 illustrates the ML lifecycle. These steps are generally followed in all ML projects, regardless of the cloud provider or the tools in use. Let’s review each step.

FIGURE 1.2 Machine learning lifecycle.

Define ML Problem

Defining a ML problem is the critical first step in the development of any ML project. It involves understanding and clearly articulating the specific business challenge or question that needs to be addressed using data and ML techniques. This phase includes identifying the problem’s scope, the desired outcomes, and the feasibility of applying ML solutions. Crucially, it requires a detailed analysis of the available data, understanding the business context, and determining the performance metrics that will be used to evaluate success. By establishing a clear and well-defined problem statement, you can ensure that your efforts are focused and aligned with strategic goals, leading to more effective and impactful ML solutions.

What sets an ML problem apart from other problems in computer science/engineering is its reliance on data-driven approaches and statistical models to make predictions or decisions. Traditional programming relies on explicit instructions and algorithms to solve problems, whereas ML leverages patterns and relationships within the data to infer solutions. This shift from deterministic algorithms to probabilistic models means that ML problems often involve uncertainty and require continuous learning and adaptation from new data, making them uniquely dynamic and challenging. This aspect of continuous learning and adaptation from new data is reflected in the cycle represented in Figure 1.2.

Collect Data