Building Resilient Architectures on AWS - Ajit Puthiyavettle - E-Book

Building Resilient Architectures on AWS E-Book

Ajit Puthiyavettle

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Building systems that can withstand failure is key to running a successful business on the cloud. Learn from distinguished AWS experts—Ajit, Imaya, and Rodrigue—who bring over four decades of experience in architecting enterprise-scale solutions, speaking at major AWS conferences, and implementing resilience strategies across diverse industries, as they guide you through building highly available and fault-tolerant applications on AWS.
This book explores resiliency, offering steps to design, build, and operate resilient architectures on AWS. You’ll master data security practices, backup strategies, and automation techniques, helping you build strong defenses and reliable recovery plans for resilience against disruptions. You’ll also learn how to apply AWS Well-Architected pillars to design applications with redundancy, loose coupling, graceful degradation, and fault isolation. With architecture examples, you’ll validate your design’s effectiveness through resilient patterns, performance monitoring, and chaos engineering.
By the end of this book, you’ll be equipped with best practices for creating robust cloud infrastructures to ensure business continuity and success and become proficient at creating fault-tolerant systems, optimizing performance, and ensuring reliability across regions.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 549

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Building Resilient Architectures on AWS

A practical guide to architecting cost-efficient, resilient solutions in AWS

Ajit Puthiyavettle

Imaya Kumar Jagannathan

Rodrigue Koffi

Building Resilient Architectures on AWS

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

The authors acknowledges the use of cutting-edge AI, such as Large Language Models, with the sole aim to do research and enhancing the language and clarity within the book, thereby ensuring a smooth reading experience for readers. It’s important to note that the content and ideas has been crafted by the authors and edited by a professional publishing team.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Preet Ahuja

Publishing Product Manager: Suwarna Patil

Book Project Manager: Ashwin Kharwa

Senior Editor: Mohd Hammad

Technical Editor: Irfa Ansari

Copy Editor: Safis Editing

Proofreader: Mohd Hammad

Indexer: Rekha Nair

Production Designer: Gokul Raj S.T

DevRel Marketing Coordinator: Rohan Dobhal

First published: December 2024

Production reference: 1081124

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK

ISBN 978-1-83588-710-3

www.packtpub.com

To my father, Narayanan, and my mother, Chandramathi, for their boundless love, sacrifices, and embodiment of resilience that have forever shaped my life. To my beloved wife, Deepa, my steadfast partner, whose unwavering support and companionship have been the cornerstone of our shared journey. To my treasured children, Ayush and Tejas, the lights of my life, whose presence fills my heart with immeasurable joy and pride.

– Ajit Puthiyavettle

To my family - Manimekala, my dear wife, and our lovely children, Kavin and Kayal, who found a unique way to motivate me by regularly asking with curiosity and pride, “Are you there yet?” My fur baby, DJ, for his licks, walks, and pure love. Appa and Amma, for teaching me how to be tenacious and kind when the going gets tough. My brother, sister-in-law, and their two little boys, who gave me great support by adding to the fun each time they visited. This journey would not have been the same without your collective love and encouragement.

– Imaya Kumar Jagannathan

To my late mother, who left this world too soon, but whose love and invaluable life lessons remain engraved in my heart forever. To my father, for all his sacrifices and for being a true definition of resilience. To my sweet Catherine, my soulmate and life partner, who always believes in me and supports me. To my daughter, Alma, my priceless treasure who reminds me of the true essence of joy. To the Anéyé family, whose sheltering, love, and support have been an anchor through my life path. To my dear “Tantie” Fanda, for being like a mother to me.

–Rodrigue Koffi

Contributors

About the authors

Ajit Puthiyavettle is an accomplished Principal Solutions Architect with a proven track record of over a decade in architecting innovative solutions that drive business success. His expertise lies in collaborating closely with enterprise clients across diverse sectors, including financial services, healthcare, and life sciences, crafting tailored solutions that align with their strategic objectives. His deep understanding of industry trends, coupled with his technical acumen, allows him to architect solutions that not only meet current requirements but also anticipate future needs. He has presented at various public conferences, including AWS re:Invent, AWS re:Inforce, and AWS Summits, as well as having authored various blogs, workshops, videos, and so on.

Imaya Kumar Jagannathan is an expert architect, technical leader, speaker, and author with over 22 years of rich experience in the technology industry, specializing in designing and building complex, internet-scale applications. He has worked in highly impactful roles at various large organizations and is passionate about creating solutions that prioritize customers and make a meaningful difference to businesses. He is passionate about coaching and mentoring others and finds happiness in witnessing their success. He has presented at several top conferences, including AWS re:Invent, and has authored dozens of articles, blogs, videos, and live events. When not busy at work, he spends time learning about aviation.

Rodrigue Koffi is a technical leader, public speaker, and software engineer at heart with over a decade of experience. He is an author of multiple articles and whitepapers, and a speaker at top industry conferences. Currently working at AWS, he helps customers across multiple industries achieve their resilience goals through observability. He has held various roles, from software engineering to consulting and leading site reliability engineering teams. Rodrigue is passionate about high-scale and distributed systems, continuously expanding his knowledge in these domains. Outside of work, he finds joy in swimming and spending quality time with his family.

About the reviewers

Divyajeet Singh is a seasoned professional in the cloud computing space with extensive experience across major tech giants. Currently serving as a senior solutions architect at AWS, he has previously held similar roles at Google and Microsoft, solidifying his expertise in cloud technologies. Divyajeet thrives on challenges, viewing them as opportunities for growth and learning.

Beyond his professional pursuits, Divyajeet nurtures a keen interest in CAD and CAM. He finds joy in bringing ideas to life through CNC machines during his spare time, showcasing his passion for innovation beyond the digital realm. Divyajeet values work-life balance, enjoying travel and quality time with family and friends.

I extend my heartfelt gratitude to my parents, my wife, and my 5-year-old son, for their unwavering support. I also want to express my appreciation to my mentors and colleagues for the invaluable insights and collaboration throughout my professional journey.

Pranit Raje is a Solutions Architect with AWS India, bringing over six years of experience across various AWS teams. He excels in DevOps, automation, containers, and operational excellence through IaC, CI/CD, and DevSecOps practices. Pranit has played a key role in proposing, advising, consulting, and delivering critical technical solutions for diverse customers. He has authored blogs, technical articles, and workshops, both internally and externally. In his spare time, Pranit enjoys networking with like-minded professionals through technical conferences and meetups.

I am grateful to my family, friends, colleagues, and managers for their constant support and the valuable lessons I’ve learned from them. I feel fortunate to work in this field alongside such helpful individuals who challenge me to improve and keep learning. Special thanks to my family for their unwavering support and patience with my busy schedule.

Table of Contents

Preface

Part 1: Setting the Stage – Learning the Basics of Designing Resilient Architectures

1

Understanding Resilience Concepts

Demystifying resilience

Cloud resilience

Shared Responsibility Model

Why resiliency?

Resilient foundations

Facing the cloud’s storms

Software bugs and security threats

Using Software Bill of Materials (SBOM) best practices

Empowering yourself with AWS services

The continuous resilience journey

Resilience Lifecycle Framework

Resilience is a continuous journey

Summary

2

Implementing Resilient Compute and Auto Scaling

Redundancy and fault tolerance in compute

Key principles for addressing factors influencing system stability

Embracing auto scaling for dynamic resource management

What is Auto Scaling in AWS?

Some pitfalls and assumptions in the e-commerce architecture

The key components of Auto Scaling in AWS

Some use cases and benefits of Auto Scaling in AWS

Optimizing cost-efficiency with Spot and Reserved Instances

Using Spot Instances

Using AWS Reserved Instances

Monitoring and maintaining a healthy infrastructure

AWS observability services

AWS-managed open source observability services

Extending resilience to containers and serverless

Summary

3

Securing and Backing Up Critical Data

Data security as resilience foundation

Controlling access to data

Encryption, intrusion detection, and prevention

Resilience advantages of a secure data strategy

Layering backup strategies for reliable resilience

Implementing layered backups in AWS

Designing your AWS backup strategy

Backup validation and disaster recovery testing

Embracing multi-region and geo-replication

The case for geographic redundancy

Understanding replication techniques

Active-active versus active-passive architectures

Dealing with data consistency

AWS services for multi-region and geo-replication

Design considerations and challenges

A simple, resilient, global web application architecture

Continuous monitoring and recovery orchestration

Best practices for automation in resilience

Automating data loss prevention and recovery

Scenario-based automation examples

Further considerations to improve recovery mechanisms

Disaster recovery planning and drills

Crafting your AWS disaster recovery plan

Types of disaster recovery drills

AWS tools to power your DR drills

Execution best practices for your drills

Specific scenarios for DR drills

Summary

4

Orchestrating Graceful Degradation

Understanding graceful degradation through an example

Identifying the fault – diagnosing partial failures and minimizing impact

Log analysis through Amazon CloudWatch

Performance monitoring

Root cause analysis through traces

Predicting issues before they occur

Isolating the wound – containment strategies to prevent cascading outages

Automated troubleshooting

Incident Management

Architectural design patterns for containment

Streamlining recovery with preconfigured actions

Leveraging ML and GenAI to enhance issue detection and response

GenAI for IR

ML for issue identification

Summary

Further reading

5

Exploring the AWS Shared Responsibility Model

The essence of collaboration

The synergy of shared resilience

Adapting shared responsibilities for specific services

Kubernetes control plane and its operations

Shared responsibility and cost

The importance of continuous testing for critical infrastructure resilience in AWS environments

Tools and techniques to perform continuous testing of AWS environments

Adapting your security practices alongside AWS’s ever-evolving landscape

Sharing lessons learned and engaging with the community

Summary

Part 2: Building Resilient Cloud Architectures on AWS

6

Learning AWS Well-Architected Principles for Resiliency

Technical requirements

Gaining Operational Excellence for improved resilience

Performing operations as code

Making frequent, small, reversible changes

Refining operations procedures frequently

Anticipating failure

Using managed services

Implementing observability for actionable insights

Fostering an organizational culture for operational excellence

Building reliable architectures

Automatically recovering from failure

Capacity and quotas management

Scaling applications to meet demand

Architecting cost-effective resilience

Implementing security for improved resilience

Identity and access management

Protection

Incident response

Summary

7

Architecting Fault-Tolerant Applications

Leveraging AWS global infrastructure for redundancy

Hardware or infrastructure redundancy

Leveraging AWS core infrastructure for redundancy

Load balancing workloads across redundant systems

State management – stateless versus stateful approaches

Handling data redundancy

Applying redundancy for file storage

Leveraging managed database services

Backing up data regularly

Implementing loose coupling for isolating faults

Using microservices for decoupling services

Service-to-service communication

Event-driven architecture (EDA)

The Twelve-Factor App methodology

Summary

8

Resiliency Considerations for Serverless Applications

Defining serverless applications

Building resilience into serverless

Idempotent and asynchronous function design

Retries and error handling in AWS Lambda

Handling throttling and service quotas (service limits)

Monitoring and observability for serverless applications

Testing serverless applications

Mock testing

Emulation testing

Testing on AWS

Summary

9

Using Containers to Improve Resiliency

Technical requirements

Immutable infrastructure with containers

Concept of immutable infrastructure

Building and managing container images

Deploying containers on AWS

Scaling and load-balancing containerized applications

Horizontal scaling

Vertical scaling

Inter-service communication with containers

Service discovery

Load balancing

Service mesh

Async communications with message brokers

Security considerations for container resilience

Securing container images and registries

Securing container runtimes

Secrets management and encryption

Summary

10

Resilient Architectures Across Regions

Understanding active-passive architectures

Failover mechanisms

Simplified failover with serverless

Exploring global versus regional services

Using CloudFront

Application performance and availability with AWS Global Accelerator

Delving into active-active regional architectures

Load balancing across regions

Data consistency and synchronization

Introducing cell-based architectures

What is a cell?

Advantages of using cells

Considerations when using cells

Summary

Part 3: Validating Your Architecture for Resiliency

11

Examples of Resilient Architecture

Introducing single-Region architecture

Why customers choose single-Region architectures

Different configurations in single-region architecture

Multi-Region architecture deployment

When to utilize a multi-Region architecture

Different multi-Region configurations

Reliability configurations in a multi-Region setup

An example of multi-Region architecture

The limitations of multi-Region architecture

Multi-site architecture deployments

When to utilize multi-site architecture

Reliability configurations in a multi-site configuration

An example of multi-site architecture

The limitations of multi-site architecture

Designing DDoS/security resilient architecture

An example of DDoS/security resilient architecture

What is DDoS, and what are security threats?

What do we mean by DDoS/security resiliency?

Reliability configurations to prevent DDoS/security threats

Summary

12

Observability, Auditing, and Continuous Improvement

Observability is key to resilience

Designing observability for resilience

Steps in designing observability

Observability of common resource

Alerting

AWS observability tooling

Logging key metrics and events

Auditing environments for resilience

Continuous observability improvement

Steps to set up continuous observability

Using third-party observability tools

Summary

Further reading

13

Performing Chaos Engineering Testing

What is chaos engineering?

What are the benefits of chaos engineering?

How does chaos engineering differ from traditional testing?

Stages in chaos engineering

Defining steady state

Hypothesizing behavior

Introducing faults

Validating the hypothesis

Improving the system

Chaos engineering guidelines

Summary

14

Disaster Recovery Planning and Testing

Disaster recovery and its significance in cloud computing

Overview of AWS’s disaster recovery features

Different disaster recovery strategies in AWS

Backup and restore

Pilot light

Warm standby

Hot standby

Defining and planning your disaster recovery objectives

Testing disaster recovery plans

Functional testing

Data loss testing

Performance testing

Security testing

Avoiding disaster recovery pitfalls and misconceptions

Pitfalls

Misconceptions

Summary

15

Finalize Building Resilient Architecture Using AWS Resilience Services

Backing up using the AWS Backup service

AWS Backup process

Immutable backups with AWS Backup Vault Lock

Following the resilience lifecycle framework

Why do you need the AWS resilience lifecycle framework?

How does the AWS resilience lifecycle framework work?

Utilizing AWS Resilience Hub

How does AWS Resilience Hub work?

Recovery using AWS DRS

AWS DRS architecture

AWS DRS components

AWS DRS best practices

Advantages of AWS resilience services

Summary

Further reading

Index

Other Books You May Enjoy

Preface

Cloud resilience is a critical aspect of modern IT infrastructure, referring to a system’s ability to withstand, adapt to, and rapidly recover from disruptions while maintaining continuous operations. In today’s digital landscape, where businesses rely heavily on cloud-based services, ensuring resilience is paramount to safeguarding against potential losses in revenue, productivity, and reputation.

Amazon Web Services (AWS) has established itself as a leading cloud service provider, offering a highly resilient infrastructure that sets the industry standard. AWS’s approach to resilience is multifaceted, encompassing both the physical infrastructure and the services it provides.

At the core of AWS’s resilient architecture is its global network of data centers, strategically located in multiple geographic regions worldwide. Each region is further divided into Availability Zones (AZs), which are physically separate data centers with independent power, cooling, and networking. This design inherently provides redundancy and fault tolerance, allowing applications to remain operational even if one or more AZs experience issues.

AWS’s infrastructure is built with redundancy at every level, from networking equipment to storage systems. The systems are designed to automatically detect failures and initiate recovery processes, often without any manual intervention. This self-healing capability minimizes downtime and ensures high availability for customer applications.

Beyond the physical infrastructure, AWS offers a comprehensive suite of services and tools specifically designed to enhance resilience. For instance, AWS Resilience Hub helps customers assess and improve their application resilience by providing recommendations based on AWS best practices. AWS Fault Injection Simulator allows organizations to perform controlled chaos engineering experiments, helping them identify and address potential weaknesses in their systems before they manifest in production.

AWS also provides robust data replication and backup services, enabling customers to implement comprehensive disaster recovery strategies. Services such as Amazon S3 offer 99.999999999% durability, ensuring data remains safe and accessible even in the face of multiple simultaneous failures.

Furthermore, AWS’s commitment to continuous improvement and innovation means it is constantly enhancing its resilience capabilities. It regularly publishes detailed post-mortems of any service disruptions, demonstrating transparency and a commitment to learning from incidents.

By leveraging AWS’s resilient infrastructure and services, organizations can build applications that not only withstand failures but also adapt and scale in response to changing conditions. This level of resilience is crucial in today’s fast-paced, always-on digital economy, where even brief outages can have significant consequences.

Who this book is for

This book is for cloud architects, developers, DevOps/SRE engineers, and executive decision makers, or in fact for anyone who has any capacity to influence decision making with respect to building resilient applications on AWS.

Specifically, these are the personas that this book targets:

Cloud architects: Deep technical experts who make important decisions when it comes to designing application and infrastructure architecture, this book will help them learn about the rich services and features on AWS that they can leverage to put out a fault-tolerant system.Developers and DevOps/SRE: These personas will learn why it is important for them to focus on writing efficient application code that takes advantage of the reliable infrastructure offered by AWS, and how they can build automation for continuous management and monitoring of their application performance and health.Executive decision makers: You will learn about the value of leveraging an advanced, reliable infrastructure that offers limitless possibilities to build workloads that allow your teams to iterate quickly and add customer value in a highly competitive business environment.

What this book covers

Chapter 1, Understanding Resilience Concepts, introduces the concept of resilience by drawing parallels with the aviation industry. It covers achieving resilient architecture using AWS infrastructure, fault-tolerant design best practices, and the shared responsibility model. You’ll learn about potential failure points and understand why maintaining resilience is an ongoing process crucial for a robust infrastructure.

Chapter 2, Implementing Resilient Compute and Auto Scaling, covers resilient compute and auto scaling solutions on AWS, focusing on failure-resistant system design, redundancy, and fault tolerance. It explores AWS Auto Scaling, cost-saving strategies, and the importance of monitoring. Key topics include multi-Availability Zone deployments, stateless architectures, and extending resilience to containers and serverless architectures.

Chapter 3, Securing and Backing Up Critical Data, covers data security and resilience strategies on AWS. It explores access control, layered backup strategies, multi-Region models for improved availability, automated recovery mechanisms, and disaster recovery best practices. You’ll learn how to design highly resilient and available systems using various AWS services.

Chapter 4, Orchestrating Graceful Degradation, explores the design principle of graceful degradation, explaining why it is critical to prevent your systems from facing catastrophic failures. You will also learn about different strategies to contain outages and how you can streamline the recovery process efficiently.

Chapter 5, Exploring the AWS Shared Responsibility Model, provides an introduction to what shared responsibilities between AWS and customers look like and the different roles these parties play in designing and operating a resilient infrastructure.

Chapter 6, Learning AWS Well-Architected Principles for Resiliency, explores the critical pillars of the AWS Well-Architected Framework through the lens of building resilient cloud solutions. This includes the operational excellence, reliability, security, efficiency, and cost optimization pillars. You’ll learn how to use AWS services to reduce heavy lifting, automate deployments, improve operational procedures, and secure your applications.

Chapter 7, Architecting Fault-Tolerant Applications, discusses architectural patterns and best practices for building fault-tolerant, highly available applications on AWS. You will learn about redundancy, loose coupling, graceful degradation, fault isolation concepts, and how important it is to build the right architecture to take the best from what AWS offers.

Chapter 8, Resiliency Considerations for Serverless Applications, helps you understand the advantages and strategies for building serverless-based applications. The chapter covers their impact on improving resilience. You will learn about idempotency, asynchronous transactions, error handling, and testing and deployment strategies.

Chapter 9, Using Containers to Improve Resiliency, focuses on ways container-based applications help with greater resilience. In this chapter, you will learn how to build, deploy, and operate containers using AWS services. You will gain an understanding of immutable deployments, scaling and security for containers, and specific considerations compared to traditional virtual machines.

Chapter 10, Resilient Architectures Across Regions, gives insights into running applications across multiple regions. It covers active/passive, active/active, and cell-based architectures, with a focus on high availability. You will learn about the pros and cons of each deployment model and what considerations to take for multi-Region deployments.

Chapter 11, Examples of Resilient Architecture, delves into architectural patterns for building resilient systems, ensuring reliability and availability through fault-tolerant designs, leveraging practical examples and real-world scenarios.

Chapter 12, Observability, Auditing, and Continuous Improvement, focuses on designing for observability, covering essential monitoring and auditing techniques to proactively identify issues and maintain system health for resilient applications.

Chapter 13, Performing Chaos Engineering Testing, explores chaos engineering principles, introducing controlled fault injection tests to proactively identify vulnerabilities and validate resilience mechanisms, enabling more robust system development.

Chapter 14, Disaster Recovery Planning and Testing, focuses on Disaster Recovery Planning (DRP), outlining procedures for crafting an effective plan, incorporating testing strategies to identify vulnerabilities, and enabling organizations to recover quickly and maintain business continuity during disruptive events.

Chapter 15, Finalize Building Resilient Architecture Using AWS Resilience Services, delves into AWS Cloud Resilience, exploring tools and capabilities for fault injection, chaos engineering, disaster recovery, and backup solutions to build highly reliable and available applications on the AWS cloud.

To get the most out of this book

Some chapters invite you to try hands-on instructions. These chapters will have a Technical requirements section for more details about the prerequisites. The main requirement is having access to a non-production AWS account to follow along.

Software/hardware covered in the book

Operating system requirements

AWS Command Line Interface (AWS CLI)

Windows, macOS, or Linux

Git

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Building-Resilient-Architectures-on-AWS. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “In this example, RedrivePolicy specifies that messages that fail to be processed after three attempts (maxReceiveCount: 3) will be moved to the my-dead-letter-queueSQS queue.”

A block of code is set as follows:

  DeadLetterQueue:     Type: AWS::SQS::Queue     Properties:       QueueName: my-dead-letter-queue

Any command-line input or output is written as follows:

$ aws arc-zonal-shift start-zonal-shift \ --resource-identifier arn:aws:elasticloadbalacing:... --away-from euw1-az2 --comment "possible issue isolated to AZ2" --expires-in 12h

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “Under Metrics, select the RDS namespace.”

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you’ve read Building Resilient Architectures on AWS, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/978-1-83588-710-3

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

Part 1: Setting the Stage – Learning the Basics of Designing Resilient Architectures

This part emphasizes the importance of resilient architecture design and explores different areas for architects to pay attention to. This part also discusses the responsibilities that the user and AWS share in designing and maintaining resilient environments.

This part has the following chapters:

Chapter 1, Understanding Resilience ConceptsChapter 2, Implementing Resilient Compute and Auto ScalingChapter 3, Securing and Backing Up Critical DataChapter 4, Orchestrating Graceful DegradationChapter 5, Exploring the AWS Shared Responsibility Model

1

Understanding Resilience Concepts

Being able to build a fully resilient application environment is a daunting task for many of us. Simply defining what resilience even means varies based on several factors. For example, simply being able to quickly reboot a standalone server when an application is running out of memory might be considered as having a resilient environment if that is what you are looking for. However, when the term resilient is brought up, the typical expectation is that we are aiming to build an architecture that is resilient to all internal and external influencing factors and will exhibit high availability, fault tolerance, and scalability under all circumstances.

In this book, we will help provide clarity about resiliency, dive deep into what resilient architecture looks like, and explore the measures you can take to build a resilient design that serves the resilience goals you have set for your applications. The book will primarily focus on services and features available on Amazon Web Services (AWS) and will also leverage best-practice guidelines that AWS has put together for customers in this regard.

In this chapter, we will first try to understand what resilience means. We will attempt to explain the concept through the eyes of the commercial aviation industry where resilience is of utmost importance in every single aspect of the design for building aircraft.

In this chapter, we will cover the following topics:

Demystifying resilienceFacing the cloud’s stormThe continuous resilience journey

Demystifying resilience

Resilient infrastructure architecture is a design that can withstand and recover from disruptions quickly. It is built on the principle of redundancy, with multiple layers of protection that can be activated in the event of a failure. This approach helps to ensure that critical systems are always available, even in the face of major disruptions.

One of the famous quotes from Werner Vogels (CTO, Amazon) is –

“Everything fails all the time.”

With time, failure is imminent. No matter how well you design your infrastructure, something is going to give up and fail at some point. The idea behind resiliency is to be intentional about understanding this reality and building systems that are able to recover from failures quickly without causing catastrophic business disruptions that cause huge financial impact.

The concept of resiliency is best understood when we examine the commercial aviation industry. Almost all parts and functions of a commercial aircraft have multiple redundant parts and methods of operation with isolation and fault tolerance built in. Whether it is hydraulics, avionics, power generation, fuel control, or the physical strengthening of the fuselage, multiple layers of redundancies are in place to prevent single-point failures from causing catastrophic accidents.

Modern aircraft are very big and they heavily rely on hydraulic systems to operate different mechanical parts of the plane, such as the ailerons, elevator, rudder, slats, and flaps that are required to control the aircraft’s balance, elevation, yaw, lift, and so on. They also have electronic flight instrumentation systems, which are part of the critical components in flying commercial aircraft. These critical systems are mainly powered using the plane’s engines and the expectation is that there will never be a situation where these systems do not receive the required electrical power to function.

Aircraft generally have a primary and secondary mechanism, as well as a backup mechanism for all functionalities put in place that is only used in situations where both the primary and secondary mechanisms have failed to provide the required functionality.

One of the most famous incidents in aviation is the Gimli Glider story (https://simpleflying.com/gimli-glider/). On July 23, 1983, a Boeing 767 ran out of fuel while flying over Canada. The pilots were able to glide the plane to an emergency landing at Gimli, a former military base. 35,000 ft, while cruising at an altitude of 35,000 ft, ran out of fuel completely due to a misunderstanding on how to calculate fuel for the aircraft at a time when Canada was converting from the imperial system to the metric system.

The pilot used the old imperial system and filled far less fuel than what was necessary to complete the flight. As the plane ran out of fuel completely, the engines, which are the primary source of power, shut down, along with the secondary Alternate Power Unit (APU), which also did not function. As this happened, the plane’s electronic flight instrumentation system did not work due to lack of power, and the pilots were left with only a few basic battery-powered emergency flight instruments. The 767 was the first twin-engine wide-bodied plane built by Boeing, and due to its large control surface, it was set up with hydraulic systems to fly the plane because simply using muscle power alone is not feasible to operate a jumbo jet of that size. Without power, the hydraulic systems were not operational as well, which meant that the pilots had to resort to the secondary option of using muscle power to control the plane unless they somehow found a way to power the hydraulic systems.

Therefore, the pilots deployed the backup power source called the Ram Air Turbine (RAT), which is a small wind turbine fan that can generate power from the airstream generated through the speed of the aircraft. With the help of the power generated from the backup source, the pilots were able to use the hydraulics and eventually glide the plane to successfully land on an abandoned runway at Gimli air station used by the Royal Canadian Air Force (RCAF) in Manitoba without any lives being lost. The aircraft went through some non-critical damages in the frontal portion of the fuselage due to the nose wheel not locking in position on touchdown.

The aircraft was subsequently repaired and placed back into service until its retirement in 2008. This example illustrates how redundant, fault-tolerant systems can help mitigate catastrophic disasters by providing alternative options when challenging situations arise.

In this section, we will talk about what cloud resilience is and how you can go about building resilient architectures in the cloud. We will explore a variety of technical areas to consider, learn how we can work coherently with the Cloud Service Providers (CSPs), and leverage the underlying infrastructure in building resilient systems.

Cloud resilience

Cloud resilience, a critical facet of contemporary cloud computing architectures, is critical in ensuring the reliability and availability of cloud-based applications and services. In cloud computing, resilience denotes the capacity of a system to withstand and recuperate from failures, disruptions, or unforeseen events without compromising its functionality or performance.

Mission-critical systems are required to be resilient to both external and internal factors. Just hosting your applications on a well-established CSP does not automatically provide resiliency. It is essential to plan for different aspects of infrastructure and software design to ensure your application can withstand unexpected turmoil and disruptions.

In traditional infrastructure, such as self-hosted server environments, private data centers, or colocation hosting environments, you were fully responsible for building resilience at all layers of the infrastructure. This includes procuring the right hardware; ensuring the availability of an uninterrupted power supply, sufficient cooling mechanisms, reliable high-speed network connectivity, strong physical security; and continuous monitoring of hardware failures and replacements.

In public cloud environments, there is a great level of resilience built in by default simply because public CSPs such as AWS, Microsoft Azure, and Google Cloud Platform operate at a very large scale to support a highly efficient multi-tenant environment that allows them to provide infrastructure services to host customers that operate workloads at internet scale for a variety of use cases, such as financial services, healthcare, media streaming, and artificial intelligence.

These companies employ highly skilled technical personnel, perform regular maintenance of equipment, replenish hardware frequently, are able to procure high-quality materials and parts and secure the premises with world-class security protocols, which can be unattainable for companies whose core business is not in operating a modern and efficient large-scale data center. Cloud services differ in the mechanisms they provide for handling resiliency. Customers play an active role in establishing resiliency for the workloads they host on the cloud. While each cloud operator has their own methodology for this, AWS uses a shared responsibility model. We will mainly focus on AWS’s approach to be in line with the goals of the book.

Shared Responsibility Model

In cloud environments, resilience is a shared responsibility between the service provider and the customer using the environment.

The CSP is responsible for the resilience of the underlying hardware, network equipment power supply, network, air conditioning, physical security, and so on; the customer is responsible for designing the application architecture in such a way that their application is resilient to other unforeseen disruptions, such as a cyberattack affecting the application performance, database performance and stability issues, API design failures due to load-balancing scenarios, sudden increase in traffic requests, or dealing with failures occurring in the underlying infrastructure that were not mitigated by the hedges put forth by the CSP.

The following diagram shows the different responsibility areas that the customer and the CSP own:

Figure 1.1 – Customer and CSP responsibility matrix

It is important to note that the line between the CSP and customer responsibilities moves depending on the service. For fully managed services, which are also called Software as a Service (SaaS), scaling and performance fall into the responsibility of the CSP, and for self-managed services that are hosted on Infrastructure as a Service (IaaS) environments, scaling and performance fall into the responsibility of the customer. You will learn about the customer versus CSP responsibilities in detail in Chapter 5. In order to guide customers to follow best practices in using the cloud platform, AWS has devised a robust framework called the AWS Well-Architected Framework, which covers a range of topics that offer prescriptive guidance. The AWS Well-Architected Framework describes key concepts, design principles, and architectural best practices for designing and running workloads in the cloud. It provides a consistent approach to evaluating and improving your cloud architecture across the following six key pillars:

Operational excellence: https://docs.aws.amazon.com/wellarchitected/latest/framework/operational-excellence.htmlSecurity: https://docs.aws.amazon.com/wellarchitected/latest/framework/security.htmlReliability: https://docs.aws.amazon.com/wellarchitected/latest/framework/reliability.htmlPerformance efficiency: https://docs.aws.amazon.com/wellarchitected/latest/framework/performance-efficiency.htmlCost optimization: https://docs.aws.amazon.com/wellarchitected/latest/framework/cost-optimization.htmlSustainability: https://docs.aws.amazon.com/wellarchitected/latest/framework/sustainability.html

While all six pillars are critical and need to be given detailed consideration when designing applications, the Reliability and Operational excellence pillars cover plenty of ground specifically about resiliency. One of the important takeaways is that while AWS is responsible for the resiliency of the cloud, the customer is responsible for resiliency in the cloud. What this means is that if you are using AWS to run your workload, it is your responsibility to ensure that the design put forth has the necessary redundancy and guardrails in place to be resilient, guided by the Well-Architected Framework.

Why resiliency?

Do we really need to build resilient applications? What exactly are we trying to achieve by building highly resilient infrastructure and applications? These are very important questions that need to be answered well before you sit down and design an application architecture.

The reality is that you may not really need to build an application that is always available and scales infinitely. While we will discuss this topic in detail in the next chapter, I want you to understand that adding redundancy will incur higher costs. There are no two ways about this, as compute, storage, and network contribute to additional costs based on usage. The question is, what trade-offs are you ready to make?

Do you want to reduce your costs in the short term, risk customer satisfaction, or take a long-term vision and prioritize customer experience? The answers will vary based on the application’s business use case in question.

In today’s environment, whether you are building an internal-facing business application for a large enterprise or you work at a start-up that is building a new mobile app for a young audience using cutting-edge technologies, the expectations are the same in terms of application performance, usability, availability, and the experience. End users are used to their experiences on their mobile devices, which typically provide them with instant responses, pretty graphics, and overall delight. When the same users log on to their business applications, they have the same expectations as their personal experience. An unreliable application will simply not sell and will result in poor customer satisfaction rates, business loss, and reduced user productivity, which can impact business revenue.

At this point, it is a good idea to summarize what we have learned so far. We now have an idea of what cloud resilience means, and through the AWS Shared Responsibility Model, we understood what areas users of the AWS cloud platform have to build resilient workloads. We also got insights into how redundancy plays an important role in improving resiliency for workloads.

In the following section, we will look at other foundational principles that are critical to establishing and maintaining a resilient infrastructure.

Resilient foundations

AWS offers a variety of levers for you to pull in order to design resilient applications. In fact, building a fully resilient application that can withstand all imaginable disaster scenarios is somewhat unattainable in most cases. The emphasis should be on building strong mitigation tactics that will help with recovering from disasters quickly.

The following diagram shows the Swiss cheese model. This is a well-known model that demonstrates that there can be unique, unknown situations that can pass through a series of barriers that were put in place to prevent them from occurring.

Figure 1.2 – Swiss cheese model showing the possibility of a disaster striking despite taking preventive measures

Resilient architectures should consider the following important factors:

Redundancy: This is about provisioning parallel infrastructure in addition to what is minimally required so that a single point of failure is avoided.Fault tolerance: This is the ability of a system to recover from a system failure. This is commonly achieved through system mirroring, application logic, and configuration.Isolation: This is about choosing to build the application on physically isolated infrastructure in order to prevent an issue from cascading into affecting the entire application.Automation: This is the most important piece of the resiliency puzzle. Anything that can be automated should be automated. Whether it is infrastructure provisioning, application deployment, or auto scaling, automation is the key to recovering from inevitable disaster situations quickly and getting back to normalcy.Monitoring: This is about constantly being in the know of system performance in order to identify anomalies and take necessary actions. To build resilient applications, you need to employ a strong monitoring solution combined with a disciplined process.Security: This is about designing the infrastructure and applications by following industry-standard security best practices to ensure issues such as cyberattacks and code vulnerabilities do not cause issues.

Recovering from disasters quickly is a key element of building resilient architectures. This means capturing key metrics such as the following:

Mean Time to Recovery (MTTR), which is a calculation of the amount of time needed to recover from a failure after it happens.Recovery Time Objective (RTO), which is the maximum amount of time that a business can afford to be without a critical service or system before it starts to incur unacceptable losses.Recovery Point Objective (RPO), which is the maximum tolerable amount of data loss that an organization is willing to accept after a disaster.

These metrics are critical to making scientific, informed decisions about building resilient architectures. We will be using the Gimli Glider incident along with aircraft design principles to dive deeper into the preceding factors.

Building redundancy in the infrastructure

As discussed earlier, building redundant architecture does involve higher costs, and the costs can vary depending on the level of resiliency you want to build into the application. For example, most business-critical applications that need to have higher availability can deploy an architecture that incurs multiple redundant layers in order to provide higher Service-Level Availability (SLA), which is measured in percentages. An SLA is a contract between a service provider and a customer that defines the level of service that the customer can expect. The SLA is calculated as the percentage of time a service is available to users. It is calculated by taking the total amount of time the service is available and dividing it by the total amount of time the service should have been available. An application that claims to have a 99.99% SLA is expected to only have a downtime of 4 minutes and 23 seconds per month and 52 minutes and 36 seconds per year. Learn more about SLAs and how they are calculated from the official AWS documentation here: https://aws.amazon.com/what-is/service-level-agreement/.

In order to ensure you maintain the committed SLA, you need to ensure you continuously track the Service-Level Indicators (SLIs) that inform you about the SLA performance. SLIs are essentially metrics that measure the performance of a service. They are used to track the reliability, availability, and performance of a service, and to identify areas where improvements can be made.

SLIs are typically defined in terms of thefollowing characteristics:

Metric: The specific metric that is being measuredUnit: The unit of measurement for the metricThreshold: The acceptable value for the metricFrequency: The frequency with which the metric is measured

SLIs are used in conjunction with Service-Level Objectives (SLOs) to define the level of service that a customer can expect. SLOs are specific, measurable objectives that define the expected service level, such as availability, response time, or error rate.

SLOs are typically defined in terms of the following characteristics:

Target: The desired value for the metricTolerance: The acceptable range of values for the metric

SLIs and SLOs are used together to create an SLA. Here are some examples of SLIs:

Availability: The percentage of time that a service is available to usersLatency: The average time it takes for a request to be processed by a serviceThroughput: The number of requests that a service can process per secondError rate: The percentage of requests that result in an error

SLIs are an important tool for managing the performance of a service. They can be used to identify areas where improvements can be made, and also to track the progress of those improvements.

Going back to the Gimli Glider example, the only way a large wide-bodied plane such as the Boeing 767 could have landed successfully that way was due to the redundant systems built in. It had several alternate sources to power the airplane’s hydraulic systems. The pilots also had access to alternate manual controls to operate the plane even in the absence of the hydraulic systems; it would have been a very tiring process to do so, though.

Using fault tolerance to build resiliency

To enable your systems to handle failures, you need to include mechanisms for the system to recover from issues without much delay. This can either be through directing operations to a redundant system or simply by rebooting automatically without causing cascading effects.

Deploying application servers behind a load balancer, creating logic in applications to use alternate paths, and using configurations to change functionality are some of the strategies typically deployed to improve fault tolerance.

You can relate this to how the Boeing 767 plane continued flying despite technical components failing unexpectedly. This is one of the key contributing factors to why the Gimli Glider incident did not cost any lives.

Decoupling systems through isolation principles

One of the critical components of building reliable systems is to identify dependencies between different systems and introduce guardrails such that the blast radius is as minimal as possible. The 12 Factor App principles (visit this page to learn all about it: https://12factor.net/) put forth by Heroku is a set of ideas that help design microservice-based architectures that encourage isolation as one of the key principles.

Applications that are loosely coupled are not only easy to scale horizontally but also do not cause upstream applications to fail when they are facing downtime. In an ideal world, an application/microservice that fails should not cause any problems to either downstream or upstream systems. Take a look at the following figure. Microservices A and B are deployed in isolated environments where they do not share tight dependencies, causing one’s failure to impact the other.

Figure 1.3 – Reduced blast radius due to loosely coupled architecture design pattern

The failure of microservice B is contained within its own operational zone and its outage doesn’t impact the functions of microservice A. Microservice A continues to function and sends the messages to an alternate destination sensing microservice B’s outage. This exemplifies both isolation and fault tolerance principles in action to provide higher resilience.

Connecting this back to the Gimli Glider incident, commercial airplanes are designed in such a way that a failure in one part of the plane does not cause a cascading effect on other parts, resulting in a total failure. In that plane, despite the engines stopping working due to lack of fuel and the APU failing as well, the RAM turbine system kicked in to provide the required power to key systems, which the pilots used to navigate the plane to a safe landing.

Using automation to improve disaster recovery time

An important part of having a better RTO is automating the entire application development and deployment process, including infrastructure creation, upgrades, and auto scaling. The shift-left movement is aimed toward automating everything that can possibly be automated. AWS offers a rich set of options to automate your infrastructure and applications regardless of the compute service being used.

One of the simplest forms of automation you can think about is setting up AWS Auto Scaling, which supports Amazon EC2, Amazon ECS, Amazon DynamoDB, and Amazon Aurora services, as of early 2024. This is an easy-to-use feature that can come in handy to scale your resources on-demand based on specific parameters so your users are not hit with application performance degradation issues.

It is recommended to programmatically configure and deploy infrastructure using AWS Cloud Development Kit (CDK) or AWS CloudFormation. Using AWS CodePipeline can help you create a CI/CD (short for continuous integration and continuous deployment) process for infrastructure using Git-based source control and configurable deployment processes. This process is often referred to as infrastructure as code (IaC).

You can easily relate this to the Gimli Glider incident. In high-stress scenarios, pilots should adhere to the fundamental principles of aviate, navigate, and communicate, in that order. The Boeing 767 aircraft did not wait for manual pilot input to deploy the RAM turbine, which provided power to the aircraft. The designers of the aircraft were fully aware that pilots in command could encounter stressful circumstances. Thus, it was critical to implement automation to help alleviate pilot stress and facilitate a timely resolution of the situation. Similarly, as architects and developers of software systems, we should ensure that automation is employed wherever necessary, such as auto scaling, deployment, and alerting, to make managing large-scale systems easy.

Monitoring systems to track system health continuously

Setting up proper monitoring of the environment is highly critical to provide resiliency. In order to address a problem, the first step is to understand what normalcy looks like. Continuously collecting signals such as metrics, logs, and traces from the environment can give you information about the performance and health of the infrastructure and workloads.

Create dashboards for various applications and infrastructure for visualizing the information, while also setting up alerts to notify when something goes wrong. Amazon CloudWatch offers rich dashboarding and alerting features to satisfy these needs. Using CloudWatch Anomaly Detection, you can set alarms to go off only when a specific metric value goes outside a certain range dynamically without you setting a hard limit yourself.

Oftentimes, monitoring is an afterthought and not paid the required attention when new workloads are being designed or when workloads are migrated from one environment to another. This leads to several problems during the operational phase, such as lack of visibility, higher costs, increased application downtime, and overall operational inefficiency.

A monitoring strategy should be set along with the architecture design, considering input from business and technical teams. A committed SLA should be one of the main drivers in making decisions on what signals to collect, what information should be visualized on the dashboard, how alerts should be set up for anomalies, and so on.

Anchoring this back to the Gimli Glider incident, imagine the pilots not having visibility into the key metrics of the plane while trying to fly in this stressful situation. The plane designers designed the systems so that the pilots could get continuous insights into the key details required to aid situational awareness. As soon as power came back on, the cockpit displays were one of the first systems to come to life.

So far, we have explored the concept of resilience and its relevance in infrastructure design. We have delved into the significance of redundancy and fault tolerance in ensuring that systems can withstand and recover from disruptions. Furthermore, we have engaged in discussions surrounding the shared responsibility model within AWS cloud environments, underscoring the joint responsibilities of customers and AWS in achieving resilience. Additionally, we have examined the pertinence of resilient applications in today’s digital landscape, along with the critical factors to consider when designing resilient architectures. Moreover, we have gained insights into key tracking mechanisms such as MTTR, RTO, and RPO, which are instrumental in measuring and enhancing resilience.

In the subsequent section, we will examine several noteworthy design considerations applicable to the construction of highly available applications on AWS. We will explore methods to effectively utilize some of the inherent infrastructure capabilities within the AWS environment in architectural design. Furthermore, we will address several frequently overlooked principles that are crucial for maintaining a dependable environment.

Facing the cloud’s storms

While AWS offers an environment that is generally regarded to offer high availability, security, and scalability compared to typical on-premise environments, it is up to you as the builder to design your architecture in a way that effectively makes use of necessary features to build an environment that can withstand unforeseen incidents that affect your application environment.

AWS offers a secure, scalable, and resilient environment for you to run your workloads. AWS has Regions all over the globe, with each Region having multiple Availability Zones (AZs) within. An AWS Region is a distinct geographical area where AWS clusters its data centers. Each Region is completely independent and isolated from other Regions, providing fault tolerance and high availability. They allow you to deploy your applications and store your data closer to your users for reduced latency and improved performance. In a Region, an AWS AZ is an isolated location that is shielded from failures in other AZs. It offers low-cost, low-latency network connectivity to other AZs in the same Region. Each AZ contains at least one data center, each with redundant power, networking, and connectivity, housed in separate physical facilities. These zones are designed to support the operation of scalable and fault-tolerant production applications and databases, which would be difficult to achieve with a single data center. Each AZ operates independently, so a failure in one won’t affect others, ensuring high availability and fault tolerance for applications and databases. AWS customers can use multiple AZs within one Region to increase redundancy and reliability. Each AZ is isolated, with independent power, cooling, and networking, and when an entire AZ goes down, workloads can be failed over to another AZ in the same Region, a capability known as multi-AZ redundancy.

The following diagram shows how AWS Regions have redundancy and isolation in place to support high resilience and availability. If there happens to be an outage in an AZ, which in itself is very rare, it doesn’t affect other AZs in the Region.

Figure 1.4 – Isolated AZs within an AWS Region

Just like other data center operation scenarios, AWS is not exempt from hardware, network, or power failures. However, such issues are overcome by AWS through proper monitoring and automation practices to address the problems quickly in such a way that customers do not experience any performance degradation or have the problem have a cascading effect on a large scale.

At this point, it is worth remembering the Shared Responsibility Model we discussed earlier in the chapter. As you can see here, AWS offers a very robust infrastructure for you to host your workloads. AWS takes ownership of keeping the underlying hardware and infrastructure healthy, performant, and secure. It is the customers’ responsibility to design applications that leverage the infrastructure in a way that allows them to enjoy maximum benefits.

While AWS takes great care in keeping the infrastructure healthy, there can be situations where something such as a software bug, unforeseen hardware failure, or a natural disaster can impact service availability depending on the size of the impact. Due to built-in redundancy and isolation models, this is mostly restricted to a small blast radius and it rarely affects an entire AZ or a Region.

However, as users of AWS, it is always recommended to design applications that do not simply rely on resources available within a single AZ.

For example, consider a scenario where a data center inside an AZ goes down due to a natural disaster, such as a fire, flood, or earthquake. The application making use of the underlying infrastructure will also go down along with it. This is why it is important to design applications that leverage more than one AZ.

While we will learn more about this topic in the upcoming chapters, designing applications that leverage multiple AZs starts with creating Virtual Private Cloud (VPC) networks that have subnets spread across multiple AZs. The higher the number of AZs used, the better the availability is.

Here are some additional details about how to design applications that leverage multiple AZs:

Use a load balancer to distribute traffic across multiple AZs. This will help to ensure that if one AZ goes down, traffic will still be routed to the other AZs.Store your data in multiple AZs. This will help to ensure that if one AZ goes down, your data will still be available in the other AZs.Use a highly available database. A high-availability database is a database that is designed to stay up and running even if one or more of its components fail.Test your application for availability. Make sure to test your application for availability to ensure that it can withstand the failure of one or more AZs.

By following these tips, you can design applications that are more resilient to failures and that can provide better availability for your users.

Software bugs and security threats

One obvious reason why an application is not resilient is due to performance issues introduced while programming the application. The customer using AWS is solely responsible for writing optimal code to deliver a good experience to the end users. Following programming best practices that leverage well-known design principles is key to writing good software.

Creating stateless, horizontally scalable microservice applications makes the entire software development lifecycle much easier compared to managing a large monolith application. The Microservice architecture allows you to achieve better resiliency by providing redundancy, isolation, and easier automation options.

Software developers can write secure code by following best practices and ISO standards. Some of the best practices include the following:

Using secure coding techniques: This includes using secure coding languages and libraries, avoiding common security vulnerabilities, and implementing security controls such as input validation and output sanitization.Testing code for security vulnerabilities: This includes static code analysis, dynamic code analysis, and penetration testing.Following ISO standards: ISO 27001, ISO 27002, and ISO 27005 are all international standards that provide guidance on information security management.

The International Organization for Standardization (ISO) 27000 series of standards comprises ISO 27001, ISO 27002, and ISO 27005. These standards provide a comprehensive framework for Information Security Management Systems (ISMSs) and risk management. Here is a concise overview of each standard:

ISO 27001: This is an internationally recognized standard that outlines the specifications for an ISMS. It presents a systematic approach to managing sensitive company information, guaranteeing its security. ISO 27001 assists organizations in establishing, implementing, maintaining, and continuously enhancing their ISMS.ISO 27002: This standard offers guidance and principles for establishing, implementing, maintaining, and enhancing information security management in an organization. It assists organizations in choosing controls as part of the ISMS implementation process. ISO 27002 is a complementary standard that facilitates the implementation of ISO 27001.ISO 27005: ISO 27005 serves as an international standard that guides organizations in conducting information security risk assessments in alignment with ISO 27001 requirements. Applicable across organizations of any size or industry, this standard aims to facilitate the effective implementation of comprehensive information security measures grounded in a risk management approach.

In summary, ISO 27001 sets the requirements for an ISMS, ISO 27002 provides guidelines for implementing controls within an ISMS, and ISO 27005 outlines the process for conducting information security risk assessments in line with the requirements of ISO 27001. These standards work together to help organizations establish and maintain effective information security management and risk assessment processes.

By following these best practices and ISO standards, software developers can help to ensure that their code is secure and resistant to attacks, impacting resiliency negatively as a result.

Using Software Bill of Materials (SBOM) best practices

A Software Bill of Materials (SBOM) is a critical component of software security. A good SBOM should include information about the software components, their versions, and their dependencies. This information can be used to identify security vulnerabilities and risks and to track the provenance of software.

Following the underlying best practices for creating SBOMs as follows can reduce target surface areas:

Use a standardized format for SBOMs. This will make it easier to share and use SBOMs.Include as much information as possible in the SBOM. This includes information about the software components, their versions, their dependencies, and any known security vulnerabilities.Keep the SBOM up to date. This is important to ensure that the SBOM reflects the latest information about the software components.Use SBOMs to identify security vulnerabilities and risks. This information can be used to prioritize security remediation efforts.Track the provenance of software. This information can be used to identify and mitigate software supply chain risks.