Becoming a Rockstar SRE - Jeremy Proffitt - E-Book

Becoming a Rockstar SRE E-Book

Jeremy Proffitt

0,0
32,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Site reliability engineering is all about continuous improvement, finding the balance between business and product demands while working within technological limitations to drive higher revenue. But quantifying and understanding reliability, handling resources, and meeting developer requirements can sometimes be overwhelming. With a focus on reliability from an infrastructure and coding perspective, Becoming a Rockstar SRE brings forth the site reliability engineer (SRE) persona using real-world examples.
This book will acquaint you the role of an SRE, followed by the why and how of site reliability engineering. It walks you through the jobs of an SRE, from the automation of CI/CD pipelines and reducing toil to reliability best practices. You’ll learn what creates bad code and how to circumvent it with reliable design and patterns. The book also guides you through interacting and negotiating with businesses and vendors on various technical matters and exploring observability, outages, and why and how to craft an excellent runbook. Finally, you’ll learn how to elevate your site reliability engineering career, including certifications and interview tips and questions.
By the end of this book, you’ll be able to identify and measure reliability, reduce downtime, troubleshoot outages, and enhance productivity to become a true rockstar SRE!

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 630

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Becoming a Rockstar SRE

Electrify your site reliability engineering mindset to build reliable, resilient, and efficient systems

Jeremy Proffitt

Rod Anami

BIRMINGHAM—MUMBAI

Becoming a Rockstar SRE

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Mohd Riyan Khan

Publishing Product Manager: Surbhi Suman

Senior Editor: Romy Dias

Technical Editor: Shruthi Shetty

Copy Editor: Safis Editing

Project Coordinator: Ashwin Kharwa

Proofreader: Safis Editing

Indexer: Tejal Daruwale Soni

Production Designer: Alishon Mendonca

Marketing Coordinator: Agnes D’souza

First published: March 2023

Production reference: 1290323

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80323-922-4

www.packtpub.com

For my wonderful wife, who still likes me after 18 years. I like you too.

– Jeremy Proffitt

To my God, wife Tati, and son Gabe.

– Rod Anami

Contributors

About the authors

Jeremy Proffitt (born January 1977) is obsessed with constantly improving systems and solving problems with an unmatched sense of urgency – the definition of a Site Reliability Engineer (SRE). A master of solutions and technological knowledge, Jeremy is a rockstar SRE with AWS professional certifications in Architecture and DevOps – and has routinely saved millions in potential lost revenue in his career. In his free time, Jeremy enjoys spending time in his rockstar-appropriate technology cave and loves venturing into 3D printing, electronics, and Internet of Things (IoT) projects. By day, Jeremy currently manages a team of top SRE and DevOps talent driving constant improvement and is often cited in the company as a visionary in terms of observability and emergency response.

To the leaders who have helped me see the truth in our work and friends who have stood by and given me the encouragement to follow the wonders of technology, often while in awe of their own work, I say thank you! To my arch-enemies, you have been a wonderful addition that has always challenged me to become better. And finally, to my wife, Jamie, who I still desperately love after 18 years – and mind you, still likes me – I still remember our first date when you took my arm, you stole my heart, and in all our years, I’ve never felt you let go once.

Rod Anami is a seasoned engineer who works with cloud infrastructure and software engineering technologies. As one of the SREs at the Kyndryl CoE, he coaches other SREs on running IT modernization, transformation, and automation projects for clients worldwide. Rod leads the global SRE guild inside Kyndryl, where he helps plant and grow SRE chapters in many countries. Rod is certified as an SRE, technical specialist, and DevOps engineer professional at the ultimate level. He holds AWS, HashiCorp, Azure, and Kubernetes certifications, among many others. He is passionate about contributing to open source software at large with Node.js libraries.

I want to thank my wonderful wife, Tatiana, and my beloved son Gabriel, for giving me the space and support needed to write this book. My parents, Shizuo and Rita, for raising me with solid character. The Google site reliability engineering organization made this fantastic approach and profession open source. I want to thank Kyndryl for backing me on this journey. I had many bosses and leaders, good, bad, and inspiring ones. I want to mention a few who impacted my career immensely by helping me acquire the skills and knowledge for this book: Marcos Cimmino, Tara Sims, Andy Barnes, and Gene Brown. Nothing great is accomplished alone: it requires effort, endurance, enjoyment, colleagues, and God.

About the reviewers

Chris Smith is a strategic IT leader with a proven track record across the financial service industry. His passion is to lead organization-wide transformational efforts for Fortune 500 institutions within digital and contact center technology and operations. He is skilled at driving agile adoption, building an engineering-first mindset, and facilitating cloud modernization of core banking services at scale.

Itohanoghosa Eregie is the founder of techinanutshellhack, a platform dedicated to explaining technology concepts with short video clips about cloud and site SRE concepts in their simplest form via LinkedIn. She worked as a software developer at Cyberspace Limited before finding her passion as a platform engineer, which earned her an opportunity to work with Dell EMC as a resident platform engineer for one of Africa’s largest telecommunications companies, MTN Nigeria, as a platform engineer. Altoros Americas currently employs her as a VMware Tanzu engineer, involved in customer engagement. Itohan is passionate about building resilient systems in the cloud and ensuring organizations adhere to SRE practices.

Brannen Taylor has almost 30 years of experience in corporate IT from the healthcare, managed services, power, hosted DR, and financial services industries. He has worked with small “mom-and-pop” operations up to ITIL-heavy Fortune 10 companies. He was a network engineer for 20 years and has been a network operations manager for the past 2 years. He has certifications from many vendors such as Nortel, Cisco, and Palo Alto, as well as a few that are vendor-agnostic, many cloud certifications from AWS and Azure, and is now moving into Network DevOps (NetDevOps), focusing on Nautobot, Ansible, and various vendor SDKs. He enjoys scuba diving with his wife and friends and has two grown children.

I would like to thank God for leading me into a career that I love. I want to thank my children for only eye-rolling me a little when I launch into an explanation about binary when they ask me how email works. I want to thank my wife Lara for putting up with me being on call these past 23 years, working unexpectedly long days, nights, and weekends, and non-stop studying. Thank you to my colleagues and the friends I’ve made along the way.

Gene Brown is the Vice President and a Distinguished Engineer at Kyndryl. He leads the SRE profession and certification program and is the global site reliability engineering leader. He is responsible for driving the enablement of SREs across Kyndryl’s countries, practices, and strategic markets through a Center of Excellence with SRE chapter leaders across the services organization globally.

Gene enjoys spending time with clients interested in adopting SRE and likes comparing notes on what has worked well and how to overcome the challenges that come with cultural change. Gene was the co-founder of IBM’s and Kyndryl’s SRE profession with a focus on certifying SREs based on their applied experience in the field of site reliability engineering.

Table of Contents

Preface

Part 1 - Understanding the Basics of Who, What, and Why

1

SRE Job Role – Activities and Responsibilities

Making this journey personal

SRE driving forces

SRE skills

SRE traits

Understanding the mindset and hobbies of an SRE

SRE affinity game

SRE guiding principles

SRE hobbies

DevOps engineers versus SRE versus others

DevOps and site reliability engineers

Software and site reliability engineers

Describing an SRE’s main responsibilities

An overview of the daily activities of an SRE

People that inspire

Jeremy’s recognition – Paul Tyma, former CTO, LendingTree

Rod’s recognition – Ingo Averdunk, Distinguished Engineer, IBM, and Gene Brown, Distinguished Engineer, Kyndryl

Summary

Further reading

2

Fundamental Numbers – Reliability Statistics

SLA commitment – a conversation, not a number

Internal partner SLAs

External partner SLAs

The cost of more 9s in an SLA

A final word on SLAs

Defining and leveraging SLOs and SLIs

SLOs

SLOs and time

Tracking outage frequency with the MTBF

Measuring the downtime with the MTTR

Understanding the customer and revenue impact

Transparency in outages

The rockstar SRE’s SLA

Summary

3

Imperfect Habits – Duct Tape Architecture and Spaghetti Code

The business of software development – let’s start with the dollars

Defining the “value” of software to a business

The value of protecting business

The value of growing a business

The value of saving labor costs

The A/B testing mindset – the art of change in customer interaction

A/B testing in customer flows

Analyzing the results of A/B testing

Leveraging A/B testing to satisfy quarterly numbers

Dedication to the craft of development – and why some are just here for a job

A quick guide to communicating with your colleagues

Reviewing the merge request – it’s about training, oversight, and reliability

Avoiding the typical rubber stamp mentality

A word on production deployments

Why businesses want us to outright ignore best practices

The truth about the ownership of a developer’s time

Understanding the flaws in how we estimate development cost

Fast, good, cheap – pick one

Why is observability the answer to reliability issues?

The cost of highly available architecture

Mixing good and bad – tricks to wrapping bad code and making it resilient

Alerting that fires actions

Adding additional logging to monitor potential issues

Using try catch to encapsulate exceptions

Retries to the rescue…or not

Summary

Part 2 - Implementing Observability for Site Reliability Engineering

4

Essential Observability – Metrics, Events, Logs, and Traces (MELT)

Technical requirements

Accomplishing systems monitoring and telemetry

Monitoring targets for infrastructure

Monitoring types and tools

Monitoring golden signals

Monitoring data

Understanding APM

Getting to know topology self-discovery, the blast radius, predictability, and correlation

Alerting – the art of doing it quietly

The user perspective notification trigger principle

Event-to-incident mapping principle

Mixing everything into observability

Outages versus downtime

Observability architecture

Observability effectiveness

In practice – applying what you have learned

Lab architecture

Lab contents

Lab instructions

Summary

Further reading

5

Resolution Path – Master Troubleshooting

Properly defining the problem – and what to ask and not ask

Source of information

The knowledge base of the reporter

Naming conventions

False urgency

Executive summary

Breaking down and testing systems

Breaking down hardware versus the operating system

Breaking down a web API

Understanding the steps

The problems with this method of troubleshooting

Previous and common events – checking for the simple problems

Prior Root Cause Analysis (RCA) documents

Timeline analysis

Comparison

The best approach

Effective research both online and among peers

The art of the Google search

Skimming the content quickly and refining it

Never forget your internal resources

Breaking down source code efficiently

Code you’ve never seen

When that fails

Logging plus code

In practice – applying what you’ve learned

Summary

6

Operational Framework – Managing Infrastructure and Systems

Technical requirements

Approaching systems administration as a discipline

Design

Installation

Configuration

App deployment

Management

Upgrade

Uninstallation

Understanding IT service management

ITIL

DevOps

Seeing systems administration as multiple layers and multiple towers

Automating systems provisioning and management

Infrastructure as Code

Immutable infrastructure

In practice – applying what you’ve learned

Lab architecture

Lab contents

Lab instructions

Summary

Further readings

7

Data Consumed – Observability Data Science

Technical requirements

Making data-driven decisions

Defining the question and options

Determining which data to use

Identifying which data is already available

Collecting the missing data

Analyzing all datasets together

Presenting the decision as a record

Documenting the lessons learned in the process

Solving problems through a scientific approach

Formulation

Hypothesis

Prediction

Experiment

Analysis

Understanding the most common statistical methods

Percentages

Mean, average, and standard deviation

Quantiles and percentiles

Histograms

Using other mathematical models in observability

Visualizing histograms with Grafana

In practice – applying what you’ve learned

Lab architecture

Lab contents

Lab instructions

Summary

Further reading

Part 3 - Applying Architecture for Reliability

8

Reliable Architecture – Systems Strategy and Design

Technical requirements

Designing for reliability

Architectural aspects

Reliability equations

Design patterns

Modern applications

Splitting and balancing the workload

Splitting

Balancing

Failing over – almost as good

Scaling up and out – horizontal versus vertical

Horizontal

Vertical

Autoscaling

In practice – applying what you’ve learned

Lab architecture

Lab contents

Lab instructions

Summary

Further reading

9

Valued Automation – Toil Discovery and Elimination

Technical requirements

Eliminating toil

Toil redefined

Why toil is bad

Handling toil the right way

Treating automation as a software problem

Document

Algorithm

Code

Automating the (in)famous CI/CD pipeline

Continuous integration

Continuous delivery

Production releases

In practice – applying what you’ve learned

Lab architecture

Lab contents

Lab instructions

Summary

Further reading

10

Exposing Pipelines – GitOps and Testing Essentials

A basic pipeline – building automation to deploy infrastructure as code architecture and code

Pipelines in chronological order

Pipeline templates

Errors or breaks in pipelines

Using containers in pipelines

Pipeline artifacts

Pipeline troubleshooting tips

Automating compliance and security in pipelines

Library age

Application security testing

Dynamic Application Security Testing (DAST)

Static Application Security Testing (SAST)

Secrets scanning

Automated linting for code quality and standards

Compiling with linting feedback

Validating functionality during deployment with automated testing

Why is testing so important to reliability?

Test data

The types of testing

When to test a pipeline

Testing observability

Automated rollbacks

The reduction of developer toil through automated processes

What is the impact of addressing toil?

In practice – applying what you’ve learned

Preparing AWS for the lab

Creating your repository

Adding secrets to your repository

Downloading and committing the lab files

Understanding the pipeline

Adding more steps

Testing but not deploying

Lab final thoughts

Summary

11

Worker Bees – Orchestrations of Serverless, Containers, and Kubernetes

Technical requirements

The multiple definitions of serverless

Serverless Framework

Serverless computing

Serverless functions

Monitoring serverless functions

Errors

Containers and why we love them

Isolation

Immutability

Promotability

Tagging

Rollbacks

Security

Signable

Monitoring containers

Kubernetes and other ways to orchestrate containers

Health checks

Crashing and force-closing containers

HTTP-based load balancing

Server load balancing

Containers as a Service (CaaS)

Simple container orchestration

Kubernetes

Deployment techniques and workers

Traditional replacement deployment

Rolling deployment

A/B or blue/green deployment

Canary deployment

Automation and rolling back failed deployments

Rollback metrics

When to roll back

How to roll back

In practice – applying what you’ve learned

Leveraging Gitpod – a containerized workspace

The emulation source code

Running the emulation

Summary

12

Final Exam – Tests and Capacity Planning

Technical requirements

Understanding types of testing

Development tests

Build tests

Delivery tests

Deployment tests

Production tests

Adopting TDD

Unit testing the hard way

Unit testing with a framework

Using test automation frameworks

Staying ahead with capacity planning

Load test data

The capacity curve

The demand curve

In practice – applying what you’ve learned

Lab architecture

Lab contents

Lab instructions

Summary

Further reading

Part 4 - Mastering the Outage Moments

13

First Thing – Runbooks and Low Noise Outage Notifications

Technical requirements

What makes a good runbook – the basics

Runbooks as living documents

Understanding the runbook audience knowledge level

Runbook audience permissions

What do you put into a runbook anyway?

Beyond the runbook – code and comments

Quickly understanding source code

Searching source code for your needle in a haystack

Commenting for understanding

What’s in a good dashboard?

Types of dashboards

NOC-style red and green

Displaying trends

Aggregates and breakdowns

What dashboards are not

The basics of priority levels

Response effort

Engineer retention

Incident response systems and priority

Incident response systems and phone-based alerts

What is a priority one event?

Defining priority based on...

The priority level of observability failures

Forcing the priority – the rockstar way!

Adjusting alerts

Logs and alerting

Pausing alerts

In practice – applying what you’ve learned

Defining priority levels

Custom hat pricing API runbook

Alerting

Summary

14

Rapid Response – Outage Management Techniques

Where to meet – an effective strategy for communicating good information

Online collaboration

In-person collaboration

The historical data found in outage responses

Participants

Follow-up work

Leveraging the people involved in the response

Tasks

Participants and personalities

Break strategy and stress management

The opportunity to respond at the right time

Training

Runbook and contact list revisions

Team building

Executive messaging bugs in the ear

Opportunities to call out during the RCA

Messaging customers and leadership

Customer versus leadership messaging

Cadence

Email groups

Status sites

Over-messaging

Notes, notes, notes...

In practice – applying what you’ve learned

Outage and alarm

Notification and response

Troubleshooting

The conclusion

Summary

15

Postmortem Candor – Long-Term Resolution

The content of the postmortem in executive summary style

Executive summary style

Overview

Impact

Timeline

Detailed technical description

Response

Resolution

Future actions

Decisions are not blame

Business is business

Resource and time constraints

Monitoring

The cost of more reliability as a business decision

Active:Active

Manual failover

Cost of time to identify

The cost of time to move a load

Hidden development costs

Training and skill sets – they matter

Identifying gaps

Training and certification targets

Creating future action plans

Immediate follow-up

Who to involve

Timelines and priority

Assigning ownership

Tracking the work

In-practice – an example of a postmortem

Writing the overview

Rounding out the postmortem

Custom Hat Company postmortem

Impact

Timeline

Technical details and response

Resolution

Future actions

Summary

Part 5 - Looking into Future Trends and Preparing for SRE Interviews

16

Chaos Injector – Advanced Systems Stability

Technical requirements

Comprehending the wheel-of-misfortune game

All ends are new beginnings

Lessons to be learned

Role-playing scenarios

A little bit of gamification

Understanding chaos engineering for reliability

Principles of chaos engineering

Chaos system architecture

Chaos experiments

In practice – employing the wheel-of-misfortune game

Lab architecture

Lab contents

Lab instructions

In practice – injecting chaos into systems

Lab architecture

Lab contents

Lab instructions

Summary

Further reading

17

Interview Advice – Hiring and Being Hired

What we’re looking for in a candidate

Are you qualified?

Entry-level SRE job

Problem-solving

The ability to accept feedback and direction

A broad knowledge base and skill set

Research and learning skill set

The ability to say “No”

Culture fit

The X factor

Passion

Experience

Personal responsibility

Common interview questions and answers

Technical questions

Non-technical questions

Insightfully odd questions

What should you look for in a career?

Define a good boss

Dotted line reporting

Morals

Researching the company

Business model

Profitability for the next decade

Structure

Large versus small

Public versus private

Online reviews

Are you over-or under-certified?

Certifications that matter

How many are too many certifications?

Relevancy

Tips for landing the job with a great salary

Interview tips

Salary negotiations

Summary

Appendix A – The Site Reliability Engineer Manifesto

The manifesto

How to adopt it

How to contribute to it

Appendix B – The 12-Factor App Questionnaire

The questionnaire

Factor I – Code base

Factor II – Dependencies

Factor III – Config (configuration)

Factor IV – Backing (backend) services

Factor V – Build, release, run

Factor VI – Processes

Factor VII – Port binding

Factor VIII – Concurrency

Factor IX – Disposability

Factor X – Development/production (dev/prod) parity

Factor XI – Logs

Factor XII – Admin processes

How to adopt this questionnaire

How to contribute to this questionnaire

Index

Other Books You May Enjoy

Part 1 - Understanding the Basics of Who, What, and Why

In this first part, you will learn about site reliability engineering, its roots, and current usage outside Google. We emphasize how the site reliability engineer (SRE) persona is the center of gravity of everything orbiting systems reliability. When we talk about site reliability engineering, it’s impossible to do so without a discussion about the business of software development, which we tie into not only statistics used for reliability but how those impact what companies are ultimately interested in, customer satisfaction and revenue. Finally, we’ll explore why the lack of reliability persists in organizations and discuss some of the lesser known truths that make site reliability engineering critical and complex.

The following chapters will be covered in this section:

Chapter 1, SRE Job Role – Activities and ResponsibilitiesChapter 2, Fundamental Numbers – Reliability StatisticsChapter 3, Imperfect Habits – Duct Tape Architecture and Spaghetti Code