36,59 €
This hands-on survival manual will give you the tools to confidently prepare for and respond to a system outage.
Key Features
Book Description
Real-World SRE is the go-to survival guide for the software developer in the middle of catastrophic website failure. Site Reliability Engineering (SRE) has emerged on the frontline as businesses strive to maximize uptime. This book is a step-by-step framework to follow when your website is down and the countdown is on to fix it.
Nat Welch has battle-hardened experience in reliability engineering at some of the biggest outage-sensitive companies on the internet. Arm yourself with his tried-and-tested methods for monitoring modern web services, setting up alerts, and evaluating your incident response.
Real-World SRE goes beyond just reacting to disaster—uncover the tools and strategies needed to safely test and release software, plan for long-term growth, and foresee future bottlenecks. Real-World SRE gives you the capability to set up your own robust plan of action to see you through a company-wide website crisis.
The final chapter of Real-World SRE is dedicated to acing SRE interviews, either in getting a first job or a valued promotion.
What you will learn
Who this book is for
Real-World SRE is aimed at software developers facing a website crisis, or who want to improve the reliability of their company's software. Newcomers to Site Reliability Engineering looking to succeed at interview will also find this invaluable.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 444
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Acquisition Editors: Ben Renow-Clarke, Suresh Jain
Project Editor: Veronica Pais
Technical Editor: Nidhisha Shetty
Proofreader: Safis Editing
Indexer: Rekha Nair
Graphics: Sandip Tadge
Production Coordinator: Sandip Tadge
First published: August 2018
Production reference: 2040918
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78862-888-4
www.packtpub.com
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Nat Welch is a software developer based in the US. Since 2005 he has been building websites and keeping them running. He has always had a deep love of infrastructure and building to support the creative efforts of others. In 2012, Nat became a Site Reliability Engineer at Google and fell in love with the specialty. Since then, he has worked at companies of all sizes trying to promote reliability and help developers build reliable systems.
I would like to thank a lot of people for helping me make this book. First off is everyone at Packt Publishing, especially Radhika Atitkar and Veronica Pais. Without them, there is no way I would have ever thought I could write a book, let alone finish one. Also shout-out to all of the wonderful editors, including Pavlos Ratis who helped me shape this book into something great.
Second is everyone at Hillary for America. I value your work more highly than you know. I would especially like to mention Stephanie Hannon and Rohen Peterson for giving me the opportunity to work with HFA. While there, I learned a ton working with a team of operators that have also become amazing friends: Michael Fisher, timball, Amy Hails, Will McCutcheon, and Dylan Ayrey. Also shoutout to Ben Hagen, Ernest W. Durbin III, and Rob Witoff.
Two communities kept me motivated throughout the writing of this book, the Recurse Center and Simple Casual. Everyone in both communities are incredibly supportive, and without your support I would have given up long ago.
Thanks to the love of my life and light of my days, Melissa Cantrell, who put up with my constant hiding in the corner with headphones on to write. The exchange of a croissant delivery for love and affection and never-ending support is not a fair trade, so thank you.
I would not be here without all of the SREs at Google, especially Bill Thiede, Ben Lazarus, stratus, Sumeet Pannu, mglb, wac, and Chris Jones. Thank you for everything you taught me from code reviews, to wheels of misfortunes, basic CIDR math, IRC etiquette, and international handoffs. To the Punchd team: stay reckless. I never would have joined SRE without your constant hustle.
Thank you to Stephanie Harris for her illustrations and help with graphics in this book.
Finally, much love to my family. You went through so much while I was trying to write this book. Your courage, while losing so much, reminds me that no matter the obstacle, I need to keep pushing forward. Mom, Dad, and Travis, we will rebuild. I love you so much.
Pavlos Ratis is a Site Reliability Engineer at HolidayCheck, where he works on automation software and infrastructure reliability. Over time, he worked on a wide range of projects, from writing software to automate, and managing multi-server cloud-based infrastructure to developing web applications.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
At some point, every software developer experiences a catastrophic website failure—customers tweet that they can't access your website for hours; while you are sleeping, no customers can buy the t-shirts you sell, or on your biggest sales day of the year, all of the servers collapse under the load.
Real-World SRE is aimed at software developers and software operators who want to improve the reliability of their company's software. The book will introduce you to a basic framework for working toward greater reliability and give you an insight into the Site Reliability Engineering (SRE) profession. For those engineers and developers who have already experienced a major outage, this is the book you wish you'd had. For those developers and engineers lucky enough not to have experienced an outage, buy this book now!
Chapter 1, Introduction, explores the relatively new SRE field and outlines the practical framework of the book.
Chapter 2, Monitoring, talks about the tools and methodologies used when monitoring. After this chapter, a good experiment for you would be to set up monitoring on services, even if they are just fake services written for testing, and see if you can see how they change over time.
Chapter 3, Incident Response, explains how to respond to outages, and preparing your team for the worst. We also focus on setting up on-call rotations best practices around working together as a team and on building processes to make incidents as low stress as possible.
Chapter 4, Postmortems, takes you through the act of writing a postmortem and promoting reviews for yourself, your team, and your organization. We talk about data to collect, along with communication and how to track future work.
Chapter 5, Testing and Releasing, reviews common practices around testing and releasing.
Chapter 6, Capacity Planning, goes over some of the basics of finance and talks about how to build a plan for your infrastructure growth over time.
Chapter 7, Building Tools, discusses how to write software in a role focused on responsiveness. We also explore how to find new projects to work on, how to define those projects, and how to plan them. We then talk about execution and the long-term maintenance of software and how to be reflective on the work you have done.
Chapter 8, User Experience, gives an overview of the basics of user experience and user testing. We also talk about security and performance budgets.
Chapter 9, Networking Foundations, helps you to dive into the basics of networking.
Chapter 10, Linux and Cloud Foundations, covers the basics of Linux and common cloud services.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Real-World-SRE. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/RealWorldSRE_ColorImages.pdf.
There are a number of text conventions used throughout this book.
CodeInText: indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. For example: "This creates an instance of the StatsD class to talk to a StatsD server that is running on port 9125 on the local machine."
A block of code is set as follows:
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
Any command-line input or output is written as follows:
Bold: indicates a new term, an important word, or words that you see on the screen, for example, in menus or dialog boxes also appear in the text like this. For example: "A Service Level Indicator (SLI) is a possible most important metric for the business."
Warnings or important notes appear like this.
Tips and tricks appear like this.
Feedback from our readers is always welcome.
General feedback: email [email protected], and mention the book's title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].
Errata: although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: if there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packtpub.com.
As the internet has grown, people have become used to having access to content all of the time, from a variety of devices. This means that the reputation of a brand has slowly become connected with the responsiveness and reliability of its products. People choose Google for searching because it always returns relevant and useful results quickly. People share content on Twitter because their message will be seen in real time by their followers. Netflix's great content selection is useless if it cannot deliver consistently on a variety of network speeds. As this reliability has become more important to businesses, a specialization focused on software reliability has emerged: Site Reliability Engineering (SRE). This chapter will introduce you to the field and also describe what you will learn from this book, helping you to write software to navigate the ever-changing internet landscape.
Before we explain what the field and role of SRE pertains to, let us start with a thought experiment. Imagine that it's early in the morning and you wake up to a screenshot of a blank web page in a text message from a friend with the caption: "I can't load your website."
If your personal website is indeed down, maybe you will message back with an, "I'll check it after breakfast," or an, "Oh yeah, been meaning to look into that." If it is your company's website, or maybe the page hosting your resume that you just sent to 15 possible employers, then a stream of expletives and indecipherable emojis will probably erupt from your mouth and in your text message back. This is because, for many businesses, websites have become the main source of incoming business. For some companies, like Facebook, Amazon, or iFixit, their entire business is a website. For other businesses, like restaurants or advertising agencies, a website acts as a way for people interested in the organization to learn more. It is often part of the marketing flow that helps companies to grow.
It is probably impossible to completely remove the adrenaline spike that comes from discovering a website is down if you are responsible for fixing it. However, we can work to set up a framework to limit how often things break. We can create a world where responding to outages is easy, and transition from, "Oh god, everything is on fire, what do I do?!" to "Oh hey, a page isn't loading, so let's check out what's having a rough day."
This chapter is our introduction to the book and the field of SRE. We will cover the following topics in the next few pages:
SRE is a relatively new field, but it is a slightly different take on many existing ideas. In 1958, the term IT was coined in the Harvard Business Review, and eventually became the descriptor for the maintenance of technology used for collecting, storing, and distributing data and information. At that time, computers were transitioning toward having integrated circuits, but they were still the size of a room and were maintained and programmed by a team of people. As computers shrank, that team started focusing on multiple computers. Over time, some people started to specialize in programming those computers, and others focused on keeping them running. "Dumb terminals" would connect to a single computer, which was maintained by a team while programmers and users used the terminals.
Eventually, these maintainers started taking care of both the machines that individuals used, as well as large arrays of machines that provided services. Users would use a word processor on their local machine, and then upload files to a remote machine. Those who maintained the remote machines became known as system engineers, system administrators, and system operators.
As computers became smaller and more commodified, programmers began spending more time interacting with infrastructure, and configuring their software and infrastructure to work together well. On the other end, system admins were writing more and more complex code to maintain infrastructure. The closer these teams became, the more they began working together. In smaller teams, often, people would start focusing on both code for infrastructure and business code. In larger organizations, teams were created that focused on tools for managing infrastructure in reliable ways, so that product teams could quickly and easily manage the infrastructure they needed. These joint teams were often described as SRE or DevOps (developer and operations) teams.
Benjamin Treynor Sloss of Google, often referred to as just Treynor, says in Google's Site Reliability Engineering book, "SRE is what happens when you ask a software engineer to design an operations team." He is often credited with the creation of the idea that operations work is now just a specialization of software engineering. Given Google's success with reliability, the idea has caught on at many companies.
SRE is still a burgeoning field and, like DevOps, is often used to describe roles that include a wide diversity of work. Some companies give the title of SRE to a position, but it is much closer to a traditional system admin role. You can use this book's framework to evaluate a job before you apply for it, however, the goal of this book is to introduce you to the SRE mindset and help you to apply it to an organization, regardless of your past experience in the tech world.
SRE is an exciting field. As mentioned earlier, it has evolved from a long line of roles and, as it is a relatively new field, its definition is steadily changing. SRE is an extension and evolution of many past concepts and, as such, concepts relevant to SRE apply to many roles, including but not exclusive to, backend engineering, DevOps, systems engineering, systems administration, operations, and so on. Depending on the company, these roles can involve very similar or very different responsibilities. The point is that, no matter what your job title is, you can apply SRE principles to your role.
In an attempt to define the field, we can learn a lot from its full name, Site Reliability Engineering:
Merging these three definitions, we get something like, "The field focused on working artfully to bring about a website that performs consistently well." While this definition could use some brushing up, it suits our needs for now. If you work, or know people who work, in the web development or software engineering world and you ask them what SRE means, then they may ask you, "Isn't that like X?" To someone from that background, X might be "DevOps," "ops," "platform engineering," "infrastructure engineering," "24/7 engineering," "a sysadmin," and so on.
This variation of answers presents the first problem we will see throughout this book: every organization is different. SRE's primary goal is making a website perform consistently according to our previous definition, which is difficult because it is dependent on the organization, the business around that organization, and the website's (or product's) requirements. One of the primary goals of this book is to present a framework that you can apply even if you do not belong to an organization with any of the aforementioned roles. The framework should be effective if you work for yourself, and it should also work if you are employed by some gigantic international multi-headed Hydra organization, and anything in between.
I worked as an SRE in 2016 for Hillary for America. It was the lead organization (but definitely not the only one) working to help to elect Hillary Clinton as President of the United States of America. We were not successful, and while this example immediately dates this book, I found it to have the most concrete separation of concerns between the parts of a website that I have ever worked on. The organization was hyper-focused on one goal (electing Hillary Clinton as president), so it had a very explicit list of goals that made my job a lot easier.
There were many separate parts of the campaign that the technology team worked on, including a mobile application, different websites, data pipelines, and large databases. To keep this simple though, and to explain what I mean by a separation of concerns, let me use three separate websites that we built and maintained as an example:
Figure 1: Screenshots of different parts of hillaryclinton.com, courtesy of the Hillary for America design team. From left to right: the header on the home page, a page about Nevada, a page about Hillary's policies, Hillary's home page in Spanish, the campaign blog, and the donate page.
The home page was a general landing page. It needed to be available during the hours that people in North America were awake (as our target audience was mostly based in the United States), but very few people visited the home page unless driven there.
The main reason you would go to https://www.hillaryclinton.com/ was if you were sent there, not because it was part of your daily browsing like you would visit Twitter or Reddit. Surrogates speaking at rallies, on the radio, or on television supporting Hillary Clinton would often say things like, "Go to hillaryclinton.com now to sign up," or "hillaryclinton.com has more details on her policies on this topic." A five-minute outage here and there was OK, because of this semipredictable traffic spike, but like many media organizations, there were no guarantees of when a large spike of traffic would occur.
The donate page always needed to be up. According to our product team and senior leadership, the donate page's availability was priority number one. If people could not give money, then the campaign might not be able to pay people's salaries or get the candidate to her speaking engagements. The donation site was not the only way that the campaign made money, but it was a significant source of income.
The voter registration page only needed to be fully available when there was an election coming soon. This was because the page let people say they were going to vote for Hillary Clinton and find their nearest polling location. While the donate page needed to be available for the majority of the campaign (May 2015 through to November 2016), the voter registration page only really needed to be available during the lead up to the primary election (September through to November of 2016). If we had built the voter registration page earlier in the election, it also would have been needed in the days leading up to the primaries, but then only for states that were voting on those days. Primary elections are a precursor to the general election and happen from February to June, with different states voting on different days.
The key here is that different websites and features have different requirements and a different definition of being reliable. Nothing will ever be perfect, nor is 100% uptime achievable on the internet, because things are always breaking. So, all we can do is figure out what sort of failures we might have and optimize our product to be resilient in a way that is useful for us. SRE isn't just the analysis of systems; it is also the architecting and building of systems so that they meet the requirements of the product.
Software on the internet can never be fully reliable for two reasons. The first reason is that the internet is a distributed system and, often, parts fail, which will affect your service's availability. The second reason is that humans write software, and that software will often have bugs, which will also cause outages.
Often, the job of someone working in SRE is to take in reliability requirements for software, and its infrastructure, and then figure out how to make the infrastructure meet those requirements. Steps toward this often require figuring out if existing infrastructure is meeting those needs, collaborating with teams (or people writing software that will run on the infrastructure), evaluating external tools, or just designing and writing what you need yourself.
As I mentioned at the beginning of the chapter, an SRE role can be very diverse. The requirements of an SRE position at a Fortune 500 company can be very different to those of a 20-person video game company. The role could be different at a bank in the USA from a role at a bank on the other side of the world. This is because the organization is different. For smaller organizations, someone working as an SRE may handle everything in the organization related to infrastructure and reliability. On the other hand, larger organizations may have multiple teams of SREs working with many diverse teams of developers. The role between two different banks could be different because of each bank's needs.
A local bank may only need someone to improve the reliability of tools for people who work for the bank, while a much larger bank in London may need someone who can make sure their bank's systems can make trades at very high speeds with the London Stock Exchange or support millions of individual customers. This book will provide a structure for anyone interested in becoming an SRE. The goal is to empower you, no matter your background or current situation. It will not be a panacea but will provide a knowledge base and a framework for making sites more reliable and moving your career forwards.
I worked as an SRE at Google for four years, and that is where I started specializing, moving away from being a full stack engineer, and instead considering myself an SRE. Google had lots of internal education courses, and when I left, I found it difficult to continue my education. I also quickly discovered that SRE at Google is a very different beast than SRE at much smaller organizations. I decided to write this book for people interested in starting with SRE or applying it to organizations that are much smaller than Google.
To do this, the book is broken up into two parts. The first eight chapters walk through the hierarchy of reliability. This hierarchy was originally designed by Mikey Dickerson of the United States Digital Service (and– surprise, surprise –Google). The hierarchy says that as you are trying to add reliability to a system, you need to walk through each level before you get to the next one.
The following diagram shows a slightly modified version of Mikey's original pyramid. I have updated it to include the all-encompassing aspect of communication:
Figure 2: This seven-layer pyramid is encircled with communication. Each layer builds upon and needs the previous layer. It is surrounded by communication because each layer needs communication to succeed.
Let us walk through the layers as a preview of what you can expect in each chapter.
Some engineers will have had bad experiences and will not think monitoring is worth the investment, whereas others will have religious zealotry toward certain tools, and some will just ignore you. This chapter will help you to navigate all of these competing opinions and find and create the implementation that is best for your project and team.
Chapter 3, Incident Response: The next level is incident response. If something is broken, how do you alert people and respond? While tools help with this, as they define the rules by which to alert humans, most of incident response is about defining policy and setting up training so humans know what to do when they get alerts. If team members see an automated message in Slack, what should they do? If they get a phone call, how quickly do they need to respond? Will employees be paid extra if they have to work on a Saturday due to an outage? These are all questions we will address in the What is incident response section. Setting up on-call rotations, best practices for working together as a team, and building infrastructure to make incidents as low-stress as possible will also be covered.Chapter 4, Postmortems: The third level is postmortems. Once you have had an outage, how do you make sure the problem does not happen again? Should you have a meeting about your incident? Does there need to be documentation? In this chapter, we will consider how to talk about past incidents and make it an enjoyable process for all involved. Postmortems are the act of recording for history how an incident happened, how the team fixed it, and how the team is working to prevent another similar incident in the future. We want to set up a culture of blameless and transparent postmortems, so people can work together.Individuals should not be afraid of incidents, but rather feel confident that if an incident happens, the team will respond and improve the system for the future, instead of focusing on the shame and anger that can come with failure. Incidents are things to learn from, not things to be afraid and ashamed of!
Chapter 5, Testing and Releasing: The fourth level is testing and releasing your software. In this chapter, we will be talking about the tooling and strategies that can be used to test and release software. This level in the hierarchy is our first level where instead of focusing on things that have happened, we focus on prevention. Prevention is about trying to limit the number of incidents that happen and also making sure that infrastructure and services stay stable when releasing new code. The chapter will talk about how to focus on all of the different types of testing that exist and make them useful for you and your team. It will also explore releasing software, when to use methodologies like continuous deployment, and some tools you can use.Chapter 6, Capacity Planning: The fifth level is capacity planning. While Chapter 5, Testing and Releasing focused on the current world, this chapter is all about predicting the future and finding the limits of your system. Capacity planning is also about making sure you can grow over time. Once you are monitoring your system, and running a reliable system, you can start thinking about how to grow it over time, and how to find and anticipate bottlenecks and resource limits. In this chapter, we will talk about planning for long-term growth, writing budgets, communicating with outside teams about the future, and things to keep in mind as your service shrinks and grows.Chapter 7, Building Tools: The sixth level is the development of new tools and services. SRE is not only about operations but also about software development. We hope SREs will spend around half of their time developing new tools and services. Some of these tools will exist to automate tasks that an employee has been doing by hand, while others will exist to improve another part of the hierarchy, such as automated load testing, or services to improve performance. In this chapter, we will talk about finding these projects, defining them, planning them, and building them. We will also talk about communicating their usefulness to your fellow engineers.Chapter 8, User Experience: The final tier is user experience, which is about making sure the user has a good experience. We'll talk about measuring performance, working with user researchers, and defining what a good experience means to your team. We will also discuss how the experience of a tool and processes can cause outages. The goal is to make sure that, no matter the tool, or the user, people enjoy using it, understand how to use it, and cannot easily hurt themselves with it.Nori Heikkinen, an SRE at Google with many years of experience, adds that "the hierarchy does not include prevention, partly because 100% uptime is impossible, and partly because the bottom three needs in the hierarchy must be addressed within an organization before prevention can be examined." (https://www.infoq.com/news/2015/06/too-big-to-fail)
The last two chapters of this book are a cheat section and introduction to common useful topics.
Chapter 9,Networking Foundations: This is a selection of tools and definitions of important ideas in networking. We discuss network packets, DNS, UDP and TCP, and lots of other things. After this chapter you should feel like you know the basics of networking, and the ability to research more advanced topics.Chapter 10, Linux and Cloud Foundations: This is a selection of tools and important concepts involved in Linux and modern cloud products. We cover what the Linux kernel is, common parts of public clouds, and other topics. After this chapter you should feel like you know the basics of Linux and most public cloud products. Afterwards you should feel comfortable researching specific clouds and more advanced Linux topics.One way to use this book is as a framework for working on a new project. As each chapter is about a different level of the hierarchy, you can work through the book to figure out where in the hierarchy your project sits. If it is a new project, then often it will be right at the bottom of the hierarchy, with no, or very little, monitoring implemented.
At each level, if there are others on the team, then you should begin a conversation to figure out what exists, and if it meets the team's needs. Each chapter will provide a rough rubric for that discussion, but remember that every team and project is unique. If you are the only person who is thinking about reliability and infrastructure, then you may end up spending a significant amount of time proposing solutions and pushing the project in a certain direction. Just remember that the point is to improve the reliability of the service, help the business, and improve the user's experience of the service.
You may find yourself distracted by each thing that you could fix. It is highly recommended to document the problems that you see first before diving in. Documenting first can be helpful in a few ways. Diving in is very satisfying, but it also may lead you to skip over requirements or spend too much time on a solution that doesn't work for your business (for example, integrating your system with a monitoring service you can't afford, or building a distributed job scheduler when you could have just used a piece of open source software).
So, when joining a new project, or evaluating a new service, here is a set of steps to follow:
Figure 3: An example system architecture diagram. This is a very simple diagram that someone might draw on a whiteboard. Most companies will have something much more complex or detailed than this, but this is often the level of detail you need. Boxes with names and arrows show what talks to what.
Figure 4: Second example of an architecture diagram. This system is a classic static site generator model. The admin service creates or modifies things and writes update notifications into a queue. A worker reads data from the queue, does work on the data, and uploads it to a static object store, in this case vendor 2. Then, we put in some sort of CDN or serving system, in this case vendor 1 in front of vendor 2.
Name
Role
Manager
Things they know/specializations
Akil
Junior Full Stack Dev
Jeff
Seems pretty new and jumps around a lot.
Catherine
Senior Frontend Dev
Jeff
Does a lot of initial design prototyping and built most of the frontend originally.
Kareem
Senior Mobile Dev
Melissa
Wrote both mobile apps.
Steph
Senior Backend Dev
Melissa
TO DO: Set up a one-on-one to understand mobile backend.
Suzy
Full Stack Dev
Jeff
Animation wizard who knows the database for CMS better than anyone.
Tom
Full Stack Dev
Jeff
Frontend architecture, made initial protocol buffers and knows sync queue best.
Table 1: An example table with notes on people in the project. With this, we have a reference on team structure. If we need to know who to talk to about mobile apps, we can look at our handy chart and see that we need to talk to Kareem or the manager, Melissa.
Now that you have context for the project, or service, start working through each chapter of the book and ask:
Does the service have monitoring?Does the team have plans for incident response?Does the team create postmortems? Are they stored anywhere?How is the service tested? Does the project have a release plan?Has anyone done any capacity planning?What tools could we build to improve the service?Is the current level of reliability providing a positive user experience?The trick to note here is that these questions could be asked about a piece of software that has been running for years, as well as one that is just being created.
The service you are investigating could be a large project with many pieces of software (a service-oriented architecture (SOA) for example) or a single monolithic application. If you are working on a project with many services, then work through each service one at a time. The downside of this can be that if you want to build a framework that will fit all of the services you are interacting with, you will not know how best to solve the problems and needs of them until after you have done a bunch of research and work. The upside is that you will not be pulled immediately in many directions and will be able to focus on one specific service's problems.
Your time and energy are limited resources and, because of this, you will always need to work with more people than you have time for, so make sure to take it slow. Going slow will mean that things do not get lost in the cracks. You also do not want to burn out before each service has its base few levels of its hierarchy filled up.
Alright! We made it through the introduction. We learned what SRE is at a high level, and we talked about the sorts of problems people in the role tend to focus on. We discussed the structure of the book, and also how to apply that structure to a software project.
In the next chapter, we will be diving into the world of monitoring! Monitoring is the foundation of learning about a system. It is how you record historical data about a system and learn about what is actually going on by analyzing the data you collect. By the end of the chapter, you'll know the basics of instrumenting an application, aggregating that data, storing that data, and displaying it.
Monitoring is defined by Oxford Dictionaries (https://en.oxforddictionaries.com/definition/monitor) as to "observe and check the progress or quality of (something) over a period; keep under systematic review." This definition points out two crucial details—firstly, you need to define what quality is and make sure that your system is making progress toward, or staying within, a limit of quality. Secondly, you need to be systematic about this work—you should not be randomly looking at your system. Instead, your approach should be consistent. The need for systematic measurements is one reason that your dentist asks you to come in every six months, or a reason why some insurance companies ask you to get a dedicated primary care doctor.
In this chapter, we will be focusing on the tools and methodology of monitoring modern web services. The chapter will include thoughts on what data to collect, how to collect that data, how to store that data, and how to display that data for developers and those who will find it useful. We will also talk about communication about monitoring, why monitoring is essential, and how to get everyone in a company invested in monitoring.
Everyone tells you to go to the doctor regularly, but why should you? What are the benefits? My parents would say that you should go to the doctor to catch signs of things that you do not necessarily pay attention to, or notice, by yourself. This could be things like cholesterol levels, blood pressure, and skin cancer. I also like to use doctor visits as a time to think about and talk about changes that I have noticed in my body. For example, if I have had an upset stomach frequently.
These examples work as good comparisons of the two separate types of monitoring that software often needs. The first type is metrics and the second is logs. Metrics, in this case, are number measurements. Traditionally, metrics focused on performance numbers, like the percentage of disk space used, or the number of packets received, or the CPU load. These days, they can be used to represent just about anything that can be defined as a numerical value. They can come from any piece of software in your system. Logs are events, which can have numbers and other data attached, but are often less structured. Some logs are complete JSON blobs of data, while others are just human-formatted strings of text. They can also be anything in between.
Let's say that I go to see the doctor and my blood pressure is recorded as 120/70 mmHg. My cholesterol is 190 mg/dL. There is also a new mole on my back and my stomach has been feeling upset. The metrics, in this case, are my blood pressure and cholesterol. My doctor collects them every time I visit and there is a documented history of them. They also have simple ranges for the human adult that are considered safe. This fact is not too relevant right now but will be useful later, when we think about alerting in the next chapter. The mole and stomach issues are closer to events.
We are stretching the metaphor a bit thin here, but the mole has data around it and the doctor is making a gut decision based on its size, location, and time present to decide if the hospital should biopsy it or not. My doctor does not have regular data, but he has a one-off measurement. For the stomach issue, the doctor has a few statistics that are partially remembered by me. These are things that I have eaten and approximately how long, how intense, and how frequent the pain is.
Image of a human with monitoring labels. Each label is an example metric you might collect about a person's body.
For the metrics, the doctor records them and if they are abnormal, or out of bounds, the doctor may recommend some changes to my lifestyle. For the log data, the doctor will probably start by either collecting more data, by sending the mole to a lab, or by asking me to keep a record of what I am eating.
As much as we might want them to take care of themselves, applications are not humans. So, we measure applications slightly differently. For a web application, the most common metrics are error counts, request counts, and request duration. The most common logs are error stack traces.
If you want help remembering these three metrics, they can be remembered as ERD or RED. Some people also call them REL (requests, errors, latency).
So, why are these metrics often the starting point? We can increment a counter every time an error occurs and write that error out to a log, with a timestamp. A counter is one of the most fundamental forms of monitoring. We just count the number of times something has happened. Some services let you store metadata with your counter increment, tying the log to the counter, but often you just write the logs and the counter increment with the same timestamp. This counter is useful because you want to know when you are serving an error to a user and to look at your logs to evaluate what the errors are. You increment the counter so that you can calculate what percentage of the requests that you serve are errors, and so that you can quickly view long-term error progress.
A total request count is useful because we know how often our application is being used. I am using the example of a basic HTTP 1.1 web application, so the total request count is an accurate view of how much work a server is doing. If the server is a streaming server, then often a team counts bytes or packets instead of requests, so they have a view of how usage changes over time, because in their case a request can represent more than a single unit of work.
See Chapter 9, Networking Foundations for more on how HTTP 1.1 works and how it differs from other versions of HTTP. We also cover what a packet is in that chapter.
In this example, request duration is just the length of a single HTTP request. We are measuring how long it takes the server to process, from when it receives the full request, until it has sent out the full response. Request duration is used for a bunch of things. Firstly, you can use it to figure out whether certain types of requests are taking longer than others. If you tag each duration recording not just with the time it took, but also with the URL hit, the method (GET, POST, HEAD, and so on), and the status code you returned, then you could dig into the metrics.
You could see that, on average, all requests that returned code 404 took one second longer than requests that returned code 200. Secondly, you could use this method to see how similar requests change over time. For instance, you could compare how requests to https://example.com/ performed in November with how they performed in December.
A graph of a service's total request count for the months of November and December. The December line shows traffic was slightly higher than November's traffic for most days. The days where this is not true are a large spike in the beginning of both months, and a slight slump in traffic near the end of the month in December.
We have mainly talked about how monitoring is useful and not necessarily why it is important. I propose some questions for you—how do you know a service is working? How do you know it is not working? How do you define what working is? This is what we are trying to solve with monitoring. Monitoring is important because it provides us with a data-driven view of our application and proves to us it is working, without us having to sit there and constantly check the application every minute of our lives. With that in mind, I believe the best approach is always the practical approach. Let's try creating a simple application and instrumenting it with monitoring.
First, a caveat—there are a lot of programming languages and monitoring systems. We will be talking about various monitoring systems later in the chapter, and there are libraries for all sorts of languages and systems. So, just because I am providing examples here with specific languages and libraries, it does not mean that you cannot do something very similar with your language and monitoring system of choice.
For the first example, we will use Ruby and StatsD. Ruby is a popular scripting language and tends to be what I use when I want to build something quickly. Also, some very large websites use Ruby, including GitHub, Spotify, and Hulu. StatsD is a monitoring system from Etsy. It is open source and used by many companies including Kickstarter and Hillary for America.
I have commented on this simple application as much as possible. However, if you need more documentation than my comments, see the references section.
Sinatra is a simple web framework. It creates a domain-specific language inside of Ruby for responding to web requests:
One of the most popular Ruby StatsD libraries is installed by running
