75,99 €
This book explains why applications running on cloud might not deliver the same service reliability, availability, latency and overall quality to end users as they do when the applications are running on traditional (non-virtualized, non-cloud) configurations, and explains what can be done to mitigate that risk.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 472
Veröffentlichungsjahr: 2013
Table of Contents
IEEE Press
Title page
Copyright page
Figures
Tables and Equations
Tables
Equations
1: Introduction
1.1 Approach
1.2 Target Audience
1.3 Organization
Acknowledgments
I: Context
2: Application Service Quality
2.1 Simple Application Model
2.2 Service Boundaries
2.3 Key Quality and Performance Indicators
2.4 Key Application Characteristics
2.5 Application Service Quality Metrics
2.6 Technical Service versus Support Service
2.7 Security Considerations
3: Cloud Model
3.1 Roles in Cloud Computing
3.2 Cloud Service Models
3.3 Cloud Essential Characteristics
3.4 Simplified Cloud Architecture
3.5 Elasticity Measurements
3.6 Regions and Zones
3.7 Cloud Awareness
4: Virtualized Infrastructure Impairments
4.1 Service Latency, Virtualization, and the Cloud
4.2 VM Failure
4.3 Nondelivery of Configured VM Capacity
4.4 Delivery of Degraded VM Capacity
4.5 Tail Latency
4.6 Clock Event Jitter
4.7 Clock Drift
4.8 Failed or Slow Allocation and Startup of VM Instance
4.9 Outlook for Virtualized Infrastructure Impairments
II: Analysis
5: Application Redundancy and Cloud Computing
5.1 Failures, Availability, and Simplex Architectures
5.2 Improving Software Repair Times via Virtualization
5.3 Improving Infrastructure Repair Times via Virtualization
5.4 Redundancy and Recoverability
5.5 Sequential Redundancy and Concurrent Redundancy
5.6 Application Service Impact of Virtualization Impairments
5.7 Data Redundancy
5.8 Discussion
6: Load Distribution and Balancing
6.1 Load Distribution Mechanisms
6.2 Load Distribution Strategies
6.3 Proxy Load Balancers
6.4 Nonproxy Load Distribution
6.5 Hierarchy of Load Distribution
6.6 Cloud-Based Load Balancing Challenges
6.7 The Role of Load Balancing in Support of Redundancy
6.8 Load Balancing and Availability Zones
6.9 Workload Service Measurements
6.10 Operational Considerations
6.11 Load Balancing and Application Service Quality
7: Failure Containment
7.1 Failure Containment
7.2 Points of Failure
7.3 Extreme Solution Coresidency
7.4 Multitenancy and Solution Containers
8: Capacity Management
8.1 Workload Variations
8.2 Traditional Capacity Management
8.3 Traditional Overload Control
8.4 Capacity Management and Virtualization
8.5 Capacity Management in Cloud
8.6 Storage Elasticity Considerations
8.7 Elasticity and Overload
8.8 Operational Considerations
8.9 Workload Whipsaw
8.10 General Elasticity Risks
8.11 Elasticity Failure Scenarios
9: Release Management
9.1 Terminology
9.2 Traditional Software Upgrade Strategies
9.3 Cloud-Enabled Software Upgrade Strategies
9.4 Data Management
9.5 Role of Service Orchestration in Software Upgrade
9.6 Conclusion
10: End-to-End Considerations
10.1 End-to-End Service Context
10.2 Three-Layer End-to-End Service Model
10.3 Distributed and Centralized Cloud Data Centers
10.4 Multitiered Solution Architectures
10.5 Disaster Recovery and Geographic Redundancy
III: Recommendations
11: Accountabilities for Service Quality
11.1 Traditional Accountability
11.2 The Cloud Service Delivery Path
11.3 Cloud Accountability
11.4 Accountability Case Studies
11.5 Service Quality Gap Model
11.6 Service Level Agreements
12: Service Availability Measurement
12.1 Parsimonious Service Measurements
12.2 Traditional Service Availability Measurement
12.3 Evolving Service Availability Measurements
12.4 Evolving Hardware Reliability Measurement
12.5 Evolving Elasticity Service Availability Measurements
12.6 Evolving Release Management Service Availability Measurement
12.7 Service Measurement Outlook
13: Application Service Quality Requirements
13.1 Service Availability Requirements
13.2 Service Latency Requirements
13.3 Service Reliability Requirements
13.4 Service Accessibility Requirements
13.5 Service Retainability Requirements
13.6 Service Throughput Requirements
13.7 Timestamp Accuracy Requirements
13.8 Elasticity Requirements
13.9 Release Management Requirements
13.10 Disaster Recovery Requirements
14: Virtualized Infrastructure Measurement and Management
14.1 Business Context for Infrastructure Service Quality Measurements
14.2 Cloud Consumer Measurement Options
14.3 Impairment Measurement Strategies
14.4 Managing Virtualized Infrastructure Impairments
15: Analysis of Cloud-Based Applications
15.1 Reliability Block Diagrams and Side-by-Side Analysis
15.2 IaaS Impairment Effects Analysis
15.3 PaaS Failure Effects Analysis
15.4 Workload Distribution Analysis
15.5 Anti-Affinity Analysis
15.6 Elasticity Analysis
15.7 Release Management Impact Effects Analysis
15.8 Recovery Point Objective Analysis
15.9 Recovery Time Objective Analysis
16: Testing Considerations
16.1 Context for Testing
16.2 Test Strategy
16.3 Simulating Infrastructure Impairments
16.4 Test Planning
17: Connecting the Dots
17.1 The Application Service Quality Challenge
17.2 Redundancy and Robustness
17.3 Design for Scalability
17.4 Design for Extensibility
17.5 Design for Failure
17.6 Planning Considerations
17.7 Evolving Traditional Applications
17.8 Concluding Remarks
Abbreviations
References
About the Authors
Index
Copyright © 2014 by The Institute of Electrical and Electronics Engineers, Inc.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey. All rights reserved
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Bauer, Eric.
Service quality of cloud-based applications / Eric Bauer, Randee Adams.
pages cm
ISBN 978-1-118-76329-2 (cloth)
1. Cloud computing. 2. Application software–Reliability. 3. Quality of service (Computer networks) I. Adams, Randee. II. Title.
QA76.585.B3944 2013
004.67'82–dc23
2013026569
Figures
Tables and Equations
1
Introduction
Customers expect that applications and services deployed on cloud computing infrastructure will deliver comparable service quality, reliability, availability, and latency as when deployed on traditional, native hardware configurations. Cloud computing infrastructure introduces a new family of service impairment risks based on the virtualized compute, memory, storage, and networking resources that an Infrastructure-as-a-Service (IaaS) provider delivers to hosted application instances. As a result, application developers and cloud consumers must mitigate these impairments to assure that application service delivered to end users is not unacceptably impacted. This book methodically analyzes the impacts of cloud infrastructure impairments on application service delivered to end users, as well as the opportunities for improvement afforded by cloud. The book also recommends architectures, policies, and other techniques to maximize the likelihood of delivering comparable or better service to end users when applications are deployed to cloud.
Cloud-based application software executes within a set of virtual machine instances, and each individual virtual machine instance relies on virtualized compute, memory, storage, and networking service delivered by the underlying cloud infrastructure. As shown in Figure 1.1, the application presents customer facing service toward end users across the dotted service boundary, and consumes virtualized resources offered by the Infrastructure-as-a-Service provider across the dashed resource facing service boundary. The application's service quality experienced by the end users is primarily a function of the application's architecture and software quality, as well as the service quality of the virtualized infrastructure offered by the IaaS across the resource facing service boundary, and the access and wide area networking that connects the end user to the application instance. This book considers both the new impairments and opportunities of virtualized resources offered to applications deployed on cloud and how user service quality experienced by end users can be maximized. By ignoring service impairments of the end user's device, and access and wide area network, one can narrowly consider how application service quality differs when a particular application is hosted on cloud infrastructure compared with when it is natively deployed on traditional hardware.
Figure 1.1. Sample Cloud-Based Application.
The key technical difference for application software between native deployment and cloud deployment is that native deployments offer the application's (guest) operating system direct access to the physical compute, memory, storage, and network resources, while cloud deployment inserts a layer of hypervisor or virtual machine management software between the guest operating system and the physical hardware. This layer of hypervisor or virtual machine management software enables sophisticated resource sharing, technical features, and operational policies. However, the hypervisor or virtual machine management layer does not deliver perfect hardware emulation to the guest operating system and application software, and these imperfections can adversely impact application service delivered to end users. While Figure 1.1 illustrates application deployment to a single data center, real world applications are often deployed to multiple data centers to improve user service quality by shortening transport latency to end users, to support business continuity and disaster recovery, and for other business reasons. Application service quality for deployment across multiple data centers is also considered in this book.
This book considers how application architectures, configurations, validation, and operational policies should evolve so that the acceptable application service quality can be delivered to end users even when application software is deployed on cloud infrastructure. This book approaches application service quality from the end users perspective while considering standards and recommendations from NIST, TM Forum, QuEST Forum, ODCA, ISO, ITIL, and so on.
This book provides application architects, developers, and testers with guidance on architecting and engineering applications that meet their customers' and end users' service reliability, availability, quality, and latency expectations. Product managers, program managers, and project managers will also gain deeper insights into the service quality risks and mitigations that must be addressed to assure that an application deployed onto cloud infrastructure consistently meets or exceeds customers' expectations for user service quality.
The work is organized into three parts: context, analysis, and recommendations. PartI: Context frames the context of service quality of cloud-based applications via the following:
“Application Service Quality” (Chapter 2). Defines the application service metrics that will be used throughout this work: service availability, service latency, service reliability, service accessibility, service retainability, service throughput, and timestamp accuracy.“Cloud Model” (Chapter 3). Explains how application deployment on cloud infrastructure differs from traditional application deployment from both a technical and an operational point of view, as well as what new opportunities are presented by rapid elasticity and massive resource pools.“Virtualized Infrastructure Impairments” (Chapter 4). Explains the infrastructure service impairments that applications running in virtual machines on cloud infrastructure must mitigate to assure acceptable quality of service to end users. The application service impacts of the impairments defined in this chapter will be rigorously considered in Part II: Analysis.PartII: Analysis methodically considers how application service defined in Chapter 2, “Application Service Quality,” is impacted by the infrastructure impairments enumerated in Chapter 4, “Virtualized Infrastructure Impairments,” across the following topics:
“Application Redundancy and Cloud Computing” (Chapter 5). Reviews fundamental redundancy architectures (simplex, sequential redundancy, concurrent redundancy, and hybrid concurrent redundancy) and considers their ability to mitigate application service quality impact when confronted with virtualized infrastructure impairments.“Load Distribution and Balancing” (Chapter 6). Methodically analyzes work load distribution and balancing for applications.“Failure Containment” (Chapter 7). Considers how virtualization and cloud help shape failure containment strategies for applications.“Capacity Management” (Chapter 8). Methodically analyzes application service risks related to rapid elasticity and online capacity growth and degrowth.“Release Management” (Chapter 9). Considers how virtualization and cloud can be leveraged to support release management actions.“End-to-End Considerations” (Chapter 10). Explains how application service quality impairments accumulate across the end-to-end service delivery path. The chapter also considers service quality implications of deploying applications to smaller cloud data centers that are closer to end users versus deploying to larger, regional cloud data centers that are farther from end users. Disaster recovery and georedundancy are also discussed.PartIII: Recommendations covers the following:
“Accountabilities for Service Quality” (Chapter 11). Explains how cloud deployment profoundly changes traditional accountabilities for service quality and offers guidance for framing accountabilities across the cloud service delivery chain. The chapter also uses the service gap model to review how to connect specification, architecture, implementation, validation, deployment, and monitoring of applications to assure that expectations are met. Service level agreements are also considered.“Service Availability Measurement” (Chapter 12). Explains how traditional application service availability measurements can be applied to cloud-based application deployments, thereby enabling efficient side-by-side comparisons of service availability performance.“Application Service Quality Requirements” (Chapter 13). Reviews high level service quality requirements for applications deployed to cloud.“Virtualized Infrastructure Measurement and Management” (Chapter 14). Reviews strategies for quantitatively measuring virtualized infrastructure impairments on production systems, along with strategies to mitigate the application service quality risks of unacceptable infrastructure performance.“Analysis of Cloud-Based Applications” (Chapter 15). Presents a suite of analysis techniques to rigorously assess the service quality risks and mitigations of a target application architecture.“Testing Considerations” (Chapter 16). Considers testing of cloud-based applications to assure that service quality expectations are likely to be met consistently despite inevitable virtualized infrastructure impairments.“Connecting the Dots” (Chapter 17). Discusses how to apply the recommendations of Part III to both existing and new applications to mitigate the service quality risks introduced in Part I: Basics and analyzed in Part II: Analysis.As many readers are likely to study sections based on the technical needs of their business and their professional interest rather than strictly following this work's running order, cross-references are included throughout the work so readers can, say, dive into detailed Part II analysis sections, and follow cross-references back into Part I for basic definitions and follow references forward to Part III for recommendations. A detailed index is included to help readers quickly locate material.
The authors acknowledge the consistent support of Dan Johnson, Annie Lequesne, Sam Samuel, and Lawrence Cowsar that enabled us to complete this work. Expert technical feedback was provided by Mark Clougherty, Roger Maitland, Rich Sohn, John Haller, Dan Eustace, Geeta Chauhan, Karsten Oberle, Kristof Boeynaems, Tony Imperato, and Chuck Salisbury. Data and practical insights were shared by Karen Woest, Srujal Shah, Pete Fales, and many others. Bob Brownlie offered keen insights into service measurements and accountabilities. Expert review and insight on release management for virtualized applications was provided by Bruce Collier. The work benefited greatly from insightful review feedback from Mark Cameron. Iraj Saniee, Katherine Guo, Indra Widjaja, Davide Cherubini, and Karsten Oberle offered keen and substantial insights. The authors gratefully acknowledge the external reviewers who took time to provide through review and thoughtful feedback that materially improved this book: Tim Coote, Steve Woodward, Herbert Ristock, Kim Tracy, and Xuemei Zhang.
The authors welcome feedback on this book; readers may e-mail us at [email protected] and [email protected].
I
Context
Figure 2.0 frames the context of this book: cloud-based applications rely on virtualized compute, memory, storage, and networking resources to provide information services to end users via access and wide area networks. The application's primary quality focus is on the user service delivered across the application's customer facing service boundary (dotted line in Figure 2.0).
Chapter 2, “Application Service Quality,” focuses on application service delivered across that boundary. The application itself relies on virtualized computer, memory, storage, and networking delivered by the cloud service provider to execute application software.Chapter 3, “Cloud Model,” frames the context of the cloud service that supports this virtualized infrastructure.Chapter 4, “Virtualized Infrastructure Impairments,” focuses on the service impairments presented to application components across the application's resource facing service boundary.Figure 2.0. Organization of Part I: Context.
2
Application Service Quality
This section considers the service offered by applications to end users and the metrics used to characterize the quality of that service. A handful of common service quality metrics that characterize application service quality are detailed. These user service key quality indicators (KQIs) are considered in depth in Part II: Analysis.
Figure 2.1 illustrates a simple cloud-based application with a pool of frontend components distributing work across a pool of backend components. The suite of frontend and backend components is managed by a pair of control components that provide management visibility and control for the entire application instance. Each of the application's components, along with their supporting guest operating systems, execute in distinct virtual machine instances served by the cloud service provider. The Distributed Management Task Force (DMTF) defines virtual machine as:
the complete environment that supports the execution of guest software. A virtual machine is a full encapsulation of the virtual hardware, virtual disks, and the metadata associated with it. Virtual machines allow multiplexing of the underlying physical machine through a software layer called a hypervisor. [DSP0243]
Figure 2.1. Simple Cloud-Based Application.
For simplicity, this simple model ignores systems that directly support the application, such as security appliances that protect the application from external attack, domain name servers, and so on.
Figure 2.2 shows a single application component deployed in a virtual machine on cloud infrastructure. The application software and its underlying operating system—referred to as a guest OS—run within a virtual machine instance that emulates a dedicated physical server. The cloud service provider's infrastructure delivers the following resource services to the application's guest OS instance:
Networking. Application software is networked to other application components, application clients, and other systems.Compute. Application programs ultimately execute on a physical processor.(Volatile) Memory. Applications execute programs out of memory, using heap memory, stack storage, shared memory, and main memory to maintain dynamic data, such as application state(Persistent) Storage. Applications maintain program executables, configuration, and application data on persistent storage in files and file systems.Figure 2.2. Simple Virtual Machine Service Model.
It is useful to define boundaries that demark applications and service offerings to better understand the dependencies, interactions, roles, and responsibilities of each element in overall user service delivery. This work will focus on the two high-level application service boundaries shown in Figure 2.3:
Application's customer facing service(CFS) boundary (dotted line in Figure 2.3), which demarks the edge of the application instance that faces users. User service reliability, such as call completion rate, and service latency, such as call setup, are well-known service quality measurements of telecommunications customer facing service.Application's resource facing service(RFS) boundary (dashed line in Figure 2.3), which demarks the boundary between the application's guest OS instances executing in virtual machine instances and the virtual compute, memory, storage, and networking provided by the cloud service provider. Latency to retrieve desired data from persistent storage (e.g., hard disk drive) is a well-known service quality measurement of resource facing service.Figure 2.3. Application Service Boundaries.
Note that customer facing service and resource facing service boundaries are relative to a particular entity in the service delivery chain. Figure 2.3, and this book, consider these concepts from the perspective of a cloud-based application, but these same service boundary notions can be applied to an element of the cloud Infrastructure-as-a-Service or technology component offered as “as-a-Service” like Database-as-a-Service.
Qualities such as latency and reliability of service delivered across a service boundary can be quantitatively measured. Technically useful service measurements are generally referred to as key performance indicators (KPIs). As shown in Figure 2.4, a subset of KPIs across the customer facing service boundary characterize key aspects of the customer's experience and perception of quality, and these are often referred to as key quality indicators (KQIs) [TMF_TR197]. Enterprises routinely track and manage these KQIs to assure that customers are delighted. Well-run enterprises will often tie staff bonus payments to achieving quantitative KQI targets to better align the financial interests of enterprise staff to the business need of delivering excellent service to customers.
Figure 2.4. KQIs and KPIs.
In the context of applications, KQIs often cover high-level business considerations, including service qualities that impact user satisfaction and churn, such as:
Service Availability (Section 2.5.1). The service is online and available to users;Service Latency (Section 2.5.2). The service promptly responds to user requests;Service Reliability (Section 2.5.3). The service correctly responds to user requests;Service Accessibility (Section 2.5.4). The probability that an individual user can promptly access the service or resource that they desire;Service Retainability (Section 2.5.5). The probability that a service session, such as a streaming movie, game, or call, will continuously be rendered with good service quality until normal (e.g., user requested) termination of that session;Service Throughput (Section 2.5.6). Meeting service throughput commitments to customers;Service Timestamp Accuracy (Section 2.5.7). Meeting billing or regulatory compliance accuracy requirements.Different applications with different business models will define KPIs somewhat differently and will select different KQIs from their suite of application KPIs.
A primary resource facing service risk experienced by cloud-based applications is the quality of virtualized compute, memory, storage, and networking delivered by the cloud service provider to application components executing in virtual machine (VM) instances. Chapter 4, “Virtualized Infrastructure Impairments,” considers the following:
Virtual Machine Failure (Section 4.2). Like traditional hardware, VM instances can failNondelivery of Configured VM Capacity (Section 4.3). For instance, VM instance can briefly cease to operate (aka “stall”)Degraded Delivery of Configured VM Capacity (Section 4.4). For instance, a particular virtual machine server may be congested, so some application IP packets are discarded by the host OS or hypervisor.Excess Tail Latency on Resource Delivery (Section 4.5). For instance, some application components may occasionally experience unusually long resource access latency.Clock Event Jitter (Section 4.6). For instance, regular clock event interrupts (e.g., every 1 ms) may be tardy or coalesced.Clock Drift (Section 4.7). Guest OS instances' real-time clocks may drift away from true (UTC) time.Failed or Slow Allocation and Startup of VM Instances (Section 4.8). For instance, newly allocated cloud resources may be nonfunctional (aka dead on arrival [DOA])Figure 2.5 overlays common customer facing service KQIs with typical resource facing service KPIs on the simple application of Section 2.1.
Figure 2.5. Application Consumer and Resource Facing Service Indicators.
As shown in Figure 2.6, the robustness of an application's architecture characterizes how effectively the application can maintain quality across the application's customer facing service boundary despite impairments experienced across the resource facing service boundary and failures within the application itself.
Figure 2.6. Application Robustness.
Figure 2.7 illustrates a concrete robustness example: if the cloud infrastructure stalls a VM that is hosting one of the application backend instances for hundreds of milliseconds (see Section 4.3, “Nondelivery of Configured VM Capacity”), then is the application's customer facing service impacted? Do some or all user operations take hundreds of milliseconds longer to complete, or do some (or all) operations fail due to timeout expiration? A robust application will mask the customer facing service impact of this service impairment so end users do not experience unacceptable service quality.
Figure 2.7. Sample Application Robustness Scenario.
Customer facing service quality expectations are fundamentally driven by application characteristics, such as:
Service criticality (Section 2.4.1)Application interactivity (Section 2.4.2)Tolerance to network traffic impairments (Section 2.4.3).These characteristics influence both the quantitative targets for application's service quality (e.g., critical applications have higher service availability expectations) and specifics of those service quality measurements (e.g., maximum tolerable service downtime influences the minimum chargeable outage downtime threshold).
Readers will recognize that different information services entail different levels of criticality to users and the enterprise. While these ratings will vary somewhat based on organizational needs and customer expectations, the criticality classification definitions from the U.S. Federal Aviation Administration's National Airspace System's reliability handbook are fairly typical:
ROUTINE(Service Availability Rating of 99%). “Loss of this capability would have a minor impact on the risk associated with providing safe and efficient operations” [FAA-HDBK-006A].ESSENTIAL(Service Availability Rating of 99.9%). “Loss of this capability would significantly raise the risk associated with providing safe and efficient operations” [FAA-HDBK-006A].CRITICAL(Service Availability Rating of 99.999%). “Loss of this capability would raise to an unacceptable level, the risk associated with providing safe and efficient operations” [FAA-HDBK-006A].There is also a “Safety Critical” category, with service availability rating of seven 9s for life-threatening risks and services where “loss would present an unacceptable safety hazard during the transition to reduced capacity operations” [FAA-HDBK-006A]. Few commercial enterprises offer services or applications that are safety critical, so seven 9's expectations are rare.
The higher the service criticality the more the enterprise is willing to invest in architectures, policies, and procedures to assure that acceptable service quality is continuously available to users.
As shown in Figure 2.8, there are three broad classifications of application service interactivity:
Batch or Noninteractive Typefor nominally “offline” applications, such as payroll processing, offline billing, and offline analytics, which often run for minutes or hours. Aggregate throughput (e.g., time to complete an entire batch job) is usually more important to users of an offline application than the time to complete a single transaction. While a batch job may consist of hundreds, thousands, or more individual transactions that may each succeed or fail individually, each failed transaction will likely require manual action to correct resulting in an increase in the customer's OPEX to perform the repairs. While interactivity expectations for batch operations may be low, service reliability expectations (e.g., low transaction fallout rate to minimize the cost of rework) are often high.Normal Interactive Typefor nominally online applications with ordinary interactivity expectations, such as routine web traffic (e.g., eCommerce), and communications signaling. There is a broad range of interactivity expectations based on application types, service providers, and other factors. For example, most users will wait no more than a few seconds for ring back after placing a telephone call or for the video on their IP TV to change after selecting a different channel, but may wait longer for web-based applications, such as completing an eCommerce transaction. Interactive transaction response times are nominally measured in hundreds or thousands of milliseconds.Real-Time Interactive Typefor applications that are extremely interactive with strict response time or service latency expectations. Interactive media content (e.g., audio or video conferencing), gaming (e.g., first-party shooter games), and data or bearer plane applications (e.g., firewalls and gateways) all have strict real-time service expectations. Transaction response times for real-time applications are often measured in milliseconds or tens of milliseconds.Figure 2.8. Interactivity Timeline.
Data networks are subject to three fundamental types of service impairments:
Packet Loss. Individual data packets can be discarded by intermediate systems due to network congestion, corrupted in transit, or otherwise lost between the sender and receiver.Packet Delay. Electrical and optical signals propagate at a finite velocity, and their flow through intermediate systems, such as routers and switches, takes finite time. Thus, there is always some latency between the instant one party transmits a packet and the moment that the other party receives the packet.Packet Jitter. Variation in packet latency from packet to packet in a single data stream is called jitter. Jitter is particularly problematic for isochronous data streams, such as conversational audio or video, where the receiving device must continuously render streaming media to an end user. If a packet has not arrived in time to be smoothly rendered to the end user, then the end user's device must engage some lost packet compensation mechanism, which is likely to somewhat compromise the fidelity of the service rendered and thus degrade the end user's quality of experience.[RFC4594] characterizes tolerance to packet loss, delay, and jitter for common classes of applications.
While different applications offer different functionality to end users, the primary service KQIs across the application's customer facing service boundary for end users of applications generally include one or more of the following:
Service availability (Section 2.5.1)Service latency (Section 2.5.2)Service reliability (Section 2.5.3)Service accessibility (Section 2.5.4)Service retainability (Section 2.5.5)Service throughput (Section 2.5.6)Service timestamp accuracy (Section 2.5.7)Application specific service quality measurements (Section 2.5.8).Note that consistency of service quality is also important to users; measured service quality performance should be consistent and repeatable from hour to hour and day to day. Service consistency of branded information and communication services are likely to be as important to end users as consistency of any other branded product.
Availability is defined as the “ability of an IT service or other configuration item to perform its agreed function when required” [ITIL-Availability]. Availability is mathematically expressed in Equation 2.1, the availability formula:
(2.1)
Agreed Service Time is the period during the measurement window that the system should be up. For so-called 24 × 7 × Forever systems (sometimes awkwardly called “24 × 7 × 365”), Agreed Service Time is every minute of every day; for systems that are permitted planned downtime, the planned and scheduled downtime can be excluded from Agreed Service Time. OutageDowntime is defined as: “the sum, over a given period, of the weighted minutes, a given population of a systems, network elements or service entities was unavailable divided by the average in-service population of systems, networks element or service entities” [TL_9000]. Note that modern applications often provide several different functions to different users simultaneously, so partial capacity and partial functionality outages are often more common than total outages; partial capacity or functionality outages are often prorated by portion of capacity or primary functionality impacted.
Service availability measurements and targets generally reflect the service criticality of the affected applications (see Section 2.4.1, “Service Criticality”). For example, consider the availability-related definitions used by popular IaaS supplier targeting enterprise applications of nominally “essential” and “routine” criticality in which the minimum chargeable downtime is at least 5 minutes: “Unavailable” means that all of your running instances haveno external connectivity during a five minute period and you are unable to launch replacement instances.* Critical services will generally have much stricter service measurements and performance targets. For example, the telecom industry's quality standard TL 9000 uses the following outage definition: “all outages shall be counted that result in a complete loss of primary functionality … for all or part of the system for a duration greater than 15 seconds during the operational window, whether the outage was unscheduled or scheduled” [TL_9000]. Obviously, minimum chargeable outage duration of 15 seconds is far more stringent than minimum chargeable outage duration of 5 minutes. In addition to stricter service performance targets, critical services will often include more precise measurements, such as prorating of partial capacity or functionality impairments, rather than all-or-nothing measurements (e.g., “no connectivity during a five minute period”). Outage events tend to be rare acute events, with weeks, months, or years of outage-free operation punctuated by an event lasting tens of minutes or even hours. Thus, availability or outage downtime is often tracked on a 6-month rolling average to set outage events into an appropriate context.
Requirements for this measurement are discussed in Section 13.1, “Service Availability Requirements.”
As shown in Figure 2.9, service latency is the elapsed time between a request and the corresponding response. Most network-based services execute some sort of transactions on behalf of client users: web applications return web pages in response to HTTP GET requests (and update pages in response to HTTP PUT requests); telecommunications networks establish calls in response to user requests; gaming servers respond to user inputs; media servers stream content based on user requests; and so on.
Figure 2.9. Service Latency.
In addition to detailed service latency measurements, such as time to load a web page, some applications have service latency measurement expectations for higher level operations that include many discrete transactions, such as how many seconds or minutes it takes to activate a new smartphone, or how long it takes to provision application service for a new user. Well-engineered solutions will cascade the high-level latency application expectations down to lower-level expectations to enable methodical management of overall service latency.
Requirements for this measurement are discussed in Section 13.2, “Service Latency Requirements.”
The latency between the time a client sends a request and the time the client receives the response will inevitably vary for reasons including:
Request Queuing. Rather than immediately rejecting requests that arrive the instant when a resource is busy, queuing those requests increases the probability that those requests will be served successfully, albeit with slightly greater service latency. Assuming that the system is engineered properly, request queuing enables the offered load to be served promptly (although not instantaneously) without having to deploy sufficient system hardware to serve the busiest traffic instant (e.g., the busiest millisecond or microsecond). In essence, request queuing enables one to trade a bit of system hardware capacity for occasionally increased service latency.Caching. Responses served from cached memory are typically much faster than requests that require one or more disk reads or network transactions.Disk Geometry. Unlike random access memory (RAM), in which it takes the same amount of time to access any memory location, disk storage inherently has nonuniform data access times because of the need to move the disk head to a physical disk location to access stored data. Disk heads move in two independent directions: Rotationally as the disk storage platters spinTrack-to-track, as the disk heads seek between concentric data storage rings or tracks.The physical layout of file systems and databases are often optimized to minimize latency for rotational and track-to-track access to sequential data, but inevitably some data operations will require more time than others due to physical layout of data on the disk.Disk Fragmentation. Disk fragmentation causes data to be stored in noncontiguous disk blocks. As reading noncontiguous disk blocks requires time-consuming disk seeks between disk reads or writes, additional latency is introduced when operating on fragmented portions of files.Variations in Request Arrival Rates. There is inevitably some randomness in the arrival rates of service requests, and this moment to moment variation is superimposed over daily, weekly, and seasonal usage patterns. When offered load is higher, request queues will be deeper, and hence queuing delays will be greater.Garbage Collection. Some software technologies require periodic garbage collection to salvage resources that are no longer required. When garbage collection mechanisms are active, resources may be unavailable to serve application user requests.Network Congestion or Latency. Bursts or spikes in network activity can cause the latency for IP packets traversing a network to increase.Unanticipated Usage and Traffic Patterns. Database and software architectures are configured and optimized for certain usage scenarios and traffic mixes. As usage and traffic patterns vary significantly from nominal expectations, the configured settings may no longer be optimal, and thus performance may degrade.Packet Loss and Corruption. Occasionally IP packets are lost or damaged when traveling between the client device and application instance, or between components within the solution. It takes time to detect lost packets and then to retransmit them, thus introducing latency.Resource Placement. Resources that are held locally offer better performance than resources held in a nearby data center, and resources held in a nearby data center generally are accessible with lower latency than resources held in data centers on distant continents.Network Bandwidth. As all web users know, web pages load slower over lower bandwidth network connections; DSL is better than dialup, and fiber to the home is better than DSL. Likewise, insufficient network bandwidth between resources in the cloud—as well as insufficient access bandwidth to users—causes service latency to increase.Application architectures can impact an application's vulnerability to these latency impairments. For example, applications that factor functionality so that more networked transactions or disk operations are required are often more vulnerable to latency impairments than applications with fewer of those operations.
Figure 2.10 shows the service latency distribution of 30,000 transactions of one sample application. While the median (50th percentile) service latency is 130 ms, there is a broad range of responses; the slowest response in this data set (1430 ms) is more than 10 times slower than the 50th percentile. As one can see from this cumulative distribution, the latency “tail” includes a few outliers (sometimes called “elephants”) that are significantly slower than the bulk of the population. As these tail values can be far slower than typical (e.g., 50th or 90th percentile) latency, it is useful to methodically characterize the latency statistics of the tail across millions of transactions, rather than the thousands of samples in the data set of Figure 2.10.
Figure 2.10. Small Sample Service Latency Distribution.
Individually recording the service latency of each transaction and then directly analyzing hundreds of thousands, millions or more data points is often infeasible, and thus it is common for service latency measurements to be recorded in measurement buckets or bins (e.g., less than 30 ms, 30–49 , and 50–69 ms). Figure 2.11 shows service latency based on binned measurements for a real-time Session Initiation Protocol (SIP) application running on virtualized infrastructure. Figure 2.11 gives service latency at three different workload densities—“X,” 1.4 times “X,” and 1.7 times “X”—and one can see that typical (e.g., 50th percentile and 90th percentile) latencies are consistent, while the best case latency (e.g., fastest 25%) degrades slightly as workload density increases.
Figure 2.11. Sample Typical Latency Variation by Workload Density.
As the linear cumulative distribution function (CDF) of Figure 2.11 obscures the latency tail along the “100%” line, a logarithmic complementary cumulative distribution function (CCDF) is the best way to visualize the latency tail. Note that while the CDF uses a linear scale for distribution on the y-axis, the CCDF uses a logarithmic scale on the y-axis to better visualize the extreme end of the tail. Figure 2.12 gives a CCDF of the same application's latency data set of Figure 2.11, and the tail behaviors for nominally the slowest 1 in 50,000 operations are radically different, with the slowest 1 in 100,000 operations of the 1.7 times “X” density being several times greater than at density 1.4 times “X.” Thus, if the quality of service criteria considered only typical (e.g., 50th percentile and 90th percentile) service latency, then the density of 1.7 times X workload—or perhaps even higher—might be acceptable. However, if the QoS criteria considered tail (e.g., 99.999th percentile or 10−5 on the CCDF) service latency, then the 1.4 times X workload might determine the maximum acceptable density.
Figure 2.12. Sample Tail Latency Variation by Workload Density.
While actual measured latency data often produces rather messy CCDFs, the results can be analyzed by considering the statistical distribution of the data. Figure 2.13 overlays three classes of statistical distributions onto a CCDF:
Concave (e.g., normal) distributions fall off very quickly on semi-log CCDF plots. For example, the slowest one in 105 might be only three times slower than the slowest one in 10 operations.Exponential distributions plot as straight lines on semi-log CCDFs, so the slowest one in 105 might be five times slower than the slowest one in 10 operations.Convex (e.g., power law) distributions fall off slower than exponential distributions, so the slowest one in 105 operations might be several tens of times slower than the slowest one in 10 operations.Figure 2.13. Understanding Complimentary Cumulative Distribution Plots.
As one can see from Figure 2.12, real distributions might blend several classes of theoretical distributions, such as having a normal distribution to the slowest one in 104 operations, and becoming power law farther out in the tail (perhaps starting around the slowest one in 50,000 operations).
There are two broad service latency related characteristics that one can attempt to optimize (visualized in Figure 2.14):
Minimizing “typical” latency, to shave milliseconds (or microseconds) off the typical or 50th percentile latency to improve median performanceMinimizing “tail” latency, to reduce the number of operations that experience service latencies far beyond “typical,” thereby shrinking the latency “tail” to reduce distribution variance by eliminating elephants.Figure 2.14. Service Latency Optimization Options.
As the root causes of typical and tail latency are often different, it is important to agree on exactly what characteristic to optimize so the applicable root causes can be identified and proper corrective actions deployed.
Reliability is defined by [TL_9000] as “the ability of an item to perform a required function under stated conditions for a stated time period.” Service reliability is the ability of an application to correctly process service requests within a maximum acceptable time. Service reliability impairments are sometimes called defective, failed, or fallout operations. While service reliability can be measured as a probability of success (e.g., 99.999% probability of success), probabilistic representations are not easy for many people to understand and are mathematically difficult to work with. Instead, sophisticated customers and suppliers often measure service reliability as defective (or failed) operations per million attempts (DPM). For example, seven defective operations per million attempts is much easier for most people to grasp than 99.9993% service reliability. In addition, DPM can often be combined by simply summing the DPM values along the critical service delivery path. Requirements for this measurement are discussed in Section 13.3, “Service Reliability Requirements.”
Application service accessibility is the probability of a user successfully establishing a new application service session or connection, such as to begin streaming video content or to begin an audio call or start an interactive game. Applications often have specific service accessibility metrics, such as telephony service accessibility impairments, which are sometimes called “failed call attempts.” Service accessibility is sometimes used as a proxy for service availability, such as in “ ‘Availability’ or ‘Available’ means that Customer is able to log on to the Application… .” Note that this work does not consider accessibility of application service for users with physical disabilities who may require modified service input, rendering of output, or operation. Requirements for this measurement are discussed in Section 13.4, “Service Accessibility Requirements.”
It is important to users of session-oriented services—like streaming video—that their session continue to operate uninterrupted with acceptable service quality until the session terminates normally (e.g., the streaming video completes). Service retainability is the probability that an existing service session will remain fully operational until the end user requests the session be terminated. Applications often have application-specific service retainability metrics, such as “dropped calls” or “premature releases” for telephony service retainability impairments. As the risk of a service retention failure increases with the duration of the service session, retainability is often either explicitly normalized by time (e.g., risk per minute of service session) or implicitly (e.g., retention risk for a 90-minute movie or a 10-minute online game or a 3-minute telephone call). For example, the risk of abnormal service disconnection during a 30-minute video call is nominally 10 times higher than the risk of disconnection for a 3-minute video call. Thus, retainability is the probability that an unacceptable service impacting event will affect a single user's active service session during the normalization window (e.g., per minute of user session). Requirements for this measurement are discussed in Section 13.5, “Service Retainability Requirements.”
Service throughput is the sustained rate of successful transaction processing, such as number of transactions processed per hour. Service throughput is generally considered a service capacity indicator, but failure to meet service throughput expectations for a (nominally) properly engineered configuration is often seen as a service quality problem. Service throughput is coupled to service reliability, since customers care most about successfully processed operations—sometimes called “goodput”—rather than counting unsuccessful or failed operations. For example, an application in overload may successfully return properly formed TOO BUSY responses to many user service requests to prevent application collapse, but few users would consider those TOO BUSY responses as successful throughput or goodput. Thus, sophisticated customers may specify throughput with a maximum acceptable transaction or fallout rate. Requirements for this measurement are discussed in Section 13.6, “Service Throughput Requirements.”
Many applications must carefully record timestamps for billing, regulatory compliance, and operational reasons, such as fault correlation. Some applications and management systems use timestamps to record—and later reconstruct—sequence and chronology of operations, so erroneous timestamps may produce a faulty chronology of the sequence of operations/events. While regulatory compliance and operational considerations may not be concerns of end users, operations and compliance staff are special users of many applications, and they may rely on accurate timestamps to do their jobs. As will be discussed in Section 4.7, “Clock Drift,” virtualization can impact the accuracy of real time perceived by application and guest OS software executing in a virtual machine instance relative to universal coordinated time (UTC) compared with execution on native hardware. Requirements for this measurement are discussed in Section 13.7, “Timestamp Accuracy Requirements.”
Classes of applications often have application-specific service quality measurements that are tailored to the specific application, such as:
Mean Opinion Scorecharacterizes the overall quality of experience as perceived by end users, especially for streaming services, such as voice calling, interactive video conferencing, and streaming video playback. Mean opinion scores (MOS) [P.800] are typically expressed via the five-point scale in Table 2.1.Service quality metrics of streaming applications are primarily impacted by the coding and decoding (aka codec) algorithm and implementation, packet loss, packet latency and packet jitter. Sophisticated client applications can mask service quality impairments from the user by implementing dejitter buffers to mitigate minor packet delivery variations and implementing lost packet compensation algorithms when individual data packets are not available in time. Service quality impairments result in a worse overall quality of experience for the end user. High service quality for many applications requires low latency, low jitter, and minimal packet loss, although the degree of tolerance for jitter and packet loss is application and end user dependent. Service quality is primarily considered at the end user's physical rendering interface, such as the audio played to user's ear or the video rendered before the user's eyes. Rendering of audio, video, and other service to users inherently integrates service impairments of the application itself along with packet latency, loss, and jitter across the access and wide area networking, as well as the quality and performance of the devices that both encoded and decoded the content. For example, the voice quality of wireless calls is limited by the voice coder/decoder (aka codec) used, the latency, jitter, and packet loss of the wireless access network, as well as the presence or absence of audio transcoding. The overall service quality impact of any individual component in the service delivery path (e.g., a cloud based application) is generally difficult to quantitatively characterize. End-to-end service quality is considered in Chapter 10, “End to End Considerations.”Audio/Video Synchronization (aka “lip sync”), Synchronization of audio and video is a key service quality for streaming video because if speech is shifted by more than about 50 ms relative to video images of the speaker's lips moving, then the viewers' quality of experience is degraded.TABLE 2.1. Mean Opinion Scores [P.800]
MOSQualityImpairment5ExcellentImperceptible4GoodPerceptible but not annoying3FairSlightly annoying2PoorAnnoying1BadVery annoyingThe term “service quality” associated with applications is often used in two rather different contexts: technical service quality (Section 2.6.1) of an application instance or support service quality (Section 2.6.2) offered by a supplier or service provider to their customers. Throughout this book, the term “service quality” shall refer to technical service quality, not support service quality.
Technical service quality characterizes the application service delivered to users across the customer facing service boundary, such as service availability (Section 2.5.1), service latency (Section 2.5.2), and service reliability (Section 2.5.3).
Both suppliers and service providers routinely offer technical support services to their customers. Many readers will be familiar with traditional helpdesks or customer support service arrangements. As with technical service KQIs, support service KQIs vary based on the type of application or service being supported. Support service KQIs generally include three metrics:
Respond.