78,99 €
A holistic approach to service reliability and availability of cloud computing Reliability and Availability of Cloud Computing provides IS/IT system and solution architects, developers, and engineers with the knowledge needed to assess the impact of virtualization and cloud computing on service reliability and availability. It reveals how to select the most appropriate design for reliability diligence to assure that user expectations are met. Organized in three parts (basics, risk analysis, and recommendations), this resource is accessible to readers of diverse backgrounds and experience levels. Numerous examples and more than 100 figures throughout the book help readers visualize problems to better understand the topic--and the authors present risks and options in bulleted lists that can be applied directly to specific applications/problems. Special features of this book include: * Rigorous analysis of the reliability and availability risks that are inherent in cloud computing * Simple formulas that explain the quantitative aspects of reliability and availability * Enlightening discussions of the ways in which virtualized applications and cloud deployments differ from traditional system implementations and deployments * Specific recommendations for developing reliable virtualized applications and cloud-based solutions Reliability and Availability of Cloud Computing is the guide for IS/IT staff in business, government, academia, and non-governmental organizations who are moving their applications to the cloud. It is also an important reference for professionals in technical sales, product management, and quality management, as well as software and quality engineers looking to broaden their expertise.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 530
Veröffentlichungsjahr: 2012
Table of Contents
COVER
IEEE PRESS
TITLE PAGE
COPYRIGHT PAGE
DEDICATION
FIGURES
TABLES
EQUATIONS
INTRODUCTION
AUDIENCE
ORGANIZATION
ACKNOWLEDGMENTS
I: BASICS
1 CLOUD COMPUTING
1.1 ESSENTIAL CLOUD CHARACTERISTICS
1.2 COMMON CLOUD CHARACTERISTICS
1.3 BUT WHAT, EXACTLY, IS CLOUD COMPUTING?
1.4 SERVICE MODELS
1.5 CLOUD DEPLOYMENT MODELS
1.6 ROLES IN CLOUD COMPUTING
1.7 BENEFITS OF CLOUD COMPUTING
1.8 RISKS OF CLOUD COMPUTING
2 VIRTUALIZATION
2.1 BACKGROUND
2.2 WHAT IS VIRTUALIZATION?
2.3 SERVER VIRTUALIZATION
2.4 VM LIFECYCLE
2.5 RELIABILITY AND AVAILABILITY RISKS OF VIRTUALIZATION
3 SERVICE RELIABILITY AND SERVICE AVAILABILITY
3.1 ERRORS AND FAILURES
3.2 EIGHT-INGREDIENT FRAMEWORK
3.3 SERVICE AVAILABILITY
3.4 SERVICE RELIABILITY
3.5 SERVICE LATENCY
3.6 REDUNDANCY AND HIGH AVAILABILITY
3.7 HIGH AVAILABILITY AND DISASTER RECOVERY
3.8 STREAMING SERVICES
3.9 RELIABILITY AND AVAILABILITY RISKS OF CLOUD COMPUTING
II: ANALYSIS
4 ANALYZING CLOUD RELIABILITY AND AVAILABILITY
4.1 EXPECTATIONS FOR SERVICE RELIABILITY AND AVAILABILITY
4.2 RISKS OF ESSENTIAL CLOUD CHARACTERISTICS
4.3 IMPACTS OF COMMON CLOUD CHARACTERISTICS
4.4 RISKS OF SERVICE MODELS
4.5 IT SERVICE MANAGEMENT AND AVAILABILITY RISKS
4.6 OUTAGE RISKS BY PROCESS AREA
4.7 FAILURE DETECTION CONSIDERATIONS
4.8 RISKS OF DEPLOYMENT MODELS
4.9 EXPECTATIONS OF IAAS DATA CENTERS
5 RELIABILITY ANALYSIS OF VIRTUALIZATION
5.1 RELIABILITY ANALYSIS TECHNIQUES
5.2 RELIABILITY ANALYSIS OF VIRTUALIZATION TECHNIQUES
5.3 SOFTWARE FAILURE RATE ANALYSIS
5.4 RECOVERY MODELS
5.5 APPLICATION ARCHITECTURE STRATEGIES
5.6 AVAILABILITY MODELING OF VIRTUALIZED RECOVERY OPTIONS
6 HARDWARE RELIABILITY, VIRTUALIZATION, AND SERVICE AVAILABILITY
6.1 HARDWARE DOWNTIME EXPECTATIONS
6.2 HARDWARE FAILURES
6.3 HARDWARE FAILURE RATE
6.4 HARDWARE FAILURE DETECTION
6.5 HARDWARE FAILURE CONTAINMENT
6.6 HARDWARE FAILURE MITIGATION
6.7 MITIGATING HARDWARE FAILURES VIA VIRTUALIZATION
6.8 VIRTUALIZED NETWORKS
6.9 MTTR OF VIRTUALIZED HARDWARE
6.10 DISCUSSION
7 CAPACITY AND ELASTICITY
7.1 SYSTEM LOAD BASICS
7.2 OVERLOAD, SERVICE RELIABILITY, AND SERVICE AVAILABILITY
7.3 TRADITIONAL CAPACITY PLANNING
7.4 CLOUD AND CAPACITY
7.5 MANAGING ONLINE CAPACITY
7.6 CAPACITY-RELATED SERVICE RISKS
7.7 CAPACITY MANAGEMENT RISKS
7.8 SECURITY AND SERVICE AVAILABILITY
7.9 ARCHITECTING FOR ELASTIC GROWTH AND DEGROWTH
8 SERVICE ORCHESTRATION ANALYSIS
8.1 SERVICE ORCHESTRATION DEFINITION
8.2 POLICY-BASED MANAGEMENT
8.3 CLOUD MANAGEMENT
8.4 SERVICE ORCHESTRATION’S ROLE IN RISK MITIGATION
8.5 SUMMARY
9 GEOGRAPHIC DISTRIBUTION, GEOREDUNDANCY, AND DISASTER RECOVERY
9.1 GEOGRAPHIC DISTRIBUTION VERSUS GEOREDUNDANCY
9.2 TRADITIONAL DISASTER RECOVERY
9.3 VIRTUALIZATION AND DISASTER RECOVERY
9.4 CLOUD COMPUTING AND DISASTER RECOVERY
9.5 GEOREDUNDANCY RECOVERY MODELS
9.6 CLOUD AND TRADITIONAL COLLATERAL BENEFITS OF GEOREDUNDANCY
9.7 DISCUSSION
III: RECOMMENDATIONS
10 APPLICATIONS, SOLUTIONS, AND ACCOUNTABILITY
10.1 APPLICATION CONFIGURATION SCENARIOS
10.2 APPLICATION DEPLOYMENT SCENARIO
10.3 SYSTEM DOWNTIME BUDGETS
10.4 END-TO-END SOLUTIONS CONSIDERATIONS
10.5 ATTRIBUTABILITY FOR SERVICE IMPAIRMENTS
10.6 SOLUTION SERVICE MEASUREMENT
10.7 MANAGING RELIABILITY AND SERVICE OF CLOUD COMPUTING
11 RECOMMENDATIONS FOR ARCHITECTING A RELIABLE SYSTEM
11.1 ARCHITECTING FOR VIRTUALIZATION AND CLOUD
11.2 DISASTER RECOVERY
11.3 IT SERVICE MANAGEMENT CONSIDERATIONS
11.4 MANY DISTRIBUTED CLOUDS VERSUS FEWER HUGE CLOUDS
11.5 MINIMIZING HARDWARE-ATTRIBUTED DOWNTIME
11.6 ARCHITECTURAL OPTIMIZATIONS
12 DESIGN FOR RELIABILITY OF VIRTUALIZED APPLICATIONS
12.1 DESIGN FOR RELIABILITY
12.2 TAILORING DFR FOR VIRTUALIZED APPLICATIONS
12.3 RELIABILITY REQUIREMENTS
12.4 QUALITATIVE RELIABILITY ANALYSIS
12.5 QUANTITATIVE RELIABILITY BUDGETING AND MODELING
12.6 ROBUSTNESS TESTING
12.7 STABILITY TESTING
12.8 FIELD PERFORMANCE ANALYSIS
12.9 RELIABILITY ROADMAP
12.10 HARDWARE RELIABILITY
13 DESIGN FOR RELIABILITY OF CLOUD SOLUTIONS
13.1 SOLUTION DESIGN FOR RELIABILITY
13.2 SOLUTION SCOPE AND EXPECTATIONS
13.3 RELIABILITY REQUIREMENTS
13.4 SOLUTION MODELING AND ANALYSIS
13.5 ELEMENT RELIABILITY DILIGENCE
13.6 SOLUTION TESTING AND VALIDATION
13.7 TRACK AND ANALYZE FIELD PERFORMANCE
13.8 OTHER SOLUTION RELIABILITY DILIGENCE TOPICS
14 SUMMARY
14.1 SERVICE RELIABILITY AND SERVICE AVAILABILITY
14.2 FAILURE ACCOUNTABILITY AND CLOUD COMPUTING
14.3 FACTORING SERVICE DOWNTIME
14.4 SERVICE AVAILABILITY MEASUREMENT POINTS
14.5 CLOUD CAPACITY AND ELASTICITY CONSIDERATIONS
14.6 MAXIMIZING SERVICE AVAILABILITY
14.7 RELIABILITY DILIGENCE
14.8 CONCLUDING REMARKS
ABBREVIATIONS
REFERENCES
ABOUT THE AUTHORS
INDEX
IEEE Press
445 Hoes Lane
Piscataway, NJ 08854
IEEE Press Editorial Board 2012
John Anderson, Editor in Chief
Ramesh Abhari
Bernhard M. Haemmerli
Saeid Nahavandi
George W. Arnold
David Jacobson
Tariq Samad
Flavio Canavero
Mary Lanzerotti
George Zobrist
Dmitry Goldgof
Om P. Malik
Kenneth Moore, Director of IEEE Book and Information Services (BIS)
Technical Reviewers
Xuemei Zhang
Principal Member of Technical Staff
Network Design and Performance Analysis
AT&T Labs
Rocky Heckman, CISSP
Architect Advisor
Microsoft
cover image: © iStockphoto
cover design: Michael Rutkowski
ITIL® is a Registered Trademark of the Cabinet Office in the United Kingdom and other countries.
Copyright © 2012 by the Institute of Electrical and Electronics Engineers. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Bauer, Eric.
Reliability and availability of cloud computing / Eric Bauer, Randee Adams.
p. cm.
ISBN 978-1-118-17701-3 (hardback)
1. Cloud computing. 2. Computer software–Reliabillity. 3. Computer software–Quality control. 4. Computer security. I. Adams, Randee. II. Title.
QA76.585.B394 2012
004.6782–dc23
2011052839
To our families and friends for their continued encouragement and support.
FIGURES
Figure 1.1
Service Models
Figure 1.2
OpenCrowd’s Cloud Taxonomy
Figure 1.3
Roles in Cloud Computing
Figure 2.1
Virtualizing Resources
Figure 2.2
Type 1 and Type 2 Hypervisors
Figure 2.3
Full Virtualization
Figure 2.4
Paravirtualization
Figure 2.5
Operating System Virtualization
Figure 2.6
Virtualized Machine Lifecycle State Transitions
Figure 3.1
Fault Activation and Failures
Figure 3.2
Minimum Chargeable Service Disruption
Figure 3.3
Eight-Ingredient (“8i”) Framework
Figure 3.4
Eight-Ingredient plus Data plus Disaster (8i + 2d) Model
Figure 3.5
MTBF and MTTR
Figure 3.6
Service and Network Element Impact Outages of Redundant Systems
Figure 3.7
Sample DSL Solution
Figure 3.8
Transaction Latency Distribution for Sample Service
Figure 3.9
Requirements Overlaid on Service Latency Distribution for Sample Solution
Figure 3.10
Maximum Acceptable Service Latency
Figure 3.11
Downtime of Simplex Systems
Figure 3.12
Downtime of Redundant Systems
Figure 3.13
Simplified View of High Availability
Figure 3.14
High Availability Example
Figure 3.15
Disaster Recovery Objectives
Figure 3.16
ITU-T G.114 Bearer Delay Guideline
Figure 4.1
TL 9000 Outage Attributability Overlaid on Augmented 8i + 2d Framework
Figure 4.2
Outage Responsibilities Overlaid on Cloud 8i + 2d Framework
Figure 4.3
ITIL Service Management Visualization
Figure 4.4
IT Service Management Activities to Minimize Service Availability Risk
Figure 4.5
8i + 2d Attributability by Process or Best Practice Areas
Figure 4.6
Traditional Error Vectors
Figure 4.7
IaaS Provider Responsibilities for Traditional Error Vectors
Figure 4.8
Software Supplier (and SaaS) Responsibilities for Traditional Error Vectors
Figure 5.1
Sample Reliability Block Diagram
Figure 5.2
Traversal of Sample Reliability Block Diagram
Figure 5.3
Nominal System Reliability Block Diagram
Figure 5.4
Reliability Block Diagram of Full virtualization
Figure 5.5
Reliability Block Diagram of OS Virtualization
Figure 5.6
Reliability Block Diagram of Paravirtualization
Figure 5.7
Reliability Block Diagram of Coresident Application Deployment
Figure 5.8
Canonical Virtualization RBD
Figure 5.9
Latency of Traditional Recovery Options
Figure 5.10
Traditional Active-Standby Redundancy via Active VM Virtualization
Figure 5.11
Reboot of a Virtual Machine
Figure 5.12
Reset of a Virtual Machine
Figure 5.13
Redundancy via Paused VM Virtualization
Figure 5.14
Redundancy via Suspended VM Virtualization
Figure 5.15
Nominal Recovery Latency of Virtualized and Traditional Options
Figure 5.16
Server Consolidation Using Virtualization
Figure 5.17
Simplified Simplex State Diagram
Figure 5.18
Downtime Drivers for Redundancy Pairs
Figure 6.1
Hardware Failure Rate Questions
Figure 6.2
Application Reliability Block Diagram with Virtual Devices
Figure 6.3
Virtual CPU
Figure 6.4
Virtual NIC
Figure 7.1
Sample Application Resource Utilization by Time of Day
Figure 7.2
Example of Extraordinary Event Traffic Spike
Figure 7.3
The Slashdot Effect: Traffic Load Over Time (in Hours)
Figure 7.4
Offered Load, Service Reliability, and Service Availability of a Traditional System
Figure 7.5
Visualizing VM Growth Scenarios
Figure 7.6
Nominal Capacity Model
Figure 7.7
Implementation Architecture of Compute Capacity Model
Figure 7.8
Orderly Reconfiguration of the Capacity Model
Figure 7.9
Slew Rate of Square Wave Amplification
Figure 7.10
Slew Rate of Rapid Elasticity
Figure 7.11
Elasticity Timeline by ODCA SLA Level
Figure 7.12
Capacity Management Process
Figure 7.13
Successful Cloud Elasticity
Figure 7.14
Elasticity Failure Model
Figure 7.15
Virtualized Application Instance Failure Model
Figure 7.16
Canonical Capacity Management Failure Scenarios
Figure 7.17
ITU X.805 Security Dimensions, Planes, and Layers
Figure 7.18
Leveraging Security and Network Infrastructure to Mitigate Overload Risk
Figure 8.1
Service Orchestration
Figure 8.2
Example of Cloud Bursting
Figure 10.1
Canonical Single Data Center Application Deployment Architecture
Figure 10.2
RBD of Sample Application on Blade-Based Server Hardware
Figure 10.3
RBD of Sample Application on IaaS Platform
Figure 10.4
Sample End-to-End Solution
Figure 10.5
Sample Distributed Cloud Architecture
Figure 10.6
Sample Recovery Scenario in Distributed Cloud Architecture
Figure 10.7
Simplified Responsibilities for a Canonical Cloud Application
Figure 10.8
Recommended Cloud-Related Service Availability Measurement Points
Figure 10.9
Canonical Example of MP 1 and MP 2
Figure 10.10
End-to-End Service Availability Key Quality Indicators
Figure 11.1
Virtual Machine Live Migration
Figure 11.2
Active–Standby Markov Model
Figure 11.3
Pie Chart of Canonical Hardware Downtime Prediction
Figure 11.4
RBD for the Hypothetical Web Server Application
Figure 11.5
Horizontal Growth of Hypothetical Application
Figure 11.6
Outgrowth of Hypothetical Application
Figure 11.7
Aggressive Protocol Retry Strategy
Figure 11.8
Data Replication of Hypothetical Application
Figure 11.9
Disaster Recovery of Hypothetical Application
Figure 11.10
Optimal Availability Architecture of Hypothetical Application
Figure 12.1
Traditional Design for Reliability Process
Figure 12.2
Mapping Virtual Machines across Hypervisors
Figure 12.3
A Virtualized Server Failure Scenario
Figure 12.4
Robustness Testing Vectors for Virtualized Applications
Figure 12.5
System Design for Reliability as a Deming Cycle
Figure 13.1
Solution Design for Reliability
Figure 13.2
Sample Solution Scope and KQI Expectations
Figure 13.3
Sample Cloud Data Center RBD
Figure 13.4
Estimating MP 2
Figure 13.5
Modeling Cloud-Based Solution with Client-Initiated Recovery Model
Figure 13.6
Client-Initiated Recovery Model
Figure 14.1
Failure Impact Duration and High Availability Goals
Figure 14.2
Eight-Ingredient Plus Data Plus Disaster (8i + 2d) Model
Figure 14.3
Traditional Outage Attributability
Figure 14.4
Sample Outage Accountability Model for Cloud Computing
Figure 14.5
Outage Responsibilities of Cloud by Process
Figure 14.6
Measurement Pointss (MPs) 1, 2, 3, and 4
Figure 14.7
Design for Reliability of Cloud-Based Solutions
TABLES
Table 2.1
Comparison of Server Virtualization Technologies
Table 2.2
Virtual Machine Lifecycle Transitions
Table 3.1
Service Availability and Downtime Ratings
Table 3.2
Mean Opinion Scores
Table 4.1
ODCA’s Data Center Classification
Table 4.2
ODCA’s Data Center Service Availability Expectations by Classification
Table 5.1
Example Failure Mode Effects Analysis
Table 5.2
Failure Mode Effect Analysis Figure for Coresident Applications
Table 5.3
Comparison of Nominal Software Availability Parameters
Table 6.1
Example of Hardware Availability as a Function of MTTR/MTTRS
Table 7.1
ODCA IaaS Elasticity Objectives
Table 9.1
ODCA IaaS Recoverability Objectives
Table 10.1
Sample Traditional Five 9’s Downtime Budget
Table 10.2
Sample Basic Virtualized Five 9’s Downtime Budget
Table 10.3
Canonical Application-Attributable Cloud-Based Five 9’s Downtime Budget
Table 10.4
Evolution of Sample Downtime Budgets
Table 11.1
Example Service Transition Activity Failure Mode Effect Analysis
Table 11.2
Canonical Hardware Downtime Prediction
Table 11.3
Summary of Hardware Downtime Mitigation Techniques for Cloud Computing
Table 12.1
Sample Service Latency and Reliability Requirements at MP 2
Table 13.1
Sample Solution Latency and Reliability Requirements
Table 13.2
Modeling Input Parameters
Table 14.1
Evolution of Sample Downtime Budgets
EQUATIONS
Equation 3.1
Basic Availability Formula
Equation 3.2
Practical System Availability Formula
Equation 3.3
Standard Availability Formula
Equation 3.4
Estimation of System Availability from MTBF and MTTR
Equation 3.5
Recommended Service Availability Formula
Equation 3.6
Sample Partial Outage Calculation
Equation 3.7
Service Reliability Formula
Equation 3.8
DPM Formula
Equation 3.9
Converting DPM to Service Reliability
Equation 3.10
Converting Service Reliability to DPM
Equation 3.11
Sample DPM Calculation
Equation 6.1
Availability as a Function of MTBF/MTTR
Equation 11.1
Maximum Theoretical Availability across Redundant Elements
Equation 11.2
Maximum Theoretical Service Availability
INTRODUCTION
Cloud computing is a new paradigm for delivering information services to end users, offering distinct advantages over traditional IS/IT deployment models, including being more economical and offering a shorter time to market. Cloud computing is defined by a handful of essential characteristics: on-demand self service, broad network access, resource pooling, rapid elasticity, and measured service. Cloud providers offer a variety of service models, including infrastructure as a service, platform as a service, and software as a service; and cloud deployment options include private cloud, community cloud, public cloud and hybrid clouds. End users naturally expect services offered via cloud computing to deliver at least the same service reliability and service availability as traditional service implementation models. This book analyzes the risks to cloud-based application deployments achieving the same service reliability and availability as traditional deployments, as well as opportunities to improve service reliability and availability via cloud deployment. We consider the service reliability and service availability risks from the fundamental definition of cloud computing—the essential characteristics—rather than focusing on any particular virtualization hypervisor software or cloud service offering. Thus, the insights of this higher level analysis and the recommendations should apply to all cloud service offerings and application deployments. This book also offers recommendations on architecture, testing, and engineering diligence to assure that cloud deployed applications meet users’ expectations for service reliability and service availability.
Virtualization technology enables enterprises to move their existing applications from traditional deployment scenarios in which applications are installed directly on native hardware to more evolved scenarios that include hardware independence and server consolidation. Use of virtualization technology is a common characteristic of cloud computing that enables cloud service providers to better manage usage of their resource pools by multiple cloud consumers. This book also considers the reliability and availability risks along this evolutionary path to guide enterprises planning the evolution of their application to virtualization and on to full cloud computing enablement over several releases.
The book is intended for IS/IT system and solution architects, developers, and engineers, as well as technical sales, product management, and quality management professionals.
The book is organized into three parts: Part I, “Basics,” Part II, “Analysis,” and Part III—,“Recommendations.” Part I, “Basics,” defines key terms and concepts of cloud computing, virtualization, service reliability, and service availability. Part I contains three chapters:
Chapter 1, “Cloud Computing.”
This book uses the cloud terminology and taxonomy defined by the U.S. National Institute of Standards and Technology. This chapter defines cloud computing and reviews the essential and common characteristics of cloud computing. Standard service and deployment models of cloud computing are reviewed, as well as roles of key cloud-related actors. Key benefits and risks of cloud computing are summarized.
Chapter 2, “Virtualization.”
Virtualization is a common characteristic of cloud computing. This chapter reviews virtualization technology, offers architectural models for virtualization that will be analyzed, and compares and contrasts “virtualized” applications to “native” applications.
Chapter 3, “Service Reliability and Service Availability.”
This chapter defines service reliability and availability concepts, reviews how those metrics are measured in traditional deployments, and how they apply to virtualized and cloud based deployments. As the telecommunications industry has very precise standards for quantification of service availability and service reliability measurements, concepts and terminology from the telecom industry will be presented in this chapter and used in Part II, “Analysis,” and Part III, “Recommendations.”
Part II, “Analysis,” methodically analyzes the service reliability and availability risks inherent in application deployments on cloud computing and virtualization technology based on the essential and common characteristics given in Part I.
Chapter 4, “Analyzing Cloud Reliability and Availability.”
Considers the service reliability and service availability risks that are inherent to the essential and common characteristics, service model, and deployment model of cloud computing. This includes implications of service transition activities, elasticity, and service orchestration. Identified risks are analyzed in detail in subsequent chapters in Part II.
Chapter 5, “Reliability Analysis of Virtualization.”
Analyzes full virtualization, OS virtualization, paravirtualization, and server virtualization and coresidency using standard reliability analysis methodologies. This chapter also analyzes the software reliability risks of virtualization and cloud computing.
Chapter 6, “Hardware Reliability, Virtualization, and Service Availability.”
This chapter considers how hardware reliability risks and responsibilities shift as applications migrate to virtualized and cloud-based hardware platforms, and how hardware attributed service downtime is determined.
Chapter 7, “Capacity and Elasticity.”
The essential cloud characteristic of rapid elasticity enables cloud consumers to dispense with the business risk of locking-in resources weeks or months ahead of demand. Rapid elasticity does, however, introduce new risks to service quality, reliability, and availability that must be carefully managed.
Chapter 8, “Service Orchestration Analysis.”
Service orchestration automates various aspects of IT service management, especially activities associated with capacity management. This chapter reviews policy-based management in the context of cloud computing and considers the associated risks to service reliability and service availability.
Chapter 9, “Geographic Distribution, Georedundancy, and Disaster Recovery.”
Geographic distribution of application instances is a common characteristic of cloud computing and a best practice for disaster recovery. This chapter considers the service availability implications of georedundancy on applications deployed in clouds.
Part III, “Recommendations,” considers techniques to maximize service reliability and service availability of applications deployed on clouds, as well as the design for reliability diligence to assure that virtualized applications and cloud based solutions meet or exceed the service reliability and availability of traditional deployments.
Chapter 10, “Applications, Solutions and Accountability.”
This chapter considers how virtualized applications fit into service solutions, and explains how application service downtime budgets change as applications move to the cloud. This chapter also proposes four measurement points for service availability, and discusses how accountability for impairments in each of those measurement points is attributed.
Chapter 11, “Recommendations for Architecting a Reliable System.”
This chapter covers architectures and techniques to maximize service availability and service reliability via virtualization and cloud deployment. A simple case study is given to illustrate key architectural points.
Chapter 12, “Design for Reliability of Virtualized Applications.”
This chapter reviews how design for reliability diligence for virtualized applications differs from reliability diligence for traditional applications.
Chapter 13, “Design for Reliability of Cloud Solutions.”
This chapter reviews how design for reliability diligence for cloud deployments differs from reliability diligence for traditional solutions.
Chapter 14, “Summary.”
This gives an executive summary of the analysis, insights, and recommendations on assuring that reliability and availability of cloud-based solutions meet or exceed the performance of traditional deployment.
The authors were greatly assisted by many deeply knowledgeable and insightful engineers at Alcatel-Lucent, especially: Mark Clougherty, Herbert Ristock, Shawa Tam, Rich Sohn, Bernard Bretherton, John Haller, Dan Johnson, Srujal Shah, Alan McBride, Lyle Kipp, and Ted East. Joe Tieu, Bill Baker, and Thomas Voith carefully reviewed the early manuscript and provided keen review feedback. Abhaya Asthana, Kasper Reinink, Roger Maitland, and Mark Cameron provided valuable input. Gary McElvany raised the initial architectural questions that ultimately led to this work. This work would not have been possible without the strong management support of Tina Hinch, Werner Heissenhuber, Annie Lequesne, Vickie Owens-Rinn, and Dor Skuler.
Cloud computing is an exciting, evolving technology with many avenues to explore. Readers with comments or corrections on topics covered in this book, or topics for a future edition of this book, are invited to send email to the authors ([email protected], [email protected], or [email protected]).
Eric BauerRandee Adams
I
BASICS
1
CLOUD COMPUTING
The U.S. National Institute of Standards and Technology (NIST) defines cloud computing as follows:
Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction
[NIST-800-145].
This definition frames cloud computing as a “utility” (or a “pay as you go”) consumption model for computing services, similar to the utility model deployed for electricity, water, and telecommunication service. Once a user is connected to the computing (or telecommunications, electricity, or water utility) cloud, they can consume as much service as they would like whenever they would like (within reasonable limits), and are billed for the resources consumed. Because the resources delivering the service can be shared (and hence amortized) across a broad pool of users, resource utilization and operational efficiency can be higher than they would be for dedicated resources for each individual user, and thus the price of the service to the consumer may well be lower from a cloud/utility provider compared with the alternative of deploying and operating private resources to provide the same service. Overall, these characteristics facilitate outsourcing production and delivery of these crucial “utility” services. For example, how many individuals or enterprises prefer to generate all of their own electricity rather than purchasing it from a commercial electric power supplier?
This chapter reviews the essential characteristics of cloud computing, as well as several common characteristics of cloud computing, considers how cloud data centers differ from traditional data centers, and discusses the cloud service and cloud deployment models. The terminologies for the various roles in cloud computing that will be used throughout the book are defined. The chapter concludes by reviewing the benefits of cloud computing.
Per [NIST-800-145], there are five essential functional characteristics of cloud computing:
Each of these is considered individually.
Per [NIST-800-145], the essential cloud characteristic of “on-demand self-service” means “a consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service’s provider.” Modern telecommunications networks offer on-demand self service: one has direct dialing access to any other telephone whenever one wants. This behavior of modern telecommunications networks contrasts to decades ago when callers had to call the human operator to request the operator to place a long distance or international call on the user’s behalf. In a traditional data center, users might have to order server resources to host applications weeks or months in advance. In the cloud computing context, on-demand self service means that resources are “instantly” available to service user requests, such as via a service/resource provisioning website or via API calls.
Per [NIST-800-145] “broad network access” means “capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, laptops, and PDAs).” Users expect to access cloud-based services anywhere there is adequate IP networking, rather than requiring the user to be in a particular physical location. With modern wireless networks, users expect good quality wireless service anywhere they go. In the context of cloud computing, this means users want to access the cloud-based service via whatever wireline or wireless network device they wish to use over whatever IP access network is most convenient.
Per [NIST-800-145], the essential characteristic of “resource pooling” is defined as: “the provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand.” Service providers deploy a pool of servers, storage devices, and other data center resources that are shared across many users to reduce costs to the service provider, as well as to the cloud consumers that pay for cloud services. Ideally, the cloud service provider will intelligently select which resources from the pool to assign to each cloud consumer’s workload to optimize the quality of service experienced by each user. For example, resources located on servers physically close to the end user (and which thus introduce less transport latency) may be selected, and alternate resources can be automatically engaged to mitigate the impact of a resource failure event. This is essentially the utility model applied to computing. For example, electricity consumers don’t expect that a specific electrical generator has been dedicated to them personally (or perhaps to their town); they just want to know that their electricity supplier has pooled the generator resources so that the utility will reliably deliver electricity despite inevitable failures, variations in load, and glitches.
Computing resources are generally used on a very bursty basis (e.g., when a key is pressed or a button is clicked). Timeshared operating systems were developed decades ago to enable a pool of users or applications with bursty demands to efficiently share a powerful computing resource. Today’s personal computer operating systems routinely support many simultaneous applications on a PC or laptop, such as simultaneously viewing multiple browser windows, doing e-mail, and instant messaging, and having virus and malware scanners running in the background, as well as all the infrastructure software that controls the keyboard, mouse, display, networking, real-time clock, and so on. Just as intelligent resource sharing on your PC enables more useful work to be done cost effectively than would be possible if each application had a dedicated computing resource, intelligent resource sharing in a computing cloud environment enables more applications to be served on less total computing hardware than would be required with dedicated computing resources. This resource sharing lowers costs for the data center hosting the computing resources for each application, and this enables lower prices to be charged to cloud consumers than would be possible for dedicated computing resources.
[NIST-800-145] describes “rapid elasticity” as “capabilities can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out, and rapidly released to quickly scale in. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.”
Forecasting future demand is always hard, and there is always the risk that unforeseen events will change plans and thereby increase or decrease the demand for service. For example, electricity demand spikes on hot summer afternoons when customers crank up their air conditioners, and business applications have peak usage during business hours, while entertainment applications peak in evenings and on weekends. In addition, most application services have time of day, day of week, and seasonal variations in traffic volumes. Elastically increasing service capacity during busy periods and releasing capacity during off-peak periods enables cloud consumers to minimize costs while meeting service quality expectations. For example, retailers might experience heavy workloads during the holiday shopping season and light workloads the rest of the year; elasticity enables them to pay only for the computing resources they need in each season, thereby enabling computing expenses to track more closely with revenue. Likewise, an unexpectedly popular service or particularly effective marketing campaign can cause demand for a service to spike beyond planned service capacity. End users expect available resources to “magically” expand to accommodate the offered service load with acceptable service quality. For cloud computing, this means all users are served with acceptable service quality rather than receiving “busy” or “try again later” messages, or experiencing unacceptable service latency or quality.
Just as electricity utilities can usually source additional electric power from neighboring electricity suppliers when their users’ demand outstrips the utility’s generating capacity, arrangements can be made to overflow applications from one cloud that is operating at capacity to other clouds that have available capacity. This notion of gracefully overflowing application load from one cloud to other clouds is called “cloud bursting.”
[NIST-800-145] describes the essential cloud computing characteristic of “measured service” as “cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported, providing transparency for both the provider and the consumer of the utilized service.” Cloud consumers want the option of usage-based (or pay-as-you-go) pricing in which their price is based on the resources actually consumed, rather than being locked into a fixed pricing arrangement. Measuring resource consumption and appropriately charging cloud consumers for their actual resource consumption encourages them not to squander resources and release unneeded resources so they can be used by other cloud consumers.
NIST originally included eight common characteristics of cloud computing in their definition [NIST-B], but as these characteristics were not essential, they were omitted from the formal definition of cloud computing. Nevertheless, six of these eight common characteristics do impact service reliability and service availability, and thus will be considered later in this book.
Virtualization
.
By untethering application software from specific dedicated hardware, virtualization technology (discussed in Chapter 2, “Virtualization”) gives cloud service providers control to manage workloads across massive pools of compute servers.
Geographic Distribution
.
Having multiple geographically distributed data center sites enables cloud providers flexibility to assign a workload to resources close to the end user. For example, for real-time gaming, users are more likely to have an excellent quality of experience via low service latency if they are served by resources geographically close to them than if they are served by resources on another continent. In addition, geographic distribution in the form of georedundancy is essential for disaster recovery and business continuity planning. Operationally, this means engineering for sufficient capacity and network access across several geographically distributed sites so that a single disaster will not adversely impact more than that single site, and the impacted workload can be promptly redeployed to nonaffected sites.
Resilient Computing
.
Hardware devices, like hard disk drives, wear out and fail for well-understood physical reasons. As the pool of hardware resources increases, the probability that some hardware device will fail in any week, day, or hour increases as well. Likewise, as the number of online servers increases, so does the risk that software running on one of those online server instances will fail. Thus, cloud computing applications and infrastructure must be designed to routinely detect, diagnose, and recover service following inevitable failures without causing unacceptable impairments to user service.
Advanced Security
.
Computing clouds are big targets for cybercriminals and others intent on disrupting service, and the homogeneity and massive scale of clouds make them particularly appealing. Advanced security techniques, tools, and policies are essential to assure that malevolent individuals or organizations don’t penetrate the cloud and compromise application service or data.
Massive Scale
.
To maximize operational efficiencies that drive down costs, successful cloud deployments will be of massive scale.
Homogeneity
.
To maximize operational efficiencies, successful cloud deployments will limit the range of different hardware, infrastructure, software platforms, policies and procedures they support.
Fundamentally, cloud computing is a new business model for operating data centers. Thus, one can consider cloud computing in two steps:
A data center is a physical space that is environmentally controlled with clean electrical power and network connectivity that is optimized for hosting servers. The temperature and humidity of the data center environment are controlled to enable proper operation of the equipment, and the facility is physically secured to prevent deliberate or accidental damage to the physical equipment. This facility will have one or more connections to the public Internet, often via redundant and physically separated cables into redundant routers. Behind the routers will be security appliances, like firewalls or deep packet inspection elements, to enforce a security perimeter protecting servers in the data center. Behind the security appliances are often load balancers which distribute traffic across front end servers like web servers. Often there are one or two tiers of servers behind the application front end like second tier servers implementing application or business logic and a third tier of database servers. Establishing and operating a traditional data center facility—including IP routers and infrastructure, security appliances, load balancers, servers’ storage and supporting systems—requires a large capital outlay and substantial operating expenses, all to support application software that often has widely varying load so that much of the resource capacity is often underutilized.
The Uptime Institute [Uptime and TIA942] defines four tiers of data centers that characterize the risk of service impact (i.e., downtime) due to both service management activities and unplanned failures:
Tier I
.
Basic
Tier II
.
Redundant components
Tier III
.
Concurrently maintainable
Tier IV
.
Fault tolerant
Tier I “basic” data centers must be completely shut down to execute planned and preventive maintenance, and are fully exposed to unplanned failures. [UptimeTiers] offers “Tier 1 sites typically experience 2 separate 12-hour, site-wide shutdowns per year for maintenance or repair work. In addition, across multiple sites and over a number of years, Tier I sites experience 1.2 equipment or distribution failures on an average year.” This translates to a data center availability rating of 99.67% with nominally 28.8 hours of downtime per year.
Tier II “redundant component” data centers include some redundancy and so are less exposed to service downtime. [UptimeTiers] offers “the redundant components of Tier II topology provide some maintenance opportunity leading to just 1 site-wide shutdown each year and reduce the number of equipment failures that affect the IT operations environment.” This translates to a data center availability rating of 99.75% with nominally 22 hours of downtime per year.
Tier III “concurrently maintainable” data centers are designed with sufficient redundancy that all service transition activities can be completed without disrupting service. [UptimeTiers] offers “experience in actual data centers shows that operating better maintained systems reduces unplanned failures to a 4-hour event every 2.5 years. … ” This translates to a data center availability rating of 99.98%, with nominally 1.6 hours of downtime per year.
Tier IV “fault tolerant” data centers are designed to withstand any single failure and permit service transition type activities, such as software upgrade to complete with no service impact. [UptimeTiers] offers “Tier IV provides robust, Fault Tolerant site infrastructure, so that facility events affecting the computer room are empirically reduced to (1) 4-hour event in a 5 year operating period. … ” This translates to a data center availability rating of 99.99% with nominally 0.8 hours of downtime per year.
Not only are data centers expensive to build and maintain, but deploying an application into a data center may mean purchasing and installing the computing resources to host that application. Purchasing computing resources implies a need to do careful capacity planning to decide exactly how much computing resource to invest in; purchase too little, and users will experience poor service; purchase too much and excess resources will be unused and stranded. Just as electrical power utilities pool electric power-generating capacity to offer electric power as a service, cloud computing pools computing resources, offers those resources to cloud consumers on-demand, and bills cloud consumers for resources actually used. Virtualization technology makes operation and management of pooled computing resources much easier. Just as electric power utilities gracefully increase and decrease the flow of electrical power to customers to meet their individual demand, clouds elastically grow and shrink the computing resources available for individual cloud consumer’s workloads to match changes in demand. Geographic distribution of cloud data centers can enable computing services to be offered physically closer to each user, thereby assuring low transmission latency, as well as supporting disaster recovery to other data centers. Because multiple applications and data sets share the same physical resources, advanced security is essential to protect each cloud consumer. Massive scale and homogeneity enable cloud service providers to maximize efficiency and thus offer lower costs to cloud consumers than traditional or hosted data center options. Resilient computing architectures become important because hardware failures are inevitable, and massive data centers with lots of hardware means lots of failures; resilient computing architectures assure that those hardware failures cause minimal service disruption. Thus, the difference between a traditional data center and a cloud computing data center is primarily the business model along with the policies and software that support that business model.
NIST defines three service models for cloud computing: infrastructure as a service, platform as a service, and software as a service. These cloud computing service models logically sit above the IP networking infrastructure, which connects end users to the applications hosted on cloud services. Figure 1.1 visualizes the relationship between these service models.
Figure 1.1. Service Models.
The cloud computing service models are formally defined as follows.
Infrastructure as a Service (
IaaS
)
.
“[T]he capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, deployed applications, and possibly limited control of select networking components (e.g., host firewalls)” [NIST-800-145]. IaaS services include: compute, storage, content delivery networks to improve performance and/or cost of serving web clients, and backup and recovery service.
Platform as a Service (
PaaS
)
.
“[T]he capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly application hosting environment configurations” [NIST-800-145]. PaaS services include: operating system, virtual desktop, web services delivery and development platforms, and database services.
Software as a Service (
SaaS
)
.
“[T]he capability provided to the consumer is to use the provider’s applications running on a cloud infrastructure. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings” [NIST-800-145]. SaaS applications include: e-mail and office productivity; customer relationship management (CRM), enterprise resource planning (ERP); social networking; collaboration; and document and content management.
Figure 1.2 gives concrete examples of IaaS, PaaS, and SaaS offerings.
Figure 1.2. OpenCrowd’s Cloud Taxonomy.
Source: Copyright 2010, Image courtesy of OpenCrowd, opencrowd.com.
NIST recognizes four cloud deployment models:
Private Cloud
.
“the cloud infrastructure is operated solely for an organization. It may be managed by the organization or a third party and may exist on premise or off premise.” [NIST-800-145]
Community Cloud
.
“the cloud infrastructure is shared by several organizations and supports a specific community that has shared concerns (e.g., mission, security requirements, policy, and compliance considerations). It may be managed by the organizations or a third party and may exist on premise or off premise” [NIST-800-145].
Public Cloud
.
“the cloud infrastructure is made available to the general public or a large industry group and is owned by an organization selling cloud services” [NIST-800-145].
Hybrid Cloud
.
“the cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load-balancing between clouds)” [NIST-800-145].
Cloud service providers typically offer either private, community or public clouds, and cloud consumers select which of those three to use, or adopt a hybrid deployment strategy blending private, community and/or public clouds.
Cloud computing opens up interfaces between applications, platform, infrastructure, and network layers, thereby enabling different layers to be offered by different service providers. While NIST [NIST-C] and some other organizations propose new roles of cloud service consumers, cloud service distributors, cloud service developers and vendors, and cloud service providers, the authors will use the more traditional roles of suppliers, service providers, cloud consumers, and end users, as illustrated in Figure 1.3.
Figure 1.3. Roles in Cloud Computing.
Specific roles in Figure 1.3 are defined below.
Suppliers
develop the equipment, software, and integration services that implement the cloud-based and client application software, the platform software, and the hardware-based systems that support the networking, compute, and storage that underpin cloud computing.
Service providers
own, operate, and maintain the solutions, systems, equipment, and networking needed to deliver service to end users. The specific service provider roles are defined as:
IP network service providers
carry IP communications between end user’s equipment and IaaS provider’s equipment, as well as between IaaS data centers. Network service providers operate network equipment and facilities to provide Internet access and/or wide area networking service. Note that while there will often be only a single infrastructure, platform, and software service provider for a particular cloud-based application, there may be several different network service providers involved in IP networking between the IaaS service provider’s equipment and end users’ equipment. Internet service providers and Internet access providers are examples of network service providers. While IP networking service is not explicitly recognized in NIST’s service model, these service providers have a crucial role in delivering end-to-end services to cloud users and can thus impact the quality of experience for end users.
IaaS providers
“have control of hardware, hypervisor and operating system, to provide services to consumers. For IaaS, the provider maintains the storage, database, message queue or other middleware, or the hosting environment for virtual machines. The [PaaS/SaaS/cloud] consumer uses that service as if it was a disk drive, database, message queue, or machine, but they cannot access the infrastructure that hosts it” [NIST-C]. Most IaaS providers focus on providing complete computing platforms for consumers’ VMs, including operating system, memory, storage, and processing power. Cloud consumers often pay for only what they use, which fits nicely into most companys’ computing budget.
PaaS providers
“take control of hardware, hypervisor, OS and middleware, to provide services. For PaaS, the provider manages the cloud infrastructure for the platform, typically a framework for a particular type of application. The consumer’s application cannot access the infrastructure underneath the platform” [NIST-C]. PaaS providers give developers complete development environments in which to code, host, and deliver applications. The development environment typically includes the underlying infrastructure, development tools, APIs, and other related services.
SaaS providers
“rely on hardware, hypervisor, OS, middleware, and application layers to provide services. For SaaS, the provider installs, manages and maintains the software. The provider does not necessarily own the physical infrastructure in which the software is running. Regardless, the consumer does not have access to the infrastructure; they can access only the application”[NIST-C]. Common SaaS offerings include desktop productivity, collaboration, sales and customer relationship management, and documentation management.
Cloud consumers
, (or simply “consumers”) are generally enterprises offering specific application services to end users by arranging to have appropriately configured software execute on XaaS resources hosted by one or more service providers. Cloud consumers pay service providers for cloud XaaS resources consumed. End users are typically aware only of the enterprise’s application; the services offered by the various XaaS service providers are completely invisible to end users.
End users
(or simply
users
) use the software applications hosted on the cloud. Users access cloud-based applications via IP networking from some user equipment, such as a smartphone, laptop, tablet, or PC.
There are likely to be several different suppliers and service providers supporting a single cloud consumer’s application to a community of end users. The cloud consumer may have some supplier role in developing and integrating the software and solution. It is possible that the end users are in the same organization as the one that offers the cloud-based service to end users.
The key benefit of cloud computing for many enterprises is that it turns IT from a capital intensive concern to a pay-as-you-go activity where operating expenses track usage—and ideally computing expenses track revenue. Beyond this strategic capital expense to operating expense shift, there are other benefits of cloud computing from [Kundra] and others:
Increased Flexibility
.
Rapid elasticity of cloud computing enables resources engaged for an application to promptly grow and later shrink to track the actual workload so cloud consumers are better able to satisfy customer demand without taking financial risks associated with accurately predicting future demand.
Rapid Implementation
.
Cloud consumers no longer need to procure, install, and bring into service new compute capacity before offering new applications or serving increased workloads. Instead, they can easily buy the necessary computing capacity “off the shelf” from cloud service providers, thereby simplifying and shortening the service deployment cycle.
Increased Effectiveness
.
Cloud computing enables cloud consumers to focus their scarce resources on building services to solve enterprise problems rather than investing in deploying and maintaining computing infrastructure, thereby increasing their organizational effectiveness.
Energy Efficiency
.
Cloud service providers have the scale and infrastructure necessary to enable effective sharing of compute, storage, networking, and data center resources across a community of cloud consumers. This not only reduces the total number of servers required compared with dedicated IT resources, but also reduces the associated power, cooling, and floor space consumed. In essence, intelligent sharing of cloud computing infrastructure enables higher resource utilization of a smaller overall pool of resources compared with dedicated IT resources for each individual cloud consumer.
As cloud computing essentially outsources responsibility for critical IS/IT infrastructure to a service provider, the cloud consumer gives up some control and is confronted with a variety of new risks. These risks range from reduced operational control and visibility (e.g., timing and control of some software upgrades) to changes in accountability (e.g., provider service level agreements) and myriad other concerns. This book considers only the risks that service reliability and service availability of virtualized and cloud-based solutions will fail to achieve performance levels the same as or better than those that traditional deployment scenarios have achieved.
2
VIRTUALIZATION
Virtualization is the logical abstraction of physical assets, such as the hardware platform, operating system (OS), storage devices, data stores, or network interfaces. Virtualization was initially developed to improve resource utilization of mainframe computers, and has evolved to become a common characteristic of cloud computing. This chapter begins with a brief background of virtualization, then describes the characteristics of virtualization and the lifecycle of a virtual machine (VM), and concludes by reviewing popular use cases of virtualization technology.
The notion of virtualization has been around for decades. Dr. Christopher Strachey from Oxford University used the term virtualization in his book Time Sharing in Large Fast Computers in the 1960s. Computer time sharing meant that multiple engineers could share the computers and work on their software in parallel; this concept became known as multiprogramming. In 1962, one of the first supercomputers, the Atlas Computer, was commissioned. One of the key features of the Atlas Computer was the supervisor, responsible for allocating system resources in support of multiprogramming. The Atlas Computer also introduced the notion of virtual memory that is the separation of the physical memory store from the programs accessing it. That supervisor is considered an early OS. IBM quickly followed suit with the M44/44X project that coined the term VM. Virtual memory and VM technologies enabled programs to run in parallel without knowledge of the existence of the other executing programs. Virtualization was used to partition large mainframe computers into multiple VMs, providing the ability for multiple applications and processes to run in parallel, and thus better utilize hardware resources. With the advent of less expensive computers and distributed computing, this ability to maximize the utilization of hardware became less necessary.
The proliferation of computers in the 1990s created another opportunity for virtualization to improve resource utilization. VMware and others constructed virtualization products to enable myriad applications running on many lightly utilized computers to be consolidated onto a smaller number of servers. This server consolidation dramatically reduced hardware-related operating expenses, including data center floor space, cooling, and maintenance. By decoupling applications from the underlying hardware resources that support them to enable efficient resource sharing, virtualization technology enables the cloud computing business model that is proliferating today.
A simple analogy of virtualization is the picture-in-picture feature of some televisions and set top boxes because it displays a small virtual television image on top of another television image, thereby allowing both programs to play simultaneously. Computer virtualization is like this in that several applications that would normally execute on dedicated computer hardware (analogous to individual television channels) are actually run on a single hardware platform that supports virtualization, thereby enabling multiple applications to execute simultaneously.
Virtualization can be implemented at various portions of the system architecture:
Network virtualization
entails virtual IP management and segmentation.
Memory virtualization
entails the aggregation of memory resources into a pool of single memory and managing the memory on behalf of the multiple applications using it.
Storage virtualization
provides a layer of abstraction for the physical storage of data at the device level (referred to as block virtualization) or at the file level (referred to as file virtualization). Block virtualization includes technologies such as storage area network (SAN) and network attached storage (NAS) that can efficiently manage storage in a central location for multiple applications across the network rather than requiring the applications to manage their own storage on a physically attached device.
Processor virtualization
enables a processor to be shared across multiple application instances.
Virtualization decouples an application from the underlying physical hardware, including CPU, networking, memory, and nonvolatile data storage or disk. Application software experiences virtualization as a VM, which is defined by [OVF] as “an encapsulation of the virtual hardware, virtual disks, and the metadata associated with it.” Figure 2.1 gives a simple depiction of a typical virtualized server. One of the key components of virtualization is the hypervisor (also called the VM monitor (VMM); these terms will be used interchangeably in this chapter), which supports the running of multiple OSs concurrently on a single host computer. The hypervisor is responsible for managing the applications’ OSs (called the guest OSs) and their use of the system resources (e.g., CPU, memory, and storage). Virtual machines (VMs) are isolated instances of the application software and Guest OS that run like a separate computer. It is the hypervisor’s responsibility to support this isolation and manage multiple VM’s running on the same host computer.
Figure 2.1. Virtualizing Resources.
A virtual appliance is a software image delivered as a complete software stack installed on one or more VMs, managed as a unit. A virtual appliance is usually delivered as Open Virtualization Format (OVF) files. The purpose of virtual appliances is to facilitate the deployment of applications. They often come with web interfaces to simplify virtual appliance configuration and installation.
There are two types of hypervisors (pictured in Figure 2.2):
Type 1
.
The hypervisor runs directly on the hardware (aka, bare metal) to control the hardware and monitor the guest OSs, which are on a level above the hypervisor. Type 1 represents the original implementation of the hypervisor.
Type 2
.
The hypervisor runs on top of an existing OS (referred to as the host OS) to monitor the guest OSs, which are running at a third level above the hardware (above the host OS and hypervisor).
Figure 2.2. Type 1 and Type 2 Hypervisors.
In the industry, the terms virtualization and emulation are sometimes used interchangeably, but they actually refer to two separate technologies. Emulation entails making one system behave like another to enable software that was written to run on a particular system to be able to run on a completely different system with the same interfaces and produce the same results. Emulation does increase the flexibility for software to move to different hardware platforms, but it does usually have a significant performance cost. Virtualization provides a decoupling of an entity from its physical assets. VMs represent isolated environments that are independent of the hardware they are running on. Some virtualization technologies use emulation while others do not.
There are three types of server virtualization:
Full virtualization
allows instances of software written for different OSs (referred to as guest OSs) to run concurrently on a host computer. Neither the application software nor the guest OS needs to be changed. Each VM is isolated from the others and managed by a hypervisor or VMM, which provides emulated hardware to the VMs so that application and OS software can seamlessly run on different virtualized hardware servers. Full virtualization provides the ability to support multiple applications on multiple OSs on the same server. In addition failovers or migrations can be performed onto servers on different generations of hardware. Full virtualization can be realized with hardware emulation that supports this separation of the hardware from the applications; however, this emulation does result in a performance impact. To address this performance impact, hardware-assisted virtualization is available to manage the isolation. This emulation does incur a performance overhead that may be partially addressed by hardware-assisted virtualization.
Hardware-assisted virtualization
is similar to full virtualization but has the added performance advantage of the processors being virtualization aware. The system hardware interacts with the hypervisors and also allows the guest OSs to directly process privileged instructions without going through the hypervisor.
Paravirtualization
is similar to full virtualization in that it supports VMs on multiple OSs; however, the guest OSs must be adapted to interface with the hypervisor. Paravirtualization provides a closer tie between the guest OS and the hypervisor. The benefit is better performance since emulation is not required; however, in order to realize this tighter interface between the guest OS and the hypervisor, changes must be made to the guest OS to make the customized API calls. Some products support paravirtualization with hardware assist to further improve performance.
OS virtualization
supports partitioning of the OS software into individual virtual environments (sometimes referred to as containers), but they are limited to running on the same host OS. OS virtualization provides the best performance since native OS calls can be made by the guest OS. The simplicity is derived from the requirement that the guest OS be the same OS as the host; however, that is also its disadvantage. OS virtualization cannot support multiple OSs on the same server; however, it can support hundreds of instances of the containers on a single server.
Full virtualization (depicted in Figure 2.3) uses a VM monitor (or hypervisor) to manage the allocation of hardware resources for the VMs. No changes are required of the guest OS. The hypervisor emulates the privileged operation and returns control to the guest OS. The VMs contain the application software, as well as its OS (referred to as the Guest OS). With full virtualization, each VM acts as a separate computer, isolated from other VMs co-residing on that hardware. Since the hypervisor runs on bare metal, the various Guest OSs can be different; this is unlike OS virtualization, which requires the virtual environments to be based off an OS consistent with the host OS.
Figure 2.3. Full Virtualization.
Hardware-assisted virtualization provides optimizations using virtualization aware processors. Virtualization-aware processors are those that know of the presence of the server virtualization stack and can therefore do things, such as interact directly with the hypervisors or dedicate hardware space to VMs. The hypervisor still provides isolation and control of the VMs and allocation of the system resources, but the guest OSs can process privileged instructions without going through the hypervisor. Intel and AMD are two of the main providers who support hardware-assisted virtualization for their processors.
Paravirtualization (illustrated in Figure 2.4) has a slightly different approach from full virtualization that is meant to improve performance and efficiency. The hypervisor actually multiplexes (or coordinates) all application access to the underlying host computer resources. A hardware environment is not simulated; however, the guest OS is executed in an isolated domain, as if running on a separate system. Guest OS software needs to be specifically modified to run in this environment with kernel mode drivers and application programming interfaces to directly access the parts of the hardware such as storage and memory. There are some products that support a combination of paravirtualization (particularly for network and storage drivers) and hardware assist that take the best of both for optimal performance.
Figure 2.4. Paravirtualization.
Operating system virtualization consists of a layer that runs on top of the host OS providing a set of libraries to be used by the applications to isolate their use of the hardware resources as shown in Figure 2.5. Each application or application instance can have its own file system, process table, network configuration, and system libraries. Each isolated instance is referred to as a virtual environment or a container. Since the virtual environment or container concept is similar to that of a VM, for consistency, the term “virtual machine” will be used in subsequent comparisons. The kernel provides resource management features to limit the impact of one container’s activities on the other containers. OS virtualization does not support OSs other than the host OS. Note that Figure 2.5