Reliability and Availability of Cloud Computing - Eric Bauer - E-Book

Reliability and Availability of Cloud Computing E-Book

Eric Bauer

0,0
78,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

A holistic approach to service reliability and availability of cloud computing Reliability and Availability of Cloud Computing provides IS/IT system and solution architects, developers, and engineers with the knowledge needed to assess the impact of virtualization and cloud computing on service reliability and availability. It reveals how to select the most appropriate design for reliability diligence to assure that user expectations are met. Organized in three parts (basics, risk analysis, and recommendations), this resource is accessible to readers of diverse backgrounds and experience levels. Numerous examples and more than 100 figures throughout the book help readers visualize problems to better understand the topic--and the authors present risks and options in bulleted lists that can be applied directly to specific applications/problems. Special features of this book include: * Rigorous analysis of the reliability and availability risks that are inherent in cloud computing * Simple formulas that explain the quantitative aspects of reliability and availability * Enlightening discussions of the ways in which virtualized applications and cloud deployments differ from traditional system implementations and deployments * Specific recommendations for developing reliable virtualized applications and cloud-based solutions Reliability and Availability of Cloud Computing is the guide for IS/IT staff in business, government, academia, and non-governmental organizations who are moving their applications to the cloud. It is also an important reference for professionals in technical sales, product management, and quality management, as well as software and quality engineers looking to broaden their expertise.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 530

Veröffentlichungsjahr: 2012

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

COVER

IEEE PRESS

TITLE PAGE

COPYRIGHT PAGE

DEDICATION

FIGURES

TABLES

EQUATIONS

INTRODUCTION

AUDIENCE

ORGANIZATION

ACKNOWLEDGMENTS

I: BASICS

1 CLOUD COMPUTING

1.1 ESSENTIAL CLOUD CHARACTERISTICS

1.2 COMMON CLOUD CHARACTERISTICS

1.3 BUT WHAT, EXACTLY, IS CLOUD COMPUTING?

1.4 SERVICE MODELS

1.5 CLOUD DEPLOYMENT MODELS

1.6 ROLES IN CLOUD COMPUTING

1.7 BENEFITS OF CLOUD COMPUTING

1.8 RISKS OF CLOUD COMPUTING

2 VIRTUALIZATION

2.1 BACKGROUND

2.2 WHAT IS VIRTUALIZATION?

2.3 SERVER VIRTUALIZATION

2.4 VM LIFECYCLE

2.5 RELIABILITY AND AVAILABILITY RISKS OF VIRTUALIZATION

3 SERVICE RELIABILITY AND SERVICE AVAILABILITY

3.1 ERRORS AND FAILURES

3.2 EIGHT-INGREDIENT FRAMEWORK

3.3 SERVICE AVAILABILITY

3.4 SERVICE RELIABILITY

3.5 SERVICE LATENCY

3.6 REDUNDANCY AND HIGH AVAILABILITY

3.7 HIGH AVAILABILITY AND DISASTER RECOVERY

3.8 STREAMING SERVICES

3.9 RELIABILITY AND AVAILABILITY RISKS OF CLOUD COMPUTING

II: ANALYSIS

4 ANALYZING CLOUD RELIABILITY AND AVAILABILITY

4.1 EXPECTATIONS FOR SERVICE RELIABILITY AND AVAILABILITY

4.2 RISKS OF ESSENTIAL CLOUD CHARACTERISTICS

4.3 IMPACTS OF COMMON CLOUD CHARACTERISTICS

4.4 RISKS OF SERVICE MODELS

4.5 IT SERVICE MANAGEMENT AND AVAILABILITY RISKS

4.6 OUTAGE RISKS BY PROCESS AREA

4.7 FAILURE DETECTION CONSIDERATIONS

4.8 RISKS OF DEPLOYMENT MODELS

4.9 EXPECTATIONS OF IAAS DATA CENTERS

5 RELIABILITY ANALYSIS OF VIRTUALIZATION

5.1 RELIABILITY ANALYSIS TECHNIQUES

5.2 RELIABILITY ANALYSIS OF VIRTUALIZATION TECHNIQUES

5.3 SOFTWARE FAILURE RATE ANALYSIS

5.4 RECOVERY MODELS

5.5 APPLICATION ARCHITECTURE STRATEGIES

5.6 AVAILABILITY MODELING OF VIRTUALIZED RECOVERY OPTIONS

6 HARDWARE RELIABILITY, VIRTUALIZATION, AND SERVICE AVAILABILITY

6.1 HARDWARE DOWNTIME EXPECTATIONS

6.2 HARDWARE FAILURES

6.3 HARDWARE FAILURE RATE

6.4 HARDWARE FAILURE DETECTION

6.5 HARDWARE FAILURE CONTAINMENT

6.6 HARDWARE FAILURE MITIGATION

6.7 MITIGATING HARDWARE FAILURES VIA VIRTUALIZATION

6.8 VIRTUALIZED NETWORKS

6.9 MTTR OF VIRTUALIZED HARDWARE

6.10 DISCUSSION

7 CAPACITY AND ELASTICITY

7.1 SYSTEM LOAD BASICS

7.2 OVERLOAD, SERVICE RELIABILITY, AND SERVICE AVAILABILITY

7.3 TRADITIONAL CAPACITY PLANNING

7.4 CLOUD AND CAPACITY

7.5 MANAGING ONLINE CAPACITY

7.6 CAPACITY-RELATED SERVICE RISKS

7.7 CAPACITY MANAGEMENT RISKS

7.8 SECURITY AND SERVICE AVAILABILITY

7.9 ARCHITECTING FOR ELASTIC GROWTH AND DEGROWTH

8 SERVICE ORCHESTRATION ANALYSIS

8.1 SERVICE ORCHESTRATION DEFINITION

8.2 POLICY-BASED MANAGEMENT

8.3 CLOUD MANAGEMENT

8.4 SERVICE ORCHESTRATION’S ROLE IN RISK MITIGATION

8.5 SUMMARY

9 GEOGRAPHIC DISTRIBUTION, GEOREDUNDANCY, AND DISASTER RECOVERY

9.1 GEOGRAPHIC DISTRIBUTION VERSUS GEOREDUNDANCY

9.2 TRADITIONAL DISASTER RECOVERY

9.3 VIRTUALIZATION AND DISASTER RECOVERY

9.4 CLOUD COMPUTING AND DISASTER RECOVERY

9.5 GEOREDUNDANCY RECOVERY MODELS

9.6 CLOUD AND TRADITIONAL COLLATERAL BENEFITS OF GEOREDUNDANCY

9.7 DISCUSSION

III: RECOMMENDATIONS

10 APPLICATIONS, SOLUTIONS, AND ACCOUNTABILITY

10.1 APPLICATION CONFIGURATION SCENARIOS

10.2 APPLICATION DEPLOYMENT SCENARIO

10.3 SYSTEM DOWNTIME BUDGETS

10.4 END-TO-END SOLUTIONS CONSIDERATIONS

10.5 ATTRIBUTABILITY FOR SERVICE IMPAIRMENTS

10.6 SOLUTION SERVICE MEASUREMENT

10.7 MANAGING RELIABILITY AND SERVICE OF CLOUD COMPUTING

11 RECOMMENDATIONS FOR ARCHITECTING A RELIABLE SYSTEM

11.1 ARCHITECTING FOR VIRTUALIZATION AND CLOUD

11.2 DISASTER RECOVERY

11.3 IT SERVICE MANAGEMENT CONSIDERATIONS

11.4 MANY DISTRIBUTED CLOUDS VERSUS FEWER HUGE CLOUDS

11.5 MINIMIZING HARDWARE-ATTRIBUTED DOWNTIME

11.6 ARCHITECTURAL OPTIMIZATIONS

12 DESIGN FOR RELIABILITY OF VIRTUALIZED APPLICATIONS

12.1 DESIGN FOR RELIABILITY

12.2 TAILORING DFR FOR VIRTUALIZED APPLICATIONS

12.3 RELIABILITY REQUIREMENTS

12.4 QUALITATIVE RELIABILITY ANALYSIS

12.5 QUANTITATIVE RELIABILITY BUDGETING AND MODELING

12.6 ROBUSTNESS TESTING

12.7 STABILITY TESTING

12.8 FIELD PERFORMANCE ANALYSIS

12.9 RELIABILITY ROADMAP

12.10 HARDWARE RELIABILITY

13 DESIGN FOR RELIABILITY OF CLOUD SOLUTIONS

13.1 SOLUTION DESIGN FOR RELIABILITY

13.2 SOLUTION SCOPE AND EXPECTATIONS

13.3 RELIABILITY REQUIREMENTS

13.4 SOLUTION MODELING AND ANALYSIS

13.5 ELEMENT RELIABILITY DILIGENCE

13.6 SOLUTION TESTING AND VALIDATION

13.7 TRACK AND ANALYZE FIELD PERFORMANCE

13.8 OTHER SOLUTION RELIABILITY DILIGENCE TOPICS

14 SUMMARY

14.1 SERVICE RELIABILITY AND SERVICE AVAILABILITY

14.2 FAILURE ACCOUNTABILITY AND CLOUD COMPUTING

14.3 FACTORING SERVICE DOWNTIME

14.4 SERVICE AVAILABILITY MEASUREMENT POINTS

14.5 CLOUD CAPACITY AND ELASTICITY CONSIDERATIONS

14.6 MAXIMIZING SERVICE AVAILABILITY

14.7 RELIABILITY DILIGENCE

14.8 CONCLUDING REMARKS

ABBREVIATIONS

REFERENCES

ABOUT THE AUTHORS

INDEX

IEEE Press

445 Hoes Lane

Piscataway, NJ 08854

IEEE Press Editorial Board 2012

John Anderson, Editor in Chief

Ramesh Abhari

Bernhard M. Haemmerli

Saeid Nahavandi

George W. Arnold

David Jacobson

Tariq Samad

Flavio Canavero

Mary Lanzerotti

George Zobrist

Dmitry Goldgof

Om P. Malik

Kenneth Moore, Director of IEEE Book and Information Services (BIS)

Technical Reviewers

Xuemei Zhang

Principal Member of Technical Staff

Network Design and Performance Analysis

AT&T Labs

Rocky Heckman, CISSP

Architect Advisor

Microsoft

cover image: © iStockphoto

cover design: Michael Rutkowski

ITIL® is a Registered Trademark of the Cabinet Office in the United Kingdom and other countries.

Copyright © 2012 by the Institute of Electrical and Electronics Engineers. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Bauer, Eric.

 Reliability and availability of cloud computing / Eric Bauer, Randee Adams.

p. cm.

 ISBN 978-1-118-17701-3 (hardback)

1. Cloud computing. 2. Computer software–Reliabillity. 3. Computer software–Quality control. 4. Computer security. I. Adams, Randee. II. Title.

 QA76.585.B394 2012

 004.6782–dc23

2011052839

To our families and friends for their continued encouragement and support.

FIGURES

Figure 1.1

    

Service Models

Figure 1.2

    

OpenCrowd’s Cloud Taxonomy

Figure 1.3

    

Roles in Cloud Computing

Figure 2.1

    

Virtualizing Resources

Figure 2.2

    

Type 1 and Type 2 Hypervisors

Figure 2.3

    

Full Virtualization

Figure 2.4

    

Paravirtualization

Figure 2.5

    

Operating System Virtualization

Figure 2.6

    

Virtualized Machine Lifecycle State Transitions

Figure 3.1

    

Fault Activation and Failures

Figure 3.2

    

Minimum Chargeable Service Disruption

Figure 3.3

    

Eight-Ingredient (“8i”) Framework

Figure 3.4

    

Eight-Ingredient plus Data plus Disaster (8i + 2d) Model

Figure 3.5

    

MTBF and MTTR

Figure 3.6

    

Service and Network Element Impact Outages of Redundant Systems

Figure 3.7

    

Sample DSL Solution

Figure 3.8

    

Transaction Latency Distribution for Sample Service

Figure 3.9

    

Requirements Overlaid on Service Latency Distribution for Sample Solution

Figure 3.10

    

Maximum Acceptable Service Latency

Figure 3.11

    

Downtime of Simplex Systems

Figure 3.12

    

Downtime of Redundant Systems

Figure 3.13

    

Simplified View of High Availability

Figure 3.14

    

High Availability Example

Figure 3.15

    

Disaster Recovery Objectives

Figure 3.16

    

ITU-T G.114 Bearer Delay Guideline

Figure 4.1

    

TL 9000 Outage Attributability Overlaid on Augmented 8i + 2d Framework

Figure 4.2

    

Outage Responsibilities Overlaid on Cloud 8i + 2d Framework

Figure 4.3

    

ITIL Service Management Visualization

Figure 4.4

    

IT Service Management Activities to Minimize Service Availability Risk

Figure 4.5

    

8i + 2d Attributability by Process or Best Practice Areas

Figure 4.6

    

Traditional Error Vectors

Figure 4.7

    

IaaS Provider Responsibilities for Traditional Error Vectors

Figure 4.8

    

Software Supplier (and SaaS) Responsibilities for Traditional Error Vectors

Figure 5.1

    

Sample Reliability Block Diagram

Figure 5.2

    

Traversal of Sample Reliability Block Diagram

Figure 5.3

    

Nominal System Reliability Block Diagram

Figure 5.4

    

Reliability Block Diagram of Full virtualization

Figure 5.5

    

Reliability Block Diagram of OS Virtualization

Figure 5.6

    

Reliability Block Diagram of Paravirtualization

Figure 5.7

    

Reliability Block Diagram of Coresident Application Deployment

Figure 5.8

    

Canonical Virtualization RBD

Figure 5.9

    

Latency of Traditional Recovery Options

Figure 5.10

    

Traditional Active-Standby Redundancy via Active VM Virtualization

Figure 5.11

    

Reboot of a Virtual Machine

Figure 5.12

    

Reset of a Virtual Machine

Figure 5.13

    

Redundancy via Paused VM Virtualization

Figure 5.14

    

Redundancy via Suspended VM Virtualization

Figure 5.15

    

Nominal Recovery Latency of Virtualized and Traditional Options

Figure 5.16

    

Server Consolidation Using Virtualization

Figure 5.17

    

Simplified Simplex State Diagram

Figure 5.18

    

Downtime Drivers for Redundancy Pairs

Figure 6.1

    

Hardware Failure Rate Questions

Figure 6.2

    

Application Reliability Block Diagram with Virtual Devices

Figure 6.3

    

Virtual CPU

Figure 6.4

    

Virtual NIC

Figure 7.1

    

Sample Application Resource Utilization by Time of Day

Figure 7.2

    

Example of Extraordinary Event Traffic Spike

Figure 7.3

    

The Slashdot Effect: Traffic Load Over Time (in Hours)

Figure 7.4

    

Offered Load, Service Reliability, and Service Availability of a Traditional System

Figure 7.5

    

Visualizing VM Growth Scenarios

Figure 7.6

    

Nominal Capacity Model

Figure 7.7

    

Implementation Architecture of Compute Capacity Model

Figure 7.8

    

Orderly Reconfiguration of the Capacity Model

Figure 7.9

    

Slew Rate of Square Wave Amplification

Figure 7.10

    

Slew Rate of Rapid Elasticity

Figure 7.11

    

Elasticity Timeline by ODCA SLA Level

Figure 7.12

    

Capacity Management Process

Figure 7.13

    

Successful Cloud Elasticity

Figure 7.14

    

Elasticity Failure Model

Figure 7.15

    

Virtualized Application Instance Failure Model

Figure 7.16

    

Canonical Capacity Management Failure Scenarios

Figure 7.17

    

ITU X.805 Security Dimensions, Planes, and Layers

Figure 7.18

    

Leveraging Security and Network Infrastructure to Mitigate Overload Risk

Figure 8.1

    

Service Orchestration

Figure 8.2

    

Example of Cloud Bursting

Figure 10.1

    

Canonical Single Data Center Application Deployment Architecture

Figure 10.2

    

RBD of Sample Application on Blade-Based Server Hardware

Figure 10.3

    

RBD of Sample Application on IaaS Platform

Figure 10.4

    

Sample End-to-End Solution

Figure 10.5

    

Sample Distributed Cloud Architecture

Figure 10.6

    

Sample Recovery Scenario in Distributed Cloud Architecture

Figure 10.7

    

Simplified Responsibilities for a Canonical Cloud Application

Figure 10.8

    

Recommended Cloud-Related Service Availability Measurement Points

Figure 10.9

    

Canonical Example of MP 1 and MP 2

Figure 10.10

    

End-to-End Service Availability Key Quality Indicators

Figure 11.1

    

Virtual Machine Live Migration

Figure 11.2

    

Active–Standby Markov Model

Figure 11.3

    

Pie Chart of Canonical Hardware Downtime Prediction

Figure 11.4

    

RBD for the Hypothetical Web Server Application

Figure 11.5

    

Horizontal Growth of Hypothetical Application

Figure 11.6

    

Outgrowth of Hypothetical Application

Figure 11.7

    

Aggressive Protocol Retry Strategy

Figure 11.8

    

Data Replication of Hypothetical Application

Figure 11.9

    

Disaster Recovery of Hypothetical Application

Figure 11.10

    

Optimal Availability Architecture of Hypothetical Application

Figure 12.1

    

Traditional Design for Reliability Process

Figure 12.2

    

Mapping Virtual Machines across Hypervisors

Figure 12.3

    

A Virtualized Server Failure Scenario

Figure 12.4

    

Robustness Testing Vectors for Virtualized Applications

Figure 12.5

    

System Design for Reliability as a Deming Cycle

Figure 13.1

    

Solution Design for Reliability

Figure 13.2

    

Sample Solution Scope and KQI Expectations

Figure 13.3

    

Sample Cloud Data Center RBD

Figure 13.4

    

Estimating MP 2

Figure 13.5

    

Modeling Cloud-Based Solution with Client-Initiated Recovery Model

Figure 13.6

    

Client-Initiated Recovery Model

Figure 14.1

    

Failure Impact Duration and High Availability Goals

Figure 14.2

    

Eight-Ingredient Plus Data Plus Disaster (8i + 2d) Model

Figure 14.3

    

Traditional Outage Attributability

Figure 14.4

    

Sample Outage Accountability Model for Cloud Computing

Figure 14.5

    

Outage Responsibilities of Cloud by Process

Figure 14.6

    

Measurement Pointss (MPs) 1, 2, 3, and 4

Figure 14.7

    

Design for Reliability of Cloud-Based Solutions

TABLES

Table 2.1

    

Comparison of Server Virtualization Technologies

Table 2.2

    

Virtual Machine Lifecycle Transitions

Table 3.1

    

Service Availability and Downtime Ratings

Table 3.2

    

Mean Opinion Scores

Table 4.1

    

ODCA’s Data Center Classification

Table 4.2

    

ODCA’s Data Center Service Availability Expectations by Classification

Table 5.1

    

Example Failure Mode Effects Analysis

Table 5.2

    

Failure Mode Effect Analysis Figure for Coresident Applications

Table 5.3

    

Comparison of Nominal Software Availability Parameters

Table 6.1

    

Example of Hardware Availability as a Function of MTTR/MTTRS

Table 7.1

    

ODCA IaaS Elasticity Objectives

Table 9.1

    

ODCA IaaS Recoverability Objectives

Table 10.1

    

Sample Traditional Five 9’s Downtime Budget

Table 10.2

    

Sample Basic Virtualized Five 9’s Downtime Budget

Table 10.3

    

Canonical Application-Attributable Cloud-Based Five 9’s Downtime Budget

Table 10.4

    

Evolution of Sample Downtime Budgets

Table 11.1

    

Example Service Transition Activity Failure Mode Effect Analysis

Table 11.2

    

Canonical Hardware Downtime Prediction

Table 11.3

    

Summary of Hardware Downtime Mitigation Techniques for Cloud Computing

Table 12.1

    

Sample Service Latency and Reliability Requirements at MP 2

Table 13.1

    

Sample Solution Latency and Reliability Requirements

Table 13.2

    

Modeling Input Parameters

Table 14.1

    

Evolution of Sample Downtime Budgets

EQUATIONS

Equation 3.1

    

Basic Availability Formula

Equation 3.2

    

Practical System Availability Formula

Equation 3.3

    

Standard Availability Formula

Equation 3.4

    

Estimation of System Availability from MTBF and MTTR

Equation 3.5

    

Recommended Service Availability Formula

Equation 3.6

    

Sample Partial Outage Calculation

Equation 3.7

    

Service Reliability Formula

Equation 3.8

    

DPM Formula

Equation 3.9

    

Converting DPM to Service Reliability

Equation 3.10

    

Converting Service Reliability to DPM

Equation 3.11

    

Sample DPM Calculation

Equation 6.1

    

Availability as a Function of MTBF/MTTR

Equation 11.1

    

Maximum Theoretical Availability across Redundant Elements

Equation 11.2

    

Maximum Theoretical Service Availability

INTRODUCTION

Cloud computing is a new paradigm for delivering information services to end users, offering distinct advantages over traditional IS/IT deployment models, including being more economical and offering a shorter time to market. Cloud computing is defined by a handful of essential characteristics: on-demand self service, broad network access, resource pooling, rapid elasticity, and measured service. Cloud providers offer a variety of service models, including infrastructure as a service, platform as a service, and software as a service; and cloud deployment options include private cloud, community cloud, public cloud and hybrid clouds. End users naturally expect services offered via cloud computing to deliver at least the same service reliability and service availability as traditional service implementation models. This book analyzes the risks to cloud-based application deployments achieving the same service reliability and availability as traditional deployments, as well as opportunities to improve service reliability and availability via cloud deployment. We consider the service reliability and service availability risks from the fundamental definition of cloud computing—the essential characteristics—rather than focusing on any particular virtualization hypervisor software or cloud service offering. Thus, the insights of this higher level analysis and the recommendations should apply to all cloud service offerings and application deployments. This book also offers recommendations on architecture, testing, and engineering diligence to assure that cloud deployed applications meet users’ expectations for service reliability and service availability.

Virtualization technology enables enterprises to move their existing applications from traditional deployment scenarios in which applications are installed directly on native hardware to more evolved scenarios that include hardware independence and server consolidation. Use of virtualization technology is a common characteristic of cloud computing that enables cloud service providers to better manage usage of their resource pools by multiple cloud consumers. This book also considers the reliability and availability risks along this evolutionary path to guide enterprises planning the evolution of their application to virtualization and on to full cloud computing enablement over several releases.

AUDIENCE

The book is intended for IS/IT system and solution architects, developers, and engineers, as well as technical sales, product management, and quality management professionals.

ORGANIZATION

The book is organized into three parts: Part I, “Basics,” Part II, “Analysis,” and Part III—,“Recommendations.” Part I, “Basics,” defines key terms and concepts of cloud computing, virtualization, service reliability, and service availability. Part I contains three chapters:

Chapter 1, “Cloud Computing.”

 This book uses the cloud terminology and taxonomy defined by the U.S. National Institute of Standards and Technology. This chapter defines cloud computing and reviews the essential and common characteristics of cloud computing. Standard service and deployment models of cloud computing are reviewed, as well as roles of key cloud-related actors. Key benefits and risks of cloud computing are summarized.

Chapter 2, “Virtualization.”

 Virtualization is a common characteristic of cloud computing. This chapter reviews virtualization technology, offers architectural models for virtualization that will be analyzed, and compares and contrasts “virtualized” applications to “native” applications.

Chapter 3, “Service Reliability and Service Availability.”

 This chapter defines service reliability and availability concepts, reviews how those metrics are measured in traditional deployments, and how they apply to virtualized and cloud based deployments. As the telecommunications industry has very precise standards for quantification of service availability and service reliability measurements, concepts and terminology from the telecom industry will be presented in this chapter and used in Part II, “Analysis,” and Part III, “Recommendations.”

Part II, “Analysis,” methodically analyzes the service reliability and availability risks inherent in application deployments on cloud computing and virtualization technology based on the essential and common characteristics given in Part I.

Chapter 4, “Analyzing Cloud Reliability and Availability.”

 Considers the service reliability and service availability risks that are inherent to the essential and common characteristics, service model, and deployment model of cloud computing. This includes implications of service transition activities, elasticity, and service orchestration. Identified risks are analyzed in detail in subsequent chapters in Part II.

Chapter 5, “Reliability Analysis of Virtualization.”

 Analyzes full virtualization, OS virtualization, paravirtualization, and server virtualization and coresidency using standard reliability analysis methodologies. This chapter also analyzes the software reliability risks of virtualization and cloud computing.

Chapter 6, “Hardware Reliability, Virtualization, and Service Availability.”

 This chapter considers how hardware reliability risks and responsibilities shift as applications migrate to virtualized and cloud-based hardware platforms, and how hardware attributed service downtime is determined.

Chapter 7, “Capacity and Elasticity.”

 The essential cloud characteristic of rapid elasticity enables cloud consumers to dispense with the business risk of locking-in resources weeks or months ahead of demand. Rapid elasticity does, however, introduce new risks to service quality, reliability, and availability that must be carefully managed.

Chapter 8, “Service Orchestration Analysis.”

 Service orchestration automates various aspects of IT service management, especially activities associated with capacity management. This chapter reviews policy-based management in the context of cloud computing and considers the associated risks to service reliability and service availability.

Chapter 9, “Geographic Distribution, Georedundancy, and Disaster Recovery.”

 Geographic distribution of application instances is a common characteristic of cloud computing and a best practice for disaster recovery. This chapter considers the service availability implications of georedundancy on applications deployed in clouds.

Part III, “Recommendations,” considers techniques to maximize service reliability and service availability of applications deployed on clouds, as well as the design for reliability diligence to assure that virtualized applications and cloud based solutions meet or exceed the service reliability and availability of traditional deployments.

Chapter 10, “Applications, Solutions and Accountability.”

 This chapter considers how virtualized applications fit into service solutions, and explains how application service downtime budgets change as applications move to the cloud. This chapter also proposes four measurement points for service availability, and discusses how accountability for impairments in each of those measurement points is attributed.

Chapter 11, “Recommendations for Architecting a Reliable System.”

 This chapter covers architectures and techniques to maximize service availability and service reliability via virtualization and cloud deployment. A simple case study is given to illustrate key architectural points.

Chapter 12, “Design for Reliability of Virtualized Applications.”

 This chapter reviews how design for reliability diligence for virtualized applications differs from reliability diligence for traditional applications.

Chapter 13, “Design for Reliability of Cloud Solutions.”

 This chapter reviews how design for reliability diligence for cloud deployments differs from reliability diligence for traditional solutions.

Chapter 14, “Summary.”

 This gives an executive summary of the analysis, insights, and recommendations on assuring that reliability and availability of cloud-based solutions meet or exceed the performance of traditional deployment.

ACKNOWLEDGMENTS

The authors were greatly assisted by many deeply knowledgeable and insightful engineers at Alcatel-Lucent, especially: Mark Clougherty, Herbert Ristock, Shawa Tam, Rich Sohn, Bernard Bretherton, John Haller, Dan Johnson, Srujal Shah, Alan McBride, Lyle Kipp, and Ted East. Joe Tieu, Bill Baker, and Thomas Voith carefully reviewed the early manuscript and provided keen review feedback. Abhaya Asthana, Kasper Reinink, Roger Maitland, and Mark Cameron provided valuable input. Gary McElvany raised the initial architectural questions that ultimately led to this work. This work would not have been possible without the strong management support of Tina Hinch, Werner Heissenhuber, Annie Lequesne, Vickie Owens-Rinn, and Dor Skuler.

Cloud computing is an exciting, evolving technology with many avenues to explore. Readers with comments or corrections on topics covered in this book, or topics for a future edition of this book, are invited to send email to the authors ([email protected], [email protected], or [email protected]).

Eric BauerRandee Adams

I

BASICS

1

CLOUD COMPUTING

The U.S. National Institute of Standards and Technology (NIST) defines cloud computing as follows:

Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction

[NIST-800-145].

This definition frames cloud computing as a “utility” (or a “pay as you go”) consumption model for computing services, similar to the utility model deployed for electricity, water, and telecommunication service. Once a user is connected to the computing (or telecommunications, electricity, or water utility) cloud, they can consume as much service as they would like whenever they would like (within reasonable limits), and are billed for the resources consumed. Because the resources delivering the service can be shared (and hence amortized) across a broad pool of users, resource utilization and operational efficiency can be higher than they would be for dedicated resources for each individual user, and thus the price of the service to the consumer may well be lower from a cloud/utility provider compared with the alternative of deploying and operating private resources to provide the same service. Overall, these characteristics facilitate outsourcing production and delivery of these crucial “utility” services. For example, how many individuals or enterprises prefer to generate all of their own electricity rather than purchasing it from a commercial electric power supplier?

This chapter reviews the essential characteristics of cloud computing, as well as several common characteristics of cloud computing, considers how cloud data centers differ from traditional data centers, and discusses the cloud service and cloud deployment models. The terminologies for the various roles in cloud computing that will be used throughout the book are defined. The chapter concludes by reviewing the benefits of cloud computing.

1.1 ESSENTIAL CLOUD CHARACTERISTICS

Per [NIST-800-145], there are five essential functional characteristics of cloud computing:

1. on-demand self service;
2. broad network access;
3. resource pooling;
4. rapid elasticity; and
5. measured service.

Each of these is considered individually.

1.1.1 On-Demand Self-Service

Per [NIST-800-145], the essential cloud characteristic of “on-demand self-service” means “a consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service’s provider.” Modern telecommunications networks offer on-demand self service: one has direct dialing access to any other telephone whenever one wants. This behavior of modern telecommunications networks contrasts to decades ago when callers had to call the human operator to request the operator to place a long distance or international call on the user’s behalf. In a traditional data center, users might have to order server resources to host applications weeks or months in advance. In the cloud computing context, on-demand self service means that resources are “instantly” available to service user requests, such as via a service/resource provisioning website or via API calls.

1.1.2 Broad Network Access

Per [NIST-800-145] “broad network access” means “capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, laptops, and PDAs).” Users expect to access cloud-based services anywhere there is adequate IP networking, rather than requiring the user to be in a particular physical location. With modern wireless networks, users expect good quality wireless service anywhere they go. In the context of cloud computing, this means users want to access the cloud-based service via whatever wireline or wireless network device they wish to use over whatever IP access network is most convenient.

1.1.3 Resource Pooling

Per [NIST-800-145], the essential characteristic of “resource pooling” is defined as: “the provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand.” Service providers deploy a pool of servers, storage devices, and other data center resources that are shared across many users to reduce costs to the service provider, as well as to the cloud consumers that pay for cloud services. Ideally, the cloud service provider will intelligently select which resources from the pool to assign to each cloud consumer’s workload to optimize the quality of service experienced by each user. For example, resources located on servers physically close to the end user (and which thus introduce less transport latency) may be selected, and alternate resources can be automatically engaged to mitigate the impact of a resource failure event. This is essentially the utility model applied to computing. For example, electricity consumers don’t expect that a specific electrical generator has been dedicated to them personally (or perhaps to their town); they just want to know that their electricity supplier has pooled the generator resources so that the utility will reliably deliver electricity despite inevitable failures, variations in load, and glitches.

Computing resources are generally used on a very bursty basis (e.g., when a key is pressed or a button is clicked). Timeshared operating systems were developed decades ago to enable a pool of users or applications with bursty demands to efficiently share a powerful computing resource. Today’s personal computer operating systems routinely support many simultaneous applications on a PC or laptop, such as simultaneously viewing multiple browser windows, doing e-mail, and instant messaging, and having virus and malware scanners running in the background, as well as all the infrastructure software that controls the keyboard, mouse, display, networking, real-time clock, and so on. Just as intelligent resource sharing on your PC enables more useful work to be done cost effectively than would be possible if each application had a dedicated computing resource, intelligent resource sharing in a computing cloud environment enables more applications to be served on less total computing hardware than would be required with dedicated computing resources. This resource sharing lowers costs for the data center hosting the computing resources for each application, and this enables lower prices to be charged to cloud consumers than would be possible for dedicated computing resources.

1.1.4 Rapid Elasticity

[NIST-800-145] describes “rapid elasticity” as “capabilities can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out, and rapidly released to quickly scale in. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.”

Forecasting future demand is always hard, and there is always the risk that unforeseen events will change plans and thereby increase or decrease the demand for service. For example, electricity demand spikes on hot summer afternoons when customers crank up their air conditioners, and business applications have peak usage during business hours, while entertainment applications peak in evenings and on weekends. In addition, most application services have time of day, day of week, and seasonal variations in traffic volumes. Elastically increasing service capacity during busy periods and releasing capacity during off-peak periods enables cloud consumers to minimize costs while meeting service quality expectations. For example, retailers might experience heavy workloads during the holiday shopping season and light workloads the rest of the year; elasticity enables them to pay only for the computing resources they need in each season, thereby enabling computing expenses to track more closely with revenue. Likewise, an unexpectedly popular service or particularly effective marketing campaign can cause demand for a service to spike beyond planned service capacity. End users expect available resources to “magically” expand to accommodate the offered service load with acceptable service quality. For cloud computing, this means all users are served with acceptable service quality rather than receiving “busy” or “try again later” messages, or experiencing unacceptable service latency or quality.

Just as electricity utilities can usually source additional electric power from neighboring electricity suppliers when their users’ demand outstrips the utility’s generating capacity, arrangements can be made to overflow applications from one cloud that is operating at capacity to other clouds that have available capacity. This notion of gracefully overflowing application load from one cloud to other clouds is called “cloud bursting.”

1.1.5 Measured Service

[NIST-800-145] describes the essential cloud computing characteristic of “measured service” as “cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported, providing transparency for both the provider and the consumer of the utilized service.” Cloud consumers want the option of usage-based (or pay-as-you-go) pricing in which their price is based on the resources actually consumed, rather than being locked into a fixed pricing arrangement. Measuring resource consumption and appropriately charging cloud consumers for their actual resource consumption encourages them not to squander resources and release unneeded resources so they can be used by other cloud consumers.

1.2 COMMON CLOUD CHARACTERISTICS

NIST originally included eight common characteristics of cloud computing in their definition [NIST-B], but as these characteristics were not essential, they were omitted from the formal definition of cloud computing. Nevertheless, six of these eight common characteristics do impact service reliability and service availability, and thus will be considered later in this book.

Virtualization

By untethering application software from specific dedicated hardware, virtualization technology (discussed in Chapter 2, “Virtualization”) gives cloud service providers control to manage workloads across massive pools of compute servers.

Geographic Distribution

Having multiple geographically distributed data center sites enables cloud providers flexibility to assign a workload to resources close to the end user. For example, for real-time gaming, users are more likely to have an excellent quality of experience via low service latency if they are served by resources geographically close to them than if they are served by resources on another continent. In addition, geographic distribution in the form of georedundancy is essential for disaster recovery and business continuity planning. Operationally, this means engineering for sufficient capacity and network access across several geographically distributed sites so that a single disaster will not adversely impact more than that single site, and the impacted workload can be promptly redeployed to nonaffected sites.

Resilient Computing

Hardware devices, like hard disk drives, wear out and fail for well-understood physical reasons. As the pool of hardware resources increases, the probability that some hardware device will fail in any week, day, or hour increases as well. Likewise, as the number of online servers increases, so does the risk that software running on one of those online server instances will fail. Thus, cloud computing applications and infrastructure must be designed to routinely detect, diagnose, and recover service following inevitable failures without causing unacceptable impairments to user service.

Advanced Security

Computing clouds are big targets for cybercriminals and others intent on disrupting service, and the homogeneity and massive scale of clouds make them particularly appealing. Advanced security techniques, tools, and policies are essential to assure that malevolent individuals or organizations don’t penetrate the cloud and compromise application service or data.

Massive Scale

To maximize operational efficiencies that drive down costs, successful cloud deployments will be of massive scale.

Homogeneity

To maximize operational efficiencies, successful cloud deployments will limit the range of different hardware, infrastructure, software platforms, policies and procedures they support.

1.3 BUT WHAT, EXACTLY, IS CLOUD COMPUTING?

Fundamentally, cloud computing is a new business model for operating data centers. Thus, one can consider cloud computing in two steps:

1. What is a data center?
2. How is a cloud data center different from a traditional data center?

1.3.1 What Is a Data Center?

A data center is a physical space that is environmentally controlled with clean electrical power and network connectivity that is optimized for hosting servers. The temperature and humidity of the data center environment are controlled to enable proper operation of the equipment, and the facility is physically secured to prevent deliberate or accidental damage to the physical equipment. This facility will have one or more connections to the public Internet, often via redundant and physically separated cables into redundant routers. Behind the routers will be security appliances, like firewalls or deep packet inspection elements, to enforce a security perimeter protecting servers in the data center. Behind the security appliances are often load balancers which distribute traffic across front end servers like web servers. Often there are one or two tiers of servers behind the application front end like second tier servers implementing application or business logic and a third tier of database servers. Establishing and operating a traditional data center facility—including IP routers and infrastructure, security appliances, load balancers, servers’ storage and supporting systems—requires a large capital outlay and substantial operating expenses, all to support application software that often has widely varying load so that much of the resource capacity is often underutilized.

The Uptime Institute [Uptime and TIA942] defines four tiers of data centers that characterize the risk of service impact (i.e., downtime) due to both service management activities and unplanned failures:

Tier I

Basic

Tier II

Redundant components

Tier III

Concurrently maintainable

Tier IV

Fault tolerant

Tier I “basic” data centers must be completely shut down to execute planned and preventive maintenance, and are fully exposed to unplanned failures. [UptimeTiers] offers “Tier 1 sites typically experience 2 separate 12-hour, site-wide shutdowns per year for maintenance or repair work. In addition, across multiple sites and over a number of years, Tier I sites experience 1.2 equipment or distribution failures on an average year.” This translates to a data center availability rating of 99.67% with nominally 28.8 hours of downtime per year.

Tier II “redundant component” data centers include some redundancy and so are less exposed to service downtime. [UptimeTiers] offers “the redundant components of Tier II topology provide some maintenance opportunity leading to just 1 site-wide shutdown each year and reduce the number of equipment failures that affect the IT operations environment.” This translates to a data center availability rating of 99.75% with nominally 22 hours of downtime per year.

Tier III “concurrently maintainable” data centers are designed with sufficient redundancy that all service transition activities can be completed without disrupting service. [UptimeTiers] offers “experience in actual data centers shows that operating better maintained systems reduces unplanned failures to a 4-hour event every 2.5 years. … ” This translates to a data center availability rating of 99.98%, with nominally 1.6 hours of downtime per year.

Tier IV “fault tolerant” data centers are designed to withstand any single failure and permit service transition type activities, such as software upgrade to complete with no service impact. [UptimeTiers] offers “Tier IV provides robust, Fault Tolerant site infrastructure, so that facility events affecting the computer room are empirically reduced to (1) 4-hour event in a 5 year operating period. … ” This translates to a data center availability rating of 99.99% with nominally 0.8 hours of downtime per year.

1.3.2 How Does Cloud Computing Differ from Traditional Data Centers?

Not only are data centers expensive to build and maintain, but deploying an application into a data center may mean purchasing and installing the computing resources to host that application. Purchasing computing resources implies a need to do careful capacity planning to decide exactly how much computing resource to invest in; purchase too little, and users will experience poor service; purchase too much and excess resources will be unused and stranded. Just as electrical power utilities pool electric power-generating capacity to offer electric power as a service, cloud computing pools computing resources, offers those resources to cloud consumers on-demand, and bills cloud consumers for resources actually used. Virtualization technology makes operation and management of pooled computing resources much easier. Just as electric power utilities gracefully increase and decrease the flow of electrical power to customers to meet their individual demand, clouds elastically grow and shrink the computing resources available for individual cloud consumer’s workloads to match changes in demand. Geographic distribution of cloud data centers can enable computing services to be offered physically closer to each user, thereby assuring low transmission latency, as well as supporting disaster recovery to other data centers. Because multiple applications and data sets share the same physical resources, advanced security is essential to protect each cloud consumer. Massive scale and homogeneity enable cloud service providers to maximize efficiency and thus offer lower costs to cloud consumers than traditional or hosted data center options. Resilient computing architectures become important because hardware failures are inevitable, and massive data centers with lots of hardware means lots of failures; resilient computing architectures assure that those hardware failures cause minimal service disruption. Thus, the difference between a traditional data center and a cloud computing data center is primarily the business model along with the policies and software that support that business model.

1.4 SERVICE MODELS

NIST defines three service models for cloud computing: infrastructure as a service, platform as a service, and software as a service. These cloud computing service models logically sit above the IP networking infrastructure, which connects end users to the applications hosted on cloud services. Figure 1.1 visualizes the relationship between these service models.

Figure 1.1. Service Models.

The cloud computing service models are formally defined as follows.

Infrastructure as a Service (

IaaS

)

“[T]he capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, deployed applications, and possibly limited control of select networking components (e.g., host firewalls)” [NIST-800-145]. IaaS services include: compute, storage, content delivery networks to improve performance and/or cost of serving web clients, and backup and recovery service.

Platform as a Service (

PaaS

)

“[T]he capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly application hosting environment configurations” [NIST-800-145]. PaaS services include: operating system, virtual desktop, web services delivery and development platforms, and database services.

Software as a Service (

SaaS

)

“[T]he capability provided to the consumer is to use the provider’s applications running on a cloud infrastructure. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings” [NIST-800-145]. SaaS applications include: e-mail and office productivity; customer relationship management (CRM), enterprise resource planning (ERP); social networking; collaboration; and document and content management.

Figure 1.2 gives concrete examples of IaaS, PaaS, and SaaS offerings.

Figure 1.2. OpenCrowd’s Cloud Taxonomy.

Source: Copyright 2010, Image courtesy of OpenCrowd, opencrowd.com.

1.5 CLOUD DEPLOYMENT MODELS

NIST recognizes four cloud deployment models:

Private Cloud

“the cloud infrastructure is operated solely for an organization. It may be managed by the organization or a third party and may exist on premise or off premise.” [NIST-800-145]

Community Cloud

“the cloud infrastructure is shared by several organizations and supports a specific community that has shared concerns (e.g., mission, security requirements, policy, and compliance considerations). It may be managed by the organizations or a third party and may exist on premise or off premise” [NIST-800-145].

Public Cloud

“the cloud infrastructure is made available to the general public or a large industry group and is owned by an organization selling cloud services” [NIST-800-145].

Hybrid Cloud

“the cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load-balancing between clouds)” [NIST-800-145].

Cloud service providers typically offer either private, community or public clouds, and cloud consumers select which of those three to use, or adopt a hybrid deployment strategy blending private, community and/or public clouds.

1.6 ROLES IN CLOUD COMPUTING

Cloud computing opens up interfaces between applications, platform, infrastructure, and network layers, thereby enabling different layers to be offered by different service providers. While NIST [NIST-C] and some other organizations propose new roles of cloud service consumers, cloud service distributors, cloud service developers and vendors, and cloud service providers, the authors will use the more traditional roles of suppliers, service providers, cloud consumers, and end users, as illustrated in Figure 1.3.

Figure 1.3. Roles in Cloud Computing.

Specific roles in Figure 1.3 are defined below.

Suppliers

develop the equipment, software, and integration services that implement the cloud-based and client application software, the platform software, and the hardware-based systems that support the networking, compute, and storage that underpin cloud computing.

Service providers

own, operate, and maintain the solutions, systems, equipment, and networking needed to deliver service to end users. The specific service provider roles are defined as:

IP network service providers

carry IP communications between end user’s equipment and IaaS provider’s equipment, as well as between IaaS data centers. Network service providers operate network equipment and facilities to provide Internet access and/or wide area networking service. Note that while there will often be only a single infrastructure, platform, and software service provider for a particular cloud-based application, there may be several different network service providers involved in IP networking between the IaaS service provider’s equipment and end users’ equipment. Internet service providers and Internet access providers are examples of network service providers. While IP networking service is not explicitly recognized in NIST’s service model, these service providers have a crucial role in delivering end-to-end services to cloud users and can thus impact the quality of experience for end users.

IaaS providers

“have control of hardware, hypervisor and operating system, to provide services to consumers. For IaaS, the provider maintains the storage, database, message queue or other middleware, or the hosting environment for virtual machines. The [PaaS/SaaS/cloud] consumer uses that service as if it was a disk drive, database, message queue, or machine, but they cannot access the infrastructure that hosts it” [NIST-C]. Most IaaS providers focus on providing complete computing platforms for consumers’ VMs, including operating system, memory, storage, and processing power. Cloud consumers often pay for only what they use, which fits nicely into most companys’ computing budget.

PaaS providers

“take control of hardware, hypervisor, OS and middleware, to provide services. For PaaS, the provider manages the cloud infrastructure for the platform, typically a framework for a particular type of application. The consumer’s application cannot access the infrastructure underneath the platform” [NIST-C]. PaaS providers give developers complete development environments in which to code, host, and deliver applications. The development environment typically includes the underlying infrastructure, development tools, APIs, and other related services.

SaaS providers

“rely on hardware, hypervisor, OS, middleware, and application layers to provide services. For SaaS, the provider installs, manages and maintains the software. The provider does not necessarily own the physical infrastructure in which the software is running. Regardless, the consumer does not have access to the infrastructure; they can access only the application”[NIST-C]. Common SaaS offerings include desktop productivity, collaboration, sales and customer relationship management, and documentation management.

Cloud consumers

, (or simply “consumers”) are generally enterprises offering specific application services to end users by arranging to have appropriately configured software execute on XaaS resources hosted by one or more service providers. Cloud consumers pay service providers for cloud XaaS resources consumed. End users are typically aware only of the enterprise’s application; the services offered by the various XaaS service providers are completely invisible to end users.

End users

(or simply

users

) use the software applications hosted on the cloud. Users access cloud-based applications via IP networking from some user equipment, such as a smartphone, laptop, tablet, or PC.

There are likely to be several different suppliers and service providers supporting a single cloud consumer’s application to a community of end users. The cloud consumer may have some supplier role in developing and integrating the software and solution. It is possible that the end users are in the same organization as the one that offers the cloud-based service to end users.

1.7 BENEFITS OF CLOUD COMPUTING

The key benefit of cloud computing for many enterprises is that it turns IT from a capital intensive concern to a pay-as-you-go activity where operating expenses track usage—and ideally computing expenses track revenue. Beyond this strategic capital expense to operating expense shift, there are other benefits of cloud computing from [Kundra] and others:

Increased Flexibility

Rapid elasticity of cloud computing enables resources engaged for an application to promptly grow and later shrink to track the actual workload so cloud consumers are better able to satisfy customer demand without taking financial risks associated with accurately predicting future demand.

Rapid Implementation

Cloud consumers no longer need to procure, install, and bring into service new compute capacity before offering new applications or serving increased workloads. Instead, they can easily buy the necessary computing capacity “off the shelf” from cloud service providers, thereby simplifying and shortening the service deployment cycle.

Increased Effectiveness

Cloud computing enables cloud consumers to focus their scarce resources on building services to solve enterprise problems rather than investing in deploying and maintaining computing infrastructure, thereby increasing their organizational effectiveness.

Energy Efficiency

Cloud service providers have the scale and infrastructure necessary to enable effective sharing of compute, storage, networking, and data center resources across a community of cloud consumers. This not only reduces the total number of servers required compared with dedicated IT resources, but also reduces the associated power, cooling, and floor space consumed. In essence, intelligent sharing of cloud computing infrastructure enables higher resource utilization of a smaller overall pool of resources compared with dedicated IT resources for each individual cloud consumer.

1.8 RISKS OF CLOUD COMPUTING

As cloud computing essentially outsources responsibility for critical IS/IT infrastructure to a service provider, the cloud consumer gives up some control and is confronted with a variety of new risks. These risks range from reduced operational control and visibility (e.g., timing and control of some software upgrades) to changes in accountability (e.g., provider service level agreements) and myriad other concerns. This book considers only the risks that service reliability and service availability of virtualized and cloud-based solutions will fail to achieve performance levels the same as or better than those that traditional deployment scenarios have achieved.

2

VIRTUALIZATION

Virtualization is the logical abstraction of physical assets, such as the hardware platform, operating system (OS), storage devices, data stores, or network interfaces. Virtualization was initially developed to improve resource utilization of mainframe computers, and has evolved to become a common characteristic of cloud computing. This chapter begins with a brief background of virtualization, then describes the characteristics of virtualization and the lifecycle of a virtual machine (VM), and concludes by reviewing popular use cases of virtualization technology.

2.1 BACKGROUND

The notion of virtualization has been around for decades. Dr. Christopher Strachey from Oxford University used the term virtualization in his book Time Sharing in Large Fast Computers in the 1960s. Computer time sharing meant that multiple engineers could share the computers and work on their software in parallel; this concept became known as multiprogramming. In 1962, one of the first supercomputers, the Atlas Computer, was commissioned. One of the key features of the Atlas Computer was the supervisor, responsible for allocating system resources in support of multiprogramming. The Atlas Computer also introduced the notion of virtual memory that is the separation of the physical memory store from the programs accessing it. That supervisor is considered an early OS. IBM quickly followed suit with the M44/44X project that coined the term VM. Virtual memory and VM technologies enabled programs to run in parallel without knowledge of the existence of the other executing programs. Virtualization was used to partition large mainframe computers into multiple VMs, providing the ability for multiple applications and processes to run in parallel, and thus better utilize hardware resources. With the advent of less expensive computers and distributed computing, this ability to maximize the utilization of hardware became less necessary.

The proliferation of computers in the 1990s created another opportunity for virtualization to improve resource utilization. VMware and others constructed virtualization products to enable myriad applications running on many lightly utilized computers to be consolidated onto a smaller number of servers. This server consolidation dramatically reduced hardware-related operating expenses, including data center floor space, cooling, and maintenance. By decoupling applications from the underlying hardware resources that support them to enable efficient resource sharing, virtualization technology enables the cloud computing business model that is proliferating today.

2.2 WHAT IS VIRTUALIZATION?

A simple analogy of virtualization is the picture-in-picture feature of some televisions and set top boxes because it displays a small virtual television image on top of another television image, thereby allowing both programs to play simultaneously. Computer virtualization is like this in that several applications that would normally execute on dedicated computer hardware (analogous to individual television channels) are actually run on a single hardware platform that supports virtualization, thereby enabling multiple applications to execute simultaneously.

Virtualization can be implemented at various portions of the system architecture:

Network virtualization

entails virtual IP management and segmentation.

Memory virtualization

entails the aggregation of memory resources into a pool of single memory and managing the memory on behalf of the multiple applications using it.

Storage virtualization

provides a layer of abstraction for the physical storage of data at the device level (referred to as block virtualization) or at the file level (referred to as file virtualization). Block virtualization includes technologies such as storage area network (SAN) and network attached storage (NAS) that can efficiently manage storage in a central location for multiple applications across the network rather than requiring the applications to manage their own storage on a physically attached device.

Processor virtualization

enables a processor to be shared across multiple application instances.

Virtualization decouples an application from the underlying physical hardware, including CPU, networking, memory, and nonvolatile data storage or disk. Application software experiences virtualization as a VM, which is defined by [OVF] as “an encapsulation of the virtual hardware, virtual disks, and the metadata associated with it.” Figure 2.1 gives a simple depiction of a typical virtualized server. One of the key components of virtualization is the hypervisor (also called the VM monitor (VMM); these terms will be used interchangeably in this chapter), which supports the running of multiple OSs concurrently on a single host computer. The hypervisor is responsible for managing the applications’ OSs (called the guest OSs) and their use of the system resources (e.g., CPU, memory, and storage). Virtual machines (VMs) are isolated instances of the application software and Guest OS that run like a separate computer. It is the hypervisor’s responsibility to support this isolation and manage multiple VM’s running on the same host computer.

Figure 2.1. Virtualizing Resources.

A virtual appliance is a software image delivered as a complete software stack installed on one or more VMs, managed as a unit. A virtual appliance is usually delivered as Open Virtualization Format (OVF) files. The purpose of virtual appliances is to facilitate the deployment of applications. They often come with web interfaces to simplify virtual appliance configuration and installation.

2.2.1 Types of Hypervisors

There are two types of hypervisors (pictured in Figure 2.2):

Type 1

The hypervisor runs directly on the hardware (aka, bare metal) to control the hardware and monitor the guest OSs, which are on a level above the hypervisor. Type 1 represents the original implementation of the hypervisor.

Type 2

The hypervisor runs on top of an existing OS (referred to as the host OS) to monitor the guest OSs, which are running at a third level above the hardware (above the host OS and hypervisor).

Figure 2.2. Type 1 and Type 2 Hypervisors.

2.2.2 Virtualization and Emulation

In the industry, the terms virtualization and emulation are sometimes used interchangeably, but they actually refer to two separate technologies. Emulation entails making one system behave like another to enable software that was written to run on a particular system to be able to run on a completely different system with the same interfaces and produce the same results. Emulation does increase the flexibility for software to move to different hardware platforms, but it does usually have a significant performance cost. Virtualization provides a decoupling of an entity from its physical assets. VMs represent isolated environments that are independent of the hardware they are running on. Some virtualization technologies use emulation while others do not.

2.3 SERVER VIRTUALIZATION

There are three types of server virtualization:

Full virtualization

allows instances of software written for different OSs (referred to as guest OSs) to run concurrently on a host computer. Neither the application software nor the guest OS needs to be changed. Each VM is isolated from the others and managed by a hypervisor or VMM, which provides emulated hardware to the VMs so that application and OS software can seamlessly run on different virtualized hardware servers. Full virtualization provides the ability to support multiple applications on multiple OSs on the same server. In addition failovers or migrations can be performed onto servers on different generations of hardware. Full virtualization can be realized with hardware emulation that supports this separation of the hardware from the applications; however, this emulation does result in a performance impact. To address this performance impact, hardware-assisted virtualization is available to manage the isolation. This emulation does incur a performance overhead that may be partially addressed by hardware-assisted virtualization.

Hardware-assisted virtualization

is similar to full virtualization but has the added performance advantage of the processors being virtualization aware. The system hardware interacts with the hypervisors and also allows the guest OSs to directly process privileged instructions without going through the hypervisor.

Paravirtualization

is similar to full virtualization in that it supports VMs on multiple OSs; however, the guest OSs must be adapted to interface with the hypervisor. Paravirtualization provides a closer tie between the guest OS and the hypervisor. The benefit is better performance since emulation is not required; however, in order to realize this tighter interface between the guest OS and the hypervisor, changes must be made to the guest OS to make the customized API calls. Some products support paravirtualization with hardware assist to further improve performance.

OS virtualization

supports partitioning of the OS software into individual virtual environments (sometimes referred to as containers), but they are limited to running on the same host OS. OS virtualization provides the best performance since native OS calls can be made by the guest OS. The simplicity is derived from the requirement that the guest OS be the same OS as the host; however, that is also its disadvantage. OS virtualization cannot support multiple OSs on the same server; however, it can support hundreds of instances of the containers on a single server.

2.3.1 Full Virtualization

Full virtualization (depicted in Figure 2.3) uses a VM monitor (or hypervisor) to manage the allocation of hardware resources for the VMs. No changes are required of the guest OS. The hypervisor emulates the privileged operation and returns control to the guest OS. The VMs contain the application software, as well as its OS (referred to as the Guest OS). With full virtualization, each VM acts as a separate computer, isolated from other VMs co-residing on that hardware. Since the hypervisor runs on bare metal, the various Guest OSs can be different; this is unlike OS virtualization, which requires the virtual environments to be based off an OS consistent with the host OS.

Figure 2.3. Full Virtualization.

2.3.1.1 Hardware-Assisted Virtualization. 

Hardware-assisted virtualization provides optimizations using virtualization aware processors. Virtualization-aware processors are those that know of the presence of the server virtualization stack and can therefore do things, such as interact directly with the hypervisors or dedicate hardware space to VMs. The hypervisor still provides isolation and control of the VMs and allocation of the system resources, but the guest OSs can process privileged instructions without going through the hypervisor. Intel and AMD are two of the main providers who support hardware-assisted virtualization for their processors.

2.3.2 Paravirtualization

Paravirtualization (illustrated in Figure 2.4) has a slightly different approach from full virtualization that is meant to improve performance and efficiency. The hypervisor actually multiplexes (or coordinates) all application access to the underlying host computer resources. A hardware environment is not simulated; however, the guest OS is executed in an isolated domain, as if running on a separate system. Guest OS software needs to be specifically modified to run in this environment with kernel mode drivers and application programming interfaces to directly access the parts of the hardware such as storage and memory. There are some products that support a combination of paravirtualization (particularly for network and storage drivers) and hardware assist that take the best of both for optimal performance.

Figure 2.4. Paravirtualization.

2.3.3 OS Virtualization

Operating system virtualization consists of a layer that runs on top of the host OS providing a set of libraries to be used by the applications to isolate their use of the hardware resources as shown in Figure 2.5. Each application or application instance can have its own file system, process table, network configuration, and system libraries. Each isolated instance is referred to as a virtual environment or a container. Since the virtual environment or container concept is similar to that of a VM, for consistency, the term “virtual machine” will be used in subsequent comparisons. The kernel provides resource management features to limit the impact of one container’s activities on the other containers. OS virtualization does not support OSs other than the host OS. Note that Figure 2.5