Oracle 11g R1/R2 Real Application Clusters Essentials - Syed Jaffer Hussain - E-Book

Oracle 11g R1/R2 Real Application Clusters Essentials E-Book

Syed Jaffer Hussain

0,0
55,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Oracle RAC or Real Application Clusters is a grid computing solution that allows multiple nodes (servers) in a clustered system to mount and open a single database that resides on shared disk storage. Should a single system (node) fail, the database service will still be available on the remaining nodes. Oracle RAC is an integral part of the Oracle database setup. You have one database with multiple users accessing it, in real time. This book will enable DBAs to get their finger on the pulse of the Oracle 11g RAC environment quickly and easily.This book will cover all areas of the Oracle RAC environment and is indispensable if you are an Oracle DBA who is charged with configuring and implementing Oracle11g R1, with bonus R2 information included. This book presents a complete method for the configuration, installation, and design of Oracle 11g RAC, ultimately enabling rapid administration of Oracle 11g RAC environments.This practical handbook documents how to administer a complex Oracle 11g RAC environment. Packed with real world examples, expert tips and troubleshooting advice, the book begins by introducing the concept of Oracle RAC and High Availability. It then dives deep into the world of RAC configuration, installation and design, enabling you to support complex RAC environments for real world deployments. Chapters cover Oracle RAC and High Availability, Oracle 11g RAC Architecture, Oracle 11g RAC Installation, Automatic Storage Management, Troubleshooting, Workload Management and much more.

By following the practical examples in this book, you will learn every concept of the RAC environment and how to successfully support complex Oracle 11g R1 and R2 RAC environments for various deployments within real world situations.

This book is the updated release of our previous Oracle 11g R1/R2 Real Application Clusters Handbook. If you already own a copy of that Handbook, there is no need to upgrade to this book.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 684

Veröffentlichungsjahr: 2011

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Oracle 11g R1/R2 Real Application Clusters Essentials
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Instant Updates on New Packt Books
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code for this book
Errata
Piracy
Questions
1. High Availability
High availability concepts
Planned versus unplanned downtime
Service Level Agreements for high availability
High availability interpretations
Recovery time and high availability
System design for high availability
Business Continuity and high availability
Disaster Recovery
Business Continuity and Disaster Recovery guidelines
Fault-tolerant systems and high availability
Requirements for implementing fault tolerance
Fault tolerance and replication
High availability solutions for Oracle
Oracle Data Guard
Oracle Streams
Oracle Application Server Clustering
High availability: Oracle 11g R1 Real Application Clusters (RAC)
High availability: Oracle 11g R2 Real Application Clusters (RAC)
Summary
2. Oracle 11g RAC Architecture
Oracle 11g RAC architecture
Certification matrix for Oracle 11g RAC architecture
Hardware architecture for Oracle 11g RAC
Server configurations for Oracle 11g RAC
CPU processors
Choosing between 32-bit and 64-bit CPU architectures
Dual core and multicore processors
Network architecture for Oracle 11g RAC
The private network and the Oracle 11g RAC interconnect
Choices for private interconnect and 11g RAC
Ethernet choices for 11g RAC interconnect
Alternatives to Gigabit Ethernet for the interconnect (Infiniband)
Interconnect choices for 11g RAC—how to decide?
Redundancy for Ethernet interconnects with 11g RAC
Network bonding (NIC teaming)
Storage architecture for Oracle 11g RAC
RAID configurations for Oracle 11g RAC
RAID 0 (striping)
RAID 1 (mirroring)
RAID 5 (striped with parity)
RAID 10 (striped mirrors)
Differences between RAID 10 and RAID 0+1
Third-party RAID implementations
IBM AIX LPAR disk volume management for RAID
Linux volume management for RAID configuration
Storage protocols for RAC
SCSI
Fibre Channel
Point-to-Point (FC-P2P)
Fibre Channel Arbitrated Loop (FC-AL)
Fibre Channel Switched Fabric (FC-SW)
Which Fibre Channel topology is best?
Fibre Channel Ports
Fibre Channel Generic Node Ports
iSCSI
Which storage protocol is best for RAC?
Asynchronous I/O versus Direct I/O for Oracle 11g RAC
Oracle 11g RAC components
Voting Disk
Oracle Cluster Registry
Oracle 11g R1 RAC background processes
ACMS Atomic Controlfile to Memory Service
GTX0-j Global Transaction Process
LMON Global Enqueue Service Monitor
LMD Global Enqueue Service Daemon
LMS Global Cache Service Process
LCK0 Instance Enqueue Process
RMSn Oracle RAC Management Processes
RSMN Remote Slave Monitor
Oracle 11g R2 RAC background processes
Grid Plug and Play
Grid Interprocess Communication
Multicast Domain Name Service
Oracle Grid Naming Service
How RAC differs from Oracle 11g single-instance implementations
New ASM features and RAC
New SYSASM privilege for Oracle 11g R1 ASM
Oracle 11g R2 ASM features
OCR and Voting Disk stored in ASM
Oracle Automatic Storage Management Cluster Filesystem (Oracle ACFS)
New Oracle 11g ASM Disk Group compatibility features
Summary
3. Clusterware Installation
Preparing for a cluster installation
Server (node) requirements
Network requirements
Kernel parameters
Operating system packages
OS groups and users
OS user settings
Configuring Secure Shell (SSH)
Verifying prerequisites with the CLUVFY utility
Oracle 11g R1 Clusterware installation
Initiating Oracle Universal Installer for Oracle 11g R1 Clusterware
What happens when orainstRoot.sh and root.sh is run?
Oracle 11g R1 Clusterware post-installation checks
Installing Oracle 11g R1 RAC software
Initiating Oracle Universal Installer for Oracle 11g R1 RAC software
Post-installation tasks
Oracle 11g R2 Clusterware installation
Initiating Oracle Universal Installer for 11g R2 Clusterware
What happens when the root.sh is run?
Oracle 11g R2 Clusterware post-installation checks
Installing Oracle 11g R2 RAC software
Initiating Oracle Universal Installer for Oracle 11g R2 RAC software
Post-installation tasks
Oracle 11g R2 Clusterware new features highlights
Removing/Reconfiguring a Grid Infrastructure configuration
Removing a successful Grid Infrastructure configuration
Reconfiguring a failed Grid Infrastructure configuration
Summary
4. Automatic Storage Management
Overview of Automatic Storage Management (ASM)
Filesystem versus ASM storage architecture
ASM disk
ASM disk group
ASM instance configuration and management
ASM initialization parameters
Creating an ASM instance
Initializing DBCA
ASM background processes
ASM dynamic views
V$ASM_DISK
V$ASM_DISKGROUP
V$ASM_OPERATION
V$ASM_DISK_STAT
V$ASM_DISKGROUP_STAT
V$ASM_CLIENT
ASM instance startup/shutdown
ASM disk group administration
Creating a disk group
Altering a disk group
Dropping a disk group
Overview of ASMCMD
ASMCMD in action
ASM 11g R1 new features
ASM fast mirror resync
ASM preferred mirror read
ASM fast rebalance
ASM disk group compatibility attributes
ASM performance enhancements
New SYSASM role
ASM 11g R2 new features
Automatic Storage Management Configuration Assistant (ASMCA)
Initiating ASMCA
Configuring a new ASM instance
Managing an ASM instance
Managing an ASM disk group
Creating an ASM disk group
Creating an ASM disk group in silent mode
Automatic Storage Management Dynamic Volume Manager (ADVM)
ADVM volume trivia
Creating ASM volumes
Creating an ASM volume with ASMCA
Creating an ASM Volume with ASMCMD
Creating an ASM volume with the ALTER DISKGROUP SQL statement
Querying V$ASM views to obtain information about ASM volumes
Automatic Storage Management Cluster File System (ACFS)
Using ACFS as Oracle database home
Oracle ACFS drivers
Prerequisites for creating ACFS
ACFS creation methods
Creating an ACFS filesystem using ASMCA
Creating ACFS for Oracle Binaries with ASMCA
Creating an ACFS filesystem with ASMCMD
ACFS mount registry
Managing ACFS
Acfsutil utility
ASM new background processes to support ACFS
Volume Driver Background (VDBG)
Volume Background process (VBGn)
ACFS Background process (ACFS)
Volume Membership Background processes (VMB0)
Querying V$ASM views to obtain information about ACFS
ACFS snapshots
Creating a snapshot
Removing a snapshot
ASM Intelligent Data Placement (IDP)
Managing IDP settings
Finding IDP setting information
ASMCMD enhancements
ASM backup strategies
md_backup and md_restore commands
Summary
5. Managing and Troubleshooting Oracle 11g Clusterware
Oracle 11g RAC Clusterware administration
About Oracle Clusterware
Oracle 11g Clusterware concepts
Oracle Cluster Registry
Voting disk
Initialization and shutdown scripts for Clusterware
Oracle 11g Clusterware background processes
Cluster Ready Services Daemon
Oracle Cluster Synchronization Services Daemon
Cluster Synchronization Services Daemon
Event Monitor Daemon
Additional background processes for Oracle 11g Clusterware
Oracle Clusterware Process Monitor Daemon OPROCD
OCLSVMON
OCLSOMON
DISKMON
OCLSKD
RACG
Fatal Clusterware processes and Oracle 11g RAC
Managing Oracle 11g Clusterware utilities
CRSCTL
CRS_STAT
OCRCHECK
OCRCONFIG
CLSCFG
CLUVFY
Troubleshooting Oracle 11g Clusterware
Failed, missing, or offline 11g Clusterware resources
Offline Clusterware resources for Oracle 11g RAC
Problems with the Voting disk and OCR
Vote disk issues with 11g Clusterware resources
Failed or corrupted Vote Disks
Failed or corrupted OCR
How to recover the OCR from backup
Steps to perform recovery of lost and/or corrupted OCR
Check status 11g RAC Clusterware
Root cause analysis 11g RAC
Oracle 11g Clusterware log file analysis
Oracle 11g RAC node reboot issues
Oracle 11g RAC Clusterware processes—node reboot issues
Oracle Process Daemon
OCLSOMON
Root cause analysis for solving node reboots with 11g RAC
Oracle 11g RAC log files for troubleshooting
OCSSD Reboots and 11g RAC
OPROCD failure and node reboots
OCLSOMON-RAC node reboot
Hardware, storage, and network problems with RAC
Hardware, storage, and network resolutions
New features in Oracle 11g R2 Clusterware
Oracle Real Application Clusters one node (Oracle RAC one node)
Improved Oracle Clusterware resource modeling
Policy-based cluster and capacity management
Cluster time synchronization service
Oracle Cluster Registry and voting disks within Oracle ASM
New features for upgrading to Oracle 11g Clusterware
Oracle 11g R2 Cluster Verification Utility new features
Zero downtime patching for Oracle Clusterware
Summary
6. RAC Database Administration and Workload Management
RAC database configuration and creation
Creating a database using DBCA
Choosing database storage options
We have created a database using DBCA—now what?
What's new in Oracle 11g R1 and R2 databases?
Automatic Memory Management
New AMM dynamic performance V$ views
Tuning AMM
Special considerations on Linux
/dev/shm sizing issues
Database Smart Flash Cache
Configuring Smart Flash Cache
Instance caging
New background processes in Oracle 11g
Finding the alert.log file location in Oracle 11g
Automatic Diagnostic Repository
Managing ADR using the Automatic Diagnostic Repository Command Interface tool
V$DIAG_INFO view
RAC database administration
Using the Server Control Utility
Automatic Workload Management
Overview of services
Creating and managing services
Creating services with the SRVCTL utility
SRVCTL ADD SERVICE syntax and options
Configuring a TNS entry with a service
DBMS_SERVICE PL/SQL package
Database internal default services
Service performance views
Services' characteristics
What's new in Oracle 11g services' behavior?
Scalability (Load Balancing)
Client Side Connect Time Load Balance
Configuring Client Side Connect Time Load Balancing
Server Side Listener Connection Load Balance
Configuring Server Side Listener Connection Load Balancing
Transparent Application Failover
Configuring Transparent Application Failover
FAILOVER_MODE parameters
Fast Connection Failover
Configuring Fast Connection Failover
Configuring Fast Connection Failover in the client environment
Summary
7. Backup and Recovery
An overview of backup and recovery
An overview of Recovery Manager (RMAN)
RMAN architecture
RMAN performance tuning tips
Backup types and methods
Logical backup
Physical backup
ONLINE RMAN backups
OFFLINE RMAN backups
RMAN new features in 11g R1 and 11g R2
Database Recovery Advisor
Multisection backups for very large datafiles
Undo tablespace backup optimization
Faster backup compression
Active database duplication
Active database duplication enhancements
Archivelog deletion policy enhancements
Automatic Block Recovery (ABR)
Tablespace point-in-time recovery enhancements
RMAN best practices for RAC
Configuring the Flash Recovery Area for a RAC database
Instance recovery versus Crash recovery in RAC
Parallelism for backup and recovery in RAC
Backing up a RAC database with RMAN
Configuring multiple channels
OCR and Voting disk backup and recovery strategies
Adding a Mirror location for the OCR and Voting disk
OCR automatic backups
Performing OCR manual backups
Voting disk manual backups
Restoring OCR
Restoring the Voting disk
Summary
8. Performance Tuning
Tuning differences: single instance versus RAC
Oracle 11g single instance database
Oracle RAC 11g database
New Oracle 11g performance tuning features
Database Replay
SQL Performance Analyzer
Database Health Monitor
PL/SQL Native Compilation
Server Result Cache
Client Side Result Cache
SQL Tuning Advisor
New performance features in Oracle 11gR2
In-Memory Parallel Execution
Analyzing the Cache Fusion impact on RAC performance
Cache Fusion
Latency statistics
RAC wait events
Monitoring RAC cluster interconnect performance
Oracle cluster interconnects
Monitoring RAC wait events
Summary
9. Oracle 11g Clusterware Upgrade
Overview of an upgrade
Upgrade sequence
Upgrading Oracle 10g R2 Clusterware to Oracle 11g R1
Kernel parameter values
Packages required on Linux 5
Oracle 11g R1 Clusterware upgrade steps
Performing preinstallation checks with cluvfy
Executing runInstaller.sh script
Post-upgrade steps for 11g R1 Clusterware
Upgrading to Oracle 11g R2 Clusterware
Overview of our environment
Upgrading nodes
11g R2 upgrade changes and restrictions
Kernel parameter values
Packages required on Linux 5
Performing preinstallation checks with cluvfy
Oracle 11g R2 Clusterware upgrade steps
Executing the runInstaller.sh script
Post-upgrade checks for 11g R2 Clusterware
Post-upgrade steps for 11g R2 Clusterware
Downgrading Oracle Clusterware after an upgrade
Summary
10. Real-world Scenarios
Adding a new node to an existing cluster
Performing prechecks with the cluvfy utility
addNode.sh
Adding a node in silent mode in Oracle 11g R2
Post-installation status checks for Clusterware
OCR file manual backup syntax
Voting Disk backup syntax
Installing ASM and RDBMS software using addNode.sh script
Cloning ASM software using addNode.sh script on Oracle 11g R1
Post-node addition steps
Removing a node from the cluster
Adding an RAC database instance
Adding a new instance using DBCA
We have added an instance—what next?
Verifying new instance status
Using DBCA in silent mode to add an instance
Post-add instance steps
Deleting an RAC database instance
Using DBCA in silent mode to delete an instance
Converting a single-instance database to an RAC database
Overview of RCONFIG command-line tool
What you need to accomplish the conversion task
Sample of a modified ConvertToRAC.xml input file
How to test a conversion without actually performing the conversion
Converting a single-instance database to an RAC database
How to resume a failed rconfig operation
Checking log files
How to optimize rconfig to run faster
Post-conversion steps
Relocating an RAC database and instances across nodes
Relocating the instance
Adding the instance example
Workaround when a database and instance are configured on the same node
Adding the database example
Post-relocation steps
Summary
11. Enabling RAC for EBS
EBS architecture
Oracle 11g RAC suitability
Installing EBS 12.1.1
EBS implementation on Oracle 11g RAC
RAC-enabling EBS 12.1.1
Configuration prerequisites
ASM and RAC-enabling the EBS database with the rconfig utility
Running AutoConfig
Copying AutoConfig to the new RDBMS ORACLE_HOME for Oracle 11g R1 11.1.0.7
Generating your database context file
Preparing for AutoConfig by completing the following AutoConfig steps
Generating and applying AutoConfig configuration files
Executing AutoConfig on all database nodes in the cluster
Performing Init file, tnsnames, and listener file activities
Establishing applications environment for Oracle RAC
Setting up load balancing
Configuring Parallel Concurrent Processing
Prerequisites for setting up Parallel Concurrent Processing
Cloning EBS concepts in brief
Preparing the source system
Copying the source system
Configuring the target system
Adding a new node to an existing EBS system
Setting up Parallel Concurrent Processing
Setting up Transaction Managers
Setting up load balancing on concurrent processing nodes
Summary
12. Maximum Availability
Oracle 11g Streams for RAC
Oracle 11g Streams architecture for RAC
Capture
Staging
Propagation
Consumption
Default apply
User-defined function apply
Explicit de-queue
Understanding Oracle Streams rules
Transformations and Streams
Capture and Apply processes in an RAC instance
Streams in the RAC environment
New features in Oracle 11g Streams
Synchronous Capture
Splitting and merging of a Stream Destination
Tracking LCRs through a Stream
Streams Topology and Performance Advisor
Combined Capture and Apply
Best practices for Streams in an RAC environment
Additional configuration of RAC environments for a Source Database
Queue ownership
Propagation restart
Changing the GLOBAL_NAME of the source database
Additional configuration for RAC environments for the Apply Database
Changing the GLOBAL_NAME of the Target Database
New features for Streams in Oracle 11g R2
XStream
Statement DML Handlers
Ability to record table changes
SQL generation
Support for compressed tables
Support for SecureFile LOBs
Automatic splitting and merging
New Apply process parameter
Monitoring jobs
New 11g R2 Oracle Streams view
Oracle 11g Data Guard and RAC
New features for Oracle 11g Data Guard
Active Data Guard
Snapshot Standby
Configuring Data Guard Physical Standby for 11g RAC
Configuring Oracle RAC primary database to send redo data
Design considerations in an Oracle RAC environment
Format for archived redo log filenames
Switchover considerations for 11g RAC and Data Guard
Troubleshooting Oracle 11g Data Guard and RAC
Switchover fails in an Oracle 11g RAC configuration
How to recover from corrupt datafile on standby
How to recover from a corrupt block on standby
Automatic repairing of corrupt data blocks
New features for Data Guard in Oracle 11g R2
New Oracle Data Guard 11g R2 features for Redo Apply
New Oracle 11g R2 Data Guard features for SQL Apply
Summary
A. Additional Resources and Tools for the Oracle RAC Professional
Sample configurations
Reviewing and resolving manual configuration changes
adchkcfg utility
Oracle RAC commands and tips
Cluster deconfig tool for Oracle RAC
Using the cluster deconfig tool
Limitations of the cluster deconfig tool
Problems and limitations of the cluster deconfig tool
Starting the cluster deconfig tool
Silent mode operations using cluster deconfig
Manual cleanup for RAC
Repairing the RAC environment without reinstalling
Reinitializing OCR and Voting Disks without reinstalling RAC
Using ROOTDELETE.SH in debug mode
Using rootdeinstall.sh
Reinstalling CRS on the same cluster in another CRS_HOME
Stopping CRS processes
Reinstalling CRS on same cluster in another CRS_HOME
Oracle 11g R2 cluster removal tools for RAC
Tracing RAC issues with Oradebug
Using Oradebug to trace Oracle 11g Clusterware
Server Control Utility
Oracle 11g R2 SRVCTL commands
Managing Oracle Clusterware with the CRSCTL utility
Differences between 11g R1 and 11g R2 syntax for CRSCTL
CRS_STAT
The kernel file OSM discovery tool
Operating system-level commands for tuning and diagnosis
Strace
Truss
GDB
Additional references and tips
Clusterware startup sequence for Oracle 11g R2
Log file locations for Oracle 11g RAC and ASM
Index

Oracle 11g R1/R2 Real Application Clusters Essentials

Oracle 11g R1/R2 Real Application Clusters Essentials

Copyright © 2011 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, Packt Publishing, nor its dealers or distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: May 2011

Production Reference: 2170611

Published by Packt Publishing Ltd.

32 Lincoln Road

Olton

Birmingham, B27 6PA, UK.

ISBN 978-1-849682-66-4

www.packtpub.com

Cover Image by Artie Ng (<[email protected]>)

Credits

Authors

Ben Prusinski

Syed Jaffer Hussain

Reviewers

Robert de Laat

Ann L. R. McKinnell

Fairlie Rego

Ronald Rood

Acquisition Editor

Kerry George

Development Editor

Meeta Rajani

Technical Editors

Conrad Sardinha

Azharuddin Sheikh

Project Coordinator

Zainab Bagasrawala

Proofreaders

Sandra Hopper

Bernadette Watkins

Chris Smith

Indexer

Rekha Nair

Graphics

Geetanjali Sawant

Nilesh Mohite

Production Coordinator

Arvindkumar Gupta

Cover Work

Arvindkumar Gupta

About the Authors

Ben Prusinski is an Oracle Certified Professional (OCP) and database architect with more than 14 years of experience with Oracle data warehouse and Oracle E-Business applications. As a corporate database consultant, Ben has provided services to dozens of Fortune 500 clients. He is an internationally recognized expert in Oracle high availability, performance tuning, database security, and ERP implementations. As a top Oracle expert, Ben received the prestigious Oracle ACE award in 2009 in recognition of his contributions to the Oracle community. As an Oracle RAC Certified Expert (OCE), Ben is also a popular speaker at major conferences such as Oracle OpenWorld, CLOUG in Latin America, IOUG, and Oracle Collaborate OAUG (Oracle Applications User Group). Ben is also a polyglot, being fluent in several languages (Spanish, French, Korean, and English) and enjoys traveling to exotic places. In his free time, he enjoys racing in autocross events, golf, martial arts, reading, and cooking.

In addition to Oracle consulting for clients, Ben regularly updates his Oracle blog at http://oracle-magician.blogspot.com with the latest database technology tips and information to share with the Oracle community. He frequently contributes to answering questions from Oracle users on the Oracle OTN forums. He can be contacted via e-mail at <[email protected]>.

I would like to extend my thanks and appreciation, first of all to my fellow co-author Mr. Syed Jaffer Hussain who was great to work with on this book. I would also like to thank everyone at Packt who made tremendous efforts for their patience and help in the editorial process. Last but not least, my heartfelt appreciation goes out to our fantastic technical review team of Oracle experts — Fairlie Rogo, Ronald Rood, and Ann L.R. Mckinnell — who provided tons of great feedback to ensure technical accuracy and quality. I also would like to dedicate this book to all fellow Oracle DBAs and consultants who work in the trenches to solve real world Oracle issues.

I also would like to thank the many Oracle professionals who have helped me as an Oracle professional over the years to make such a book like this possible, including Julian Dyke (Oracle RAC master extraordinaire), Alex Gorbachev, Doug Hahn, Tanel Poder, Arup Nanda, Robert Freeman, Tim Hall, and all of my current and past clients who allowed me to help them with their Oracle database challenges.

Syed Jaffer Hussain is the Database Support Manager at AlInma Bank (Saudi Arabia) who has over 18 years of hands-on Information Technology (IT) experience, which also includes over 10 years as a Production Oracle DBA. Apparently he is the first person in the Middle East to be awarded the prestigious Oracle ACE award. He holds several industry recognized Oracle certifications, including Oracle 10g Certified Master (OCM), OCP DBA (v8i, 9i, 10g, and 11g), and Oracle 10g RAC Certified Expert. With broad knowledge in Oracle technologies such as RAC, DataGuard, RMAN, and Performance Tuning, he has completed several successful RAC implementations, Clusterware upgrades, and has set up disaster recovery solutions for many business-critical databases. Jaffer is a noted speaker at BrainSurface.com, and occasionally presents Oracle University five-day courses and one-day celebrity seminars on behalf of Oracle EMEA. He has also worked for a couple of multinational banks in Saudi Arabia.

Jaffer frequently contributes at Oracle OTN forums and many other Oracle-related forums. He regularly updates his Oracle technology-related blog, (http://jaffardba.blogspot.com) and he is reachable at <[email protected]>.

First and foremost, I owe a very big thank you to my wife Ayesha and my three sons (Ashfaq, Arfan, and Aahil) for sacrificing their invaluable time and allowing me to concentrate on the book. I am also thankful to Alinma Bank management; in particular Ahmed Darwish and my immediate boss Mr. Majed Saleh AlShuaibi for their constant encouragement and continuous support. My special thanks goes to my younger brother Sabdar and my friend Khusro Mohammed Khan (who are also Oracle DBAs) for preparing required platforms, testing codes and doing initial review of the chapters. I also want to thank all my colleagues, Mohammed Farooqui, Zaheer, Khaja Mainuddin, Shaukat Ali, Naresh Kumar Markapuram, Hussain AlKalifah, Faisal Bakhashwain, Naser Ali AlEssa, Mohammed Alsalahi, Wayne Philips, Chand Basha, Rizwan Siddiqi, Sadak, Anees, Shakir, Angelo G Train, Mohammed Khidir, Asad Khan, Ibrahim Ali, Rajesh Ankuru, Sandeep, Mohammad AlHiary, Ahmed Khanj, Ahmed Bakheet, Mohammed Azar, and other friends for their encouragement and motivation.

I thank all the staff at Packt Publishing who were involved with this book and special mention to James Lumsden, Kerry George, Meeta Rajani, Zainab Bagasrawala, and Azharuddin Sheikh for being patient with me during the course of this book.

Last but not the least, I also thank other RAC Book authors (not in any particular order), Murali Vallath, K. Gopalakrishnan, Nitin Vengulkar, Julian Dyke, Sandesh Rao, Arup Nanda, Bert Scalzo, Riyaj Shamsudeen, Tariq Farooq, and other Oracle experts worldwide for being my inspiration.

About the Reviewers

Robert de Laat is an experienced DBA, working with Oracle products for 12 years, primarily in large, complex environments where high availability is mandatory. With a background as an Unix administrator, he is skilled in system design, high availability, and getting the most out of an environment. Robert has been working with every version of RAC, primarily on Unix/Linux platforms. He has skills in many scripting languages, such as Perl, Python, Shell, and so on. He also knows his way in PL/SQL.

Currently Robert is a senior consultant, working for Ciber in The Netherlands, where he cooperates in many complex projects for large companies where high availability is a key to the customers' success. Ciber (CBR) is an Oracle Platinum Partner and committed to the limit.

Besides spending time with Oracle, Robert enjoys woodworking and jewellery. Creating things gives him great pleasure, in his work and in his personal life. He also maintains his own website at www.dutchdba.nl, where he writes about technical stuff such as Oracle, Linux, Scripting, and so on.

I would like to thank my wife, Jose for having the patience while I spend many hours in my test lab. You truly are the light of my life.

Ann L. R. McKinnell has been an OCP since Oracle 7.3.4, with over eight years experience as a senior technical member of Oracle Global Support, specializing in Database Server technologies. Ann has trained Oracle Support and Consulting personnel from many countries in Database Internals and Problem Solving techniques. She has served as a technical reviewer for Oracle University course material, numerous My Oracle Support (previously Metalink) notes and whitepapers, and various Oracle Database Administration user manuals. Ann is also a co-author of Packt Publishing’s Oracle 11g Streams Implementer’s Guide. With over 16 years in the IT industry, Ann continues to specialize in practical implementation strategies and the development of distributed Oracle database systems, and database architecture, along with software and database design, integration, and engineering.

Fairlie Rego is a senior Oracle Database Consultant with over 10 years of experience in Oracle Database Technologies. He has worked on mission-critical, highly available systems built around the Maximum Availability architecture propounded by Oracle.

He has worked extensively in the financial, telecom, and educational sectors and has been fortunate to work in challenging environments that make use of state-of-the-art technologies.

I would like to thank my parents who have always encouraged me in my endeavors. I would also like to acknowledge my wife, Abigail for demonstrating extreme patience, given my very hectic work schedule.

Ronald Rood is an innovative Oracle DBA with over 20 years of IT experience. He has built and managed cluster databases on about each and every platform that Oracle ever supported, from the famous OPS databases in version 7 to the latest RAC releases, currently being 11g. Ronald is constantly looking for ways to get the most out of the database to make the investment for the customers even more valuable. He has a great knowledge of how to handle the power of the rich Unix environment, which makes him a first class troubleshooter and solution architect. Next to his many spoken languages such as Dutch, English, German, and French, he also writes fluently in many scripting languages.

Currently Ronald is a principal consultant working for Ciber in The Netherlands where he cooperates in many complex projects for large companies where downtime is not an option. Ciber (CBR) is one of the few Oracle Partners on the Diamond level.

Ronald often replies on the Oracle forums, writes his own blog (http://ronr.blogspot.com) called 'from errors we learn', and writes for various Oracle-related magazines. He has also authored a book, Mastering Oracle Scheduler in Oracle 11g Databases, where he fills the gap between the Oracle Documentation and the customers' questions.

Ronald has lots of certifications, among them:

Oracle Certified MasterOracle Certified ProfessionalOracle Database 11g Tuning SpecialistOracle Database 11g Data Warehouse Certified Implementation Specialist

Ronald fills his time with Oracle, his family, sky-diving, radio-controlled model airplane flying, running a scouting group, and having a lot of fun.

His quote is: "a problem is merely a challenge that might take a little time to solve".

www.PacktPub.com

Support files, eBooks, discount offers and more

You might want to visit www.PacktPub.com for support files and downloads related to your book.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

http://PacktLib.PacktPub.com

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. 

Why Subscribe?

Fully searchable across every book published by PacktCopy & paste, print and bookmark contentOn demand and accessible via web browser

Free Access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

Instant Updates on New Packt Books

Get notified! Find out when new books are published by following @PacktEnterprise on Twitter, or the Packt Enterprise Facebook page.

 

I dedicate this book to my friends, family, colleagues, and all of the Oracle database professionals who work long and hard over weekends, nights, and holidays to keep their database systems online and running smoothly.

  --Ben Prusinski
 

I would like to dedicate this book to my parents Mr. and Mrs. Saifulla.

  --Syed Jaffer Hussain

Preface

Oracle Real Application Clusters, or Oracle RAC for short (formerly known as Oracle Parallel Server), is a clustering technology that provides the ability to scale performance and improve server availability for Oracle data center environments. Oracle RAC not only serves as a part of the Oracle Maximum Availability Architecture (MAA) for disaster recovery purposes, but it's also used for other purposes such as scaling up and out for performance by adding additional nodes to an Oracle data center environment. However, getting started with Oracle RAC can be difficult and challenging for the Oracle professional who is new to this technology and has worked only with single-instance Oracle databases. This book provides the guidance needed to overcome that difficulty, covering the key features of Oracle RAC. Each chapter introduces new features, allowing you to develop competency in the administration of this advanced technology.

By the end of the book, you will not only have experimented with numerous examples, but you will have also deployed a complete Oracle RAC environment and solution.

What this book covers

Chapter 1, High Availability, serves as the most basic introduction to concepts of high availability with regard to how Oracle RAC comes into play. The chapter includes a comprehensive review of the core areas of high availability and disaster recovery as well as how the Oracle RAC technology fits into a strategy for implementing high availability. The chapter also includes a review of key Oracle technologies that complement Oracle RAC for a disaster recovery implementation.

Chapter 2, Oracle 11g RAC Architecture, provides a blueprint from concept to finish of how to design an Oracle 11g RAC environment from the hardware and storage layers to the software and database layers.

Chapter 3, Clusterware Installation, provides step-by-step instructions on how to install Oracle 11g RAC. The chapter explains all of the steps required for the installation of the Oracle 11g R1 RAC Clusterware and Oracle 11g R1 RDBMS binaries, as well as the latest 11g R2 RAC Clusterware (grid) and RDBMS binaries.

Chapter 4, Automatic Storage Management, discusses the key concepts for Oracle 11g Automatic Storage Management (ASM) technology and provides an overview of ASM features. For this chapter, we assume that you have already worked with the ASM Oracle database 10g environment.

Chapter 5, Managing and Troubleshooting Oracle 11g Clusterware, explains how to resolve complex problems with Oracle 11g RAC clusterware failures. We show you how to identify the root cause of Oracle Clusterware issues along with timely solutions based on case study methods.

Chapter 6, RAC Database Administration and Workload Management, provides a deep dive into various methods that are available to create and manage RAC databases. The workload management segment in this chapter further explains how your application can take advantage of running on the RAC database to improve overall performance and scalability. It also discusses most of the useful new features introduced in 11g R1 and 11g R2 versions.

Chapter 7, Backup and Recovery, shows you how to back up and recover the Oracle RAC environment using different approaches. In this chapter, the emphasis will be placed on backup and recovery using RMAN. However, we also briefly discuss the various methods along with the pros and cons of each of them. In addition, new features of RMAN within Oracle 11g R1 and 11g R2 along with OCR and voting disk backup and recovery are covered in great detail.

Chapter 8, Performance Tuning, first explains how the differences between an Oracle RAC cluster and a non-RAC single-instance Oracle database pose unique challenges to the Oracle database professional. This chapter will focus on how to tune a massively parallel Oracle RAC database, consisting of many instances residing on different nodes of a cluster and accessing the same disk files residing on shared disk storage.

Chapter 9, Oracle 11g Clusterware Upgrade, explains the pros and cons of an upgrade process as well as the possibilities during upgrade scenarios. We also demonstrate how to upgrade the Oracle 10g R2 clusterware to 11g R1 and then perform a second upgrade from 11g R1 to 11g R2 versions. Furthermore, we also explain how to downgrade Oracle Clusterware to a previous version.

Chapter 10, Real-world Scenarios, teaches you how to perform many common real-world business scenarios, such as adding and removing cluster nodes, as well as how to convert non-RAC Oracle databases to Oracle RAC and how to relocate an Oracle RAC database instance, which are key skills for an Oracle RAC database administrator.

Chapter 11, Enabling RAC for EBS, discusses how to implement RAC for the Oracle R12 E-Business Suite (EBS) environment. We look at the cases for why Oracle RAC would be suitable for an Oracle R12 EBS environment to achieve a scalable and resilient architecture.

Chapter 12, Maximum Availability, discusses the complete picture of solutions for Oracle to enable high availability and disaster recovery. A detailed explanation is provided of these key technologies, including Oracle Streams and Oracle Data Guard, and how they complement the Oracle 11g RAC environment. We then move into a few case studies that show you how to enable Data Guard and Streams for Oracle RAC environments.

Appendix, Additional Resources and Tools for the Oracle RAC Professional is a handy summary of beneficial My Oracle Support notes that can provide assistance for Oracle RAC environments.

What you need for this book

Oracle 11g RAC is a complex technology that demands many resources, from logistical to hardware, to implement. This means that you will need multiple servers and technical resources to be carefully orchestrated in order to achieve a successful RAC implementation. In a sense, the Oracle RAC architect is a symphony conductor who must skillfully place all of the key chess pieces into motion to avoid failure and delays with the deployment. As many DBAs have no prior RAC exposure, we gently introduce the basic concepts first, in order to familiarize you with RAC, before jumping into the deep ocean of RAC administration. Have no fear, while RAC is a complex beast, it can be tamed. The only prerequisite is to have at least a basic understanding of Oracle database concepts and Oracle database administration before learning about RAC. It is also useful to have access to a test or sandbox environment to install and configure an RAC environment with the examples in the book. Nothing beats hands-on experience.

Having an Internet connection and Oracle database server while reading is extremely useful as well. We also recommend that if you want a virtual environment to set up and play with RAC, then you consult the many whitepapers written by Dr. Tim Hall (http://oracle-base.com) that provide detailed step-by-step instructions on how to download, install, and configure an Oracle 11g RAC environment in both virtual server and standalone environments.

Who this book is for

If you are an Oracle DBA who wants to administer Real Application Clusters, then this book is for you. Basic understanding of Oracle DBA is required. No experience of RAC is required.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text are shown as follows: "$CRS_HOME\log\nodename\racg contains logfiles for the VIP and ONS resources".

A block of code will be set as follows:

$ clscfg -concepts clscfg: EXISTING configuration version 3 detected. clscfg: version 3 is 11G Release 1. clscfg -- concepts and terminology

Any command-line input or output is written as follows:

# crsctl set css misscount 90Configuration parameter misscount is now set to 90.

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in our text like this: "Click on Finish to commence the database creation process."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply drop an e-mail to <[email protected]>, and mention the book title in the subject of your message.

If there is a book that you need and would like to see us publish, please send us a note in the SUGGEST A TITLE form on www.packtpub.com or e-mail <[email protected]>.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code for this book

You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our contents, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in text or code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration, and help us to improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/support, selecting your book, clicking on the let us know link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata added to any list of existing errata. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Chapter 1. High Availability

High availability is a discipline within database technology that provides a solution to protect against data loss and against downtime, which is costly to mission-critical database systems. In this chapter, we will discuss how Oracle 11g RAC provides you with mission-critical options for minimizing outages and downtime as well as how RAC fits into the overall scheme for maintenance of a comprehensive disaster recovery and business continuity policy. In this chapter, we will provide you with an introduction to the high availability concepts and solutions that are workable for Oracle 11g. As such, we will provide details on what constitutes high availability and what does not. By having the proper framework, you will understand how to leverage Oracle RAC and auxiliary technologies including Oracle Data Guard to maximize the Return On Investment (ROI) for your data center environment. In summary, we will discuss the following topics:

High availability conceptsFault-tolerant systems and high availabilityHigh availability solutions for Oracle 11g R1 and 11g R2 Real Application Clusters (RAC)

High availability concepts

High availability provides data center environments that run mission-critical database applications with the resiliency to withstand failures that may occur due to natural, human, or environmental conditions. For example, if a hurricane wipes out the production data center that hosts a financial application's production database, high availability would provide the much-needed protection to avoid data loss, minimize downtime, and maximize availability of the firm's resources and database applications. Let's now move to the high availability concepts.

Planned versus unplanned downtime

The distinction needs to be made between planned downtime and unplanned downtime. In most cases, planned downtime is the result of maintenance that is disruptive to system operations and cannot be avoided with current system designs for a data center. An example of planned downtime would be a DBA maintenance activity such as database patching to an Oracle database, which would require taking an outage to take the system offline for a period of time. From the database administrator's perspective, planned downtime situations usually are the result of management-initiated events.

On the other hand, unplanned downtime issues frequently occur due to a physical event caused by a hardware, software, or environmental failure or caused by human error. A few examples of unplanned downtime events include hardware server component failures such as CPU, disk, or power outages.

Most data centers will exclude planned downtime from the high availability factor in terms of calculating the current total availability percentage. Even so, both planned and unplanned maintenance windows affect high availability. For instance, database upgrades require a few hours of downtime. Another example would be a SAN replacement. Such items make comprehensive four nine solutions nigh impossible to implement without additional considerations. The fact is that implementing a true 100% high availability is nearly impossible without exorbitant costs. To have complete high availability for all components within the data center requires an architecture for all systems and databases that eliminates any Single Point of Failure (SPOF) and allows for total online availability for all server hardware, network, operating systems, applications, and database systems.

Service Level Agreements for high availability

When it comes to determining high availability ratios, this is often expressed as the percentage of uptime in a given year. The following table shows the approximate downtime that is allowed for a specific percentage of high availability, granted that the system is required to operate continuously. Service Level Agreements (SLAs) usually refer to monthly downtime or availability in order to calculate service levels to match monthly financial cycles. The following table from the International Organization for Standardization (ISO) illustrates the correlation between a given availability percentage and the relevant amount of time a system would be unavailable per year, month, or week:

Availability %

Annual downtime

Monthly downtime*

Weekly downtime

90%

36.5 days

72 hours

16.8 hours

95%

18.25 days

36 hours

8.4 hours

98%

7.30 days

14.4 hours

3.36 hours

99%

3.65 days

7.20 hours

1.68 hours

99.5%

1.83 days

3.60 hours

50.4 minutes

99.8%

17.52 hours

86.23 minutes

20.16 minutes

99.9% ("three nines")

8.76 hours

43.2 minutes

10.1 minutes

99.95%

4.38 hours

21.56 minutes

5.04 minutes

99.99% ("four nines")

52.6 minutes

4.32 minutes

1.01 minutes

99.999% ("five nines")

5.26 minutes

25.9 seconds

6.05 seconds

99.9999% ("six nines")

31.5 seconds

2.59 seconds

0.605 seconds

Note

For monthly calculations, a 30-day month is used.

It should be noted that availability and uptimes are not the same thing. For instance, a database system may be online but not available, as in the case of application outages such as when a user's SQL script cannot be executed.

In most cases, the number of nines is not often used by the database or system professional when measuring high availability for data center environments because it is difficult to extrapolate such hard numbers without a large test environment. For practical purposes, availability is calculated more as a probability or average downtime given per annual basis.

High availability interpretations

When it comes to discussing how availability is measured, there is a debate on the correct method of interpretation for high availability ratios. For instance, an Oracle database server that has been online for 365 days in a given non-leap year might have been eclipsed by an application failure that lasted for nine hours during a peak usage period. As a consequence, the users will see the complete system as unavailable, whereas the Oracle database administrator will claim 100% "uptime." However, given the true definition of availability, the Oracle database will be approximately 99.897% available (8751 hours of available timeout of 8760 hours per non-leap year). Furthermore, Oracle database systems experiencing performance problems are often deemed partially or entirely unavailable by users, while in the eyes of the database administrator the system is fine and available.

Another situation that presents a challenge in terms of what constitutes availability would be the scenario in which the availability of a mission-critical application might go offline yet is not viewed as unavailable by the Oracle DBA, as the database instance could still be online and thus available. However, the application in question is offline to the end user, thus presenting a status of unavailable from the perspective of the end user. This illustrates the key point that a true availability measure must be from a holistic perspective and not strictly from the database's point of view.

Availability should be measured with comprehensive monitoring tools that are themselves highly available and present the proper instrumentation. If there is a lack of instrumentation, systems supporting high-volume transaction processing frequently during the day and night, such as credit-card-processing database servers, are often inherently better monitored than systems that experience a periodic lull in demand. Currently, custom scripts can be developed in conjunction with third-party tools to provide a measure of availability. One such tool that we recommend for monitoring database, server, and application availability is that provided by Oracle Grid Control, which also includes Oracle Enterprise Manager.

Oracle Grid Control provides instrumentation via agents and plugin modules to measure availability and performance on a system-wide enterprise level, thereby greatly aiding the Oracle database professional to measure, track, and report to management and users on the status of availability with all mission-critical applications and system components. However, the current version of Oracle Enterprise Manager will not provide a true picture of availability until 11g Grid Control is released in the future.

Recovery time and high availability

Recovery time is closely related to the concept of high availability. Recovery time varies based on system design and failure experienced, in that a full recovery may well be impossible if the system design prevents such recovery options. For example, if the data center is not designed correctly with the required system and database backups and a standby disaster recovery site in place, then a major catastrophe such as a fire or earthquake will almost always result in complete unavailability until a complete MAA solution is implemented. In this case, only a partial recovery may be possible. This drives home the point that for all major data center operations, you should always have a backup plan with an offsite secondary disaster-recovery data center to protect against losing all critical systems and data.

In terms of database administration for Oracle data centers, the concept of data availability is essential when dealing with recovery time and planning for highly available options. Data availability references the degree to which databases such as Oracle record and report transactions. Data management professionals often focus just on data availability in order to judge what constitutes an acceptable data loss with different types of failure events. While application service interruptions are inconvenient and sometimes permitted, data loss is not to be tolerated. As one Chief Information Officer (CIO) and executive once told us while working for a large financial brokerage, you can have the system down to perform maintenance but never ever lose my data!

The next item related to high availability and recovery standards is that of Service Level Agreements or SLAs for data center operations. The purpose of the Service Level Agreement is to actualize the availability objectives and requirements for a data center environment per business requirements into a standard corporate information technology (IT) policy.

System design for high availability

Ironically, by adding further components to the overall system and database architecture design, you may actually undermine your efforts to achieve true high availability for your Oracle data center environment. The reason for this is by their very nature, complex systems inherently have more potential failure points and thus are more difficult to implement properly. The most highly available systems for Oracle adhere to a simple design pattern that makes use of a single, high quality, multipurpose physical system with comprehensive internal redundancy running all interdependent functions, paired with a second like system at a separate physical location. An example would be to have a primary Oracle RAC clustered site with a second Disaster Recovery site at another location with Oracle Data Guard and perhaps dual Oracle RAC clusters at both sites connected by stretch clusters. The best possible way to implement an active standby site with Oracle would be to have Oracle Streams and Oracle Data Guard. Large commercial banking and insurance institutions would benefit from this model for Oracle data center design to maximize system availability.

Business Continuity and high availability

Business Continuity Planning (BCP) refers to the creation and validation of a rehearsed operations plan for the IT organization that explains the procedures of how the data center and business unit will recover and restore, partially or completely, interrupted business functions within a predetermined time after a major disaster.

In its simplest terms, BCP is the foundation for the IT data center operations team to maintain critical systems in the event of disaster. Major incidents could include events such as fires, earthquakes, or national acts of terrorism.

BCP may also encompass corporate training efforts to help reduce operational risk factors associated with the lack of information technology (IT) management controls. These BCP processes may also be integrated with IT standards and practices to improve security and corporate risk management practices. An example would be to implement BCP controls as part of Sarbanes-Oxley (SOX) compliance requirements for publicly traded corporations.

The origins for BCP standards arose from the British Standards Institution (BSI) in 2006 when the BSI released a new independent standard for business continuity named BS 25999-1. Prior to the introduction of this standard for BCP, IT professionals had to rely on the previous BSI information security standard, BS 7799, which provided only limited standards for business continuity compliance procedures. One of the key benefits of these new standards was to extend additional practices for business continuity to a wider variety of organizations, to cover needs for public sector, government, non-profit, and private corporations.

Disaster Recovery

Disaster Recovery (DR) is the process, policies, and procedures related to preparing for recovery or continuation of technology infrastructure critical to an organization after either a natural or human-caused disaster.

Disaster Recovery Planning (DRP) is a subset of larger processes such as Business Continuity and should include planning for resumption of applications, databases, hardware, networking, and other IT infrastructure components. A Business Continuity Plan includes planning for non-IT-related aspects, such as staff member activities, during a major disaster as well as site facility operations, and it should reference the Disaster Recovery Plan for IT-related infrastructure recovery and business continuity procedures and guidelines.

Business Continuity and Disaster Recovery guidelines

The following recommendations will provide you with a blueprint to formulate your requirements and implementation for a robust Business Continuity and Disaster Recovery plan:

Identifying the scope and boundaries of your Business Continuity Plan:

The first step enables you to define the scope of your new Business Continuity Plan. It provides you with an idea of the limitations and boundaries of the Business Continuity Plan. It also includes important audit and risk analysis reports for corporate assets.

Conducting a Business Impact Analysis session:

Business Impact Analysis (BIA) is the assessment of financial losses to institutions, which usually results as the consequence of destructive events such as the loss or unavailability of mission-critical business services.

Obtaining support for your business continuity plans and goals from the executive management team:

You will need to convince senior management to approve your business continuity plan, so that you can flawlessly execute your disaster recovery planning. Assign stakeholders as representatives on the project planning committee team, once approval is obtained from the corporate executive team.

Understanding its specific role:

In the possible event of a major disaster, each of your departments must be prepared to take immediate action. In order to successfully recover your mission-critical database systems with minimal loss, each team must understand the BCP and DRP plans, as well as follow them correctly. Furthermore, it is also important to maintain your DRP and BCP plans, as well as conduct periodic training of your IT staff members on a regular basis to have successful response time for emergencies. Such "smoke tests" to train and keep your IT staff members up to date on the correct procedures and communications will pay major dividends in the event of an unforeseen disaster.

One useful tool for creating and managing BCP plans is available from the National Institute of Standards and Technologies (NIST). The NIST documentation can be used to generate templates that can be used as an excellent starting point for your Business Continuity and Disaster Recovery planning. We highly recommend that you download and review the following NIST publication for creating and evaluating BCP plans, Contingency Planning Guide for Information Technology Systems, which is available online at http://csrc.nist.gov/publications/nistpubs/800-34/sp800-34.pdf.

Additional NIST documents may also provide insight into how best to manage new or current BCP or DRP plans. A complete listing of NIST publications is available online at http://csrc.nist.gov/publications/PubsSPs.html.

Fault-tolerant systems and high availability

Fault tolerance is data center technology that enables a system to continue to function correctly in the face of a failure with one or more faults within any given key component of the system architecture or data center. If operating quality experiences major degradation, the decrease in functionality of the environment is usually in direct proportion to the severity of the failure, whereas a poorly designed system will completely fail and breakdown with a small failure. In other words, fault tolerance gives you that added layer of protection and support to avoid a total meltdown of your mission-critical data center and, in our case, Oracle servers and database systems. Fault tolerance is often associated with highly available systems such as those found with Oracle Data Guard and Oracle RAC technologies.

Data formats may also be designed to degrade gracefully. For example, in the case of Oracle RAC environments, services provide for load balancing to minimize performance issues in the event that one or more nodes in the cluster are lost due to an unforeseen event.

Recovery from errors in fault-tolerant systems provides for either rollforward or rollback operations. For instance, whenever the Oracle server detects that it has an error condition and cannot find data from a missed transaction, rollback will occur either at the instance level or application level (a transaction must be atomic in that all elements must commit or rollback). Oracle takes the system state at that time and rolls back transactional changes to be able to move forward. Whenever a rollback is required for a transaction within Oracle, Oracle reverts the system state to some earlier correct version—for example, using the database checkpoint and rollback process inherent in the Oracle database engine and moving forward from there.

Rollback recovery requires that the operations between the checkpoint (implicit checkpoints are NEVER required for transactional recovery) and the detected erroneous state can be made to be transparent. Some systems make use of both rollforward and rollback recovery for different errors or different parts of one error.

For Oracle, database recovery always rolls back failed transactions and restores the state of the rollback or undo, from which it then rolls forward using the contents of the rollback or undo segments. However, when it comes to transactional-based recovery, Oracle only rolls back. Within the scope of an individual system, fault tolerance can be achieved by anticipating exceptional conditions and building the system to cope with them, and in general, aiming for self-healing so that the system converges towards an error-free state. In any case, if the consequence of a system failure is catastrophic, the system must be able to use reversion to fall back to a safe mode. This is similar to rollback recovery but can be a human action if humans are present in the loop.

Requirements for implementing fault tolerance

The basic characteristics of fault tolerance are:

No single point of failureNo single point of repairFault isolation to the failing componentFault containment to prevent propagation of the failureAvailability of reversion modes

In addition, fault-tolerant systems are characterized in terms of both planned and unplanned service outages. These are usually measured at the application level and not just at a hardware level. The figure of merit is called availability and is expressed as a percentage. For instance, a five nine system would therefore statistically provide 99.999% availability. Fault-tolerant systems are typically based on the concept of redundancy. In theory, this would be ideal; however, in reality this is an elusive impractical goal. Due to the time required to fail over, reestablish middle-tier connections, and perform application restarts, it is not realistic to have complete availability. We can obtain four nines as the best goal for high availability with Oracle systems. For Oracle RAC, you can deploy a fault-tolerant environment by using multiple network interface cards, dual Host Bus Adapters (HBAs), and multiple switches to avoid any Single Point of Failure.

Fault tolerance and replication

By using spare components, we address the first fundamental characteristic of fault tolerance in the following two ways:

Replication: This provides multiple identical instances of the same system or subsystem by directing tasks or requests to all of them simultaneously. Oracle Streams and Oracle GoldenGate, as well as third-party solutions such as Quest Shareplex, are replication technologies. Redundancy: This provides you with multiple identical instances of the same system and switching to one of the remaining instances in case of a failure. This switchover and failover process is available with standby database technology with Oracle Data Guard. Oracle RAC also provides node/server failover capability with the use of services by using Fast Connection Failover (FCF) and with Fast Application Notification (FAN).

At the storage layer, the major implementations of RAID (Redundant Array of Independent Disks) with the exception of disk striping (RAID 0) provide you with fault-tolerant appliances that also use data redundancy.

Bringing the replications into synchrony requires making their internal stored states the same. They can be started from a fixed initial state such as the reset state. Alternatively, the internal state of one replica can be copied to another replica.

One variant of Data Mirror Replication (DMR) is pair-and-spare. Two replicated elements operate in lockstep as a pair, with a voting circuit that detects any mismatch between their operations and outputs a signal indicating that there is an error. Another pair operates exactly the same way. A final circuit selects the output of the pair that does not proclaim that it is in error. Pair-and-spare requires four replicas rather than the three of DMR, but has been used commercially.

If a system experiences a failure, it must continue to operate without interruption during the repair process.

When a failure occurs, the system must be able to isolate the failure to the offending component. This requires the addition of dedicated failure-detection mechanisms that exist only for the purpose of fault isolation.

Recovery from a fault condition requires classifying the fault or failing component. The National Institute of Standards and Technology (NIST) categorizes faults based on locality, cause, duration, and effect.