39,59 €
Learning Ceph, Second Edition will give you all the skills you need to plan, deploy, and effectively manage your Ceph cluster. You will begin with the first module, where you will be introduced to Ceph use cases, its architecture, and core projects. In the next module, you will learn to set up a test cluster, using Ceph clusters and hardware selection. After you have learned to use Ceph clusters, the next module will teach you how to monitor cluster health, improve performance, and troubleshoot any issues that arise. In the last module, you will learn to integrate Ceph with other tools such as OpenStack, Glance, Manila, Swift, and Cinder.
By the end of the book you will have learned to use Ceph effectively for your data storage requirements.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 474
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: January 2015
Second edition: October 2017
Production reference: 1121017
ISBN: 978-1-78712-791-3
www.packtpub.com
Authors
Anthony D'Atri
Vaibhav Bhembre
Karan Singh
Copy Editors
Safis Editing
Juliana Nair
Reviewer
Christian Kauhaus
Project Coordinator
Judie Jose
Acquisition Editor
Meeta Rajani
Proofreader
Safis Editing
Content Development Editor
Abhishek Jadhav
Indexer
Tejal Soni Daruwala
Technical Editor
Manish D Shanbhag
Graphics
Kirk D'Penha
Vaibhav Bhembre
Anthony D'Atri Suzanne D'Atri
Production Coordinator
Aparna Bhagat
Anthony D'Atri's career in system administration has spanned laptops to vector supercomputers. He has brought his passion for fleet management and the underlying server components to bear on a holistic yet, detailed approach to deployment and operations. Experience with architecture, operation, and troubleshooting of NetApp, ZFS, SVM, and other storage systems dovetailed neatly into Ceph. Three years with Ceph as a petabyte-scale object and block backend to multiple OpenStack clouds at Cisco, additionally built on Anthony's depth. Now helping deliver awesome storage to DigitalOcean's droplet customers, Anthony aims to help the growing community build success with Ceph.
Vaibhav Bhembre is a systems programmer working currently as a Technical Lead for cloud storage products at DigitalOcean. Before joining DigitalOcean, Vaibhav wore multiple hats leading backend engineering and reliability engineering teams at Sailthru Inc. From helping scale dynamically generated campaign sends to be delivered to over tens of millions of users on time, to architecting a cloud-scale compute and storage platform, Vaibhav has years of experience writing software across all layers of the stack.
Vaibhav holds a bachelor’s degree in Computer Engineering from the University of Mumbai and a master’s degree in Computer Science from the State University of New York in Buffalo. During his time in academia, Vaibhav co-published a novel graph algorithm that optimally computed closeness and betweeness in an incrementally updating social network. He also had the fortune of committing changes to a highly available distributed file-system built on top of iRODs data management framework as his master’s project. This system, that was actively used across 10+ educational institutions live, was his foray into large-scale distributed storage and his transition into using Ceph professionally was only natural.
Karan Singh is a senior storage architect working with Red Hat and living with his charming wife Monika in Finland. In his current role, Karan is doing solution engineering on Ceph together with partners, customers and exploring new avenues for software defined storage.
Karan devotes a part of his time in learning emerging technologies and enjoys the challenges that comes with it.. He also authored the first edition of Learning Ceph and Ceph Cookbook, Packt Publishing. You can reach him on Twitter @karansingh010.
Christian Kauhaus set up his first Ceph cluster using the Argonaut release back in 2012. He got hooked on sysadmin stuff since helping out to set up his school's computer room back in the nineties. After studying Computer Science in Rostock and Jena, he spent a few years with the High Performance Computing group at the University of Jena. Currently, he is working as a systems engineer at Flying Circus Internet Operations GmbH, a small company providing managed hosting and data center related services located in Halle (Saale), Germany.
Apart from that, Christian likes to program in Python and Rust and is an active member of the NixOS community. He loves to play Jazz piano in his leisure time. He currently lives in Jena, Germany.
For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787127915. If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Introducing Ceph Storage
The history and evolution of Ceph
Ceph releases
New since the first edition
The future of storage
Ceph as the cloud storage solution
Ceph is software-defined
Ceph is a unified storage solution
The next-generation architecture
RAID: the end of an era
Ceph Block Storage
Ceph compared to other storage solutions
GPFS
iRODS
HDFS
Lustre
Gluster
Ceph
Summary
Ceph Components and Services
Introduction
Core components
Reliable Autonomic Distributed Object Store (RADOS)
MONs
Object Storage Daemons (OSDs)
Ceph manager
RADOS GateWay (RGW)
Admin host
CephFS MetaData server (MDS)
The community
Core services
RADOS Block Device (RBD)
RADOS Gateway (RGW)
CephFS
Librados
Summary
Hardware and Network Selection
Introduction
Hardware selection criteria
Corporate procurement policies
Power requirements-amps, volts, and outlets
Compatibility with management infrastructure
Compatibility with physical infrastructure
Configuring options for one-stop shopping
Memory
RAM capacity and speed
Storage drives
Storage drive capacity
Storage drive form factor
Storage drive durability and speed
Storage drive type
Number of storage drive bays per chassis
Controllers
Storage HBA / controller type
Networking options
Network versus serial versus KVM management
Adapter slots
Processors
CPU socket count
CPU model
Emerging technologies
Summary
Planning Your Deployment
Layout decisions
Convergence: Wisdom or Hype?
Planning Ceph component servers
Rack strategy
Server naming
Architectural decisions
Pool decisions
Replication
Erasure Coding
Placement Group calculations
OSD decisions
Back end: FileStore or BlueStore?
OSD device strategy
Journals
Filesystem
Encryption
Operating system decisions
Kernel and operating system
Ceph packages
Operating system deployment
Time synchronization
Packages
Networking decisions
Summary
Deploying a Virtual Sandbox Cluster
Installing prerequisites for our Sandbox environment
Bootstrapping our Ceph cluster
Deploying our Ceph cluster
Scaling our Ceph cluster
Summary
Operations and Maintenance
Topology
The 40,000 foot view
Drilling down
OSD dump
OSD list
OSD find
CRUSH dump
Pools
Monitors
CephFS
Configuration
Cluster naming and configuration
The Ceph configuration file
Admin sockets
Injection
Configuration management
Scrubs
Logs
MON logs
OSD logs
Debug levels
Common tasks
Installation
Ceph-deploy
Flags
Service management
Systemd: the wave (tsunami?) of the future
Upstart
sysvinit
Component failures
Expansion
Balancing
Upgrades
Working with remote hands
Summary
Monitoring Ceph
Monitoring Ceph clusters
Ceph cluster health
Watching cluster events
Utilizing your cluster
OSD variance and fillage
Cluster status
Cluster authentication
Monitoring Ceph MONs
MON status
MON quorum status
Monitoring Ceph OSDs
OSD tree lookup
OSD statistics
OSD CRUSH map
Monitoring Ceph placement groups
PG states
Monitoring Ceph MDS
Open source dashboards and tools
Kraken
Ceph-dash
Decapod
Rook
Calamari
Ceph-mgr
Prometheus and Grafana
Summary
Ceph Architecture: Under the Hood
Objects
Accessing objects
Placement groups
Setting PGs on pools
PG peering
PG Up and Acting sets
PG states
CRUSH
The CRUSH Hierarchy
CRUSH Lookup
Backfill, Recovery, and Rebalancing
Customizing CRUSH
Ceph pools
Pool operations
Creating and listing pools
Ceph data flow
Erasure coding
Summary
Storage Provisioning with Ceph
Client Services
Ceph Block Device (RADOS Block Device)
Creating and Provisioning RADOS Block Devices
Resizing RADOS Block Devices
RADOS Block Device Snapshots
RADOS Block Device Clones
The Ceph Filesystem (CephFS)
CephFS with Kernel Driver
CephFS with the FUSE Driver
Ceph Object Storage (RADOS Gateway)
Configuration for the RGW Service
Performing S3 Object Operations Using s3cmd
Enabling the Swift API
Performing Object Operations using the Swift API
Summary
Integrating Ceph with OpenStack
Introduction to OpenStack
Nova
Glance
Cinder
Swift
Ganesha / Manila
Horizon
Keystone
The Best Choice for OpenStack storage
Integrating Ceph and OpenStack
Guest Operating System Presentation
Virtual OpenStack Deployment
Summary
Performance and Stability Tuning
Ceph performance overview
Kernel settings
pid_max
kernel.threads-max, vm.max_map_count
XFS filesystem settings
Virtual memory settings
Network settings
Jumbo frames
TCP and network core
iptables and nf_conntrack
Ceph settings
max_open_files
Recovery
OSD and FileStore settings
MON settings
Client settings
Benchmarking
RADOS bench
CBT
FIO
Fill volume, then random 1M writes for 96 hours, no read verification:
Fill volume, then small block writes for 96 hours, no read verification:
Fill volume, then 4k random writes for 96 hours, occasional read verification:
Summary
Learning Ceph, Second Edition will give you all the skills you need to plan, deploy, and effectively manage Ceph clusters. You will begin with an introduction to Ceph use cases and components, then progress through advice on selecting and planning servers and components. We then cover a number of important decisions to make before provisioning your servers and clusters and walk through hands-on deployment of a fully functional virtualized sandbox cluster. A wide range of common (and not so common) management tasks are explored. A discussion of monitoring is followed by a deep dive into the inner workings of Ceph, a selection of topics related to provisioning storage, and an overview of Ceph's role as an OpenStack storage solution. Rounding out our chapters is advice on benchmarking and tuning for performance and stability.
By the end of the book you will have learned to deploy and use Ceph effectively for your data storage requirements.
Chapter 1, Introducing Ceph Storage, here we explore Ceph’s role as storage needs evolve and embrace cloud computing. Also covered are Ceph’s release cycle and history including changes since the first edition of Learning Ceph .
Chapter 2, Ceph Components and Services, a tour through Ceph’s major components and the services they implement. Examples of each service’s use cases are given.
Chapter 3, Hardware and Network Selection, is a comprehensive journey through the maze of hardware choices that one faces when provisioning Ceph infrastructure. While Ceph is highly adaptable to diverse server components, prudent planning can help you optimize your choices for today and tomorrow.
Chapter 4, Planning Your Deployment, is the software complement to Chapter 3, Hardware and Network Selection. We guide the reader through deployment and provisioning decisions for both Ceph and the underlying operating system.
Chapter 5, Deploying a Virtual Sandbox Cluster, is an automated, fully functional Ceph deployment on virtual machines. This provides opportunities for hands-on experience within minutes.
Chapter 6, Ceph Operations and Maintenance, is a deep and wide inventory of day to day operations. We cover management of Ceph topologies, services, and configuration settings as well as, maintenance and debugging.
Chapter 7, Monitoring Ceph, a comprehensive collection of commands, practices, and dashboard software to help keep a close eye on the health of Ceph clusters.
Chapter 8, Ceph Architecture: Under the Hood, deep-dives into the inner workings of Ceph with low-level architecture, processes, and data flow.
Chapter 9, Storage Provisioning with Ceph, here we practice the client side of provisioning and managing each of Ceph’s major services.
Chapter 10, Integrating Ceph with OpenStack, explores the components of the OpenStack cloud infrastructure platform and how they can exploit Ceph for resilient and scalable storage.
Chapter 11, Performance and Stability Tuning, provides a collection of Ceph, networks, filesystems, and underlying operating system settings to optimize cluster performance and stability. Benchmarking of cluster performance is also explored.
A basic knowledge of storage terms, server hardware, and networking will help digest the wealth of information provided. The virtual deployment sandbox was tested on macOS with specified software versions. It should work readily on Linux or other desktop operating systems. While execution of the virtual deployment is valuable, it is not strictly required to reap the benefits of this book.
A basic knowledge of GNU/Linux, storage systems, and server components is assumed. If you have no experience with software-defined storage solutions and Ceph, but are eager to learn about them, this is the book for you. Those already managing Ceph deployments will also find value in the breadth and depth of the material presented.
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. You can download the code files by following these steps:
Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account. Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-Ceph-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/LearningCephSecondEdition_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata is verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
Ceph is an open source project that provides a solution for software-defined, network-available storage with high performance and no single point of failure. It is designed to be highly scalable to the exabyte level and beyond while running on general-purpose commodity hardware.
In this chapter, we will cover the following topics:
The history and evolution of Ceph
What's new since the first edition of
Learning Ceph
The future of storage
Ceph compared with other storage solutions
Ceph garners much of the buzz in the storage industry due to its open, scalable, and distributed nature. Today public, private, and hybrid cloud models are dominant strategies for scalable and scale-out infrastructure. Ceph's design and features including multi-tenancy are a natural fit for cloud Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) deployments: at least 60% of OpenStack deployments leverage Ceph.
Ceph is architected deliberately to deliver enterprise-quality services on a variety of commodity hardware. Ceph's architectural philosophy includes the following:
Every component must be scalable
No individual process, server, or other component can be a single point of failure
The solution must be software-based, open source, and adaptable
Ceph software should run on readily available commodity hardware without vendor lock-in
Everything must be self-manageable wherever possible
Ceph provides great performance, limitless scalability, power, and flexibility to enterprises, helping them move on from expensive proprietary storage silos. The Ceph universal storage system provides block, file, and object storage from a single, unified back-end, enabling customers to access storage as their needs evolve and grow.
The foundation of Ceph is objects, building blocks from which complex services are assembled. Any flavor of data, be it a block, object, or file, is represented by objects within the Ceph backend. Object storage is the flexible solution for unstructured data storage needs today and in the future. An object-based storage system offers advantages over traditional file-based storage solutions that include platform and hardware independence. Ceph manages data carefully, replicating across storage devices, servers, data center racks, and even data centers to ensure reliability, availability, and durability. Within Ceph objects are not tied to a physical path, making objects flexible and location-independent. This enables Ceph to scale linearly from the petabyte level to an exabyte level.
Ceph was developed at University of California, Santa Cruz, by Sage Weil in 2003 as a part of his PhD project. The initial implementation provided the Ceph Filesystem (CephFS) in approximately 40,000 lines of C++ code. This was open sourced in 2006 under a Lesser GNU Public License (LGPL) to serve as a reference implementation and research platform. Lawrence Livermore National Laboratory supported Sage's early followup work from 2003 to 2007.
DreamHost, a Los-Angeles-based web hosting and domain registrar company also co-founded by Sage Weil, supported Ceph development from 2007 to 2011. During this period Ceph as we know it took shape: the core components gained stability and reliability, new features were implemented, and the road map for the future was drawn. During this time a number of key developers began contributing, including Yehuda Sadeh-Weinraub, Gregory Farnum, Josh Durgin, Samuel Just, Wido den Hollander, and Loïc Dachary.
In 2012 Sage Weil founded Inktankto enable the widespread adoption of Ceph. Their expertise, processes, tools, and support enabled enterprise-subscription customers to effectively implement and manage Ceph storage systems. In 2014 Red Hat, Inc.,the world's leading provider of open source solutions, agreed to acquire Inktank.
The term Ceph is a common nickname given to pet octopuses; Ceph and is an abbreviation of cephalopod, marine animals belonging to the Cephalopoda class of molluscs. Ceph's mascot is an octopus,referencing the highly parallel behavior of an octopus and was chosen to connect the file system with UCSC's mascot, a banana slug named Sammy. Banana slugs are gastropods,which are also a class of molluscs. As Ceph is not an acronym, it should not be uppercased as CEPH.
Each release of Ceph has a numeric version. Major releases also receive cephalopod code-names in alphabetical order. Through the Luminous release the Ceph community tagged a new major version about twice a year, alternating between Long Term Support (LTS) and stable releases. The latest two LTS releases were officially supported, but only the single latest stable release.
The release numbering scheme has changed since the first edition of Learning Ceph was published. Earlier major releases were tagged initially with a version number (0.87) and were followed by multiple point releases (0.87.1, 0.87.2, ...). Releases beginning with Infernalis however are numbered as shown:
The major release number matches the letter of the alphabet of its code name (for example I is the ninth letter of the English alphabet, so 9.2.1 was named Infernalis). As we write, there have been four releases following this numbering convention: Infernalis, Jewel, Kraken, and Luminous.
The early versions of each major release have a type of 0 in the second field, which indicates active pre-release development status for early testers and the brave of heart. Later release candidates have a type of 1 and are targeted at test clusters and brave users. A type of 2 represents a general-availability, production-ready release. Point releases mostly contain security and bug fixes, but sometimes offer functionality improvements as well.
Ceph release name
Ceph package version
Release date
Argonaut
0.48 (LTS)
July 2012
Bobtail
0.56 (LTS)
January 2013
Cuttlefish
0.61
May 2013
Dumpling
0.67 (LTS)
August 2013
Emperor
0.72
November 2013
Firefly
0.80 (LTS)
May 2014
Giant
0.87
October 2014
Hammer
0.94 (LTS)
April 2015
Infernalis
9.2.1
November 2015
Jewel
10.2.3 (LTS)
April 2016
Kraken
11.2.0
January 2017
Luminous
12.2.0 (LTS)
August 2017
Mimic
13.2.0
2018
Nautilus
14.2.0
2019
The Jewel LTS release brought a number of significant changes:
Unified queue of client I/O, recovery, scrubs, and snapshot trimming
Daemons now run as the
ceph
user, which must be addressed when upgrading
Cache tier improvements
SHEC erasure coding is no longer experimental
The SWIFT API now supports object expiration
RBD improvements (now supports suffixes)
rbd du
shows actual and provisioned usage quickly via
object-map
and
fast-diff
features
New
rbd status
command
deep-flatten
now handles snapshots
CephFS snapshots can now be renamed
And CephFS is considered stable!
Scrubbing improvements
TCMalloc improvements
Multisite functionality in RGW significantly improved
OpenStack Keystone v3 support
Swift per-tenant namespace
Async RBD mirroring
A new look for
ceph status
As we write, the major Luminous LTS release has just reached general availability. Early experiences are positive and it is the best choice for new deployments. Much-anticipated features in Luminous include:
The BlueStore back end is supported
In-line compression and read checksums
Erasure coding for RBD volumes
Better tools for uniform OSD utilization
Improved tools for the OSD lifecycle
Enhanced CLI
Multiple active CephFS MDS servers are supported
Enterprise storage requirements have grown explosively over the last decade. Research has shown that data in large enterprises is growing at a rate of 40 to 60 percent annually, and many companies are doubling their data footprint each year. IDC analysts estimated that there were 54.4 exabytes of total digital data worldwide in the year 2000. By 2007, this reached 295 exabytes, by 2012 2,596 exabytes, and by the end of 2020 it's expected to reach 40,000 exabytes worldwide
Traditional and proprietary storage solutions often suffer from breathtaking cost, limited scalability and functionality, and vendor lock-in. Each of these factors confounds seamless growth and upgrades for speed and capacity.
Closed source software and proprietary hardware leave one between a rock and a hard place when a product line is discontinued, often requiring a lengthy, costly, and disruptive forklift-style total replacement of EOL deployments.
Modern storage demands a system that is unified, distributed, reliable, highly performant, and most importantly, massively scalable to the exabyte level and beyond. Ceph is a true solution for the world's growing data explosion. A key factor in Ceph's growth and adoption at lightning pace is the vibrant community of users who truly believe in the power of Ceph. Data generation is a never-ending process and we need to evolve storage to accommodate the burgeoning volume.
Ceph is the perfect solution for modern, growing storage: its unified, distributed, cost-effective, and scalable nature is the solution to today's and the future's data storage needs. The open source Linux community saw Ceph's potential as early as 2008, and added support for Ceph into the mainline Linux kernel.
One of the most problematic yet crucial components of cloud infrastructure development is storage. A cloud environment needs storage that can scale up and out at low cost and that integrates well with other components. Such a storage system is a key contributor to the total cost of ownership (TCO) of the entire cloud platform. There are traditional storage vendors who claim to provide integration to cloud frameworks, but we need additional features beyond just integration support. Traditional storage solutions may have proven adequate in the past, but today they are not ideal candidates for a unified cloud storage solution. Traditional storage systems are expensive to deploy and support in the long term, and scaling up and out is uncharted territory. We need a storage solution designed to fulfill current and future needs, a system built upon open source software and commodity hardware that can provide the required scalability in a cost-effective way.
Ceph has rapidly evolved in this space to fill the need for a true cloud storage backend. It is favored by major open source cloud platforms including as OpenStack, CloudStack, and OpenNebula. Ceph has built partnerships with Canonical, Red Hat, and SUSE, the giants in Linux space who favor distributed, reliable, and scalable Ceph storage clusters for their Linux and cloud software distributions. The Ceph community is working closely with these Linux giants to provide a reliable multi-featured storage backend for their cloud platforms.
Public and private clouds have gained momentum with to the OpenStack platform. OpenStack has proven itself as an end-to-end cloud solution. It includes two core storage components: Swift, which provides object-based storage, and Cinder, which provides block storage volumes to instances. Ceph excels as the back end for both object and block storage in OpenStack deployments.
Swift is limited to object storage. Ceph is a unified storage solution for block, file, and object storage and benefits OpenStack deployments by serving multiple storage modalities from a single backend cluster. The OpenStack and Ceph communities have worked together for many years to develop a fully supported Ceph storage backend for the OpenStack cloud. From OpenStack's Folsom release Ceph has been fully integrated. Ceph's developers ensure that Ceph works well with each new release of OpenStack, contributing new features and bug fixes. OpenStack's Cinder and Glance components utilize Ceph's key RADOS Block Device (RBD) service. Ceph RBD enables OpenStack deployments to rapidly provision of hundreds of virtual machine instances by providing thin-provisioned snapshot and cloned volumes that are quickly and efficiently created.
Cloud platforms with Ceph as a storage backend provide much needed flexibility to service providers who build Storage as a Service (SaaS) and Infrastructure-as-a-Service (IaaS) solutions that they cannot realize with traditional enterprise storage solutions. By leveraging Ceph as a backend for cloud platforms, service providers can offer low-cost cloud services to their customers. Ceph enables them to offer relatively low storage prices with enterprise features when compared to other storage solutions.
Dell, SUSE, Redhat, and Canonical offer and support deployment and configuration management tools such as Dell Crowbar, Red Hat's Ansible, and Juju for automated and easy deployment of Ceph storage for their OpenStack cloud solutions. Other configuration management tools such as Puppet, Chef, and SaltStack are popular for automated Ceph deployment. Each of these tools has open source, ready made Ceph modules available that can be easily leveraged for Ceph deployment. With Red Hat's acquisition of Ansible the open source ceph-ansible suite is becoming a favored deployment and management tool. In distributed cloud (and other) environments, every component must scale. These configuration management tools are essential to quickly scale up your infrastructure. Ceph is fully compatible with these tools, allowing customers to deploy and extend a Ceph cluster instantly.
Storage infrastructure architects increasingly favor Software-defined Storage (SDS) solutions. SDS offers an attractive solution to organizations with a large investment in legacy storage who are not getting the flexibility and scalability they need for evolving needs. Ceph is a true SDS solution:
Open source software
Runs on commodity hardware
No vendor lock in
Low cost per GB
An SDS solution provides much-needed flexibility with respect to hardware selection. Customers can choose commodity hardware from any manufacturer and are free to design a heterogeneous hardware solution that evolves over time to meet their specific needs and constraints. Ceph's software-defined storage built from commodity hardware flexibly provides agile enterprise storage features from the software layer.
In Chapter 3, Hardware and Network Selection we'll explore a variety of factors that influence the hardware choices you make for your Ceph deployments.
Unified storage from a storage vendor's perspective is defined as file-based Network-Attached Storage (NAS) and block-based Storage Area Network(SAN) access from a single platform. NAS and SAN technologies became popular in the late 1990's and early 2000's, but when we look to the future are we sure that traditional, proprietary NAS and SAN technologies can manage storage needs 50 years down the line? Do they have what it takes to handle exabytes of data?
With Ceph, the term unified storage means much more than just what traditional storage vendors claim to provide. Ceph is designed from the ground up to be future-ready; its building blocks are scalable to handle enormous amounts of data and the open source model ensures that we are not bound to the whim or fortunes of any single vendor. Ceph is a true unified storage solution that provides block, file, and object services from a single unified software defined backend. Object storage is a better fit for today's mix of unstructured data strategies than are blocks and files. Access is through a well-defined RESTful network interface, freeing application architects and software engineers from the nuances and vagaries of operating system kernels and filesystems. Moreover, object-backed applications scale readily by freeing users from managing the limits of discrete-sized block volumes. Block volumes can sometimes be expanded in-place, but this rarely a simple, fast, or non-disruptive operation. Applications can be written to access multiple volumes, either natively or through layers such as the Linux LVM (Logical Volume Manager), but these also can be awkward to manage and scaling can still be painful. Object storage from the client perspective does not require management of fixed-size volumes or devices.
Rather than managing the complexity blocks and files behind the scenes, Ceph manages low-level RADOS objects and defines block- and file-based storage on top of them. If you think of a traditional file-based storage system, files are addressed via a directory and file path, and in a similar way, objects in Ceph are addressed by a unique identifier and are stored in a flat namespace.
Traditional storage systems lack an efficient way to managing metadata. Metadata is information (data) about the actual user payload data, including where the data will be written to and read from. Traditional storage systems maintain a central lookup table to track of their metadata. Every time a client sends a request for a read or write operation, the storage system first performs a lookup to the huge metadata table. After receiving the results it performs the client operation. For a smaller storage system, you might not notice the performance impact of this centralized bottleneck, but as storage domains grow large the performance and scalability limits of this approach become increasingly problematic.
Ceph does not follow the traditional storage architecture; it has been totally reinvented for the next generation. Rather than centrally storing, manipulating, and accessing metadata, Ceph introduces a new approach, the Controlled Replication Under Scalable Hashing (CRUSH) algorithm.
Instead of performing a lookup in the metadata table for every client request, the CRUSH algorithm enables the client to independently computes where data should be written to or read from. By deriving this metadata dynamically, there is no need to manage a centralized table. Modern computers can perform a CRUSH lookup very quickly; moreover, a smaller computing load can be distributed across cluster nodes, leveraging the power of distributed storage.
CRUSH accomplishes this via infrastructure awareness. It understands the hierarchy and capacities of the various components of your logical and physical infrastructure: drives, nodes, chassis, datacenter racks, pools, network switch domains, datacenter rows, even datacenter rooms and buildings as local requirements dictate. These are the failure domains for any infrastructure. CRUSH stores data safely replicated so that data will be protected (durability) and accessible (availability) even if multiple components fail within or across failure domains. Ceph managers define these failure domains for their infrastructure within the topology of Ceph's CRUSH map. The Ceph backend and clients share a copy of the CRUSH map, and clients are thus able to derive the location, drive, server, datacenter, and so on, of desired data and access it directly without a centralized lookup bottleneck.
CRUSH enables Ceph's self-management and self-healing. In the event of component failure, the CRUSH map is updated to reflect the down component. The back end transparently determines the effect of the failure on the cluster according to defined placement and replication rules. Without administrative intervention, the Ceph back end performs behind-the-scenes recovery to ensure data durability and availability. The back end creates replicas of data from surviving copies on other, unaffected components to restore the desired degree of safety. A properly designed CRUSH map and CRUSH rule set ensure that the cluster will maintain more than one copy of data distributed across the cluster on diverse components, avoiding data loss from single or multiple component failures.
Redundant Array of Independent Disks (RAID) has been a fundamental storage technology for the last 30 years. However, as data volume and component capacities scale dramatically, RAID-based storage systems are increasingly showing their limitations and fall short of today's and tomorrow's storage needs.
Disk technology has matured over time. Manufacturers are now producing enterprise-quality magnetic disks with immense capacities at ever lower prices. We no longer speak of 450 GB, 600 GB, or even 1 TB disks as drive capacity and performance has grown. As we write, modern enterprise drives offer up to 12 TB of storage; by the time you read these words capacities of 14 or more TB may well be available. Solid State Drives (SSDs) were formerly an expensive solution for small-capacity high-performance segments of larger systems or niches requiring shock resistance or minimal power and cooling. In recent years SSD capacities have increased dramatically as prices have plummeted. Since the publication of the first edition of Learning Ceph, SSDs have become increasingly viable for bulk storage as well.
Consider an enterprise RAID-based storage system built from numerous 4 or 8 TB disk drives; in the event of disk failure, RAID will take many hours or even days to recover from a single failed drive. If another drive fails during recovery, chaos will ensue and data may be lost. Recovering from the failure or replacement of multiple large disk drives using RAID is a cumbersome process that can significantly degrade client performance.
Traditional RAID technologies include RAID 1 (mirroring), RAID 10 (mirroring plus striping), and RAID 5 (parity).
Effective RAID implementations require entire dedicated drives to be provisioned as hot spares. This impacts TCO, and running out of spare drives can be fatal. Most RAID strategies assume a set of identically-sized disks, so you will suffer efficiency and speed penalties or even failure to recover if you mix in drives of differing speeds and sizes. Often a RAID system will be unable to use a spare or replacement drive that is very slightly smaller than the original, and if the replacement drive is larger, the additional capacity is usually wasted.
Another shortcoming of traditional RAID-based storage systems is that they rarely offer any detection or correction of latent or bit-flip errors, aka bit-rot. The microscopic footprint of data on modern storage media means that sooner or later what you read from the storage device won't match what you wrote, and you may not have any way to know when this happens. Ceph runs periodic scrubs that compare checksums and remove altered copies of data from service. With the Luminous release Ceph also gains the ZFS-like ability to checksum data at every read, additionally improving the reliability of your critical data.
Enterprise RAID-based systems often require expensive, complex, and fussy RAID-capable HBA cards that increase management overhead, complicate monitoring, and increase the overall cost. RAID can hit the wall when size limits are reached. This author has repeatedly encountered systems that cannot expand a storage pool past 64TB. Parity RAID implementations including RAID 5 and RAID 6 also suffer from write throughput penalties, and require complex and finicky caching strategies to enable tolerable performance for most applications. Often the most limiting shortcoming of traditional RAID is that it only protects against disk failure; it cannot protect against switch and network failures, those of server hardware and operating systems, or even regional disaster. Depending on strategy, the maximum protection you may realize from RAID is surviving through one or at most two drive failures. Strategies such as RAID 60 can somewhat mitigate this risk, though they are not universally available, are inefficient, may require additional licensing, and still deliver incomplete protection against certain failure patterns.
For modern storage capacity, performance, and durability needs, we need a system that can overcome all these limitations in a performance- and cost-effective way. Back in the day a common solution for component failure was a backup system, which itself could be slow, expensive, capacity-limited, and subject to vendor lock-in. Modern data volumes are such that traditional backup strategies are often infeasible due to scale and volatility.
A Ceph storage system is the best solution available today to address these problems. For data reliability, Ceph makes use of data replication (including erasure coding). It does not use traditional RAID, and because of this, it is free of the limitations and vulnerabilities of a traditional RAID-based enterprise storage system. Since Ceph is software-defined and exploits commodity components we do not require specialized hardware for data replication. Moreover, the replication level is highly configurable by Ceph managers, who can easily manage data protection strategies according to local needs and underlying infrastructure. Ceph's flexibility even allows managers to define multiple types and levels of protection to address the needs of differing types and populations of data within the same back end.
Ceph's replication is superior to traditional RAID when components fail. Unlike RAID, when a drive (or server!) fails, the data that was held by that drive is recovered from a large number of surviving drives. Since Ceph is a distributed system driven by the CRUSH map, the replicated copies of data are scattered across many drives. By design no primary and replicated copies reside on the same drive or server; they are placed within different failure domains. A large number of cluster drives participate in data recovery, distributing the workload and minimizing the contention with and impact on ongoing client operations. This makes recovery operations amazingly fast without performance bottlenecks.
Moreover, recovery does not require spare drives; data is replicated to unallocated space on other drives within the cluster. Ceph implements a weighting mechanism for drives and sorts data independently at a granularity smaller than any single drive's capacity. This avoids the limitations and inefficiencies that RAID suffers with non-uniform drive sizes. Ceph stores data based on each drive's and each server's weight, which is adaptively managed via the CRUSH map. Replacing a failed drive with a smaller drive results in a slight reduction of cluster aggregate capacity, but unlike traditional RAID it still works. If a replacement drive is larger than the original, even many times larger, the cluster's aggregate capacity increases accordingly. Ceph does the right thing with whatever you throw at it.
In addition to replication, Ceph also supports another advanced method of ensuring data durability: erasure coding, which is a type of Forward Error Correction (FEC). Erasure-coded pools require less storage space than replicated pools, resulting in a greater ratio of usable to raw capacity. In this process, data on failed components is regenerated algorithmically. You can use both replication and erasure coding on different pools with the same Ceph cluster. We will explore the benefits and drawbacks of erasure-coding versus replication in coming chapters.
Block storage will be familiar to those who have worked with traditional SAN (Storage Area Network) technologies. Allocations of desired capacity are provisioned on demand and presented as contiguous statically-sized volumes (sometimes referred to as images). Ceph RBD supports volumes up to 16 exabytes in size. These volumes are attached to the client operating system as virtualized disk drives that can be utilized much like local physical drives. In virtualized environments the attachment point is often at the hypervisor level (eg. QEMU / KVM). The hypervisor then presents volumes to the guest operating system via the virtio driver or as an emulated IDE or SCSI disk.Usually a filesystem is then created on the volume for traditional file storage. This strategy has the appeal that guest operating systems do not need to know about Ceph, which is especially useful for software delivered as an appliance image. Client operating systems running on bare metal can also directly map volumes using a Ceph kernel driver.
Ceph's block storage component is RBD, the RADOS Block Device. We will discuss RADOS in depth in the following chapters, but for now we'll note that RADOS is the underlying technology on which RBD is built. RBD provides reliable, distributed, and high performance block storage volumes to clients. RBD volumes are effectively striped over numerous objects scattered throughout the entire Ceph cluster, a strategy that is key for providing availability, durability, and performance to clients. The Linux kernel bundles a native RBD driver; thus clients need not install layered software to enjoy Ceph's block service. RBD also provides enterprise features including incremental (diff) and full-volume snapshots, thin provisioning, copy-on-write (COW) cloning, layering, and others. RBD clients also support in-memory caching, which can dramatically improve performance.
The Ceph RBD service is exploited by cloud platforms including OpenStack and CloudStack to provision both primary / boot devices and supplemental volumes. Within OpenStack, Ceph's RBD service is configured as a backend for the abstracted Cinder (block) and Glance (base image) components. RBD's copy-on-write functionality enables one to quickly spin up hundreds or even thousands of thin-provisioned instances (virtual machines).
The enterprise storage market is experiencing a fundamental realignment. Traditional proprietary storage systems are incapable of meeting future data storage needs, especially within a reasonable budget. Appliance-based storage is declining even as data usage grows by leaps and bounds.
The high TCO of proprietary systems does not end with hardware procurement: nickle-and-dime feature licenses, yearly support, and management add up to a breathtakingly expensive bottom line. One would previously purchase a pallet-load of hardware, pay for a few years of support, then find that the initial deployment has been EOL'd and thus can't be expanded or even maintained. This perpetuates a cycle of successive rounds of en-masse hardware acquisition. Concomitant support contracts to receive bug fixes and security updates often come at spiraling cost. After a few years (or even sooner) your once-snazzy solution becomes unsupported scrap metal, and the cycle repeats. Pay, rinse, lather, repeat. When the time comes to add a second deployment, the same product line may not even be available, forcing you to implement, document, and support a growing number of incompatible, one-off solutions. I daresay your organization's money and your time can be better spent elsewhere, like giving you a well-deserved raise.
With Ceph new software releases are always available, no licenses expire, and you're welcome to read the code yourself and even contribute. You can also expand your solution along many axes, compatibly and without disruption. Unlike one-size-fits-none proprietary solutions, you can pick exactly the scale, speed, and components that make sense today while effortlessly growing tomorrow, with the highest levels of control and customization.
Open source storage technologies however have demonstrated performance, reliability, scalability, and lower TCO (Total Cost of Ownership) without fear of product line or model phase-outs or vendor lock-in. Many corporations as well as government, universities, research, healthcare, and HPC (High Performance Computing) organizations are already successfully exploiting open source storage solutions.
Ceph is garnering tremendous interest and gaining popularity, increasingly winning over other open source as well as proprietary storage solutions. In the remainder of this chapter we'll compare Ceph to other open source storage solutions.
General Parallel File System (GPFS) is a distributed filesystem developed and owned by IBM. This is a proprietary and closed source storage system, which limits its appeal and adaptability. Licensing and support cost added to that of storage hardware add up to an expensive solution. Moreover, it has a very limited set of storage interfaces: it provides neither block storage (like RBD) nor RESTful (like RGW) access to the storage system, limiting the constellation of use-cases that can be served by a single backend.
In 2015 GPFS was rebranded as IBM Spectrum Scale.
iRODS stands for Integrated Rule-Oriented Data System, an open source data-management system released with a 3-clause BSD license. iRODS is not highly available and can be bottlenecked. Its iCAT metadata server is a single point of failure (SPoF) without true high availability (HA) or scalability. Moreover, it implements a very limited set of storage interfaces, providing neither block storage nor RESTful access modalities. iRODS is more effective at storing a relatively small number of large files than both a large number of mixed small and large files. iRODS implements a traditional metadata architecture, maintaining an index of the physical location of each filename.
HDFS is a distributed scalable filesystem written in Java for the Hadoop processing framework. HDFS is not a fully POSIX-compliant filesystem and does not offer a block interface. The reliability of HDFS is of concern as it lacks high availability. The single NameNode in HDFS is a SPoF and performance bottleneck. HDFS is again suitable, primarily storing a small number of large files rather than the mix of small and large files at scale that modern deployments demand.
Lustre is a parallel-distributed filesystem driven by the open source community and is available under GNU General Public License (GPL). Lustre relies on a single server for storing and managing metadata. Thus, I/O requests from the client are totally dependent on a single server's computing power, which can be a bottleneck for enterprise-level consumption. Like iRODS and HDFS, Lustre is better suited to a small number of large files than to a more typical mix of numbers files of various sizes. Like iRODS, Lustre manages an index file that maps filenames to physical addresses, which makes its traditional architecture prone to performance bottlenecks. Lustre lacks a mechanism for failure detection and correction: when a node fails clients must connect to another node.
GlusterFS was originally developed by Gluster Inc., which was acquired by Red Hat in 2011. GlusterFS is a scale-out network-attached filesystem in which administrators must determine the placement strategy to use to store data replicas on geographically spread racks. Gluster does not provide block access, filesystem, or remote replication as intrinsic functions; rather, it provides these features as add-ons.
Ceph stands out from the storage solution crowd by virtue of its feature set. It has been designed to overcome the limitations of existing storage systems, and effectively replaces old and expensive proprietary solutions. Ceph is economical by being open source and software-defined and by running on most any commodity hardware. Clients enjoy the flexibility of Ceph's variety of client access modalities with a single backend.
Every Ceph component is reliable and supports high availability and scaling. A properly configured Ceph cluster is free from single points of failure and accepts an arbitrary mix of file types and sizes without performance penalties.
Ceph by virtue of being distributed does not follow the traditional centralized metadata