Apache Cassandra Essentials - Nitin Padalia - E-Book

Apache Cassandra Essentials E-Book

Nitin Padalia

0,0
27,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Apache Cassandra Essentials takes you step-by-step from from the basics of installation to advanced installation options and database design techniques. It gives you all the information you need to effectively design a well distributed and high performance database. You’ll get to know about the steps that are performed by a Cassandra node when you execute a read/write query, which is essential to properly maintain of a Cassandra cluster and to debug any issues. Next, you’ll discover how to integrate a Cassandra driver in your applications and perform read/write operations. Finally, you’ll learn about the various tools provided by Cassandra for serviceability aspects such as logging, metrics, backup, and recovery.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 187

Veröffentlichungsjahr: 2015

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Apache Cassandra Essentials
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Getting Your Cassandra Cluster Ready
Installation
Prerequisites
Compiling Cassandra from source and installing
Installation from a precompiled binary
The installation layout
The directory layout in tarball installations
The directory layout in package-based installation
Configuration files
cassandra.yaml
Running a Cassandra server
Running a Cassandra node
Setting up the cluster
Viewing the cluster status
Summary
2. An Architectural Overview
Background
Cassandra cluster overview
The Gossip protocol
Failure detection
Data distribution
Replication
SimpleStrategy
NetworkTopologyStrategy
Snitches
Virtual nodes
Adding nodes to our cluster
Create keyspace and column family
Summary
3. Creating Database and Schema
A database and schema
Keyspace
Column families
Static rows
Wide rows
A primary key
Partition keys and clustering columns
A composite partition key
Multiple clustering columns
Static columns
Modifying a table
Data types
Counters
Collections
Sets
Lists
Map
UDTs
Secondary indexes
Allowing filtering
TTL
Conditional querying
Conditions on a partition key
Conditions on a partition key and clustering columns
Sorting query results
Write operations
Lightweight transactions
Batch statements
Summary
4. Read and Write – Behind the Scenes
Write operations
CommitLog
Anatomy of Memtable
SSTable explained
SSTable Compaction strategies
Size-tiered compaction
Leveled compaction
DateTiered compaction
Read operations
Reads from row cache
Read operations for row cache miss
Key is in KeyCache
Key search miss both the key cache and the row cache
Delete operations
Data consistency
Read operation
Digest reads
Read repair
Consistency levels
Write operation
Hinted handoff
Consistency levels
Tracing Cassandra queries
Summary
5. Writing Your Cassandra Client
Connecting to a Cassandra cluster
Driver Connection policies
Load balancing policies
Retry policies
Reconnection policies
Reading and writing to the Cassandra cluster
QueryBuilder
Reading and writing asynchronously
Prepared statements
Example REST service using prepared statement
Batch statements
Mapping API
Tracing Cassandra queries using Java driver
Summary
6. Monitoring and Tuning a Cassandra Cluster
Monitoring a Cassandra cluster
Use logging for debugging
Monitoring using command-line utilities
nodetool cfstats
nodetool cfhistograms
nodetool netstats
nodetool tpstats
JConsole
Third-party tools
Tuning Cassandra nodes
Configuring Cassandra caches
Tuning Bloom filters
Configuring and tuning Java
Summary
7. Backup and Restore
Taking backup of a Casandra cluster
Manual backup
Deleting snapshots
Incremental backup
Restoring data to Cassandra
The Cassandra bulk loader
Exporting and importing data using the Cassandra JSON utility
Loading external data into Cassandra
Removing nodes from Cassandra cluster
Adding nodes to a Cassandra cluster
Replacing dead nodes in a cluster
Summary
Index

Apache Cassandra Essentials

Apache Cassandra Essentials

Copyright © 2015 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: November 2015

Production reference: 1161115

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78398-910-2

www.packtpub.com

Credits

Author

Nitin Padalia

Reviewers

Ranjeet Kumar Jha

Sonal Raj

Chaoran Yu

Commissioning Editor

Akram Hussain

Acquisition Editor

Meeta Rajani

Content Development Editor

Aparna Mitra

Technical Editor

Rohan Uttam Gosavi

Copy Editor

Pranjali Chury

Project Coordinator

Mary Alex

Proofreader

Safis Editing

Indexer

Mariammal Chettiyar

Graphics

Disha Haria

Production Coordinator

Nilesh Mohite

Cover Work

Nilesh Mohite

About the Author

Nitin Padalia is the technical leader at Aricent Group, where he is involved in building highly scalable distributed applications in the field of telecommunications. From the beginning of his career, he has been working in the field of telecommunications and has worked on protocols such as SMPP, RTP, SIP, and VOIP. Since the beginning of his career, he has worked on the development of applications that can scale infinitely with highest performance possible. He has experience of developing applications for bare metal hardware, virtualized environments, and cloud-based applications using various languages and technologies.

I would like to thank all the reviewers of this book; their comments helped me to present data effectively.

Meeta Rajani, for setting things up and providing input during the initial phase of the book.

Anish Sukumaran, for helping me through his comments and input till the completion of this book.

Chaoran Yu, for good suggestions regarding presenting data and examples in a way that could be more helpful from the readers' perspective.

Ranjit, for his input throughout the book.

I also would like to thank my family—my mother, father, wife, and kids—for letting me take some time out to write this book.

About the Reviewers

Ranjeet Kumar Jha has over 12 years (three years in the big data field) of experience in various phases of the project life cycle, including the development and design phases. He has also been part of production support for Java/JEE and big data-based applications. He is a certified enterprise architect, that is, Oracle Certified Master Enterprise Java JEE Architect, and has worked for over six years as an architect in Java JEE technologies (over three years in the big data field). He has worked in various domains such as finance, insurance, e-commerce, digital media, CMS, security, and online advertisements.

He has worked as a programmer, designer, mentor, and architect on all types of projects related to Java, especially JEE and big data. He is the reviewer of the book Real-time Analytics with Storm and Cassandra.

To find out more about him, visit his LinkedIn profile at https://www.linkedin.com/in/jharanjeet.

I would like to thank my family—my wife, Anila Jha, and two kids, Anushka Jha and Tanisha Jha, for their constant support, encouragement, and patience. Without you, I wouldn't have achieved so much! Love you all immensely.

Sonal Raj is a hacker, Pythonista, big data believer, and a technology dreamer. He has a passion for design and is an artist at heart. He blogs about technology, design, and gadgets at http://www.sonalraj.com/. When not working on projects, he can be found travelling, stargazing, or reading.

He has pursued engineering in computer science and holds a master's degree in IT. He loves to work on community projects. He has been a research fellow at IISc and has taken up projects on graph computations using Neo4j, Storm, and NoSQL databases. He has been a speaker at PyCon India and local meetups and has also published articles and research papers in leading magazines and international journals. He has contributed to several open source projects.

He is the author of Neo4j High Performance, Packt Publishing, and has reviewed titles on technologies such as Storm and Neo4j

I am grateful to the author for patiently listening to my critiques. I'd like to thank the open source community for keeping their passions alive and contributing to such remarkable projects. A special thank you to my parents, without whom I never would have grown to love learning as much as I do.

Chaoran Yu obtained his bachelor's degree with high honors from UC Berkeley Department of Electrical Engineering and Computer Science in May 2014. He has been a software developer with the data analytics team of Ericsson MediaFirst, a leading IPTV solution, since then. The technologies that he has worked on include Apache Cassandra, Spark, and the Microsoft .NET framework. He organized service and client logging and performance data and wrote code to store them in Cassandra, which he then processed with Spark jobs to generate real-time reports for TV operators. His passion for open source technologies, especially for distributed and scalable systems, makes him an avid learner in this ever-changing technology landscape.

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

Preface

Traditional database management systems sometimes become the bottleneck of being highly available, scalable, and ultra responsive for modern day applications, as they are not able to satisfy the storage and retrieval needs of modern applications with all these attributes. Apache Cassandra being a highly available, massively scalable, NoSQL, query-driven database helps our applications to achieve these modern day must have attributes. Apache Cassandra's core features include handling of large data with the flexibility of configuring responsiveness, scalability, and high availability at the same time to suit our requirements.

In this book, I've provided step-by-step information starting from the basic installation to the advanced installation options and database design techniques. It gives all the information that you will need to design a well-distributed and high performance database. This book focuses on explaining core concepts with simple and easy-to-understand examples. I've also incorporated some code examples with this book. You can use these examples while working on your day-to-day tasks with Cassandra.

What this book covers

Chapter 1, Getting Your Cassandra Cluster Ready, gives an introduction to Cassandra and helps you to set up your cluster. It also introduces you to the various configuration options available to set up your cluster, which can be referred to while fine tuning the cluster.

Chapter 2, An ArchitecturalOverview, helps you to understand the internal architecture of a Cassandra cluster. It details various strategies used by Cassandra to distribute data among various nodes in the cluster. It describes how Cassandra becomes highly available by employing various replication strategies. It also clarifies various replication and data distribution strategies.

Chapter 3, Creating Database and Schema, details the concepts used by Cassandra. We'll learn to use CQL (Cassandra Query Language), which is used by Cassandra clients to describe data models, to create our databases and tables. Also, we'll discuss various techniques provided by Cassandra that can be used based on our storage and data retrieval requirements.

Chapter 4, Read and Write – Behind the Scenes, has been written keeping in mind how the reader can understand core concepts of a system. We'll discuss the operations that Cassandra performs for every read and write query along with all the data structures and caches it uses. We'll also discuss what configuration options it provides to configure the trade-off between consistency and latency. In the later parts of this chapter, we'll see how we can trace a Cassandra read/write query to debug performance issues for our read/write queries.

Chapter 5, Writing Your Cassandra Client, provides some code samples to set up your cluster, learn the core concepts of Cassandra, and create your database and schema. Now comes the time to know how our application will connect to the Cassandra cluster and perform a read/write operation.

Chapter 6, Monitoring and Tuning a Cassandra Cluster, covers various tools that can be used to monitor your Cassandra cluster. After you set up your application and cluster, it is necessary to know how to monitor your Cassandra cluster in order to run it successfully consistently. We'll also discuss various tuning parameters that are used to fine-tune Cassandra with regards to our hardware or networking environments.

Chapter 7, Backup and Restore, talks about Cassandra being highly available with no single point of failure. Sometimes there could be a scenario when we would need to restore data from an old snapshot; for example; suppose some buggy client corrupted our data and we want to recover from last day's snapshot. For situations like this, Cassandra has an option to take a backup of data and use various restore techniques. You'll learn about these techniques in this chapter.

What you need for this book

In this book, we'll set up a Cassandra cluster. Cassandra server's latest code can be downloaded from http://cassandra.apache.org/download/. We refer to the Cassandra Server version more than or equal to 2.x in our examples; this version requires Java version more than or equal to 1.7 and Python version more than or equal to 2.6. Python is required to run the CQL client cqlsh provided by Cassandra. In later chapters, we use the Datastax Java driver as the Cassandra client; for example, the Cassandra Java driver by Datastax can be downloaded from https://github.com/datastax/java-driver. We will use the driver version 2.1.2 in our examples. Other than that, if you set up a cluster for your development environment, then your development machine should have at least 4 GB of RAM and at least a dual core CPU. While working with a Java client, we expect you to have a basic knowledge of Java. While working on a Cassandra client, use any IDE; for example, Eclipse (https://eclipse.org/), for building. I've provided dependencies according to the Maven (https://maven.apache.org/) and Gradle (https://gradle.org/) frameworks.

Who this book is for

This book is written keeping in mind developers at both beginner and intermediate level. It also includes topics on maintenance and fine tuning Cassandra also debugging your queries so that you can get the best out of it. This book is useful for all those who are working with huge datasets and since traditional relational databases are not able to satisfy their needs of high performance, availability and scalability, so they want to learn Cassandra. However, it's not required for them to be aware of traditional relational concepts. In fact, not knowing relational model at all might help in some cases because when you are designing your database, you won't be thinking about it from the traditional relational database perspective.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Apache provides source as well as binary tarballs and Debian packages."

A block of code is set as follows:

$ sudomkdir -p /var/log/Cassandra $ sudochown -R `whoami` /var/log/Cassandra $ sudomkdir -p /var/lib/Cassandra $ sudochown -R `whoami` /var/lib/cassandra

Any command-line input or output is written as follows:

$ java –versionjava version "1.7.0_45"

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "OrderPreservingPartitioner is similar to above with same challenges and additional limitation that it assumes that keys are UTF8 strings".

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the erratasubmissionform link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Chapter 1. Getting Your Cassandra Cluster Ready

In this chapter, you'll learn how to set up and run our own Cassandra cluster. We'll look at the prerequisites that need to be considered before setting up a Cassandra cluster. We'll also see a Cassandra installation layout, so that we can easily locate different configuration files, tools, and utilities later on. We will discuss key configuration options that are required for cluster deployment. Then, we'll run our cluster and use Cassandra tools to verify our cluster status, some stats, and its version.

Installation

Apache provides source as well as binary tarballs and Debian packages. However, third-party vendors, such as Datastax, provide MSI installer, Linux RPM, Debian packages, and UNIX and Mac OS X binary in the form of community edition, which is a free packaged distribution of Apache Cassandra by Datastax. Here, we'll cover installation using binary tarball and source tarball packages.

Prerequisites

The following are the prerequisites for installing Cassandra:

Hardware requirements: Cassandra employs various caching techniques to enable ultra-fast read operations; hence more memory enables Cassandra to cache more data hence more memory would lead to better performance. Minimum 4GB memory is recommended for development environments and minimum 8GB memory for production environments. If our data set is bigger we should consider upgrading memory used by Cassandra. We'll discuss more about tuning Cassandra memory in later chapters. Similar to memory, more number of CPUs helps Cassandra to perform better as Cassandra performs its task concurrently. For bare-metal hardware, 8-core servers are recommended and for virtualized machines it's recommended that CPU cycles allocated to machines could grow on demand, for example some vendors like Rackspace and Amazon use CPU bursting. For development environments you could use single disk machine, however for production machines ideally there should be at least two disks. One disk is used for commitlog and other for storing data files called SSTables, so that I/O contention doesn't happen for both these operations. The commitlog file is used by Cassandra to make write requests durable. Every write request is first written to this file in append only mode and an in memory representation of column family called memtable.Java: Cassandra can run on Oracle/Sun JVM, OpenJDK, and IBM JVM. The current stable version of Cassandra requires Java 7 or later version. Set your JAVA_HOME environment variable to the correct version of Java if you are using multiple Java versions on your machine.Python: The current version of Cassandra requires Python 2.6 or above. Cassandra tools, such as cqlsh, are based on Python.Firewall configurations: Since we are setting up a cluster, let's see which ports are used by Cassandra on various interfaces. If the firewall blocks these ports because we fail to configure them, then our cluster won't function properly. For example, if the internode communication port is being blocked, then nodes will not be able to join the cluster.

Lets have a look at the following table

Port/Protocol

Configuration file

Configuration name

Firewall setting

Description

7000/tcp

cassandra.yaml

storage_port

Open among nodes in the cluster

It acts as an internode communication port in a Cassandra cluster.

7001/tcp

cassandra.yaml

ssl_storage_port

Open among nodes in the cluster

It is a SSL port for encrypted communication among cluster nodes.

9042/tcp

cassandra.yaml

native_transport_port

Between the Cassandra client and the cluster

Cassandra clients, for example cqlsh, or clients using the JAVA driver use this port to communicate with the Cassandra server.

9160/tcp

cassandra.yaml

rpc_port

The Thrift client and the Cassandra cluster

Thrift uses this port for client connections.

7199/tcp

cassandra-env.sh

JMX_PORT

Between the JMX console and the Cassandra cluster

It acts as an JMX console port for monitoring the Cassandra server.

Clock syncronization: Since Cassandra depends heavily on timestamps for data consistency purposes, all nodes of our cluster should be time synchronized. Ensure that we verify this. One of the methods we can use for time synchronization is configuring NTP on each node. NTP (Network Time Protocol) is widely used protocol for clock synchronization of computers over a network.

Compiling Cassandra from source and installing

The following method of installation is less used. One of the