Mastering Apache Cassandra 3.x - Aaron Ploetz - E-Book

Mastering Apache Cassandra 3.x E-Book

Aaron Ploetz

0,0
36,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Build, manage, and configure high-performing, reliable NoSQL database for your applications with Cassandra




Key Features



  • Write programs more efficiently using Cassandra's features with the help of examples



  • Configure Cassandra and fine-tune its parameters depending on your needs



  • Integrate Cassandra database with Apache Spark and build strong data analytics pipeline





Book Description



With ever-increasing rates of data creation, the demand for storing data fast and reliably becomes a need. Apache Cassandra is the perfect choice for building fault-tolerant and scalable databases. Mastering Apache Cassandra 3.x teaches you how to build and architect your clusters, configure and work with your nodes, and program in a high-throughput environment, helping you understand the power of Cassandra as per the new features.







Once you've covered a brief recap of the basics, you'll move on to deploying and monitoring a production setup and optimizing and integrating it with other software. You'll work with the advanced features of CQL and the new storage engine in order to understand how they function on the server-side. You'll explore the integration and interaction of Cassandra components, followed by discovering features such as token allocation algorithm, CQL3, vnodes, lightweight transactions, and data modelling in detail. Last but not least you will get to grips with Apache Spark.







By the end of this book, you'll be able to analyse big data, and build and manage high-performance databases for your application.





What you will learn



  • Write programs more efficiently using Cassandra's features more efficiently



  • Exploit the given infrastructure, improve performance, and tweak the Java Virtual Machine (JVM)



  • Use CQL3 in your application in order to simplify working with Cassandra



  • Configure Cassandra and fine-tune its parameters depending on your needs



  • Set up a cluster and learn how to scale it



  • Monitor a Cassandra cluster in different ways



  • Use Apache Spark and other big data processing tools






Who this book is for



Mastering Apache Cassandra 3.x is for you if you are a big data administrator, database administrator, architect, or developer who wants to build a high-performing, scalable, and fault-tolerant database. Prior knowledge of core concepts of databases is required.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 382

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Mastering Apache Cassandra 3.xThird Edition

 

An expert guide to improving database scalability and availability without compromising performance

 

 

 

 

 

 

 

 

 

 

 

Aaron Ploetz
Tejaswi Malepati
Nishant Neeraj

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Mastering Apache Cassandra 3.x Third Edition

 

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Pravin DhandreAcquisition Editor: Divya PoojariContent Development Editor: Chris D'cruzTechnical Editor: Suwarna PatilCopy Editor: Safis EditingProject Coordinator: Nidhi JoshiProofreader: Safis EditingIndexer: Mariammal ChettiyarGraphics: Tom ScariaProduction Coordinator: Arvindkumar Gupta

First published: October 2013 Second Edition: March 2015 Third Edition: October 2018

Production reference: 1311018

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78913-149-9

www.packtpub.com

 
mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. 

Foreward

Being asked to write the next edition of Mastering Apache Cassandra was a bit of a tall order. After all, writing a master-level book sort of implies that I have mastered whatever subject the book entails, which, by proxy, means that I should have mastered Apache Cassandra in order to be asked to author such a book. Honestly, that seems pretty far from the truth.

I feel privileged to have been a part of the Apache Cassandra community since 2012. I've been helping out by answering questions on Stack Overflow since then, as well as submitting Jira tickets, and the first of my patches to the project a couple of years later. During that time I've also written a few articles (about Apache Cassandra), managed to be selected as a DataStax MVP (most valuable professional) a few times, and have presented at several NoSQL events and conferences.  

Talking with other experts at those events has humbled me, as I continue to find aspects of Apache Cassandra that I have yet to fully understand. And that is really the best part. Throughout my career, I have found that maintaining a student mentality has allowed me to continue to grow and get better. While I have managed to develop an understanding of several aspects of Apache Cassandra, there are some areas where I still feel like I am very much a student. In fact, one of the reasons I asked my good friend and co-worker Tejaswi Malepati to help me out with this book, is that there are aspects of the Apache Cassandra ecosystem that he understands and can articulate better than I.

Ultimately, I hope this book helps you to foster your own student mentality. While reading, this book should inspire you to push the bounds of your own knowledge. Throughout, you will find areas in which we have offered tips. These pointers are pieces of advice that can provide further context and understanding, based on real-world experience. Hopefully, these will help to point you in the correct direction and ultimately lead to resolution.

Thank you, and enjoy!

Aaron Ploetz

Lead Engineer, Target Corp. and Cassandra MVP.

Contributors

About the authors

Aaron Ploetz is the NoSQL Engineering Lead for Target, where his DevOps team supports Cassandra, MongoDB, and Neo4j. He has been named a DataStax MVP for Apache Cassandra three times and has presented at multiple events, including the DataStax Summit and Data Day Texas. Aaron earned a BS in Management/Computer Systems from the University of Wisconsin-Whitewater, and an MS in Software Engineering from Regis University. He and his wife, Coriene, live with their three children in the Twin Cities area.

I'd like to thank my wife, Coriene, for all of her support through this endeavor. Sometimes, I think she is more excited about my authoring projects than I am.

 

 

 

Tejaswi Malepati is the Cassandra Tech Lead for Target. He has been instrumental in designing and building custom Cassandra integrations, including web-based SQL interface and data validation frameworks between Oracle and Cassandra. Tejaswi earned a master's degree in computer science from the University of New Mexico, and a bachelor's degree in Electronics and Communication from Jawaharlal Nehru Technological University in India. He is passionate about identifying and analyzing data patterns in datasets using R, Python, Spark, and Cassandra.

A very special thanks to Aaron Ploetz, who provided me with this opportunity to be a co-author. Also, I am grateful to my family, friends, and team for their constant encouragement in completing my first book.

 

 

 

Nishant Neeraj is an independent software developer with experience in developing and planning out architectures for massively scalable data storage and data processing systems. Over the years, he has helped to design and implement a wide variety of products and systems for companies, ranging from small start-ups to large multinational companies. Currently, he helps drive WealthEngine's core product to the next level by leveraging a variety of big data technologies.

About the reviewers

Sourav Gulati has been associated with the software industry for more than 8 years. He started his career with Unix/Linux and Java then moved to the Big data and NoSQL space. He has been designing and implementing Big data solutions for last few years. He is also the co-author of Apache Spark 2.x for Java Developers published by Packt. Apart from the IT world, he likes to play lawn tennis and likes to read about mythology. 

 

 

Ahmed Sherif is a data scientist who has been working with data in various roles since 2005. He started off with BI solutions and transitioned to data science in 2013. In 2016, he obtained a master's in Predictive Analytics from Northwestern University, where he studied the science and application of ML and predictive modeling using both Python and R. Lately, he has been developing ML and deep learning solutions on the cloud using Azure. In 2016, his first book, Practical Business Intelligence, was published by Packt. He currently works as a technology solution professional in data and AI for Microsoft.

 

 

Amrith Ravindra is a machine learning (ML) enthusiast who holds degrees in electrical and industrial engineering. While pursuing his Master's, he delved deeper into the world of ML and developed a love for data science. Graduate level courses in engineering gave him the mathematical background to launch himself into a career in ML. He met Ahmed Sherif at a local data science meetup in Tampa. They decided to put their brains together to write a book on their favorite ML algorithms. He hopes that this book will help him to achieve his ultimate goal of becoming a data scientist and actively contributing to ML.

 

 

Valentina Crisan is a product architecture consultant in the big data domain and a trainer for big data technologies (Apache Cassandra, Apache Hadoop architecture, and Apache Kafka). With a background in computer science, she has more than 15 years' experience in telecoms, architecting telecoms, and value-added service solutions, and has headed technical teams over a number of years. Passionate about the opportunities cloud and data could bring in different domains, for the past 4 years, she has been delivering training courses for big data architectures and works in different projects related to these domains.

Swathi Kurunji is a software engineer at Actian Corporation. She has a PhD in computer science from the University of Massachusetts, Lowell, USA. She has worked as a software development intern with IT companies including EMC and SAP. At EMC, she gained experience on Apache Cassandra data modeling and performance analysis. She worked with Wipro Technologies in India as a project engineer managing application servers. She has experience with database systems, such as Apache Cassandra, Sybase IQ, Oracle, MySQL, and MS Access. Her interests include software design and development, big data analysis, the optimization of databases, and cloud computing. She has previously reviewed Cassandra Data Modeling and Analysis published by Packt.

I would like to thank my husband and my family for all their support.

 

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Mastering Apache Cassandra 3.x Third Edition

Packt Upsell

Why subscribe?

Packt.com

Foreward

Contributors

About the authors

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Reviews

Quick Start

Introduction to Cassandra

High availability

Distributed

Partitioned row store

Installation

Configuration

cassandra.yaml

cassandra-rackdc.properties

Starting Cassandra

Cassandra Cluster Manager

A quick introduction to the data model

Using Cassandra with cqlsh

Shutting down Cassandra

Summary

Cassandra Architecture

Why was Cassandra created?

RDBMS and problems at scale

Cassandra and the CAP theorem

Cassandra's ring architecture

Partitioners

ByteOrderedPartitioner

RandomPartitioner

Murmur3Partitioner

Single token range per node

Vnodes

Cassandra's write path

Cassandra's read path

On-disk storage

SSTables

How data was structured in prior versions

How data is structured in newer versions

Additional components of Cassandra

Gossiper

Snitch

Phi failure-detector

Tombstones

Hinted handoff

Compaction

Repair

Merkle tree calculation

Streaming data

Read repair

Security

Authentication

Authorization

Managing roles

Client-to-node SSL

Node-to-node SSL

Summary

Effective CQL

An overview of Cassandra data modeling

Cassandra storage model for early versions up to 2.2

Cassandra storage model for versions 3.0 and beyond

Data cells

cqlsh

Logging into cqlsh

Problems connecting to cqlsh

Local cluster without security enabled

Remote cluster with user security enabled

Remote cluster with auth and SSL enabled

Connecting with cqlsh over SSL

Converting the Java keyStore into a PKCS12 keyStore

Exporting the certificate from the PKCS12 keyStore

Modifying your cqlshrc file

Testing your connection via cqlsh

Getting started with CQL

Creating a keyspace

Single data center example

Multi-data center example

Creating a table

Simple table example

Clustering key example

Composite partition key example

Table options

Data types

Type conversion

The primary key

Designing a primary key

Selecting a good partition key

Selecting a good clustering key

Querying data

The IN operator

Writing data

Inserting data

Updating data

Deleting data

Lightweight transactions

Executing a BATCH statement

The expiring cell

Altering a keyspace

Dropping a keyspace

Altering a table

Truncating a table

Dropping a table

Truncate versus drop

Creating an index

Caution with implementing secondary indexes

Dropping an index

Creating a custom data type

Altering a custom type

Dropping a custom type

User management

Creating a user and role

Altering a user and role

Dropping a user and role

Granting permissions

Revoking permissions

Other CQL commands

COUNT

DISTINCT

LIMIT

STATIC

User-defined functions

cqlsh commands

CONSISTENCY

COPY

DESCRIBE

TRACING

Summary

Configuring a Cluster

Evaluating instance requirements

RAM

CPU

Disk

Solid state drives

Cloud storage offerings

SAN and NAS

Network

Public cloud networks

Firewall considerations

Strategy for many small instances versus few large instances

Operating system optimizations

Disable swap

XFS

Limits

limits.conf

sysctl.conf

Time synchronization

Configuring the JVM

Garbage collection

CMS

G1GC

Garbage collection with Cassandra

Installation of JVM

JCE

Configuring Cassandra

cassandra.yaml

cassandra-env.sh

cassandra-rackdc.properties

dc

rack

dc_suffix

prefer_local

cassandra-topology.properties

jvm.options

logback.xml

Managing a deployment pipeline

Orchestration tools

Configuration management tools

Recommended approach

Local repository for downloadable files

Summary

Performance Tuning

Cassandra-Stress

The Cassandra-Stress YAML file

name

size

population

cluster

Cassandra-Stress results

Write performance

Commitlog mount point

Scaling out

Scaling out a data center

Read performance

Compaction strategy selection

Optimizing read throughput for time-series models

Optimizing tables for read-heavy models

Cache settings

Appropriate uses for row-caching

Compression

Chunk size

The bloom filter configuration

Read performance issues

Other performance considerations

JVM configuration

Cassandra anti-patterns

Building a queue

Query flexibility

Querying an entire table

Incorrect use of BATCH

Network

Summary

Managing a Cluster

Revisiting nodetool

A warning about using nodetool

Scaling up

Adding nodes to a cluster

Cleaning up the original nodes

Adding a new data center

Adjusting the cassandra-rackdc.properties file

A warning about SimpleStrategy

Streaming data

Scaling down

Removing nodes from a cluster

Removing a live node

Removing a dead node

Other removenode options

When removenode doesn't work (nodetool assassinate)

Assassinating a node on an older version

Removing a data center

Backing up and restoring data

Taking snapshots

Enabling incremental backups

Recovering from snapshots

Maintenance

Replacing a node

Repair

A warning about incremental repairs

Cassandra Reaper

Forcing read repairs at consistency – ALL

Clearing snapshots and incremental backups

Snapshots

Incremental backups

Compaction

Why you should never invoke compaction manually

Adjusting compaction throughput due to available resources

Summary

Monitoring

JMX interface

MBean packages exposed by Cassandra

JConsole (GUI)

Connection and overview

Viewing metrics

Performing an operation

JMXTerm (CLI)

Connection and domains

Getting a metric

Performing an operation

The nodetool utility

Monitoring using nodetool

describecluster

gcstats

getcompactionthreshold

getcompactionthroughput

getconcurrentcompactors

getendpoints

getlogginglevels

getstreamthroughput

gettimeout

gossipinfo

info

netstats

proxyhistograms

status

tablestats

tpstats

verify

Administering using nodetool

cleanup

drain

flush

resetlocalschema

stopdaemon

truncatehints

upgradeSSTable

Metric stack

Telegraf

Installation

Configuration

JMXTrans

Installation

Configuration

InfluxDB

Installation

Configuration

InfluxDB CLI

Grafana

Installation

Configuration

Visualization

Alerting

Custom setup

Log stack

The system/debug/gc logs

Filebeat

Installation

Configuration

Elasticsearch

Installation

Configuration

Kibana

Installation

Configuration

Troubleshooting

High CPU usage

Different garbage-collection patterns

Hotspots

Disk performance

Node flakiness

All-in-one Docker

Creating a database and other monitoring components locally

Web links

Summary

Application Development

Getting started

The path to failure

Is Cassandra the right database?

Good use cases for Apache Cassandra

Use and expectations around application data consistency

Choosing the right driver

Building a Java application

Driver dependency configuration with Apache Maven

Connection class

Other connection options

Retry policy

Default keyspace

Port

SSL

Connection pooling options

Starting simple – Hello World!

Using the object mapper

Building a data loader

Asynchronous operations

Data loader example

Summary

Integration with Apache Spark

Spark

Architecture

Installation

Running custom Spark Docker locally

Configuration

The web UI

Master

Worker

Application

PySpark

Connection config

Accessing Cassandra data

SparkR

Connection config

Accessing Cassandra data

RStudio

Connection config

Accessing Cassandra data

Jupyter

Architecture

Installation

Configuration

Web UI

PYSpark through Juypter

Summary

References

Chapter 1 – Quick Start

Chapter 2 – Cassandra Architecture

Chapter 3 – Effective CQL

Chapter 4 – Configuring a Cluster

Chapter 5 – Performance Tuning

Chapter 6 – Managing a Cluster

Chapter 7 – Monitoring

Chapter 8 – Application Development

Chapter 9 – Integration with Apache Spark

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

This book is intended to help you understand the Apache Cassandra NoSQL database. It will describe procedures and methods for configuring, installing, and maintaining a high-performing Cassandra cluster. This book can serve as a reference for some of the more obscure configuration parameters and commands that may be needed to operate and maintain a cluster. Also, tools and methods will be suggested for integrating Cassandra with Apache Spark, as well as practices for building Java applications that use Cassandra.

Who this book is for

This book is intended for a DBA who is responsible for supporting Apache Cassandra clusters. It may also benefit full-stack engineers at smaller companies. While these individuals primarily build applications, they may also have to maintain and support their Cassandra cluster out of necessity.

What this book covers

Chapter 1, Quick Start, walks the reader through getting started with Apache Cassandra. As the title suggests, explanations will be brief in favor of guiding the reader toward quickly standing up an Apache Cassandra single-node cluster.

Chapter 2, Cassandra Architecture, covers the ideas and theories behind how Apache Cassandra works. These concepts will be useful going forward, as an understanding of Cassandra's inner workings can help in building high-performing data models.

Chapter 3, Effective CQL, introduces the reader to the CQL. It describes building appropriate data models and how to leverage CQL to get the most out of your cluster.

Chapter 4, Configuring a Cluster, details the configuration files and settings that go into building an Apache Cassandra Cluster. In addition, this chapter also describes the effects that some of the settings have, and how they can be used to keep your cluster running well.

Chapter 5, Performance Tuning, discusses the extra settings, configurations, and design considerations that can help to improve performance or mitigate issues.

Chapter 6, Managing a Cluster, goes into detail when describing the nodetool utility, and how it can be used for operations on an Apache Cassandra cluster. Adding and removing nodes is covered, as well as taking and restoring from backups.

Chapter 7, Monitoring, describes how to integrate a technology stack that provides a window into an Apache Cassandra cluster's history and performance metrics.

Chapter 8, Application Development, takes the reader through design considerations around coding Java applications to work with an Apache Cassandra cluster.

Chapter 9, Integration with Apache Spark, talks about installing and using Apache Spark in order to analyze and discover value in your data.

Appendix A, References, In this chapter you will find links present for various references present throughout the book.

To get the most out of this book

This book assumes that you have access to hardware on which you can install, configure, and code against an Apache Cassandra instance. Having elevatedadminorsudoprivileges on the aforementioned machine will be essential to carrying out some of the tasks described.

This book is written from the perspective of running Apache Cassandra on a macOS or Linux instance. As OS-specific system administration is not within the scope of this book, readers who are new to Linux may find value in seeking out a separate tutorial prior to attempting some of the examples.

The Java coding examples will be easier to do from within an integrated developer environment(IDE), with Apache Maven installed for dependency management. You may need to look up additional resources to ensure that these components are configured properly. Several IDEs have a plugin that allows for direct integration with Apache Maven.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packt.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Apache-Cassandra-3.x-Third-Edition. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "This will store the PID of the Cassandra process in a file named cassandra.pid in the local/cassandra directory."

A block of code is set as follows:

<dependencies> <dependency> <groupId>com.datastax.cassandra</groupId> <artifactId>cassandra-driver-core</artifactId> <version>3.6.0</version> </dependency></dependencies>

Any command-line input or output is written as follows:

cassdba@cqlsh> use packt;

cassdba@cqlsh:packt>

Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. 

Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Quick Start

Welcome to the world of Apache Cassandra! In this first chapter, we will briefly introduce Cassandra, along with a quick, step-by-step process to get your own single-node cluster up and running. Even if you already have experience working with Cassandra, this chapter will help to provide assurance that you are building everything properly. If this is your first foray into Cassandra, then get ready to take your first steps into a larger world.

In this chapter, we will cover the following topics:

Introduction to Cassandra

Installation and configuration

Starting up and shutting down Cassandra

Cassandra Cluster Manager

(

CCM

)

By the end of this chapter, you will have built a single-node cluster of Apache Cassandra. This will be a good exercise to help you start to see some of the configuration and thought that goes into building a larger cluster. As this chapter progresses and the material gets more complex, you will start to connect the dots and understand exactly what is happening between installation, operation, and development.

Introduction to Cassandra

Apache Cassandra is a highly available, distributed, partitioned row store. It is one of the more popular NoSQL databases used by both small and large companies all over the world to store and efficiently retrieve large amounts of data. While there are licensed, proprietary versions available (which include enterprise support), Cassandra is also a top-level project of the Apache Software Foundation, and has deep roots in the open source community. This makes Cassandra a proven and battle-tested approach to scaling high-throughput applications.

High availability

Cassandra's design is premised on the points outlined in the Dynamo: Amazon's Highly Available Key-value Store paper (https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf). Specifically, when you have large networks of interconnected hardware, something is always in a state of failure. In reality, every piece of hardware being in a healthy state is the exception, rather than the rule. Therefore, it is important that a data storage system is able to deal with (and account for) issues such as network or disk failure.

Depending on the Replication Factor (RF) and required consistency level, a Cassandra cluster is capable of sustaining operations with one or two nodes in a failure state. For example, let's assume that a cluster with a single data center has a keyspace configured for a RF of three. This means that the cluster contains three copies of each row of data. If an application queries with a consistency level of one, then it can still function properly with one or two nodes in a down state.

Distributed

Cassandra is known as a distributed database. A Cassandra cluster is a collection of nodes (individual instances running Cassandra) all working together to serve the same dataset. Nodes can also be grouped together into logical data centers. This is useful for providing data locality for an application or service layer, as well as for working with Cassandra instances that have been deployed in different regions of a public cloud.

Cassandra clusters can scale to suit both expanding disk footprint and higher operational throughput. Essentially, this means that each cluster becomes responsible for a smaller percentage of the total data size. Assuming that the 500 GB disks of a six node cluster (RF of three) start to reach their maximum capacity, then adding three more nodes (for a total of nine) accomplishes the following:

Brings the total disk available to the cluster up from 3 TB to 4.5 TB

The percentage of data that each node is responsible for drops from 50% down to 33%

Additionally, let's assume that before the expansion of the cluster (from the prior example), the cluster was capable of supporting 5,000 operations per second. Cassandra scales linearly to support operational throughput. After increasing the cluster from six nodes to nine, the cluster should then be expected to support 7,500 operations per second.

Partitioned row store

In Cassandra, rows of data are stored in tables based on the hashed value of the partition key, called a token. Each node in the cluster is assigned multiple token ranges, and rows are stored on nodes that are responsible for their tokens.

Each keyspace (collection of tables) can be assigned a RF. The RF designates how many copies of each row should be stored in each data center. If a keyspace has a RF of three, then each node is assigned primary, secondary, and tertiary token ranges. As data is written, it is written to all of the nodes that are responsible for its token.

Installation

To get started with Cassandra quickly, we'll step through a single-node, local installation.

The following are the requirements to run Cassandra locally:

A flavor of Linux or macOS

A system with between 4 GB and 16 GB of

random access memory

(

RAM

)

A local installation of the

Java Development Kit

(

JDK

) version 8, latest patch

A local installation of Python 2.7 (for cqlsh)

Your user must have 

sudo

rights to your local system

While you don't need to have sudo rights to run Apache Cassandra, it is required for some of the operating system configurations.
Apache Cassandra 3.11.2 breaks with JDK 1.8.0_161. Make sure to use either an older or newer version of the JDK.

Head to the Apache download site for the Cassandra project (http://cassandra.apache.org/download/), choose 3.11.2, and select a mirror to download the latest version of Cassandra. When complete, copy the .tar or .gzip file to a location that your user has read and write permissions for. This example will assume that this is going to be the ~/local/ directory:

mkdir ~/local

cd ~/local

cp ~/Downloads/apache-cassandra-3.11.2-bin.tar.gz .

Untar the file to create your cassandra directory:

tar -zxvf apache-cassandra-3.11.2-bin.tar.gz

Some people prefer to rename this directory, like so:

mv apache-cassandra-3.11.2/ cassandra/

Configuration

At this point, you could start your node with no further configuration. However, it is good to get into the habit of checking and adjusting the properties that are indicated as follows.

cassandra.yaml

It is usually a good idea to rename your cluster. Inside the conf/cassandra.yaml file, specify a new cluster_name property, overwriting the default Test Cluster:

cluster_name: 'PermanentWaves'

The num_tokens property default of 256 has proven to be too high for the newer, 3.x versions of Cassandra. Go ahead and set that to 24:

num_tokens: 24

To enable user security, change the authenticator and authorizer properties (from their defaults) to the following values:

authenticator: PasswordAuthenticatorauthorizer: CassandraAuthorizer

Cassandra installs with all security disabled by default. Even if you are not concerned with security on your local system, it makes sense to enable it to get used to working with authentication and authorization from a development perspective.

By default, Cassandra will come up bound to localhost or 127.0.0.1. For your own local development machine, this is probably fine. However, if you want to build a multi-node cluster, you will want to bind to your machine's IP address. For this example, I will use 192.168.0.101. To configure the node to bind to this IP, adjust the listen_address and rpc_address properties:

listen_address: 192.168.0.101rpc_address: 192.168.0.101

If you set listen_address and rpc_address, you'll also need to adjust your seed list (defaults to 127.0.0.1) as well:

seeds: 192.168.0.101

I will also adjust my endpoint_snitch property to use GossipingPropertyFileSnitch:

endpoint_snitch: GossipingPropertyFileSnitch

cassandra-rackdc.properties

In terms of NoSQL databases, Apache Cassandra handles multi-data center awareness better than any other. To configure this, each node must use GossipingPropertyFileSnitch (as previously mentioned in the preceding cassandra.yaml configuration process) and must have its local data center (and rack) settings defined. Therefore, I will set the dc and rack properties in the conf/cassandra-rackdc.properties file:

dc=ClockworkAngelsrack=R40

Starting Cassandra

To start Cassandra locally, execute the Cassandra script. If no arguments are passed, it will run in the foreground. To have it run in the background, send the -p flag with a destination file for the Process ID (PID):

cd cassandra

bin/cassandra -p cassandra.pid

This will store the PID of the Cassandra process in a file named cassandra.pid in the local/cassandra directory. Several messages will be dumped to the screen. The node is successfully running when you see this message:

Starting listening for CQL clients on localhost/192.168.0.101:9042 (unencrypted).

This can also be verified with the nodetool status command:

bin/nodetool status

Datacenter: ClockworkAngels

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address Load Tokens Owns (effective) Host ID Rack

UN 192.168.0.101 71.26 KiB 24 100.0% 0edb5efa... R40

Cassandra Cluster Manager

If you want an even faster way to install Cassandra, you can use an open source tool called CCM. CCM installs Cassandra for you, with very minimal configuration. In addition to ease of installation, CCM also allows you to run multiple Cassandra nodes locally.

First, let's clone the CCM repository from GitHub, and cd into the directory:

git clone https://github.com/riptano/ccm.gitcd ccm

Next, we'll run the setup program to install CCM:

sudo ./setup.py install

To verify that my local cluster is working, I'll invoke nodetool status via node1:

ccm node1 status

Datacenter: datacenter1

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address Load Tokens Owns (effective) Host ID Rack

UN 127.0.0.1 100.56 KiB 1 66.7% 49ecc8dd... rack1

UN 127.0.0.2 34.81 KiB 1 66.7% 404a8f97... rack1

UN 127.0.0.3 34.85 KiB 1 66.7% eed33fc5... rack1

To shut down your cluster, go ahead and send the stop command to each node:

ccm stop node1

ccm stop node2

ccm stop node3

Note that CCM requires a working installation of Python 2.7 or later, as well as a few additional libraries (pyYAML, six, ant, and psutil), and local IPs 127.0.0.1 through 127.0.0.3 to be available. Visit https://github.com/riptano/ccm for more information.

Using CCM actually changes many of the commands that we will follow in this book. While it is a great tool for quickly spinning up a small cluster for demonstration purposes, it can complicate the process of learning how to use Cassandra.

A quick introduction to the data model

Now that we have a Cassandra cluster running on our local machine, we will demonstrate its use with some quick examples. We will start with cqlsh, and use that as our primary means of working with the Cassandra data model.

Shutting down Cassandra

Before shutting down your cluster instances, there are some additional commands that should be run. Again, with your own, local node(s), these are not terribly necessary. But it is a good idea to get used to running these, should you ever need to properly shut down a production node that may contain data that people actually care about.

First, we will disable gossip. This keeps other nodes from communicating with the node while we are trying to bring it down:

bin/nodetool disablegossip

Next, we will disable the native binary protocol to keep this node from serving client requests:

bin/nodetool disablebinary

Then, we will drain the node. This will prevent it from accepting writes, and force all in-memory data to be written to disk:

bin/nodetool drain

With the node drained, we can kill the PID:

kill 'cat cassandra.pid'

We can verify that the node has stopped by tailing the log:

tail logs/system.log

INFO [RMI TCP Connection(2)-127.0.0.1] 2018-03-31 17:49:05,789 StorageService.java:2292 - Node localhost/192.168.0.101 state jump to shutdown

INFO [RMI TCP Connection(4)-127.0.0.1] 2018-03-31 17:49:49,492 Server.java:176 - Stop listening for CQL clients

INFO [RMI TCP Connection(6)-127.0.0.1] 2018-03-31 17:50:11,312 StorageService.java:1449 - DRAINING: starting drain process

INFO [RMI TCP Connection(6)-127.0.0.1] 2018-03-31 17:50:11,313 HintsService.java:220 - Paused hints dispatch

INFO [RMI TCP Connection(6)-127.0.0.1] 2018-03-31 17:50:11,314 Gossiper.java:1540 - Announcing shutdown

INFO [RMI TCP Connection(6)-127.0.0.1] 2018-03-31 17:50:11,314 StorageService.java:2292 - Node localhost/192.168.0.101 state jump to shutdown

INFO [RMI TCP Connection(6)-127.0.0.1] 2018-03-31 17:50:13,315 MessagingService.java:984 - Waiting for messaging service to quiesce

INFO [ACCEPT-localhost/192.168.0.101] 2018-03-31 17:50:13,316 MessagingService.java:1338 - MessagingService has terminated the accept() thread

INFO [RMI TCP Connection(6)-127.0.0.1] 2018-03-31 17:50:14,764 HintsService.java:220 - Paused hints dispatch

INFO [RMI TCP Connection(6)-127.0.0.1] 2018-03-31 17:50:14,861 StorageService.java:1449 - DRAINED

Summary

In this chapter, we introduced Apache Cassandra and some of its design considerations and components. These aspects were discussed and a high level description of each was given, as well as how each affects things like cluster layout and data storage. Additionally, we built our own local, single-node cluster. CCM was also introduced, with minimal discussion. Some basic commands with Cassandra's nodetool were introduced and put to use.

With a single-node cluster running, the cqlsh tool was introduced. We created a keyspace that will work in a plural data center configuration. The concept of query tables was also introduced, as well as running some simple read and write operations.

In the next chapter, we will take an in-depth look at Cassandra's underlying architecture, and understand what is key to making good decisions about cluster deployment and data modeling. From there, we'll discuss various aspects to help fine-tune a production cluster and its deployment process. That will bring us to monitoring and application development, and put you well on your way to mastering Cassandra!

Cassandra Architecture

In this chapter, we will discuss the architecture behind Apache Cassandra in detail. We will discuss how Cassandra was designed and how it adheres to the Brewer's CAP theorem, which will give us insight into the reasons for its behavior. Specifically, this chapter will cover:

Problems that Cassandra was designed to solve

Cassandra's read and write paths

The role that horizontal scaling plays

How data is stored on-disk

How Cassandra handles failure scenarios

This chapter will help you to build a good foundation of understanding that will prove very helpful later on. Knowing how Apache Cassandra works under the hood helps for later tasks around operations. Building high-performing, scalable data models is also something that requires an understanding of the architecture, and your architecture can be the difference between an unsuccessful or a successful cluster.

Why was Cassandra created?

Understanding how Apache Cassandra works under the hood can greatly improve your chances of running a successful cluster or application. We will reach that understanding by asking some simple, fundamental questions. What types of problems was Cassandra designed to solve? Why does a relational database management system (RDBMS) have difficulty handling those problems? If Cassandra works this way, how should I design my data model to achieve the best performance?

RDBMS and problems at scale

As the internet grew in popularity around the turn of the century, the systems behind internet architecture began to change. When good ideas were built into popular websites, user traffic increased exponentially. It was not uncommon in 2001 for too much web traffic being the reason for a popular site being slow or a web server going down. Web architects quickly figured out that they could build out multiple instances of their website or application, and distribute traffic with load balancers.