36,59 €
Build, manage, and configure high-performing, reliable NoSQL database for your applications with Cassandra
Key Features
Book Description
With ever-increasing rates of data creation, the demand for storing data fast and reliably becomes a need. Apache Cassandra is the perfect choice for building fault-tolerant and scalable databases. Mastering Apache Cassandra 3.x teaches you how to build and architect your clusters, configure and work with your nodes, and program in a high-throughput environment, helping you understand the power of Cassandra as per the new features.
Once you've covered a brief recap of the basics, you'll move on to deploying and monitoring a production setup and optimizing and integrating it with other software. You'll work with the advanced features of CQL and the new storage engine in order to understand how they function on the server-side. You'll explore the integration and interaction of Cassandra components, followed by discovering features such as token allocation algorithm, CQL3, vnodes, lightweight transactions, and data modelling in detail. Last but not least you will get to grips with Apache Spark.
By the end of this book, you'll be able to analyse big data, and build and manage high-performance databases for your application.
What you will learn
Who this book is for
Mastering Apache Cassandra 3.x is for you if you are a big data administrator, database administrator, architect, or developer who wants to build a high-performing, scalable, and fault-tolerant database. Prior knowledge of core concepts of databases is required.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 382
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Pravin DhandreAcquisition Editor: Divya PoojariContent Development Editor: Chris D'cruzTechnical Editor: Suwarna PatilCopy Editor: Safis EditingProject Coordinator: Nidhi JoshiProofreader: Safis EditingIndexer: Mariammal ChettiyarGraphics: Tom ScariaProduction Coordinator: Arvindkumar Gupta
First published: October 2013 Second Edition: March 2015 Third Edition: October 2018
Production reference: 1311018
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78913-149-9
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Being asked to write the next edition of Mastering Apache Cassandra was a bit of a tall order. After all, writing a master-level book sort of implies that I have mastered whatever subject the book entails, which, by proxy, means that I should have mastered Apache Cassandra in order to be asked to author such a book. Honestly, that seems pretty far from the truth.
I feel privileged to have been a part of the Apache Cassandra community since 2012. I've been helping out by answering questions on Stack Overflow since then, as well as submitting Jira tickets, and the first of my patches to the project a couple of years later. During that time I've also written a few articles (about Apache Cassandra), managed to be selected as a DataStax MVP (most valuable professional) a few times, and have presented at several NoSQL events and conferences.
Talking with other experts at those events has humbled me, as I continue to find aspects of Apache Cassandra that I have yet to fully understand. And that is really the best part. Throughout my career, I have found that maintaining a student mentality has allowed me to continue to grow and get better. While I have managed to develop an understanding of several aspects of Apache Cassandra, there are some areas where I still feel like I am very much a student. In fact, one of the reasons I asked my good friend and co-worker Tejaswi Malepati to help me out with this book, is that there are aspects of the Apache Cassandra ecosystem that he understands and can articulate better than I.
Ultimately, I hope this book helps you to foster your own student mentality. While reading, this book should inspire you to push the bounds of your own knowledge. Throughout, you will find areas in which we have offered tips. These pointers are pieces of advice that can provide further context and understanding, based on real-world experience. Hopefully, these will help to point you in the correct direction and ultimately lead to resolution.
Thank you, and enjoy!
Aaron Ploetz
Lead Engineer, Target Corp. and Cassandra MVP.
Aaron Ploetz is the NoSQL Engineering Lead for Target, where his DevOps team supports Cassandra, MongoDB, and Neo4j. He has been named a DataStax MVP for Apache Cassandra three times and has presented at multiple events, including the DataStax Summit and Data Day Texas. Aaron earned a BS in Management/Computer Systems from the University of Wisconsin-Whitewater, and an MS in Software Engineering from Regis University. He and his wife, Coriene, live with their three children in the Twin Cities area.
Tejaswi Malepati is the Cassandra Tech Lead for Target. He has been instrumental in designing and building custom Cassandra integrations, including web-based SQL interface and data validation frameworks between Oracle and Cassandra. Tejaswi earned a master's degree in computer science from the University of New Mexico, and a bachelor's degree in Electronics and Communication from Jawaharlal Nehru Technological University in India. He is passionate about identifying and analyzing data patterns in datasets using R, Python, Spark, and Cassandra.
Nishant Neeraj is an independent software developer with experience in developing and planning out architectures for massively scalable data storage and data processing systems. Over the years, he has helped to design and implement a wide variety of products and systems for companies, ranging from small start-ups to large multinational companies. Currently, he helps drive WealthEngine's core product to the next level by leveraging a variety of big data technologies.
Sourav Gulati has been associated with the software industry for more than 8 years. He started his career with Unix/Linux and Java then moved to the Big data and NoSQL space. He has been designing and implementing Big data solutions for last few years. He is also the co-author of Apache Spark 2.x for Java Developers published by Packt. Apart from the IT world, he likes to play lawn tennis and likes to read about mythology.
Ahmed Sherif is a data scientist who has been working with data in various roles since 2005. He started off with BI solutions and transitioned to data science in 2013. In 2016, he obtained a master's in Predictive Analytics from Northwestern University, where he studied the science and application of ML and predictive modeling using both Python and R. Lately, he has been developing ML and deep learning solutions on the cloud using Azure. In 2016, his first book, Practical Business Intelligence, was published by Packt. He currently works as a technology solution professional in data and AI for Microsoft.
Amrith Ravindra is a machine learning (ML) enthusiast who holds degrees in electrical and industrial engineering. While pursuing his Master's, he delved deeper into the world of ML and developed a love for data science. Graduate level courses in engineering gave him the mathematical background to launch himself into a career in ML. He met Ahmed Sherif at a local data science meetup in Tampa. They decided to put their brains together to write a book on their favorite ML algorithms. He hopes that this book will help him to achieve his ultimate goal of becoming a data scientist and actively contributing to ML.
Valentina Crisan is a product architecture consultant in the big data domain and a trainer for big data technologies (Apache Cassandra, Apache Hadoop architecture, and Apache Kafka). With a background in computer science, she has more than 15 years' experience in telecoms, architecting telecoms, and value-added service solutions, and has headed technical teams over a number of years. Passionate about the opportunities cloud and data could bring in different domains, for the past 4 years, she has been delivering training courses for big data architectures and works in different projects related to these domains.
Swathi Kurunji is a software engineer at Actian Corporation. She has a PhD in computer science from the University of Massachusetts, Lowell, USA. She has worked as a software development intern with IT companies including EMC and SAP. At EMC, she gained experience on Apache Cassandra data modeling and performance analysis. She worked with Wipro Technologies in India as a project engineer managing application servers. She has experience with database systems, such as Apache Cassandra, Sybase IQ, Oracle, MySQL, and MS Access. Her interests include software design and development, big data analysis, the optimization of databases, and cloud computing. She has previously reviewed Cassandra Data Modeling and Analysis published by Packt.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Mastering Apache Cassandra 3.x Third Edition
Packt Upsell
Why subscribe?
Packt.com
Foreward
Contributors
About the authors
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Conventions used
Get in touch
Reviews
Quick Start
Introduction to Cassandra
High availability
Distributed
Partitioned row store
Installation
Configuration
cassandra.yaml
cassandra-rackdc.properties
Starting Cassandra
Cassandra Cluster Manager
A quick introduction to the data model
Using Cassandra with cqlsh
Shutting down Cassandra
Summary
Cassandra Architecture
Why was Cassandra created?
RDBMS and problems at scale
Cassandra and the CAP theorem
Cassandra's ring architecture
Partitioners
ByteOrderedPartitioner
RandomPartitioner
Murmur3Partitioner
Single token range per node
Vnodes
Cassandra's write path
Cassandra's read path
On-disk storage
SSTables
How data was structured in prior versions
How data is structured in newer versions
Additional components of Cassandra
Gossiper
Snitch
Phi failure-detector
Tombstones
Hinted handoff
Compaction
Repair
Merkle tree calculation
Streaming data
Read repair
Security
Authentication
Authorization
Managing roles
Client-to-node SSL
Node-to-node SSL
Summary
Effective CQL
An overview of Cassandra data modeling
Cassandra storage model for early versions up to 2.2
Cassandra storage model for versions 3.0 and beyond
Data cells
cqlsh
Logging into cqlsh
Problems connecting to cqlsh
Local cluster without security enabled
Remote cluster with user security enabled
Remote cluster with auth and SSL enabled
Connecting with cqlsh over SSL
Converting the Java keyStore into a PKCS12 keyStore
Exporting the certificate from the PKCS12 keyStore
Modifying your cqlshrc file
Testing your connection via cqlsh
Getting started with CQL
Creating a keyspace
Single data center example
Multi-data center example
Creating a table
Simple table example
Clustering key example
Composite partition key example
Table options
Data types
Type conversion
The primary key
Designing a primary key
Selecting a good partition key
Selecting a good clustering key
Querying data
The IN operator
Writing data
Inserting data
Updating data
Deleting data
Lightweight transactions
Executing a BATCH statement
The expiring cell
Altering a keyspace
Dropping a keyspace
Altering a table
Truncating a table
Dropping a table
Truncate versus drop
Creating an index
Caution with implementing secondary indexes
Dropping an index
Creating a custom data type
Altering a custom type
Dropping a custom type
User management
Creating a user and role
Altering a user and role
Dropping a user and role
Granting permissions
Revoking permissions
Other CQL commands
COUNT
DISTINCT
LIMIT
STATIC
User-defined functions
cqlsh commands
CONSISTENCY
COPY
DESCRIBE
TRACING
Summary
Configuring a Cluster
Evaluating instance requirements
RAM
CPU
Disk
Solid state drives
Cloud storage offerings
SAN and NAS
Network
Public cloud networks
Firewall considerations
Strategy for many small instances versus few large instances
Operating system optimizations
Disable swap
XFS
Limits
limits.conf
sysctl.conf
Time synchronization
Configuring the JVM
Garbage collection
CMS
G1GC
Garbage collection with Cassandra
Installation of JVM
JCE
Configuring Cassandra
cassandra.yaml
cassandra-env.sh
cassandra-rackdc.properties
dc
rack
dc_suffix
prefer_local
cassandra-topology.properties
jvm.options
logback.xml
Managing a deployment pipeline
Orchestration tools
Configuration management tools
Recommended approach
Local repository for downloadable files
Summary
Performance Tuning
Cassandra-Stress
The Cassandra-Stress YAML file
name
size
population
cluster
Cassandra-Stress results
Write performance
Commitlog mount point
Scaling out
Scaling out a data center
Read performance
Compaction strategy selection
Optimizing read throughput for time-series models
Optimizing tables for read-heavy models
Cache settings
Appropriate uses for row-caching
Compression
Chunk size
The bloom filter configuration
Read performance issues
Other performance considerations
JVM configuration
Cassandra anti-patterns
Building a queue
Query flexibility
Querying an entire table
Incorrect use of BATCH
Network
Summary
Managing a Cluster
Revisiting nodetool
A warning about using nodetool
Scaling up
Adding nodes to a cluster
Cleaning up the original nodes
Adding a new data center
Adjusting the cassandra-rackdc.properties file
A warning about SimpleStrategy
Streaming data
Scaling down
Removing nodes from a cluster
Removing a live node
Removing a dead node
Other removenode options
When removenode doesn't work (nodetool assassinate)
Assassinating a node on an older version
Removing a data center
Backing up and restoring data
Taking snapshots
Enabling incremental backups
Recovering from snapshots
Maintenance
Replacing a node
Repair
A warning about incremental repairs
Cassandra Reaper
Forcing read repairs at consistency – ALL
Clearing snapshots and incremental backups
Snapshots
Incremental backups
Compaction
Why you should never invoke compaction manually
Adjusting compaction throughput due to available resources
Summary
Monitoring
JMX interface
MBean packages exposed by Cassandra
JConsole (GUI)
Connection and overview
Viewing metrics
Performing an operation
JMXTerm (CLI)
Connection and domains
Getting a metric
Performing an operation
The nodetool utility
Monitoring using nodetool
describecluster
gcstats
getcompactionthreshold
getcompactionthroughput
getconcurrentcompactors
getendpoints
getlogginglevels
getstreamthroughput
gettimeout
gossipinfo
info
netstats
proxyhistograms
status
tablestats
tpstats
verify
Administering using nodetool
cleanup
drain
flush
resetlocalschema
stopdaemon
truncatehints
upgradeSSTable
Metric stack
Telegraf
Installation
Configuration
JMXTrans
Installation
Configuration
InfluxDB
Installation
Configuration
InfluxDB CLI
Grafana
Installation
Configuration
Visualization
Alerting
Custom setup
Log stack
The system/debug/gc logs
Filebeat
Installation
Configuration
Elasticsearch
Installation
Configuration
Kibana
Installation
Configuration
Troubleshooting
High CPU usage
Different garbage-collection patterns
Hotspots
Disk performance
Node flakiness
All-in-one Docker
Creating a database and other monitoring components locally
Web links
Summary
Application Development
Getting started
The path to failure
Is Cassandra the right database?
Good use cases for Apache Cassandra
Use and expectations around application data consistency
Choosing the right driver
Building a Java application
Driver dependency configuration with Apache Maven
Connection class
Other connection options
Retry policy
Default keyspace
Port
SSL
Connection pooling options
Starting simple – Hello World!
Using the object mapper
Building a data loader
Asynchronous operations
Data loader example
Summary
Integration with Apache Spark
Spark
Architecture
Installation
Running custom Spark Docker locally
Configuration
The web UI
Master
Worker
Application
PySpark
Connection config
Accessing Cassandra data
SparkR
Connection config
Accessing Cassandra data
RStudio
Connection config
Accessing Cassandra data
Jupyter
Architecture
Installation
Configuration
Web UI
PYSpark through Juypter
Summary
References
Chapter 1 – Quick Start
Chapter 2 – Cassandra Architecture
Chapter 3 – Effective CQL
Chapter 4 – Configuring a Cluster
Chapter 5 – Performance Tuning
Chapter 6 – Managing a Cluster
Chapter 7 – Monitoring
Chapter 8 – Application Development
Chapter 9 – Integration with Apache Spark
Other Books You May Enjoy
Leave a review - let other readers know what you think
This book is intended to help you understand the Apache Cassandra NoSQL database. It will describe procedures and methods for configuring, installing, and maintaining a high-performing Cassandra cluster. This book can serve as a reference for some of the more obscure configuration parameters and commands that may be needed to operate and maintain a cluster. Also, tools and methods will be suggested for integrating Cassandra with Apache Spark, as well as practices for building Java applications that use Cassandra.
This book is intended for a DBA who is responsible for supporting Apache Cassandra clusters. It may also benefit full-stack engineers at smaller companies. While these individuals primarily build applications, they may also have to maintain and support their Cassandra cluster out of necessity.
Chapter 1, Quick Start, walks the reader through getting started with Apache Cassandra. As the title suggests, explanations will be brief in favor of guiding the reader toward quickly standing up an Apache Cassandra single-node cluster.
Chapter 2, Cassandra Architecture, covers the ideas and theories behind how Apache Cassandra works. These concepts will be useful going forward, as an understanding of Cassandra's inner workings can help in building high-performing data models.
Chapter 3, Effective CQL, introduces the reader to the CQL. It describes building appropriate data models and how to leverage CQL to get the most out of your cluster.
Chapter 4, Configuring a Cluster, details the configuration files and settings that go into building an Apache Cassandra Cluster. In addition, this chapter also describes the effects that some of the settings have, and how they can be used to keep your cluster running well.
Chapter 5, Performance Tuning, discusses the extra settings, configurations, and design considerations that can help to improve performance or mitigate issues.
Chapter 6, Managing a Cluster, goes into detail when describing the nodetool utility, and how it can be used for operations on an Apache Cassandra cluster. Adding and removing nodes is covered, as well as taking and restoring from backups.
Chapter 7, Monitoring, describes how to integrate a technology stack that provides a window into an Apache Cassandra cluster's history and performance metrics.
Chapter 8, Application Development, takes the reader through design considerations around coding Java applications to work with an Apache Cassandra cluster.
Chapter 9, Integration with Apache Spark, talks about installing and using Apache Spark in order to analyze and discover value in your data.
Appendix A, References, In this chapter you will find links present for various references present throughout the book.
This book assumes that you have access to hardware on which you can install, configure, and code against an Apache Cassandra instance. Having elevatedadminorsudoprivileges on the aforementioned machine will be essential to carrying out some of the tasks described.
This book is written from the perspective of running Apache Cassandra on a macOS or Linux instance. As OS-specific system administration is not within the scope of this book, readers who are new to Linux may find value in seeking out a separate tutorial prior to attempting some of the examples.
The Java coding examples will be easier to do from within an integrated developer environment(IDE), with Apache Maven installed for dependency management. You may need to look up additional resources to ensure that these components are configured properly. Several IDEs have a plugin that allows for direct integration with Apache Maven.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Apache-Cassandra-3.x-Third-Edition. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "This will store the PID of the Cassandra process in a file named cassandra.pid in the local/cassandra directory."
A block of code is set as follows:
<dependencies> <dependency> <groupId>com.datastax.cassandra</groupId> <artifactId>cassandra-driver-core</artifactId> <version>3.6.0</version> </dependency></dependencies>
Any command-line input or output is written as follows:
cassdba@cqlsh> use packt;
cassdba@cqlsh:packt>
Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
Welcome to the world of Apache Cassandra! In this first chapter, we will briefly introduce Cassandra, along with a quick, step-by-step process to get your own single-node cluster up and running. Even if you already have experience working with Cassandra, this chapter will help to provide assurance that you are building everything properly. If this is your first foray into Cassandra, then get ready to take your first steps into a larger world.
In this chapter, we will cover the following topics:
Introduction to Cassandra
Installation and configuration
Starting up and shutting down Cassandra
Cassandra Cluster Manager
(
CCM
)
By the end of this chapter, you will have built a single-node cluster of Apache Cassandra. This will be a good exercise to help you start to see some of the configuration and thought that goes into building a larger cluster. As this chapter progresses and the material gets more complex, you will start to connect the dots and understand exactly what is happening between installation, operation, and development.
Apache Cassandra is a highly available, distributed, partitioned row store. It is one of the more popular NoSQL databases used by both small and large companies all over the world to store and efficiently retrieve large amounts of data. While there are licensed, proprietary versions available (which include enterprise support), Cassandra is also a top-level project of the Apache Software Foundation, and has deep roots in the open source community. This makes Cassandra a proven and battle-tested approach to scaling high-throughput applications.
Cassandra's design is premised on the points outlined in the Dynamo: Amazon's Highly Available Key-value Store paper (https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf). Specifically, when you have large networks of interconnected hardware, something is always in a state of failure. In reality, every piece of hardware being in a healthy state is the exception, rather than the rule. Therefore, it is important that a data storage system is able to deal with (and account for) issues such as network or disk failure.
Depending on the Replication Factor (RF) and required consistency level, a Cassandra cluster is capable of sustaining operations with one or two nodes in a failure state. For example, let's assume that a cluster with a single data center has a keyspace configured for a RF of three. This means that the cluster contains three copies of each row of data. If an application queries with a consistency level of one, then it can still function properly with one or two nodes in a down state.
Cassandra is known as a distributed database. A Cassandra cluster is a collection of nodes (individual instances running Cassandra) all working together to serve the same dataset. Nodes can also be grouped together into logical data centers. This is useful for providing data locality for an application or service layer, as well as for working with Cassandra instances that have been deployed in different regions of a public cloud.
Cassandra clusters can scale to suit both expanding disk footprint and higher operational throughput. Essentially, this means that each cluster becomes responsible for a smaller percentage of the total data size. Assuming that the 500 GB disks of a six node cluster (RF of three) start to reach their maximum capacity, then adding three more nodes (for a total of nine) accomplishes the following:
Brings the total disk available to the cluster up from 3 TB to 4.5 TB
The percentage of data that each node is responsible for drops from 50% down to 33%
Additionally, let's assume that before the expansion of the cluster (from the prior example), the cluster was capable of supporting 5,000 operations per second. Cassandra scales linearly to support operational throughput. After increasing the cluster from six nodes to nine, the cluster should then be expected to support 7,500 operations per second.
In Cassandra, rows of data are stored in tables based on the hashed value of the partition key, called a token. Each node in the cluster is assigned multiple token ranges, and rows are stored on nodes that are responsible for their tokens.
Each keyspace (collection of tables) can be assigned a RF. The RF designates how many copies of each row should be stored in each data center. If a keyspace has a RF of three, then each node is assigned primary, secondary, and tertiary token ranges. As data is written, it is written to all of the nodes that are responsible for its token.
To get started with Cassandra quickly, we'll step through a single-node, local installation.
The following are the requirements to run Cassandra locally:
A flavor of Linux or macOS
A system with between 4 GB and 16 GB of
random access memory
(
RAM
)
A local installation of the
Java Development Kit
(
JDK
) version 8, latest patch
A local installation of Python 2.7 (for cqlsh)
Your user must have
sudo
rights to your local system
Head to the Apache download site for the Cassandra project (http://cassandra.apache.org/download/), choose 3.11.2, and select a mirror to download the latest version of Cassandra. When complete, copy the .tar or .gzip file to a location that your user has read and write permissions for. This example will assume that this is going to be the ~/local/ directory:
mkdir ~/local
cd ~/local
cp ~/Downloads/apache-cassandra-3.11.2-bin.tar.gz .
Untar the file to create your cassandra directory:
tar -zxvf apache-cassandra-3.11.2-bin.tar.gz
Some people prefer to rename this directory, like so:
mv apache-cassandra-3.11.2/ cassandra/
At this point, you could start your node with no further configuration. However, it is good to get into the habit of checking and adjusting the properties that are indicated as follows.
It is usually a good idea to rename your cluster. Inside the conf/cassandra.yaml file, specify a new cluster_name property, overwriting the default Test Cluster:
cluster_name: 'PermanentWaves'
The num_tokens property default of 256 has proven to be too high for the newer, 3.x versions of Cassandra. Go ahead and set that to 24:
num_tokens: 24
To enable user security, change the authenticator and authorizer properties (from their defaults) to the following values:
authenticator: PasswordAuthenticatorauthorizer: CassandraAuthorizer
By default, Cassandra will come up bound to localhost or 127.0.0.1. For your own local development machine, this is probably fine. However, if you want to build a multi-node cluster, you will want to bind to your machine's IP address. For this example, I will use 192.168.0.101. To configure the node to bind to this IP, adjust the listen_address and rpc_address properties:
listen_address: 192.168.0.101rpc_address: 192.168.0.101
If you set listen_address and rpc_address, you'll also need to adjust your seed list (defaults to 127.0.0.1) as well:
seeds: 192.168.0.101
I will also adjust my endpoint_snitch property to use GossipingPropertyFileSnitch:
endpoint_snitch: GossipingPropertyFileSnitch
In terms of NoSQL databases, Apache Cassandra handles multi-data center awareness better than any other. To configure this, each node must use GossipingPropertyFileSnitch (as previously mentioned in the preceding cassandra.yaml configuration process) and must have its local data center (and rack) settings defined. Therefore, I will set the dc and rack properties in the conf/cassandra-rackdc.properties file:
dc=ClockworkAngelsrack=R40
To start Cassandra locally, execute the Cassandra script. If no arguments are passed, it will run in the foreground. To have it run in the background, send the -p flag with a destination file for the Process ID (PID):
cd cassandra
bin/cassandra -p cassandra.pid
This will store the PID of the Cassandra process in a file named cassandra.pid in the local/cassandra directory. Several messages will be dumped to the screen. The node is successfully running when you see this message:
Starting listening for CQL clients on localhost/192.168.0.101:9042 (unencrypted).
This can also be verified with the nodetool status command:
bin/nodetool status
Datacenter: ClockworkAngels
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.0.101 71.26 KiB 24 100.0% 0edb5efa... R40
If you want an even faster way to install Cassandra, you can use an open source tool called CCM. CCM installs Cassandra for you, with very minimal configuration. In addition to ease of installation, CCM also allows you to run multiple Cassandra nodes locally.
First, let's clone the CCM repository from GitHub, and cd into the directory:
git clone https://github.com/riptano/ccm.gitcd ccm
Next, we'll run the setup program to install CCM:
sudo ./setup.py install
To verify that my local cluster is working, I'll invoke nodetool status via node1:
ccm node1 status
Datacenter: datacenter1
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 100.56 KiB 1 66.7% 49ecc8dd... rack1
UN 127.0.0.2 34.81 KiB 1 66.7% 404a8f97... rack1
UN 127.0.0.3 34.85 KiB 1 66.7% eed33fc5... rack1
To shut down your cluster, go ahead and send the stop command to each node:
ccm stop node1
ccm stop node2
ccm stop node3
Note that CCM requires a working installation of Python 2.7 or later, as well as a few additional libraries (pyYAML, six, ant, and psutil), and local IPs 127.0.0.1 through 127.0.0.3 to be available. Visit https://github.com/riptano/ccm for more information.
Now that we have a Cassandra cluster running on our local machine, we will demonstrate its use with some quick examples. We will start with cqlsh, and use that as our primary means of working with the Cassandra data model.
Before shutting down your cluster instances, there are some additional commands that should be run. Again, with your own, local node(s), these are not terribly necessary. But it is a good idea to get used to running these, should you ever need to properly shut down a production node that may contain data that people actually care about.
First, we will disable gossip. This keeps other nodes from communicating with the node while we are trying to bring it down:
bin/nodetool disablegossip
Next, we will disable the native binary protocol to keep this node from serving client requests:
bin/nodetool disablebinary
Then, we will drain the node. This will prevent it from accepting writes, and force all in-memory data to be written to disk:
bin/nodetool drain
With the node drained, we can kill the PID:
kill 'cat cassandra.pid'
We can verify that the node has stopped by tailing the log:
tail logs/system.log
INFO [RMI TCP Connection(2)-127.0.0.1] 2018-03-31 17:49:05,789 StorageService.java:2292 - Node localhost/192.168.0.101 state jump to shutdown
INFO [RMI TCP Connection(4)-127.0.0.1] 2018-03-31 17:49:49,492 Server.java:176 - Stop listening for CQL clients
INFO [RMI TCP Connection(6)-127.0.0.1] 2018-03-31 17:50:11,312 StorageService.java:1449 - DRAINING: starting drain process
INFO [RMI TCP Connection(6)-127.0.0.1] 2018-03-31 17:50:11,313 HintsService.java:220 - Paused hints dispatch
INFO [RMI TCP Connection(6)-127.0.0.1] 2018-03-31 17:50:11,314 Gossiper.java:1540 - Announcing shutdown
INFO [RMI TCP Connection(6)-127.0.0.1] 2018-03-31 17:50:11,314 StorageService.java:2292 - Node localhost/192.168.0.101 state jump to shutdown
INFO [RMI TCP Connection(6)-127.0.0.1] 2018-03-31 17:50:13,315 MessagingService.java:984 - Waiting for messaging service to quiesce
INFO [ACCEPT-localhost/192.168.0.101] 2018-03-31 17:50:13,316 MessagingService.java:1338 - MessagingService has terminated the accept() thread
INFO [RMI TCP Connection(6)-127.0.0.1] 2018-03-31 17:50:14,764 HintsService.java:220 - Paused hints dispatch
INFO [RMI TCP Connection(6)-127.0.0.1] 2018-03-31 17:50:14,861 StorageService.java:1449 - DRAINED
In this chapter, we introduced Apache Cassandra and some of its design considerations and components. These aspects were discussed and a high level description of each was given, as well as how each affects things like cluster layout and data storage. Additionally, we built our own local, single-node cluster. CCM was also introduced, with minimal discussion. Some basic commands with Cassandra's nodetool were introduced and put to use.
With a single-node cluster running, the cqlsh tool was introduced. We created a keyspace that will work in a plural data center configuration. The concept of query tables was also introduced, as well as running some simple read and write operations.
In the next chapter, we will take an in-depth look at Cassandra's underlying architecture, and understand what is key to making good decisions about cluster deployment and data modeling. From there, we'll discuss various aspects to help fine-tune a production cluster and its deployment process. That will bring us to monitoring and application development, and put you well on your way to mastering Cassandra!
In this chapter, we will discuss the architecture behind Apache Cassandra in detail. We will discuss how Cassandra was designed and how it adheres to the Brewer's CAP theorem, which will give us insight into the reasons for its behavior. Specifically, this chapter will cover:
Problems that Cassandra was designed to solve
Cassandra's read and write paths
The role that horizontal scaling plays
How data is stored on-disk
How Cassandra handles failure scenarios
This chapter will help you to build a good foundation of understanding that will prove very helpful later on. Knowing how Apache Cassandra works under the hood helps for later tasks around operations. Building high-performing, scalable data models is also something that requires an understanding of the architecture, and your architecture can be the difference between an unsuccessful or a successful cluster.
Understanding how Apache Cassandra works under the hood can greatly improve your chances of running a successful cluster or application. We will reach that understanding by asking some simple, fundamental questions. What types of problems was Cassandra designed to solve? Why does a relational database management system (RDBMS) have difficulty handling those problems? If Cassandra works this way, how should I design my data model to achieve the best performance?
As the internet grew in popularity around the turn of the century, the systems behind internet architecture began to change. When good ideas were built into popular websites, user traffic increased exponentially. It was not uncommon in 2001 for too much web traffic being the reason for a popular site being slow or a web server going down. Web architects quickly figured out that they could build out multiple instances of their website or application, and distribute traffic with load balancers.
