Getting Started with Cassandra - John Garcia - E-Book

Getting Started with Cassandra E-Book

John Garcia

0,0
2,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung


This title is one of the "Essentials" IT Books published by TechNet Publications Limited.
This Book is a very helpful practical guide for beginners in the topic , which can be used as a learning material for students pursuing their studies in undergraduate and graduate levels in universities and colleges and those who want to learn the topic via a short and complete resource.
We hope you find this book useful in shaping your future career.

This book will be available soon...

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



John Garcia

Getting Started with Cassandra

BookRix GmbH & Co. KG81371 Munich

Chapter 1 Introduction

Chapter 1  Introduction

Apache Cassandra is a very scalable, NoSQL open-source database designed as a peer-to-peer distributed system where all nodes are the same and the data is distributed among the nodes in the cluster. Since there is no single point of failure, having nodes out of order in a cluster is not a big deal for Cassandra. It’s a common thing that nodes are added and taken out of the cluster during regular working hours, without having to wait for the system load to drop somewhere during the night. Also, one of the key features of Cassandra is that it works on commodity hardware and is easily deployed on a cloud-based infrastructure.

Note: In Cassandra terminology, a node is one Cassandra instance. A cluster is a group of two or more nodes working together and communicating with the gossip protocol.

The Name Cassandra

The Name Cassandra

Let’s start with the name. Cassandra was the very beautiful daughter of King Priam and Queen Hecuba of Troy. Apollo, Greek god of music, poetry, art, oracles, archery, plague, medicine, sun, light, and knowledge, fell in love with Cassandra the first time he saw her. He offered her the gift of prophecy in exchange for a kiss. Cassandra agreed and she received the gift of prophecy, but she then refused Apollo. The gift could not be taken back, so Apollo cursed her so that nobody would believe her prophecies.

The story behind the name is not as simple as it might seem at first. Some say that the chosen name is actually related to the name of a popular storage technology and believe that the engineers at Facebook chose the name Cassandra because Cassandra was a cursed oracle.

History of Apache Cassandra

History of Apache Cassandra

Many concepts Cassandra relies on come from Google’s BigTable developed in 2005, and Amazon’s Dynamo from 2007. Avinash Lakshman, one of the authors of the Dynamo paper, and Prashant Malik developed the first version of Cassandra at Facebook to support the Inbox Search feature.

Facebook then released it on Google Code in 2008. In 2009, it became an Apache Incubator project; it was promoted to a top-level project in 2010. Since then, Cassandra has gone through more than eight releases and is used by more than 400 companies in 39 countries. The impact that Cassandra has had on the NoSQL scene is pretty significant, especially considering that 25 percent of the Fortune 100 companies use Cassandra even though there are literally hundreds of other NoSQL solutions available today.

Let’s take a look at some companies and organizations that use Cassandra. Netflix uses Cassandra for storing customer viewing data, eBay uses it for fraud detection and the social component of their platform, Expedia uses it for caching hotel room prices, SoundCloud uses it to store user’s dashboards, and CERN uses it for monitoring the data acquisition system of the ATLAS experiment (one of two systems involved into the discovery of the Higgs boson).

As mentioned previously, Cassandra is an open-source project and there are many committers working on it. Cassandra committers are employed at companies such as Facebook, Apple, Twitter, and nScaled—with the majority of the committers coming from DataStax. DataStax’s business model is based on providing distribution and support for an enterprise-grade version of Cassandra. DataStax also supports and hosts many Cassandra community-related activities.

Basic Theory behind Cassandra

Basic Theory behind Cassandra

To understand the principles behind (and the motivation that led to the development of) Cassandra and distributed systems in general, we have to take a step back and look at how it was all done with systems that didn’t cope with the scale of data that today’s systems often do.

Vertical Scaling

Back in the day, big centralized servers managed all the storage. If the system needed to scale, one bought a server with better performance characteristics and moved the data to the new hardware. This type of scaling is called vertical or upscaling.

Figure 1: Vertical Scaling

This approach to scaling is limited by currently available hardware, not to mention that the bigger the hardware, the bigger the price. One other issue is that, for instance, if one needed just 10 percent more capacity (or something in that range), new hardware still had to be acquired. Now, there is nothing wrong with this approach; it functioned just fine for decades and is still used a lot today on a day-to-day basis. However, with the development of various systems and an increase in the scale of these systems, the vertical scaling approach hit a wall, not to mention that it became very difficult to move gigabytes and gigabytes of data around for even the smallest changes on the system. Also, losing a server often meant losing data or involved a complicated process to restore it.

Relational Databases

For many decades, relational databases were the main technology when handling data. In a relational database, a unit of work performed against the database is called a transaction. Relational databases are often referred to as transactional systems, and the relational database basically has to provide an environment for executing transactions.

Transaction properties are:

Atomicity: If one part of the transaction fails, the whole transaction fails.Consistency: Every change has to be in accordance to constraints, triggers, etc.Isolation: Ongoing transactions don’t interfere with one another.Durability: After the transaction is finished, the data remains stored in case of a failure.

The transaction properties (often referred to as ACID) are harder to provide when the data is distributed across multiple physical machines. For instance, ACID would require distributed locking of resources. This locking becomes a problem when the number of transactions increases and the system starts to spend more and more time on coordinating while nodes wait for other nodes to finish and acknowledge operations.

Oftentimes, some sort of master/slave architecture is used. The master decides where the data is stored and waits for the slaves to respond. The scaling problem is twofold: throughput is dependent on the master’s capacity and, the more nodes that are included in the system, the more communication that is required to maintain the ACID properties of the transactions.

Figure 2: Client waiting for operation to finish in an ACID-distributed system with master node

Take into account that in the previous figure, the number of requests from master to slaves is actually twice as many because, in the first phase, the master notifies the slaves of what the change is going to be; every slave then acknowledges if it’s all right. The master then has to notify the nodes that the other nodes agreed and that the transaction went through, and it has to wait for their response once again.

CAP Theorem

In 2000, Eric Brewer published the CAP theorem. Seth Gilbert and Nancy Lynch of MIT formally proved the theorem in 2002. In short, the theorem states that a distributed system can simultaneously provide just two of the three following guarantees:

Consistency: Every read gets the data from the most recent write.Availability: Every running node always responds to a query.Partition tolerance: The system continues to operate despite node outage.

This choice is often depicted with the following triangle.

Figure 3: CAP Theorem Choices

Although there’s a fine amount of theory behind this “two out of three” rule, Brewer stated in 2012 that it’s a bit misunderstood. First, most systems aren’t partitioned, so choosing between A and C is oftentimes not really necessary because P is irrelevant within a one node system. Second, the choice between A and C can happen at many levels within the system itself. Finally, all three properties are more continuous than binary. Brewer also pointed out that CAP consistency is not the same as ACID consistency, and that CAP only refers to single-copy consistency which is a pure subset of ACID consistency.

Note: In distributed systems, it comes down to availability or consistency.

Previous reasoning around misunderstood concepts in CAP might seem a bit complicated but “two out of three”, in practice, comes down to “one out of two”, so it’s more like picking between “A” and “C”. To have availability, data is replicated to different machines. Consistency is achieved by updating other nodes before allowing further reads. So, systems allowing reads before other nodes are updated get a high “A”. Systems that lock other nodes before allowing reads have a high “C”. With Cassandra, this choice is more toward “A”.

Consistent Hashing

Cassandra is a peer-to-peer system. We have already mentioned a couple of times that having some node or other central place to distribute data within the cluster is, by itself, very inefficient when we pass a certain size cluster. So, the basic idea is to provide some way for the nodes to know on which node within the cluster to place the data. For now, let’s assume that the data key is going to be a really large number—large enough for us to pick it randomly without any fear of this number appearing again. Let’s call it “RowID”.

The simplest algorithm to distribute this data to N nodes would be to divide RowID by N and then look at the remainder. The remainder will always range from 0 to N-1. If we assign all of the keys with the remainder of 0 to Node A, 1 to Node B, 2 to Node C, and so on, we would have a nice way of knowing which node a row belongs to. Now, as great as this might sound, this way of distributing data has a very serious weakness in that when the number of nodes changes, all of the data stored in the cluster actually starts belonging to some other node. In practice, this would mean that taking any node down while the system is running would cause the cluster to become unresponsive and probably become useless until all of the data was shifted to the appropriate server.

One way of avoiding shifting data on such a scale is to partition the whole space of possible RowID values into equal sizes.

Figure 4: Partitioning with a range of values

Figure 4 shows RowID values from 0 to 99. Remember, RowID is a very large number representing a row, but for simplicity we’re working with a range from 0 to 99. The first 25 RowID values belong to A, the next 25 to B, and so on. If, for some reason, another node joins the cluster, only a part of the data would be shifted between the nodes, not all of it. Cassandra functioned like this for a long time. One didn’t specify the numbers from 0 to 99, but used special tools to generate these boundary values for nodes, and the boundary values were held in configuration files. The weakness of this approach was that it had to be done manually, and thus was prone to errors.

Before going into a final round of explaining what consistent hashing is and why it is used, let’s take a step back. Previously, we had this large number, RowID, identifying the data. We assumed it was just some random number, but that’s not quite the truth of it. Every piece of data in Cassandra has a row key, some deterministic value that identifies the row we want to store and manipulate.

This value is determined on an application level and is used to retrieve data when necessary. Usually it is referred to as a key. This key value is unique, but since it probably isn’t a number (i.e. it is a username or e-mail address), it’s not possible to map it directly to a unique number that we could use to determine which node this data belongs to. Being able to deterministically get a number for storing the data from the application key is very important. The function that enables us to turn any size key into a fixed-sized number that we need to manipulate the data is called a hash function. In Cassandra’s case, the Murmur3 hash function is used, and it has a range from -263 to 263-1.

To avoid manual assignment of hash function ranges to nodes, a simple algorithm could be used. We could, for instance, take a node’s name and calculate its hash value, which would produce a number. If we did this for every node in the cluster, we could partition the possible hash ranges with boundary marker values defined by the numbers produced as hash values of the node names.

Figure 5: Dynamic partitioning with hash values of node names

The technique shown in the previous figure is called consistent hashing. Note that there are no guarantees of any kind for the partition sizes or their ordering. The algorithm is pretty simple and any node participating in the cluster can reconstruct it and know where a row with some hash value belongs. Whatever the result of a hash function on a row key is, the node is found by moving clockwise or counterclockwise until a marker belonging to the node is found. The calculated RowID then belongs to the found node’s marker value.

The output of the hash function is not guaranteed to fall into any kind of symmetrical range. As depicted in the previous figure, some nodes (Node D in our case) may hold much more data than others. This could lead to one node being underutilized and another working under heavy loads. To distribute this data evenly among nodes, one could create a lot of smaller partitions. Theoretically, there is no practical limit on how many markers we set into this range. If we simply combine some predefined numerical or alphabetical prefix or suffix with the node’s name and put the resulting hashes in the previous range as markers, we would probably get a much better distribution of ranges. The algorithm would remain the same: move clockwise or counterclockwise until a marker is found. By default, Cassandra puts 256 markers for every node in the range.

Figure 6: Sample of Murmur3 hash distribution on Cassandra cluster with three nodes

If we want some nodes to hold smaller amounts of data, we assign them fewer markers (or more markers for larger amounts of data). Having more markers in hash output distribution means a greater probability of getting more data and vice versa. This technique provides many advantages in that, if we add a node to the cluster, all nodes are going to distribute a bit of their data to the new node. If a node leaves the cluster, the remaining nodes will split its data almost equally. Later we will cover Cassandra’s virtual nodes technology. As complex as it might seem, virtual nodes are just a lot of markers on the output range of the Murmur3 hash function as depicted earlier.

Architecture Overview

Architecture Overview

Cassandra is a row-oriented database. Every row is identified by its key. One of Cassandra’s limitations is that a single row must fit on a single node. The placement of the rows on the nodes happens automatically because there is no central node determining where a row is stored. All of the nodes use the same algorithm which determines the data distribution, basically calculating a hash value for the row key and looking up the nearest marker.

This hash function has ranges and, depending on the available nodes in the cluster, the node is responsible for a certain number of rows. If some node leaves the cluster, the key hashes remain the same but some other node or nodes become responsible for the hash ranges the failed node was responsible for. The story is similar when a new node is added to the cluster.

It’s important to note that a single row is usually not stored on just one node because of the fault tolerance of the system. One row is usually stored on one node only in development and test environments. Having rows stored on multiple nodes is important because any node can be taken out or have some sort of failure at any time.

Replication Factor

When talking about data replication, one of the first concerns is how many copies of a single row are actually needed? This question is not easy to answer; it depends on a lot of circumstances. When discussing data replication, the most important term to remember is replication factor, which is the number of nodes that store the same row.

For instance, replication factor two guarantees that there will be two copies of data on the nodes in the cluster. Choosing two is fine for covering single-node failure and is a minimum for production-level deployments (although not advisable because if anything happens to one node, the remaining node has to handle all of the requests, which is always a bad idea).

When nodes fail more often, a higher replication factor is desirable. Also, a higher replication factor will, in some cases, increase the read speed because multiple nodes will respond faster with data parts than a single node transmitting all of the required data. On the other hand, the data replication from node to node will take longer. As a rule of thumb, going below or above 3 must be justified by a design or environment circumstance.

Keyspace and Column Family

Usually, not all data is equally important. Some historical logging information might be less valuable than measurement data. Cassandra allows grouping of data into so-called keyspaces. Generally speaking, a keyspace is a container for the application data. Typically, a cluster has one keyspace per application. But from Cassandra’s point of view, the main concern with the keyspaces is to control replication.

Note: Replication is defined on the keyspace level.

The next subunit of storing data is called column family. Column family is a container for an ordered collection of rows. Every row is an ordered collection of columns. Most literature on Cassandra describes column family as something similar to the table in the relational world.

Replication Strategy

Earlier we covered nodes and determining which node the data is going to be stored on. To achieve a specific replication factor, the simplest strategy would be to copy the data to the next node on the hash distribution ring until the desired replication factor is achieved. This strategy is called SimpleStrategy in Cassandra. This strategy is just fine for many smaller-scale systems, but sometimes just copying the data around the data center is not enough. Some systems might have different topologies, and constantly replicating data from coast to coast is not the best solution if every millisecond counts.

The second replication strategy in Cassandra is called NetworkTopologyStrategy. This strategy is the best choice if you plan to distribute your data among multiple data centers. The main difference between the NetworkTopologyStrategy and the SimpleStrategy is that determining the next node to hold the replicated data continues clockwise until a node outside of the same rack is found. Nodes in a rack often fail together because of power failures or facility conditions such as broken air conditioners or faulty network equipment.

The most important problem here is determining the number of replicas within the data center needed to satisfy reads without going to other data centers for the data. With two replicas in each data center, one node can fail at any time and rows will still be readable. Three replicas enable two nodes to fall out and so on. It’s also possible to have different replication factors in other data centers; for instance, having three replicas in a data center to serve real-time requests and a single replica in another data center for batch processing or analytics.

Gossip Protocol and the Snitch

The protocol used for sharing node locations and state information about the nodes participating in a Cassandra cluster is called gossip. Cassandra nodes are constantly gossiping; every node initiates a gossip exchange approximately once per second with up to three nodes in the cluster.