E-Book
27,59 €

Cassandra Design Patterns E-Book

Rajanarayanan Thottuvaikkatumana

0,0

27,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

If you are new to Cassandra but well-versed in RDBMS modeling and design, then it is natural to model data in the same way in Cassandra, resulting in poorly performing applications and losing the real purpose of Cassandra. If you want to learn to make the most of Cassandra, this book is for you.
This book starts with strategies to integrate Cassandra with other legacy data stores and progresses to the ways in which a migration from RDBMS to Cassandra can be accomplished. The journey continues with ideas to migrate data from cache solutions to Cassandra. With this, the stage is set and the book moves on to some of the most commonly seen problems in applications when dealing with consistency, availability, and partition tolerance guarantees.
Cassandra is exceptionally good at dealing with temporal data and patterns such as the time-series pattern and log pattern, which are covered next. Many NoSQL data stores fail miserably when a huge amount of data is read for analytical purposes, but Cassandra is different in this regard. Keeping analytical needs in mind, you’ll walk through different and interesting design patterns.
No theoretical discussions are complete without a good set of use cases to which the knowledge gained can be applied, so the book concludes with a set of use cases you can apply the patterns you’ve learned.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 261

Veröffentlichungsjahr: 2015

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Apache Spark 2 for Beginners

Rajanarayanan Thottuvaikkatumana

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Der größte Raubzug der Geschichte

Matthias Weik

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Organisation für Komplexität

Niels Pfläging

Radikal führen

Reinhard K. Sprenger

30 Minuten Sympathisch und souverän: So geht Vortragen!

Thomas Lorenz

BLACKOUT - Morgen ist es zu spät

Marc Elsberg

The Truth About Employee Engagement

Patrick M. Lencioni

Mensch und Wald

Carsten Wippermann

The Food Truck Handbook

David Weber

Leseprobe

Cassandra Design Patterns Second Edition

Credits

About the Author

Acknowledgements

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Co-existence Patterns

A brief overview of Cassandra

Denormalization pattern

Motivations/solutions

Best practices

Example

Reporting pattern

Motivations/solutions

Best practices

Example

Aggregation pattern

Motivations/solutions

Best practices

Example

References

Summary

2. RDBMS Migration Patterns

A brief overview

List pattern

Motivations/solutions

Best practices

Example

Set pattern

Motivations/solutions

Best practices

Example

Map pattern

Motivations/solutions

Best practices

Example

Distributed Counter pattern

Motivations/solutions

Best practices

Example

Purge pattern

Motivations/solutions

Best practices

Example

References

Summary

3. Cache Migration Patterns

A brief overview

Cache to NoSQL pattern

Motivations/solutions

Best practices

Example

References

Summary

4. CAP Patterns

A brief overview

Write-heavy pattern

Motivations/solutions

Best practices

Example

Read-heavy pattern

Motivations/solutions

Best practices

Example

Read-write balanced pattern

Motivations/solutions

Best practices

Example

References

Summary

5. Temporal Patterns

A brief overview

Time series pattern

Motivations/solutions

Best practices

Example

Log pattern

Motivations/solutions

Best practices

Example

Conversation pattern

Motivations/solutions

Best practices

Example

References

Summary

6. Analytics Patterns

Processing big data

Apache Hadoop

Apache Spark

Transforming data

A brief overview

Map/Reduce pattern

Motivations/solutions

Best practices

Example

Transformation pattern

Motivations/solutions

Best practices

Example

References

Summary

7. Designing Applications

A brief overview

Application design and use cases

Service management and use cases

References

Summary

Index

Cassandra Design Patterns Second Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: January 2014

Second edition: October 2015

Production reference: 1261015

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78528-570-7

www.packtpub.com

This book is an update to Cassandra Design Patterns by Sanjay Sharma.

Credits

Author

Rajanarayanan Thottuvaikkatumana

Reviewers

William Berg

Mark Kerzner

Alex Shvid

Commissioning Editor

Priya Singh

Acquisition Editor

Tushar Gupta

Content Development Editor

Samantha Gonsalves

Technical Editor

Anushree Arun Tendulkar

Copy Editor

Vatsal Surti

Project Coordinator

Kinjal Bari

Proofreader

Safis Editing

Indexer

Tejal Daruwale Soni

Production Coordinator

Aparna Bhagat

Cover Work

Aparna Bhagat

About the Author

Rajanarayanan Thottuvaikkatumana, "Raj", is a seasoned technologist with more than 23 years of software development experience at various multinational companies. He has lived and worked in India, Singapore, and the USA, and is presently based out of the UK. His experience includes architecting, designing, and developing software applications. He has worked on various technologies including major databases, application development platforms, web technologies, and big data technologies. Since 2000 onwards, he has been working mainly in Java-based technologies, and has been doing heavy-duty server-side programming in Java and Scala. He has worked on very highly concurrent, highly distributed, and high-transaction-volume systems with NoSQL data stores such as Cassandra and Riak and caching technologies such as Redis, Ehcache, and Chronicle Map. Raj has a lot of experience in integrating Cassandra with Spark and has shared the Scala code repository on GitHub.

Raj holds one master's degree in Mathematics and one master's degree in Computer Information Systems, and has many certifications in ITIL and Cloud Computing to his credit.

Apart from all this, Raj is a prolific corporate trainer on various technical subjects and has contributed to the Apache Cassandra project.

When not working with computers, Raj watches a lot of tennis and he is an avid listener of classical music.

Even though Raj has worked on many white papers and training materials, this is his first publication in the form of a book.

Acknowledgements

I would like to thank my father for showing me that there is no age barrier for embarking upon something totally new. I would like to thank my mother for showing me that relentless work will be fruitful one day. I would like to thank my wife for showing me that working towards perfection culminates in something beyond comparison. I would like to thank my teachers who have helped me to see learning as a continuous process. I would like to thank my geeky friends who collectively have solutions for almost any technical problem. Last but not the least, I would like to thank my present employer for gracefully giving me official permission to work on this project.

About the Reviewers

William Berg has been engineering software for the last several years, and has worked with Cassandra all that time. He works mainly with Java. He has also reviewed Cassandra Design Patterns, another Packt Publishing title. He also plays the bass guitar and produces electronic music.

Mark Kerzner holds degrees in law, math, and computer science. He is a software architect and has been working with Big Data for the last 7 years. He is a cofounder of Elephant Scale, a Big Data training and implementation company, and is the author of FreeEed, an open-source platform for eDiscovery based on Apache Hadoop. He has many authored books and patents to his credit. He loves learning languages, and is currently perfecting his Hebrew and Chinese.

I would like to acknowledge the help of my colleagues, in particular Sujee Maniyam and, last but not least, of my multitalented family.

Alex Shvid is a Data Grid architect with more than 10 years of software experience in Fortune 500 companies with the focus on financial institutions. He has worked in the USA, Argentina, and Russia and has many architect and developer certifications, including those from Pivotal/Spring Source and Oracle. He is a regular speaker at user groups and conferences around the world such as the Java One and Cassandra meet ups. Alex works for PayPal in Silicon Valley, developing low-latency big data real-time solutions. His major specialization is in big data and fast data framework adoption for enterprise environments. He has participated in an open-source project Spring Data Cassandra module and developed a Dell Crowbar automation barclamp for Cassandra. Among his recent projects in Fast data are: integration of Gemfire from Pivotal as an event processing middleware solution and caching system for Gire (Buenos Aires, Argentina), Visa (Foster City, CA, USA), VMWare (Palo Alto, CA, USA) as well as the Coherence from Oracle for Analog (Boston, MA, USA), RCI (Parsippany, NJ, USA), and a custom data grid solution for Deutsche Bank (New York, NY, USA). When he is not working, Alex can usually be found hiking with his wife along the Coastal Trail in the San Francisco Bay Area

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

Preface

Apache Cassandra is one of the most popular NoSQL data stores based on the research papers Dynamo: Amazon's Highly Available Key-value Store and Bigtable: A Distributed Storage System for Structured Data. Cassandra is implemented with the best features from both of these research papers. In general, NoSQL data stores can be classified into the following groups.

Key-value data storeColumn-family data storeDocument data storeGraph data store

Cassandra belongs to the column-family data store group. Cassandra's peer-to-peer architecture avoids single-point failures in the cluster of Cassandra nodes and gives the ability to distribute the nodes across racks or data centers. This makes Cassandra a linearly scalable data store. In other words, the greater your processing need, the more Cassandra nodes you can add to your cluster. Cassandra's multidata center support makes it a perfect choice to replicate data stores across data centers for disaster recovery, high availability, separating transaction processing, and analytical environments for building resiliency into the data store infrastructure.

The basic data abstraction in Cassandra starts with a column consisting of a name, value, timestamp, and optional time-to-live attributes. A row comes with a row key and a collection of sorted columns. A column family or a table is a collection of of rows. A keyspace is a collection of column families.

Cassandra 2.1 comes with a lot of new features making it an even more powerful data store than ever before. Now the new CQL keyword IF NOT EXISTS lets you check the existence of an object before Cassandra creates a new one. Lightweight transactions and the batching of CQL commands gives the user an ability to perform multistep atomic operations. Marking some columns in a column family as STATIC gives the user the ability to share data across all the rows of a given partition. The user-defined data type gives the power of modeling your data store very close to the real-world objects and objects used in the applications written using object-oriented programming languages. Collection indexes may be used to index and query collection data types in Cassandra. Row cache improvements, changes to reads and writes, off-heap memory tables, incremental node repair, and the new counter implementation all make Cassandra perform much better than its previous releases.

All the code samples that are used in this book are written for Cassandra 2.1.5. All the examples are as per the CQL specification 3.x. The pre-CQL Thrift API-based Cassandra CLI is being used to list the physical layout of the column families. An insight into the physical layout is very important because a wrong choice of a partition key or primary key will result in insidious performance problems. As a best practice, it is a good idea to create the column family, insert a couple of records, and use the list command in the Cassandra CLI with the column-family name. It will give the physical layout.

The term "design patterns" is a highly misinterpreted term in the software development community. In a very general sense, it is a set of solutions for some known problems in a very specific context. The way it is being used in this book is to describe a pattern of using certain features of Cassandra to solve some real-world problems. To refer to them and to identify them later, a name is also given to each of such design patterns. These pattern names may not be related at all to any similar sounding design pattern name used in other contexts and in other software development paradigms.

Users love Cassandra because of its SQL-like interface, CQL, and its features are very closely related to the RDBMS even though the paradigm is totally new. Application developers love Cassandra because of the plethora of drivers available in the market so that they can write applications in their preferred programming language. Architects love Cassandra because they can store structured, semi-structured, and unstructured data in it. Database administers love Cassandra because it comes with almost no maintenance overhead. Service managers love Cassandra because of the wonderful monitoring tools available in the market. CIOs love Cassandra because it gives value for their money. And, Cassandra works!

What this book covers

Chapter 1, Co-existence Patterns, discusses how Cassandra may be used in a legacy environment coexisting with RDBMSs.

Chapter 2, RDBMS Migration Patterns, discusses how some of the unique Cassandra features may be used to provide value and hence migrate traditional RDBMS data to Cassandra. It is a natural progression from coexistence with other legacy RDBMSs.

Chapter 3, Cache Migration Patterns, deals with some of the pitfalls of using caching solutions and how Cassandra may be used to overcome them.

Chapter 4, CAP Patterns, talks about data integrity considerations, consistency, availability, and partition tolerance and how some of the fine-tuning possibilities in Cassandra may be used to design powerful data stores.

Chapter 5, Temporal Patterns, discusses temporal data and how some of the features in Cassandra may be used to design powerful temporal data stores.

Chapter 6, Analytics Patterns, talks about the need for data analytics and how Cassandra in conjunction with Spark may be used to serve the data analysis use cases.

Chapter 7, Designing Applications, discusses designing a complete application that makes use of all the design patterns discussed in this book.

What you need for this book

Readers are advised to go through Cassandra data modeling before starting the journey of understanding Cassandra Design Patterns, Second Edition. An excellent book to start with data modeling is Cassandra Data Modeling and Analysis, C.Y. Kan, Packt Publishing. An understanding of RDBMS data modeling is a definite plus point.

This book has some version-specific content. The code examples refer to Cassandra Query Language (CQL). Cassandra 2.1.5 or above is the preferred version for references as well as for running the CQL code samples given in this book.

Who this book is for

This book is perfect for Cassandra developers who want to make use of the real power of Cassandra by taking their solutions to the next level. If you are an architect who is designing scalable Cassandra-based data solutions, this book is ideal for you to make use of the right Cassandra features in the right context to solve real-world problems. If you are already using Cassandra, this book will help you in leveraging its full potential.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. Co-existence Patterns

"It's coexistence or no existence"

--Bertrand Russell

Relational Database Management Systems (RDBMS) have been pervasive since the '70s. It is very difficult to find an organization without any RDBMS in their solution stack. Huge efforts have gone into the standardization of RDBMS. Because of that, if you are familiar with one RDBMS, switching over to another will not be a big problem. You will remain in the same paradigm without any major shifts. Pretty much all the RDBMS vendors offer a core set of features with standard interfaces and then include their own value-added features on top of it. There is a standardized language to interact with RDBMS called Structured Query Language (SQL). The same queries written against one RDBMS will work without significant changes in another RDBMS. From a skill set perspective, this is a big advantage because you need not learn and relearn new dialects of these query languages as and when the products evolve. These enable the migration from one RDBMS to another RDBMS, which is a painless task. Many application designers designed the applications in an RDBMS agnostic way. In other words, the applications will work with multiple RDBMS. Just change some configuration file properties of the application, and it will start working with a different but supported RDBMS. Many software products are designed to support multiple RDBMS through their configuration file settings to suit the needs of the customers' preferred choice of RDBMS.

Mostly in RDBMS, a database schema organizes objects such as tables, views, indexes, stored procedures, sequences, and so on, into a logical group. Structured and related data is stored in tables as rows and columns. The primary key in a table uniquely identifies a row. There is a very strong theoretical background in the way data is stored in a table.

A table consists of rows and columns. Columns contain the fields, and rows contain the values of data. Rows are also called records or tuples. Tuple calculus, which was introduced by Edgar F. Codd as part of the relational model, serves as basis for the structured query language or SQL for this type of data model. Redundancy is avoided as much as possible. Wikipedia defines database normalization as follows:

"Database normalizationis the process of organizing the attributes and tables of a relational database to minimize data redundancy."

Since the emphasis is on avoiding redundancy, related data is spread across multiple tables, and they are joined together with SQL to present data in various application contexts. Multiple indexes that may be defined on various columns in a table can help data retrieval, sorting needs, and maintaining data integrity.

In the recent years, the amount of data that is being generated by various applications is really huge and the traditional RDBMS have started showing their age. Most of the RDBMS were not able to ingest various types of data into their schema. When the data starts flowing in quick succession, traditional RDBMS often become bottlenecks. When data is written into the RDBMS data stores in such speed, in a very short period of time, the need to add more nodes into the RDBMS cluster becomes necessary. The SQL performance degradation happens on distributed RDBMS. In other words, as we enter the era of big data, RDBMS could not handle the three Vs of data: Volume, Variety, and Velocity of data.

Many RDBMS vendors came up with solutions for handling the three Vs of data, but these came with a huge cost. The cost involved in the software licensing, the sophisticated hardware required for that, and the related eco-system of building a fault-tolerant solution stack, started affecting the bottom line in a big way. New generation Internet companies started thinking of different solutions to solve this problem, and very specialized data stores started coming up from these organizations and open source communities based on some of the popular research papers. These data stores are generally termed as NoSQL data stores, and they started addressing very specific data storage and retrieval needs. Cassandra is one of the highly successful NoSQL data stores, which has a very good similarity with traditional RDBMS. The advantage of this similarity comes in handy when Cassandra is adopted by an enterprise. The abstractions of a typical RDBMS and Cassandra have a few similarities. Because of this, new users can relate things to RDBMS and Cassandra. From a logical perspective Cassandra tables have a similarity with RDBMS-based tables in the view of the users, even though the underlying structures of these tables are totally different. Because of this, Cassandra is the best fit to be deployed along with the traditional RDBMS to solve some of the problems that RDBMS is not able to handle.

The caveat here is that because of the similarity of RDBMS tables and Cassandra column families (also known as Cassandra tables) in the view of the end users, many users and data modelers try to use Cassandra in exactly the same way as an RDBMS schema is being modeled, used, and is getting into the serious deployment issues. How do you prevent such pitfalls? At the outset, Cassandra may look like a traditional RDBMS data store. But the fact is that it is not the same. The key here is to understand the differences from a theoretical perspective as well as in a practical perspective, and follow the best practices prescribed by the creators of Cassandra.

Tip

In Cassandra, the terms "column family" and "table" are synonymous. The Cassandra Query Language (CQL) command syntax uses the term "table."

Why can Cassandra be used along with other RDBMS? The answer to that lies in the limitations of RDBMS. Some of the obvious ones are cost savings, the need to scale out, handling high-volume traffic, complex queries slowing down response times, the data types are getting complex, and the list goes on and on. The most important aspect of the need for Cassandra to coexist with legacy RDBMS is that you need to preserve the investments made already and make sure that the current applications are working without any problems. So, you should protect your investments, make your future investments in a smart NoSQL store such as Cassandra, and follow the one-step-at-a-time approach.

A brief overview of Cassandra

Where do you start with Cassandra? The best place is to look at the new application development requirements and take it from there. Look at cases where there is a need to denormalize the RDBMS tables and keep all the data items together, which would have been distributed if you were to design the same solution in an RDBMS. If an application is writing a set of data items together into a data store, why do you want to separate them out? No need to worry about redundancy. This is the new NoSQL philosophy. This is the new way to look at data modeling in NoSQL data stores. Cassandra supports fast writes and reads. Initial versions of Cassandra had some performance problems, but a huge number of optimizations have gone into making the latest version of Cassandra perform much better for reads as well as writes. There is no problem with consuming space because the secondary storage is getting cheaper and cheaper. A word of caution here is that, it is fine to write the data into Cassandra, whatever the level of redundancy, but the data access use cases have to be thought through carefully before getting involved in the Cassandra data model. The data is stored in the disk, to be read at a later date. These reads have to be efficient, and it gives the required data in the desired sorted order.

In a nutshell, you should decide how do you want to store the data and make sure that it is giving you the data in the desired sort order. There is no hard and fast rule for this. It is purely up to the application requirements. That is, the other shift in the thought process.

Instead of thinking from the pure data model perspective, start thinking in terms of the application's perspective. How the data is generated by the application, what are the read requirements, what are the write requirements, what is the response time expected out of some of the use cases, and so on. Depending on these aspects, design the data model. In the big data world, the application becomes the first class citizen and the data model leaves the driving seat in the application design. Design the data model to serve the needs of the applications.

In any organization, new reporting requirements come all the time. The major challenge to generate reports is the underlying data store. In the RDBMS world, reporting is always a challenge. You may have to join multiple tables to generate even simple reports. Even though the RDBMS objects such as views, stored procedures, and indexes may be used to get the desired data for the reports, when the report is being generated, the query plan is going to be very complex most of the time. Consumption of the processing power is another need to consider when generating such reports on the fly. Because of these complexities, many times, for reporting requirements, it is common to keep separate tables containing data exported from the transactional tables. Martin Fowler emphasizes the need for separating reporting data from the operations data in his article, Reporting Database. He states:

"Most Enterprise Applications store persistent data with a database. This database supports operational updates of the application's state, and also various reports used for decision support and analysis. The operational needs and the reporting needs are, however, often quite different - with different requirements from a schema and different data access patterns. When this happens it's often a wise idea to separate the reporting needs into a reporting database, which takes a copy of the essential operational data but represents it in a different schema".

This is a great opportunity to start with NoSQL stores such as Cassandra as a reporting data store.

Data aggregation and summarization are the common requirements in any organization. This helps to control data growth in by storing only the summary statistics and moving the transactional data into archives. Often, these aggregated and summarized data are used for statistical analysis. In many websites, you can see the summary of your data instantaneously when you log in to the site or when you perform transactions. Some of the examples include the available credit limit of credit cards, the available number of text messages, remaining international call minutes in a mobile phone account, and so on. Making the summary accurate and easily accessible is a big challenge. Most of the time, data aggregation and reporting go hand in hand. The aggregated data is used heavily in reports. The aggregation process speeds up the queries to a great extent. In RDBMS, it is always a challenge to aggregate data, and you can find new requirements coming all the time. This is another place you can start with NoSQL stores such as Cassandra.

Now, we are going to discuss some aspects of the denormalization, reporting, and aggregation of data using Cassandra as the preferred NoSQL data store.

Denormalization pattern

Denormalize the data and store them as column families in Cassandra. This is a very common practice in NoSQL data stores. There are many reasons why you might do this in Cassandra. The most important aspect is that Cassandra doesn't support joins between the column families. Redundancy is acceptable in Cassandra as storage is cheap, and this is more relevant for Cassandra because Cassandra runs on commodity hardware while many RDBMS systems need much better hardware specifications for the optimal performance when deployed in production environments. Moreover, the read and write operations are highly efficient even if the column families are huge in terms of the number of columns or rows. In the traditional RDBMS, you can create multiple indexes on a single table on various columns. But in Cassandra, secondary indexes are very costly and they affect the performance of reads and writes.

Motivations/solutions

In many situations, whenever a new requirement comes, if you think in the traditional RDBMS way, it will lead to many problems such as poor performance on read/write, long running processes, queries becoming overly complex, and so on. In this situation, one of the best approaches is to apply denormalization principles and design column families in Cassandra.