E-Book
28,79 €

Seven NoSQL Databases in a Week E-Book

Xun (Brian) Wu

0,0

28,79 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

A beginner's guide to get you up and running with Cassandra, DynamoDB, HBase, InfluxDB, MongoDB, Neo4j, and Redis

Key Features

Covers the basics of 7 NoSQL databases and how they are used in the enterprises
Quick introduction to MongoDB, DynamoDB, Redis, Cassandra, Neo4j, InfluxDB, and HBase
Includes effective techniques for database querying and management

Book Description

This is the golden age of open source NoSQL databases. With enterprises having to work with large amounts of unstructured data and moving away from expensive monolithic architecture, the adoption of NoSQL databases is rapidly increasing. Being familiar with the popular NoSQL databases and knowing how to use them is a must for budding DBAs and developers.

This book introduces you to the different types of NoSQL databases and gets you started with seven of the most popular NoSQL databases used by enterprises today. We start off with a brief overview of what NoSQL databases are, followed by an explanation of why and when to use them. The book then covers the seven most popular databases in each of these categories: MongoDB, Amazon DynamoDB, Redis, HBase, Cassandra, InﬂuxDB, and Neo4j. The book doesn't go into too much detail about each database but teaches

you enough to get started with them.

By the end of this book, you will have a thorough understanding of the different NoSQL databases and their functionalities, empowering you to select and use the right

database according to your needs.

What you will learn

Understand how MongoDB provides high-performance, high-availability, and automatic scaling
Interact with your Neo4j instances via database queries, Python scripts, and Java application code
Get familiar with common querying and programming methods to interact with Redis
Study the different types of problems Cassandra can solve
Work with HBase components to support common operations such as creating tables and reading/writing data
Discover data models and work with CRUD operations using DynamoDB
Discover what makes InﬂuxDB a great choice for working with
time-series data

Who this book is for

If you are a budding DBA or a developer who wants to get started with the fundamentals of NoSQL databases, this book is for you. Relational DBAs who want to get insights into the various offerings of popular NoSQL databases will also find this book to be very useful.

Aaron Ploetz Aaron is the NoSQL Engineering Lead for Target, where his DevOps team supports Cassandra, MongoDB, and Neo4j. He has been named a DataStax MVP for Apache Cassandra three times, and has presented at multiple events, including the DataStax Summit and Data Day Texas. Aaron earned a B.S. in Management/Computer Systems from the University of Wisconsin-Whitewater, and a M.S. in Software Engineering from Regis University. He and his wife Coriene live with their three children in the Twin Cities area. Devram Kandhare Devram has 4 years of experience of working with SQL database MySql and NoSql databases MongoDB and Dynamo db. Worked as database designer and developer. Developed various projects using agile development model. Experienced in building web based application and REST API. Sudarshan Kadambi Sudarshan has a background in Distributed systems and Database design. He has been a user and contributor to various NoSQL databases and is passionate about solving large-scale data management challenges. Xun (Brian) Wu Xun (Brian) Wu has more than 15 years' experience in web/mobile development, big data analytics, cloud computing, Blockchain, and IT architecture. Xun holds a Master degree in computer science from NJIT. He is always enthusiastic about exploring new ideas, technologies, and opportunities that arise. He always keeps himself up-to-date by coding, reading books, training, and researching. He has reviewed more than 40 Packt Publishing books.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 338

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Seven NoSQL Databases in a Week

Get up and running with the fundamentals and functionalities of seven of the most popular NoSQL databases

Aaron Ploetz

Devram Kandhare

Sudarshan Kadambi

Xun (Brian) Wu

BIRMINGHAM - MUMBAI

Seven NoSQL Databases in a Week

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Amey VarangaonkarAcquisition Editor:Prachi BishtContent Development Editor:Eisha DsouzaTechnical Editor: Nirbhaya ShajiCopy Editors: Laxmi Subramanian and Safis EditingProject Coordinator: Kinjal BariProofreader: Safis EditingIndexer: Tejal Daruwale SoniGraphics: Jisha ChirayilProduction Coordinator: Aparna Bhagat

First published: March 2018

Production reference: 1270318

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78728-886-7

www.packtpub.com

To my wife, Coriene, for all of her constant love and support. And to my parents, Brian and Mary Jo Ploetz and Christine and Rick Franda, for all the sacrifices they made to ensure that I always had access to a computer while growing up.

– Aaron Ploetz

To Goraksha Bharam for his sacrifices and support in my life. To my teachers, who introduced me to this beautiful world of knowledge. And to my grandmother.

– Devram Kandhare

I would like to thank my parents for providing me the foundation that make work like this possible. I would like to thank my sister, wife, and son for their support and encouragement and for the joy and happiness they bring to my life.

– Sudarshan Kadambi

I would like to thank my parents who always give me the inspiration, drive, and encouragement. I would also like to thank my wife and kids, Bridget and Charlotte, for their patience and support throughout this endeavor.

– Xun (Brian) Wu

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the authors

Aaron Ploetz is the NoSQL Engineering Lead for Target, where his DevOps team supports Cassandra, MongoDB, Redis, and Neo4j. He has been named a DataStax MVP for Apache Cassandra three times, and has presented at multiple events, including the DataStax Summit and Data Day Texas. He earned a BS in Management/Computer Systems from the University of Wisconsin-Whitewater, and a MS in Software Engineering from Regis University. He and his wife, Coriene, live with their three children in the Twin Cities area.

I'd like to thank Dr. David Munro, who inspired me, Ron Bieber, who believed in me and introduced me to the world of distributed databases, and my lovely wife, Coriene, who provided constant support and encouragement during this endeavor. Seriously, without Coriene none of this would have been possible.

Devram Kandhare has 4 years of experience of working with the SQL database—MySql and NoSql databases—MongoDB and DynamoDB. He has worked as database designer and developer. He has developed various projects using the Agile development model. He is experienced in building web-based applications and REST API.

I'd like to thank my grandparents, Mamaji, Shriniwas Gadre, and Mudit Tyagi, for their love and guidance. Most importantly I want to thank my parents, my sister, Amruta, my loving and supportive wife, Ashvini, and my friend Prashant; without them none of this would have been possible.

Sudarshan Kadambi has a background in distributed systems and database design. He has been a user and contributor to various NoSQL databases and is passionate about solving large-scale data management challenges.

I would like to thank my parents for providing me the foundationsthat make work like this possible. I would like to thank my sister, wife, and son for their support and encouragement and for the joy and happiness they bring to my life.

Xun (Brian) Wu has more than 15 years of experience in web/mobile development, big data analytics, cloud computing, blockchain, and IT architecture. He holds a master's degree in computer science from NJIT. He is always enthusiastic about exploring new ideas, technologies, and opportunities that arise. He has previously reviewed more than 40 books from Packt Publishing.

I would like to thank my parents who always give me inspiration, drive and encouragement. I would also like to thank my wife and kids, Bridget and Charlotte, for their patience and support throughout this endeavor.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Seven NoSQL Databases in a Week

Dedication

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the authors

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Introduction to NoSQL Databases

Consistency versus availability

ACID guarantees

Hash versus range partition

In-place updates versus appends

Row versus column versus column-family storage models

Strongly versus loosely enforced schemas

Summary

MongoDB

Installing of MongoDB

MongoDB data types

The MongoDB database

MongoDB collections

MongoDB documents

The create operation

The read operation

Applying filters on fields

Applying conditional and logical operators on the filter parameter

The update operation

The delete operation

Data models in MongoDB

The references document data model

The embedded data model

Introduction to MongoDB indexing

The default _id index

Replication

Replication in MongoDB

Automatic failover in replication

Read operations

Sharding

Sharded clusters

Advantages of sharding

Storing large data in MongoDB

Summary

Neo4j

What is Neo4j?

How does Neo4j work?

Features of Neo4j

Clustering

Neo4j Browser

Cache sharding

Help for beginners

Evaluating your use case

Social networks

Matchmaking

Network management

Analytics

Recommendation engines

Neo4j anti-patterns

Applying relational modeling techniques in Neo4j

Using Neo4j for the first time on something mission-critical

Storing entities and relationships within entities

Improper use of relationship types

Storing binary large object data

Indexing everything

Neo4j hardware selection, installation, and configuration

Random access memory

CPU

Disk

Operating system

Network/firewall

Installation

Installing JVM

Configuration

High-availability clustering

Causal clustering

Using Neo4j

Neo4j Browser

Cypher

Python

Java

Taking a backup with Neo4j

Backup/restore with Neo4j Enterprise

Backup/restore with Neo4j Community

Differences between the Neo4j Community and Enterprise Editions

Tips for success

Summary

References 

Redis

Introduction to Redis

What are the key features of Redis?

Performance

Tunable data durability

Publish/Subscribe

Useful data types

Expiring data over time

Counters

Server-side Lua scripting

Appropriate use cases for Redis

Data fits into RAM

Data durability is not a concern

Data at scale

Simple data model

Features of Redis matching part of your use case

Data modeling and application design with Redis

Taking advantage of Redis' data structures

Queues

Sets

Notifications

Counters

Caching

Redis anti-patterns

Dataset cannot fit into RAM

Modeling relational data

Improper connection management

Security

Using the KEYS command

Unnecessary trips over the network

Not disabling THP

Redis setup, installation, and configuration

Virtualization versus on-the-metal

RAM

CPU

Disk

Operating system

Network/firewall

Installation

Configuration files

Using Redis

redis-cli

Lua

Python

Java

Taking a backup with Redis

Restoring from a backup

Tips for success

Summary

References

Cassandra

Introduction to Cassandra

What problems does Cassandra solve?

What are the key features of Cassandra?

No single point of failure

Tunable consistency

Data center awareness

Linear scalability

Built on the JVM

Appropriate use cases for Cassandra

Overview of the internals

Data modeling in Cassandra

Partition keys

Clustering keys

Putting it all together

Optimal use cases

Cassandra anti-patterns

Frequently updated data

Frequently deleted data

Queues or queue-like data

Solutions requiring query flexibility

Solutions requiring full table scans

Incorrect use of BATCH statements

Using Byte Ordered Partitioner

Using a load balancer in front of Cassandra nodes

Using a framework driver

Cassandra hardware selection, installation, and configuration

RAM

CPU

Disk

Operating system

Network/firewall

Installation using apt-get

Tarball installation

JVM installation

Node configuration

Running Cassandra

Adding a new node to the cluster

Using Cassandra

Nodetool

CQLSH

Python

Java

Taking a backup with Cassandra

Restoring from a snapshot

Tips for success

Run Cassandra on Linux

Open ports 7199, 7000, 7001, and 9042

Enable security

Use solid state drives (SSDs) if possible

Configure only one or two seed nodes per data center

Schedule weekly repairs

Do not force a major compaction

Remember that every mutation is a write

The data model is key

Consider a support contract

Cassandra is not a general purpose database

Summary

References

HBase

Architecture

Components in the HBase stack

Zookeeper

HDFS

HBase master

HBase RegionServers

Reads and writes

The HBase write path

HBase writes – design motivation

The HBase read path

HBase compactions

System trade-offs

Logical and physical data models

Interacting with HBase – the HBase shell

Interacting with HBase – the HBase Client API

Interacting with secure HBase clusters

Advanced topics

HBase high availability

Replicated reads

HBase in multiple regions

HBase coprocessors

SQL over HBase

Summary

DynamoDB

The difference between SQL and DynamoDB

Setting up DynamoDB

Setting up locally

Setting up using AWS

The difference between downloadable DynamoDB and DynamoDB web services

DynamoDB data types and terminology

Tables, items, and attributes

Primary key

Secondary indexes

Streams

Queries

Scan

Data types

Data models and CRUD operations in DynamoDB

Limitations of DynamoDB

Best practices

Summary

InfluxDB

Introduction to InfluxDB

Key concepts and terms of InfluxDB

Data model and storage engine

Storage engine

Installation and configuration

Installing InfluxDB

Configuring InfluxDB

Production deployment considerations

Query language and API

Query language

Query pagination

Query performance optimizations

Interaction via Rest API

InfluxDB API client

InfluxDB with Java client

InfluxDB with a Python client

InfluxDB with Go client

InfluxDB ecosystem

Telegraf

Telegraf data management

Kapacitor

InfluxDB operations

Backup and restore

Backups

Restore

Clustering and HA

Retention policy

Monitoring

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

The book will help you understand the fundamentals of each database, and understand how their functionalities differ, while still giving you a common result – a database solution with speed, high performance, and accuracy.

Who this book is for

If you are a budding DBA or a developer who wants to get started with the fundamentals of NoSQL databases, this book is for you. Relational DBAs who want to get insights into the various offerings of the popular NoSQL databases will also find this book to be very useful.

What this book covers

Chapter 1, Introduction to NoSQL Databases, introduces the topic of NoSQL and distributed databases. The design principles and trade-offs involved in NoSQL database design are described. These design principles provide context around why individual databases covered in the following chapters are designed in a particular way and what constraints they are trying to optimize for.

Chapter 2, MongoDB, covers installation and basic CRUD operations. High-level concepts such as indexing allow you to speed up database operations, sharding, and replication. Also, it covers data models, which help us with application database design.

Chapter 3, Neo4j, introduces the Neo4j graph database. It discusses Neo4j's architecture, use cases, administration, and application development.

Chapter 4, Redis, discusses the Redis data store. Redis’ unique architecture and behavior will be discussed, as well as installation, application development, and server-side scripting with Lua.

Chapter 5, Cassandra, introduces the Cassandra database. Cassandra’s highly-available, eventually consistent design will be discussed along with the appropriate use cases. Known anti-patterns will also be presented, as well as production-level configuration, administration, and application development.

Chapter 6, HBase, introduces HBase, that is, the Hadoop Database. Inspired by Google's Bigtable, HBase is a widely deployed key-value store today. This chapter covers HBase's architectural internals, data model, and API.

Chapter 7, DynamoDB, covers how to set up a local and AWS DynamoDB service and perform CRUD operations. It also covers how to deal with partition keys, sort keys, and secondary indexes. It covers various advantages and disadvantages of DynamoDB over other databases, which makes it easy for developers to choose a database for an application.

Chapter 8, InfluxDB, describes InfluxDB and its key concepts and terms. It also covers InfluxDB installation and configuration. It explores the query language and APIs. It helps you set up Telegraf and Kapacitor as an InfluxDB ecosystem's key components to collect and process data. At the end of the chapter, you will also find information about InfluxDB operations.

To get the most out of this book

This book assumes that you have access to hardware on which you can install, configure, and code against a database instance. Having elevated admin or sudo privileges on the aforementioned machine will be essential to carrying out some of the tasks described.

Some of the NoSQL databases discussed will only run on a Linux-based operating system. Therefore, prior familiarity with Linux is recommended. As OS-specific system administration is not within the scope of this book, readers who are new to Linux may find value in seeking out a separate tutorial prior to attempting some of the examples.

The Java Virtual Machine (JVM)-based NoSQL databases will require a Java Runtime Environment (JRE) to be installed. Do note that some of them may require a specific version of the JRE to function properly. This will necessitate updating or installing a new JRE, depending on the database.

The Java coding examples will be easier to do from within an Integrated Developer Envorinment (IDE), with Maven installed for dependency management. You may need to look up additional resources to ensure that these components are configured properly.

In Chapter 6, HBase, you can install the Hortonworks sandbox to get a small HBase cluster set up on your laptop. The sandbox can be installed for free from https://hortonworks.com/products/sandbox/.

In Chapter 8, InfluxDB, to run the examples you will need to install InfluxDB in a UNIX or Linux environment. In order to run different InfluxDB API client examples, you also need to install a programming language environment and related InfluxDB client packages:

Run the InfluxDB Java client: Install JDK and an editor (Eclipse or IntelliJ).

Run the InfluxDB Python client: Install Python.

Run the InfluxDB Go client: Install Go and the InfluxDB Go client; you can use JetBrains Goland to run the Go code.

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packtpub.com

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

Enter the name of the book in the

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Seven-NoSQL-Databases-in-a-Week. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/SevenNoSQLDatabasesinaWeek_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Now is also a good time to change the initial password. Neo4j installs with a single default admin username and password of neo4j/neo4j."

A block of code is set as follows:

# Paths of directories in the installation.#dbms.directories.data=data#dbms.directories.plugins=plugins#dbms.directories.certificates=certificates#dbms.directories.logs=logs#dbms.directories.lib=lib#dbms.directories.run=run

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

# Paths of directories in the installation.#dbms.directories.data=data

#dbms.directories.plugins=plugins

#dbms.directories.certificates=certificates

#dbms.directories.logs=logs

#dbms.directories.lib=lib#dbms.directories.run=run

Any command-line input or output is written as follows:

sudo mkdir /local

sudo chown $USER:$USER /local

cd /local

mv ~/Downloads/neo4j-community-3.3.3-unix.tar.gz .

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "To create a table, click on the Create table button. This will take you to the Create table screen."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Introduction to NoSQL Databases

Over the last decade, the volume and velocity with which data is generated within organizations has grown exponentially. Consequently, there has been an explosion of database technologies that have been developed to address these growing data needs. These databases have typically had distributed implementations, since the volume of data being managed far exceeds the storage capacity of a single node. In order to support the massive scale of data, these databases have provided fewer of the features that we've come to expect from relational databases.

The first generation of these so-called NoSQL databases only provided rudimentary key-value get/put APIs. They were largely schema-free and didn't require well-defined types to be associated with the values being stored in the database. Over the last decade, however, a number of features that we've come to expect from standard databases—such as type systems and SQL, secondary indices, materialized views, and some kind of concept of transactions—have come to be incorporated and overlaid over those rudimentary key-value interfaces.

Today, there are hundreds of NoSQL databases available in the world, with a few popular ones, such as MongoDB, HBase, and Cassandra, having the lion's share of the market, followed by a long list of other, less popular databases.

These databases have different data models, ranging from the document model of MongoDB, to the column-family model of HBase and Cassandra, to the columnar model of Kudu. These databases are widely deployed in hundreds of organizations and at this point are considered mainstream and commonplace.

This book covers some of the most popular and widely deployed NoSQL databases. Each chapter covers a different NoSQL database, how it is architected, how to model your data, and how to interact with the database. Before we jump into each of the NoSQL databases covered in this book, let's look at some of the design choices that should be considered when one is setting out to build a distributed database.

Knowing about some of these database principles will give us insight into why different databases have been designed with different architectural choices in mind, based on the use cases and workloads they were originally designed for.

Consistency versus availability

A database's consistency refers to the reliability of its functions' performance. A consistent system is one in which reads return the value of the last write, and reads at a given time epoch return the same value regardless of where they were initiated.

NoSQL databases support a range of consistency models, such as the following:

Strong consistency

: A system that is strongly consistent ensures that updates to a given key are ordered and reads reflect the latest update that has been accepted by the system

Timeline consistency

: A system that is timeline consistent ensures that updates to a given key are ordered in all the replicants, but reads at a given replicant might be stale and may not reflect the latest update that has been accepted by the system

Eventual consistency

: A system that is eventually consistent makes no guarantees about whether updates will be applied in order in all the replicants, nor does it make guarantees about when a read would reflect a prior update accepted by the system

A database's availability refers to the system's ability to complete a certain operation. Like consistency, availability is a spectrum. A system can be unavailable for writes while being available for reads. A system can be unavailable for admin operations while being available for data operations.

As is well known at this point, there's tension between consistency and availability. A system that is highly available needs to allow operations to succeed even if some nodes in the system are unreachable (either dead or partitioned off by the network). However, since it is unknown as to whether those nodes are still alive and are reachable by some clients or are dead and reachable by no one, there are no guarantees about whether those operations left the system in a consistent state or not.

So, a system that guarantees consistency must make sure that all of the nodes that contain data for a given key must be reachable and participate in the operation. The degenerate case is that a single node is responsible for operations on a given key. Since there is just a single node, there is no chance of inconsistency of the sort we've been discussing. The downside is that when a node goes down, there is a complete loss of availability for operations on that key.

ACID guarantees

Relational databases have provided the traditional properties of ACID: atomicity, consistency, isolation, and durability:

Atomicity

is self-explanatory and refers to the all-or-nothing nature of a set of operations.

Consistency

in ACID and

consistency

in the CAP theorem refer to different things. Consistency in ACID refers to the principle that the system must be left in a consistent state while processing transactions, it either reflects the state after successful completion of the transaction or must roll back to a state prior to the start of the transaction.

Isolation

refers to the interaction effects between transactions. Under what conditions is the state modified by one transaction visible to other active transactions in the system? It ranges from weak isolation levels, such as read-committed, and goes all the way to linearizable.

Durability

indicates that once a transaction has committed, the effects of the transaction remain despite events such as errors and crashes.

NoSQL databases vary widely in their support for these guarantees, with most of them not approaching the level of strong guarantees provided by relational databases (since these are hard to support in a distributed setting).

Hash versus range partition

Once you've decided to distribute data, how should the data be distributed?

Firstly, data needs to be distributed using a partitioning key in the data. The partitioning key can be the primary key or any other unique key. Once you've identified the partitioning key, we need to decide how to assign a key to a given shard.

One way to do this would be to take a key and apply a hash function. Based on the hash bucket and the number of shards to map keys into, the key would be assigned to a shard. There's a bit of nuance here in the sense that an assignment scheme based on a modulo by the number of nodes currently in the cluster will result in a lot of data movement when nodes join or leave the cluster (since all of the assignments need to be recalculated). This is addressed by something called consistent hashing, a detailed description of which is outside the scope of this chapter.

Another way to do assignments would be to take the entire keyspace and break it up into a set of ranges. Each range corresponds to a shard and is assigned to a given node in the cluster. Given a key, you would then do a binary search to find out the node it is meant to be assigned to. A range partition doesn't have the churn issue that a naive hashing scheme would have. When a node joins, shards from existing nodes will migrate onto the new node. When a node leaves, the shards on the node will migrate to one or more of the existing nodes.

What impact do the hash and range partitions have on the system design? A hash-based assignment can be built in a decentralized manner, where all nodes are peers of each other and there are no special master-slave relationships between nodes. Ceph and Cassandra both do hash-based partition assignment.

On the other hand, a range-based partitioning scheme requires that range assignments be kept in some special service. Hence, databases that do range-based partitioning, such as Bigtable and HBase, tend to be centralized and peer to peer, but instead have nodes with special roles and responsibilities.

In-place updates versus appends

Another key difference between database systems is how they handle updates to the physical records stored on disk.

Relational databases, such as MySQL, maintain a variety of structures in both the memory and disk, where writes from in-flight transactions and writes from completed transactions are persisted. Once the transaction has been committed, the physical record on disk for a given key is updated to reflect that. On the other hand, many NoSQL databases, such as HBase and Cassandra, are variants of what is called a log-structured merge (LSM) database.

In such an LSM database, updates aren't applied to the record at transaction commit. Instead, updates are applied in memory. Once the memory structure gets full, the contents of the memory are flushed to the disk. This means that updates to a single record can be fragmented and located within separate flush files that are created over time. This means that when there is a read for that record, you need to read in fragments of the record from the different flush files and merge the fragments in reverse time order in order to construct the latest snapshot of the given record. We will discuss the mechanics of how an LSM database works in the Chapter 6, HBase.

Row versus column versus column-family storage models

When you have a logical table with a bunch of rows and columns, there are multiple ways in which they can be stored physically on a disk.

You can store the contents of entire rows together so that all of the columns of a given row would be stored together. This works really well if the access pattern accesses a lot of the columns for a given set of rows. MySQL uses such a row-oriented storage model.

On the other hand, you could store the contents of entire columns together. In this scheme, all of the values from all of the rows for a given column can be stored together. This is really optimized for analytic use cases where you might need to scan through the entire table for a small set of columns. Storing data as column vectors allows for better compression (since there is less entropy between values within a column than there is between the values across a column). Also, these column vectors can be retrieved from a disk and processed quickly in a vectorized fashion through the SIMD capabilities of modern processors. SIMD processing on column vectors can approach throughputs of a billion data points/sec on a personal laptop.

Hybrid schemes are possible as well. Rather than storing an entire column vector together, it is possible to first break up all of the rows in a table into distinct row groups, and then, within a row group, you could store all of the column vectors together. Parquet and ORC use such a data placement strategy.

Another variant is that data is stored row-wise, but the rows are divided into row groups such that a row group is assigned to a shard. Within a row group, groups of columns that are often queried together, called column families, are then stored physically together on the disk. This storage model is used by HBase and is discussed in more detail in Chapter 6, HBase.

Strongly versus loosely enforced schemas

Databases can decide up-front how prescriptive they want to be about specifying a schema for the data.

When NoSQL databases came to the fore a decade ago, a key point was that they didn't require a schema. The schema could be encoded and enforced in the application rather than in the database. It was thought that schemas were a hindrance in dealing with all of the semi structured data that was getting produced in modern enterprise. So because the early NoSQL systems didn't have a type system, they didn't enforce the standard that all rows in the table have the same structure, they didn't enforce a whole lot.

However, today, most of these NoSQL databases have acquired an SQL interface. Most of them have acquired a rich type system. One of the reasons for this has been the realization that SQL is widely known and reduces the on-board friction in working with a new database. Getting started is easier with an SQL interface than it is with an obscure key-value API. More importantly, having a type system frees application developers from having to remember how a particular value was encoded and to decode it appropriately.

Hence, Cassandra deprecated the Thrift API and made CQL the default. HBase still doesn't support SQL access natively, but use of HBase is increasingly pivoting towards SQL interfaces over HBase, such as Phoenix.

Summary

In this chapter, we introduced the notion of a NoSQL database and considered some of the principles that go into the design of such a database. We now understand that there are many trade-offs to be considered in database design based on the specific use cases and types of workloads the database is being designed for. In the following chapters, we are going to be looking in detail at seven popular NoSQL databases. We will look at their architecture, data, and query models, as well as some practical tips on how you can get started using these databases, if they are a fit for your use case.

MongoDB

MongoDB is an open source, document-oriented, and cross-platform database. It is primarily written in C++. It is also the leading NoSQL database and tied with the SQL database in fifth position after PostgreSQL. It provides high performance, high availability, and easy scalability. MongoDB uses JSON-like documents with schema. MongoDB, developed by MongoDB Inc., is free to use. It is published under a combination of the GNU Affero General Public License and the Apache License.

Let's go through the MongoDB features:

Rich query support

: We can query the database as we do with SQL databases. It has a large query set that supports insert, update, delete and select operations. MongoDB supports fields, range queries, and regular expressions. Queries also support the projection where they return a value for specific keys.

Indexing

: MongoDB supports primary and secondary indices in its fields.

Replication

: Replication means providing more than one copy of data. MongoDB provides multiple copies of data with multiple servers. It provides fault tolerance, if one database server goes down, the application uses other database servers.

Load balancing

: Replica sets provide multiple copies of data. MongoDB can scale read operation by client request directly to the secondary node. This divides loads across multiple servers.

File storage

: We can store documents up to 6 MB directly to the MongoDB JSON field. For documents exceeding the size limit of 16 MB, MongoDB provides GridFS to store in chunks.

Aggregation

: The aggregate function takes a number of records and calculates single results like sum, min, and max. MongoDB provides a data pipeline and multistage pipeline to move large data to the aggregate function which improves performance.

Installing of MongoDB

You can download the latest version of MongoDB here: https://www.mongodb.com/download-center#community. Follow the setup instructions to install it.

Once MongoDB is installed on your Windows PC, you have to create the following directory:

Data directory C:\data\db

Once you have successfully installed MongoDB, you will be able to see the following executable:

We have to start the mongod instances to begin working with MongoDB. To start the mongod instance, execute it from the command prompt, as shown in the following screenshot:

Once mongod has started, we have to connect this instance using the mongo client with the mongo executable:

Once we are connected to the database, we can start working on the database operations.

MongoDB data types

Documents in MongoDB are JSON-like objects. JSON is a simple representation of data. It supports the following data types:

null

: The

null

data type is used to represent the

null

value as well as a value that does not exist:

boolean

: The

boolean

type is used to represent

true

and

false

values:

number

: In MongoDB, the shell default supports 64-bit floating-point numbers. To process long and integer numbers, MongoDB provides

NumberLong

and

NumberInt

, which represent 4 bytes and 8 bytes, respectively.

string

: The

string

data type represents the collection of characters. The MongoDB default supports UTF-* character encoding:

date

: MongoDB stores dates in milliseconds since the epoch. The time zone information is not saved:

After inserting a date using the preceding way in the document, when we query using

find

it returns a document with a date in the following format:

array

: A set or list of values represents arrays. Also, multiple JSON objects represent an array of elements. The following example shows an array of city values:

Embedded document

: Another MongoDB document-like structure, which can also be used as a key. In the following screenshot, we are storing address fields as an array of addresses, instead of creating a separate collection of addresses:

The MongoDB database

Data is stored in a database in the form of collections. It is a container for collection, just like in SQL databases where the database is a container for tables.

To create a database in MongoDB, we use the following command:

This command creates a database called sample_db, which can be used as a container for storing collections.

The default database for mongo is test. If we do not specify a database before storing our collection, MongoDB will store the collection in the test database.

Each database has its own set of files on the filesystem. A MongoDB server can have multiple databases. We can see the list of all the databases using the following command:

MongoDB collections

The collection is a container for MongoDB documents. It is equivalent to SQL tables, which store the data in rows. The collection should only store related documents. For example, the user_profiles collection should only store data related to user profiles. It should not contain a user's friend list as this should not be a part of a user's profile; instead, this should fall under the users_friend collection.

To create a new collection, you can use the following command:

Here, db represents the database in which we are storing a collection and users_profile is the new collection we are creating.

Documents in a collection should have a similar or related purpose. A database cannot have multiple collections with the same name, they are unique in the given database.