28,79 €
A beginner's guide to get you up and running with Cassandra, DynamoDB, HBase, InfluxDB, MongoDB, Neo4j, and Redis
This is the golden age of open source NoSQL databases. With enterprises having to work with large amounts of unstructured data and moving away from expensive monolithic architecture, the adoption of NoSQL databases is rapidly increasing. Being familiar with the popular NoSQL databases and knowing how to use them is a must for budding DBAs and developers.
This book introduces you to the different types of NoSQL databases and gets you started with seven of the most popular NoSQL databases used by enterprises today. We start off with a brief overview of what NoSQL databases are, followed by an explanation of why and when to use them. The book then covers the seven most popular databases in each of these categories: MongoDB, Amazon DynamoDB, Redis, HBase, Cassandra, InfluxDB, and Neo4j. The book doesn't go into too much detail about each database but teaches
you enough to get started with them.
By the end of this book, you will have a thorough understanding of the different NoSQL databases and their functionalities, empowering you to select and use the right
database according to your needs.
If you are a budding DBA or a developer who wants to get started with the fundamentals of NoSQL databases, this book is for you. Relational DBAs who want to get insights into the various offerings of popular NoSQL databases will also find this book to be very useful.
Aaron Ploetz Aaron is the NoSQL Engineering Lead for Target, where his DevOps team supports Cassandra, MongoDB, and Neo4j. He has been named a DataStax MVP for Apache Cassandra three times, and has presented at multiple events, including the DataStax Summit and Data Day Texas. Aaron earned a B.S. in Management/Computer Systems from the University of Wisconsin-Whitewater, and a M.S. in Software Engineering from Regis University. He and his wife Coriene live with their three children in the Twin Cities area. Devram Kandhare Devram has 4 years of experience of working with SQL database MySql and NoSql databases MongoDB and Dynamo db. Worked as database designer and developer. Developed various projects using agile development model. Experienced in building web based application and REST API. Sudarshan Kadambi Sudarshan has a background in Distributed systems and Database design. He has been a user and contributor to various NoSQL databases and is passionate about solving large-scale data management challenges. Xun (Brian) Wu Xun (Brian) Wu has more than 15 years' experience in web/mobile development, big data analytics, cloud computing, Blockchain, and IT architecture. Xun holds a Master degree in computer science from NJIT. He is always enthusiastic about exploring new ideas, technologies, and opportunities that arise. He always keeps himself up-to-date by coding, reading books, training, and researching. He has reviewed more than 40 Packt Publishing books.Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 338
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Amey VarangaonkarAcquisition Editor:Prachi BishtContent Development Editor:Eisha DsouzaTechnical Editor: Nirbhaya ShajiCopy Editors: Laxmi Subramanian and Safis EditingProject Coordinator: Kinjal BariProofreader: Safis EditingIndexer: Tejal Daruwale SoniGraphics: Jisha ChirayilProduction Coordinator: Aparna Bhagat
First published: March 2018
Production reference: 1270318
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78728-886-7
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Aaron Ploetz is the NoSQL Engineering Lead for Target, where his DevOps team supports Cassandra, MongoDB, Redis, and Neo4j. He has been named a DataStax MVP for Apache Cassandra three times, and has presented at multiple events, including the DataStax Summit and Data Day Texas. He earned a BS in Management/Computer Systems from the University of Wisconsin-Whitewater, and a MS in Software Engineering from Regis University. He and his wife, Coriene, live with their three children in the Twin Cities area.
Devram Kandhare has 4 years of experience of working with the SQL database—MySql and NoSql databases—MongoDB and DynamoDB. He has worked as database designer and developer. He has developed various projects using the Agile development model. He is experienced in building web-based applications and REST API.
Sudarshan Kadambi has a background in distributed systems and database design. He has been a user and contributor to various NoSQL databases and is passionate about solving large-scale data management challenges.
Xun (Brian) Wu has more than 15 years of experience in web/mobile development, big data analytics, cloud computing, blockchain, and IT architecture. He holds a master's degree in computer science from NJIT. He is always enthusiastic about exploring new ideas, technologies, and opportunities that arise. He has previously reviewed more than 40 books from Packt Publishing.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Seven NoSQL Databases in a Week
Dedication
Packt Upsell
Why subscribe?
PacktPub.com
Contributors
About the authors
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Introduction to NoSQL Databases
Consistency versus availability
ACID guarantees
Hash versus range partition
In-place updates versus appends
Row versus column versus column-family storage models
Strongly versus loosely enforced schemas
Summary
MongoDB
Installing of MongoDB
MongoDB data types
The MongoDB database
MongoDB collections
MongoDB documents
The create operation
The read operation
Applying filters on fields
Applying conditional and logical operators on the filter parameter
The update operation
The delete operation
Data models in MongoDB
The references document data model
The embedded data model
Introduction to MongoDB indexing
The default _id index
Replication
Replication in MongoDB
Automatic failover in replication
Read operations
Sharding
Sharded clusters
Advantages of sharding
Storing large data in MongoDB
Summary
Neo4j
What is Neo4j?
How does Neo4j work?
Features of Neo4j
Clustering
Neo4j Browser
Cache sharding
Help for beginners
Evaluating your use case
Social networks
Matchmaking
Network management
Analytics
Recommendation engines
Neo4j anti-patterns
Applying relational modeling techniques in Neo4j
Using Neo4j for the first time on something mission-critical
Storing entities and relationships within entities
Improper use of relationship types
Storing binary large object data
Indexing everything
Neo4j hardware selection, installation, and configuration
Random access memory
CPU
Disk
Operating system
Network/firewall
Installation
Installing JVM
Configuration
High-availability clustering
Causal clustering
Using Neo4j
Neo4j Browser
Cypher
Python
Java
Taking a backup with Neo4j
Backup/restore with Neo4j Enterprise
Backup/restore with Neo4j Community
Differences between the Neo4j Community and Enterprise Editions
Tips for success
Summary
References 
Redis
Introduction to Redis
What are the key features of Redis?
Performance
Tunable data durability
Publish/Subscribe
Useful data types
Expiring data over time
Counters
Server-side Lua scripting
Appropriate use cases for Redis
Data fits into RAM
Data durability is not a concern
Data at scale
Simple data model
Features of Redis matching part of your use case
Data modeling and application design with Redis
Taking advantage of Redis' data structures
Queues
Sets
Notifications
Counters
Caching
Redis anti-patterns
Dataset cannot fit into RAM
Modeling relational data
Improper connection management
Security
Using the KEYS command
Unnecessary trips over the network
Not disabling THP
Redis setup, installation, and configuration
Virtualization versus on-the-metal
RAM
CPU
Disk
Operating system
Network/firewall
Installation
Configuration files
Using Redis
redis-cli
Lua
Python
Java
Taking a backup with Redis
Restoring from a backup
Tips for success
Summary
References
Cassandra
Introduction to Cassandra
What problems does Cassandra solve?
What are the key features of Cassandra?
No single point of failure
Tunable consistency
Data center awareness
Linear scalability
Built on the JVM
Appropriate use cases for Cassandra
Overview of the internals
Data modeling in Cassandra
Partition keys
Clustering keys
Putting it all together
Optimal use cases
Cassandra anti-patterns
Frequently updated data
Frequently deleted data
Queues or queue-like data
Solutions requiring query flexibility
Solutions requiring full table scans
Incorrect use of BATCH statements
Using Byte Ordered Partitioner
Using a load balancer in front of Cassandra nodes
Using a framework driver
Cassandra hardware selection, installation, and configuration
RAM
CPU
Disk
Operating system
Network/firewall
Installation using apt-get
Tarball installation
JVM installation
Node configuration
Running Cassandra
Adding a new node to the cluster
Using Cassandra
Nodetool
CQLSH
Python
Java
Taking a backup with Cassandra
Restoring from a snapshot
Tips for success
Run Cassandra on Linux
Open ports 7199, 7000, 7001, and 9042
Enable security
Use solid state drives (SSDs) if possible
Configure only one or two seed nodes per data center
Schedule weekly repairs
Do not force a major compaction
Remember that every mutation is a write
The data model is key
Consider a support contract
Cassandra is not a general purpose database
Summary
References
HBase
Architecture
Components in the HBase stack
Zookeeper
HDFS
HBase master
HBase RegionServers
Reads and writes
The HBase write path
HBase writes – design motivation
The HBase read path
HBase compactions
System trade-offs
Logical and physical data models
Interacting with HBase – the HBase shell
Interacting with HBase – the HBase Client API
Interacting with secure HBase clusters
Advanced topics
HBase high availability
Replicated reads
HBase in multiple regions
HBase coprocessors
SQL over HBase
Summary
DynamoDB
The difference between SQL and DynamoDB
Setting up DynamoDB
Setting up locally
Setting up using AWS
The difference between downloadable DynamoDB and DynamoDB web services
DynamoDB data types and terminology
Tables, items, and attributes
Primary key
Secondary indexes
Streams
Queries
Scan
Data types
Data models and CRUD operations in DynamoDB
Limitations of DynamoDB
Best practices
Summary
InfluxDB
Introduction to InfluxDB
Key concepts and terms of InfluxDB
Data model and storage engine
Storage engine
Installation and configuration
Installing InfluxDB
Configuring InfluxDB
Production deployment considerations
Query language and API
Query language
Query pagination
Query performance optimizations
Interaction via Rest API
InfluxDB API client
InfluxDB with Java client
InfluxDB with a Python client
InfluxDB with Go client
InfluxDB ecosystem
Telegraf
Telegraf data management
Kapacitor
InfluxDB operations
Backup and restore
Backups
Restore
Clustering and HA
Retention policy
Monitoring
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
The book will help you understand the fundamentals of each database, and understand how their functionalities differ, while still giving you a common result – a database solution with speed, high performance, and accuracy.
If you are a budding DBA or a developer who wants to get started with the fundamentals of NoSQL databases, this book is for you. Relational DBAs who want to get insights into the various offerings of the popular NoSQL databases will also find this book to be very useful.
Chapter 1, Introduction to NoSQL Databases, introduces the topic of NoSQL and distributed databases. The design principles and trade-offs involved in NoSQL database design are described. These design principles provide context around why individual databases covered in the following chapters are designed in a particular way and what constraints they are trying to optimize for.
Chapter 2, MongoDB, covers installation and basic CRUD operations. High-level concepts such as indexing allow you to speed up database operations, sharding, and replication. Also, it covers data models, which help us with application database design.
Chapter 3, Neo4j, introduces the Neo4j graph database. It discusses Neo4j's architecture, use cases, administration, and application development.
Chapter 4, Redis, discusses the Redis data store. Redis’ unique architecture and behavior will be discussed, as well as installation, application development, and server-side scripting with Lua.
Chapter 5, Cassandra, introduces the Cassandra database. Cassandra’s highly-available, eventually consistent design will be discussed along with the appropriate use cases. Known anti-patterns will also be presented, as well as production-level configuration, administration, and application development.
Chapter 6, HBase, introduces HBase, that is, the Hadoop Database. Inspired by Google's Bigtable, HBase is a widely deployed key-value store today. This chapter covers HBase's architectural internals, data model, and API.
Chapter 7, DynamoDB, covers how to set up a local and AWS DynamoDB service and perform CRUD operations. It also covers how to deal with partition keys, sort keys, and secondary indexes. It covers various advantages and disadvantages of DynamoDB over other databases, which makes it easy for developers to choose a database for an application.
Chapter 8, InfluxDB, describes InfluxDB and its key concepts and terms. It also covers InfluxDB installation and configuration. It explores the query language and APIs. It helps you set up Telegraf and Kapacitor as an InfluxDB ecosystem's key components to collect and process data. At the end of the chapter, you will also find information about InfluxDB operations.
This book assumes that you have access to hardware on which you can install, configure, and code against a database instance. Having elevated admin or sudo privileges on the aforementioned machine will be essential to carrying out some of the tasks described.
Some of the NoSQL databases discussed will only run on a Linux-based operating system. Therefore, prior familiarity with Linux is recommended. As OS-specific system administration is not within the scope of this book, readers who are new to Linux may find value in seeking out a separate tutorial prior to attempting some of the examples.
The Java Virtual Machine (JVM)-based NoSQL databases will require a Java Runtime Environment (JRE) to be installed. Do note that some of them may require a specific version of the JRE to function properly. This will necessitate updating or installing a new JRE, depending on the database.
The Java coding examples will be easier to do from within an Integrated Developer Envorinment (IDE), with Maven installed for dependency management. You may need to look up additional resources to ensure that these components are configured properly.
In Chapter 6, HBase, you can install the Hortonworks sandbox to get a small HBase cluster set up on your laptop. The sandbox can be installed for free from https://hortonworks.com/products/sandbox/.
In Chapter 8, InfluxDB, to run the examples you will need to install InfluxDB in a UNIX or Linux environment. In order to run different InfluxDB API client examples, you also need to install a programming language environment and related InfluxDB client packages:
Run the InfluxDB Java client: Install JDK and an editor (Eclipse or IntelliJ).
Run the InfluxDB Python client: Install Python.
Run the InfluxDB Go client: Install Go and the InfluxDB Go client; you can use JetBrains Goland to run the Go code.
You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packtpub.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Seven-NoSQL-Databases-in-a-Week. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/SevenNoSQLDatabasesinaWeek_ColorImages.pdf.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Now is also a good time to change the initial password. Neo4j installs with a single default admin username and password of neo4j/neo4j."
A block of code is set as follows:
# Paths of directories in the installation.#dbms.directories.data=data#dbms.directories.plugins=plugins#dbms.directories.certificates=certificates#dbms.directories.logs=logs#dbms.directories.lib=lib#dbms.directories.run=run
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
# Paths of directories in the installation.#dbms.directories.data=data
#dbms.directories.plugins=plugins
#dbms.directories.certificates=certificates
#dbms.directories.logs=logs
#dbms.directories.lib=lib#dbms.directories.run=run
Any command-line input or output is written as follows:
sudo mkdir /local
sudo chown $USER:$USER /local
cd /local
mv ~/Downloads/neo4j-community-3.3.3-unix.tar.gz .
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "To create a table, click on the Create table button. This will take you to the Create table screen."
Feedback from our readers is always welcome.
General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packtpub.com.
Over the last decade, the volume and velocity with which data is generated within organizations has grown exponentially. Consequently, there has been an explosion of database technologies that have been developed to address these growing data needs. These databases have typically had distributed implementations, since the volume of data being managed far exceeds the storage capacity of a single node. In order to support the massive scale of data, these databases have provided fewer of the features that we've come to expect from relational databases.
The first generation of these so-called NoSQL databases only provided rudimentary key-value get/put APIs. They were largely schema-free and didn't require well-defined types to be associated with the values being stored in the database. Over the last decade, however, a number of features that we've come to expect from standard databases—such as type systems and SQL, secondary indices, materialized views, and some kind of concept of transactions—have come to be incorporated and overlaid over those rudimentary key-value interfaces.
Today, there are hundreds of NoSQL databases available in the world, with a few popular ones, such as MongoDB, HBase, and Cassandra, having the lion's share of the market, followed by a long list of other, less popular databases.
These databases have different data models, ranging from the document model of MongoDB, to the column-family model of HBase and Cassandra, to the columnar model of Kudu. These databases are widely deployed in hundreds of organizations and at this point are considered mainstream and commonplace.
This book covers some of the most popular and widely deployed NoSQL databases. Each chapter covers a different NoSQL database, how it is architected, how to model your data, and how to interact with the database. Before we jump into each of the NoSQL databases covered in this book, let's look at some of the design choices that should be considered when one is setting out to build a distributed database.
Knowing about some of these database principles will give us insight into why different databases have been designed with different architectural choices in mind, based on the use cases and workloads they were originally designed for.
A database's consistency refers to the reliability of its functions' performance. A consistent system is one in which reads return the value of the last write, and reads at a given time epoch return the same value regardless of where they were initiated.
NoSQL databases support a range of consistency models, such as the following:
Strong consistency
: A system that is strongly consistent ensures that updates to a given key are ordered and reads reflect the latest update that has been accepted by the system
Timeline consistency
: A system that is timeline consistent ensures that updates to a given key are ordered in all the replicants, but reads at a given replicant might be stale and may not reflect the latest update that has been accepted by the system
Eventual consistency
: A system that is eventually consistent makes no guarantees about whether updates will be applied in order in all the replicants, nor does it make guarantees about when a read would reflect a prior update accepted by the system
A database's availability refers to the system's ability to complete a certain operation. Like consistency, availability is a spectrum. A system can be unavailable for writes while being available for reads. A system can be unavailable for admin operations while being available for data operations.
As is well known at this point, there's tension between consistency and availability. A system that is highly available needs to allow operations to succeed even if some nodes in the system are unreachable (either dead or partitioned off by the network). However, since it is unknown as to whether those nodes are still alive and are reachable by some clients or are dead and reachable by no one, there are no guarantees about whether those operations left the system in a consistent state or not.
So, a system that guarantees consistency must make sure that all of the nodes that contain data for a given key must be reachable and participate in the operation. The degenerate case is that a single node is responsible for operations on a given key. Since there is just a single node, there is no chance of inconsistency of the sort we've been discussing. The downside is that when a node goes down, there is a complete loss of availability for operations on that key.
Relational databases have provided the traditional properties of ACID: atomicity, consistency, isolation, and durability:
Atomicity
is self-explanatory and refers to the all-or-nothing nature of a set of operations.
Consistency
in ACID and
consistency
in the CAP theorem refer to different things. Consistency in ACID refers to the principle that the system must be left in a consistent state while processing transactions, it either reflects the state after successful completion of the transaction or must roll back to a state prior to the start of the transaction.
Isolation
refers to the interaction effects between transactions. Under what conditions is the state modified by one transaction visible to other active transactions in the system? It ranges from weak isolation levels, such as read-committed, and goes all the way to linearizable.
Durability
indicates that once a transaction has committed, the effects of the transaction remain despite events such as errors and crashes.
NoSQL databases vary widely in their support for these guarantees, with most of them not approaching the level of strong guarantees provided by relational databases (since these are hard to support in a distributed setting).
Once you've decided to distribute data, how should the data be distributed?
Firstly, data needs to be distributed using a partitioning key in the data. The partitioning key can be the primary key or any other unique key. Once you've identified the partitioning key, we need to decide how to assign a key to a given shard.
One way to do this would be to take a key and apply a hash function. Based on the hash bucket and the number of shards to map keys into, the key would be assigned to a shard. There's a bit of nuance here in the sense that an assignment scheme based on a modulo by the number of nodes currently in the cluster will result in a lot of data movement when nodes join or leave the cluster (since all of the assignments need to be recalculated). This is addressed by something called consistent hashing, a detailed description of which is outside the scope of this chapter.
Another way to do assignments would be to take the entire keyspace and break it up into a set of ranges. Each range corresponds to a shard and is assigned to a given node in the cluster. Given a key, you would then do a binary search to find out the node it is meant to be assigned to. A range partition doesn't have the churn issue that a naive hashing scheme would have. When a node joins, shards from existing nodes will migrate onto the new node. When a node leaves, the shards on the node will migrate to one or more of the existing nodes.
What impact do the hash and range partitions have on the system design? A hash-based assignment can be built in a decentralized manner, where all nodes are peers of each other and there are no special master-slave relationships between nodes. Ceph and Cassandra both do hash-based partition assignment.
On the other hand, a range-based partitioning scheme requires that range assignments be kept in some special service. Hence, databases that do range-based partitioning, such as Bigtable and HBase, tend to be centralized and peer to peer, but instead have nodes with special roles and responsibilities.
Another key difference between database systems is how they handle updates to the physical records stored on disk.
Relational databases, such as MySQL, maintain a variety of structures in both the memory and disk, where writes from in-flight transactions and writes from completed transactions are persisted. Once the transaction has been committed, the physical record on disk for a given key is updated to reflect that. On the other hand, many NoSQL databases, such as HBase and Cassandra, are variants of what is called a log-structured merge (LSM) database.
In such an LSM database, updates aren't applied to the record at transaction commit. Instead, updates are applied in memory. Once the memory structure gets full, the contents of the memory are flushed to the disk. This means that updates to a single record can be fragmented and located within separate flush files that are created over time. This means that when there is a read for that record, you need to read in fragments of the record from the different flush files and merge the fragments in reverse time order in order to construct the latest snapshot of the given record. We will discuss the mechanics of how an LSM database works in the Chapter 6, HBase.
When you have a logical table with a bunch of rows and columns, there are multiple ways in which they can be stored physically on a disk.
You can store the contents of entire rows together so that all of the columns of a given row would be stored together. This works really well if the access pattern accesses a lot of the columns for a given set of rows. MySQL uses such a row-oriented storage model.
On the other hand, you could store the contents of entire columns together. In this scheme, all of the values from all of the rows for a given column can be stored together. This is really optimized for analytic use cases where you might need to scan through the entire table for a small set of columns. Storing data as column vectors allows for better compression (since there is less entropy between values within a column than there is between the values across a column). Also, these column vectors can be retrieved from a disk and processed quickly in a vectorized fashion through the SIMD capabilities of modern processors. SIMD processing on column vectors can approach throughputs of a billion data points/sec on a personal laptop.
Hybrid schemes are possible as well. Rather than storing an entire column vector together, it is possible to first break up all of the rows in a table into distinct row groups, and then, within a row group, you could store all of the column vectors together. Parquet and ORC use such a data placement strategy.
Another variant is that data is stored row-wise, but the rows are divided into row groups such that a row group is assigned to a shard. Within a row group, groups of columns that are often queried together, called column families, are then stored physically together on the disk. This storage model is used by HBase and is discussed in more detail in Chapter 6, HBase.
Databases can decide up-front how prescriptive they want to be about specifying a schema for the data.
When NoSQL databases came to the fore a decade ago, a key point was that they didn't require a schema. The schema could be encoded and enforced in the application rather than in the database. It was thought that schemas were a hindrance in dealing with all of the semi structured data that was getting produced in modern enterprise. So because the early NoSQL systems didn't have a type system, they didn't enforce the standard that all rows in the table have the same structure, they didn't enforce a whole lot.
However, today, most of these NoSQL databases have acquired an SQL interface. Most of them have acquired a rich type system. One of the reasons for this has been the realization that SQL is widely known and reduces the on-board friction in working with a new database. Getting started is easier with an SQL interface than it is with an obscure key-value API. More importantly, having a type system frees application developers from having to remember how a particular value was encoded and to decode it appropriately.
Hence, Cassandra deprecated the Thrift API and made CQL the default. HBase still doesn't support SQL access natively, but use of HBase is increasingly pivoting towards SQL interfaces over HBase, such as Phoenix.
In this chapter, we introduced the notion of a NoSQL database and considered some of the principles that go into the design of such a database. We now understand that there are many trade-offs to be considered in database design based on the specific use cases and types of workloads the database is being designed for. In the following chapters, we are going to be looking in detail at seven popular NoSQL databases. We will look at their architecture, data, and query models, as well as some practical tips on how you can get started using these databases, if they are a fit for your use case.
MongoDB is an open source, document-oriented, and cross-platform database. It is primarily written in C++. It is also the leading NoSQL database and tied with the SQL database in fifth position after PostgreSQL. It provides high performance, high availability, and easy scalability. MongoDB uses JSON-like documents with schema. MongoDB, developed by MongoDB Inc., is free to use. It is published under a combination of the GNU Affero General Public License and the Apache License.
Let's go through the MongoDB features:
Rich query support
: We can query the database as we do with SQL databases. It has a large query set that supports insert, update, delete and select operations. MongoDB supports fields, range queries, and regular expressions. Queries also support the projection where they return a value for specific keys.
Indexing
: MongoDB supports primary and secondary indices in its fields.
Replication
: Replication means providing more than one copy of data. MongoDB provides multiple copies of data with multiple servers. It provides fault tolerance, if one database server goes down, the application uses other database servers.
Load balancing
: Replica sets provide multiple copies of data. MongoDB can scale read operation by client request directly to the secondary node. This divides loads across multiple servers.
File storage
: We can store documents up to 6 MB directly to the MongoDB JSON field. For documents exceeding the size limit of 16 MB, MongoDB provides GridFS to store in chunks.
Aggregation
: The aggregate function takes a number of records and calculates single results like sum, min, and max. MongoDB provides a data pipeline and multistage pipeline to move large data to the aggregate function which improves performance.
You can download the latest version of MongoDB here: https://www.mongodb.com/download-center#community. Follow the setup instructions to install it.
Once MongoDB is installed on your Windows PC, you have to create the following directory:
Data directory C:\data\db
Once you have successfully installed MongoDB, you will be able to see the following executable:
We have to start the mongod instances to begin working with MongoDB. To start the mongod instance, execute it from the command prompt, as shown in the following screenshot:
Once mongod has started, we have to connect this instance using the mongo client with the mongo executable:
Once we are connected to the database, we can start working on the database operations.
Documents in MongoDB are JSON-like objects. JSON is a simple representation of data. It supports the following data types:
null
: The
null
data type is used to represent the
null
value as well as a value that does not exist:
boolean
: The
boolean
type is used to represent
true
and
false
values:
number
: In MongoDB, the shell default supports 64-bit floating-point numbers. To process long and integer numbers, MongoDB provides
NumberLong
and
NumberInt
, which represent 4 bytes and 8 bytes, respectively.
string
: The
string
data type represents the collection of characters. The MongoDB default supports UTF-* character encoding:
date
: MongoDB stores dates in milliseconds since the epoch. The time zone information is not saved:
After inserting a date using the preceding way in the document, when we query using
find
it returns a document with a date in the following format:
array
: A set or list of values represents arrays. Also, multiple JSON objects represent an array of elements. The following example shows an array of city values:
Embedded document
: Another MongoDB document-like structure, which can also be used as a key. In the following screenshot, we are storing address fields as an array of addresses, instead of creating a separate collection of addresses:
Data is stored in a database in the form of collections. It is a container for collection, just like in SQL databases where the database is a container for tables.
To create a database in MongoDB, we use the following command:
This command creates a database called sample_db, which can be used as a container for storing collections.
The default database for mongo is test. If we do not specify a database before storing our collection, MongoDB will store the collection in the test database.
Each database has its own set of files on the filesystem. A MongoDB server can have multiple databases. We can see the list of all the databases using the following command:
The collection is a container for MongoDB documents. It is equivalent to SQL tables, which store the data in rows. The collection should only store related documents. For example, the user_profiles collection should only store data related to user profiles. It should not contain a user's friend list as this should not be a part of a user's profile; instead, this should fall under the users_friend collection.
To create a new collection, you can use the following command:
Here, db represents the database in which we are storing a collection and users_profile is the new collection we are creating.
Documents in a collection should have a similar or related purpose. A database cannot have multiple collections with the same name, they are unique in the given database.