32,39 €
An expert's guide to build fault tolerant MongoDB application
Mastering MongoDB is a book for database developers, architects, and administrators who want to learn how to use MongoDB more effectively and productively.
If you have experience in, and are interested in working with, NoSQL databases to build apps and websites, then this book is for you.
MongoDB has grown to become the de facto NoSQL database with millions of users—from small startups to Fortune 500 companies. Addressing the limitations of SQL schema-based databases, MongoDB pioneered a shift of focus for DevOps and offered sharding and replication maintainable by DevOps teams. The book is based on MongoDB 3.x and covers topics ranging from database querying using the shell, built in drivers, and popular ODM mappers to more advanced topics such as sharding, high availability, and integration with big data sources.
You will get an overview of MongoDB and how to play to its strengths, with relevant use cases. After that, you will learn how to query MongoDB effectively and make use of indexes as much as possible. The next part deals with the administration of MongoDB installations on-premise or in the cloud. We deal with database internals in the next section, explaining storage systems and how they can affect performance. The last section of this book deals with replication and MongoDB scaling, along with integration with heterogeneous data sources. By the end this book, you will be equipped with all the required industry skills and knowledge to become a certified MongoDB developer and administrator.
This book takes a practical, step-by-step approach to explain the concepts of MongoDB. Practical use-cases involving real-world examples are used throughout the book to clearly explain theoretical concepts.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 381
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: November 2017
Production reference: 1151117
ISBN 978-1-78398-260-8
www.packtpub.com
Author
Alex Giamas
Copy Editors
Safis Editing
Reviewers
Juan Tomás Oliva Ramos
Nilap Shah
Project Coordinator
Nidhi Joshi
Commissioning Editor
Amey Varangaonkar
Proofreader
Safis Editing
Acquisition Editor
Vinay Argekar
Indexer
Aishwarya Gangawane
Content Development Editor
Mayur Pawanikar
Graphics
Tania Dutta
Technical Editor
Prasad Ramesh
Production Coordinator
Shantanu Zagade
Alex Giamas is a Senior Software Engineer at the Department for International Trade, UK Government. He has also worked as a consultant for various startups. He is an experienced professional in systems engineering, NoSQL and big data technologies, with experience spanning from co-founding a digital health startup to Fortune 15 companies.
He has been developing using MongoDB since 2009 and early 1.x versions, using it for several projects around data storage and analytical processing. He has been developing in Apache Hadoop since 2007 while working on its incubation.
He has worked with a wide array of NoSQL and big data technologies, building scalable and highly available distributed software systems in C++, Java, Ruby and Python.
Alex holds an MSc from Carnegie Mellon University in Information Networking and has attended professional courses in Stanford University. He is a graduate from National Technical University of Athens, Greece in Electrical and Computer Engineering. He is a MongoDB Certified developer, a Cloudera Certified Developer for Apache Hadoop and Data Science essentials.
He publishes regularly for the past 4 years at InfoQ in NoSQL, big data and data science topics.
Juan Tomás Oliva Ramos is an environmental engineer from the University of Guanajuato, Mexico, with a master's degree in administrative engineering and quality. He has more than 5 years of experience in the management and development of patents, technological innovation projects, and the development of technological solutions through the statistical control of processes.
He has been a teacher of statistics, entrepreneurship, and the technological development of projects since 2011. He became an entrepreneur mentor and started a new department of technology management and entrepreneurship at Instituto Tecnológico Superior de Purisima del Rincon Guanajuato, Mexico.
Juan is an Alfaomega reviewer and has worked on the book Wearable Designs for Smart Watches, Smart TVs and Android Mobile Devices.
Juan has also developed prototypes through programming and automation technologies for the improvement of operations, which have been registered for patents.
Nilap Shah is a lead software consultant with experience across various fields and technologies. He is an expert in .NET, Uipath (robotics), and MongoDB. He is a certified MongoDB developer and DBA. He is a technical writer as well as a technical speaker. He also provides MongoDB corporate training. Currently, Nilap is working as a lead MongoDB consultant and provides solutions with MongoDB (DBA and developer projects). His LinkedIn profile can be found at https:/ /www.linkedin.com/in/nilap-shah-8b6780a/ and you can reach him on WhatsApp at +91-9537047334.
For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1783982608.
If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
MongoDB – A Database for the Modern Web
Web history
Web 1.0
Web 2.0
Web 3.0
SQL and NoSQL evolution
MongoDB evolution
Major feature set for versions 1.0 and 1.2
Version 2
Version 3
Version 3+
MongoDB for SQL developers
MongoDB for NoSQL developers
MongoDB key characteristics and use cases
Key characteristics
What is the use case for MongoDB?
MongoDB criticism
MongoDB configuration and best practices
Operational best practices
Schema design best practices
Best practices for write durability
Best practices for replication
Best practices for sharding
Best practices for security
Best practices for AWS
Reference documentation
MongoDB documentation
Packt references
Further reading
Summary
Schema Design and Data Modeling
Relational schema design
MongoDB schema design
Read-write ratio
Data modeling
Data types
Comparing different data types
Date type
ObjectId
Modeling data for atomic operations
Write isolation
Read isolation and consistency
Modeling relationships
One-to-one
One-to-many, many-to-many
Modeling data for keyword searches
Connecting to MongoDB
Connecting using Ruby
Mongoid ODM
Inheritance with Mongoid models
Connecting using Python
PyMODM ODM
Inheritance with PyMODM models
Connecting using PHP
Doctrine ODM
Inheritance with Doctrine
Summary
MongoDB CRUD Operations
CRUD using the shell
Scripting for the mongo shell
Differences between scripting for the mongo shell and using it directly
Batch inserts using the shell
Batch operations using the mongo shell
Administration
fsync
compact
currentOp/killOp
collMod
touch
MapReduce in the mongo shell
MapReduce concurrency
Incremental MapReduce
Troubleshooting MapReduce
Aggregation framework
SQL to aggregation
Aggregation versus MapReduce
Securing the shell
Authentication and authorization
Authorization with MongoDB
Security tips for MongoDB
Encrypting communication using TLS/SSL
Encrypting data
Limiting network exposure
Firewalls and VPNs
Auditing
Use secure configuration options
Authentication with MongoDB
Enterprise Edition
Kerberos authentication
LDAP authentication
Summary
Advanced Querying
MongoDB CRUD operations
CRUD using the Ruby driver
Creating documents
Read
Chaining operations in find()
Nested operations
Update
Delete
Batch operations
CRUD in Mongoid
Read
Scoping queries
Create, update, and delete
CRUD using the Python driver
Create and delete
Finding documents
Updating documents
CRUD using PyMODM
Creating documents
Updating documents
Deleting documents
Querying documents
CRUD using the PHP driver
Create and delete
Bulk write
Read
Update
CRUD using Doctrine
Create, update, and delete
Read
Best practices
Comparison operators
Update operators
Smart querying
Using regular expressions
Query results and cursors
Storage considerations on delete
Summary
Aggregation
Why aggregation?
Aggregation operators
Aggregation stage operators
Expression operators
Expression Boolean operators
Expression comparison operators
Set expression and array operators
Expression date operators
Expression string operators
Expression arithmetic operators
Aggregation accumulators
Conditional expressions
Other operators
Text search
Variable
Literal
Parsing data type
Limitations
Aggregation use case
Summary
Indexing
Index internals
Index types
Single field indexes
Indexing embedded fields
Indexing embedded documents
Background indexes
Compound indexes
Sorting using compound indexes
Reusing compound indexes
Multikey indexes
Special types of index
Text
Hashed
TTL
Partial
Sparse
Unique
Case-insensitive
Geospatial
Building and managing indexes
Forcing index usage
Hint and sparse indexes
Building indexes on replica sets
Managing indexes
Naming indexes
Special considerations
Using indexes efficiently
Measuring performance
Improving performance
Index intersection
References
Summary
Monitoring, Backup, and Security
Monitoring
What should we monitor?
Page faults
Resident memory
Virtual and mapped memory
Working set
Monitoring memory usage in WiredTiger
Tracking page faults
Tracking B-tree misses
I/O wait
Read and write queues
Lock percentage
Background flushes
Tracking free space
Monitoring replication
Oplog size
Working set calculations
Monitoring tools
Hosted tools
Open source tools
Backups
Backup options
Cloud-based solutions
Backups with file system snapshots
Taking a backup of a sharded cluster
Backups using mongodump
Backups by copying raw files
Backups using queueing
EC2 backup and restore
Incremental backups
Security
Authentication
Authorization
User roles
Database administration roles
Cluster administration roles
Backup restore roles
Roles across all databases
Superuser
Network level security
Auditing security
Special cases
Overview
Summary
Storage Engines
Pluggable storage engines
WiredTiger
Document-level locking
Snapshots and checkpoints
Journaling
Data compression
Memory usage
readConcern
WiredTiger collection-level options
WiredTiger performance strategies
WiredTiger B-tree versus LSM indexes
Encrypted
In-memory
MMAPv1
MMAPv1 storage optimization
Mixed usage
Other storage engines
RocksDB
TokuMX
Locking in MongoDB
Lock reporting
Lock yield
Commonly used commands and locks
Commands requiring a database lock
References
Summary
Harnessing Big Data with MongoDB
What is big data?
Big data landscape
Message queuing systems
Apache ActiveMQ
RabbitMQ
Apache Kafka
Data warehousing
Apache Hadoop
Apache Spark
Spark comparison with Hadoop MapReduce
MongoDB as a data warehouse
Big data use case
Kafka setup
Hadoop setup
Steps
Hadoop to MongoDB pipeline
Spark to MongoDB
References
Summary
Replication
Replication
Logical or physical replication
Different high availability types
Architectural overview
How do elections work?
What is the use case for a replica set?
Setting up a replica set
Converting a standalone server to a replica set
Creating a replica set
Read preference
Write concern
Custom write concern
Priority settings for replica set members
Priority zero replica set members
Hidden replica set members
Delayed replica set members
Production considerations
Connecting to a replica set
Replica set administration
How to perform maintenance on replica sets
Resyncing a member of a replica set
Changing the oplog size
Reconfiguring a replica set when we have lost the majority of our servers
Chained replication
Cloud options for a replica set
mLab
MongoDB Atlas
Replica set limitations
Summary
Sharding
Advantages of sharding
Architectural overview
Development, continuous deployment, and staging environments
Planning ahead on sharding
Sharding setup
Choosing the shard key
Changing the shard key
Choosing the correct shard key
Range-based sharding
Hash-based sharding
Coming up with our own key
Location-based data
Sharding administration and monitoring
Balancing data – how to track and keep our data balanced
Chunk administration
Moving chunks
Changing the default chunk size
Jumbo chunks
Merging chunks
Adding and removing shards
Sharding limitations
Querying sharded data
The query router
Find
Sort/limit/skip
Update/remove
Querying using Ruby
Performance comparison with replica sets
Sharding recovery
Mongos
Mongod process
Config server
A shard goes down
The entire cluster goes down
References
Summary
Fault Tolerance and High Availability
Application design
Schema-less doesn't mean schema design-less
Read performance optimization
Consolidating read querying
Defensive coding
Monitoring integrations
Operations
Security
Enabling security by default
Isolating our servers
Checklists
References
Summary
MongoDB has grown to become the de facto NoSQL database with millions of users, from small start-ups to Fortune 500 companies. Addressing the limitations of SQL schema-based databases, MongoDB pioneered a shift of focus for DevOps and offered sharding and replication maintainable by DevOps teams. This book is based on MongoDB 3.x and covers topics ranging from database querying using the shell, built-in drivers, and popular ODM mappers, to more advanced topics such as sharding, high availability, and integration with big data sources.
You will get an overview of MongoDB and how to play to its strengths, with relevant use cases. After that, you will learn how to query MongoDB effectively and make use of indexes as much as possible. The next part deals with the administration of MongoDB installations on-premise or on the cloud. We deal with database internals in the next section, explaining storage systems and how they can affect performance. The last section of this book deals with replication and MongoDB scaling, along with integration with heterogeneous data sources. By the end this book, you will be equipped with all the required industry skills and knowledge to become a certified MongoDB developer and administrator.
Chapter 1, MongoDB – A Database for the Modern Web, takes us on a journey through web, SQL, and NoSQL technologies from inception to current state.
Chapter 2, Schema Design and Data Modeling, teaches schema design for relational databases and MongoDB, and how we can achieve the same goal starting from a different point.
Chapter 3, MongoDB CRUD Operations, gives a bird's-eye view of CRUD operations.
Chapter 4, Advanced Querying, covers advanced querying concepts using Ruby, Python, and PHP, using both the official drivers and an ODM.
Chapter 5, Aggregation, dives deep into the aggregation framework. We also discuss why and when we should use aggregation, as opposed to MapReduce and querying the database.
Chapter 6, Indexing, explores one of the most important properties of every database, which is indexing.
Chapter 7, Monitoring, Backup, and Security, discusses the operational aspects of MongoDB. Monitoring, backup, and security should not be an afterthought but rather a necessary process before deploying MongoDB in a production environment.
Chapter 8, Storage Engines, teaches about different storage engines in MongoDB. We identify the pros and cons of each one and the use cases for choosing each storage engine.
Chapter 9, Harnessing Big Data with MongoDB, shows more about how MongoDB fits into the wider big data landscape and ecosystem.
Chapter 10, Replication, discusses replica sets and how to administer them. Starting from an architectural overview of replica sets and replica set internals around elections, we dive deep into setting up and configuring a replica set.
Chapter 11, Sharding, explores sharding, one of the most interesting features of MongoDB. We start from an architectural overview of sharding and move on to how we can design a shard, and especially choose the right shard key.
Chapter 12, Fault Tolerance and High Availability, tries to fit in the information that we didn't manage to discuss in the previous chapters, and places emphasis on some others.
You will need the following software to be able to smoothly sail through the chapters:
MongoDB version 3+
Apache Kafka 1
Apache Spark 2+
Apache Hadoop 2+
Mastering MongoDB 3.x is a book for database developers, architects, and administrators who want to learn how to use MongoDB more effectively and productively. If you have experience in, and are interested in working with, NoSQL databases to build apps and websites, then this book is for you.
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "In a sharded environment, each mongod applies its own locks, thus greatly improving concurrency."
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
> db.types.find().sort({a:-1}){ "_id" : ObjectId("5908d59d55454e2de6519c4a"), "a" : [ 2,
5
] }
{ "_id" : ObjectId("5908d58455454e2de6519c49"), "a" : [ 1, 2,
3
] }
Any command-line input or output is written as follows:
> db.types.insert({"a":4})WriteResult({ "nInserted" : 1 })
New terms and important words are shown in bold.
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:
Log in or register to our website using your email address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-MongoDB-3x and https://github.com/agiamas/mastering-mongodb. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
In this chapter, we will lay the foundations for understanding MongoDB and how it is a database designed for the modern web. We will cover the following topics:
The web, SQL, and MongoDB's history and evolution.
MongoDB from the
perspective of
SQL and other NoSQL technology users.
MongoDB's common use cases and why they matter.
Configuration best practices:
Operational
Schema design
Write durability
Replication
Sharding
Security
AWS
Learning to learn. Nowadays, learning how to learn is as important as learning in the first place. We will go through references that have the most up to date information about MongoDB for both new and experienced users.
In March 1989, more than 28 years ago, Sir Tim Berners-Lee unveiled his vision for what would later be named the World Wide Web (WWW) in a document called Information Management: A Proposal (http://info.cern.ch/Proposal.html). Since then, the WWW has grown to be a tool of information, communication, and entertainment for more than two of every five people on our planet.
The first version of the WWW relied exclusively on web pages and hyperlinks between them, a concept kept until present times. It was mostly read-only, with limited support for interaction between the user and the web page. Brick and mortar companies were using it to put up their informational pages. Finding websites could only be done using hierarchical directories like Yahoo! and DMOZ. The web was meant to be an information portal.
This, while not being Sir Tim Berners-Lee's vision, allowed media outlets such as the BBC and CNN to create a digital presence and start pushing out information to the users. It revolutionized information access as everyone in the world could get first-hand access to quality information at the same time.
Web 1.0 was totally device and software independent, allowing for every device to access all information. Resources were identified by address (the website's URL) and open protocols (GET, POST, PUT, DELETE) could be used to access content resources.
Hyper Text Markup Language (HTML) was used to develop web sites that were serving static content. There was no notion of Cascading Style Sheets (CSS) as positioning of elements in a page could only be modified using tables and framesets were used extensively to embed information in pages.
This proved to be severely limiting and so browser vendors back then started adding custom HTML tags like <blink> and <marquee> which lead to the first browser wars, with rivals Microsoft (Internet Explorer) and Netscape racing to extend the HTTP protocol's functionality. Web 1.0 reached 45 million users by 1996.
Here is the Lycos start page as it appeared in Web 1.0 http://www.lycos.com/:
Yahoo as appeared in Web 1.0 http://www.yahoo.com:
A term first defined and formulated by Tim O'Reilly, we use it to describe our current WWW sites and services. Its main characteristic is that the web moved from being read-only to the read-write state. Websites evolved into services and human collaboration plays an ever important part in Web 2.0.
From simple information portals, we now have many more types of services such as:
Audio
BlogPod
Blogging
Bookmarking
Calendars
Chat
Collaboration
Communication
Community
CRM
E-commerce
E-learning
Filesharing
Forums
Games
Images
Knowledge
Mapping
Mashups
Multimedia
Portals
RSS
Wikis
Web 2.0 reached 1+ billion users in 2006 and 3.77 billion users at the time of writing this book (late 2017). Building communities was the differentiating factor for Web 2.0, allowing internet users to connect on common interests, communicate, and share information.
Personalization plays an important part of Web 2.0 with many websites offering tailored content to its users. Recommendation algorithms and human curation decides the content to show to each user.
Browsers can support more and more desktop applications by using Adobe Flash and Asynchronous JavaScript and XML (AJAX) technologies. Most desktop applications have web counterparts that either supplement or have completely replaced the desktop versions. Most notable examples are office productivity (Google Docs, Microsoft Office 365), Digital Design Sketch, and image editing and manipulation (Google Photos, Adobe Creative Cloud).
Moving from websites to web applications also unveiled the era of Service Oriented Architecture (SOA). Applications can interconnect with each other, exposing data through Application Programming Interfaces (API) allowing to build more complex applications on top of application layers.
One of the applications that defined Web 2.0 are social apps. Facebook with 1.86 billion monthly active users at the end of 2016 is the most well known example. We use social networks and many web applications share social aspects that allow us to communicate with peers and extend our social circle.
It's not yet here, but Web 3.0 is expected to bring Semantic Web capabilities. Advanced as Web 2.0 applications may seem, they all rely mostly on structured information. We use the same concept of searching for keywords and matching these keywords with web content without much understanding of context, content and intention of user's request. Also called Web of Data, Web 3.0 will rely on inter-machine communication and algorithms to provide rich interaction via diverse human computer interfaces.
Structured Query Language existed even before the WWW. Dr. EF Codd originally published the paper A Relational Model of Data for Large Shared Data Banks, in June 1970, in the Association of Computer Machinery (ACM) journal, Communications of the ACM. SQL was initially developed at IBM by Chamberlin and Boyce in 1974. Relational Software (now Oracle Corporation) was the first to develop a commercially available implementation of SQL, targeted at United States governmental agencies.
The first American National Standards Institute (ANSI) SQL standard came out in 1986 and since then there have been eight revisions with the most recent being published in 2016 (SQL:2016).
SQL was not particularly popular at the start of the WWW. Static content could just be hard coded into the HTML page without much fuss. However, as functionality of websites grew, webmasters wanted to generate web page content driven by offline data sources to generate content that could change over time without redeploying code.
Common Gateway Interface (CGI) scripts in Perl or Unix shell were driving early database driven websites in Web 1.0. With Web 2.0, the web evolved from directly injecting SQL results into the browser to using two- and three-tier architecture that separated views from business and model logic, allowing for SQL queries to be modular and isolated from the rest of a web application.
Not only SQL (NoSQL) on the other hand is much more modern and supervenes web evolution, rising at the same time as Web 2.0 technologies. The term was first coined by Carlo Strozzi in 1998 for his open source database that was not following the SQL standard but was still relational.
This is not what we currently expect from a NoSQL database. Johan Oskarsson, a developer at Last.fm at the time, reintroduced the term in early 2009 to group a set of distributed, non-relational data stores that were being developed. Many of them were based on Google's Bigtable and MapReduce papers or Amazon's Dynamo highly available key-value based storage system.
NoSQL foundations grew upon relaxed ACID (atomicity, consistency, isolation, durability) guarantees in favor of performance, scalability, flexibility and reduced complexity. Most NoSQL databases have gone one way or another in providing as many of the previously mentioned qualities as possible, even offering tunable guarantees to the developer.
10gen started developing a cloud computing stack in 2007 and soon realized that the most important innovation was centered around the document oriented database that they built to power it, MongoDB. MongoDB was initially released on August 27th, 2009.
Version 1 of MongoDB was pretty basic in terms of features, authorization, and ACID guarantees and made up for these shortcomings with performance and flexibility.
In the following sections, we can see the major features along with the version number with which they were introduced.
Document-based model
Global lock (process level)
Indexes on collections
CRUD operations on documents
No authentication (authentication was handled at the server level)
Master/slave replication
MapReduce (introduced in v1.2)
Stored JavaScript functions (introduced in v1.2)
Background index creation (since v.1.4)
Sharding (since v.1.6)
More query operators (since v.1.6)
Journaling (since v.1.8)
Sparse and covered indexes (since v.1.8)
Compact command to reduce disk usage
Memory usage more efficient
Concurrency improvements
Index performance enhancements
Replica sets are now more configurable and data center aware
MapReduce improvements
Authentication (since 2.0 for sharding and most database commands)
Geospatial features introduced
Aggregation framework (since v.2.2) and enhancements (since v.2.6)
TTL collections (since v.2.2)
Concurrency improvements among which DB level locking (since v.2.2)
Text search (since v.2.4) and integration (since v.2.6)
Hashed index (since v.2.4)
Security enhancements, role based access (since v.2.4)
V8 JavaScript engine instead of SpiderMonkey (since v.2.4)
Query engine improvements (since v.2.6)
Pluggable storage engine API
WiredTiger storage engine introduced, with document level locking while previous storage engine (now called MMAPv1) supports collection level locking
Replication and sharding enhancements (since v.3.2)
Document validation (since v.3.2)
Aggregation framework enhanced operations (since v.3.2)
Multiple storage engines (since v.3.2, only in Enterprise Edition)
As one can observe, version 1 was pretty basic, whereas version 2 introduced most of the features present in the current version such as sharding, usable and special indexes, geospatial features, and memory and concurrency improvements.
On the way from version 2 to version 3, the aggregation framework was introduced, mainly as a supplement to the ageing (and never up to par with dedicated frameworks like Hadoop) MapReduce framework. Then, adding text search and slowly but surely improving performance, stability, and security to adapt to the increasing enterprise load of customers using MongoDB.
With WiredTiger's introduction in version 3, locking became much less of an issue for MongoDB as it was brought down from process (global lock) to document level, almost the most granular level possible.
At its current state, MongoDB is a database that can handle loads ranging from startup MVPs and POCs to enterprise applications with hundreds of servers.
As MongoDB has grown from being a niche database solution to the Swiss Army knife of NoSQL technologies, more developers are coming to it from a NoSQL background as well.
Setting the SQL to NoSQL differences aside, users from columnar type databases face the most challenges. Cassandra and HBase being the most popular column oriented database management systems, we will examine the differences and how a developer can migrate a system to MongoDB.
Flexibility
: MongoDB's notion of documents that can contain sub-documents nested in complex hierarchies is really expressive and flexible. This is similar to the comparison between MongoDB and SQL, with the added benefit that MongoDB can map easier to plain old objects from any programming language, allowing for easy deployment and maintenance.
Flexible query model
: A user can selectively index some parts of each document, query based on attribute values, regular expressions or ranges, and have as many properties per object as needed by the application layer. Primary, secondary indexes as well as special types of indexes like sparse ones can help greatly with query efficiency. Using a JavaScript shell with MapReduce makes it really easy for most developers and many data analysts to quickly take a look into data and get valuable insights.
Native aggregation
: The aggregation framework provides an ETL pipeline for users to extract and transform data from MongoDB and either load them in a new format or export it from MongoDB to other data sources. This can also help data analysts and scientists get the slice of data they need performing data wrangling along the way.
Schemaless model
: This is a result of MongoDB's design philosophy to give applications the power and responsibility to interpret different properties found in a collection's documents. In contrast to Cassandra's or HBase's schema based approach, in MongoDB a developer can store and process dynamically generated attributes.
In this section, we will analyze MongoDB's characteristics as a database. Understanding the features that MongoDB provides can help developers and architects evaluate the requirement at hand and how MongoDB can help fulfill it. Also, we will go through some common use cases from MongoDB Inc's experience that have delivered the best results for its users.
MongoDB has grown to a general purpose NoSQL database, offering the best of both RDBMS and NoSQL worlds. Some of the key characteristics are:
It's a general purpose database. In contrast with other NoSQL databases that are built for purpose (for example, graph databases), MongoDB can serve heterogeneous loads and multiple purposes within an application.
Flexible schema design. Document oriented approaches with non-defined attributes that can be modified on the fly is a key contrast between MongoDB and relational databases.
It's built with high availability from the ground up. In our era of five nines in availability, this has to be a given. Coupled with automatic failover on detection of a server failure, this can help achieve high uptime.
Feature rich. Offering the full range of SQL equivalent operators along with features such as MapReduce, aggregation framework, TTL/capped collections, and secondary indexing, MongoDB can fit many use cases, no matter how diverse the requirements are.
Scalability and load balancing. It's built to scale, both vertically but most importantly horizontally. Using sharding, an architect can share load between different instances and achieve both read and write scalability. Data balancing happens automatically and transparently to the user by the shard balancer.
Aggregation framework. Having an extract transform load framework built in the database means that a developer can perform most of the ETL logic before the data leaves the database, eliminating in many cases the need for complex data pipelines.
Native replication. Data will get replicated across a replica set without complicated setup.
Security features. Both authentication and authorization are taken into account so that an architect can secure her MongoDB instances.
JSON (BSON, Binary JSON) objects for storing and transmitting documents. JSON is widely used across the web for frontend and API communication and as such it's easier when the database is using the same protocol.
MapReduce. Even though the MapReduce engine isn't as advanced as it is in dedicated frameworks, it is nonetheless a great tool for building data pipelines.
Querying and geospatial information in 2D and 3D. This may not be critical for many applications, but if it is for your use case then it's really convenient to be able to use the same database for geospatial calculations along with data storage.
MongoDB being a hugely popular NoSQL database means that there are several use cases where it has succeeded in supporting quality applications with a great time to market delivery time.
Many of its most successful use cases center around the following areas:
Integration of siloed data providing a single view of them
Internet of Things
Mobile applications
Real-time analytics
Personalization
Catalog management
Content management
All these success stories share some common characteristics. We will try and break these down in order of relative importance.
Schema flexibility is most probably the most important one. Being able to store documents inside a collection that can have different properties can help both during development phase but also in ingesting data from heterogeneous sources that may or may not have the same properties. In contrast with an RDBMS where columns need to be predefined and having sparse data can be penalized, in MongoDB this is the norm and it's a feature that most use cases share. Having the ability to deep nest attributes into documents, add arrays of values into attributes and all the while being able to search and index these fields helps application developers exploit the schema-less nature of MongoDB.
Scaling and sharding are the most common patterns for MongoDB use cases. Easily scaling using built-in sharding and using replica sets for data replication and offloading primary servers from read load can help developers store data effectively.
Many use cases also use MongoDB as a way of archiving data. Used as a pure data store and not having the need to define schemas, it's fairly easy to dump data into MongoDB, only to be analyzed at a later date by business analysts either using the shell or some of the numerous BI tools that can integrate easily with MongoDB. Breaking data down further based on time caps or document count can help serve these datasets from RAM, the use case where MongoDB is most effective.
On this point, keeping datasets in RAM is more often another common pattern. MongoDB uses MMAP storage (called MMAPv1) in most versions up to the most recent, which delegates data mapping to the underlying operating system. This means that most GNU/Linux based systems working with collections that can be stored in RAM will dramatically increase performance. This is less of an issue with the introduction of pluggable storage engines like WiredTiger, more on that in Chapter 8, Storage Engines.
Capped collections are also a feature used in many use cases. Capped collections can restrict documents in a collection by count or by overall size of the collection. In the latter case, we need to have an estimate of size per document to calculate how many documents will fit in our target size. Capped collections are a quick and dirty solution to answer requests like "Give me the last hour's overview of the logs." without any need for maintenance and running async background jobs to clean our collection. Oftentimes, these may be used to quickly build and operate a queuing system. Instead of deploying and maintaining a dedicated queuing system like ActiveMQ, a developer can use a collection to store messages and then use native tailable cursors provided by MongoDB to iterate through results as they pile up and feed an external system.
Low operational overhead is also a common pattern in use cases. Developers working in agile teams can operate and maintain clusters of MongoDB servers without the need for a dedicated DBA. MongoDB Management Service can greatly help in reducing administrative overhead, whereas MongoDB Atlas, the hosted solution by MongoDB Inc., means that developers don't need to deal with operational headaches.
In terms of business sectors using MongoDB, there is a huge variety coming from almost all industries. Where there seems to be a greater penetration though, is in cases that have to deal with lots of data with a relatively low business value in each single data point. Fields like IoT can benefit the most by exploiting availability over consistency design, storing lots of data from sensors in a cost efficient way. Financial services on the other hand, many times have absolutely stringent consistency requirements aligned with proper ACID characteristics that make MongoDB more of a challenge to adapt. Transactions carrying financial data can be a few bytes but have an impact of millions of dollars, hence all the safety nets around transmitting this type of information correctly.
Location-based data is also a field where MongoDB has thrived. Foursquare being one of the most prominent early clients, MongoDB offers quite a rich set of features around 2D and 3D geolocation data, offering features like searching by distance, geofencing, and intersection between geographical areas.
Overall, the rich feature set is the common pattern across different use cases. By providing features that can be used in many different industries and applications, MongoDB can be a unified solution for all business needs, offering users the ability to minimize operational overhead and at the same time iterate quickly in product development.
MongoDB has had its fair share of criticism throughout the years. The web-scale proposition has been met with skepticism by many developers. The counter argument is that scale is not needed most of the time and we should focus on other design considerations. While this may be true on several occasions, it's a false dichotomy and in an ideal world we would have both. MongoDB is as close as it can get to combining scalability with features and ease of use/time to market.
