37,19 €
MongoDB is a leading non-relational database. This book covers all the major features of MongoDB including the latest version 6. MongoDB 6.x adds many new features and expands on existing ones such as aggregation, indexing, replication, sharding and MongoDB Atlas tools. Some of the MongoDB Atlas tools that you will master include Atlas dedicated clusters and Serverless, Atlas Search, Charts, Realm Application Services/Sync, Compass, Cloud Manager and Data Lake.
By getting hands-on working with code using realistic use cases, you will master the art of modeling, shaping and querying your data and become the MongoDB oracle for the business. You will focus on broadly used and niche areas such as optimizing queries, configuring large-scale clusters, configuring your cluster for high performance and availability and many more. Later, you will become proficient in auditing, monitoring, and securing your clusters using a structured and organized approach.
By the end of this book, you will have grasped all the practical understanding needed to design, develop, administer and scale MongoDB-based database applications both on premises and on the cloud.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 593
Veröffentlichungsjahr: 2022
Expert techniques to run high-volume and fault-tolerant database solutions using MongoDB 6.x
Alex Giamas
BIRMINGHAM—MUMBAI
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Heramb Bhavsar
Senior Editor: Tazeen Shaikh
Content Development Editor: Joseph Sunil
Technical Editor: Sweety Pagaria
Copy Editor: Safis Editing
Project Coordinator: Farheen Fathima
Proofreader: Safis Editing
Indexer: Pratik Shirodkar
Production Designer: Alishon Mendonca
Marketing Coordinator: Nivedita Singh
First published: November 2017
Second edition: March 2019
Third Edition: September 2022
Production reference: 1180822
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80324-386-3
www.packt.com
In memory of my dearest mother Evi (1954-2020) and my highly esteemed father-in-law, Christos (1950-2022) who gave me the guidance and strength to carry on. You will always be in my mind, in my heart and live forever through our memories.
To my son Dimitris, with the wish that he will grow up in a better world than ours. I will always have your back, stand next to you and be your greatest fan. Never doubt my love and only be the best that you can, for yourself and your loved ones. Make the world a better place, please.
Alex Giamas is a freelance consultant and a hands-on Lead Technical and Data Architect. Over the past 15 years, he has expertise in designing and developing systems for the UK Government (HMRC, Cabinet Office, DIT) and private sector (Amazon ProServe, PwC, Fintech Fortune 500, Yahoo!, Verizon) clients. Alex is an alumnus of the MassChallenge London cohort as the co-founder and CTO of a digital health startup. Alex has authored Mastering MongoDB 3.x and 4.x, both by Packt Publishing. Alex has developed large-scale robust, distributed software systems in Python, JavaScript, Ruby, and Java. He is a MongoDB Certified Developer, a Cloudera Hadoop Certified Developer with Data Science Essentials, and a Carnegie Mellon and Stanford graduate.
I would like to thank my wife Mary for her support, patience, and understanding all throughout the journey of writing 3 books on MongoDB in the past 5 years. You are the architect of our life together; you always support me by modeling our non-relational and unstructured daily routine data. I wouldn’t have made it without you by my side.
I would like to thank the team at Packt Publishing for their support and understanding when major life and death events got in the way. You people rock!
Amit Phaltankar is a software developer and blogger with more than 13 years’ experience of building lightweight and efficient software components. He specializes in writing web-based applications as well as handling large-scale datasets using traditional SQL, NoSQL, and big data technologies. He has gained professional experience in a wide range of technology stacks and loves learning and adapting to new technology trends. Amit has a huge passion for improving his skill set and loves guiding his peers and contributing to blogs. During the last 6 years, he has used MongoDB effectively in various ways to build faster systems.
Kevin Smith is a Microsoft MVP and has been working with MongoDB since the early releases in 2010, with the first deployment being a 16-sharded cluster. He has been a technology enthusiast from a young age and enjoys working on a wide range of technologies. In his day-to-day work he is focused on, but not limited to, .NET and TypeScript, and using AWS and Azure Cloud Services. He is heavily involved in the community, running three community groups, 2 in the North of England, dotnet York and dotnetsheff, and one virtual hackathon group, MiniHack. He is passionate about helping and sharing knowledge with others, speaking at user groups and conferences, and contributing to the open source community.
MongoDB is the leading non-relational database. This book covers all the major features of MongoDB including the latest version, 6. MongoDB 6.x adds many new features and expands on existing ones, such as aggregation, indexing, replication, sharding, and MongoDB Atlas tools. Some of the MongoDB Atlas tools that you will master include Atlas dedicated clusters and Serverless, Atlas Search, Charts, Realm Application Services/Sync, Compass, Cloud Manager, and Data Lake.
Learning from experience and demonstrating code using realistic use cases, you will master the art of modeling, shaping, and querying your data and become the MongoDB oracle for the business. You will dive deep into broadly used as well as niche areas such as optimizing queries, configuring large-scale clusters, configuring your cluster for high performance and availability, and many more. With this under your belt, you will be proficient in auditing, monitoring, and securing your clusters using a structured and organized approach.
By the end of the book, you will have grasped all the practical understanding needed to design, develop, administer and scale MongoDB-based database applications both on-premises and in the cloud.
The book is geared towards MongoDB developers and database administrators who wish to learn in depth how to model their data using MongoDB, for both greenfield and existing projects. Some understanding of MongoDB, shell command skills, and basic database design concepts is required to get the most out of the book.
Chapter 1, MongoDB – A Database for the Modern Web, will act as a quick refresher of the structure and its key components for businesses. You will learn how the database has evolved over time and how different designs get to be driven by data modeling.
Chapter 2, Schema Design and Data Modeling, will explain the pros and cons of each approach and help you identify the best route to take for each case (key-value, document-based, graph, and CAP theorem). You will learn how to model your data for different use cases along with the tradeoffs of the different designs. Furthermore, you will learn how to configure the drivers for each language to make sure that you are making the most of MongoDB.
Chapter 3, MongoDB CRUD Operations, will showcase the MongoDB shell and its capabilities. This chapter will show you how to perform all the CRUD operations and administration tasks using the shell. You will learn to use the aggregation framework for prototyping, getting quick insights from data and where it shines compared to the older MapReduce framework. You will also learn how to use the new shell, mongosh, and how to migrate from the old one. Finally, you will know how to use the versioned API and the rapid, regular MongoDB release cycle to ensure code sustainability.
Chapter 4, Auditing, explores what auditing is and how is it different from regular application logging. You will learn how to set up auditing on-premises and in the cloud and how to identify irregular activity and single it out. Finally, you will have a case study bringing it all together that can serve as a reference for an end-to-end auditing implementation.
Chapter 5, Advanced Querying, will teach you how to query MongoDB, both using ODM and the driver from Ruby, along with instructions to use from the PHP and Python perspectives. It will show you how to avoid expensive operations and design queries so that they require the least possible maintenance down the line. You will be able to update and delete documents without impacting the underlying storage, and perform complex queries using regular expressions and arrays. You will also learn how and when you need to change streams.
Chapter 6, Multi-Document ACID Transactions, explores the theory behind transactions. How do the different transaction levels compare? When you should use transactions and what are the drawbacks? You will also configure different concern levels for transactions, and find out what the limitations of multi-document transactions in MongoDB are as of version 6.x.
Chapter 7, Aggregation, acts as a deep dive into the aggregation framework and how it can be a solution instead of complex queries or building data pipelines for ETL in code. You will learn how to use aggregation for a generic use case, as well as for more specific cases, such as window operators and time series. You will also be able to create and update materialized views from an RDBMS perspective.
Chapter 8, Indexing, showcases the different types of indexes and how to use them to improve querying efficiency. You will also learn to troubleshoot slow queries and optimize them. You will also understand what the drawbacks are of using too many indexes.
Chapter 9, Monitoring, Backup, and Security, explores monitoring, backups, and deployment for developers. The chapter shows how a developer or DBA can administer a MongoDB server or servers in their own data center or the cloud using MongoDB Atlas. It also shows how you can monitor ongoing operations and keep an eye on the cluster’s health, and how can you comply with GDPR using security updates, new in 6.x. Finally, you will learn how to find the optimal path for deployment and upgrading existing clusters.
Chapter 10, Managing Storage Engines, introduces the concept of storage engines. This chapter explains why they matter and how WiredTiger can help users administer MongoDB better.
Chapter 11, MongoDB Tooling, provides an answer to the following questions: how does tooling in the MongoDB ecosystem work? How can you use Realm, Search, Serverless, and Charts to build applications more quickly and robustly? How can you deploy and administer Atlas Kubernetes Operator to manage resources in Atlas without leaving Kubernetes? How can you use data from multiple devices, including IoT and mobile? How is MongoDB innovative in these use cases?
Chapter 12, Harnessing Big Data with MongoDB, showcases the integration of MongoDB with other data sources in the big data ecosystem, HDFS, message queues, and Kafka. It tells you how you can design MongoDB and where your source of truth and aggregator operations should lie when you store and process datasets. You will also learn to use MongoDB Atlas Data Lake as a warehouse and when should you use it as opposed to AWS Data Lake solutions.
Chapter 13, Mastering Replication, explores replication for MongoDB, along with setting up and administering replica sets. You will learn why we need replication as a concept and which workloads require replication as a first-class concern. You will learn to connect and administer replica sets from different drivers. Finally, you will see the latest updates to replica set setup and administration in MongoDB 6.x.
Chapter 14, Mastering Sharding, shows how to horizontally scale a MongoDB installation and how to set up and administer a sharded cluster. You will also learn how to decide on a sharding strategy and reshard data if the requirements change.
Chapter 15, Fault Tolerance and High Availability, explores various tips and tricks that you can use with all the concepts that you covered in the previous chapters.
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
The code in this book has been tested on Windows and is running as intended. In case of any issues, please raise it in the GitHub repository.
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Mastering-MongoDB-6.x. If there’s an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://packt.link/k275B.
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Fields such as ts (timestamp), users, and rules are included in every audit log by default.”
A block of code is set as follows:
apiVersion: v1kind: ConfigMapmetadata:name:<<any sample name we choose(1)>>namespace: mongodbdata:projectId:<<Project ID from above>>baseUrl: <<BaseURI from above>>Any command-line input or output is written as follows:
mongod --auditFilter '{ atype: "authenticate", "param.db": "test" }'
Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “The project ID is typically a long hex string, for example, "620173c921b1ab3de3e8e610", which we can retrieve from the Organization | Projects page.”
Tips or important notes
Appear like this.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Once you’ve read Mastering MongoDB 6.x, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
This part of the book covers all the basics surrounding MongoDB and will serve as an introduction to the world of MongoDB. We will go through the history of databases and how the need for MongoDB grew out of database evolution. We will learn how to design our database and model our data for efficiency, ease of development, and maintainability.
This part contains the following chapters:
Chapter 1, MongoDB - A Database for the Modern WebChapter 2, Schema Design and Data ModelingIn this chapter, we will lay the foundations for understanding MongoDB. We will explore how it is a database designed for the modern web and beyond. Learning is as important as knowing how to learn in the first place. We will go through the references that have the most up-to-date information about MongoDB, for both new and experienced users.
By the end of this chapter, you will have learned where MongoDB is best suited to be used and when it might be sub-optimal to use it. Learning about the evolution of MongoDB and the wider ecosystem will allow you to apply critical thinking when evaluating different database options early on in the software development life cycle.
In this chapter, we will cover the following topics:
SQL and MongoDB’s history and evolutionMongoDB from the perspective of SQL and other NoSQL technology usersMongoDB’s common use cases and why they matterMongoDB’s configuration and best practicesTo sail smoothly through the chapter, you will need MongoDB version 5 installed or a free tier account in MongoDB Atlas The code that has been used for all of the chapters in this book can be found at https://github.com/PacktPublishing/Mastering-MongoDB-6.x.
Structured Query Language (SQL) existed even before the World Wide Web (WWW). Dr. E. F. Codd originally published a paper, A Relational Model of Data for Large Shared Data Banks, in June 1970, in the Association of Computer Machinery (ACM) journal, Communications of the ACM. SQL was initially developed at IBM by Chamberlin and Boyce, in 1974. Relational Software (now known as Oracle Corporation) was the first to develop a commercially available implementation of SQL, which was targeted at United States governmental agencies.
The first American National Standards Institute (ANSI) SQL standard came out in 1986. Since then, there have been eight revisions, with the most recent being published in 2016 (SQL:2016).
SQL was not particularly popular at the start of the WWW. Static content could just be hardcoded onto the HTML page without much fuss. However, as the functionality of websites grew, webmasters wanted to generate web page content driven by offline data sources, in order to generate content that could change over time without redeploying code.
Common Gateway Interface (CGI) scripts, developing Perl or Unix shells, were driving early database-driven websites in Web 1.0. With Web 2.0, the web evolved from directly injecting SQL results into the browser to using two-tier and three-tier architectures that separated views from the business and model logic, allowing for SQL queries to be modular and isolated from the rest of the web application.
On the other hand, Not only SQL (NoSQL) is much more modern and supervened web evolution, rising at the same time as Web 2.0 technologies. The term was first coined by Carlo Strozzi, in 1998, for his open source database that did not follow the SQL standard but was still relational.
This is not what we currently expect from a NoSQL database. Johan Oskarsson, a developer at Last.fm, reintroduced the term in early 2009, in order to group a set of distributed, non-relational data stores that were being developed. Many of them were based on Google’s Bigtable and MapReduce papers or Amazon’s DynamoDB, which is a highly available key-value-based storage system.
NoSQL’s foundations grew upon relaxed atomicity, consistency, isolation, and durability (ACID) properties, which guarantee performance, scalability, flexibility, and reduced complexity. Most NoSQL databases have gone one way or the other in providing as many of the previously mentioned qualities as possible, even offering adjustable guarantees to the developer. The following diagram describes the evolution of SQL and NoSQL:
Figure 1.1 – Database evolution
In the next section, we will learn more about how MongoDB has evolved over time, from a basic object store to a full-fledged general-purpose database system.
MongoDB Inc’s former name, 10gen Inc., started to develop a cloud computing stack in 2007 and soon realized that the most important innovation was centered around the document-oriented database that they built to power it, which was MongoDB. MongoDB shifted from a Platform as a Service (PaaS) to an open source model and released MongoDB version 1.0 on August 27, 2009.
Version 1 of MongoDB was pretty basic in terms of features, authorization, and ACID guarantees, but it made up for these shortcomings with performance and flexibility.
In the following sections, we will highlight the major features of MongoDB, along with the version numbers with which they were introduced.
The major new features of versions 1.0 and 1.2 are listed as follows:
A document-based modelA global lock (process level)Indexes on collectionsCRUD operations on documentsNo authentication (authentication was handled at the server level)Primary and secondary replication: Back then, they were named master and slave, respectively, and were changed to their current names with the SERVER-20608 ticket, in version 4.9.0MapReduce (introduced in v1.2)Stored JavaScript functions (introduced in v1.2)The major new features of version 2 are listed as follows:
Background index creation (since v1.4)Sharding (since v1.6)More query operators (since v1.6)Journaling (since v1.8)Sparse and covered indexes (since v1.8)Compact commands to reduce disk usageMore efficient memory usageConcurrency improvementsIndex performance enhancementsReplica sets are now more configurable and data center-awareMapReduce improvementsAuthentication (since 2.0, for sharding and most database commands)Geospatial features introducedThe aggregation framework (since v2.2) and enhancements (since v2.6)Time-to-Live (TTL) collections (since v2.2)Concurrency improvements, among which there is DB-level locking (since v2.2)Text searching (since v2.4) and integration (since v2.6)Hashed indexes (since v2.4)Security enhancements and role-based access (since v2.4)A V8 JavaScript engine instead of SpiderMonkey (since v2.4)Query engine improvements (since v2.6)A pluggable storage engine APIA WiredTiger storage engine has been introduced, with document-level locking, while the previous storage engine (now called MMAPv1) supports collection-level lockingThe major new features of version 3 are listed as follows:
Replication and sharding enhancements (since v3.2)Document validation (since v3.2)The aggregation framework’s enhanced operations (since v3.2)Multiple storage engines (since v3.2, only in Enterprise Edition)Query language and indexes collation (since v3.4)Read-only database views (since v3.4)Linearizable read concerns (since v3.4)The major new features of version 4 are listed as follows:
Multi-document ACID transactions (since v4.0)Change streams (since v4.0)MongoDB tools (Stitch, Mobile, Sync, and Kubernetes Operator) (since v4.0)Retryable writes (since v4.0)Distributed transactions (since v4.2)Removing the outdated MMAPv1 storage engine (since v4.2)Updating the shard key (since v4.2)On-demand materialized views using aggregation pipelines (since v4.2)Wildcard indexes (since v4.2)Streaming replication in replica sets (since v4.4) Hidden indexes (since v4.4)The major new features of version 5 are listed as follows:
A quarterly MongoDB release schedule going forward Window operators using aggregation pipelines (since v5.0)A new MongoDB shell – mongosh (since v5.0)Native time series collections (since v5.0)Live resharding (since v5.0)Versioned APIs (since v5.0)Multi-cloud client-side field level encryption (since v5.0)Cross-Shard Joins and Graph Traversals (since v5.1)The following diagram shows MongoDB’s evolution over time:
Figure 1.2 – MongoDB’s evolution
As you can see, version 1 was pretty basic, whereas version 2 introduced most of the features present in the current version, such as sharding, usable and spatial indexes, geospatial features, and memory and concurrency improvements.
On the way from version 2 to version 3, the aggregation framework was introduced, mainly as a supplement to the aging MapReduce framework that didn’t keep up to speed with dedicated frameworks, such as Hadoop. Then, text search was added, and slowly but surely, the performance, stability, and security of the framework improved, adapting to the increasing enterprise loads of customers using MongoDB.
With WiredTiger’s introduction in version 3, locking became much less of an issue for MongoDB, as it was brought down from the process (global lock) to the document level, which is almost the most granular level possible.
Version 4 marked a major transition, bridging the SQL and NoSQL world with the introduction of multi-document ACID transactions. This allowed for a wider range of applications to use MongoDB, especially applications that require a strong real-time consistency guarantee. Further, the introduction of change streams allowed for a faster time to market for real-time applications using MongoDB. Additionally, a series of tools have been introduced to facilitate serverless, mobile, and Internet of Things (IoT) development.
With version 5, MongoDB is now a cloud-first database, with MongoDB Atlas offering full customer support for all major and minor releases going forward. In comparison, non-cloud users only get official support for major releases (for example, version 5 and then version 6). This is complemented by the newly released versioned API approach, which futureproofs applications. Live resharding addresses the major risk of choosing the wrong sharding key, whereas native time series collections and cross-shard lookups using $lookup and $graphlookup greatly improve analytics capabilities and unlock new use cases. End-to-end encryption and multi-cloud support can help implement systems in industries that have unique regulatory needs and also avoid vendor locking. The new mongosh shell is a major improvement over the legacy mongo shell.
Version 6 brings many incremental improvements. Now time series collections support sharding, compression, an extended range of secondary indexes, and updates and deletes (with limitations), making them useful for production use. The new slot-based query execution engine can be used in eligible queries such as $group and $lookup, improving execution time by optimizing query calculations. Finally, queryable encryption and cluster-to-cluster syncing improve the operational and management aspects of MongoDB.
In its current state, MongoDB is a database that can handle heterogeneous workloads ranging from startup Minimum Viable Product (MVP) and Proof of Concept (PoC) to enterprise applications with hundreds of servers.
MongoDB was developed in the Web 2.0 era. By then, most developers were using SQL or object-relational mapping (ORM) tools from their language of choice to access RDBMS data. As such, these developers needed an easy way to get acquainted with MongoDB from their relational background.
Thankfully, there have been several attempts at making SQL-to-MongoDB cheat sheets that explain the MongoDB terminology in SQL terms.
On a higher level, we have the following:
Databases and indexes (SQL databases)Collections (SQL tables)Documents (SQL rows)Fields (SQL columns)Embedded and linked documents (SQL joins)Further examples of common operations in SQL and their equivalents in MongoDB are shown in the following table:
Table 1.1 – Common operations in SQL/MongoDB
A few more examples of common operations can be seen at https://s3.amazonaws.com/info-mongodb-com/sql_to_mongo.pdf.
Next, we will check out the features that MongoDB has brought for NoSQL developers.
As MongoDB has grown from being a niche database solution to the Swiss Army knife of NoSQL technologies, more developers are also coming to it from a NoSQL background.
Putting the SQL versus NoSQL differences aside, it is the users from the columnar-type databases that face the most challenges. With Cassandra and HBase being the most popular column-oriented database management systems, we will examine the differences between them and how a developer can migrate a system to MongoDB. The different features of MongoDB for NoSQL developers are listed as follows:
Flexibility: MongoDB’s notion of documents that can contain sub-documents nested in complex hierarchies is really expressive and flexible. This is similar to the comparison between MongoDB and SQL, with the added benefit that MongoDB can more easily map to plain old objects from any programming language, allowing for easy deployment and maintenance.Flexible query model: A user can selectively index some parts of each document; query based on attribute values, regular expressions, or ranges; and have as many properties per object as needed by the application layer. Primary and secondary indexes, along with special types of indexes (such as sparse ones), can help greatly with query efficiency. Using a JavaScript shell with MapReduce makes it really easy for most developers (and many data analysts) to quickly take a look at data and get valuable insights.Native aggregation: The aggregation framework provides an extract-transform-load (ETL) pipeline for users to extract and transform data from MongoDB, and either load it in a new format or export it from MongoDB to other data sources. This can also help data analysts and scientists to get the slice of data they need in performing data wrangling along the way.Schema-less model: This is a result of MongoDB’s design philosophy to give applications the power and responsibility to interpret the different properties found in a collection’s documents. In contrast to Cassandra’s or HBase’s schema-based approach, in MongoDB, a developer can store and process dynamically generated attributes.After learning about the major features MongoDB offers to its users, in the next section, we will learn more about the key characteristics and the most widely deployed use cases.
In this section, we will analyze MongoDB’s characteristics as a database. Understanding the features that MongoDB provides can help developers and architects to evaluate the requirements at hand and how MongoDB can help to fulfill them. Also, we will go over some common use cases from the experience of MongoDB, Inc. that have delivered the best results for its users. Finally, we will uncover some of the most common points of criticism against MongoDB and non-relational databases in general.
MongoDB has grown to become a general-purpose NoSQL database, offering the best of both the RDBMS and NoSQL worlds. Some of the key characteristics are listed as follows:
It is a general-purpose database: In contrast to other NoSQL databases that are built for specific purposes (for example, graph databases), MongoDB can serve heterogeneous loads and multiple purposes within an application. This became even more true after version 4.0 introduced multi-document ACID transactions, further expanding the use cases in which it can be effectively used.Flexible schema design: Document-oriented approaches with non-defined attributes that can be modified on the fly is a key contrast between MongoDB and relational databases.It is built with high availability, from the ground up: In the era of five-nines availability, this has to be a given. Coupled with automatic failover upon detection of a server failure, this can help to achieve high uptime.Feature-rich: Offering the full range of SQL-equivalent operators, along with features such as MapReduce, aggregation frameworks, TTL and capped collections, and secondary indexing, MongoDB can fit many use cases, no matter how diverse the requirements are.Scalability and load balancing: It is built to scale, both vertically and (mainly) horizontally. Using sharding, an architect can share a load between different instances and achieve both read and write scalability. Data balancing happens automatically (and transparently to the user) via the shard balancer.Aggregation framework: Having an ETL framework built into the database means that a developer can perform most of the ETL logic before the data leaves the database, eliminating, in many cases, the need for complex data pipelines.Native replication: Data will get replicated across a replica set without a complicated setup.Security features: Both authentication and authorization are taken into account so that an architect can secure their MongoDB instances.JSON (BSON and Binary JSON) objects for storing and transmitting documents: JSON is widely used across the web for frontend and API communication, and, as such, it is easier when the database is using the same protocol.MapReduce: Even though the MapReduce engine is not as advanced as it is in dedicated frameworks, nonetheless, it is a great tool for building data pipelines.Querying and geospatial information in 2D and 3D: This might not be critical for many applications, but if it is for your use case, then it is really convenient to be able to use the same database for geospatial calculations and data storage.Multi-document ACID transactions: Starting from version 4.0, MongoDB supports ACID transactions across multiple documents.Mature tooling: The tooling for MongoDB has evolved to support systems around DBaaS to Sync, Mobile, and serverless (Stitch).Since MongoDB is a highly popular NoSQL database, there have been several use cases where it has succeeded in supporting quality applications, with a great delivery time to the market.
Many of its most successful use cases center around the following list of areas:
The integration of siloed data, providing a single view of themIoTMobile applicationsReal-time analyticsPersonalizationCatalog managementContent managementAll of these success stories share some common characteristics. We will try to break them down in order of relative importance:
Schema flexibility is probably the most important one. Being able to store documents inside a collection that can have different properties can help during both the development phase and when ingesting data from heterogeneous sources that may or may not have the same properties. This is in contrast with an RDBMS, where columns need to be predefined, and having sparse data can be penalized. In MongoDB, this is the norm, and it is a feature that most use cases share. Having the ability to deeply nest attributes into documents and add arrays of values into attributes while also being able to search and index these fields helps application developers to exploit the schema-less nature of MongoDB.Scaling and sharding are the most common patterns for MongoDB use cases. Easily scaling using built-in sharding, using replica sets for data replication, and offloading primary servers from read loads can help developers store data effectively.Additionally, many use cases use MongoDB as a way of archiving data. Used as a pure data store (and without the need to define schemas), it is fairly easy to dump data into MongoDB to be analyzed at a later date by business analysts, using either the shell or some of the numerous BI tools that can integrate easily with MongoDB. Breaking data down further, based on time caps or document counts, can help serve these datasets from RAM, the use case in which MongoDB is most effective.Capped collections are also a feature used in many use cases. Capped collections can restrict documents in a collection by count or by the overall size of the collection. In the latter case, we need to have an estimate of the size per document, in order to calculate how many documents will fit into our target size. Capped collections are a quick and dirty solution used to answer requests such as “Give me the last hour’s overview of the logs” without the need for maintenance and running async background jobs to clean our collection. Oftentimes, they might be used to quickly build and operate a queuing system. Instead of deploying and maintaining a dedicated queuing system, such as ActiveMQ, a developer can use a collection to store messages, and then use the native tailable cursors provided by MongoDB to iterate through the results as they pile up and feed an external system. Alternatively, you can use a TTL index within a regular collection if they require greater flexibility.Low operational overhead is also a common pattern in many use cases. Developers working in agile teams can operate and maintain clusters of MongoDB servers without the need for a dedicated DBA. The free cloud monitoring service can greatly help in reducing administrative overhead for Community Edition users, whereas MongoDB Atlas, the hosted solution by MongoDB, Inc., means that developers do not need to deal with operational headaches.In terms of business sectors using MongoDB, there is a huge variety coming from almost all industries. A common pattern seems to be higher usage where we are more interested in aggregated data than individual transaction-level data. Fields such as IoT can benefit the most by exploiting the availability over consistent design, storing lots of data from sensors in a cost-efficient way. On the other hand, financial services have absolutely stringent consistency requirements, aligned with proper ACID characteristics that make MongoDB more of a challenge to adapt. A financial transaction might be small in size but big in impact, which means that we cannot afford to leave a single message without proper processing.Location-based data is also a field where MongoDB has thrived, with Foursquare being one of the most prominent early clients. MongoDB offers quite a rich set of features around two-dimensional and three-dimensional geolocation data, such as searching by distance, geofencing, and intersections between geographical areas.Overall, the rich feature set is a common pattern across different use cases. By providing features that can be used in many different industries and applications, MongoDB can be a unified solution for all business needs, offering users the ability to minimize operational overhead and, at the same time, iterate quickly in product development.MongoDB’s criticism can be broken down into the following points:
MongoDB has had its fair share of criticism throughout the years. The web-scale proposition has been met with skepticism by many developers. The counterargument is that scale is not needed most of the time, and the focus should be on other design considerations. While this might occasionally be true, it is a false dichotomy, and in an ideal world, we would have both. MongoDB is as close as it can get to combining scalability with features, ease of use, and time to market.MongoDB’s schema-less nature is also a big point of debate and argument. Schema-less can be really beneficial in many use cases, as it allows for heterogeneous data to be dumped into the database without complex cleansing and without ending up with lots of empty columns or blocks of text stuffed into a single column. On the other hand, this is a double-edged sword, as a developer could end up with many documents in a collection that have loose semantics in their fields, and it can become really hard to extract these semantics at the code level. If our schema design is not optimal, we could end up with a data store, rather than a database.A lack of proper ACID guarantees is a recurring complaint from the relational world. Indeed, if a developer needs access to more than one document at a time, it is not easy to guarantee RDBMS properties, as there are no transactions. In the RDBMS sense, having no transactions also means that complex writes will need to have application-level logic to roll back. If you need to update three documents in two collections to mark an application-level transaction complete, and the third document does not get updated for whatever reason, the application will need to undo the previous two writes – something that might not exactly be trivial.With the introduction of multi-document transactions in version 4, MongoDB can cope with ACID transactions at the expense of speed. While this is not ideal, and transactions are not meant to be used for every CRUD operation in MongoDB, it does address the main source of criticism.The configuration setup defaults that favored setting up MongoDB but not operating it in a production environment are disapproved. For years, the default write behavior was write and forget; sending a write wouldn’t wait for an acknowledgment before attempting the next write, resulting in insane write speeds with poor behaviors in the case of failure. Also, authentication is an afterthought, leaving thousands of MongoDB databases on the public internet prey to whoever wants to read the stored data. Even though these were conscious design decisions, they are decisions that have affected developers’ perceptions of MongoDB.It’s important to note that MongoDB has addressed all of the shortcomings throughout the years, with the aim of becoming a versatile and resilient general-purpose database system. Now that we understand the characteristics and features of MongoDB, we will learn how to configure and set up MongoDB efficiently.
In this section, we will present some of the best practices around operations, schema design, durability, replication, sharding, security, and AWS. Further information on when to implement these best practices will be presented in later chapters.
As a database, MongoDB is built with developers in mind, and it was developed during the web era, so it does not require as much operational overhead as traditional RDBMS. That being said, there are some best practices that need to be followed to be proactive and achieve high availability goals.
In order of importance, the best practices are as follows:
Mind the location of your data files: Data files can be mounted anywhere by using the --dbpath command-line option. It is really important to ensure that data files are stored in partitions with sufficient disk space, preferably XFS, or at the very least Ext4.Keep yourself updated with versions: Before version 5, there was a different versioning naming convention. Even major numbered versions are the stable ones. So, 3.2 is stable, whereas 3.3 is not. In this example, 3.3 is the developmental version that will eventually materialize into the stable 3.4 version. It is a good practice to always update to the latest updated security version (which, at the time of writing this book, is 4.0.2) and to consider updating as soon as the next stable version comes out (4.2, in this example).Version 5 has become cloud-first. The newest versions are automatically updated in MongoDB Atlas with the ability to opt out of them, whereas all versions are available to download for evaluation and development purposes. Chapter 3, MongoDB CRUD Operations, goes into more detail about the new rapid release schedule and how it affects developers and architects.
Use MongoDB Cloud monitoring: The free MongoDB, Inc. monitoring service is a great tool to get an overview of a MongoDB cluster, notifications, and alerts and to be proactive about potential issues.Scale up if your metrics show heavy use: Do not wait until it is too late. Utilizing more than 65% in CPU or RAM, or starting to notice disk swapping, should be the threshold to start thinking about scaling, either vertically (by using bigger machines) or horizontally (by sharding).Be careful when sharding: Sharding is a strong commitment to your shard key. If you make the wrong decision, it might be really difficult to go back from an operational perspective. When designing for sharding, architects need to take deep consideration of the current workloads (reads/writes) and what the current and expected data access patterns are. Live resharding, which was introduced in version 5, mitigates the risk compared to previous versions, but it’s still better to spend more time upfront instead of resharding after the fact. Always use the shard key in queries or else MongoDB will have to query all shards in the cluster, negating the major sharding advantage.Use an application driver maintained by the MongoDB team: These drivers are supported and tend to get updated faster than drivers with no official support. If MongoDB does not support the language that you are using yet, please open a ticket in MongoDB’s JIRA tracking system.Schedule regular backups: No matter whether you are using standalone servers, replica sets, or sharding, a regular backup policy should also be used as a second-level guard against data loss. XFS is a great choice as a filesystem, as it can perform snapshot backups.Manual backups should be avoided: When possible, regular, automated backups should be used. If we need to resort to a manual backup, we can use a hidden member in a replica set to take the backup from. We have to make sure that we are using db.fsync with {lock: true} in this member, to get the maximum consistency at this node, along with having journaling turned on. If this volume is on AWS, we can get away with taking an EBS snapshot straight away.Enable database access control: Never put a database into a production system without access control. Access control should be implemented at a node level, by a proper firewall that only allows access to specific application servers to the database, and at a DB level, by using the built-in roles or defining custom-defined ones. This has to be initialized at start-up time by using the --auth command-line parameter and can be configured by using the admin collection.Test your deployment using real data: Since MongoDB is a schema-less, document-oriented database, you might have documents with varying fields. This means that it is even more important than with an RDBMS to test using data that resembles production data as closely as possible. A document with an extra field of an unexpected value can make the difference between an application working smoothly or crashing at runtime. Try to deploy a staging server using production-level data, or at least fake your production data in staging, by using an appropriate library, such as Faker for Ruby.MongoDB is schema-less, and you need to design your collections and indexes to accommodate for this fact:
Index early and often: Identify common query patterns, using cloud monitoring, the GUI that MongoDB Compass offers, or logs. Analyzing the results, you should create indexes that cover the most common query patterns, using as many indexes as possible at the beginning of a project.Eliminate unnecessary indexes: This is a bit counter-intuitive to the preceding suggestion, but monitor your database for changing query patterns, and drop the indexes that are not being used. An index will consume RAM and I/O, as it needs to be stored and updated along with the documents in the database. Using an aggregation pipeline and $indexStats, a developer can identify the indexes that are seldom being used and eliminate them.Use a compound index, rather than index intersection: Most of the time, querying with multiple predicates (A and B, C or D and E, and so on) will work better with a single compound index than with multiple simple indexes. Also, a compound index will have its data ordered by field, and we can use this to our advantage when querying. An index on fields A, B, and C will be used in queries for A, (A,B), (A,B,C), but not in querying for (B,C) or (C).Low selectivity indexes: Indexing a field on gender, for example, will statistically return half of our documents back, whereas an index on the last name will only return a handful of documents with the same last name.Use regular expressions: Again, since indexes are ordered by value, searching using a regular expression with leading wildcards (that is, /.*BASE/) won’t be able to use the index. Searching with trailing wildcards (that is, /DATA.*/) can be efficient, as long as there are enough case-sensitive characters in the expression.Avoid negation in queries: Indexes are indexing values, not the absence of them. Using NOT in queries, instead of using the index, can result in full table scans.Use partial indexes: If we need to index a subset of the documents in a collection, partial indexes can help us to minimize the index set and improve performance. A partial index will include a condition on the filter that we use in the desired query.Use document validation: Use document validation to monitor for new attributes being inserted into your documents and decide what to do with them. With document validation set to warn, we can keep a log of documents that were inserted with new, never-seen-before attributes that we did not expect during the design phase and decide whether we need to update our index or not.Use MongoDB Compass: MongoDB’s free visualization tool is great for getting a quick overview of our data and how it grows over time.Respect the maximum document size of 16 MB: The maximum document size for MongoDB is 16 MB. This is a fairly generous limit, but it is one that should not be violated under any circumstances. Allowing for documents to grow unbounded should not be an option, and, as efficient as it might be to embed documents, we should always keep in mind that this should be kept under control. Additionally, we should keep track of the average and maximum document sizes using monitoring or the bsonSize() method and the aggregation operator.Use the appropriate storage engine: MongoDB has introduced several new storage engines since version 3.2. The in-memory storage engine should be used for real-time workloads, whereas the encrypted storage engine (only available in MongoDB Enterprise Edition) should be the engine of choice when there are strict requirements around data security. Otherwise, the default WiredTiger storage engine is the best option for general-purpose workloads.Examining some schema design best practices, we will move on to the best practices for write durability as of MongoDB version 6.
Write durability can be fine-tuned in MongoDB, and, according to our application design, it should be as strict as possible, without affecting our performance goals.
Fine-tune the data and flush it to the disk interval in the WiredTiger storage engine; the default is to flush data to the disk every 60 seconds after the last checkpoint. This can be changed by using the --wiredTigerCheckpointDelaySecs command-line option.
MongoDB version 5 has changed the default settings for read and write concerns.
The default write concern is now majority writes, which means that in a replica set of three nodes (with one primary and two secondaries), the operation returns as soon as two of the nodes acknowledge it by writing it to the disk. Writes always go to the primary and then get propagated asynchronously to the secondaries. In this way, MongoDB eliminates the possibility of data rollback in the event of a node failure.
If we use arbiters in our replica set, then writes will still be acknowledged solely by the primary if the following formula resolves to true:
#arbiters > #nodes*0.5 - 1
For example, in a replica set of three nodes of which one is the arbiter and two are storing data, this formula resolves to the following:
1 > 3*0.5 - 1 ... 1 > 0.5 ... true
Note
MongoDB 6 restricts the number of arbiters to a maximum of one.
The default read concern is now local instead of available, which mitigates the risk of returning orphaned documents for reads in sharded collections. Orphaned documents might be returned during chunk migrations, which can be triggered either by MongoDB or, since version 5, also by the user when applying live resharding to the sharded collection.
Multi-document ACID transactions and the transactional guarantees that they have provided since MongoDB 4.2, coupled with the introduction of streaming replication and replicate-before-journaling behavior, have improved replication performance. Additionally, they allow for more durable and consistent default write and read concerns without affecting performance as much. The new defaults are promoting durability and consistent reads and should be carefully evaluated before changing them.
Under the right conditions, replica sets are MongoDB’s mechanism to provide redundancy, high availability, and higher read throughput. In MongoDB, replication is easy to configure and focuses on operational terms:
Always use replica sets: Even if your dataset is currently small, and you don’t expect it to grow exponentially, you never know when that might happen. Also, having a replica set of at least three servers helps to design for redundancy, separating the workloads between real time and analytics (using the secondary) and having data redundancy built-in from day one. Finally, there are some corner cases that you will identify earlier by using a replica set instead of a single standalone server, even for development purposes.Use a replica set to your advantage: A replica set is not just for data replication. We can (and, in most cases, should) use the primary server for writes and preference reads from one of the secondaries to offload the primary server. This can be done by setting read preferences for reads, along with the correct write concern, to ensure that writes propagate as needed.Use an odd number of replicas in a MongoDB replica set: If a server is down or loses connectivity with the rest of them (network partitioning), the rest have to vote as to which one will be elected as the primary server. If we have an odd number of replica set members, we can guarantee that each subset of servers knows if they belong to the majority of the minority of the replica set members. If we cannot have an odd number of replicas, we need to have one extra host set as an arbiter, with the sole purpose of voting in the election process. Even a micro-instance in EC2 could serve this purpose.Sharding is MongoDB’s solution for horizontal scaling. In Chapter 9, Monitoring, Backup, and Security, we will go over its usage in more detail, but the following list offers some best practices, based on the underlying data architecture:
Think about query routing: Based on different shard keys and techniques, the mongos query router might direct the query to some (or all) of the members of a shard. It is important to take our queries into account when designing sharding. This is so that we don’t end up with our queries hitting all of our shards.Use tag-aware sharding: Tags can provide more fine-grained distribution of data across our shards. Using the right set of tags for each shard, we can ensure that subsets of data get stored in a specific set of shards. This can be useful for data proximity between application servers, MongoDB shards, and the users.Security is always a multi-layered approach, and the following recommendations do not form an exhaustive list; they are just the bare basics that need to be done in any MongoDB database:
Always turn authentication on. There are multiple hacks over the years where open MongoDB servers have been hacked for fun or profit such as being backed up and deleted to extort admins to pay. It is a good practice to set up authentication even in non-production environments to decrease the possibility of human error.The HTTP status interface should be disabled.The RESTful API should be disabled.The JSON API should be disabled.Connect to MongoDB using SSL.Audit the system activity.Use a dedicated system user to access MongoDB with appropriate system-level access.Disable server-side scripting if it is not needed. This will affect MapReduce, built-in db.group() commands, and $where operations. If they are not used in your code base, it is better to disable server-side scripting at startup by using the --noscripting parameter or setting security.javascriptEnabled to false in the configuration file.After examining the best practices for security in general, we will dive into what are the best practices for AWS deployments.
When we are using MongoDB, we can use our own servers in a data center, a MongoDB-hosted solution such as MongoDB Atlas, or we can rent instances from Amazon by using EC2. EC2 instances are virtualized and share resources in a transparent way, with collocated VMs in the same physical host. So, there are some more considerations to take into account if you wish to go down that route, as follows:
Use EBS-optimized EC2 instances.Get EBS volumes with provisioned I/O operations per second (IOPS) for consistent performance.Use EBS snapshotting for backup and restore.Use different availability zones for high availability and different regions for disaster recovery. Using different availability zones within each region that Amazon provides guarantees that our data will be highly available. Different regions should mostly be used for disaster recovery in case a catastrophic event ever takes out an entire region. A region might be EU-West-2 (for London), whereas an availability zone is a subdivision within a region; currently, three availability zones are available in the London region.Deploy globally, access locally.For truly global applications with users from different time zones, we should have application servers in different regions access the data that is closest to them, using the right read preference configuration in each server.Reading a book is great (and reading this book is even better), but continuous learning is the only way to keep up to date with MongoDB.
The online documentation available at https://docs.mongodb.com/manual/ is the perfect starting point for every developer, new or seasoned.
The JIRA tracker is a great place to take a look at fixed bugs and the features that are coming up next: https://jira.mongodb.org/browse/SERVER/.
Some other great books on MongoDB are listed as follows:
MongoDB Fundamentals: A hands-on guide to using MongoDB and Atlas in the real world, by Amit Phaltankar and Juned AhsanMongoDB: The Definitive Guide 3e: Powerful and Scalable Data Storage, by Shannon Bradshaw and Eoin Brazil MongoDB Topology Design: Scalability, Security, and Compliance on a Global Scale, by Nicholas CottrellAny book by Kristina ChodorowThe MongoDB user group (https://groups.google.com/forum/#!forum/mongodb-user
