Mastering MongoDB 3.x - Alex Giamas - E-Book

Mastering MongoDB 3.x E-Book

Alex Giamas

0,0
32,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

An expert's guide to build fault tolerant MongoDB application

About This Book

  • Master the advanced modeling, querying, and administration techniques in MongoDB and become a MongoDB expert
  • Covers the latest updates and Big Data features frequently used by professional MongoDB developers and administrators
  • If your goal is to become a certified MongoDB professional, this book is your perfect companion

Who This Book Is For

Mastering MongoDB is a book for database developers, architects, and administrators who want to learn how to use MongoDB more effectively and productively.

If you have experience in, and are interested in working with, NoSQL databases to build apps and websites, then this book is for you.

What You Will Learn

  • Get hands-on with advanced querying techniques such as indexing, expressions, arrays, and more.
  • Configure, monitor, and maintain highly scalable MongoDB environment like an expert.
  • Master replication and data sharding to optimize read/write performance.
  • Design secure and robust applications based on MongoDB.
  • Administer MongoDB-based applications on-premise or in the cloud
  • Scale MongoDB to achieve your design goals
  • Integrate MongoDB with big data sources to process huge amounts of data

In Detail

MongoDB has grown to become the de facto NoSQL database with millions of users—from small startups to Fortune 500 companies. Addressing the limitations of SQL schema-based databases, MongoDB pioneered a shift of focus for DevOps and offered sharding and replication maintainable by DevOps teams. The book is based on MongoDB 3.x and covers topics ranging from database querying using the shell, built in drivers, and popular ODM mappers to more advanced topics such as sharding, high availability, and integration with big data sources.

You will get an overview of MongoDB and how to play to its strengths, with relevant use cases. After that, you will learn how to query MongoDB effectively and make use of indexes as much as possible. The next part deals with the administration of MongoDB installations on-premise or in the cloud. We deal with database internals in the next section, explaining storage systems and how they can affect performance. The last section of this book deals with replication and MongoDB scaling, along with integration with heterogeneous data sources. By the end this book, you will be equipped with all the required industry skills and knowledge to become a certified MongoDB developer and administrator.

Style and approach

This book takes a practical, step-by-step approach to explain the concepts of MongoDB. Practical use-cases involving real-world examples are used throughout the book to clearly explain theoretical concepts.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 381

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Mastering MongoDB 3.x

 

 

 

 

 

 

 

 

 

 

An expert's guide to building fault-tolerant MongoDB applications

 

 

 

 

 

 

 

 

 

 

Alex Giamas

 

 

 

 

BIRMINGHAM - MUMBAI

Mastering MongoDB 3.x

Copyright © 2017 Packt Publishing

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

First published: November 2017

 

Production reference: 1151117

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-78398-260-8

 

www.packtpub.com

Credits

Author

 

Alex Giamas

Copy Editors

 

Safis Editing

Reviewers

 

Juan Tomás Oliva Ramos

Nilap Shah

Project Coordinator

 

Nidhi Joshi

Commissioning Editor

 

Amey Varangaonkar

Proofreader

 

Safis Editing

Acquisition Editor

 

Vinay Argekar

Indexer

 

Aishwarya Gangawane

Content Development Editor

 

Mayur Pawanikar

Graphics

 

Tania Dutta

Technical Editor

 

Prasad Ramesh

Production Coordinator

 

Shantanu Zagade

About the Author

Alex Giamas is a Senior Software Engineer at the Department for International Trade, UK Government. He has also worked as a consultant for various startups. He is an experienced professional in systems engineering, NoSQL and big data technologies, with experience spanning from co-founding a digital health startup to Fortune 15 companies.

He has been developing using MongoDB since 2009 and early 1.x versions, using it for several projects around data storage and analytical processing. He has been developing in Apache Hadoop since 2007 while working on its incubation. 

He has worked with a wide array of NoSQL and big data technologies, building scalable and highly available distributed software systems in C++, Java, Ruby and Python.

Alex holds an MSc from Carnegie Mellon University in Information Networking and has attended professional courses in Stanford University. He is a graduate from National Technical University of Athens, Greece in Electrical and Computer Engineering. He is a MongoDB Certified developer, a Cloudera Certified Developer for Apache Hadoop and Data Science essentials.

He publishes regularly for the past 4 years at InfoQ in NoSQL, big data and data science topics.

I would like to thank my parents for their support and advice all these years. I would like to thank my fiancé Mary for her patience and support throughout the time, days and nights, weekdays and weekends I spent writing this book.

About the Reviewers

Juan Tomás Oliva Ramos is an environmental engineer from the University of Guanajuato, Mexico, with a master's degree in administrative engineering and quality. He has more than 5 years of experience in the management and development of patents, technological innovation projects, and the development of technological solutions through the statistical control of processes.

He has been a teacher of statistics, entrepreneurship, and the technological development of projects since 2011. He became an entrepreneur mentor and started a new department of technology management and entrepreneurship at Instituto Tecnológico Superior de Purisima del Rincon Guanajuato, Mexico.

Juan is an Alfaomega reviewer and has worked on the book Wearable Designs for Smart Watches, Smart TVs and Android Mobile Devices.

Juan has also developed prototypes through programming and automation technologies for the improvement of operations, which have been registered for patents.

I want to thank God for giving me wisdom and humility to review this book. I thank Packt for giving me the opportunity to review this amazing book and to collaborate with a group of committed people I want to thank my beautiful wife, Brenda, our two magic princesses (Maria Regina and Maria Renata) and our next member (Angel Tadeo), all of you, give me the strength, happiness, and joy to start a new day. Thanks for being my family.

Nilap Shah is a lead software consultant with experience across various fields and technologies. He is an expert in .NET, Uipath (robotics), and MongoDB. He is a certified MongoDB developer and DBA. He is a technical writer as well as a technical speaker. He also provides MongoDB corporate training. Currently, Nilap is working as a lead MongoDB consultant and provides solutions with MongoDB (DBA and developer projects). His LinkedIn profile can be found at https:/ /www.linkedin.com/in/nilap-shah-8b6780a/ and you can reach him on WhatsApp at +91-9537047334.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1783982608.

If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

MongoDB – A Database for the Modern Web

Web history

Web 1.0

Web 2.0

Web 3.0

SQL and NoSQL evolution

MongoDB evolution

Major feature set for versions 1.0 and 1.2

Version 2

Version 3

Version 3+

MongoDB for SQL developers

MongoDB for NoSQL developers

MongoDB key characteristics and use cases

Key characteristics

What is the use case for MongoDB?

MongoDB criticism

MongoDB configuration and best practices

Operational best practices

Schema design best practices

Best practices for write durability

Best practices for replication

Best practices for sharding

Best practices for security

Best practices for AWS

Reference documentation

MongoDB documentation

Packt references

Further reading 

Summary

Schema Design and Data Modeling

Relational schema design

MongoDB schema design

Read-write ratio

Data modeling

Data types

Comparing different data types

Date type

ObjectId

Modeling data for atomic operations

Write isolation

Read isolation and consistency

Modeling relationships

One-to-one

One-to-many, many-to-many

Modeling data for keyword searches

Connecting to MongoDB

Connecting using Ruby

Mongoid ODM

Inheritance with Mongoid models

Connecting using Python

PyMODM ODM

Inheritance with PyMODM models

Connecting using PHP

Doctrine ODM

Inheritance with Doctrine

Summary

MongoDB CRUD Operations

CRUD using the shell

Scripting for the mongo shell

Differences between scripting for the mongo shell and using it directly

Batch inserts using the shell

Batch operations using the mongo shell

Administration

fsync

compact

currentOp/killOp

collMod

touch

MapReduce in the mongo shell

MapReduce concurrency

Incremental MapReduce

Troubleshooting MapReduce

Aggregation framework

SQL to aggregation

Aggregation versus MapReduce

Securing the shell

Authentication and authorization

Authorization with MongoDB

Security tips for MongoDB

Encrypting communication using TLS/SSL

Encrypting data

Limiting network exposure

Firewalls and VPNs

Auditing

Use secure configuration options

Authentication with MongoDB

Enterprise Edition

Kerberos authentication

LDAP authentication

Summary

Advanced Querying

MongoDB CRUD operations

CRUD using the Ruby driver

Creating documents

Read

Chaining operations in find()

Nested operations

Update

Delete

Batch operations

CRUD in Mongoid

Read

Scoping queries

Create, update, and delete

CRUD using the Python driver

Create and delete

Finding documents

Updating documents

CRUD using PyMODM

Creating documents

Updating documents

Deleting documents

Querying documents

CRUD using the PHP driver

Create and delete

Bulk write

Read

Update

CRUD using Doctrine

Create, update, and delete

Read

Best practices

Comparison operators

Update operators

Smart querying

Using regular expressions

Query results and cursors

Storage considerations on delete

Summary

Aggregation

Why aggregation?

Aggregation operators

Aggregation stage operators

Expression operators

Expression Boolean operators

Expression comparison operators

Set expression and array operators

Expression date operators

Expression string operators

Expression arithmetic operators

Aggregation accumulators

Conditional expressions

Other operators

Text search

Variable

Literal

Parsing data type

Limitations

Aggregation use case

Summary

Indexing

Index internals

Index types

Single field indexes

Indexing embedded fields

Indexing embedded documents

Background indexes

Compound indexes

Sorting using compound indexes

Reusing compound indexes

Multikey indexes

Special types of index

Text

Hashed

TTL

Partial

Sparse

Unique

Case-insensitive

Geospatial

Building and managing indexes

Forcing index usage

Hint and sparse indexes

Building indexes on replica sets

Managing indexes

Naming indexes

Special considerations

Using indexes efficiently

Measuring performance

Improving performance

Index intersection

References

Summary

Monitoring, Backup, and Security

Monitoring

What should we monitor?

Page faults

Resident memory

Virtual and mapped memory

Working set

Monitoring memory usage in WiredTiger

Tracking page faults

Tracking B-tree misses

I/O wait

Read and write queues

Lock percentage

Background flushes

Tracking free space

Monitoring replication

Oplog size

Working set calculations

Monitoring tools

Hosted tools

Open source tools

Backups

Backup options

Cloud-based solutions

Backups with file system snapshots

Taking a backup of a sharded cluster

Backups using mongodump

Backups by copying raw files

Backups using queueing

EC2 backup and restore

Incremental backups

Security

Authentication

Authorization

User roles

Database administration roles

Cluster administration roles

Backup restore roles

Roles across all databases

Superuser

Network level security

Auditing security

Special cases

Overview

Summary

Storage Engines

Pluggable storage engines

WiredTiger

Document-level locking

Snapshots and checkpoints

Journaling

Data compression

Memory usage

readConcern

WiredTiger collection-level options

WiredTiger performance strategies

WiredTiger B-tree versus LSM indexes

Encrypted

In-memory

MMAPv1

MMAPv1 storage optimization

Mixed usage

Other storage engines

RocksDB

TokuMX

Locking in MongoDB

Lock reporting

Lock yield

Commonly used commands and locks

Commands requiring a database lock

References

Summary

Harnessing Big Data with MongoDB

What is big data?

Big data landscape

Message queuing systems

Apache ActiveMQ

RabbitMQ

Apache Kafka

Data warehousing

Apache Hadoop

Apache Spark

Spark comparison with Hadoop MapReduce

MongoDB as a data warehouse

Big data use case

Kafka setup

Hadoop setup

Steps

Hadoop to MongoDB pipeline

Spark to MongoDB

References

Summary

Replication

Replication

Logical or physical replication

Different high availability types

Architectural overview

How do elections work?

What is the use case for a replica set?

Setting up a replica set

Converting a standalone server to a replica set

Creating a replica set

Read preference

Write concern

Custom write concern

Priority settings for replica set members

Priority zero replica set members

Hidden replica set members

Delayed replica set members

Production considerations

Connecting to a replica set

Replica set administration

How to perform maintenance on replica sets

Resyncing a member of a replica set

Changing the oplog size

Reconfiguring a replica set when we have lost the majority of our servers

Chained replication

Cloud options for a replica set

mLab

MongoDB Atlas

Replica set limitations

Summary

Sharding

Advantages of sharding

Architectural overview

Development, continuous deployment, and staging environments

Planning ahead on sharding

Sharding setup

Choosing the shard key

Changing the shard key

Choosing the correct shard key

Range-based sharding

Hash-based sharding

Coming up with our own key

Location-based data

Sharding administration and monitoring

Balancing data – how to track and keep our data balanced

Chunk administration

Moving chunks

Changing the default chunk size

Jumbo chunks

Merging chunks

Adding and removing shards

Sharding limitations

Querying sharded data

The query router

Find

Sort/limit/skip

Update/remove

Querying using Ruby

Performance comparison with replica sets

Sharding recovery

Mongos

Mongod process

Config server

A shard goes down

The entire cluster goes down

References

Summary

Fault Tolerance and High Availability

Application design

Schema-less doesn't mean schema design-less

Read performance optimization

Consolidating read querying

Defensive coding

Monitoring integrations

Operations

Security

Enabling security by default

Isolating our servers

Checklists

References

Summary

Preface

MongoDB has grown to become the de facto NoSQL database with millions of users, from small start-ups to Fortune 500 companies. Addressing the limitations of SQL schema-based databases, MongoDB pioneered a shift of focus for DevOps and offered sharding and replication maintainable by DevOps teams. This book is based on MongoDB 3.x and covers topics ranging from database querying using the shell, built-in drivers, and popular ODM mappers, to more advanced topics such as sharding, high availability, and integration with big data sources.

You will get an overview of MongoDB and how to play to its strengths, with relevant use cases. After that, you will learn how to query MongoDB effectively and make use of indexes as much as possible. The next part deals with the administration of MongoDB installations on-premise or on the cloud. We deal with database internals in the next section, explaining storage systems and how they can affect performance. The last section of this book deals with replication and MongoDB scaling, along with integration with heterogeneous data sources. By the end this book, you will be equipped with all the required industry skills and knowledge to become a certified MongoDB developer and administrator.

What this book covers

Chapter 1, MongoDB – A Database for the Modern Web, takes us on a journey through web, SQL, and NoSQL technologies from inception to current state.

Chapter 2, Schema Design and Data Modeling, teaches schema design for relational databases and MongoDB, and how we can achieve the same goal starting from a different point.

Chapter 3, MongoDB CRUD Operations, gives a bird's-eye view of CRUD operations.

Chapter 4, Advanced Querying, covers advanced querying concepts using Ruby, Python, and PHP, using both the official drivers and an ODM.

Chapter 5, Aggregation, dives deep into the aggregation framework. We also discuss why and when we should use aggregation, as opposed to MapReduce and querying the database.

Chapter 6, Indexing, explores one of the most important properties of every database, which is indexing.

Chapter 7, Monitoring, Backup, and Security, discusses the operational aspects of MongoDB. Monitoring, backup, and security should not be an afterthought but rather a necessary process before deploying MongoDB in a production environment.

Chapter 8, Storage Engines, teaches about different storage engines in MongoDB. We identify the pros and cons of each one and the use cases for choosing each storage engine.

Chapter 9, Harnessing Big Data with MongoDB, shows more about how MongoDB fits into the wider big data landscape and ecosystem.

Chapter 10, Replication, discusses replica sets and how to administer them. Starting from an architectural overview of replica sets and replica set internals around elections, we dive deep into setting up and configuring a replica set.

Chapter 11, Sharding, explores sharding, one of the most interesting features of MongoDB. We start from an architectural overview of sharding and move on to how we can design a shard, and especially choose the right shard key.

Chapter 12, Fault Tolerance and High Availability, tries to fit in the information that we didn't manage to discuss in the previous chapters, and places emphasis on some others.

What you need for this book

You will need the following software to be able to smoothly sail through the chapters:

MongoDB version 3+

Apache Kafka 1

Apache Spark 2+

Apache Hadoop 2+

Who this book is for

Mastering MongoDB 3.x is a book for database developers, architects, and administrators who want to learn how to use MongoDB more effectively and productively. If you have experience in, and are interested in working with, NoSQL databases to build apps and websites, then this book is for you.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "In a sharded environment, each mongod applies its own locks, thus greatly improving concurrency."

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

> db.types.find().sort({a:-1}){ "_id" : ObjectId("5908d59d55454e2de6519c4a"), "a" : [ 2,

5

] }

{ "_id" : ObjectId("5908d58455454e2de6519c49"), "a" : [ 1, 2,

3

] }

Any command-line input or output is written as follows:

> db.types.insert({"a":4})WriteResult({ "nInserted" : 1 })

New terms and important words are shown in bold.

Warnings or important notes appear like this.
Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:

Log in or register to our website using your email address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-MongoDB-3x and https://github.com/agiamas/mastering-mongodb. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

MongoDB – A Database for the Modern Web

In this chapter, we will lay the foundations for understanding MongoDB and how it is a database designed for the modern web. We will cover the following topics:

The web, SQL, and MongoDB's history and evolution.

MongoDB from the 

perspective of 

SQL and other NoSQL technology users.

MongoDB's common use cases and why they matter.

Configuration best practices:

Operational

Schema design

Write durability

Replication

Sharding

Security

AWS

Learning to learn. Nowadays, learning how to learn is as important as learning in the first place. We will go through references that have the most up to date information about MongoDB for both new and experienced users.

Web history

In March 1989, more than 28 years ago, Sir Tim Berners-Lee unveiled his vision for what would later be named the World Wide Web (WWW) in a document called Information Management: A Proposal (http://info.cern.ch/Proposal.html). Since then, the WWW has grown to be a tool of information, communication, and entertainment for more than two of every five people on our planet.

Web 1.0

The first version of the WWW relied exclusively on web pages and hyperlinks between them, a concept kept until present times. It was mostly read-only, with limited support for interaction between the user and the web page. Brick and mortar companies were using it to put up their informational pages. Finding websites could only be done using hierarchical directories like Yahoo! and DMOZ. The web was meant to be an information portal.

This, while not being Sir Tim Berners-Lee's vision, allowed media outlets such as the BBC and CNN to create a digital presence and start pushing out information to the users. It revolutionized information access as everyone in the world could get first-hand access to quality information at the same time.

Web 1.0 was totally device and software independent, allowing for every device to access all information. Resources were identified by address (the website's URL) and open protocols (GET, POST, PUT, DELETE) could be used to access content resources.

Hyper Text Markup Language (HTML) was used to develop web sites that were serving static content. There was no notion of Cascading Style Sheets (CSS) as positioning of elements in a page could only be modified using tables and framesets were used extensively to embed information in pages.

This proved to be severely limiting and so browser vendors back then started adding custom HTML tags like <blink> and <marquee> which lead to the first browser wars, with rivals Microsoft (Internet Explorer) and Netscape racing to extend the HTTP protocol's functionality. Web 1.0 reached 45 million users by 1996.

Here is the Lycos start page as it appeared in Web 1.0 http://www.lycos.com/:

Yahoo as appeared in Web 1.0 http://www.yahoo.com:

Web 2.0

A term first defined and formulated by Tim O'Reilly, we use it to describe our current WWW sites and services. Its main characteristic is that the web moved from being read-only to the read-write state. Websites evolved into services and human collaboration plays an ever important part in Web 2.0.

From simple information portals, we now have many more types of services such as:

Audio

BlogPod

Blogging

Bookmarking

Calendars

Chat

Collaboration

Communication

Community

CRM

E-commerce

E-learning

Email

Filesharing

Forums

Games

Images

Knowledge

Mapping

Mashups

Multimedia

Portals

RSS

Wikis

Web 2.0 reached 1+ billion users in 2006 and 3.77 billion users at the time of writing this book (late 2017). Building communities was the differentiating factor for Web 2.0, allowing internet users to connect on common interests, communicate, and share information.

Personalization plays an important part of Web 2.0 with many websites offering tailored content to its users. Recommendation algorithms and human curation decides the content to show to each user.

Browsers can support more and more desktop applications by using Adobe Flash and Asynchronous JavaScript and XML (AJAX) technologies. Most desktop applications have web counterparts that either supplement or have completely replaced the desktop versions. Most notable examples are office productivity (Google Docs, Microsoft Office 365), Digital Design Sketch, and image editing and manipulation (Google Photos, Adobe Creative Cloud).

Moving from websites to web applications also unveiled the era of Service Oriented Architecture (SOA). Applications can interconnect with each other, exposing data through Application Programming Interfaces (API) allowing to build more complex applications on top of application layers.

One of the applications that defined Web 2.0 are social apps. Facebook with 1.86 billion monthly active users at the end of 2016 is the most well known example. We use social networks and many web applications share social aspects that allow us to communicate with peers and extend our social circle.

Web 3.0

It's not yet here, but Web 3.0 is expected to bring Semantic Web capabilities. Advanced as Web 2.0 applications may seem, they all rely mostly on structured information. We use the same concept of searching for keywords and matching these keywords with web content without much understanding of context, content and intention of user's request. Also called Web of Data, Web 3.0 will rely on inter-machine communication and algorithms to provide rich interaction via diverse human computer interfaces.

SQL and NoSQL evolution

Structured Query Language existed even before the WWW. Dr. EF Codd originally published the paper A Relational Model of Data for Large Shared Data Banks, in June 1970, in the Association of Computer Machinery (ACM) journal, Communications of the ACM. SQL was initially developed at IBM by Chamberlin and Boyce in 1974. Relational Software (now Oracle Corporation) was the first to develop a commercially available implementation of SQL, targeted at United States governmental agencies.

The first American National Standards Institute (ANSI) SQL standard came out in 1986 and since then there have been eight revisions with the most recent being published in 2016 (SQL:2016).

SQL was not particularly popular at the start of the WWW. Static content could just be hard coded into the HTML page without much fuss. However, as functionality of websites grew, webmasters wanted to generate web page content driven by offline data sources to generate content that could change over time without redeploying code.

Common Gateway Interface (CGI) scripts in Perl or Unix shell were driving early database driven websites in Web 1.0. With Web 2.0, the web evolved from directly injecting SQL results into the browser to using two- and three-tier architecture that separated views from business and model logic, allowing for SQL queries to be modular and isolated from the rest of a web application.

Not only SQL (NoSQL) on the other hand is much more modern and supervenes web evolution, rising at the same time as Web 2.0 technologies. The term was first coined by Carlo Strozzi in 1998 for his open source database that was not following the SQL standard but was still relational.

This is not what we currently expect from a NoSQL database. Johan Oskarsson, a developer at Last.fm at the time, reintroduced the term in early 2009 to group a set of distributed, non-relational data stores that were being developed. Many of them were based on Google's Bigtable and MapReduce papers or Amazon's Dynamo highly available key-value based storage system.

NoSQL foundations grew upon relaxed ACID (atomicity, consistency, isolation, durability) guarantees in favor of performance, scalability, flexibility and reduced complexity. Most NoSQL databases have gone one way or another in providing as many of the previously mentioned qualities as possible, even offering tunable guarantees to the developer.

Timeline of SQL and NoSQL evolution

MongoDB evolution

10gen started developing a cloud computing stack in 2007 and soon realized that the most important innovation was centered around the document oriented database that they built to power it, MongoDB. MongoDB was initially released on August 27th, 2009.

Version 1 of MongoDB was pretty basic in terms of features, authorization, and ACID guarantees and made up for these shortcomings with performance and flexibility.

In the following sections, we can see the major features along with the version number with which they were introduced.

Major feature set for versions 1.0 and 1.2

Document-based model

Global lock (process level)

Indexes on collections

CRUD operations on documents

No authentication (authentication was handled at the server level)

Master/slave replication

MapReduce (introduced in v1.2)

Stored JavaScript functions (introduced in v1.2)

Version 2

Background index creation (since v.1.4)

Sharding (since v.1.6)

More query operators (since v.1.6)

Journaling (since v.1.8)

Sparse and covered indexes (since v.1.8)

Compact command to reduce disk usage

Memory usage more efficient

Concurrency improvements

Index performance enhancements

Replica sets are now more configurable and data center aware

MapReduce improvements

Authentication (since 2.0 for sharding and most database commands)

Geospatial features introduced

Version 3

Aggregation framework (since v.2.2) and enhancements (since v.2.6)

TTL collections (since v.2.2)

Concurrency improvements among which DB level locking (since v.2.2)

Text search (since v.2.4) and integration (since v.2.6)

Hashed index (since v.2.4)

Security enhancements, role based access (since v.2.4)

V8 JavaScript engine instead of SpiderMonkey (since v.2.4)

Query engine improvements (since v.2.6)

Pluggable storage engine API

WiredTiger storage engine introduced, with document level locking while previous storage engine (now called MMAPv1) supports collection level locking

Version 3+

Replication and sharding enhancements (since v.3.2)

Document validation (since v.3.2)

Aggregation framework enhanced operations (since v.3.2)

Multiple storage engines (since v.3.2, only in Enterprise Edition)

MongoDB evolution diagram

As one can observe, version 1 was pretty basic, whereas version 2 introduced most of the features present in the current version such as sharding, usable and special indexes, geospatial features, and memory and concurrency improvements.

On the way from version 2 to version 3, the aggregation framework was introduced, mainly as a supplement to the ageing (and never up to par with dedicated frameworks like Hadoop) MapReduce framework. Then, adding text search and slowly but surely improving performance, stability, and security to adapt to the increasing enterprise load of customers using MongoDB.

With WiredTiger's introduction in version 3, locking became much less of an issue for MongoDB as it was brought down from process (global lock) to document level, almost the most granular level possible.

At its current state, MongoDB is a database that can handle loads ranging from startup MVPs and POCs to enterprise applications with hundreds of servers.

MongoDB for NoSQL developers

As MongoDB has grown from being a niche database solution to the Swiss Army knife of NoSQL technologies, more developers are coming to it from a NoSQL background as well.

Setting the SQL to NoSQL differences aside, users from columnar type databases face the most challenges. Cassandra and HBase being the most popular column oriented database management systems, we will examine the differences and how a developer can migrate a system to MongoDB.

Flexibility

: MongoDB's notion of documents that can contain sub-documents nested in complex hierarchies is really expressive and flexible. This is similar to the comparison between MongoDB and SQL, with the added benefit that MongoDB can map easier to plain old objects from any programming language, allowing for easy deployment and maintenance.

Flexible query model

: A user can selectively index some parts of each document, query based on attribute values, regular expressions or ranges, and have as many properties per object as needed by the application layer. Primary, secondary indexes as well as special types of indexes like sparse ones can help greatly with query efficiency. Using a JavaScript shell with MapReduce makes it really easy for most developers and many data analysts to quickly take a look into data and get valuable insights.

Native aggregation

: The aggregation framework provides an ETL pipeline for users to extract and transform data from MongoDB and either load them in a new format or export it from MongoDB to other data sources. This can also help data analysts and scientists get the slice of data they need performing data wrangling along the way.

Schemaless model

: This is a result of MongoDB's design philosophy to give applications the power and responsibility to interpret different properties found in a collection's documents. In contrast to Cassandra's or HBase's schema based approach, in MongoDB a developer can store and process dynamically generated attributes.

MongoDB key characteristics and use cases

In this section, we will analyze MongoDB's characteristics as a database. Understanding the features that MongoDB provides can help developers and architects evaluate the requirement at hand and how MongoDB can help fulfill it. Also, we will go through some common use cases from MongoDB Inc's experience that have delivered the best results for its users.

Key characteristics

MongoDB has grown to a general purpose NoSQL database, offering the best of both RDBMS and NoSQL worlds. Some of the key characteristics are:

It's a general purpose database. In contrast with other NoSQL databases that are built for purpose (for example, graph databases), MongoDB can serve heterogeneous loads and multiple purposes within an application.

Flexible schema design. Document oriented approaches with non-defined attributes that can be modified on the fly is a key contrast between MongoDB and relational databases.

It's built with high availability from the ground up. In our era of five nines in availability, this has to be a given. Coupled with automatic failover on detection of a server failure, this can help achieve high uptime.

Feature rich. Offering the full range of SQL equivalent operators along with features such as MapReduce, aggregation framework, TTL/capped collections, and secondary indexing, MongoDB can fit many use cases, no matter how diverse the requirements are.

Scalability and load balancing. It's built to scale, both vertically but most importantly horizontally. Using sharding, an architect can share load between different instances and achieve both read and write scalability. Data balancing happens automatically and transparently to the user by the shard balancer.

Aggregation framework. Having an extract transform load framework built in the database means that a developer can perform most of the ETL logic before the data leaves the database, eliminating in many cases the need for complex data pipelines.

Native replication. Data will get replicated across a replica set without complicated setup.

Security features. Both authentication and authorization are taken into account so that an architect can secure her MongoDB instances.

JSON (BSON, Binary JSON) objects for storing and transmitting documents. JSON is widely used across the web for frontend and API communication and as such it's easier when the database is using the same protocol.

MapReduce. Even though the MapReduce engine isn't as advanced as it is in dedicated frameworks, it is nonetheless a great tool for building data pipelines.

Querying and geospatial information in 2D and 3D. This may not be critical for many applications, but if it is for your use case then it's really convenient to be able to use the same database for geospatial calculations along with data storage.

What is the use case for MongoDB?

MongoDB being a hugely popular NoSQL database means that there are several use cases where it has succeeded in supporting quality applications with a great time to market delivery time.

Many of its most successful use cases center around the following areas:

Integration of siloed data providing a single view of them

Internet of Things

Mobile applications

Real-time analytics

Personalization

Catalog management

Content management

All these success stories share some common characteristics. We will try and break these down in order of relative importance.

Schema flexibility is most probably the most important one. Being able to store documents inside a collection that can have different properties can help both during development phase but also in ingesting data from heterogeneous sources that may or may not have the same properties. In contrast with an RDBMS where columns need to be predefined and having sparse data can be penalized, in MongoDB this is the norm and it's a feature that most use cases share. Having the ability to deep nest attributes into documents, add arrays of values into attributes and all the while being able to search and index these fields helps application developers exploit the schema-less nature of MongoDB.

Scaling and sharding are the most common patterns for MongoDB use cases. Easily scaling using built-in sharding and using replica sets for data replication and offloading primary servers from read load can help developers store data effectively.

Many use cases also use MongoDB as a way of archiving data. Used as a pure data store and not having the need to define schemas, it's fairly easy to dump data into MongoDB, only to be analyzed at a later date by business analysts either using the shell or some of the numerous BI tools that can integrate easily with MongoDB. Breaking data down further based on time caps or document count can help serve these datasets from RAM, the use case where MongoDB is most effective.

On this point, keeping datasets in RAM is more often another common pattern. MongoDB uses MMAP storage (called MMAPv1) in most versions up to the most recent, which delegates data mapping to the underlying operating system. This means that  most GNU/Linux based systems working with collections that can be stored in RAM will dramatically increase performance. This is less of an issue with the introduction of pluggable storage engines like WiredTiger, more on that in Chapter 8, Storage Engines.

Capped collections are also a feature used in many use cases. Capped collections can restrict documents in a collection by count or by overall size of the collection. In the latter case, we need to have an estimate of size per document to calculate how many documents will fit in our target size. Capped collections are a quick and dirty solution to answer requests like "Give me the last hour's overview of the logs." without any need for maintenance and running async background jobs to clean our collection. Oftentimes, these may be used to quickly build and operate a queuing system. Instead of deploying and maintaining a dedicated queuing system like ActiveMQ, a developer can use a collection to store messages and then use native tailable cursors provided by MongoDB to iterate through results as they pile up and feed an external system.

Low operational overhead is also a common pattern in use cases. Developers working in agile teams can operate and maintain clusters of MongoDB servers without the need for a dedicated DBA. MongoDB Management Service can greatly help in reducing administrative overhead, whereas MongoDB Atlas, the hosted solution by MongoDB Inc., means that developers don't need to deal with operational headaches.

In terms of business sectors using MongoDB, there is a huge variety coming from almost all industries. Where there seems to be a greater penetration though, is in cases that have to deal with lots of data with a relatively low business value in each single data point. Fields like IoT can benefit the most by exploiting availability over consistency design, storing lots of data from sensors in a cost efficient way. Financial services on the other hand, many times have absolutely stringent consistency requirements aligned with proper ACID characteristics that make MongoDB more of a challenge to adapt. Transactions carrying financial data can be a few bytes but have an impact of millions of dollars, hence all the safety nets around transmitting this type of information correctly.

Location-based data is also a field where MongoDB has thrived. Foursquare being one of the most prominent early clients, MongoDB offers quite a rich set of features around 2D and 3D geolocation data, offering features like searching by distance, geofencing, and intersection between geographical areas.

Overall, the rich feature set is the common pattern across different use cases. By providing features that can be used in many different industries and applications, MongoDB can be a unified solution for all business needs, offering users the ability to minimize operational overhead and at the same time iterate quickly in product development.

MongoDB criticism

MongoDB has had its fair share of criticism throughout the years. The web-scale proposition has been met with skepticism by many developers. The counter argument is that scale is not needed most of the time and we should focus on other design considerations. While this may be true on several occasions, it's a false dichotomy and in an ideal world we would have both. MongoDB is as close as it can get to combining scalability with features and ease of use/time to market.