53,99 €
Officially endorsed by MongoDB, Inc., Practical MongoDB Aggregations helps you unlock the full potential of the MongoDB aggregation framework, including the latest features of MongoDB 7.0. This book provides practical, easy-to-digest principles and approaches for increasing your effectiveness in developing aggregation pipelines, supported by examples for building pipelines to solve complex data manipulation and analytical tasks.
This book is customized for developers, architects, data analysts, data engineers, and data scientists with some familiarity with the aggregation framework. It begins by explaining the framework's architecture and then shows you how to build pipelines optimized for productivity and scale.
Given the critical role arrays play in MongoDB's document model, the book delves into best practices for optimally manipulating arrays. The latter part of the book equips you with examples to solve common data processing challenges so you can apply the lessons you've learned to practical situations. By the end of this MongoDB book, you’ll have learned how to utilize the MongoDB aggregation framework to streamline your data analysis and manipulation processes effectively.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 270
Veröffentlichungsjahr: 2023
The official guide to developing optimal aggregation pipelines with MongoDB 7.0
Paul Done
BIRMINGHAM—MUMBAI
Practical MongoDB Aggregations
Copyright © 2023 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Acquisition Editor: Sathya Mohan
Lead Development Editor: Siddhant Jain
Development Editor: Afzal Shaikh
Content Development Editor: Rhea Gangavkar
Project Coordinator: Yash Basil
Copy Editor: Safis Editing
Proofreader: Safis Editing
Production Designer: Deepak Chavan
Production reference: 1050923
Published by Packt Publishing Ltd.
Grosvenor House, 11 St Paul's Square, Birmingham, B3 1RB, UK.
ISBN 978-1-83508-064-1
www.packtpub.com
About the author
Paul Done is one of two distinguished solutions architects at MongoDB, Inc., having been at MongoDB for 10 years. He has previously held roles in various software disciplines, including engineering, consulting, and pre-sales, at companies such as Oracle, Novell, and BEA Systems. Paul specializes in databases and middleware, focusing on resiliency, scalability, transactions, event processing, and applying evolvable data model approaches. He spent most of the early 2000s building Java EE (J2EE) transactional systems on WebLogic, integrated with relational databases such as Oracle RAC and messaging systems such as MQSeries.
About the reviewers
Rachelle Palmer is the Product Leader for Developer Database Experience and Developer Education at MongoDB, overseeing the driver client libraries, documentation, framework integrations, and MongoDB University. She is passionate about being hands-on and has built sample applications for MongoDB in Java, PHP, Rust, Python, Node.js, and Ruby. She joined MongoDB in 2013 and was previously director of the technical services engineering team, creating and managing the team that provided support and CloudOps to MongoDB Atlas. She is also passionate about education for all, and runs a family scholarship fund for youth interested in STEM majors.
Nicholas Cottrell has used MongoDB as the data layer for dozens of software development projects since version 2.0, long before joining the company as a consulting engineer in 2017. He now helps technical services engineers learn about all aspects of MongoDB to support an ever-growing range of customer use cases. He holds dual Australian and Swedish citizenship and lives in Normandy, France.
Creating this book has been a labor of love for me over the last few years. Writing a book is hard, but the process is much easier when the book’s subject (MongoDB software) is a joy to use. With that said, this book only exists due to the valuable input and feedback from many of my colleagues at MongoDB, Inc.
Two of these colleagues are the reviewers of this book—Rachelle Palmer and Nicholas Cottrell. Thank you for your tireless work of catching errors and poor phrasing.
Earlier incarnations of the book’s content also benefitted from insight and advice from other colleagues, who I will be eternally grateful to: Jake McInteer, John Page, Asya Kamsky, Mat Keep, Brian Leonard, Marcus Eagan, Elle Shwer, Ethan Steininger, and Andrew Morgan.
In particular, I would like to single out Asya Kamsky (who also wrote the Foreword for this book). I often refer to Asya as MongoDB Aggregations Royalty. Whatever Asya doesn’t know about aggregations isn’t worth knowing and being able to lean on her knowledge helped improve this book immensely.
There was also a whole team of people at Packt involved in the production who went above and beyond with a razor-sharp focus on quality, which I greatly appreciated—thank you.
Finally, I’d like to thank my wife, Helen, and my daughters, Caitlin and Elsie. They will have zero interest in reading this book, but they’ve put up with me being ‘zoned out’ for too long while I was creating it. Thank you for just being you and making me laugh every day!
I've been involved with databases since I joined a small database company in the early 1990s. For the next two decades, I believed databases were synonymous with SQL until someone asked me what I thought about the new NoSQL databases, and MongoDB—in particular. This led me to try MongoDB for a small side project I was working on. The rest, as they say, is history.
When I joined 10gen—the company that created MongoDB, in early 2012, the query language was simple and straightforward. There weren't options for easy data aggregation because the general advice was, "Store the data the way you expect to access the data," which was a fantastic approach for fast point queries.
However, as time went on, it became clear that there are times when you must answer questions you never expected when you were first designing the application and database schema. The options for this type of analytical data aggregation within MongoDB were limited. MongoDB's initial pass at data aggregation was the mapReduce command. However, map-reduce was hard to understand and get right. Plus, it required writing and running inefficient JavaScript.
Developers needed a new way to aggregate data natively on the server that was both intuitive and efficient. We called what we came up with the aggregation framework. Since the stages of data processing in the aggregation framework were organized as a pipeline (familiarly evoking processing files on the Unix command line, for those of us who did such things a lot), we also referred to it as the aggregation pipeline. Very quickly, agg became my favorite feature of MongoDB because of its flexibility, power, and ease of debugging.
We've come a long way in the intervening years. We started with just seven stages and three dozen expressions operating on a single collection. Today, there are over 30 stages, more than 150 expressions, and the ability to run aggregations across multiple collections.
The nature of data is such that it never reveals up front all the questions we might have about it in the future. Being able to construct complex aggregations on that data as it evolves is critical for successfully using it, while complex data processing can be performed in any programming language you are comfortable with. The ability to analyze your data without having to move it from the database where it is stored provides a tremendous advantage over exporting and loading the data elsewhere.
For years, I've given talks about the power of aggregation pipelines, answered questions from users about how to do complex analysis, and frequently fielded requests for a comprehensive Aggregation Cookbook. Of course, it would be great to have a repository of recipes with which to solve common data tasks that involve more than a single stage or expression combination, but it's hard to find the time to sit down and write something like that. This is why I was so stoked to see that my colleague, Paul Done, wrote this book, Practical MongoDB Aggregations, and laid the foundation for developers around the world—and it's also why this book is the first in the new MongoDB Press Imprint.
I hope you find this collection of suggestions, general principles, and specific pipeline examples useful in your own application development. I look forward to seeing how people use this book in the coming years to realize the full power of their data.
Asya Kamsky
Principal Engineer
MongoDB, Inc.
This book is about the MongoDB aggregation framework. It provides a set of practical, easy-to-digest principles and approaches for increasing your effectiveness in developing aggregation pipelines, supported by examples for building pipelines to solve various data manipulation and analysis tasks.
The aggregation framework is a runtime component of MongoDB with an API for client applications. By using the aggregation framework, developers can invoke queries to perform sophisticated transformations, analytics, and computations on the data held directly within the database. MongoDB aggregations help with tasks such as filtering, grouping, sorting, and calculating aggregated values from large datasets without needing to extract and process the data externally.
Mastering the MongoDB aggregation framework can seem overwhelming. The focus of this book is to streamline your learning process for aggregations and make difficult concepts and steps simpler to understand. You'll be able to craft aggregation pipelines that exhibit increased performance and scalability by using the guidance and practical examples shared in this book. Instead of inefficiently extracting vast amounts of data for external processing, you'll learn how to shift more of your data processing tasks directly to the database. Furthermore, your developed pipelines will be more robust and adaptable, ready to evolve in response to changing business requirements.
This book is for developers, architects, data analysts, data engineers, and data scientists who have a foundational grasp of MongoDB and preliminary experience with its aggregation framework; that is, a working knowledge of MongoDB is assumed. This book is not for beginners who want to learn about aggregation pipelines from the ground up.
Given the programming-centric approach of MongoDB aggregations, this book is for readers with some coding experience. While knowledge of JavaScript is advantageous, proficiency in any modern programming language is sufficient.
This book will empower readers to elevate their capabilities regarding the richness, agility, performance, and scalability of the pipelines they develop.
Chapter 1, MongoDB Aggregations Explained, provides a level-set of what aggregations are and how to use them.
Chapter 2, Optimizing Pipelines for Productivity, helps you to develop composable and adaptable pipelines.
Chapter 3, Optimizing Pipelines for Performance, informs you how to reduce the latency of your aggregations.
Chapter 4, Harnessing the Power of Expressions, helps you leverage the power of expressions for transforming data, especially arrays.
Chapter 5, Optimizing Pipelines for Sharded Clusters, provides considerations for executing your pipelines against large volumes of data.
Chapter 6, Foundational Examples: Filtering, Grouping, and Unwinding, provides examples of common data manipulation patterns used in many aggregation pipelines, which are relatively straightforward to understand and adapt.
Chapter 7, Joining Data Examples, offers guidance on joining together data from different collections.
Chapter 8, Fixing and Generating Data Examples, provides tools and techniques to clean data within a dataset.
Chapter 9, Trend Analysis Examples, showcases the capabilities of the MongoDB aggregation framework in performing advanced data analytics.
Chapter 10, Securing Data Examples, helps you discover ways to use aggregation pipelines to secure the data in a MongoDB database and reduce the risk of a data breach.
Chapter 11, Time-Series Examples, shows examples of how you can use aggregation pipelines to extract insight from time-series data.
Chapter 12, Array Manipulation Examples, shows how to break down array manipulation problems into manageable pieces, streamlining your assembly of solutions.
Chapter 13, Full-Text Search Examples, demonstrates how to build aggregation pipelines that leverage full-text search capabilities in MongoDB Atlas.
You will require the following software:
Software covered in the book
Operating system requirements
MongoDB version 4.4 or newer
Windows, macOS, or Linux
MongoDB Atlas Search
Windows, macOS, or Linux
MongoDB Shell
Windows, macOS, or Linux
After reading this book, we encourage you to check out some of the other resources available at https://www.mongodb.com/developerorhttps://learn.mongodb.com/
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Practical-MongoDB-Aggregations. If there's an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database collection names, folder names, filenames, file extensions, pathnames, dummy URLs, and user input. Here is an example: "When considering the $sort and $group stages, it becomes evident why they have to block."
A block of code is set as follows:
db.persons.find( {"vocation": "ENGINEER"}, {"_id": 0, "vocation": 0, "address": 0}, ).sort( {"dateofbirth": -1} ).limit(3);Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "A MongoDB database, version 4.2 or greater, that is network accessible from your workstation."
Tips or important notes
Appear like this.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere? Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link belowhttps://packt.link/free-ebook/9781835080641
Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directlyGetting insights from data stored in a database can be challenging, especially when there are millions or even billions of records to process.
In this chapter, you will learn how the MongoDB aggregation framework is designed to make mass data processing, analysis, and reporting intuitive and performant. Even though you may already be familiar with building basic aggregation pipelines, this chapter will lay a solid foundation to help you understand the mindset required for building more powerful, optimized aggregations for the real world.
By the end of this chapter, you will have a grasp of the following:
The purpose and design of the MongoDB aggregation frameworkThe MongoDB aggregation language's approach for building aggregation pipelinesRelevant use cases for the MongoDB aggregation frameworkSuggestions for tools to use to run aggregation pipelines and how to get help if you get stuckThe MongoDB aggregation framework enables you to perform data processing and manipulation on the documents in one or more MongoDB collections. It allows you to perform data transformations and gather summary data using various operators for filtering, grouping, sorting, and reshaping documents. You construct a pipeline consisting of one or more stages, each applying a specific transformation operation on the documents as they pass through the pipeline. One of the common uses of an aggregation pipeline is to calculate sums and averages, similar to using SQL's GROUP BY clause in a relational database but tailored to the MongoDB document-oriented structure.
The MongoDB aggregation framework enables users to send an analytics or data processing workload—written using an aggregation language—to the database to execute the workload against the data it holds. The MongoDB aggregation framework has two parts:
An aggregation API provided by the MongoDB driver that you embed in your application. You define an aggregation pipeline in your application's code and send it to the database for processing.The aggregation runtime in the database that receives the pipeline request from the application and executes the pipeline against the persisted data.Figure 1.1 illustrates these two elements and their relationship:
Figure 1.1: MongoDB aggregation framework
Each driver provides APIs to enable an application to use both the MongoDB Query Language (MQL) and the aggregation framework. In the database, the aggregation runtime reuses the query runtime to efficiently execute the query part of an aggregation workload that typically appears at the start of an aggregation pipeline.
MongoDB's aggregation pipeline language is somewhat of a paradox. It can appear daunting, yet it is straightforward. It can seem verbose, yet it is lean and to the point. It is Turing complete and able to solve any business problem. Conversely, it is a strongly opinionated domain-specific language (DSL); if you attempt to veer away from its core purpose of mass data manipulation, it will try its best to resist you.
Invariably, for beginners, the aggregation framework seems difficult to understand and comes with an initially steep learning curve that you must overcome to become productive. In some programming languages, you only need to master a small set of the language's aspects to be largely effective. With MongoDB aggregations, the initial effort you must invest is slightly greater. However, once mastered, users find it provides an elegant, natural, and efficient solution to breaking down a complex set of data manipulations into a series of simple, easy-to-understand steps.
The MongoDB aggregation pipeline language is focused on data-oriented problem-solving rather than business process problem-solving. It can be regarded as a functional programming language rather than a procedural programming language. Since an aggregation pipeline is an ordered series of statements, called stages, the entire output of one stage forms the entire input of the next stage, with no side effects. This functional nature is why many users regard the aggregation framework as having a steeper learning curve than many languages—not because it is inherently more difficult to understand but because most developers come from a procedural programming background and not a functional one. Most developers also have to learn how to think like a functional programmer to learn the aggregation framework.
The functional characteristics of the aggregation framework ultimately make it especially powerful for processing massive datasets. Users focus more on defining the what in terms of the required outcome and less on the how of specifying the exact logic to apply to achieve each transformation. You provide one specific and clearly advertised purpose for each stage in the pipeline. At runtime, the database engine can then understand the exact intent of each stage. For example, the database engine can obtain clear answers to the questions it asks, such as, "Is this stage for performing a filter or is this stage for grouping on some fields?" With this knowledge, the database engine has the opportunity to optimize the pipeline at runtime. Figure 1.2 shows an example of the database performing a pipeline optimization. It may decide to reorder stages to optimally use an index while ensuring that the output hasn't changed. Alternatively, it may choose to execute some steps in parallel against subsets of the data in different shards, reducing the response time while again ensuring the output hasn't changed.
Figure 1.2: Database performing a pipeline optimization
Last and least in terms of importance is the syntax. So far, MongoDB aggregations have been described here as a programming language. However, what syntax do you use to construct a MongoDB aggregation pipeline? The answer is it depends, and the answer is mostly irrelevant.
This book will highlight pipeline examples using MongoDB Shell and the JavaScript interpreter it runs in. The book will express aggregation pipelines using a JSON-based syntax. However, if you are using one of the many programming language drivers that MongoDB offers, you will be using that language to construct an aggregation pipeline, not JSON. To learn more about MongoDB drivers, see https://docs.mongodb.com/drivers/. An aggregation is specified as an array of objects, regardless of how the programming language may facilitate it. This programmatic rather than textual format has a couple of advantages compared to querying with a string. It has a low vulnerability to injection attacks, and it is highly composable.
The aggregation framework is versatile and used for many different data processing and manipulation tasks. Some typical use cases include the following:
Generating business reports, which include roll-ups, sums, and averagesPerforming real-time analytics to generate insight and actions for end usersPresenting real-time business dashboards with an up-to-date summary statusPerforming data masking to securely obfuscate and redact sensitive data ready to expose to consumers via viewsJoining data together from different collections on the server side rather than in the client application for improved performanceConducting data science activities such as data discovery and data wranglingPerforming mass data analysis at scale (i.e., big data) as a faster and more intuitive alternative to technologies such as HadoopExecuting real-time queries where deeper server-side data post-processing is required than what is available via default MongoDB Query LanguageNavigating a graph of relationships between records, looking for patternsPerforming the transform part of an extract, load, transform (ELT) workload to transform data landed in MongoDB into a more appropriate shape for consuming applications to useEnabling data engineers to report on the quality of data in the database and perform data-cleansing activitiesUpdating a materialized view with the results of the most recent source data changes so that real-time applications don't have to wait for long-running analytics jobs to completePerforming full-text search and fuzzy search on data using MongoDB Atlas Search, see https://www.mongodb.com/atlas/searchExposing MongoDB data to analytics tools that don't natively integrate with MongoDB via SQL, ODBC, or JDBC (using MongoDB BI Connector, see https://www.mongodb.com/docs/bi-connector/current/, or Atlas SQL, https://www.mongodb.com/atlas/sql)Supporting machine learning frameworks for efficient data analysis (e.g., via MongoDB Spark Connector, see https://docs.mongodb.com/spark-connector)MongoDB released the first major version of the database (version 1.0) in early 2009. Back then, users and the predominant company behind the database, MongoDB, Inc. (then called 10gen), were still establishing the sort of use cases the database would excel at and where the critical gaps were. Within half a year of this first major release, the engineering team at MongoDB identified an essential requirement to generate materialized views on demand. Users needed this capability to maintain counts, sums, and averages for their real-time client applications to query. By the end of 2009, in time for the following major release (1.2), the database engineers introduced a quick tactical solution to address this gap. This solution involved embedding a JavaScript engine in the database and allowing client applications to submit and execute server-side logic using a simple map-reduce-style API. Although from a functional perspective, the MongoDB map-reduce capability provided a solution to the typical data processing requirements of users, it came with some drawbacks:
The database used an inherently slow JavaScript engine to execute the user's code.Users had to provide two sets of JavaScript logic: a map (or matching) function and a reduce (or grouping) function. Both were unintuitive to develop and lacked a solid data-oriented bias.At runtime, the database could not determine the specific intent of an arbitrary piece of logic. The database engine had no opportunity to identify and apply optimizations. It couldn't easily target indexes or reorder logic for more efficient processing. The database had to be conservative, executing the workload with minimal concurrency and employing locks at various times to prevent race conditions.If returning the response to the client application, rather than sending the output to a collection, the response payload had to be less than 16 MB.Over the subsequent two years, MongoDB engineers envisioned a better solution as user behavior with the map-reduce capability became more understood. Given the ability to hold large datasets in MongoDB, users increasingly tried to use map-reduce to perform mass data processing. They were hitting the same map-reduce limitations. Users desired a more targeted capability leveraging a data-oriented DSL. The engineers saw how to deliver a framework enabling developers to define data manipulation steps with valuable composability characteristics. Each step would have a clearly advertised intent, allowing the database engine to apply optimizations at runtime. The engineers could also design a framework that would execute natively in the database and not require a JavaScript engine. In mid-2012, the database introduced the aggregation framework solution in the 2.2 version of MongoDB, which provided a far more powerful, efficient, scalable, and easy-to-use replacement to map-reduce.
Within its first year, the aggregation framework rapidly became the go-to tool for processing large volumes of data in MongoDB. Now, over a decade on, it is as if the aggregation framework has always been part of MongoDB. It feels like part of the database's core DNA. The old map-reduce capability in MongoDB is deprecated and offers no value nowadays. A MongoDB aggregation pipeline is always the correct answer for processing data in the database!
The following is a summary of the evolution of the aggregation framework in terms of significant capabilities added in each major release of MongoDB from when the framework debuted in MongoDB 2.2:
MongoDB 2.2 (August 2012): Marked the initial release of the MongoDB aggregation frameworkMongoDB 2.4 (March 2013): Focused predominantly on aggregation performance improvements, especially for sorting data, but also included a new string concatenation operatorMongoDB 2.6 (April 2014)