38,39 €
Master the capabilities of RethinkDB and implement them to develop efficient real-time web applications. The way to better database development is here!
This book caters to all the real-time application developers looking forward to master their skills using RethinkDB. A basic understanding of RethinkDB and Node.js is essential to get the most out of this book.
RethinkDB has a lot of cool things to be excited about: ReQL (its readable,highly-functional syntax), cluster management, primitives for 21st century applications, and change-feeds. This book starts with a brief overview of the RethinkDB architecture and data modeling, and coverage of the advanced ReQL queries to work with JSON documents. Then, you will quickly jump to implementing these concepts in real-world scenarios, by building real-time applications on polling, data synchronization, share market, and the geospatial domain using RethinkDB and Node.js. You will also see how to tweak RethinkDB's capabilities to ensure faster data processing by exploring the sharding and replication techniques in depth.
Then, we will take you through the more advanced administration tasks as well as show you the various deployment techniques using PaaS, Docker, and Compose. By the time you have finished reading this book, you would have taken your knowledge of RethinkDB to the next level, and will be able to use the concepts in RethinkDB to develop efficient, real-time applications with ease.
This book is a unique blend of comprehensive theory and real-world examples to help you master RethinkDB.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 280
Veröffentlichungsjahr: 2016
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: December 2016
Production reference: 1131216
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78646-107-0
www.packtpub.com
Author
Shahid Shaikh
Copy Editor
Vikrant Phadkay
Reviewer
Rafael Ferreira dos Santos
Project Coordinator
Shweta H Birwatkar
Commissioning Editor
Amey Varangaonkar
Proofreader
Safis Editing
Acquisition Editor
Vinay Argekar
Indexer
Mariammal Chettiyar
Content Development Editor
Amrita Noronha
Graphics
Disha Haria
Technical Editor
Akash Patel
Production Coordinator
Arvindkumar Gupta
Shahid Shaikh is an engineer, blogger, and author living in Mumbai, India. He is a full-time professional and a part-time blogger. He loves solving programming problems and he is, expert in software backend design and development.
Shahid has been blogging and teaching programming in practical way for more than two years on his blog. His blog is quite famous in the developer zone and people all around the world take advantage of his expertise in various programming problems related to backend development.
Shahid has also authored a book on Sails.js – MVC framework for Node.js published by Packt.
I would like to thank my parents, my family, and my friends for being kind and supportive during the book development period. I would like to thank my friends, who changed their plans according to my schedule for various occasions. I would also like to thank the RethinkDB team for helping me out with various architectural questions. You guys are awesome!
Rafael Ferreira dos Santos
Ted’s father, Geysla’s husband, Developer/Entrepreneur/Bjj addicted, 10 years working with software developer, loves to code, specially in ASP.NET and Node.js.
Thanks to Glenn Morton and the QuizJam team for such an amazing workplace. I would like to thank God and my wife for all the support and love that they give to me. I would not be in such an amazing moment without you.
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
RethinkDB is database built for real time web. It offers us variety of features over NoSQL databases already present in market such as very powerful query language, changefeed and easy scaling.
In this book, we are covering RethinkDB in deep and learning mastering level topics such as scaling, integration and deployment. We are also covering programming with RethinkDB along with step by step screenshot to help you understand the concepts easily.
Chapter 1, The RethinkDB Architecture and Data Model, covers the architecture of RethinkDB and data modeling, along with revisiting the concepts of RethinkDB.
Chapter 2, RethinkDB Query Language, covers RethinkDB query language, or ReQL, which is the core and essential learning curve of RethinkDB. ReQL provides various SQL-like features such as join, indexing, and foreign keys, along with document-based storage with NoSQL.
Chapter 3, Data Exploration Using RethinkDB, covers data extraction and loading along with example use cases using ReQL.
Chapter 4, Performance Tuning in RethinkDB, covers various methods and tricks to improve the performance of RethinkDB.
Chapter 5, Administration and Troubleshooting Tasks in RethinkDB, covers failover mechanisms along with example use cases.
Chapter 6, RethinkDB Deployment, covers various options available to deploy RethinkDB on production.
Chapter 7, Extending RethinkDB. This chapter covers the integration of RethinkDB with other products, such as ElasticSearch.
Chapter 8, Full Stack Development with RethinkDB, covers the implementation of full stack JavaScript application using RethinkDB.
Chapter 9, Polyglot Persistence Using RethinkDB, covers complex synchronization application development using RethinkDB.
Chapter 10, Using RethinkDB and Horizon. This chapter covers the RethinkDB-powered framework called Horizon with a demo application.
A computer with at least 2 GB of RAM that can support Node.js and Java.
This book caters to all the real-time application developers looking forward to master their skills using RethinkDB. A basic understanding of RethinkDB and Node.js is essential to get the most out of this book. Developers working in backend development, full stack developers and database architect will find this book useful.
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.
To send us general feedback, simply send an e-mail to [email protected], and mention the book title via the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on https://www.packtpub.com/books/info/packt/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-RethinkDB. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from: https://www.packtpub.com/sites/default/files/downloads/MasteringRethinkDB_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.
Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at [email protected] with a link to the suspected pirated material.
We appreciate your help in protecting our authors, and our ability to bring you valuable content.
You can contact us at [email protected] if you are having a problem with any aspect of the book, and we will do our best to address it.
RethinkDB is a real-time, open source distributed database. It stores JSON documents (basically unstructured data) in an operable format with distribution (sharding and replication). It also provides the real-time push of JSON data to the server, which redefines the entire real-time web application development.
In this chapter, we will look over its architecture and key elements in order to understand how RethinkDB supports these awesome features with high performance. We will also look over data modeling along with SQL operations in NoSQL, that is, joins.
Here is the list of topics we are going to cover in this chapter:
RethinkDB's architecture consists of various components such as cluster, query execution engine, filesystem storage, push changes (real-time feed), and of course RethinkDB client drivers.
Refer to the following diagram to understand the block-level components of RethinkDB:
RethinkDB provides official client drivers for Node.js, Python, Ruby, and Java and various non official community drivers which are listed at the official website (https://rethinkdb.com/docs/install-drivers/). At the time of writing this book, only these languages were supported. In this book, we will refer to code examples with Node.js.
RethinkDB query handler, as name implies, performs query execution and returns the response to the client. It does so by performing lot of internal operations such as sorting, indexing, finding the cluster, or merging data from various clusters. All of these operations are performed by RethinkDB query handler. We will look at this in detail in the upcoming section.
RethinkDB is a distributed database designed for high-performance, real-time operations. RethinkDB manages distribution by clustering (sharding or replication). RethinkDB clusters are just another instance of the main process of RethinkDB and store data. We will look at sharding and replication in detail in the upcoming section.
This is a revolutionary concept introduced by RethinkDB. Consider this scenario: you are developing an application for the stock market where there are too many changes in a given amount of time. Obviously, we are storing every entry in the database and making sure that other connected nodes or clients know about these changes.
In order to do so, the conventional way is to keep looking (polling) for the data in the particular collection or table in order to find some changes. This improves the latency and turnaround time of packets, and we all know that a network call in a wide area network (WAN) is really costly. An HTTP call in a WAN is really costly.
Then came something called socket. In this, we do the polling operation but from the socket layer, not the HTTP layer. Here, the size of network requests may get reduced, but still we do the polling.
Socket.io is one of the popular projects available for real-time web development.
RethinkDB proposes a reverse approach of this: what about the database itself tells you:
Hey, there are some changes happen in stock value and here are the new and old value.
This is exactly what RethinkDB push changes (change feed in technical terms) does. Once you subscribe to a particular table to look for its changes, RethinkDB just keeps pushing the old and new values of changes to the connected client. By "connected client," I meant a RethinkDB client and not a web application client. The difference between polling and push changes is shown here:
So you will get the changes in the data in one of the RethinkDB clients, say Node.js. And then you can simply broadcast it over the network, using socket probably.
But why are we using socket when RethinkDB can provide us the changes in the data? Because RethinkDB provides it to the middle layer and not the client layer, having a client layer directly talk to the client can be risky. Hence it has not been allowed yet.
But the RethinkDB team is working on another project called Horizon, which solves the issue mentioned previously, to allow clients to communicate to the database using secure layer of the middle tier. We will look at Horizon in detail in Chapter 10, Using RethinkDB and Horizon.
RethinkDB query engine is a very critical and important part of RethinkDB. RethinkDB performs various computations and internal logic operations to maintain high performance along with good throughput of the system.
Refer to the following diagram to understand query execution:
RethinkDB, upon arrival of a query, divides it into various stacks. Each stack contains various methods and internal logic to perform its operation. Each stack consists of various methods, but there are three core methods that play key roles:
To speed up the process, these stacks are transported to every related server and each server begins to evaluate it in parallel to other servers. This process runs recursively in order to merge the data to stream to the client.
The stack in the node grabs the data from the stack after it and performs its own method of execution and transformation. The data from each server is then combined into a single result set and streamed to the client.
In order to speed up the process and maintain high performance, every query is completely parallelized across various relevant clusters. Thus, every cluster then performs the query execution and the data is again merged together to make a single result set.
RethinkDB query engine maintains efficiency in the process too; for example, if a client only requests a certain result that is not in a shared or replicated server, it will not execute the parallel operation and just return the result set. This process is also referred to as lazy execution.
To maintain concurrency and high performance of query execution, RethinkDB uses block-level Multiversion Concurrency Control (MVCC). If one user is reading some data while other users are writing on it, there is a high chance of inconsistent data, and to avoid that we use a concurrency control algorithm. One of the simplest and commonly used methods method by SQL databases is to lock the transaction, that is, make the user wait if a write operation is being performed on the data. This slows down the system, and since big data promises fast reading time, this simply won't work.
Multiversion concurrency control takes a different approach. Here each user will see the snapshot of the data (that is, child copies of master data), and if there are some changes going on in the master copy, then the child copies or snapshot will not get updated until the change has been committed:
RethinkDB does use block-level MVCC and this is how it works. Whenever there is any update or write operation being performed during the read operation, RethinkDB takes a snapshot of each shard and maintains a different version of a block to make sure every read and write operation works in parallel. RethinkDB does use exclusive locks on block level in case of multiple updates happening on the same document. These locks are very short in duration because they all are cached; hence it always seems to be lock-free.
RethinkDB provides atomicity of data as per the JSON document. This is different from other NoSQL systems; most NoSQL systems provide atomicity to each small operation done on the document before the actual commit. RethinkDB does the opposite, it provides atomicity to a document no matter what combination of operations is being performed.
For example, a user may want to read some data (say, the first name from one document), change it to uppercase, append the last name coming from another JSON document, and then update the JSON document. All of these operations will be performed automatically in one update operation.
RethinkDB limits this atomicity to a few operations. For example, results coming from JavaScript code cannot be performed atomically. The result of a subquery is also not atomic. Replace cannot be performed atomically.
RethinkDB supports major used filesystems such as NTFS, EXT and so on. RethinkDB also supports direct I/O filesystems for efficiency and performance, but it is not enabled by default.
File is stored on disk and when it's been requested by any program, the operating system first puts it into the main memory for faster reads. The operating system can read directly from disk too, but that would slow down the response time because of heavy-cost I/O operation. Hence, the operating system first puts it into the main memory for operation. This is called buffer cache.
Databases generally manage data caching at the application and they do not need the operating system to cache it for them. In such cases, the process of buffering at two places (main memory and application cache) becomes an overhead since data is first moved to the main memory and then the application cache.
This double buffering of data results in more CPU consumption and load on the memory too.
Direct I/O is a filesystem for those applications that want to avoid the buffering at the main memory and directly read files from disk. When direct I/O is used, data is transferred directly to the application buffer instead of the memory buffer, as shown in the following diagram:
Direct I/O can be used in two ways:
Direct I/O provides great efficiency and performance by reducing CPU consumption and the overhead of managing two buffers.
RethinkDB uses a custom-built storage engine inspired by the Binary tree file system by Oracle (BTRFS). There is not enough information available on the RethinkDB custom filesystem right now, but we have found the following promises by it:
Due to these features, RethinkDB can handle large amounts of data in very little memory storage.
Sharding is partitioning where the database is split across multiple smaller databases to improve performance and reading time. In replication, we basically copy the database across multiple databases to provide a quicker look and less response time. Content delivery networks are the best examples of this.
RethinkDB, just like other NoSQL databases, also uses sharding and replication to provide fast response and greater availability. Let's look at it in detail bit by bit.
RethinkDB makes use of a range sharding algorithm to provide the sharding feature. It performs sharding on the table's primary key to partition the data. RethinkDB uses the table's primary key to perform all sharding operations and it cannot use any other keys to do so. In RethinkDB, the shard key and primary key are the same.
Upon a request to create a new shard for a particular table, RethinkDB examines the table and tries to find out the optimal breakpoint to create an even number of shards.
For example, say you have a table with 1,000 rows, the primary key ranging from 0 to 999, and you've asked RethinkDB to create two shards for you.
RethinkDB will likely find primary key 500 as the breaking point. It will store every entry ranging from 0 to 499 in shard 1, while data with primary keys 500 to 999 will be stored in shard 2. The shards will be distributed across clusters automatically.
You can specify the sharding and replication settings at the time of creation of the table or alter it later. You cannot specify the split point manually; that is RethinkDB's job to do internally. You cannot have less server than you shard.
You can always visit the RethinkDB administrative screen to increase the number of shards or replicas:
We will look at this in more detail with practical use cases in Chapter 5, Administration and Troubleshooting Tasks in RethinkDB totally focused on RethinkDB administration.
Let's see in more detail how range-based sharding works. Sharding can be basically done in two ways, using vertical partitioning or horizontal partitioning:
In the range sharding algorithm, we use a service called locator to determine the entries in a particular table. The locator service finds out the data using range queries and hence it becomes faster than others. If you do not have a range or some kind of indicator to know which data belongs to which shard in which server, you will need to look over every database to find the particular document, which no doubt turns into a very slow process.
RethinkDB maintains a relevant piece of metadata, which they refer to as the directory. The directory maintains a list of node (RethinkDB instance) responsibilities for each shard. Each node is responsible for maintaining the updated version of the directory.
RethinkDB allows users to provide the location of shards.You can again go to web-based administrative screens to perform the same. However, you need to set up the RethinkDB servers manually using the command line and it cannot be done via web-based interfaces.
Replication provides a copy of data in order to improve performance, availability, and failover handling. Each shard in RethinkDB can contain a configurable number of replicas. A RethinkDB instance (node) in the cluster can be used as a replication node for any shard. You can always change the replication from the RethinkDB web console.
Currently, RethinkDB does not allow more than one replica in a single RethinkDB instance due to some technical limitations. Every RethinkDB instance stores metadata of tables. In case of changes in metadata, RethinkDB sends those changes across other RethinkDB instance in the cluster in order to keep the updated metadata across every shard and replica.
RethinkDB uses the primary key by default to index a document in a table. If the user does not provide primary key information during the creation of the table, RethinkDB uses its default name ID.
The default-generated primary key contains information about the shard's location in order to directly fetch the information from the appropriate shard. The primary key of each shard is indexed using the B-Tree data structure.
One of the examples for the RethinkDB primary key is as follows:
D0041fcf-9a3a-460d-8450-4380b00ffac0.
RethinkDB also provides the secondary key and compound key (combination of keys) features. It even provides multi-index features that allow you to have arrays of values acting as keys, which again can be single compound keys.
Having system-generated keys for primary is very efficient and fast, because the query execution engine can immediately determine on which shard the data is present. Hence, there is no need for extra routing, while having a custom primary key, say an alphabet or a number, may force RethinkDB to perform more searching of data on various clusters. This slows down the performance. You can always use secondary keys of your choice to perform further indexing and searching based on your application needs.
RethinkDB provides automatic failover handling in a multi-server configuration where multiple replicas of a table are present. In case of node failure due to any reason, RethinkDB finds out the other node to divert the request and maintain the availability. However, there are some requirements that must be met before considering automatic failover handling:
Every table, by default, has a primary replica created by RethinkDB. You can always change that using the reconfigure() command. In case of failure of the primary replica of the table, as long as more than half of the replicas with voting option are available, one of them will be internally selected as the primary replica. There will be a slight offline scenario while the selection is going on in RethinkDB, but that will be very minor and no data will be lost.
As soon as the primary replica comes online, RethinkDB automatically syncs it with the latest documents and switches control of the primary replica to it automatically.
By default, every replica in RethinkDB is created as a voting replica. That means those replicas will take part in the failover process to perform the selection of the next primary replica. You can also change this option using the reconfigure() command.
Automatic failover requires at least three server clusters with three replicas for table. Two server clusters will not be covered under the automatic failover process and the system may go down during the failure of any RethinkDB instance.
In such cases-where RethinkDB cannot perform failover-you need to do it manually using the reconfigure() command, by passing the emergency repair mode key.
