Mastering RethinkDB - Shahid Shaikh - E-Book

Mastering RethinkDB E-Book

Shahid Shaikh

0,0
38,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Master the capabilities of RethinkDB and implement them to develop efficient real-time web applications. The way to better database development is here!

About This Book

  • Master the powerful ReQL queries to manipulate your JSON data,
  • Learn how to develop scalable, real-time web applications using RethinkDB and Node.js and deploy them for production,
  • A detailed, step-by-step guide to help you master the concepts of RethinkDB programming with ease

Who This Book Is For

This book caters to all the real-time application developers looking forward to master their skills using RethinkDB. A basic understanding of RethinkDB and Node.js is essential to get the most out of this book.

What You Will Learn

  • Master the web-based management console for data-center configuration (sharding, replication, and more), database monitoring, and testing queries.
  • Run queries using the ReQL language
  • Perform Geospatial queries (such as finding all the documents with locations within 5km of a given point).
  • Deal with time series data, especially across various times zones.
  • Extending the functionality of RethinkDB and integrate it with third party libraries such as ElasticSearch to enhance our search

In Detail

RethinkDB has a lot of cool things to be excited about: ReQL (its readable,highly-functional syntax), cluster management, primitives for 21st century applications, and change-feeds. This book starts with a brief overview of the RethinkDB architecture and data modeling, and coverage of the advanced ReQL queries to work with JSON documents. Then, you will quickly jump to implementing these concepts in real-world scenarios, by building real-time applications on polling, data synchronization, share market, and the geospatial domain using RethinkDB and Node.js. You will also see how to tweak RethinkDB's capabilities to ensure faster data processing by exploring the sharding and replication techniques in depth.

Then, we will take you through the more advanced administration tasks as well as show you the various deployment techniques using PaaS, Docker, and Compose. By the time you have finished reading this book, you would have taken your knowledge of RethinkDB to the next level, and will be able to use the concepts in RethinkDB to develop efficient, real-time applications with ease.

Style and approach

This book is a unique blend of comprehensive theory and real-world examples to help you master RethinkDB.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 280

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Mastering RethinkDB
Credits
About the Author
About the Reviewer
www.PacktPub.com
Why subscribe?
Preface
What this book covers
What you need for this book 
Who this book is for 
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. The RethinkDB Architecture and Data Model
RethinkDB architectural components
Client drivers
RethinkDB query engine
RethinkDB clusters
Pushing changes to a RethinkDB client
Query execution in RethinkDB
Filesystem and data storage
About direct I/O
Data storage
Sharding and replication
Sharding in RethinkDB
Range-based sharding
Replication in RethinkDB
Indexing in RethinkDB
Automatic failover handling in RethinkDB
About voting replicas
The RethinkDB data model
RethinkDB data types
Binary objects
Geospatial queries in RethinkDB
Supported data types
RethinkDB model relationships
Embedded arrays
Merits of embedded arrays
Demerits of embedded arrays
Document linking in multiple tables
Merits of document linking
Demerits of document linking
Constraints and limitation in RethinkDB
Summary
2. RethinkDB Query Language
Embedding ReQL in a programming language
Performing CRUD operations using RethinkDB and Node
Creating new records
Reading the document data
Updating the document
Deleting the document
ReQL queries are chainable
ReQL queries are executed on a server
Performing conditional queries
Performing string operations
Performing MapReduce operations
Grouping the data
Counting the data
Sum
Avg
Min and Max
Distinct
Contains
Map and reduce
Calling HTTP APIs using ReQL
Handling binary objects
Performing JOINS
Accessing changefeed (real-time feed) in RethinkDB
Applications of changefeed
Performing geolocation operations
Storing a coordinate
Finding the distance between points
Performing administrative operations
Summary
3. Data Exploration Using RethinkDB
Generating mock data
Importing data in RethinkDB using HTTP
Importing data via file read
Executing data exploration use cases
Finding duplicate elements
Finding the list of countries
Finding the top 10 employees with the highest salary
Displaying employee records with a specific name and location
Finding employees living in Russia with a salary less than 50,000 dollars
Finding employees with a constant contact e-mail address
Finding employees who use class a C IP address
Summary
4. Performance Tuning in RethinkDB
Clustering
Creating and handling a RethinkDB cluster
Creating a RethinkDB cluster in the same machine
Creating a RethinkDB cluster using different machines
Creating a RethinkDB cluster in production
Securing our RethinkDB cluster
Using transport layer security
Binding the web administrative port
Executing ReQL queries in a cluster
Performing replication of tables in RethinkDB
Sharding the table to scale the database
Running a RethinkDB proxy node
Optimizing query performance
Summary
5. Administration and Troubleshooting Tasks in RethinkDB
Understanding access controls and permission in RethinkDB
RethinkDB user management
Failover handling in RethinkDB
Performing a manual and automatic backup in RethinkDB
Performing automatic backups
Restoring a RethinkDB database
Data import and export in RethinkDB
Importing data from MySQL to RethinkDB
Importing data from MongoDB to RethinkDB
Data migration to an updated version
Crash recovery in RethinkDB
Using third-party tools
ReQLPro
Chateau
Summary
6. RethinkDB Deployment
Deploying RethinkDB using PaaS services
Deploying RethinkDB on AWS
Deploying RethinkDB on Compose.io
Deploying RethinkDB on DigitalOcean
Deploying RethinkDB using Docker
The need for Docker
Installing Docker
Creating a Docker image
Deploying the Docker image
Deploying RethinkDB on a standalone server
Summary
7. Extending RethinkDB
Integrating RethinkDB with ElasticSearch
Introducing ElasticSearch
Installing ElasticSearch
Performing operations in ElasticSearch
The problem statement
Integration use cases
Search engine
Static website search
Integrating RethinkDB with RabbitMQ
Installing RabbitMQ
Developing producer code
Connecting to the RethinkDB instance
Creating a database and table if they do not exist
Connecting to RabbitMQ
Creating a channel and queue
Attaching a changefeed and integrating the RabbitMQ queue
Developing the consumer code
Connecting to RabbitMQ
Creating a channel and binding the queue
Understanding the RethinkDB protocol
Third-party libraries and tools
Summary
8. Full Stack Development with RethinkDB
Project structure
Data modeling for our application
Creating a Node.js server and routes
Integrating RethinkDB with Node.js
Integrating AngularJS in the frontend
Socket.io integration for message broadcasting
Summary
9. Polyglot Persistence Using RethinkDB
Introducing Polyglot Persistence
Using the RethinkDB changefeed as a Polyglot agent
Developing a proof-of-concept application with MongoDB and MySQL
Setting up the project
Developing a server using Express
Case - reading all users' data
Case - creating a new user
Case - updating user data
Case - deleting the user
Developing the Polyglot agent
Developing event consumers
Observer pattern
MySQL consumer
MongoDB consumer
Running the app
Further improvements
Integrating the message broker
Developing a distributed transaction
Summary
10. Using RethinkDB and Horizon
Workings of Horizon
Installing and configuring Horizon
Class Horizon
Connect()
Collection class
Method subscribe()
Watch method
CRUD methods
Developing a simple web application using Horizon
Setting up the project
Developing the JavaScript code
Developing the frontend
Horizon user management
Summary

Mastering RethinkDB

Mastering RethinkDB

Copyright © 2016 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: December 2016

Production reference: 1131216

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham 

B3 2PB, UK.

ISBN 978-1-78646-107-0

www.packtpub.com

Credits

Author

Shahid Shaikh

Copy Editor

Vikrant Phadkay

Reviewer

Rafael Ferreira dos Santos

Project Coordinator

Shweta H Birwatkar 

Commissioning Editor

Amey Varangaonkar

Proofreader

Safis Editing

Acquisition Editor

Vinay Argekar

Indexer

Mariammal Chettiyar

Content Development Editor

Amrita Noronha

Graphics

Disha Haria

Technical Editor

Akash Patel

Production Coordinator

Arvindkumar Gupta

About the Author

Shahid Shaikh is an engineer, blogger, and author living in Mumbai, India. He is a full-time professional and a part-time blogger. He loves solving programming problems and he is, expert in software backend design and development.

Shahid has been blogging and teaching programming in practical way for more than two years on his blog. His blog is quite famous in the developer zone and people all around the world take advantage of his expertise in various programming problems related to backend development.

Shahid has also authored a book on Sails.js – MVC framework for Node.js published by Packt.

I would like to thank my parents, my family, and my friends for being kind and supportive during the book development period. I would like to thank my friends, who changed their plans according to my schedule for various occasions. I would also like to thank the RethinkDB team for helping me out with various architectural questions. You guys are awesome!

About the Reviewer

Rafael Ferreira dos Santos

Ted’s father, Geysla’s husband, Developer/Entrepreneur/Bjj addicted, 10 years working with software developer, loves to code, specially in ASP.NET and Node.js.

Thanks to Glenn Morton and the QuizJam team for such an amazing workplace. I would like to thank God and my wife for all the support and love that they give to me. I would not be in such an amazing moment without you.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Preface

RethinkDB is database built for real time web. It offers us variety of features over NoSQL databases already present in market such as very powerful query language, changefeed and easy scaling.

In this book, we are covering RethinkDB in deep and learning mastering level topics such as scaling, integration and deployment. We are also covering programming with RethinkDB along with step by step screenshot to help you understand the concepts easily. 

What this book covers

Chapter 1, The RethinkDB Architecture and Data Model, covers the architecture of RethinkDB and data modeling, along with revisiting the concepts of RethinkDB.

Chapter 2, RethinkDB Query Language, covers RethinkDB query language, or ReQL, which is the core and essential learning curve of RethinkDB. ReQL provides various SQL-like features such as join, indexing, and foreign keys, along with document-based storage with NoSQL.

Chapter 3, Data Exploration Using RethinkDB, covers data extraction and loading along with example use cases using ReQL.

Chapter 4, Performance Tuning in RethinkDB, covers various methods and tricks to improve the performance of RethinkDB.

Chapter 5, Administration and Troubleshooting Tasks in RethinkDB, covers failover mechanisms along with example use cases.

Chapter 6, RethinkDB Deployment, covers various options available to deploy RethinkDB on production.

Chapter 7, Extending RethinkDB. This chapter covers the integration of RethinkDB with other products, such as ElasticSearch.

Chapter 8, Full Stack Development with RethinkDB, covers the implementation of full stack JavaScript application using RethinkDB.

Chapter 9, Polyglot Persistence Using RethinkDB, covers complex synchronization application development using RethinkDB.

Chapter 10, Using RethinkDB and Horizon. This chapter covers the RethinkDB-powered framework called Horizon with a demo application.

What you need for this book 

A computer with at least 2 GB of RAM that can support Node.js and Java.

Who this book is for 

This book caters to all the real-time application developers looking forward to master their skills using RethinkDB. A basic understanding of RethinkDB and Node.js is essential to get the most out of this book. Developers working in backend development, full stack developers and database architect will find this book useful. 

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to [email protected], and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on https://www.packtpub.com/books/info/packt/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-RethinkDB. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from: https://www.packtpub.com/sites/default/files/downloads/MasteringRethinkDB_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at [email protected] if you are having a problem with any aspect of the book, and we will do our best to address it.

Chapter 1. The RethinkDB Architecture and Data Model

RethinkDB is a real-time, open source distributed database. It stores JSON documents (basically unstructured data) in an operable format with distribution (sharding and replication). It also provides the real-time push of JSON data to the server, which redefines the entire real-time web application development.

In this chapter, we will look over its architecture and key elements in order to understand how RethinkDB supports these awesome features with high performance. We will also look over data modeling along with SQL operations in NoSQL, that is, joins.

Here is the list of topics we are going to cover in this chapter:

RethinkDB architectural componentsSharding and replication in RethinkDBRethinkDB failover handlingThe RethinkDB data modelData modeling in RethinkDB

RethinkDB architectural components

RethinkDB's architecture consists of various components such as cluster, query execution engine, filesystem storage, push changes (real-time feed), and of course RethinkDB client drivers.

Refer to the following diagram to understand the block-level components of RethinkDB:

Client drivers

RethinkDB provides official client drivers for Node.js, Python, Ruby, and Java and various non official community drivers which are listed at the official website (https://rethinkdb.com/docs/install-drivers/). At the time of writing this book, only these languages were supported. In this book, we will refer to code examples with Node.js.

RethinkDB query engine

RethinkDB query handler, as name implies, performs query execution and returns the response to the client. It does so by performing lot of internal operations such as sorting, indexing, finding the cluster, or merging data from various clusters. All of these operations are performed by RethinkDB query handler. We will look at this in detail in the upcoming section.

RethinkDB clusters

RethinkDB is a distributed database designed for high-performance, real-time operations. RethinkDB manages distribution by clustering (sharding or replication). RethinkDB clusters are just another instance of the main process of RethinkDB and store data. We will look at sharding and replication in detail in the upcoming section.

Pushing changes to a RethinkDB client

This is a revolutionary concept introduced by RethinkDB. Consider this scenario: you are developing an application for the stock market where there are too many changes in a given amount of time. Obviously, we are storing every entry in the database and making sure that other connected nodes or clients know about these changes.

In order to do so, the conventional way is to keep looking (polling) for the data in the particular collection or table in order to find some changes. This improves the latency and turnaround time of packets, and we all know that a network call in a wide area network (WAN) is really costly. An HTTP call in a WAN is really costly.

Then came something called socket. In this, we do the polling operation but from the socket layer, not the HTTP layer. Here, the size of network requests may get reduced, but still we do the polling.

Note

Socket.io is one of the popular projects available for real-time web development.

RethinkDB proposes a reverse approach of this: what about the database itself tells you:

Hey, there are some changes happen in stock value and here are the new and old value.

This is exactly what RethinkDB push changes (change feed in technical terms) does. Once you subscribe to a particular table to look for its changes, RethinkDB just keeps pushing the old and new values of changes to the connected client. By "connected client," I meant a RethinkDB client and not a web application client. The difference between polling and push changes is shown here:

So you will get the changes in the data in one of the RethinkDB clients, say Node.js. And then you can simply broadcast it over the network, using socket probably.

But why are we using socket when RethinkDB can provide us the changes in the data? Because RethinkDB provides it to the middle layer and not the client layer, having a client layer directly talk to the client can be risky. Hence it has not been allowed yet.

But the RethinkDB team is working on another project called Horizon, which solves the issue mentioned previously, to allow clients to communicate to the database using secure layer of the middle tier. We will look at Horizon in detail in Chapter 10, Using RethinkDB and Horizon.

Query execution in RethinkDB

RethinkDB query engine is a very critical and important part of RethinkDB. RethinkDB performs various computations and internal logic operations to maintain high performance along with good throughput of the system.

Refer to the following diagram to understand query execution:

RethinkDB, upon arrival of a query, divides it into various stacks. Each stack contains various methods and internal logic to perform its operation. Each stack consists of various methods, but there are three core methods that play key roles:

The first method decides how to execute the query or subset of the query on each server in a particular clusterThe second method decides how to merge the data coming from various clusters in order to make sense of itThe third method, which is very important, deals with transmission of that data in streams rather than as a whole

To speed up the process, these stacks are transported to every related server and each server begins to evaluate it in parallel to other servers. This process runs recursively in order to merge the data to stream to the client.

The stack in the node grabs the data from the stack after it and performs its own method of execution and transformation. The data from each server is then combined into a single result set and streamed to the client.

In order to speed up the process and maintain high performance, every query is completely parallelized across various relevant clusters. Thus, every cluster then performs the query execution and the data is again merged together to make a single result set.

RethinkDB query engine maintains efficiency in the process too; for example, if a client only requests a certain result that is not in a shared or replicated server, it will not execute the parallel operation and just return the result set. This process is also referred to as lazy execution.

To maintain concurrency and high performance of query execution, RethinkDB uses block-level Multiversion Concurrency Control (MVCC). If one user is reading some data while other users are writing on it, there is a high chance of inconsistent data, and to avoid that we use a concurrency control algorithm. One of the simplest and commonly used methods method by SQL databases is to lock the transaction, that is, make the user wait if a write operation is being performed on the data. This slows down the system, and since big data promises fast reading time, this simply won't work.

Multiversion concurrency control takes a different approach. Here each user will see the snapshot of the data (that is, child copies of master data), and if there are some changes going on in the master copy, then the child copies or snapshot will not get updated until the change has been committed:

RethinkDB does use block-level MVCC and this is how it works. Whenever there is any update or write operation being performed during the read operation, RethinkDB takes a snapshot of each shard and maintains a different version of a block to make sure every read and write operation works in parallel. RethinkDB does use exclusive locks on block level in case of multiple updates happening on the same document. These locks are very short in duration because they all are cached; hence it always seems to be lock-free.

RethinkDB provides atomicity of data as per the JSON document. This is different from other NoSQL systems; most NoSQL systems provide atomicity to each small operation done on the document before the actual commit. RethinkDB does the opposite, it provides atomicity to a document no matter what combination of operations is being performed.

For example, a user may want to read some data (say, the first name from one document), change it to uppercase, append the last name coming from another JSON document, and then update the JSON document. All of these operations will be performed automatically in one update operation.

RethinkDB limits this atomicity to a few operations. For example, results coming from JavaScript code cannot be performed atomically. The result of a subquery is also not atomic. Replace cannot be performed atomically.

Filesystem and data storage

RethinkDB supports major used filesystems such as NTFS, EXT and so on. RethinkDB also supports direct I/O filesystems for efficiency and performance, but it is not enabled by default.

About direct I/O

File is stored on disk and when it's been requested by any program, the operating system first puts it into the main memory for faster reads. The operating system can read directly from disk too, but that would slow down the response time because of heavy-cost I/O operation. Hence, the operating system first puts it into the main memory for operation. This is called buffer cache.

Databases generally manage data caching at the application and they do not need the operating system to cache it for them. In such cases, the process of buffering at two places (main memory and application cache) becomes an overhead since data is first moved to the main memory and then the application cache.

This double buffering of data results in more CPU consumption and load on the memory too.

Direct I/O is a filesystem for those applications that want to avoid the buffering at the main memory and directly read files from disk. When direct I/O is used, data is transferred directly to the application buffer instead of the memory buffer, as shown in the following diagram:

Direct I/O can be used in two ways:

Mounting the filesystem using direct I/O (options vary from OS to OS)Opening the file using the O_DIRECT option specified in the open() system call

Direct I/O provides great efficiency and performance by reducing CPU consumption and the overhead of managing two buffers.

Data storage

RethinkDB uses a custom-built storage engine inspired by the Binary tree file system by Oracle (BTRFS). There is not enough information available on the RethinkDB custom filesystem right now, but we have found the following promises by it:

Fully concurrent garbage compactorLow CPU overheadEfficient multi-core operationSSD optimizationPower failure recoveryData consistency in case of failureMVCC supports

Due to these features, RethinkDB can handle large amounts of data in very little memory storage.

Sharding and replication

Sharding is partitioning where the database is split across multiple smaller databases to improve performance and reading time. In replication, we basically copy the database across multiple databases to provide a quicker look and less response time. Content delivery networks are the best examples of this.

RethinkDB, just like other NoSQL databases, also uses sharding and replication to provide fast response and greater availability. Let's look at it in detail bit by bit.

Sharding in RethinkDB

RethinkDB makes use of a range sharding algorithm to provide the sharding feature. It performs sharding on the table's primary key to partition the data. RethinkDB uses the table's primary key to perform all sharding operations and it cannot use any other keys to do so. In RethinkDB, the shard key and primary key are the same.

Upon a request to create a new shard for a particular table, RethinkDB examines the table and tries to find out the optimal breakpoint to create an even number of shards.

For example, say you have a table with 1,000 rows, the primary key ranging from 0 to 999, and you've asked RethinkDB to create two shards for you.

RethinkDB will likely find primary key 500 as the breaking point. It will store every entry ranging from 0 to 499 in shard 1, while data with primary keys 500 to 999 will be stored in shard 2. The shards will be distributed across clusters automatically.

You can specify the sharding and replication settings at the time of creation of the table or alter it later. You cannot specify the split point manually; that is RethinkDB's job to do internally. You cannot have less server than you shard.

You can always visit the RethinkDB administrative screen to increase the number of shards or replicas:

We will look at this in more detail with practical use cases in Chapter 5, Administration and Troubleshooting Tasks in RethinkDB totally focused on RethinkDB administration.

Let's see in more detail how range-based sharding works. Sharding can be basically done in two ways, using vertical partitioning or horizontal partitioning:

In vertical partitioning, we store data in different tables having different documents in different databases.In horizontal partitioning, we store documents of the same table in separate databases. The range shard algorithm is a dynamic algorithm that determines the breaking point of the table and stores data in different shards based on the calculation.

Range-based sharding

In the range sharding algorithm, we use a service called locator to determine the entries in a particular table. The locator service finds out the data using range queries and hence it becomes faster than others. If you do not have a range or some kind of indicator to know which data belongs to which shard in which server, you will need to look over every database to find the particular document, which no doubt turns into a very slow process.

RethinkDB maintains a relevant piece of metadata, which they refer to as the directory. The directory maintains a list of node (RethinkDB instance) responsibilities for each shard. Each node is responsible for maintaining the updated version of the directory.

RethinkDB allows users to provide the location of shards.You can again go to web-based administrative screens to perform the same. However, you need to set up the RethinkDB servers manually using the command line and it cannot be done via web-based interfaces.

Replication in RethinkDB

Replication provides a copy of data in order to improve performance, availability, and failover handling. Each shard in RethinkDB can contain a configurable number of replicas. A RethinkDB instance (node) in the cluster can be used as a replication node for any shard. You can always change the replication from the RethinkDB web console.

Currently, RethinkDB does not allow more than one replica in a single RethinkDB instance due to some technical limitations. Every RethinkDB instance stores metadata of tables. In case of changes in metadata, RethinkDB sends those changes across other RethinkDB instance in the cluster in order to keep the updated metadata across every shard and replica.

Indexing in RethinkDB

RethinkDB uses the primary key by default to index a document in a table. If the user does not provide primary key information during the creation of the table, RethinkDB uses its default name ID.

The default-generated primary key contains information about the shard's location in order to directly fetch the information from the appropriate shard. The primary key of each shard is indexed using the B-Tree data structure.

One of the examples for the RethinkDB primary key is as follows:

D0041fcf-9a3a-460d-8450-4380b00ffac0.

RethinkDB also provides the secondary key and compound key (combination of keys) features. It even provides multi-index features that allow you to have arrays of values acting as keys, which again can be single compound keys.

Having system-generated keys for primary is very efficient and fast, because the query execution engine can immediately determine on which shard the data is present. Hence, there is no need for extra routing, while having a custom primary key, say an alphabet or a number, may force RethinkDB to perform more searching of data on various clusters. This slows down the performance. You can always use secondary keys of your choice to perform further indexing and searching based on your application needs.

Automatic failover handling in RethinkDB

RethinkDB provides automatic failover handling in a multi-server configuration where multiple replicas of a table are present. In case of node failure due to any reason, RethinkDB finds out the other node to divert the request and maintain the availability. However, there are some requirements that must be met before considering automatic failover handling:

The cluster must have three or more nodes (RethinkDB servers)The table must be set to have three or more replicas set with the voting optionDuring failover, the majority of replicas (greater than half of all replicas) for the table must be online

Every table, by default, has a primary replica created by RethinkDB. You can always change that using the reconfigure() command. In case of failure of the primary replica of the table, as long as more than half of the replicas with voting option are available, one of them will be internally selected as the primary replica. There will be a slight offline scenario while the selection is going on in RethinkDB, but that will be very minor and no data will be lost.

As soon as the primary replica comes online, RethinkDB automatically syncs it with the latest documents and switches control of the primary replica to it automatically.

About voting replicas

By default, every replica in RethinkDB is created as a voting replica. That means those replicas will take part in the failover process to perform the selection of the next primary replica. You can also change this option using the reconfigure() command.

Automatic failover requires at least three server clusters with three replicas for table. Two server clusters will not be covered under the automatic failover process and the system may go down during the failure of any RethinkDB instance.

In such cases-where RethinkDB cannot perform failover-you need to do it manually using the reconfigure() command, by passing the emergency repair mode key.