34,79 €
Getting Started with CockroachDB will introduce you to the inner workings of CockroachDB and help you to understand how it provides faster access to distributed data through a SQL interface. The book will also uncover how you can use the database to provide solutions where the data is highly available.
Starting with CockroachDB's installation, setup, and configuration, this SQL book will familiarize you with the database architecture and database design principles. You'll then discover several options that CockroachDB provides to store multiple copies of your data to ensure fast data access. The book covers the internals of CockroachDB, how to deploy and manage it on the cloud, performance tuning to get the best out of CockroachDB, and how to scale data across continents and serve it locally. In addition to this, you'll get to grips with fault tolerance and auto-rebalancing, how indexes work, and the CockroachDB Admin UI. The book will guide you in building scalable cloud services on top of CockroachDB, covering administrative and security aspects and tips for troubleshooting, performance enhancements, and a brief guideline on migrating from traditional databases.
By the end of this book, you'll have gained sufficient knowledge to manage your data on CockroachDB and interact with it from your application layer.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 233
Veröffentlichungsjahr: 2022
A guide to using a modern, cloud-native, and distributed SQL database for your data-intensive apps
Kishen Das Kondabagilu Rajanna
BIRMINGHAM—MUMBAI
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Sunith Shetty
Senior Editors: Roshan Kumar, Nazia Shaikh
Content Development Editor: Tazeen Shaikh
Technical Editor: Rahul Limbachiya
Copy Editor: Safis Editing
Project Coordinator: Aparna Ravikumar Nair
Proofreader: Safis Editing
Indexer: Pratik Shirodkar
Production Designer: Jyoti Chauhan
Marketing Coordinator: Priyanka Mhatre
First published: March 2022
Production reference: 1090322
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80056-065-9
www.packt.com
During my childhood, I have spent more time watching insects than interacting with human beings. So, I dedicate this book to cockroaches and all the other insects that have fascinated me throughout my life.
Kishen Das Kondabagilu Rajanna is currently working as a distributed query engineer at Adobe. His previous experience includes leading the data warehouse team at Cloudera, managing the SaaS platform at Rubrik, writing core services for the Oracle public cloud, and managing data infrastructure at Apple.
Nadir Doctor is a database and data warehousing architect, plus DBA, who has worked in various industries with multiple OLTP and OLAP technologies, as well as primary data platforms, including CockroachDB, Snowflake, Databricks, DataStax, Cassandra, ScyllaDB, Redis, MS SQL Server, Oracle, Db2 Cloud, AWS, Azure, and GCP. A major focus of his is health check scripting for security, high availability, performance optimization, cost reduction, and operational excellence. He has presented at several technical conference events, is active in user group participation, and can be reached on LinkedIn.
Thank you to Kishen and all the staff at Packt. I'm grateful for the immense support of my loving wife, children, and family during the technical review of this book. I hope that you all find the content enjoyable, inspiring, and useful.
Scott Ling is a technology specialist with over 30 years experience of working at the forefront of technology in various roles, in companies from start-ups to 10 bn+ listed companies, with a focus on distributed technologies, software as a service, and product/project management. He is currently working on a free product designed to make it easy for anyone to create and manage an SaaS product, service, or business as his way of giving back to the community. Scott is also an established technical author with a bestselling book on .NET published back in 2001 and has worked with authors and publishers on various books and technologies over the years.
This book will introduce you to the inner workings of CockroachDB and help you understand how it provides faster access to distributed data through a SQL interface. You'll learn how you can use the database to provide solutions that require data to be highly available.
Starting with CockroachDB's installation, setup, and configuration, this book will familiarize you with the database architecture and database design principles. You'll then discover several options that CockroachDB provides to store multiple copies of your data to ensure fast data access. The book covers the internals of CockroachDB, how to deploy and manage it on the cloud, performance tuning to get the best out of CockroachDB, and how to scale data across continents and serve it locally. In addition to this, you'll get to grips with fault tolerance and auto-rebalancing, how indexes work, and the CockroachDB Admin UI. The book will guide you in building scalable cloud services on top of CockroachDB, covering administrative and security aspects and tips for troubleshooting, performance enhancements, and a brief guideline on migrating from traditional databases.
By the end of this book, you'll have enough knowledge to manage your data on CockroachDB and interact with it from your application layer.
Software engineers, database developers, database administrators, and anyone who wishes to learn about the features of CockroachDB and how to build database solutions that are fast, highly available, and cater to business-critical applications, will find this book useful. Although no prior exposure to CockroachDB is required, familiarity with database concepts will help you to get the most out of this book.
Chapter 1, CockroachDB – A Brief Introduction, talks about databases and how they have evolved over time. You will also get to know about the high-level architecture of CockroachDB.
Chapter 2, How Does CockroachDB Work Internally?, explores various layers of CockroachDB and some of its inner workings.
Chapter 3, Atomicity, Consistency, Isolation, and Durability (ACID), introduces you to ACID properties and how they are implemented in CockroachDB.
Chapter 4, Geo-Partitioning, explains the concept of geo-partitioning, why we need it, and what are the various options for geographically distributing the data using CockroachDB
Chapter 5, Fault Tolerance and Auto-Rebalancing, explores the concept of fault tolerance and auto-recovery strategies. It also covers a few experiments to understand these concepts better.
Chapter 6, How Indexes Work in CockroachDB, is all about database indexes, how they are useful in improving query performance, the different types of indexes that are supported in CockroachDB, and some of the best practices that you can follow while using indexes.
Chapter 7, Schema Creation and Management, introduces you to SQL syntaxes for DDL, DML, and DQL with examples, different data types available in CockroachDB, sequences, and how to manage schema changes.
Chapter 8, Exploring the Admin User Interface, explores the admin user interface that comes by default when you deploy a CockroachDB cluster. We examine all the metrics and other information that are available in the user interface and how they are useful in troubleshooting issues.
Chapter 9, An Overview of Security Aspects, touches upon the key security aspects that you have to pay attention to when using CockroachDB. We also learn about authentication, authorization, how to protect data with encryption while at rest and in flight, and how to achieve desired data protection by defining the correct strategies for RTO and RPO. Network security and some security best practices are also covered.
Chapter 10, Troubleshooting Issues, helps you in getting yourself familiarized with troubleshooting issues by collecting logs, looking at some of the metrics, understanding and tracking slow queries, and integrating logs with external sinks. This chapter also covers advanced debugging options at the end.
Chapter 11, Performance Benchmarking and Migration, discusses performance, what the key indicators of performance are, and how to measure them. You will also learn about migrating from other traditional databases to CockroachDB.
Appendix: Bibliography and Additional Resources, provides additional resource material that you can go through to become more familiar with CockroachDB.
You should have access to the internet so that you can download CockroachDB and try it on your laptop.
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Getting-Started-with-CockroachDB. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781800560659_ColorImages.pdf.
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in the text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Here, in the SELECT query, you should use AS OF SYSTEM TIME follower_read_timestamp()."
A block of code is set as follows:
SHOW TABLES
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
DROP DATABASE <DATABASE_NAME>
DROP ROLE <ROLE_NAME>
DROP TABLE <TABLE_NAME>
Any command-line input or output is written as follows:
$ cockroach cert create-client <user_name> \
--certs-dir=<certs_directory> \
--ca-key=<CA_key_directory>
Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "The Sessions dashboard gives information about all the active client sessions within the CockroachDB cluster."
Tips or Important Notes
Appear like this.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Once you've read Getting Started with CockroachDB, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.
In this section, we will provide a brief introduction to CockroachDB and the motivation behind creating this new database, as well as go into its overall architecture and design concepts.
This section comprises the following chapters:
Chapter 1, CockroachDB - A Brief IntroductionChapter 2, How Does CockroachDB Work Internally?In this chapter, we will go over the history of databases, where we will learn about the evolution of SQL, NoSQL, and NewSQL databases, various relational models, different categories for classifying databases, and timelines. Later, we will discuss the CAP theorem. Finally, we will briefly discuss the motivation for creating a new database and learn about the basic architecture of CockroachDB.
The following topics will be covered in this chapter:
The history and evolution of databasesDatabase conceptsCAP theorem CockroachDBA database is a collection of data that can be organized, managed, modified, and retrieved using a computer. The system that helps with managing data in a database is called a database management system (DBMS).
In the 1950s and 1960s, several advancements were made in terms of processors, storage, memory, and networks. We also had our first programming languages, COBOL and FORTRAN. The development of hard disk drives for data storage further spurred the development of databases. Around the same time, the first notion of a modern-day computer with a mouse and graphical user interface came into existence, making it easy for the general public to consume it. In this section, we will discuss how various types of databases evolved.
The first database was designed by Charles William Bachman III, an American computer scientist. In 1963, he developed the Integrated Data Store (IDS), which gave rise to the concept of the navigational database. In navigational databases, we can find records by chasing references from other objects. For example, let's say that in a school database, you want to find all the students from a specific grade in a specific school. In a navigational database, first, you have to go to the group of students that belong to a particular school and then to the group that belongs to a particular grade. So, records can be accessed by hierarchical navigation. Based on IDS, Bachman later developed the CODASYL database model in 1969. CODASYL stands for Conference/Committee on Data Systems Languages, which was a consortium to guide the development of programming languages. Around the same time Edgar F. Codd, an IBM employee, developed the IBM Information Management System (IMS), which was based on the hierarchical database model. A hierarchical database model is a data model in which the data is designed in a tree-like structure. In 1970, Donald D. Chamberlin and Raymond F. Boyce developed Structured Query Language (SQL) based on what they'd learned about IMS. They initially called it Structured English Query Language (SEQUEL), which System R was later developed with by a group at the IBM San Jose research laboratory. In 1976, QUEL, which is a relational database query language designed by Michael Ralph Stonebraker, was developed as part of the Interactive Graphics Retrieval System (INGRES) database management system at the University of California, Berkeley.
Based on QUEL and SQL, several databases were implemented. Some of the most prominent ones include Post Ingres (Postgres), Sybase, Microsoft SQL, IBM DB2, Oracle, MariaDB, and MySQL.
In the 1980s, object-oriented database systems (OODBMSes) grew in popularity. In OODBMSes, information is represented as objects compared to tables in relational databases. Some of the important ones include Gemstone/S, Objectivity/DB, InterSystems Cache, Perst, ZODB, Wakanda, ObjectDB, ODABA, and Realm.
The concept of non-SQL or non-relational databases has existed since the 1960s, but the term NoSQL became has much more popular in the last decade. NoSQL databases focus on performance and scaling and mostly rely on a non-relational data model such as a document, key-value, wide-column, or graph to organize the data. Some of the most popular ones in this category include Cassandra, MongoDB, Couchbase, Dynamo, FoundationDB, Neo4j, and Hbase.
With the introduction of the on-demand availability of compute, storage, and network resources and the pay-as-you-go model, which is collectively known as cloud computing, the amount of data that we collect, process, manage, and analyze has been growing exponentially. Although it was relatively easier for some of the NoSQL databases to adapt to the cloud, it is still much harder for traditional SQL databases to do so. Many of them are better suited for vertical scaling and do not consider geographically distributed data, the shared-nothing architecture, and enormous scale as part of their initial design. This created a void. We needed SQL databases that are cloud-native, scale well with data growth, and are easy to manage. Many companies developed in-house solutions on top of existing SQL databases:
Facebook developed TAO, a NoSQL graph API built on top of sharded MySQL. YouTube developed Vitess to easily scale and manage MySQL clusters. Dropbox developed Edgestore, a metadata store to power their services and products, which again was built on top of MySQL.GreenPlum developed a massively parallel data platform by the same name for analytics, machine learning, and AI on top of Postgres.However, it was still relatively hard and painful to manage the data as the underlying database was not built to scale.
In 2012, Google published a seminal paper on Google Spanner: a globally distributed database service and storage solution. Spanner essentially combined the important features of SQL databases such as ACID transactions, strongly consistent reads, and the SQL interface with some of the features that were only available with NoSQL databases, such as scaling across geographical locations, multi-site replication, and failover. It created a new category of databases called NewSQL, which is meant to indicate a combination of SQL features at NoSQL scale. YugabyteDB and CockroachDB were developed later, both of which got their inspiration from Google Spanner.
In this section, we will learn about some of the core database concepts, including cardinality, database models, and various processing models.
Before we discuss database models, it is important to know about cardinality. Cardinality refers to the relationship between two entities or tables. The most popular ones include one-to-many, many-to-one, and many-to-many.
In the case of a one-to-one relationship, a row or entry in one entity or table can be related to only one row in another entity or table. For example, in a Department of Motor Vehicles database, let's say there are two tables called License Info and Driver Info, as shown in the following diagram:
Figure 1.1 – An example of a one-to-one relationship
Here, Driver ID can only be assigned to one driver as it has to uniquely identify a driver. Also, a driver can only be assigned one Driver ID. So, here, any row in the License Info table will be associated with a specific row in the Driver Info table.
In a one-to-many relationship, a single row from one entity or table can be associated with multiple rows in another entity or table.
For example, let's consider the Driver Info and City Info tables shown in the following diagram::
Figure 1.2 – An example of a one-to-many relationship
Here, for every row in City Info, there will be multiple rows in Driver Info, as there can be many drivers that live in a particular city.
In a many-to-many relationship, a single row in one entity or table can be associated with multiple rows in another entity or table and vice versa.
For example, let's consider two tables: Vehicle Ownership History, where we are maintaining the history of ownership of a given vehicle, and Driver Ownership History, where we are maintaining the history of vehicles owned by a given driver:
Figure 1.3 – An example of a many-to-many relationship
Here, a driver can own multiple vehicles and a vehicle can have multiple owners over time. So, a given row in the Vehicle Ownership History table can be associated with multiple rows in the Driver Ownership History table. Similarly, a given row in the Driver Ownership History table can be associated with multiple rows in the Vehicle Ownership History table.
Now, let's take a look at some of the most important database models.
A database model determines how the data is stored, organized, and modified. Databases are typically implemented based on a specific data model. It is also possible to borrow concepts from multiple database models when you are designing a new database. The relational database model happens to be the most widely known and has been popularized by databases such as Oracle, IBM DB2, and MySQL.
In the hierarchical database model, the data is organized in the form of a tree. There is a root at the first level and multiple children at the subsequent levels. Since a single parent can have multiple children, one-to-many relationships can easily be represented here. A child cannot have multiple parents, so this results in the advantage of not being able to model many-to-many relationships.
IBM's Information Management System (IMS) was the first database that implemented this data model.
The following diagram shows an example of a hierarchical database model:
Figure 1.4 – An example of a hierarchical database model
Typically, the tree starts with a single root and the data is organized into this tree. Any node except the leaves can have multiple children, but a child can have only one parent.
The network model was developed as an enhancement of the hierarchical database model to accommodate many-to-many relationships. The network model relies on a graph structure to organize its data. So, there is no concept of a single root, and a child can have multiple parents and a parent can have multiple children. Integrated Data Store (IDS), Integrated Database Management Systems (IDMS), and Raima Database Manager (RDM) are some of the popular databases that use the network model.
As shown in the following diagram, there is no single root and a given child (for example, Object 2 can have multiple parents; that is, Object 1 and Object 3):
Figure 1.5 – An example of a network model
Although the network model was an improvement over the hierarchical model, it was still a little restrictive when it came to representing data. In the relational model, any record can have a relationship with any other with the help of a common field. This drastically reduced the design's complexity and made it easier to independently add, update, and access records, without having to walk down the tree or traverse the graph. SQL was combined with the relational database model to provide a simple query interface to add and retrieve data.
All the popular traditional databases such as Oracle database, IBM DB2, MySQL, MariaDB, and Microsoft SQL Server implement relational data models.
Let's look at two tables called Employee and Employee Info:
Figure 1.6 – Employee tables showing the column names
Here, Employee ID is the common field or column between the Employee and Employee Info tables. The Employee table is responsible for ensuring that a given Employee ID is unique, while Employee Info is responsible for more detailed information about a given employee.
The object-relational model, as the name suggests, combines the best of the relational and object data models. The concept of objects, classes, and inheritance are directly supported as first-class citizens as part of the database and in queries. SQL:1999, the fourth revision of SQL, introduced several features for embedding object concepts into the relational database. One of the main features was to create structured user-defined types with CREATE TYPE to define an object's structure.
Over time, relational databases have added more support for objects. There is a varying degree of support for object concepts in Oracle database, IBM DB2, PostgreSQL, and Microsoft SQL Server.
Given the scope of this book, we will not discuss the entity-relational model, object model, document model, star schema, snowflake schema, and many other less well-known models.
Now, let's look at how databases can be classified based on what kinds of workload they can be used for.
Based on how you want to consume and process data, databases can be categorized into four different processing systems. Let's take a look.
OLTP systems support the concept of transactions. A transaction refers to the ability to atomically apply changes (insert, update, delete, and read) to a given system. One popular example is a bank, where withdrawing or depositing money to a given bank account must be done atomically to ensure data is not lost or incorrect. So, the main purpose here is to maintain data integrity and consistency. Also, these systems are generally suited for fast-running queries.
OLAP focuses mostly on running queries to analyze multi-dimensional data and to extract some intelligence or patterns from it. Typically, such systems support generating some sort of report that can be used for marketing, sales, financing, budgeting, management, and many more. Data mining and data analytics applications would typically have to have an OLAP system in some form. OLAP doesn't deal with transactions, and the emphasis is more on analyzing large amounts of data from different sources to extract business intelligence. Some databases also provide built-in support for MapReduce to run queries across a large set of data.
A data warehouse is a piece of software that's used for reporting and data analysis. Warehouses are typically developed for OLAP. It is also very common to retrieve the data from OLTP in batches or bulk, run it through an Extract, Load, and Transform (ELT) or Extract, Transform, and Load (ETL) data transformation pipeline, and store it in an OLAP system.
OLEP guarantees strong consistency without the traditional atomic commit protocols or distributed locking. OLEP also focuses on high performance, larger scales, and fault tolerance.
As the name suggests, this