32,39 €
Leverage the power of PostgreSQL 10 to build powerful database and data warehousing applications.
If you're interested in learning more about PostgreSQL - one of the most popular relational databases in the world, then this book is for you. Those looking to build solid database or data warehousing applications with PostgreSQL 10 will also find this book a useful resource. No prior knowledge of database programming or administration is required to get started with this book.
PostgreSQL is one of the most popular open source databases in the world, and supports the most advanced features included in SQL standards and beyond. This book will familiarize you with the latest new features released in PostgreSQL 10, and get you up and running with building efficient PostgreSQL database solutions from scratch.
We'll start with the concepts of relational databases and their core principles. Then you'll get a thorough introduction to PostgreSQL and the new features introduced in PostgreSQL 10. We'll cover the Data Definition Language (DDL) with an emphasis on PostgreSQL, and the common DDL commands supported by ANSI SQL. You'll learn to create tables, define integrity constraints, build indexes, and set up views and other schema objects.
Moving on, you'll get to know the concepts of Data Manipulation Language (DML) and PostgreSQL server-side programming capabilities using PL/pgSQL. This will give you a very robust background to develop, tune, test, and troubleshoot your database application. We'll also explore the NoSQL capabilities of PostgreSQL and connect to your PostgreSQL database to manipulate data objects.
By the end of this book, you'll have a thorough understanding of the basics of PostgreSQL 10 and will have the necessary skills to build efficient database solutions.
This book is a comprehensive beginner level tutorial on PostgreSQL and introduces the features of the newest version 10, along with explanation of concepts in a very easy to understand manner. Practical tips and examples are provided at every step to ensure you are able to grasp each topic as quickly as possible.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 649
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: November 2015
Second Edition: November 2017
Production reference: 1231117
ISBN 978-1-78839-201-3
www.packtpub.com
Author
Salahaldin Juba Andrey Volkov
Copy Editors
Tasneem Fatehi
Safis Editing
Reviewers
Dr. Isabel M.D. Rosa
Sheldon E. Strauch
Project Coordinator
Manthan Patel
Commissioning Editor
Amey Varangaonkar
Proofreader
Safis Editing
Acquisition Editor
Varsha Shetty
Indexer
Rekha Nair
Content Development Editor
Aaryaman Singh
Graphics
Tania Dutta
Technical EditorDinesh Chaudhary
Production Coordinator
Melwyn Dsa
Salahaldin Juba has over than a decade of experience in the industry and academia, with a focus on database development for large-scale and enterprise applications. He holds a master's degree of science in environmental management with a distinction, and a bachelor's degree of engineering in computer systems. He is also a Microsoft Certified Solution Developer (MCSD).
He has worked mainly with SQL server, PostgreSQL, and Greenplum databases. He has developed applications for scientific communities to process GIS information in a distributed manner, and he has participated in many international projects and standards for image processing during his work in the academic sector.
As a software engineer, he works mainly with defining ETL processes with external parties, defining software solution, promoting SQL best practices, designing OLTP and OLAP application, scouting and evaluating new technologies, and providing training and consultation services.
Andrey Volkov has studied information systems in banking. He started his career as a financial analyst in a commercial bank. Using databases as a main tool in his work, he realized that querying the database directly and mastering SQL is much more efficient for ad hoc analysis than using any visual reporting software. He joined the data warehouse team, and after some time, he lead the team by taking the position of the data warehouse architect.
He has worked mainly with Oracle and used to develop logical and physical models of financial and accounting data, implement them in the database, develop ETL processes, and perform analytical tasks. He was also responsible for teaching the users how to use the data warehouse and BI tools. SQL training was also a part of that work.
After 10 years in the financial sector, he changed his field and now works as a senior database developer in a telecommunications company. Here, he works mainly with PostgreSQL databases, being responsible for data modeling and implementing physical data structures, developing stored procedures, integrating databases with other software components, and developing a data warehouse.
Having a lot of experience with both Oracle and PostgreSQL--the leading commercial and one of the most advanced open source RDBMSes--he is able to compare them and recognize and evaluate the key advantages of both of them. Working as a software developer and implementing different kinds of database applications, as well as working as a business analyst and using databases as a tool for analysis, let him learn and understand different database features in different use cases. This experience made him able and willing to work on this book.
For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1788392019. If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Relational Databases
Database management systems
A brief history
Database categories
The NoSQL databases
The CAP theorem
NoSQL motivation
Key-value databases
Columnar databases
Document databases
Graph databases
Relational and object relational databases
ACID properties
The SQL language
Relational model concepts
Relation
Tuple
NULL value
Attribute
Constraint
Domain integrity constraint
Entity integrity constraint
Referential integrity constraints
Semantic constraints
Relational algebra
The select and project operations
The rename operation
The set theory operations
The cartesian product operation
Data modeling
Data model perspectives
The entity-relation model
Sample application
Entities, attributes, and keys
Mapping ER to relations
UML class diagrams
Summary
PostgreSQL in Action
An overview of PostgreSQL
PostgreSQL history
The advantages of PostgreSQL
Business advantages of PostgreSQL
PostgreSQL user advantages
PostgreSQL applications
Success stories
Forks
PostgreSQL architecture
The PostgreSQL community
PostgreSQL capabilities
Replication
Security
Extension
NoSQL capabilities
Foreign data wrappers
Performance
Installing PostgreSQL
Installing PostgreSQL using Advanced Package Tool
Client installation
Server installation
Basic server configuration
Installing PostgreSQL on Windows
The PostgreSQL clients
The psql client
psql advanced settings
PostgreSQL utility tools
Summary
PostgreSQL Basic Building Blocks
Database coding
Database naming conventions
PostgreSQL identifiers
Documentation
Version control system
Database migration tool
PostgreSQL objects hierarchy
Template databases
User databases
Roles
Tablespace
Template procedural languages
Settings
PostgreSQL high-level object interaction
PostgreSQL database components
Schema
Schema usages
Table
PostgreSQL native data types
Numeric types
Character types
Date and time types
The car web portal database
Summary
PostgreSQL Advanced Building Blocks
Views
View synopsis
View categories
Materialized views
Updatable views
Indexes
Index synopsis
Index selectivity
Index types
Index categories
Best practices on indexes
Functions
PostgreSQL native programming language
Creating a function in the C language
Function usages
Function dependency
PostgreSQL function categories
PostgreSQL anonymous functions
User-defined data types
Triggers and rule systems
Rule system
Trigger system
Triggers with arguments
Triggers and updatable views
Summary
SQL Language
SQL fundamentals
SQL lexical structure
Querying data with SELECT statements
Structure of a SELECT query
Select-list
SQL expressions
DISTINCT
FROM clause
Selecting from multiple tables
Self-joins
The WHERE clause
Comparison operators
Pattern matching
Row and array comparison constructs
Grouping and aggregation
The GROUP BY clause
The HAVING clause
Ordering and limiting the results
Subqueries
Set operations – UNION, EXCEPT, and INTERSECT
Dealing with NULLs
Changing the data in the database
INSERT statement
The UPDATE statement
UPDATE using sub-select
UPDATE using additional tables
The DELETE statement
The TRUNCATE statement
Summary
Advanced Query Writing
Common table expressions
Reusing SQL code with CTE
Recursive and hierarchical queries
Changing data in multiple tables at a time
Window functions
Window definition
The WINDOW clause
Using window functions
Window functions with grouping and aggregation
Advanced SQL techniques
Selecting the first records
Selecting a data sample
Set returning functions
Lateral subqueries
Advanced grouping
Advanced aggregation
Summary
Server-Side Programming with PL/pgSQL
SQL language and PL/pgSQL – a comparison
PostgreSQL function parameters
Function authorization-related parameters
Function planner-related parameters
Function configuration-related parameters
The PostgreSQL PL/pgSQL control statements
Declaration statements
Assignment statements
Conditional statements
Iteration
Loop statement
While loop statement
For loop statement
Returning from the function
Returning void
Returning a single row
Returning multiple rows
Function predefined variables
Exception handling
Dynamic SQL
Executing DDL statements in dynamic SQL
Executing DML statements in dynamic SQL
Dynamic SQL and the caching effect
Recommended practices for dynamic SQL usage
Summary
OLAP and Data Warehousing
Online analytical processing
Extract, transform, and load
Data modeling for OLAP
Aggregation
Partitioning
Parallel query
Index-only scans
Summary
Beyond Conventional Data Types
Arrays
Common functions of arrays and their operators
Modifying and accessing arrays
Indexing arrays
The hash store data structure
Indexing an hstore
The JSON data structure
JSON and XML
JSON data types for PostgreSQL
Modifying and accessing JSON types
Indexing a JSON data type
Providing a RESTful API interface for PostgreSQL
A PostgreSQL full text search
The tsquery and tsvector data types
The tsvector data type
The tsquery data type
Pattern matching
Full text search indexing
Summary
Transactions and Concurrency Control
Transactions
Transaction and ACID properties
Transaction and concurrency
MVCC in PostgreSQL
Transaction isolation levels
Explicit locking
Table-level locks
Table locking modes
Row-level locks
Row-level lock modes
Deadlocks
Advisory locks
Summary
PostgreSQL Security
Authentication in PostgreSQL
PostgreSQL pg_hba.conf
Listen addresses
Authentication best practices
PostgreSQL default access privileges
Role system and proxy authentication
PostgreSQL security levels
Database security level
Schema security level
Table-level security
Column-level security
Row-level security
Encrypting data
PostgreSQL role password encryption
pgcrypto
One-way encryption
Two-way encryption
Summary
The PostgreSQL Catalog
The system catalog
System catalog for administrators
Getting the database cluster and client tools version
Terminating and canceling user sessions
Defining and getting database cluster settings
Getting the database and database object size
Cleaning up the database
Cleaning up the database data
Tuning for performance
Selective dump
Summary
Optimizing Database Performance
PostgreSQL configuration tuning
Maximum number of connections
Memory settings
Hard disk settings
Planner-related settings
Bench marking is your friend
Tuning performance for write
Tuning performance for read
Explain command and execution plan
Detecting problems in query plans
Common mistakes in writing queries
Unnecessary operations
Misplaced or missing indexes
Using CTE when not mandatory
Using the PL/pgSQL procedural language consideration
Cross column correlation
Table partitioning
Constraint exclusion limitations
Query rewriting
Summary
Testing
Unit testing
Specificity of unit testing in databases
Unit test frameworks
Schema difference
Database abstraction interfaces
Data difference
Performance testing
Summary
Using PostgreSQL in Python Applications
Python DB API 2.0
Low-level database access with psycopg2
Connecting to a database
Connection pooling
Executing SQL commands
Reading data from a database
The COPY command
Asynchronous access
Alternative drivers for PostgreSQL
pg8000
asyncpg
Object relational mapping with SQLAlchemy
Main components of SQLAlchemy
Connecting to a database and retrieving data with SQL Expression Language
ORM
Summary
Scalability
The problem of scalability and the CAP theorem
Data replication in PostgreSQL
Transaction log
Physical replication
Log shipping replication
Streaming replication
Synchronous replication
Logical replication
Using replication to scale PostgreSQL
Scaling for heavy querying
Data sharding
Scaling for big number of connections
Summary
Picking the right database management system is a difficult task due to the vast number of options on the market. Depending on the business model, one can pick a commercial database or an open source database with commercial support. In addition to this, there are several technical and non-technical factors to assess. When it comes to a relational database management system, PostgreSQL stands at the top for several reasons. The PostgreSQL slogan, The world's most advanced open source database, shows the sophistication of its features and community confidence.
PostgreSQL is an open source object relational database management system. It emphasizes extensibility and competes with major relational database vendors such as Oracle, SQL Server, and MySQL. Due to its rich extensions and open source license, it is often used for research purposes, but PostgreSQL code is also the base for many open source and commercial database management systems such as Greenplum and Vertica. Furthermore, start-ups often favor PostgreSQL due to its licensing costs and because there are a lot of companies that provide commercial support.
PostgreSQL runs on most modern operating systems, including Windows, Mac, and Linux flavors. Also, there are several extensions to access, manage, and monitor PostgreSQL clusters, such as pgAdmin, OmniDB, and psql. PostgreSQL installation and configuration is moderately easy as it is supported by most packaging tools, such as yum and apt. Database developers can easily learn and use PostgreSQL because it complies with ANSI SQL standards. Other than this, there are a lot of resources to help developers learn PostgreSQL; it has a very good documentation manual and a very active and organized community.
PostgreSQL can be used for both OLTP and OLAP applications. As it is ACID compliant, it can be used out of the box for OLTP applications. For OLAP applications, PostgreSQL supports Window functions, FDW, and table inheritance; there are many external extensions for this purpose as well.
Even though PostgreSQL is ACID compliant, it has very good performance as it utilizes state-of-the-art algorithms and techniques. For example, PostgreSQL utilizes MVCC architecture to allow concurrent access to data. In addition to that, PostgreSQL supports both pessimistic and optimistic concurrency control, and the locking behavior can be altered based on the use case. Also, PostgreSQL provides a very good analyzer and advanced features, such as data partitioning using table inheritance and constraint exclusion, to speed up the handling of very large amounts of data. PostgreSQL supports several types of indexes, such as B-Tree, GiN, and GiST, and BRIN indexes. Also, parallel query execution has been supported since PostgreSQL 9.6. Finally, one can use replication to load balance the load to different database clusters.
PostgreSQL is scalable thanks to the many replication solutions in the market, such as Slony and pgpool-II. Additionally, PostgreSQL supports out-of-the-box synchronous and asynchronous streaming replication, as well as logical replication. This makes PostgreSQL very attractive because it can be used to set up highly available and performant systems.
Chapter 01, Relational Databases, introduces relational database system concepts, including relational database properties, relational algebra, and database modeling. Also, it describes different database management systems such as graph, document, key value, and columnar databases.
Chapter 02, PostgreSQL in Action, provides first-hand experience of installing the PostgreSQL server and client tools on different platforms. This chapter also introduces PostgreSQL capabilities, such as out-of-the-box replication support and its very rich data types.
Chapter 03, PostgreSQL Basic Building Blocks, provides some coding best practices, such as coding conventions of PostgreSQL and identifier names. This chapter describes the basic building blocks and the interaction between these blocks, mainly template databases, user databases, tablespaces, roles, and settings. Also, it describes basic data types and tables.
Chapter 04, PostgreSQL Advanced Building Blocks, introduces several building blocks, including views, indexes, functions, user-defined data types, triggers, and rules. This chapter provides use cases of these building blocks and compares building blocks that can be used for the same case, such as rules and triggers.
Chapter 05, SQL Language, introduces structured query language (SQL), which is used to interact with a database, create and maintain data structures, and enter data into databases as well as, change it, retrieve it, and delete it. SQL has commands related to data definition language (DDL), data manipulation language (DML), and data control language (DCL). Four SQL statements form the basis of DML and are described in this chapter. The SELECT statement is examined in detail to explain SQL concepts such as grouping and filtering to show what SQL expressions and conditions are and how to use subqueries. Some relational algebra topics are also covered in application to joining tables.
Chapter 06, Advanced Query Writing, describes advanced SQL concepts and features such as common table expressions and window functions. This helps you implement a logic that would not be possible without them, such as recursive queries. Other techniques explained here, such as the DISTINCT ON clause, the FILTER clause, and lateral subqueries, are not irreplaceable. However, they can help make a query smaller, easier, and faster.
Chapter 07, Server-Side Programming with PL/pgSQL, covers function parameters, such as the number of returned rows and function cost, which is mainly used by the query planner. Also, it presents control statements such as conditionals and iteration. Finally, it explains the concept of dynamic SQL and some recommended practices when using dynamic SQL.
Chapter 08, OLAP and Data Warehousing, introduces several concepts regarding the usage of recreational databases in the realm of analytical processing. It discusses the deference between OLTP load and OLAP loads; furthermore, it discusses the modeling aspect of OLAP applications. In addition to that, it discusses some technical methods to perform ETL (extract, transform, and load) operations such as the COPY command. Also, it discusses some features of PostgreSQL which increasing data retrieval performance such as index-only scans and table partitioning.
Chapter 09, Beyond Conventional Data types, discusses several rich data types, including arrays, hash stores, JSON documents, and full-text search. It presents operations and functions for each data type to initialize, insert, update, access, and delete these data types. Finally, it shows how PostgreSQL can be combined with Nginx to serve read-only restful requests.
Chapter 10, Transactions and Concurrency Control, discusses in detail the ACID properties and the relation between these properties and concurrency controls. This chapter also discusses concepts such as isolation levels and their side-effects and it shows these side-effects using SQL examples. Finally, the chapter discusses different locking methods, including pessimistic locking strategies such as row locking and advisory locks.
Chapter 11, PostgreSQL Security, covers concepts of authentication and authorization. It describes PostgreSQL authentication methods and explains the structure of a PostgreSQL host-based authentication configuration file. It also discusses the permissions that can be granted to database building objects such as schemas, tables, views, indexes, and columns. Finally, it shows how sensitive data, such as passwords, can be protected using different techniques, including one-way and two-way encryption.
Chapter 12, The PostgreSQL Catalog, provides several recipes to maintain a database cluster, including cleaning up data, maintaining user processes, cleaning up indexes and unused databases objects, discovering and adding indexes to foreign keys, and so on.
Chapter 13, Optimizing Database Performance, discusses several approaches to optimize performance. It presents PostgreSQL cluster configuration settings, which are used in tuning the whole cluster's performance. Also, it presents common mistakes in writing queries and discusses several approaches to increase performance, such as using indexes or table partitioning and constraint exclusion.
Chapter 14, Testing, covers some aspects of the software testing process and how it can be applied to databases. Unit tests for databases can be written as SQL scripts or stored functions in a database. There are several frameworks that help us write unit tests and process the results of testing.
Chapter 15, Using PostgreSQL in Python Applications, discusses several advanced concepts, such as connection pooling, asynchronous access, and object relational mappers (ORMs). The chapter shows by example how to connect to database, query it, and perform updates using Python. Finally, it introduces different technologies that interact with PostgreSQL, and this gives the developer a broad overview of the state-of-the-art technologies.
Chapter 16, Scalability, discusses the problem of scalability and the CAP theorem in detail. Also, it covers data replication in PostgreSQL, including physical replication and logical replication. Finally, it shows different scaling scenarios and their implementation in PostgreSQL.
In general, PostgreSQL server and client tools do not need exceptional hardware. PostgreSQL can be installed on almost all modern platforms, including Linux, Windows, and Mac. Also, in the book, when a certain library is needed, the installation instructions are given.
You need PostgreSQL version 10; however, most of the examples can be executed on earlier versions as well. In order to execute the sample code, scripts, and examples provided in the book, you need to have at least a PostgreSQL client tool installed on your machine—preferably psql—and access to a remote server running the PostgreSQL server.
In a Windows environment, the cmd.exe command prompt is not very convenient; thus, the user might consider using Cygwin (http://www.cygwin.com/) or another alternative, such as PowerShell.
Some chapters might requires additional software. For example, in Chapter 15, Using PostgreSQL in Python application, you need to install Python and the required Python libraries to interact with PostgreSQL. In Chapter 16, Scalability, Docker is used to give the reader a better user experience.
For other chapters, such as Chapter 09, Beyond Conventional Data types, and Chapter 11, PostgreSQL Security, it is better to use Linux because some of the software used is not very convenient on Windows such as Nginx and GnuPG. To run a Linux distribution on your Windows machine, you can use Virtual Box (https://www.virtualbox.org/).
If you're interested in learning more about PostgreSQL--one of the most popular relational databases in the world--then this book is for you. Those looking to build solid database or data warehousing applications with PostgreSQL 10 will also find this book a useful resource. No prior knowledge of database programming or administration is required to get started with this book.
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:
Log in or register to our website using your email address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-PostgreSQL-10-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/LearningPostgreSQL10SecondEdition_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
This chapter, and the next chapters, will provide a high-level overview of topics related to database development. These topics will cover the theoretical aspect of relational databases. The first two chapters try to summarize theoretical topics that are seen on a daily basis. Understanding these theoretical concepts will enable the developers to not only come up with clean designs, but also to master relational databases.
This chapter is not restricted to learning PostgreSQL, but covers all relational databases. The topics covered in this chapter include the following:
Database management systems
: Understanding the different database categories enables the developer to utilize the best in each world
Relational algebra
: Understanding relational algebra enables the developers to master the SQL language, especially SQL code rewriting
Data modeling
: Using data modeling techniques leads to better communication
Different database management systems support diverse application scenarios, use cases, and requirements. Database management systems have a long history. First, we will quickly take a look at the recent history, and then explore the market-dominant database management system categories.
Broadly, the term database can be used to present a collection of things. Moreover, this term brings to mind many other terms including data, information, data structure, and management. A database can be defined as a collection or repository of data, which has a certain structure, managed by a database management system (DBMS). Data can be structured as tabular data, semi-structured as XML documents, or unstructured data that does not fit a predefined data model.
In the early days, databases were mainly aimed at supporting business applications; this led us to the well-defined relational algebra and relational database systems. With the introduction of object-oriented languages, new paradigms of database management systems appeared such as object-relational databases and object-oriented databases. Also, many businesses as well as scientific applications use arrays, images, and spatial data; thus, new models such as raster, map, and array algebra are supported. Graph databases are used to support graph queries such as the shortest path from one node to another, along with supporting traversal queries easily.
With the advent of web applications such as social portals, it is now necessary to support a huge number of requests in a distributed manner. This has led to another new paradigm of databases called NoSQL (Not Only SQL), which has different requirements such as performance over fault tolerance and horizontal scaling capabilities. In general, the timeline of database evolution was greatly affected by many factors such as the following:
Functional requirements
: The nature of the applications using a DBMS has led to the development of extensions on top of relational databases such as PostGIS (for spatial data) or even dedicated DBMS such as SciDB (for scientific data analytics).
Nonfunctional requirements
: The success of object-oriented programming languages has created new trends such as object-oriented databases. Object relational database management systems have appeared to bridge the gap between relational databases and the object-oriented programming languages. Data explosion and the necessity to handle terabytes of data on commodity hardware have led to columnar databases, which can easily scale up horizontally.
Many database models have appeared and vanished such as the network model and hierarchical model. The predominant categories now in the market are relational, object-relational databases, and NoSQL databases. One should not think of NoSQL and SQL databases as rivals--they are complementary to each other. By utilizing different database systems, one can overcome many limitations and get the best of different technologies.
The NoSQL databases are affected by the CAP theorem, also known as Brewer's theorem. In 2002, S. Gilbert and N. Lynch published a formal proof of the CAP theorem in their article, Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services. In 2009, the NoSQL movement began. Currently, there are over 150 NoSQL databases (nosql-database.org).
The CAP theorem states that it is impossible for a distributed computing system to simultaneously provide all three of the following guarantees:
Consistency
: All clients see (immediately) the latest data even in the case of updates.
Availability
: All clients can find a replica of some data even in the case of a node failure. This means that even if some part of the system goes down, the clients can still access the data.
Partition tolerance
: The system continues to work regardless of arbitrary message loss or failure of part of the system.
The choice of which feature to discard determines the nature of the system. For example, one could sacrifice consistency to get a scalable, simple, and high performance database management system. Often, the main difference between a relational database and a NoSQL database is consistency. A relational database enforces ACID (atomicity, consistency, isolation, durability) properties. In contrast, many NoSQL databases adopt the basically available soft-state, eventual consistency (base) model.
A NoSQL database provides a means for data storage, manipulation, and retrieval for non-relational data. The NoSQL databases are distributed, open source, and horizontally scalable. NoSQL often adopts the base model, which prizes availability over consistency, and informally guarantees that if no new updates are made on a data item, eventually all access to that data item will return the latest version of that data item. The advantages of this approach include the following:
Simplicity of design
Horizontal scaling and easy replication
Schema free
Huge amount of data support
We will now explore a few types of NoSQL databases.
The key-value store is the simplest database store. In this database model, the storage, as its name suggests, is based on maps or hash tables. Some key-value databases allow complex values to be stored as lists and hash tables. Key-value pairs are extremely fast for certain scenarios, but lack the support for complex queries and aggregation. Some of the existing open source key-value databases are Riak, Redis, Memebase, and MemcacheDB.
Columnar or column-oriented databases are based on columns. Data in a certain column in a two-dimensional relation is stored together.
HBase is one of the most famous columnar databases. It is based on the Google Bigtable storage system. Column-oriented databases are designed for huge data scenarios, so they scale up easily. For small datasets, HBase is not a suitable architecture. First, the recommended hardware topology for HBase is a five-node or server deployment. Also, it needs a lot of administration and is difficult to master and learn.
A document-oriented database is suitable for documents and semi-structured data. The central concept of a document-oriented database is the notion of a document. Documents encapsulate and encode data (or information) in some standard formats or encodings such as XML, JSON, and BSON. Documents do not adhere to a standard schema or have the same structure, so they provide a high degree of flexibility. Unlike relational databases, changing the structure of the document is simple and does not lock the clients from accessing the data.
Document databases merge the power of relational databases and column-oriented databases. They provide support for ad hoc queries and can be scaled up easily. Depending on the design of the document database, MongoDB is designed to handle a huge amount of data efficiently. On the other hand, CouchDB provides high availability even in the case of hardware failure.
Graph databases are based on the graph theory, where a database consists of nodes and edges. The nodes as well as the edges can be assigned data. Graph databases allow traversing between the nodes using edges. As a graph is a generic data structure, graph databases are capable of representing different data. A famous implementation of an open source commercially supported graph database is Neo4j.
Relational database management systems are one of the most widely-used DBMSs in the world. It is highly unlikely that any organization, institution, or personal computer today does not have or use a piece of software that rely on RBDMS. Software applications can use relational databases via dedicated database servers or via lightweight RDBMS engines, embedded in the software applications as shared libraries. The capabilities of a relational database management system vary from one vendor to another, but most of them adhere to the ANSI SQL standards. A relational database is formally described by relational algebra, and is based on the relational model. Object-relational database (ORD) are similar to relational databases. They support the following object-oriented model concepts:
User-defined and complex data types
Inheritance
In a relational database, a single logical operation is called a transaction. The technical translation of a transaction is a set of database operations, which are create, read, update, and delete (CRUD). An example of explaining a transaction is budget assignment to several projects in the company assuming we have a fixed amount of money. If we increase a certain project budget, we need to deduct this amount of increase from another project. The ACID properties in this context could be described as follows:
Atomicity
: All or nothing, which means that if a part of a transaction fails, then the transaction fails as a whole.
Consistency
: Any transaction gets the database from one valid state to another valid state. Database consistency is governed normally by data constraints and the relation between data and any combination thereof. For example, imagine if one would like to completely purge his account on a shopping service. In order to purge his account, his account details, such as a list of addresses, will also need to be purged. This is governed by foreign key constraints, which will be explained in detail in the coming chapter.
Isolation
: Concurrent execution of transactions results in a system state that would be obtained if the transactions were executed serially.
Durability
: The transactions that are committed--that is, executed successfully--are persistent even with power loss or some server crashes. In PostgreSQL, this is done normally by a technique called
write-ahead log
(
WAL
). Other database refers to this as a transaction log such as in Oracle.
Relational databases are often linked to the structured query language (SQL). SQL is a declarative programming language and is the standard relational database language. The American National Standard Institute (ANSI) and the International Standard Organization (ISO) published the SQL standard for the first time in 1986, followed by many versions such as SQL:1999, SQL:2003, SQL:2006, SQL:2008, SQL:2011, and SQL:2016. The SQL language has several parts:
Data definition language
(
DDL
): It defines and amends the relational structure
Data manipulation language
(
DML
): It retrieves and extracts information from the relations
Data control language
(
DCL
): It controls the access rights to relations
A relational model is a first-order predicate logic, which was first introduced by Edgar F. Codd in 1970 in his paper A relational model of data for large shared data banks. A database is represented as a collection of relations. The state of the whole database is defined by the state of all the relations in the database. Different information can be extracted from the relations by joining and aggregating data from different relations and by applying filters on the data. In this section, the basic concepts of the relational model are introduced using the top-down approach by first describing the relation, tuple, attribute, and domain.
Think of a relation as a table with a header, columns, and rows. The table name and the header help in interpreting the data in the rows. Each row represents a group of related data, which points to a certain object.
A relation is represented by a set of tuples. Tuples should have the same set of ordered attributes. Attributes have a domain, that is, a type and a name:
customer_id
first_name
last_name
Tuple →
1
thomas
sieh
Tuple →
2
wang
kim
Attribute ↑
Attribute ↑
Attribute ↑
Attribute ↑
The relation schema is denoted by the relation name and the relation attributes. For example, customer (customer_id, first_name, last_name, and email) is the relation schema for the customer relation. Relation state is defined by the set of relation tuples; thus, adding, deleting, and amending a tuple will change the relation to another state.
Tuple order or position in the relation is not important, and the relation is not sensitive to tuple order. The tuples in the relation could be ordered by a single attribute or a set of attributes. Also, a relation cannot have duplicate tuples.
A relation can represent entities in the real world, such as a customer, or can be used to represent an association between relations. For example, the customer could have several services and a service can be offered to several customers. This could be modeled by three relations: customer, service, and customer_service. The customer_service relation associates the customer and the service relations. Separating the data in different relations is a key concept in relational database modeling, which is called normalization. Normalization is the process of organizing relation columns and relations to reduce data redundancy. For example, assume that a collection of services is stored in the customer relation. If a service is assigned to multiple customers, this would result in data redundancy. Also, updating a certain service would require updating all its copies in the customer table.
Each attribute has a name and a domain, and the name should be distinct within the relation. The domain defines the possible set of values that the attribute can have. One way to define the domain is to define the data type and a constraint on this data type. For example, hourly wage should be a positive real number and bigger than five if we assume that the minimum hourly wage is five dollars. The domain could be continuous, such as salary, which is any positive real number, or discrete, such as gender.
The formal relational model puts a constraint on the domain: the value should be atomic. Atomic means that each value in the domain is indivisible. For instance, the name attribute domain is not atomic because it can be divided into first name and last name. Some examples of domains are as follows:
Phone number
: Numeric text with a certain length.
Country code
: Defined by ISO 3166 as a list of two letter codes (ISO alpha-2) and three letter codes (ISO alpha-3). The country codes for Germany are DE and DEU for alpha-2 and alpha-3 respectively.
The relational model defines many constraints in order to control data integrity, redundancy, and validity:
Redundancy
: Duplicate tuples are not allowed in the relation.
Validity
: Domain constraints control data validity.
Integrity
: The relations within a single database are linked to each other. An action on a relation such as updating or deleting a tuple might leave the other relations in an invalid state.
We could classify the constraints in a relational database roughly into two categories:
Inherited constraints from the relational model: Domain integrity, entity integrity, and referential integrity constraints.
Semantic constraint, business rules, and application specific constraints: These constraints cannot be expressed explicitly by the relational model. However, with the introduction of procedural SQL languages such as PL/pgsql for PostgreSQL, relational databases can also be used to model these constraints.
The domain integrity constraint ensures data validity. The first step in defining the domain integrity constraint is to determine the appropriate data type. The domain data types could be integer, real, boolean, character, text, inet, and so on. For example, the data type of first name and email address is text. After specifying the data type, check constraints, such as the mail address pattern, need to be defined.
Check constraint
: A check constraint can be applied to a single attribute or a combination of many attributes in a tuple. Let's assume that the
customer_service
schema is defined as
customer_id
,
service_id
,
start_date
,
end_date
,
order_date
. For this relation, we can have a check constraint to make sure that
start_date
and
end_date
are entered correctly by applying the following check
start_date
<
end_date
.
Default constraint
: The attribute can have a default value. The default value could be a fixed value such as the default hourly wage of the employees, for example, $10. It may also have a dynamic value based on a function such as random, current time, and date. For example, in the
customer_service
relation,
order_date
can have a default value, which is the current date.
Unique constraint
: A unique constraint guarantees that the attribute has a distinct value in each tuple. It allows null values. For example, let's assume that we have a relation player defined as player (
player_id
,
player_nickname
). The player uses his ID to play with others; he can also pick up a nickname which is also unique to identify himself.
Not null constraint
: By default, the attribute value can be null. The not null constraint restricts an attribute from having a null value. For example, each person in the birth registry record should have a name.
In the relational model, a relation is defined as a set of tuples. This means that all the tuples in a relation must be distinct. The entity integrity constraint is enforced by having a primary key which is an attribute/set of attributes having the following characteristics:
The attribute should be unique
The attributes should be not null
Each relation must have only one primary key, but can have many unique keys. A candidate key is a minimal set of attributes that can identify a tuple. All unique, not null attributes can be candidate keys. The set of all attributes form a super key. In practice, we often pick up a single attribute to be a primary key instead of a compound key ( a key that consists of two or more attributes that uniquely identify a tuple) to ease the joining of the relations with each other.
If the primary key is generated by the DBMS, then it is called a surrogate key or synthetic key . Otherwise, it is called a natural key. The surrogate key candidates can be sequences and universal unique identifiers (UUID). A surrogate key has many advantages such as performance, requirement change tolerance, agility, and compatibility with object relational mappers. The chief disadvantage of surrogate keys is that , it makes redundant tuples possible.
Relations are associated with each other via common attributes. Referential integrity constraints govern the association between two relations and ensure data consistency between tuples. If a tuple in one relation references a tuple in another relation, then the referenced tuple must exist. In the customer service example, if a service is assigned to a customer, then the service and the customer must exist, as shown in the following example. For instance, in the customer_service relation, we cannot have a tuple with values (5, 1,01-01-2014, NULL), because we do not have a customer with customer_id equal to 5.
The lack of referential integrity constraints can lead to many problems:
Invalid data in the common attributes
Invalid information during joining of data from different relations
Performance degradation either due to bad execution plans generated by the PostgreSQL planner or by a third-party tool.
Referential integrity constraints are achieved via foreign keys. A foreign key is an attribute or a set of attributes that can identify a tuple in the referenced relation. As the purpose of a foreign key is to identify a tuple in the referenced relation, foreign keys are generally primary keys in the referenced relation. Unlike a primary key, a foreign key can have a null value. It can also reference a unique attribute in the referenced relation. Allowing a foreign key to have a null value enables us to model different cardinality constraints. Cardinality constraints define the participation between two different relations. For example, a parent can have more than one child; this relation is called one-to-many relationship, because one tuple in the referenced relation is associated with many tuples in the referencing relation. Also, a relation could reference itself. This foreign key is called a self-referencing or recursive foreign key.
For example, a company acquired by another company:
To ensure data integrity, foreign keys can be used to define several behaviors when a tuple in the referenced relation is updated or deleted. The following behaviors are called referential actions:
Cascade
: When a tuple is deleted or updated in the referenced relation, the tuples in the referencing relation are also updated or deleted
Restrict
: The tuple cannot be deleted or the referenced attribute cannot be updated if it is referenced by another relation
No action
: Similar to restrict, but it is deferred to the end of the transaction
Set default
: When a tuple in the referenced relation is deleted or the referenced attribute is updated, then the foreign key value is assigned the default value
Set null
: The foreign key attribute value is set to null when the referenced tuple is deleted
Integrity constraints or business logic constraints describe the database application constraints in general. These constraints are either enforced by the business logic tier of the application program or by SQL procedural languages. Trigger and rule systems can also be used for this purpose. For example, the customer should have at most one active service at a time. Based on the nature of the application, one could favor using an SQL procedural language or a high-level programming language to meet the semantic constraints, or mix the two approaches.
The advantages of using the SQL programming language are as follows:
Performance
: RDBMSs often have complex analyzers to generate efficient execution plans. Also, in some cases such as data mining, the amount of data that needs to be manipulated is very large. Manipulating the data using procedural SQL language eliminates the network data transfer. Finally, some procedural SQL languages utilize clever caching algorithms.
Last minute change
: For the SQL procedural languages, one could deploy bug fixes without service disruption.
