Learning PostgreSQL 10 - Second Edition - Salahaldin Juba - E-Book

Learning PostgreSQL 10 - Second Edition E-Book

Salahaldin Juba

0,0
32,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Leverage the power of PostgreSQL 10 to build powerful database and data warehousing applications.

About This Book

  • Be introduced to the concept of relational databases and PostgreSQL, one of the fastest growing open source databases in the world
  • Learn client-side and server-side programming in PostgreSQL, and how to administer PostgreSQL databases
  • Discover tips on implementing efficient database solutions with PostgreSQL 10

Who This Book Is For

If you're interested in learning more about PostgreSQL - one of the most popular relational databases in the world, then this book is for you. Those looking to build solid database or data warehousing applications with PostgreSQL 10 will also find this book a useful resource. No prior knowledge of database programming or administration is required to get started with this book.

What You Will Learn

  • Understand the fundamentals of relational databases, relational algebra, and data modeling
  • Install a PostgreSQL cluster, create a database, and implement your data model
  • Create tables and views, define indexes, and implement triggers, stored procedures, and other schema objects
  • Use the Structured Query Language (SQL) to manipulate data in the database
  • Implement business logic on the server side with triggers and stored procedures using PL/pgSQL
  • Make use of advanced data types supported by PostgreSQL 10: Arrays, hstore, JSONB, and others
  • Develop OLAP database solutions using the most recent features of PostgreSQL 10
  • Connect your Python applications to a PostgreSQL database and work with the data efficiently
  • Test your database code, find bottlenecks, improve performance, and enhance the reliability of the database applications

In Detail

PostgreSQL is one of the most popular open source databases in the world, and supports the most advanced features included in SQL standards and beyond. This book will familiarize you with the latest new features released in PostgreSQL 10, and get you up and running with building efficient PostgreSQL database solutions from scratch.

We'll start with the concepts of relational databases and their core principles. Then you'll get a thorough introduction to PostgreSQL and the new features introduced in PostgreSQL 10. We'll cover the Data Definition Language (DDL) with an emphasis on PostgreSQL, and the common DDL commands supported by ANSI SQL. You'll learn to create tables, define integrity constraints, build indexes, and set up views and other schema objects.

Moving on, you'll get to know the concepts of Data Manipulation Language (DML) and PostgreSQL server-side programming capabilities using PL/pgSQL. This will give you a very robust background to develop, tune, test, and troubleshoot your database application. We'll also explore the NoSQL capabilities of PostgreSQL and connect to your PostgreSQL database to manipulate data objects.

By the end of this book, you'll have a thorough understanding of the basics of PostgreSQL 10 and will have the necessary skills to build efficient database solutions.

Style and approach

This book is a comprehensive beginner level tutorial on PostgreSQL and introduces the features of the newest version 10, along with explanation of concepts in a very easy to understand manner. Practical tips and examples are provided at every step to ensure you are able to grasp each topic as quickly as possible.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 649

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Learning PostgreSQL 10

Second Edition

 

 

 

 

 

 

 

 

A beginner’s guide to building high-performance PostgreSQL database solutions

 

 

 

 

 

 

 

 

Salahaldin Juba
Andrey Volkov

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Learning PostgreSQL 10

Second Edition

 

Copyright © 2017 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

First published: November 2015

Second Edition: November 2017

 

Production reference: 1231117

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-78839-201-3

 

www.packtpub.com

Credits

Author

Salahaldin Juba Andrey Volkov

Copy Editors

Tasneem Fatehi

Safis Editing

Reviewers

Dr. Isabel M.D. Rosa

Sheldon E. Strauch

Project Coordinator

Manthan Patel

Commissioning Editor

Amey Varangaonkar

Proofreader

Safis Editing

Acquisition Editor

Varsha Shetty

Indexer

Rekha Nair

Content Development Editor

Aaryaman Singh

Graphics

Tania Dutta

Technical EditorDinesh Chaudhary

Production Coordinator

Melwyn Dsa

About the Authors

Salahaldin Juba has over than a decade of experience in the industry and academia, with a focus on database development for large-scale and enterprise applications. He holds a master's degree of science in environmental management with a distinction, and a bachelor's degree of engineering in computer systems. He is also a Microsoft Certified Solution Developer (MCSD). 

He has worked mainly with SQL server, PostgreSQL, and Greenplum databases. He has developed applications for scientific communities to process GIS information in a distributed manner, and he has participated in many international projects and standards for image processing during his work in the academic sector.

As a software engineer, he works mainly with defining ETL processes with external parties, defining software solution, promoting SQL best practices, designing OLTP and OLAP application, scouting and evaluating new technologies, and providing training and consultation services.

I would like to express my deepest gratitude to my colleague Andrey Volkov for making this work possible. Also, I would like to thank all the people who provided support, especially the team at Packt for their great support and feedback with proofreading, design, comments, and remarks. I would also like to thank my family for their support despite all of the time that I had to devote to this book over them. Finally, very warm and deep thanks to my late father, Ikrayem Juba, for his utmost support, help, and guidance. 

Andrey Volkov has studied information systems in banking. He started his career as a financial analyst in a commercial bank. Using databases as a main tool in his work, he realized that querying the database directly and mastering SQL is much more efficient for ad hoc analysis than using any visual reporting software. He joined the data warehouse team, and after some time, he lead the team by taking the position of the data warehouse architect.

He has worked mainly with Oracle and used to develop logical and physical models of financial and accounting data, implement them in the database, develop ETL processes, and perform analytical tasks. He was also responsible for teaching the users how to use the data warehouse and BI tools. SQL training was also a part of that work.

After 10 years in the financial sector, he changed his field and now works as a senior database developer in a telecommunications company. Here, he works mainly with PostgreSQL databases, being responsible for data modeling and implementing physical data structures, developing stored procedures, integrating databases with other software components, and developing a data warehouse.

Having a lot of experience with both Oracle and PostgreSQL--the leading commercial and one of the most advanced open source RDBMSes--he is able to compare them and recognize and evaluate the key advantages of both of them. Working as a software developer and implementing different kinds of database applications, as well as working as a business analyst and using databases as a tool for analysis, let him learn and understand different database features in different use cases. This experience made him able and willing to work on this book.

I would like to thank my wife and son for their support and for letting me work on weekends and evenings. Big thanks to the editors team for their support, guidance, and organization. And most of all, I would like to thank the main author of the book, Salahaldin Juba, who invited me to work on the book, introduced me to the team, and, in fact, has done most of the work.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1788392019. If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Relational Databases

Database management systems

A brief history

Database categories

The NoSQL databases

The CAP theorem

NoSQL motivation

Key-value databases

Columnar databases

Document databases

Graph databases

Relational and object relational databases

ACID properties

The SQL language

Relational model concepts

Relation

Tuple

NULL value

Attribute

Constraint

Domain integrity constraint

Entity integrity constraint

Referential integrity constraints

Semantic constraints

Relational algebra

The select and project operations

The rename operation

The set theory operations

The cartesian product operation

Data modeling

Data model perspectives

The entity-relation model

Sample application

Entities, attributes, and keys

Mapping ER to relations

UML class diagrams

Summary

PostgreSQL in Action

An overview of PostgreSQL

PostgreSQL history

The advantages of PostgreSQL

Business advantages of PostgreSQL

PostgreSQL user advantages

PostgreSQL applications

Success stories

Forks

PostgreSQL architecture

The PostgreSQL community

PostgreSQL capabilities

Replication

Security

Extension

NoSQL capabilities

Foreign data wrappers

Performance

Installing PostgreSQL

Installing PostgreSQL using Advanced Package Tool

Client installation

Server installation

Basic server configuration

Installing PostgreSQL on Windows

The PostgreSQL clients

The psql client

psql advanced settings

PostgreSQL utility tools

Summary

PostgreSQL Basic Building Blocks

Database coding

Database naming conventions

PostgreSQL identifiers

Documentation

Version control system

Database migration tool

PostgreSQL objects hierarchy

Template databases

User databases

Roles

Tablespace

Template procedural languages

Settings

PostgreSQL high-level object interaction

PostgreSQL database components

Schema

Schema usages

Table

PostgreSQL native data types

Numeric types

Character types

Date and time types

The car web portal database

Summary

PostgreSQL Advanced Building Blocks

Views

View synopsis

View categories

Materialized views

Updatable views

Indexes

Index synopsis 

Index selectivity

Index types

Index categories

Best practices on indexes

Functions

PostgreSQL native programming language

Creating a function in the C language

Function usages

Function dependency

PostgreSQL function categories

PostgreSQL anonymous functions

User-defined data types

Triggers and rule systems

Rule system

Trigger system

Triggers with arguments

 Triggers and updatable views

Summary

SQL Language

SQL fundamentals

SQL lexical structure

Querying data with SELECT statements

Structure of a SELECT query

Select-list

SQL expressions

DISTINCT

FROM clause

Selecting from multiple tables

Self-joins

The WHERE clause

Comparison operators

Pattern matching

Row and array comparison constructs

Grouping and aggregation

The GROUP BY clause

The HAVING clause

Ordering and limiting the results

Subqueries

Set operations – UNION, EXCEPT, and INTERSECT

Dealing with NULLs

Changing the data in the database

INSERT statement

The UPDATE statement

UPDATE using sub-select

UPDATE using additional tables

The DELETE statement

The TRUNCATE statement

Summary

Advanced Query Writing

Common table expressions

Reusing SQL code with CTE

Recursive and hierarchical queries

Changing data in multiple tables at a time

Window functions

Window definition

The WINDOW clause

Using window functions

Window functions with grouping and aggregation

Advanced SQL techniques

Selecting the first records

Selecting a data sample

Set returning functions

Lateral subqueries

Advanced grouping

Advanced aggregation

Summary

Server-Side Programming with PL/pgSQL

SQL language and PL/pgSQL – a comparison

PostgreSQL function parameters

Function authorization-related parameters

Function planner-related parameters

Function configuration-related parameters

The PostgreSQL PL/pgSQL control statements

Declaration statements

Assignment statements

Conditional statements

Iteration

Loop statement

While loop statement

For loop statement

Returning from the function

Returning void

Returning a single row

Returning multiple rows

Function predefined variables

Exception handling

Dynamic SQL

Executing DDL statements in dynamic SQL

Executing DML statements in dynamic SQL

Dynamic SQL and the caching effect

Recommended practices for dynamic SQL usage

Summary

OLAP and Data Warehousing

Online analytical processing

Extract, transform, and load

Data modeling for OLAP

Aggregation

Partitioning

Parallel query

Index-only scans

Summary

Beyond Conventional Data Types

Arrays

Common functions of arrays and their operators

Modifying and accessing arrays

Indexing arrays

The hash store data structure

Indexing an hstore

The JSON data structure

JSON and XML

JSON data types for PostgreSQL

Modifying and accessing JSON types

Indexing a JSON data type

Providing a RESTful API interface for PostgreSQL

A PostgreSQL full text search

The tsquery and tsvector data types

The tsvector data type

The tsquery data type

Pattern matching

Full text search indexing

Summary

Transactions and Concurrency Control

Transactions

Transaction and ACID properties

Transaction and concurrency

MVCC in PostgreSQL

Transaction isolation levels

Explicit locking

Table-level locks 

Table locking modes

Row-level locks

Row-level lock modes

Deadlocks

Advisory locks

Summary

PostgreSQL Security

Authentication in PostgreSQL

PostgreSQL pg_hba.conf

Listen addresses

Authentication best practices

PostgreSQL default access privileges

Role system and proxy authentication

PostgreSQL security levels

Database security level

Schema security level

Table-level security

Column-level security

Row-level security 

Encrypting data

PostgreSQL role password encryption

pgcrypto

One-way encryption

Two-way encryption

Summary

The PostgreSQL Catalog

The system catalog

System catalog for administrators

Getting the database cluster and client tools version

Terminating and canceling user sessions

Defining and getting database cluster settings

Getting the database and database object size

Cleaning up the database

Cleaning up the database data

Tuning for performance 

Selective dump

Summary 

Optimizing Database Performance

PostgreSQL configuration tuning

Maximum number of connections

Memory settings

Hard disk settings

Planner-related settings

Bench marking is your friend

Tuning performance for write

Tuning performance for read 

Explain command and execution plan

Detecting problems in query plans

Common mistakes in writing queries

Unnecessary operations

Misplaced or missing indexes

Using CTE when not mandatory

Using the PL/pgSQL procedural language consideration

Cross column correlation

Table partitioning

Constraint exclusion limitations

Query rewriting

Summary

Testing

Unit testing

Specificity of unit testing in databases

Unit test frameworks

Schema difference

Database abstraction interfaces

Data difference

Performance testing

Summary

Using PostgreSQL in Python Applications

Python DB API 2.0

Low-level database access with psycopg2

Connecting to a database

Connection pooling

Executing SQL commands

Reading data from a database

The COPY command

Asynchronous access

Alternative drivers for PostgreSQL

pg8000

asyncpg

Object relational mapping with SQLAlchemy

Main components of SQLAlchemy

Connecting to a database and retrieving data with SQL Expression Language

ORM

Summary

Scalability

The problem of scalability and the CAP theorem

Data replication in PostgreSQL

Transaction log

Physical replication

Log shipping replication

Streaming replication

Synchronous replication

Logical replication

Using replication to scale PostgreSQL

Scaling for heavy querying

Data sharding

Scaling for big number of connections

Summary

Preface

Picking the right database management system is a difficult task due to the vast number of options on the market. Depending on the business model, one can pick a commercial database or an open source database with commercial support. In addition to this, there are several technical and non-technical factors to assess. When it comes to a relational database management system, PostgreSQL stands at the top for several reasons. The PostgreSQL slogan, The world's most advanced open source database, shows the sophistication of its features and community confidence.

PostgreSQL is an open source object relational database management system. It emphasizes extensibility and competes with major relational database vendors such as Oracle, SQL Server, and MySQL. Due to its rich extensions and open source license, it is often used for research purposes, but PostgreSQL code is also the base for many open source and commercial database management systems such as Greenplum and Vertica. Furthermore, start-ups often favor PostgreSQL due to its licensing costs and because there are a lot of companies that provide commercial support.

PostgreSQL runs on most modern operating systems, including Windows, Mac, and Linux flavors. Also, there are several extensions to access, manage, and monitor PostgreSQL clusters, such as pgAdmin, OmniDB, and psql. PostgreSQL installation and configuration is moderately easy as it is supported by most packaging tools, such as yum and apt. Database developers can easily learn and use PostgreSQL because it complies with ANSI SQL standards. Other than this, there are a lot of resources to help developers learn PostgreSQL; it has a very good documentation manual and a very active and organized community.

PostgreSQL can be used for both OLTP and OLAP applications. As it is ACID compliant, it can be used out of the box for OLTP applications. For OLAP applications, PostgreSQL supports Window functions, FDW, and table inheritance; there are many external extensions for this purpose as well.

Even though PostgreSQL is ACID compliant, it has very good performance as it utilizes state-of-the-art algorithms and techniques. For example, PostgreSQL utilizes MVCC architecture to allow concurrent access to data. In addition to that, PostgreSQL supports both pessimistic and optimistic concurrency control, and the locking behavior can be altered based on the use case. Also, PostgreSQL provides a very good analyzer and advanced features, such as data partitioning using table inheritance and constraint exclusion, to speed up the handling of very large amounts of data. PostgreSQL supports several types of indexes, such as B-Tree, GiN, and GiST, and BRIN indexes. Also, parallel query execution has been supported since PostgreSQL 9.6. Finally, one can use replication to load balance the load to different database clusters.

PostgreSQL is scalable thanks to the many replication solutions in the market, such as Slony and pgpool-II. Additionally, PostgreSQL supports out-of-the-box synchronous and asynchronous streaming replication, as well as logical replication. This makes PostgreSQL very attractive because it can be used to set up highly available and performant systems. 

What this book covers

Chapter 01, Relational Databases, introduces relational database system concepts, including relational database properties, relational algebra, and database modeling. Also, it describes different database management systems such as graph, document, key value, and columnar databases.

Chapter 02, PostgreSQL in Action, provides first-hand experience of installing the PostgreSQL server and client tools on different platforms. This chapter also introduces PostgreSQL capabilities, such as out-of-the-box replication support and its very rich data types.

Chapter 03, PostgreSQL Basic Building Blocks, provides some coding best practices, such as coding conventions of PostgreSQL and identifier names. This chapter describes the basic building blocks and the interaction between these blocks, mainly template databases, user databases, tablespaces, roles, and settings. Also, it describes basic data types and tables.

Chapter 04, PostgreSQL Advanced Building Blocks, introduces several building blocks, including views, indexes, functions, user-defined data types, triggers, and rules. This chapter provides use cases of these building blocks and compares building blocks that can be used for the same case, such as rules and triggers.

Chapter 05, SQL Language, introduces structured query language (SQL), which is used to interact with a database, create and maintain data structures, and enter data into databases as well as, change it, retrieve it, and delete it. SQL has commands related to data definition language (DDL), data manipulation language (DML), and data control language (DCL). Four SQL statements form the basis of DML and are described in this chapter. The SELECT statement is examined in detail to explain SQL concepts such as grouping and filtering to show what SQL expressions and conditions are and how to use subqueries. Some relational algebra topics are also covered in application to joining tables.

Chapter 06, Advanced Query Writing, describes advanced SQL concepts and features such as common table expressions and window functions. This helps you implement a logic that would not be possible without them, such as recursive queries. Other techniques explained here, such as the DISTINCT ON clause, the FILTER clause, and lateral subqueries, are not irreplaceable. However, they can help make a query smaller, easier, and faster.

Chapter 07, Server-Side Programming with PL/pgSQL, covers function parameters, such as the number of returned rows and function cost, which is mainly used by the query planner. Also, it presents control statements such as conditionals and iteration. Finally, it explains the concept of dynamic SQL and some recommended practices when using dynamic SQL.

Chapter 08, OLAP and Data Warehousing, introduces several concepts regarding the usage of recreational databases in the realm of analytical processing. It discusses the deference between OLTP load and OLAP loads; furthermore, it discusses the modeling aspect of OLAP applications. In addition to that, it discusses some technical methods to perform ETL (extract, transform, and load) operations such as the COPY command. Also, it discusses some features of PostgreSQL which increasing data retrieval performance such as index-only scans and table partitioning.

Chapter 09, Beyond Conventional Data types, discusses several rich data types, including arrays, hash stores, JSON documents, and full-text search. It presents operations and functions for each data type to initialize, insert, update, access, and delete these data types. Finally, it shows how PostgreSQL can be combined with Nginx to serve read-only restful requests.

Chapter 10, Transactions and Concurrency Control, discusses in detail the ACID properties and the relation between these properties and concurrency controls. This chapter also discusses concepts such as isolation levels and their side-effects and it shows these side-effects using SQL examples. Finally, the chapter discusses different locking methods, including pessimistic locking strategies such as row locking and advisory locks. 

Chapter 11, PostgreSQL Security, covers concepts of authentication and authorization. It describes PostgreSQL authentication methods and explains the structure of a PostgreSQL host-based authentication configuration file. It also discusses the permissions that can be granted to database building objects such as schemas, tables, views, indexes, and columns. Finally, it shows how sensitive data, such as passwords, can be protected using different techniques, including one-way and two-way encryption.

Chapter 12, The PostgreSQL Catalog, provides several recipes to maintain a database cluster, including cleaning up data, maintaining user processes, cleaning up indexes and unused databases objects, discovering and adding indexes to foreign keys, and so on.

Chapter 13, Optimizing Database Performance, discusses several approaches to optimize performance. It presents PostgreSQL cluster configuration settings, which are used in tuning the whole cluster's performance. Also, it presents common mistakes in writing queries and discusses several approaches to increase performance, such as using indexes or table partitioning and constraint exclusion.

Chapter 14, Testing, covers some aspects of the software testing process and how it can be applied to databases. Unit tests for databases can be written as SQL scripts or stored functions in a database. There are several frameworks that help us write unit tests and process the results of testing.

Chapter 15, Using PostgreSQL in Python Applications, discusses several advanced concepts, such as connection pooling, asynchronous access, and object relational mappers (ORMs). The chapter shows by example how to connect to database, query it, and perform updates using Python. Finally, it introduces different technologies that interact with PostgreSQL, and this gives the developer a broad overview of the state-of-the-art technologies.

Chapter 16, Scalability, discusses the problem of scalability and the CAP theorem in detail. Also, it covers data replication in PostgreSQL, including physical replication and logical replication. Finally, it shows different scaling scenarios and their implementation in PostgreSQL. 

What you need for this book

In general, PostgreSQL server and client tools do not need exceptional hardware. PostgreSQL can be installed on almost all modern platforms, including Linux, Windows, and Mac. Also, in the book, when a certain library is needed, the installation instructions are given.

You need PostgreSQL version 10; however, most of the examples can be executed on earlier versions as well. In order to execute the sample code, scripts, and examples provided in the book, you need to have at least a PostgreSQL client tool installed on your machine—preferably psql—and access to a remote server running the PostgreSQL server.

In a Windows environment, the cmd.exe command prompt is not very convenient; thus, the user might consider using Cygwin (http://www.cygwin.com/) or another alternative, such as PowerShell.

Some chapters might requires additional software. For example, in Chapter 15, Using PostgreSQL in Python application, you need to install Python and the required Python libraries to interact with PostgreSQL. In Chapter 16, Scalability, Docker is used to give the reader a better user experience. 

For other chapters, such as Chapter 09, Beyond Conventional Data types, and Chapter 11, PostgreSQL Security, it is better to use Linux because some of the software used is not very convenient on Windows such as Nginx and GnuPG. To run a Linux distribution on your Windows machine, you can use Virtual Box (https://www.virtualbox.org/).

Who this book is for

If you're interested in learning more about PostgreSQL--one of the most popular relational databases in the world--then this book is for you. Those looking to build solid database or data warehousing applications with PostgreSQL 10 will also find this book a useful resource. No prior knowledge of database programming or administration is required to get started with this book.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:

Log in or register to our website using your email address and password.

Hover the mouse pointer on the

 

SUPPORT

 

tab at the top.

Click on

 

Code Downloads & Errata

.

Enter the name of the book in the

 

Search

 

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

 

Code Download

.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-PostgreSQL-10-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/LearningPostgreSQL10SecondEdition_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Relational Databases

This chapter, and the next chapters, will provide a high-level overview of topics related to database development. These topics will cover the theoretical aspect of relational databases. The first two chapters try to summarize theoretical topics that are seen on a daily basis. Understanding these theoretical concepts will enable the developers to not only come up with clean designs, but also to master relational databases.

This chapter is not restricted to learning PostgreSQL, but covers all relational databases. The topics covered in this chapter include the following:

Database management systems

: Understanding the different database categories enables the developer to utilize the best in each world

Relational algebra

: Understanding relational algebra enables the developers to master the SQL language, especially SQL code rewriting

Data modeling

: Using data modeling techniques leads to better communication

Database management systems

Different database management systems support diverse application scenarios, use cases, and requirements. Database management systems have a long history. First, we will quickly take a look at the recent history, and then explore the market-dominant database management system categories.

A brief history

Broadly, the term database can be used to present a collection of things. Moreover, this term brings to mind many other terms including data, information, data structure, and management. A database can be defined as a collection or repository of data, which has a certain structure, managed by a database management system (DBMS). Data can be structured as tabular data, semi-structured as XML documents, or unstructured data that does not fit a predefined data model.

In the early days, databases were mainly aimed at supporting business applications; this led us to the well-defined relational algebra and relational database systems. With the introduction of object-oriented languages, new paradigms of database management systems appeared such as object-relational databases and object-oriented databases. Also, many businesses as well as scientific applications use arrays, images, and spatial data; thus, new models such as raster, map, and array algebra are supported. Graph databases are used to support graph queries such as the shortest path from one node to another, along with supporting traversal queries easily.

With the advent of web applications such as social portals, it is now necessary to support a huge number of requests in a distributed manner. This has led to another new paradigm of databases called NoSQL (Not Only SQL), which has different requirements such as performance over fault tolerance and horizontal scaling capabilities. In general, the timeline of database evolution was greatly affected by many factors such as the following:

Functional requirements

: The nature of the applications using a DBMS has led to the development of extensions on top of relational databases such as PostGIS (for spatial data) or even dedicated DBMS such as SciDB (for scientific data analytics).

Nonfunctional requirements

: The success of object-oriented programming languages has created new trends such as object-oriented databases. Object relational database management systems have appeared to bridge the gap between relational databases and the object-oriented programming languages. Data explosion and the necessity to handle terabytes of data on commodity hardware have led to columnar databases, which can easily scale up horizontally.

Database categories

Many database models have appeared and vanished such as the network model and hierarchical model. The predominant categories now in the market are relational, object-relational databases, and NoSQL databases. One should not think of NoSQL and SQL databases as rivals--they are complementary to each other. By utilizing different database systems, one can overcome many limitations and get the best of different technologies.

The NoSQL databases

The NoSQL databases are affected by the CAP theorem, also known as Brewer's theorem. In 2002, S. Gilbert and N. Lynch published a formal proof of the CAP theorem in their article, Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services. In 2009, the NoSQL movement began. Currently, there are over 150 NoSQL databases (nosql-database.org).

The CAP theorem

The CAP theorem states that it is impossible for a distributed computing system to simultaneously provide all three of the following guarantees:

Consistency

: All clients see (immediately) the latest data even in the case of updates.

Availability

: All clients can find a replica of some data even in the case of a node failure. This means that even if some part of the system goes down, the clients can still access the data.

Partition tolerance

: The system continues to work regardless of arbitrary message loss or failure of part of the system.

The choice of which feature to discard determines the nature of the system. For example, one could sacrifice consistency to get a scalable, simple, and high performance database management system. Often, the main difference between a relational database and a NoSQL database is consistency. A relational database enforces ACID (atomicity, consistency, isolation, durability) properties. In contrast, many NoSQL databases adopt the basically available soft-state, eventual consistency (base) model.

NoSQL motivation

A NoSQL database provides a means for data storage, manipulation, and retrieval for non-relational data. The NoSQL databases are distributed, open source, and horizontally scalable. NoSQL often adopts the base model, which prizes availability over consistency, and informally guarantees that if no new updates are made on a data item, eventually all access to that data item will return the latest version of that data item. The advantages of this approach include the following:

Simplicity of design

Horizontal scaling and easy replication

Schema free

Huge amount of data support

We will now explore a few types of NoSQL databases.

Key-value databases

The key-value store is the simplest database store. In this database model, the storage, as its name suggests, is based on maps or hash tables. Some key-value databases allow complex values to be stored as lists and hash tables. Key-value pairs are extremely fast for certain scenarios, but lack the support for complex queries and aggregation. Some of the existing open source key-value databases are Riak, Redis, Memebase, and MemcacheDB.

Columnar databases

Columnar or column-oriented databases are based on columns. Data in a certain column in a two-dimensional relation is stored together.

Unlike relational databases, adding columns is inexpensive and is done on a row-by-row basis. Rows can have a different set of columns. Tables can benefit from this structure by eliminating the storage cost of the null values. This model is best suited for distributed databases.

HBase is one of the most famous columnar databases. It is based on the Google Bigtable storage system. Column-oriented databases are designed for huge data scenarios, so they scale up easily. For small datasets, HBase is not a suitable architecture. First, the recommended hardware topology for HBase is a five-node or server deployment. Also, it needs a lot of administration and is difficult to master and learn.

Document databases

A document-oriented database is suitable for documents and semi-structured data. The central concept of a document-oriented database is the notion of a document. Documents encapsulate and encode data (or information) in some standard formats or encodings such as XML, JSON, and BSON. Documents do not adhere to a standard schema or have the same structure, so they provide a high degree of flexibility. Unlike relational databases, changing the structure of the document is simple and does not lock the clients from accessing the data.

Document databases merge the power of relational databases and column-oriented databases. They provide support for ad hoc queries and can be scaled up easily. Depending on the design of the document database, MongoDB is designed to handle a huge amount of data efficiently. On the other hand, CouchDB provides high availability even in the case of hardware failure.

Graph databases

Graph databases are based on the graph theory, where a database consists of nodes and edges. The nodes as well as the edges can be assigned data. Graph databases allow traversing between the nodes using edges. As a graph is a generic data structure, graph databases are capable of representing different data. A famous implementation of an open source commercially supported graph database is Neo4j.

Relational and object relational databases

Relational database management systems are one of the most widely-used DBMSs in the world. It is highly unlikely that any organization, institution, or personal computer today does not have or use a piece of software that rely on RBDMS. Software applications can use relational databases via dedicated database servers or via lightweight RDBMS engines, embedded in the software applications as shared libraries. The capabilities of a relational database management system vary from one vendor to another, but most of them adhere to the ANSI SQL standards. A relational database is formally described by relational algebra, and is based on the relational model. Object-relational database (ORD) are similar to relational databases. They support the following object-oriented model concepts:

User-defined and complex data types

Inheritance

ACID properties

In a relational database, a single logical operation is called a transaction. The technical translation of a transaction is a set of database operations, which are create, read, update, and delete (CRUD). An example of explaining a transaction is budget assignment to several projects in the company assuming we have a fixed amount of money. If we increase a certain project budget, we need to deduct this amount of increase from another project. The ACID properties in this context could be described as follows:

Atomicity

: All or nothing, which means that if a part of a transaction fails, then the transaction fails as a whole.

Consistency

: Any transaction gets the database from one valid state to another valid state. Database consistency is governed normally by data constraints and the relation between data and any combination thereof. For example, imagine if one would like to completely purge his account on a shopping service. In order to purge his account, his account details, such as a list of addresses, will also need to be purged. This is governed by foreign key constraints, which will be explained in detail in the coming chapter.

Isolation

: Concurrent execution of transactions results in a system state that would be obtained if the transactions were executed serially.

Durability

: The transactions that are committed--that is, executed successfully--are persistent even with power loss or some server crashes. In PostgreSQL, this is done normally by a technique called

write-ahead log

(

WAL

). Other database refers to this as a transaction log such as in Oracle.

The SQL language

Relational databases are often linked to the structured query language (SQL). SQL is a declarative programming language and is the standard relational database language. The American National Standard Institute (ANSI) and the International Standard Organization (ISO) published the SQL standard for the first time in 1986, followed by many versions such as SQL:1999, SQL:2003, SQL:2006, SQL:2008, SQL:2011, and SQL:2016. The SQL language has several parts:

Data definition language

(

DDL

): It defines and amends the relational structure

Data manipulation language

(

DML

): It retrieves and extracts information from the relations

Data control language

(

DCL

): It controls the access rights to relations

Relational model concepts

A relational model is a first-order predicate logic, which was first introduced by Edgar F. Codd in 1970 in his paper A relational model of data for large shared data banks. A database is represented as a collection of relations. The state of the whole database is defined by the state of all the relations in the database. Different information can be extracted from the relations by joining and aggregating data from different relations and by applying filters on the data. In this section, the basic concepts of the relational model are introduced using the top-down approach by first describing the relation, tuple, attribute, and domain.

The terms relation, tuple, attribute, and unknown, which are used in the formal relational model, are equivalent to table, row, column, and null in the SQL language.

Relation

Think of a relation as a table with a header, columns, and rows. The table name and the header help in interpreting the data in the rows. Each row represents a group of related data, which points to a certain object.

A relation is represented by a set of tuples. Tuples should have the same set of ordered attributes. Attributes have a domain, that is, a type and a name:

customer_id

first_name

last_name

email

Tuple →

1

thomas

sieh

[email protected]

Tuple →

2

wang

kim

[email protected]

Attribute ↑

Attribute ↑

Attribute ↑

Attribute ↑

 

The relation schema is denoted by the relation name and the relation attributes. For example, customer (customer_id, first_name, last_name, and email) is the relation schema for the customer relation. Relation state is defined by the set of relation tuples; thus, adding, deleting, and amending a tuple will change the relation to another state.

Tuple order or position in the relation is not important, and the relation is not sensitive to tuple order. The tuples in the relation could be ordered by a single attribute or a set of attributes. Also, a relation cannot have duplicate tuples.

A relation can represent entities in the real world, such as a customer, or can be used to represent an association between relations. For example, the customer could have several services and a service can be offered to several customers. This could be modeled by three relations: customer, service, and customer_service. The customer_service relation associates the customer and the service relations. Separating the data in different relations is a key concept in relational database modeling, which is called normalization. Normalization is the process of organizing relation columns and relations to reduce data redundancy. For example, assume that a collection of services is stored in the customer relation. If a service is assigned to multiple customers, this would result in data redundancy. Also, updating a certain service would require updating all its copies in the customer table.

Attribute

Each attribute has a name and a domain, and the name should be distinct within the relation. The domain defines the possible set of values that the attribute can have. One way to define the domain is to define the data type and a constraint on this data type. For example, hourly wage should be a positive real number and bigger than five if we assume that the minimum hourly wage is five dollars. The domain could be continuous, such as salary, which is any positive real number, or discrete, such as gender.

The formal relational model puts a constraint on the domain: the value should be atomic. Atomic means that each value in the domain is indivisible. For instance, the name attribute domain is not atomic because it can be divided into first name and last name. Some examples of domains are as follows:

Phone number

: Numeric text with a certain length.

Country code

: Defined by ISO 3166 as a list of two letter codes (ISO alpha-2) and three letter codes (ISO alpha-3). The country codes for Germany are DE and DEU for alpha-2 and alpha-3 respectively.

In real-life applications, it is better to use ISO and international standards for lookup tables such as country and currency. This enables you to expose your data much easily for third-party software and increases your data quality.

Constraint

The relational model defines many constraints in order to control data integrity, redundancy, and validity:

Redundancy

: Duplicate tuples are not allowed in the relation.

Validity

: Domain constraints control data validity.

Integrity

: The relations within a single database are linked to each other. An action on a relation such as updating or deleting a tuple might leave the other relations in an invalid state.

We could classify the constraints in a relational database roughly into two categories:

Inherited constraints from the relational model: Domain integrity, entity integrity, and referential integrity constraints.

Semantic constraint, business rules, and application specific constraints: These constraints cannot be expressed explicitly by the relational model. However, with the introduction of procedural SQL languages such as PL/pgsql for PostgreSQL, relational databases can also be used to model these constraints.

Domain integrity constraint

The domain integrity constraint ensures data validity. The first step in defining the domain integrity constraint is to determine the appropriate data type. The domain data types could be integer, real, boolean, character, text, inet, and so on. For example, the data type of first name and email address is text. After specifying the data type, check constraints, such as the mail address pattern, need to be defined.

Check constraint

: A check constraint can be applied to a single attribute or a combination of many attributes in a tuple. Let's assume that the 

customer_service

schema is defined as

customer_id

,

service_id

,

start_date

,

end_date

,

order_date

. For this relation, we can have a check constraint to make sure that

start_date

and

end_date

are entered correctly by applying the following check

start_date

<

end_date

.

Default constraint

: The attribute can have a default value. The default value could be a fixed value such as the default hourly wage of the employees, for example, $10. It may also have a dynamic value based on a function such as random, current time, and date. For example, in the

customer_service

relation,

order_date

can have a default value, which is the current date.

Unique constraint

: A unique constraint guarantees that the attribute has a distinct value in each tuple. It allows null values. For example, let's assume that we have a relation player defined as player (

player_id

,

player_nickname

). The player uses his ID to play with others; he can also pick up a nickname which is also unique to identify himself.

Not null constraint

: By default, the attribute value can be null. The not null constraint restricts an attribute from having a null value. For example, each person in the birth registry record should have a name.

Entity integrity constraint

In the relational model, a relation is defined as a set of tuples. This means that all the tuples in a relation must be distinct. The entity integrity constraint is enforced by having a primary key which is an attribute/set of attributes having the following characteristics:

The attribute should be unique

The attributes should be not null

Each relation must have only one primary key, but can have many unique keys. A candidate key is a minimal set of attributes that can identify a tuple. All unique, not null attributes can be candidate keys. The set of all attributes form a super key. In practice, we often pick up a single attribute to be a primary key instead of a compound key ( a key that consists of two or more attributes that uniquely identify a tuple) to ease the joining of the relations with each other.

If the primary key is generated by the DBMS, then it is called a surrogate key or synthetic key . Otherwise, it is called a natural key. The surrogate key candidates can be sequences and universal unique identifiers (UUID). A surrogate key has many advantages such as performance, requirement change tolerance, agility, and compatibility with object relational mappers. The chief disadvantage of surrogate keys is that , it makes redundant tuples possible. 

Referential integrity constraints

Relations are associated with each other via common attributes. Referential integrity constraints govern the association between two relations and ensure data consistency between tuples. If a tuple in one relation references a tuple in another relation, then the referenced tuple must exist. In the customer service example, if a service is assigned to a customer, then the service and the customer must exist, as shown in the following example. For instance, in the customer_service relation, we cannot have a tuple with values (5, 1,01-01-2014, NULL), because we do not have a customer with customer_id equal to 5.

The lack of referential integrity constraints can lead to many problems:

Invalid data in the common attributes

Invalid information during joining of data from different relations

Performance degradation either due to bad execution plans generated by the PostgreSQL planner or by a third-party tool.

Foreign keys can increase performance in reading data from multiple tables. The query execution planner will have a better estimation of the number of rows that need to be processed. Disabling foreign keys when doing a bulk insert will lead to a performance boost. 

Referential integrity constraints are achieved via foreign keys. A foreign key is an attribute or a set of attributes that can identify a tuple in the referenced relation. As the purpose of a foreign key is to identify a tuple in the referenced relation, foreign keys are generally primary keys in the referenced relation. Unlike a primary key, a foreign key can have a null value. It can also reference a unique attribute in the referenced relation. Allowing a foreign key to have a null value enables us to model different cardinality constraints. Cardinality constraints define the participation between two different relations. For example, a parent can have more than one child; this relation is called one-to-many relationship, because one tuple in the referenced relation is associated with many tuples in the referencing relation. Also, a relation could reference itself. This foreign key is called a self-referencing or recursive foreign key.

For example, a company acquired by another company:

To ensure data integrity, foreign keys can be used to define several behaviors when a tuple in the referenced relation is updated or deleted. The following behaviors are called referential actions:

Cascade

: When a tuple is deleted or updated in the referenced relation, the tuples in the referencing relation are also updated or deleted

Restrict

: The tuple cannot be deleted or the referenced attribute cannot be updated if it is referenced by another relation

No action

: Similar to restrict, but it is deferred to the end of the transaction

Set default

: When a tuple in the referenced relation is deleted or the referenced attribute is updated, then the foreign key value is assigned the default value

Set null

: The foreign key attribute value is set to null when the referenced tuple is deleted

Semantic constraints

Integrity constraints or business logic constraints describe the database application constraints in general. These constraints are either enforced by the business logic tier of the application program or by SQL procedural languages. Trigger and rule systems can also be used for this purpose. For example, the customer should have at most one active service at a time. Based on the nature of the application, one could favor using an SQL procedural language or a high-level programming language to meet the semantic constraints, or mix the two approaches.

The advantages of using the SQL programming language are as follows:

Performance

: RDBMSs often have complex analyzers to generate efficient execution plans. Also, in some cases such as data mining, the amount of data that needs to be manipulated is very large. Manipulating the data using procedural SQL language eliminates the network data transfer. Finally, some procedural SQL languages utilize clever caching algorithms.

Last minute change

: For the SQL procedural languages, one could deploy bug fixes without service disruption.

Implementing business logic in database tier has a lot of pros and cons and it is a highly arguable topic. For example, some disadvantages of implementing business logic in the database is visibility, developers efficiency in writing code due to a lack of proper tools and IDEs, and code reuse.