34,79 €
PostgreSQL database servers have a common set of problems that they encounter as their usage gets heavier and requirements get more demanding. Peek into the future of your PostgreSQL 10 database's problems today. Know the warning signs to look for and how to avoid the most common issues before they even happen.
Surprisingly, most PostgreSQL database applications evolve in the same way—choose the right hardware, tune the operating system and server memory use, optimize queries against the database and CPUs with the right indexes, and monitor every layer, from hardware to queries, using tools from inside and outside PostgreSQL. Also, using monitoring insight, PostgreSQL database applications continuously rework the design and
configuration. On reaching the limits of a single server, they break things up; connection pooling, caching, partitioning, replication, and parallel queries can all help handle increasing database workloads.
By the end of this book, you will have all the knowledge you need to design, run, and manage your PostgreSQL solution while ensuring high performance and high availability
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 750
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Amey VarangaonkarAcquisition Editor:Namrata PatilContent Development Editor: Amrita NoronhaTechnical Editor: Jovita AlvaCopy Editor: Safis Editing, Vikrant PhadkeProject Coordinator: Shweta H BirwatkarProofreader: Safis EditingIndexer: Rekha NairGraphics: Jisha ChirayilProduction Coordinator: Nilesh Mohite
First published: April 2018
Production reference: 1300418
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78847-448-1
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Ibrar Ahmed is a senior database architect in an Enterprise PostgreSQL company. He started his development career in 1999. He has contributed to the PostgreSQL community, the company's PostgreSQL fork, and other open source communities such as Google Chrome. He also has experience in MySQL, Oracle, MongoDB and Hadoop (Hive, Hbase, Spark). He is a major contributor in integrating other databases with PostgreSQL. He also authored PostgreSQL Developer's Guide. He has a master's in computer science (1999) and MS in computer engineering (2015), and is finalizing his PhD.
Gregory Smith is a chief PostgreSQL evangelist in Charleston, South Carolina, for Crunchy Data Solutions. He's been providing database advice to clients in industries such as manufacturing, finance, and web development for 20 years. He has contributed feature additions to every PostgreSQL version since 8.3. He's also the creator of a growing set of add-on tools for the database, currently including PgTune, pgbench-tools, peg, and 2warm.
Enrico Pirozzi, EnterpriseDB certified on implementation management and tuning, with a master's in computer science, has been a PostgreSQL DBA since 2003. Based in Italy, he has been providing database advice to clients in industries such as manufacturing and web development for 10 years. He has been training others on PostgreSQL since 2008. Dedicated to open source technology since early in his career, he is a cofounder of the PostgreSQL Italian mailing list, PostgreSQL-it, and of the PostgreSQL Italian community site, PSQL.it.
Feng Tan is from China. He is a member of the China PostgreSQL User Group Association and a special expert in the PostgreSQL branch of China OSS (Open Source Software) Promotion Union. He is one of the translators of PostgreSQL 9 Administration Cookbook Chinese Edition, as well as a technical reviewer of PostgreSQL High Performance Cookbook English Edition.
He was a PostgreSQL DBA at SkyMobi for more than 6 years, maintaining over 100 PostgreSQL instances. He likes to share PostgreSQL technology a lot and has posted over 500 tech blogs.
Srivathsava Rangarajan is a lead software engineer at a publicly traded financial company in downtown Chicago. He was introduced to PostgreSQL after an interesting start in the NoSQL world of CouchDB and Cassandra. Impressed with PostgreSQL's robustness, he is now its avid advocate for the right applications and co-organizes the Chicago PostgreSQL User Group to help in its adoption. He loves data and helps spread the understanding of data structures, domains, and data models to foster healthier interactions with databases.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
PostgreSQL 10 High Performance
Dedication
Packt Upsell
Why subscribe?
PacktPub.com
Contributors
About the authors
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
PostgreSQL Versions
Performance of historical PostgreSQL releases
Choosing a version to deploy
Upgrading to a newer major version
Upgrades to PostgreSQL 8.3+ from earlier ones
Minor version upgrades
Migrating from PostgreSQL 9.x to 10.x – a new way to work
PostgreSQL or another database?
PostgreSQL 10.x and NoSQL
PostgreSQL as HUB
PostgreSQL tools
PostgreSQL contrib
Finding contrib modules on your system
Installing a contrib module from source
Using a contrib module
pgFoundry
PGXN
Additional PostgreSQL-related software
PostgreSQL application scaling life cycle
Performance tuning as a practice
Summary
Database Hardware
Balancing hardware spending
CPUs
Memory
Disks
RAID
Drive error handling
Hard drive reliability studies
Drive firmware and RAID
SSDs
Disk controllers
Hardware and software RAID
Recommended disk controllers
Attached storage – SAN and NAS
Reliable controller and disk setup
Write-back caches
Sources of write-back caching
Disk controller monitoring
Disabling drive write caches
Performance impact of write-through caching
Summary
Database Hardware Benchmarking
CPU and memory benchmarking
Memtest86+
STREAM memory testing
STREAM and Intel versus AMD
CPU benchmarking
Sources of slow memory and processors
Physical disk performance
Random access and input/outputs per second
Sequential access and ZCAV
Short stroking
Commit rate
PostgreSQL test_fsync
INSERT rate
Windows commit rate
Disk benchmarking tools
HD Tune
Short stroking tests
IOPS
Unpredictable performance and Windows
Disk throughput in case of sequential read and write
Bonnie++
Bonnie++ 2.0
Bonnie++ ZCAV
sysbench
pgbench
Seek rate
Removing test files
fsync commit rate
Complicated disk benchmarks
Sample disk results
Disk performance expectations
Sources of slow disk and array performance
Summary
Disk Setup
Maximum filesystem sizes
Filesystem crash recovery
Journaling filesystems
Linux filesystems
ext2
ext3
ext4
XFS
Benchmarks
Other Linux filesystems
Write barriers
Drive support for barriers
Filesystem support for barriers
General Linux filesystem tuning
Read-ahead
File access times
Read caching and swapping
Write cache sizing
I/O scheduler elevator
Solaris and FreeBSD filesystems
Solaris UFS
FreeBSD UFS2
ZFS
Windows filesystems
FAT32
NTFS
Adjusting mounting behavior
Disk layout for PostgreSQL
Symbolic links
Tablespaces
Database directory tree
Temporary files
Disk arrays, RAID, and disk layout
Disk layout guidelines
Summary
Memory for Database Caching
Memory units in postgresql.conf
Increasing Unix shared memory parameters for larger buffer sizes
Kernel semaphores
Estimating shared memory allocation
Inspecting the database cache
Installing pg_buffercache into a database
Database disk layout
Creating a new block in a database
Writing dirty blocks to disk
Crash recovery and the buffer cache
Checkpoint processing basics
Write-ahead log and recovery processing
Checkpoint timing
Checkpoint spikes
Spread checkpoints
Database block life cycle
Dirty block write paths
Database buffer cache versus operating system cache
Doubly cached data
Inspecting the OS cache
Checkpoint overhead
Starting size guidelines
Platform, version, and workload limitations
Analyzing buffer cache contents
Inspection of the buffer cache queries
Top relations in the cache
Summary by usage count
Buffer content summary with percentages
Buffer usage count distribution
Using buffer cache inspection for sizing feedback
Summary
Server Configuration Tuning
Interacting with the live configuration
Defaults and reset values
Allowed change context
Reloading the configuration file
Commented out settings
Server-wide settings
Database connections
listen_addresses
max_connections
Shared memory
shared_buffers
Free Space Map settings
Logging
log_line_prefix
log_statement
log_min_duration_statement
News on PostgreSQL 10
Vacuuming and statistics
autovacuum
Enabling autovacuum on older versions
maintainance_work_mem
default_statistics_target
Checkpoints
checkpoint_segments – max_wal_size
checkpoint_timeout
checkpoint_completion_target
WAL settings
wal_buffers
wal_sync_method
PITR and WAL replication
Per-client settings
effective_cache_size
synchronous_commit
work_mem
random_page_cost
constraint_exclusion
Tunables to avoid
fsync
full_page_writes
commit_delay and commit_siblings
max_prepared_transactions
Querying enable parameters
New server tuning
Dedicated server guidelines
Shared server guidelines
PgTune
Summary
Routine Maintenance
Transaction visibility with multiversion concurrency control
Visibility computation internals
Updates
Row lock conflicts
Serialization
Deletions
Advantages of MVCC
Disadvantages of MVCC
Transaction ID wraparound
Vacuum
Vacuum implementation
Regular vacuum
Returning free disk space
Full vacuum
HOT
Cost-based vacuuming
autovacuum
autovacuum logging
autovacuum monitoring
Autovacuum triggering
Per-table adjustments
Common vacuum and autovacuum problems
autovacuum is running even though it was turned off
autovacuum is constantly running
Out of memory errors
Not keeping up with a busy server
autovacuum is too disruptive
Long running transactions
FSM exhaustion
Recovering from major problems
Autoanalyze
Index bloat
Measuring index bloat
Detailed data and index page monitoring
Monitoring query logs
Basic PostgreSQL log setup
Log collection
log_line_prefix
Multiline queries
Using syslog for log messages
CSV logging
Logging difficult queries
auto_explain
Log file analysis
Normalized query fingerprints
pg_stat_statements
pgBadger
Summary
Database Benchmarking
pgbench default tests
Table definition
Scale detection
Query script definition
Configuring the database server for pgbench
Sample server configuration
Running pgbench manually
Graphing results with pgbench-tools
Configuring pgbench-tools
Customizing for 8.3
Sample pgbench test results
Select-only test
TPC-B-like test
Latency analysis
Sources of bad results and variation
Developer PostgreSQL builds
Worker threads and pgbench program limitations
pgbench custom tests
Insert speed test
Transaction Processing Performance Council benchmarks
Summary
Database Indexing
Indexing example walkthrough
Measuring query disk and index block statistics
Running the example
Sample data setup
Simple index lookups
Full table scans
Index creation
Lookup with an inefficient index
Combining indexes
Switching from indexed to sequential scans
Planning for plan changes
Clustering against an index
Explain with buffer counts
Index creation and maintenance
Unique indexes
Concurrent index creation
Clustering an index
Fill factor
Reindexing
Index types
B-tree
Text operator classes
Hash
GIN
B-tree GIN versus bitmap indexes
GiST
Advanced index use
Multicolumn indexes
Indexes for sorting
Partial indexes
Expression-based indexes
Indexing for full-text search
Indexing like or like queries with pg_trgm contrib
Indexing JSON datatype
Summary
Query Optimization
Sample data sets
Pagila
dellstore2
EXPLAIN basics
Timing overhead
Hot and cold cache behavior
Clearing the cache
Query plan node structure
Basic cost computation
Estimated costs and real-world costs
Explain analysis tools
Visual explain
Verbose output
Machine-readable EXPLAIN output
Plan analysis tools
Assembling row sets
Tuple ID
Object ID
Sequential scan
Index Scan
Bitmap heap and index scans
Processing nodes
Sort
Limit
Offsets
Aggregate
HashAggregate
Unique
WindowAgg
Result
Append
Group
Subquery scan and subplan
Subquery conversion and IN lists
Set operations
Materialize
CTE scan
Joins
Nested loop
Nested loop with inner index scan
Merge Join
Nested Loop and Merge Join materialization
Hash joins
Hash semi and anti-joins
Join ordering
Forcing join order
Join removal
Genetic query optimizer
Statistics
Viewing and estimating with statistics
Statistics targets
Adjusting a column target
Distinct values
Difficult areas to estimate
Other query-planning parameters
effective_cache_size
work_mem
constraint_exclusion
cursor_tuple_fraction
Executing other statement types
Improving queries
Optimizing for fully cached data sets
Testing for query equivalence
Disabling optimizer features
Working around optimizer bugs
Avoiding plan restructuring with OFFSET
External trouble spots
Parallel queries
SQL limitations
Numbering rows in SQL
Using Window functions for numbering
Using Window functions for cumulatives
Summary
Database Activity and Statistics
Statistics views
Cumulative and live views
Table statistics
Table I/O
Index statistics
Index I/O
Database-wide totals
Connections and activity
Locks
Virtual transactions
Decoding lock information
Transaction lock waits
Table lock waits
Logging lock information
Deadlocks
Disk usage
Buffer, background writer, and checkpoint activity
Saving pg_stat_bgwriter snapshots
Tuning using background writer statistics
Summary
Monitoring and Trending
UNIX monitoring tools
Sample setup
vmstat
iostat
iotop for Linux
Examples of good performance
Overloaded system samples
top
Solaris top replacements
htop for Linux
sysstat and sar
Enabling sysstat and its optional features
Graphing with kSar
Windows monitoring tools
Task Manager
Sysinternals tools
Windows system monitor
Saving Windows system monitor data
Trending software
Types of monitoring and trending software
Storing historical trend data
Nagios
Nagios and PostgreSQL
Nagios and Windows
Cacti
Cacti and PostgreSQL
Cacti and Windows
Munin
Other trending packages
pgstatspack
Zenoss
Hyperic HQ
Reconnoiter
Staplr
SNMP tools
Summary
Pooling and Caching
Connection pooling
Pooling connection counts
pgpool-II
pgpool-II load balancing for replication scaling
pgBouncer
Application server pooling
Database caching
memcached
pgmemcache
Summary
Scaling with Replication
Hot Standby
Terminology
Setting up WAL shipping
Streaming replication
Tuning Hot Standby
Replication queue managers
Synchronous replication
Logical replication
Slony
Londiste
Read scaling with replication queue software
Special application requirements
Bucardo
pgpool-II
Other interesting replication projects
Replication solution comparison
Summary
Partitioning Data
Table range partitioning
Determining a key field to partition over
Sizing the partitions
List partitioning
Creating the partitions
Redirecting INSERT statements to the partitions
Dynamic trigger functions
Partition rules
Empty partition query plans
Date change update trigger
Live migration of a partitioned table
PostgreSQL 10 – declarative partitioning – the built-in partitioning
Range partitioning
Partition maintenance
Caveats
Partitioned queries
Creating new partitions
Scheduled creation
Dynamic creation
Partitioning advantages
Common partitioning mistakes
Horizontal partitioning with PL/Proxy
Hash generation
Scaling with PL/Proxy
Sharding
Scaling with GridSQL
Summary
Avoiding Common Problems
Bulk loading
Loading methods
External loading programs
Tuning for bulk loads
Skipping WAL acceleration
Recreating indexes and adding constraints
Parallel restore
Post-load cleanup
Backup
pg_dump
Continuous archiving and point in time recovery
Common performance issues
Counting rows
Unexplained writes
Slow function and prepared statement execution
PL/pgSQL benchmarking
High foreign key overhead
Trigger memory use
Transition tables for trigger
Heavy statistics collector overhead
Targeted statistics resets
Extended statistics
Materialized views
Foreign data wrapper
The amcheck module
pgAdmin
gprof
OProfile
Visual Studio
DTrace
DTrace on FreeBSD
Linux SystemTap emulation of DTrace
Performance-related features by version
Aggressive PostgreSQL version upgrades
Version 9.0
Replication
Queries and EXPLAIN
Database development
Configuration and monitoring
Tools
Internals
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
PostgreSQL has become an increasingly viable database platform to serve as storage for applications, from classic corporate database use to the latest web apps. But getting the best performance from it is not an easy subject to learn. You need just the right combination of thumb rule to get started, solid monitoring, and maintenance to keep your system running as well as suggestions for troubleshooting, and add-on tools to add the features the core database doesn't try to handle on its own. This book is an update of the book PostgreSQL 9.6 High Performance.
This book is aimed at intermediate to advanced database administrators who are using, or planning to use, PostgreSQL. Portions will also interest systems administrators looking to build or monitor a PostgreSQL installation, as well as developers interested in advanced database internals that impact application design.
Chapter 1, PostgreSQL Versions, introduces how PostgreSQL performance has improved in the most recent versions of the database. It makes a case for using the most recent version feasible, in contrast to the common presumption that newer versions of any software are buggier and slower than their predecessors.
Chapter 2, Database Hardware, discusses how the main components in server hardware, including processors, memory, and disks, need to be carefully selected for reliable database storage and a balanced budget. In particular, accidentally using volatile write-back caching in disk controllers and drives can easily introduce database corruption.
Chapter 3, Database Hardware Benchmarking, moves on to quantifying the different performance aspects of database hardware. Just how fast are the memory and raw drives in your system? Does performance scale properly as more drives are added?
Chapter 4, Disk Setup, looks at popular filesystem choices and suggests the trade-offs of various ways to lay out your database on disk. Some common and effective filesystem tuning tweaks are also discussed.
Chapter 5, Memory for Database Caching, digs into how the database is stored on disk and in memory, and how the checkpoint process serves to reconcile the two safely. It also suggests how you can actually look at the data being cached by the database to confirm that what's being stored in memory matches what you'd expect to be there.
Chapter 6, Server Configuration Tuning, covers the most important settings in the postgresql.conf file, what they mean, and how you should set them. The settings that may cause you some trouble if you change them are pointed out too.
Chapter 7, Routine Maintenance, starts by explaining how PostgreSQL determines which rows are visible to which clients. The way visibility information is stored requires a cleanup process named VACUUM to reuse leftover space properly. Common issues and general tuning suggestions for it and the ever-running autovacuum are covered. Finally, there's a look at adjusting the amount of data logged by the database, and using a query log analyzer on the result to help find query bottlenecks.
Chapter 8, Database Benchmarking, investigates how to get useful benchmark results from the built-in pgbench testing program included with PostgreSQL.
Chapter 9, Database Indexing, introduces indexes in terms of how they can reduce the number of data blocks read to answer a query. That approach allows for thoroughly investigating common questions such as why a query is using a sequential scan instead of an index in a robust way. In this chapter, we present new PostgreSQL 10 index types.
Chapter 10, Query Optimization, is a guided tour of the PostgreSQL optimizer, exposed by showing the way sample queries are executed differently based on what they are asking for and how the database parameters are set. This chapter also covers the new way to perform searches in multi-core mode.
Chapter 11, Database Activity and Statistics, looks at the statistics collected inside the database, and which of them are useful for finding problems. The views that let you watch query activity and locking behavior are also explored.
Chapter 12, Monitoring and Trending, starts with how to use basic operating system monitoring tools to determine what the database is doing. Then, it moves onto suggestions for trending software that can be used to graph this information over time.
Chapter 13, Pooling and Caching, explains the difficulties you can encounter when a large number of connections are made to the database at once. Two types of software packages are suggested to help connection poolers, to better queue incoming requests, and caches which can answer user requests without connecting to the database.
Chapter 14, Scaling with Replication, covers approaches for handling heavier system loads by replicating the data across multiple nodes, typically a set of read-only nodes synchronized to a single writable master. This chapter alsocovers the new logical replication method.
Chapter 15, Partitioning Data, explores how data might be partitioned into subsets usefully, such that queries can execute against a smaller portion of the database. The approaches discussed include the standard single-node database table partitioning and the new declarative partition method.
Chapter 16, Avoiding Common Problems, discusses parts of PostgreSQL that regularly seem to frustrate newcomers to the database. Bulk loading, counting records, and foreign key handling are some examples. This chapter ends with a detailed review of which performance-related features have changed between each version of PostgreSQL from 8.1 to 10.3. Sometimes, the best way to avoid a common problem is to upgrade to a version in which it doesn't happen anymore.
In order for this book to be useful, you need at least access to a PostgreSQL client that is allowed to execute queries on a server. Ideally, you'll also be the server administrator. Full client and server packages for PostgreSQL are available for most popular operating systems at http://www.postgresql.org/download/. All the examples here are executed on Command Prompt, usually running the psql program. This makes them applicable to most platforms. It's straightforward to do many of these operations using a GUI tool for PostgreSQL instead, such as the pgAdmin program. Some scripts that are written in the bash scripting language are provided.
You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packtpub.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/PostgreSQL-10-High-Performance. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/PostgreSQL10HighPerformance_ColorImages.pdf.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "The sync here is to try and flush all the data to the disk before we just blow away the caches."
A block of code is set as follows:
for each outer row: for each inner row: if join condition is true: output combined row
Any command-line input or output is written as follows:
EXPLAIN ANALYZE SELECT * FROM orders WHERE customerid IN (SELECT customerid FROM customers WHERE customerid=1000 OR customerid=2000);
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "HD Tune includes a Random Access test that gives its results in terms of the standard IOPS figure, at various block sizes."
Feedback from our readers is always welcome.
General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packtpub.com.
PostgreSQL certainly has a reputation. It's known for having a rich feature set and very stable software releases. The secure stance that its default configuration takes is simultaneously praised by security fans and criticized for its learning curve. The SQL-specification conformance and data integrity features allow only the strictest ways to interact with the database, which is surprising to those who come from a background working with looser desktop database software. All of these points have an element of truth to them.
Another part of PostgreSQL's reputation is that it's slow. This, too, has some truth to it, even today. There are many database operations where the right thing takes longer to do than the alternative. As the simplest example of this, consider the date February 29, 2009. With no leap year in 2009, that date is only valid as an abstract one. It's not possible for this to be the real date of something that happened. If you ask the database to store this value into a standard date field, it can just do that, the fast approach. Alternatively, it can check whether that date is valid to store into the destination field, note that there is no such date in a regular calendar, and reject your change. That's always going to be slower. PostgreSQL is designed by, and intended for, the sort of people who don't like cutting corners just to make things faster or easier, and in cases where the only way you can properly handle something takes a while, that may be the only option available.
However, once you have a correct implementation of something, you can then go back and optimize it. That's the mode PostgreSQL has been in for the last few years. PostgreSQL usually rises above these smaller issues to give excellent database performance. Parts of it have the sort of great design that outperforms simpler approaches, even after paying the overhead that complexity can introduce. This is a fairly recent phenomenon though, which explains quite a bit about the perception that PostgreSQL is a slower database than its competitors. In this chapter, we will cover the following topics:
Performance of historical PostgreSQL releases
PostgreSQL or another database?
PostgreSQL tools
PostgreSQL application scaling life cycle
Performance tuning as a practice
In November 2005, PostgreSQL 8.1 was released. It included a number of internal architectural changes, some of which aimed to improve how fast the database would run on a multiprocessor system with many active clients. The result was a major improvement in the ability of the database to scale upwards to handle a heavy load. Benchmarks on modern hardware really highlight just how far that version leapfrogged earlier ones. You can find an excellent performance comparison of versions 8.0 through 8.4 from György Vilmos at http://suckit.blog.hu/2009/09/29/postgresql_history. This shows exactly how dramatic these improvements have been.
This test gives a transactions per second (TPS) figure that measures the total system speed, and you can run it in either a read-only mode or one that includes writes. The read-only performance improved by over four times from 8.0 to 8.1 and more than doubled again by 8.3:
Version
Peak read-only TPS
# of clients at peak
8.0.21
1256
4
8.1.17
5620
14
8.2.13
8109
18
8.3.7
13984
22
8.4.1
13546
22
The rise in the number of clients at the peak load gives us an idea of how well the database internals handle access to shared resources. The area 8.1 in particular included a significant upgrade. Performance improved similarly on the write side, with almost an 8 times gain between 8.0 and 8.3:
Version
Peak write TPS
# of clients at peak
8.0.21
361
2
8.1.17
873
10
8.2.13
1358
14
8.3.7
2795
18
8.4.1
2713
12
The small decrease in performance from 8.3 to 8.4 in both these tests is due to some subtle retuning of the database to improve its worst-case performance. More statistics are collected in 8.4 to improve complicated queries, at the expense of slightly slowing the sort of trivial ones tested here.
These improvements have been confirmed by other benchmarking results, albeit normally not covering such a wide range of versions. It's easy to see that any conclusion about PostgreSQL performance reached before late 2005, when 8.1 shipped, is completely out of date at this point. The speed improvement in 2008's 8.3 release was an additional large leap. Versions before 8.3 are not representative of the current performance and there are other reasons to prefer using that one or a later one too.
Because of these dramatic gains, if you have an older PostgreSQL system you'd like to make faster, the very first thing you should ask yourself is not how to tweak its settings, but instead if it's possible to upgrade to a newer version. If you're starting a new project, 8.3 is the earliest version you should consider. In addition to the performance improvements, there were some changes to that version that impact application coding that you'd be better off to start with to avoid needing to retrofit later.
Chapter 16, Avoiding Common Problems, includes a reference guide to what performance-related features were added to each major version of PostgreSQL from 8.1 through 10.0. You might discover that one of the features only available in a very recent version is compelling to you, and therefore you have a strong preference to use that one. Many of these version-specific changes are also highlighted throughout the book.
Until very recently, the only way to upgrade an existing PostgreSQL version to a newer major version, such as going from 8.1.X to 8.2.X, was to dump and reload. The and/or programs are used to write the entire content of the database to a file, using the newer versions of those programs. That way, if any changes need to be made to upgrade, the newer dumping program can try to handle them. Not all upgrade changes will happen automatically though. Then, depending on the format you dumped in, you can either restore that just by running the script it generates or use the program to handle that task. pg_restore can be a much better alternative in newer PostgreSQL versions that include a version with parallel restore capabilities.
Dumping can take a while, and restoring can take even longer. While this is going on, your database likely needs to be down, so that you don't allow any changes that won't then be migrated over by the dump. For large databases, this downtime can be both large and unacceptable.
The most demanding sites prefer near-zero downtime, to run 24/7. There, a dump and reload is never an acceptable option. Until recently, the only real approach available for doing PostgreSQL upgrades in those environments has been using statement replication to do so. Slony is the most popular tool for that, and more information about it is available in Chapter 14, Scaling with Replication. One of Slony's features is that you don't have to be running the same version of PostgreSQL on all the nodes you are replicating to. You can bring up a new node running a newer PostgreSQL version, wait for replication to complete, and then switch over once it matches the original.
Another tool used for the asynchronous primary/secondary replication is Londiste from SkyTools. One of the benefits of Londiste over the streaming replication that’s in the core of PostgreSQL is that Londiste can replicate a single database or a table from a database. Streaming replication will create an exact copy of the database server. Londiste provides more granularity for replication which makes it ideal for our migration. It allows us to move databases from several servers to one unified server.
Now, there is another way available that works without needing any replication software. A program originally called pg_migrator is capable of upgrading from 8.3 to 8.4 without the dump and reload. This process is called in-place upgrading. You need to test this carefully, and there are both known limitations and likely still unknown ones related to less popular PostgreSQL features. Be sure to read the documentation of the upgrade tool very carefully. Starting in PostgreSQL 10.0, this module is included with the core database, with the name changed to pg_upgrade.pg_upgrade is a native PostgreSQL command and must be offline. While all in-place upgrades have some risk and need careful testing, in many cases, these will take you from 8.3 or 8.4 to 10.0 and hopefully beyond.
The PostgreSQL development community is now moving to an online replication approach, for example the pg_logical extension for PostgreSQL providing much faster replication than Slony, Bucardo or Londiste, as well as cross-version upgrades
The major internal changes to 8.3 make it impossible to upgrade from any earlier version past it without dumping the entire database and reloading it into the later one. This makes 8.3 a doubly important version milestone to cross. Not only is it much faster than 8.2, once your data is in 8.3, you can perform in-place upgrades from there.
Going from an earlier version to PostgreSQL 8.3 or later can be a difficult change. Some older applications rely on non-character data types being transparently cast to the type, a behavior removed from 8.3 for a variety of reasons. For details, see http://www.postgresql.org/docs/8.3/static/release-8-3.html.
While there's always a chance that upgrading your database version can introduce new issues, it is particularly likely that applications written against an earlier version will need to be updated to work against 8.3 or later. It is possible to work around this issue by manually adding back the automatic typecasting features that were removed. However, fixing the behavior in your application instead is a more robust and sustainable solution to the problem. The old behavior was eliminated because it caused subtle application issues. If you just add it back, you'll both be exposed to those and need to continue doing this extra cost additional step with every new PostgreSQL release. There is more information available at https://www.endpoint.com/blog/2010/01 on this topic and on the general challenges of doing a major PostgreSQL upgrade.
A dump/reload, or the use of tools such as pg_upgrade, is not needed for minor version updates, for example, going from 8.4.1 to 8.4.2. These simply require stopping the server, installing the new version, and then running the newer database binary against the existing server data files. Some people avoid ever doing such upgrades once their application is running for fear that a change in the database will cause a problem. This should never be the case for PostgreSQL.
You should never find an unexpected change that breaks an application in a minor PostgreSQL upgrade. Bug, security, and corruption fixes are always done in a way that minimizes the odds of introducing an externally visible behavior change, and if that's not possible, the reason why and the suggested workarounds will be detailed in the release notes. What you will find is that some subtle problems, resulting from resolved bugs, can clear up even after a minor version update. It's not uncommon to discover that the reporting of a problem to one of the PostgreSQL mailing lists is resolved in the latest minor version update compatible with that installation, and upgrading to that version is all that's needed to make the issue go away.
Starting from version 9, it is possible to migrate a complete cluster (users and databases) using pg_upgrade. It is useful to migrate from a minor version to a major version, for example from PostgreSQl 9.6 to PostgreSQL 10. This way to work is safe and faster than dump/restore, because pg_upgrade migrates PostgreSQL pages in a binary way and it's not necessary rebuild any indexes.
As mentioned above, another approach may be to use pglogical, pglogical is a logical replication system implemented entirely as a PostgreSQL extension. Fully integrated, it requires no triggers or external programs. This alternative to physical replication is a highly efficient method of replicating data using a publish/subscribe model for selective replication. Using pglogical we can migrate and upgrade PostgreSQL with almost zero downtime
There are certainly situations where other database solutions will perform better. For example, PostgreSQL is missing features needed to perform well on some of the more difficult queries in the TPC-H test suite (see Chapter 8, Database Benchmarking, for more details). It's correspondingly less suitable for running large data warehouse applications than many of the commercial databases. If you need queries along the lines of some of the very heavy ones TPC-H includes, you may find that databases such as Oracle, DB2, and SQL Server still have a performance advantage worth paying for. There are also several PostgreSQL-derived databases that include features making them more appropriate for data warehouses and similar larger systems. Examples include Greenplum, Aster Data, and Netezza.
For some types of web applications, you can only get acceptable performance by cutting corners on the data integrity features in ways that PostgreSQL just won't allow. These applications might be better served by a less strict database, such as MySQL or even a really minimal one, such as SQLite. Unlike the fairly mature data warehouse market, the design of this type of application is still moving around quite a bit. Work on approaches using the key/value-based NoSQL approach, including CouchDB, MongoDB, and Cassandra, are all becoming more popular at the time of writing this. All of them can easily outperform a traditional database, provided you have no need to run the sort of advanced queries that key/value stores are slower at handling. PostgreSQL also natively supports and indexes the Json data type for a NoSQL data approach.
Starting from version 9.4, PostgreSQL has the jsonb field and it can be used as a NoSQL system. jsonb fields are indexable fields, and starting from version10.x, new operators and functions are present in PostgreSQL that allow deleting, modifying, or inserting values intojsonbvalues, including at specific path locations.
Starting from version 9.3, PostgreSQL has foreign data wrapper (fdw) support. With fdw, PostgreSQL can connect to many external database management system (DBMS), and it can see foreign tables (for example, MySQL or Oracle tables) as local tables. Some of the best know fdws are:
Oracle
MySQL
Informix
Firebird
SQLite
CSV files
Sybase
Microsoft SQL Server
MongoDB
Cassandra
The complete list is available at https://wiki.postgresql.org/wiki/Foreign_data_wrappers.
If you're used to your database vendor supplying a full tool chain with the database itself, from server management to application development, PostgreSQL may be a shock to you. Like many successful open source projects, PostgreSQL tries to stay focused on the features it's uniquely good at. This is what the development community refers to as the PostgreSQL core: the main database server, and associated utilities, that can only be developed as a part of the database itself. When new features are proposed, if it's possible for them to be built and distributed out of core, this is the preferred way to do things. This approach keeps the database core as streamlined as possible, as well as allowing those external projects to release their own updates without needing to synchronize them against the main database's release schedule.
Successful PostgreSQL deployments should recognize that a number of additional tools, each with their own specialized purpose, will need to be integrated with the database core server to build a complete system.
One part of the PostgreSQL core that you may not necessarily have installed is what's called the contrib modules (it is named after the directory they are stored in). These are optional utilities shipped with the standard package, but that aren't necessarily installed by default on your system. The contrib code is maintained and distributed as part of the PostgreSQL core, but not required for the server to operate.
From a code quality perspective, the contrib modules aren't held to quite as high a standard, primarily by how they're tested. The main server includes heavy regression tests for every feature, run across a large build farm of systems that look for errors and look for greater performance and greater stability. The optional contrib modules don't get that same level of testing coverage. However, the code itself is maintained by the same development team, and some of the modules are extremely popular and well tested by users.
A list of all the contrib modules available can be found at at http://www.postgresql.org/docs/current/static/contrib.html.
One good way to check whether you have contrib modules installed is to see if the program is available. That's one of the few contrib components that installs a full program, rather than just the scripts you can use. Here's a Unix example of checking for pgbench :
$ pgbench -V
pgbench (PostgreSQL) 10.0
If you're using an RPM or DEB packaged version of PostgreSQL, as the case would be on many Linux systems, the optional package contains all of the contrib modules and their associated installer scripts. You may have to add that package using yum, apt-get, or a similar mechanism if it wasn't installed already. On Solaris, the package is named SUNWpostgr-contrib .
If you're not sure where your system's PostgreSQL contrib modules are installed, you can use a filesystem utility to search. locate works well for this purpose on many Unix-like systems, as does the find command. The file search utilities available on the Windows Start menu will work. A sample file you could look for is pg_buffercache.sql, which will be used in the upcoming chapter Chapter 5,Memory for Database Caching, on memory allocation. Here's where that might be on some of the platforms that PostgreSQL supports:
RHEL and CentOS Linux systems will put the main file you need into
/usr/share/pgsql/contrib/pg_buffercache.sql
Debian or Ubuntu Linux systems will install the file at
/usr/share/postgresql/version/contrib/pg_buffercache.sql
Solaris installs it into
/usr/share/pgsql/contrib/pg_buffercache.sql
The standard Windows one-click installer with the default options will always include the
contrib
modules, and this one will be in
C:\Program Files\PostgreSQL/version/share/contrib/pg_buffercache.sql
Building your own PostgreSQL from source code can be a straightforward exercise on some platforms if you have the appropriate requirements already installed on the server. Details are documented at http://www.postgresql.org/docs/current/static/install-procedure.html.
After building the main server code, you'll also need to compile contrib modules by yourself too. Here's an example of how that would work, presuming that your PostgreSQL destination is /usr/local/postgresql, and that there's a directory there named source you put the source code into (this is not intended to be a typical or recommended structure you should use):
$ cd /usr/local/postgresql/source
$ cd contrib/pg_buffercache/
$ make
$ make install
/bin/mkdir -p '/usr/local/postgresql/lib/postgresql'
/bin/mkdir -p '/usr/local/postgresql/share/postgresql/contrib'
/bin/sh ../../config/install-sh -c -m 755 pg_buffercache.so '/usr/local/postgresql/lib/postgresql/pg_buffercache.so'
/bin/sh ../../config/install-sh -c -m 644 ./uninstall_pg_buffercache.sql '/usr/local/postgresql/share/postgresql/contrib'
/bin/sh ../../config/install-sh -c -m 644 pg_buffercache.sql '/usr/local/postgresql/share/postgresql/contrib'
It's also possible to build and install all the contrib modules at once by running / from the directory.
While some contrib programs such as pgbench, are directly executable, most are utilities that you install into a database in order to add extra features to them.
As an example, to install the module into a database named abc, the following command line would work (assuming the RedHat location of the file):
$ psql -d abc -f /usr/share/postgresql/contrib/pg_buffercache.sql
You could instead use the pgAdmin III GUI management utility, which is bundled with the Windows installer for PostgreSQL, instead of the command line:
Navigate to the database you want to install the module into.
Click on the SQL icon in the toolbar to bring up the command editor.
Choose
File
/
Open
. Navigate to
C:\Program
Files\PostgreSQL/version/share/contrib/pg_buffercache.sql
and open that file.
Execute using either the green arrow or
Query
/
Execute
.
You can do a quick test of the module installed on any type of system by running the following quick query:
SELECT * FROM pg_buffercache;
If any results come back, the module was installed. Note that pg_buffercache will only be installable and usable by database superusers.
The official home of many PostgreSQL-related projects is pgFoundry.
pgFoundry only hosts software for PostgreSQL, and it provides resources such as mailing lists and bug tracking, in addition to file distribution. Many of the most popular PostgreSQL add-on programs are hosted there:
Windows software allowing access to PostgreSQL through .NET and OLE
Connection poolers, such as pgpool and pgBouncer
Database management utilities, such as pgFouine, SkyTools, and PgTune
While sometimes maintained by the same people who work on the PostgreSQL core, pgFoundry code varies significantly in quality. One way to help spot the healthier projects is to note how regularly and recently new versions have been released.
Another site where it is possible to find many PostgreSQL-related projects is PGXN. PGXN is more recent than pgFoundry and it is possible to find recent extensions there.
The PostgreSQL Extension Network (PGXN) is a central distribution system for open source PostgreSQL extension libraries. It consists of four basic parts:
PGXN Manager
: An upload and distribution infrastructure for extension developers
PGXN API
: A centralized index and API of distribution metadata
PGXN Search
: This site is for searching extensions and perusing their documentation
PGXN Client
: A command-line client for downloading, testing, and installing extensions
The difference between pgFoundry and PGXN is that pgFoundry is about project management and PGXN is about distribution and exposure.
Beyond what comes with the PostgreSQL core, the contrib modules, and software available on pgFoundry, there are plenty of other programs that will make PostgreSQL easier and more powerful. These are available from sources all over the internet. There are actually so many available that choosing the right package for a requirement can itself be overwhelming.
Some of the best programs will be highlighted throughout the book, to help provide a short list of the ones you should consider early. This approach, where you get a basic system running and then add additional components as needed, is the standard way large open source projects are built.
It can be difficult for some corporate cultures to adapt to that style, such as ones where any software installation requires everything from approval to a QA cycle. In order to improve the odds of your PostgreSQL installation being successful in such environments, it's important to start introducing this conceptearly on. Additional programs to add components building on the intentionally slim database core will be needed later, and not all of what's needed will be obvious at the beginning.
While every application has unique growth aspects, there are many common techniques that you'll find necessary as an application using a PostgreSQL database becomes used more heavily. The chapters of this book each focus on one of the common aspects of this process. The general path that database servers follow includes the following steps:
Select hardware to run the server on. Ideally, you'll test that hardware to make sure it performs as expected too.
Set up all the parts of database disk layout: RAID level, filesystem, and possibly table/index layout on disk.
Optimize the server configuration.
Monitor server performance and how well queries are executing.
Improve queries to execute more efficiently, or add indexes to help accelerate them.
As it gets more difficult to just tune the server to do more work, instead reduce the amount it has to worry about by introducing connection pooling and caching.
Partition larger tables into sections. Eventually, really large ones may need to be split so that they're written to multiple servers simultaneously.
This process is by no means linear. You can expect to make multiple passes over optimizing the server parameters. It may be the case that you decide to buy newer hardware first, rather than launching into replication or partitioning work that requires application redesign work. Some designs might integrate caching into the design from the very beginning. The important thing is to be aware of the various options available and to collect enough data about what limits the system is reaching to decide which of the potential changes is most likely to help.
Work on improving database performance has its own terminology, just like any other field. Here are some terms or phrases that will be used throughout the book; both of these terms will be used to refer to the current limitation that is preventing performance from getting better:
Running a test to determine how fast a particular operation can run. This is often done to figure out where the bottleneck of a program or system is.
Monitoring what parts of a program are using the most resources when running a difficult operation, such as a benchmark. This is typically to help prove where the bottleneck is, and whether it's been removed as expected after a change. Profiling a database application usually starts with monitoring tools, such as
vmstat
and
iostat
. Popular profiling tools at the code level include
gprof
,
OProfile
, and
DTrace
.
One of the interesting principles of performance tuning work is that, in general, you cannot figure out what bottleneck an application will next run into until you remove the current one. When presented with a system that's not as fast as someone would expect it to be, you'll often see people guessing what the current bottleneck is, or what the next one will be. That's generally a waste of time. You're always better off measuring performance, profiling the parts of the system that are slow, and using that to guess at causes and guide changes.
Let's say what you've looked at suggests that you should significantly increase shared_buffers, the primary tunable for memory used to cache database reads and writes. This normally has some positive impact, but there are potential negative things you could encounter instead. The information needed to figure out which category a new application will fall into, whether this change will increase or decrease performance, cannot be predicted from watching the server running with the smaller setting. This falls into the category of chaos theory: even a tiny change in the starting conditions can end up rippling out to a very different end condition, as the server makes millions of decisions and they can be impacted to a small degree by that change. Similarly, if is set too small, there are several other parameters that won't work as expected at all, such as those governing database checkpoints.
Since you can't predict what's going to happen most of the time, the mindset you need to adopt is one of heavy monitoring and change control.
Introduce a small targeted change. Try to quantify what's different and be aware that some changes you have rejected as not positive won't always stay that way forever. Move the bottleneck to somewhere else, and you may discover that some parameter that didn't matter before is now suddenly the next limiting factor.
There's a popular expression on the mailing list devoted to PostgreSQL performance when people speculate about root causes without doing profiling to prove their theories: less talk, more gprof. While gprof may not be the tool of choice for every performance issue, given it's more of a code profiling tool than a general monitoring one, the idea that you measure as much as possible before speculating as to the root causes is always a sound one. You should also measure again to verify that your change did what you expected too.
Another principle that you'll find is a recurring theme in this book is that you must be systematic about investigating performance issues. Do not assume your server is fast because you bought it from a reputable vendor; benchmark the individual components yourself. Don't start your database performance testing with application level tests; run synthetic database performance tests that you can compare against other people's first. That way, when you run into the inevitable application slowdown, you'll already know your hardware is operating as expected and that the database itself is running well. Once your system goes into production, some of the basic things you might need to do in order to find a performance problem, such as testing hardware speed, become impossible to take the system down.
You'll be in much better shape if every server you deploy is tested with a common methodology, which is exactly what later chapters here lead you through. Just because you're not a hardware guy, it doesn't mean you should skip over the parts here that cover things such as testing your disk performance. You need to perform work like that as often as possible when exposed to new systems—that's the only way to get a basic feel of whether something is operated within the standard range of behavior or if instead there's something wrong.
PostgreSQL has come a long way in the last five years. After building solid database fundamentals, the many developers adding features across the globe have made significant strides in adding both new features and performance improvements in recent releases. The features added to the latest PostgreSQL, 10.0, making replication and read scaling easier than ever before, are expected to further accelerate the types of applications the database is appropriate for.
The extensive performance improvements in PostgreSQL 9.x and 10.x in particular shatter some earlier notions that the database server was slower than its main competitors.
There are still some situations where PostgreSQL's feature set results in slower query processing than some of the commercial databases it might otherwise displace.
If you're starting a new project using PostgreSQL, use the latest version possible (your preference really should be to deploy version 8.3 or later).
PostgreSQL works well in many common database applications, but certainly there are applications it's not the best choice for.