39,59 €
DuckDB is a fast in-process analytical database. Getting Started with DuckDB offers a practical overview of its usage. You'll learn to load, transform, and query various data formats, including CSV, JSON, and Parquet. The book covers DuckDB's optimizations, SQL enhancements, and extensions for specialized applications. Working with examples in SQL, Python, and R, you'll explore analyzing public datasets and discover tools enhancing DuckDB workflows. This guide suits both experienced and new data practitioners, quickly equipping you to apply DuckDB's capabilities in analytical projects. You'll gain proficiency in using DuckDB for diverse tasks, enabling effective integration into your data workflows.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 562
Veröffentlichungsjahr: 2024
Getting Started with DuckDB
A practical guide for accelerating your data science, data analytics, and data engineering workflows
Simon Aubury
Ned Letcher
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Managers: Kaustubh Mangrulkar and Apeksha Shetty
Publishing Product Manager: Deepesh Patel
Book Project Managers: Kirti Pisat and Hemangi Lotlikar
Senior Content Development Editor: Shreya Moharir
Technical Editor: Seemanjay Ameriya
Copy Editor: Safis Editing
Proofreader: Shreya Moharir
Indexer: Manju Arasan
Production Designer: Prashant Ghare
Senior DevRel Marketing Executive: Nivedita Singh
First published: June 2024
Production reference: 1240524
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN 978-1-80324-100-5
www.packtpub.com
To the many teachers who have inspired me, to my loving wife and children who have supported me, and to Snowy the cat who has entertained me.
– Simon Aubury
To Libby and Marcus. I am ever grateful for your love and support.
– Ned Letcher
It’s easy to convince someone that DuckDB is fast. Just run a query that aggregates a billion rows—you’ll have some pretty compelling evidence about a second later. But showing someone how user-friendly it is? That takes some more time. Beyond all the effort that’s gone into making DuckDB fast lies a second ocean of effort that’s gone into making your life as a data wrangler easier—from the little helpers such as pattern matching on column names (so you can select all the columns that end in '_price') to much bigger features such as being able to read directly from CSV files or JSON endpoints. DuckDB has so many affordances and hidden gems that you could spend years stumbling across them by chance, or by peering over the shoulder of a colleague. Luckily, you don’t have to.
In this excellent book, Simon and Ned have combined the practicalities of what you need to know now with a wealth of hints and tips for getting the most out of DuckDB. Tips for doing more, much more easily. As I read through, I collected several pages of notes for things I’ll start using immediately, and things to file away for future projects.
The chapter on DuckDB’s extensions is particularly fruitful if you’re looking to perform minor data miracles. You will learn how to pull raw data off of S3, chew through it in seconds, and export it into an Excel spreadsheet, instantly becoming the favorite data guru of an entire marketing department. (Although you may want to think twice before sealing that pact…)
– Kris Jenkins
Host of Developer Voices and Co-Founder of BullionVault
Simon Aubury has been working in the IT industry since 2000 as a data engineering specialist. He has an extensive background in building large, flexible, highly available distributed data systems. Simon has delivered critical data systems for finance, transport, healthcare, insurance, and telecommunications clients in Australia, Europe, and Asia Pacific. In 2019, Simon joined Thoughtworks as a principal data engineer and today is associate director of data platforms at Simple Machines in Sydney, Australia. Simon is active in the data community, a regular conference speaker, and the organizer of local and international meetups and data engineering conferences.
I want to thank the vibrant DuckDB community, which like many open source projects is built with the hard work, dedication, and passion of hundreds of contributors working together to build incredible things for us all.
Ned Letcher has worked as a data science and software engineering consultant since completing his PhD in computational linguistics in 2018 and currently works at Thoughtworks. He has designed and developed data-powered products and services across a range of industries and helped organizations and teams improve the effectiveness of their data processes and workflows. Ned has also worked as a Python trainer, supporting both tertiary students and data professionals across various organizations. He is active in the data community, speaking at and helping organize meetups and conferences, as well as contributing to a range of open source projects.
Thank you to everyone who’s supported me on this odyssey, Dan and Lilly in particular: I couldn’t have done this without you. Thanks to the technical advice and support from Craig Savage, Nathan Dines, and Elliana May. A big shoutout to the lovely folks at Code Black Brunswick, Ramblin’ Man, and the Sporting Club Hotel, for keeping me fueled while I worked on my laptop in a fugue state.
Torsten Grust is a professor of computer science at Universität Tübingen, Germany, where he has led the database systems research group since 2008. Torsten performs research into the design, compilation, optimization, and evaluation of a variety of database languages, modern dialects of SQL in particular. In this work, he often walks the fine line between database query and programming language technology. His group develops techniques that turn relational database systems into scalable processors also for non-relational query and programming languages. Torsten is at his happiest whenever he finds new evidence that database and programming language research can mutually benefit each other.
Louisa Lambrecht, Björn Bamberg, Tim Fischer, and Denis Hirn are research assistants in Torsten’s group, where they tend to spend the lion’s share of their (working) days dabbling with the intricacies of SQL processors and the innards of various database systems, including DuckDB.
Data is everywhere, stored in a huge variety of systems across many different formats, and with an ever-growing number of tools available to data practitioners to practice their craft. DuckDB is a relatively new and explosively popular database management system (DBMS) that is increasingly being adopted for analytical data workloads by data scientists, data analysts, data engineers, and software engineers. DuckDB is open source software that is made available under the permissive MIT license, making it friendly to both commercial and non-commercial applications alike. The non-profit DuckDB Foundation stewards the long-term health of the DuckDB project, and the development of DuckDB is supported by DuckDB Labs, which employs the project’s core contributors.
In this chapter, we’ll unpack what type of database DuckDB is and identify use cases that DuckDB is well suited to and that data practitioners are increasingly adopting it for. We’ll also outline the different deployment options DuckDB comes with and take you through how to install it on your own system so that you’re ready to dive into the hands-on examples in this book. Finally, we’ll go through a quick primer on Structured Query Language (SQL), the query language DuckDB uses for its primary interface that we’ll be using for many of the exercises in this book. If you’ve wrangled your fair share of SQL before, you may want to just skim through this section. If you’re newer to using SQL, or it’s been a while between queries, then you’ll want to dive into these hands-on exercises.
By the end of this chapter, you’ll be able to orient DuckDB within the landscape of data tooling and understand what kinds of use cases you may want to consider leveraging it for, as well as be able to recognize when other data processing tooling may be more appropriate.
Across the rest of the book, we’ll show you how to take DuckDB through its paces, and in doing so, hopefully impart a sense of why there is so much enthusiasm around it. Right now, let’s jump into setting the scene for our DuckDB explorations by covering the following topics:
What is DuckDB?Why use DuckDB?DuckDB deployment options and installationA short SQL primerTo follow along with the examples in this book, you’ll need access to a computer running either Windows, macOS, or Linux, and an internet connection to download and then install DuckDB. In later chapters, you’ll also need to download some datasets that we’ll be using to explore DuckDB’s analytical capabilities. The examples we present are available for you to access in this GitHub repository: https://github.com/PacktPublishing/Getting-Started-with-DuckDB.
Whether you’re an experienced data practitioner or just getting started working with data, you will almost certainly find yourself having to navigate the dizzying number of databases and data processing tools that you can choose from to support data-centric applications and operational systems. The reason for this overwhelming choice is that when it comes to data processing and management architectures, there is no one-size-fits-all. Each tool necessarily comes with its own set of trade-offs that make it well suited to a particular flavor of application and less so to others.
With that in mind, let’s dig into what kind of database DuckDB is and where it sits in the data-tooling landscape so that we can unpack what kinds of applications and use cases it is well suited to. One description of DuckDB, which you might encounter when poking around online resources, is the following:
DuckDB is an in-process SQL OLAP DBMS.
While this is a fairly dense description, invoking several distinct concepts from the world of databases and software applications, it does a great job of positioning where DuckDB sits in relation to other databases and data processing tools. So, let’s break this description down, going through each component and working our way from right to left:
A database management system (DBMS) is a software application for managing structured data in a database, allowing users and applications to store, manipulate, delete, and query records. While you might hear the term database being used as shorthand for DBMS, it’s worth noting that a DBMS provides additional functionality on top of the core features of a database—which is essentially to store data in a structured format that supports efficient retrieval and manipulation. A DBMS provides an interface between the database and its users, enabling them to effectively create, read, update, and delete data, while also managing the integrity, scalability, and security of the database. DuckDB is a fully-fledged DBMS that manages all these concerns for users.Online analytical processing (OLAP) is a data processing paradigm that is characterized by complex queries over large volumes of multidimensional data, which often involve processing significant portions of a dataset. These analytical workloads often involve applying column-wise aggregation functions over entire tables and joining large tables together. The term was created in contrast to online transaction processing (OLTP), which describes transaction-oriented DBMS tools, such as PostgreSQL, MySQL, and SQLite, which are typically used as operational databases supporting software applications, where frequent reading and writing of individual records is the dominant access pattern. DuckDB is designed and optimized for fast and efficient performance over OLAP workloads.SQL is a popular programming language used for storing, manipulating, and querying records in a wide variety of databases and data stores. It is a standard interface used for interacting with and managing relational databases, which are databases characterized by the representation of data as tables of rows and columns, with formal relationships defined across tables. SQL’s increasing ubiquity has made it something of a de facto choice for code-defined data-querying interfaces. DuckDB has its own SQL dialect, which forms the primary interface for interacting with DuckDB databases. As we will see, there are also non-SQL interfaces available for users to work with DuckDB databases. In the last section of this chapter, A short SQL primer, we’ll cover a brief introduction to the fundamentals of working with SQL for those who are new to working with it or a little rusty.In-process means that DuckDB runs embedded within a host process. This is in contrast to most DBMSs, which typically operate standalone, running in a separate process from consuming applications, often on a remote server. By adopting an in-process model rather than a client-server architecture, DuckDB greatly simplifies installation and integration, removing the need to install and manage a standalone DBMS service, as well as the need to connect and authenticate with a remote server. A notable example of an in-process DBMS that you may have encountered is SQLite, which is a popular choice for software developers distributing apps that require reading and writing local transactional data, such as user data for mobile apps and lightweight web apps.Putting all these pieces together, we can see that DuckDB is a fully featured relational DBMS (RDBMS) that is designed for analytical workloads, provides a SQL interface, and runs entirely embedded in a host process.
When compared with other popular databases, DuckDB is perhaps most similar to the ubiquitous SQLite in that they are both simple in-process DBMSs that write to a single-file storage format, and they are also both free and open source. The key difference between the two tools is that SQLite is optimized for row-oriented OLTP workloads and hence does not perform well on complex analytical workloads, whereas DuckDB is purpose-built for these workloads, offering extremely good performance over them. It’s for this reason that DuckDB is sometimes described as SQLite for OLAP. In fact, DuckDB appears to be the first production-ready in-process OLAP DBMS.
In the next section, we’ll explore the reasons why people are increasingly adopting DuckDB and finding it to be a valuable workhorse in their analytical data toolkit.
So, why might you want to use DuckDB? Let’s start by zooming all the way out. As a data practitioner, there are two broad contexts where you might find yourself getting excited about leveraging DuckDB:
Using DuckDB to scale and supercharge data science, data analytics, and ad hoc data-wrangling workflows.Using DuckDB as a building block to build operational data engineering infrastructure and interactive data products.The first of these is likely to be of interest to data practitioners with analytical workflows, such as data scientists, data analysts, and machine learning engineers, whereas the second is more likely to be relevant to data engineers and machine learning engineers building data infrastructure, as well as software engineers building user-facing data products. In this book, we’ll be focusing more on using DuckDB to supercharge analytical workflows; however, if you’re looking to use DuckDB for building operational data infrastructure and data products, this book will still be a great starting point to get you up to speed with DuckDB’s capabilities that make it well suited to these kinds of applications.
In this section, we’ll first go through some use cases that land in DuckDB’s sweet spot, before looking at DuckDB’s features that make it especially well-suited to these applications. We’ll finish up by discussing contexts where other tools may be more appropriate.
DuckDB is an incredibly versatile tool for analytical data processing and management, so any attempt to describe its full range of potential applications will almost certainly be incomplete. To give you a sense of the flavor of possible applications, we’ll go through a range of use cases for DuckDB across the two broad categories mentioned previously: analytical workflows and building operational data infrastructure and products.
A major component of the workflows of data scientists and data analysts is activities that involve processing often quite large datasets, from cleaning data, transforming data into the right shape, structured data modeling, running statistical algorithms, and training machine learning models. If you talk to a practitioner who has been in the trenches for a while, they will likely tell you that sometimes they feel like they’re spending more time fighting the tools they use for these tasks than they are being productive. Often the size of the data is a limiting factor, with many popular data processing tools, such as pandas dataframes in Python and dataframes in R, simply not being able to handle the size of target datasets within the memory of your workstation or, if they can, taking a frustrating amount of time to process.
Once you’ve hit the limits of your local machine, conventional wisdom is that you need to take your workload to a distributed data compute framework such as Apache Spark, Dask, or Ray, or perhaps ingest your data into a cloud data warehouse or a data lake, where a distributed SQL query engine such as Google BigQuery, Trino, or Amazon Athena can be used to run queries at scale. These solutions significantly increase the complexity of your workflows, requiring complex supporting infrastructure that must be managed and maintained, with a hefty price tag often associated with such managed services. If you’re lucky enough to have access to these tools, they still come with additional challenges, such as working with unfamiliar or constrained interfaces, and when things go wrong, you may often find yourself having to debug arcane and confusing stack traces from the underlying compute engine.
This is where DuckDB can come to the rescue, offering the simplicity of an in-process tool, with a familiar SQL interface (as well as non-SQL interfaces if you prefer) that is optimized for running complex OLAP queries over large datasets. Not only is DuckDB blazingly fast, but it is also able to handle out-of-core workloads (datasets that don’t fit into memory), enabling you to scale your workflows on a single machine much further before you need to consider more complex distributed data processing solutions.
In recent times, there have been developments in dataframe libraries that help address the performance limitations of tools such as pandas dataframes and R dataframes, such as Dask and Modin, which allow you to perform simple parallelization of dataframe operations across your CPU cores, as well as providing on-ramps to run the same queries across a distributed cluster. We also have dataframe libraries such as Polars and Vaex, which are built on top of Apache Arrow, providing more efficient memory utilization, parallelization, and the ability to handle some out-of-core workloads. These innovations in the data ecosystem are pleasing to see; however, these tools are still ultimately dataframe tools, focusing primarily on querying and data transformation—they do not give you the data management features of a DBMS.
By virtue of being a fully-fledged DBMS, DuckDB provides a range of affordances that data practitioners may not realize they’re missing from their current analytical processing workflows:
DuckDB provides transactional guarantees through ACID properties (atomicity, consistency, isolation, and durability), meaning that you don’t have to worry about corrupted data if your Python or R process crashes midway through a job.Data integrity guarantees can be enabled through the specification of constraints that enforce properties over data inserted into tables. DuckDB allows you to specify PRIMARY KEY constraints, which enforce uniqueness across rows within a table, FOREIGN KEY constraints, which enforce referential integrity for relationships across tables, and NOT NULL constraints over column values. DuckDB also provides the ability to apply arbitrary CHECK constraints to column values in the form of Boolean expressions, such as ensuring that string identifiers only contain alphanumeric characters.While you can use DuckDB as an entirely in-memory database, its database can also be persisted to disk and used across processes, even allowing multiple processes to read concurrently. This enables workflows and consuming patterns that dataframe libraries cannot readily support on their own.DuckDB also includes a rich suite of data integrations, with an eye toward performance. Notable examples include optimized CSV, Parquet, and JSON loaders, which can read files in parallel, the ability to read Hive-partitioned files, and the PostgreSQL, MySQL, and SQLite extensions, which allow DuckDB to query directly from source tables in external databases, rather than having to rely on bulk imports that must be periodically refreshed.When a data team starts to hit the limits of their existing tooling, whether due to missing data management features or insufficient performance, it’s common for the team to start building out their own bespoke tools and packages. Since this kind of custom tooling is typically not core to the value the team is providing, these resources can suffer from defects due to insufficient resources being able to be dedicated to their development. Using a well-maintained and tested DBMS that is optimized for analytical workloads removes the busywork that is associated with maintaining tooling that doesn’t represent your core value proposition.
DuckDB’s powerful feature set makes it a versatile tool for a range of analytical workflows, whether you’re performing exploratory data analysis (EDA), quickly transforming between common data formats, or building complex data science pipelines. DuckDB enables you to slurp up large datasets from across heterogeneous data sources, with a rich set of features for cleaning dirty data and normalizing inconsistent schema, through a simple interface and with blazing performance. DuckDB also has great integrations with familiar analytical tools commonly used in the data ecosystem, allowing you to mix and match DuckDB with complementary tools to assemble your own effective workflows. For example, DuckDB can query directly from and write to pandas and Polars dataframes and R dataframes, as well as Apache Arrow tables. It also offers the ability to use alternative query interfaces to SQL that may be more familiar to data scientists and data analysts, such as dplyr in R and Ibis in Python. In addition to being a powerful workhorse for complex analytical queries, all this versatility makes DuckDB a valuable Swiss Army knife that is worth having in your analytical data toolkit.
Finally, data scientists and data analysts often find themselves building custom interactive data apps or dashboards for use as proof of concepts (POCs), bespoke tools that support common workflows, or for publishing internal decision-support tools within their organization. Powerful open source dashboarding tools such as R Shiny, Streamlit, and Plotly Dash streamline the development of such data apps; however, they typically leave the integration of a data source up to the developer. DuckDB is a perfect complement to these tools, offering simple in-process integration with no external dependencies and enabling fast analytical querying performance, which is important for low-latency response times that improve the user experience of your data apps. We unpack this particular application of DuckDB further in the DuckDB-powered data apps section in Chapter 12.
While much of the explosive growth and excitement that DuckDB has seen has been driven by folks adopting it for the types of analytical workflows we have just discussed, there is another area of application that is starting to show increased amounts of activity and demand for, which draws upon similar themes of doing more with less and simplifying and streamlining workloads. This sees DuckDB being used as a building block in modern data infrastructure for use cases that involve small-to-medium data rather than truly big data, as well as use cases that require low-latency responsiveness for consumer-facing interactive data apps. Common to these applications is a shift away from the paradigm of moving your compute to the data, which is seen as the conventional wisdom for effectively working with big data, and a move toward bringing your data to the compute. For smaller data workloads, this can be faster, more efficient, and cheaper to build and maintain.
Much of the development in modern data processing technologies has been dominated by the needs of hyperscale organizations, with scale-out tools such as MapReduce, Hadoop, and Apache Spark, as well as cloud data warehouses such as Snowflake and BigQuery, dominating the landscape. Most organizations, however, do not operate at hyperscale, and oftentimes data processing needs are quite moderate in comparison. The cleaned and enriched datasets that drive the modern data-informed business—providing business intelligence (BI) across sales, marketing, growth, and product innovation—tend not to reach the petabyte scale. There is an opportunity for data teams in many organizations to adopt leaner data architectures that are optimized for more moderate data workloads and that come with the benefits of reduced complexity and much lower total cost of ownership.
DuckDB’s performance characteristics make it well placed to be a core building block in such architectures. Some examples include the following:
Using DuckDB to perform transformations in extract, transform, and load (ETL) pipelines as an alternative to tools such as Apache Spark. Compute instances can be spun up on demand, invoking DuckDB to pull down and transform data.For data lake contexts, where structured and semi-structured data has been landed in object storage, DuckDB can be used as a lightweight alternative to distributed SQL query engines, which data teams might otherwise reach for, such as Google BigQuery, Amazon Athena, Trino, and Dremio.For some scenarios, DuckDB also offers the potential to replace the use of cloud data warehouses such as Snowflake or OLAP engines such as ClickHouse, where utilization of these powerful resources would be low. If your organization is only consuming a handful of data sources to produce conformed tables that drive a small number of reporting use cases, then using DuckDB to build small, targeted data cubes may well be sufficient for your needs.Some folks have already started to roll their own solutions for adopting these architectures. See, for example, the Modern Data Stack in a Box with DuckDB post by Jacob Matson (https://duckdb.org/2022/10/12/modern-data-stack-in-a-box.html), which explores the use of open source tools to create an end-to-end lightweight data stack, with DuckDB at its core. Another post, Build a poor man’s data lake from scratch with DuckDB, by Pete Hunt and Sandy Ryza (https://dagster.io/blog/duckdb-data-lake), explores using DuckDB as a SQL query engine on top of a data lake. Meanwhile, there are also companies emerging that are oriented around offering hosted platforms that provide serverless analytics platforms driven by DuckDB, the most notable example being MotherDuck (https://motherduck.com).
Another area where traditional scale-out approaches to data processing have shown to be not always fit for purpose is around interactive data applications, such as BI dashboards and bespoke data apps. In such applications, low-latency query results in response to user interaction are crucial for supporting dynamic and ad hoc workloads with a positive user experience. However, most cloud data warehouses and distributed data processing engines are simply not able to provide the low-latency response times required for these types of workloads and must be augmented with different types of pre-aggregation and caching strategies, often in the form of separate service, which further increases complexity and architectural surface area. DuckDB’s blazing fast speeds over analytical workloads make it a compelling choice for being the backing query engine for interactive data applications. For example, the hosted BI service Mode recently switched to using DuckDB as their in-memory query engine in order to improve the speed of queries (https://mode.com/blog/how-we-switched-in-memory-data-engine-to-duck-db-to-boost-visual-data-exploration-speed). Hex and Observable are two hosted data analytics notebook services offering rich visualizations and interactivity that both recently added DuckDB integration to supercharge users’ workflows. Another notable example is Rill Data Developer, an open source tool for building dashboards, which is built on DuckDB to provide rapid response times for queries.
The use of DuckDB as a building block for data infrastructure and interactive data applications is a notable emerging trend and one we think is worth paying attention to. In the next section, we’ll further unpack the features of DuckDB that serve to make it appealing for both analytical workflows and building operational data infrastructure and data products.
You may find yourself asking, what makes DuckDB so well suited to scaling analytical workflows and being used as a building block in data infrastructure? Here are some key features of DuckDB that have led to it increasingly being adopted by data practitioners.
DuckDB is optimized for OLAP workloads, making it blazingly fast and efficient for the kinds of queries frequently seen in analytical workflows. It achieves this through a range of design choices and optimizations:
As with most modern OLAP engines, DuckDB employs a column-based execution model to enable better performance over operations that are characteristic of analytical workloads. DuckDB uses a highly tuned vectorized query engine that works by processing chunks of columns at a time. Operating on column chunks rather than entire columns means that queries involving multiple operations, which require intermediate aggregations, are less likely to result in out-of-memory errors. The chunks are also tuned to be small enough so that they remain inside the CPU’s ultra-low latency L1 cache—the CPU’s fasted dedicated memory, which is drastically faster than main memory.DuckDB leverages a range of compression algorithms, which exploit similarities in values within columns, to reduce its storage size on disk, which in turn improves read times.DuckDB employs an end-to-end query optimizer. This means that rather than executing queries as they are written, DuckDB can automatically rewrite queries to be much more efficient.Almost all of DuckDB’s operations come with automatic parallelism, allowing it to distribute operations over multiple CPU threads, resulting in reduced processing time.DuckDB is also able to support out-of-core workloads, where the data to be processed does not fit within available memory. It does this by spilling over into temporary disk storage when memory is exhausted. This does increase processing times due to the slower read times of persistent storage compared to memory; however, this is typically preferable to the query failing outright. These costs can also be mitigated by the selective use of low-latency SSD drives for applications where this is a concern.
The design choice of operating in-process means that users of DuckDB don’t need to concern themselves with installing, maintaining, and authenticating with a standalone database server. Another key design decision of DuckDB was for it not to make use of any third-party dependencies. This makes DuckDB extremely portable across platforms and has also enabled DuckDB to be made available for a wide range of languages and runtimes. This feature of DuckDB has increased its accessibility to a diverse range of consumers, allowing it to be readily incorporated into a wide variety of workflows and tech stacks.
DuckDB also has a strong focus on improving the ergonomics of working with SQL. It has a PostgreSQL-like SQL dialect, making it familiar to many data practitioners, and also includes a wide range of alternative function aliases, matching names used in other popular databases that many practitioners will be familiar with. Notably, DuckDB’s SQL dialect has a range of enhancements designed to improve productivity when writing analytical SQL queries. Some of these include the following:
Automatic casting of data types where possible, which serves to simplify SQL queries.Simple creation of LIST and STRUCT data types using literal values.Accessing attributes of STRUCT data types using dot notation.Simple string and list slicing syntax similar to Python.The ability to define anonymous lambda functions within SQL queries that can be used for transforming and filtering lists.List comprehension syntax similar to Python’s, enabling more convenient list transformation and filtering.Improved ergonomics for column selection using SELECT *, with the EXCLUDE and REPLACE clauses, which allow you to leverage wildcard selection for more concise queries, while still being able to exclude or transform specific columns. This pattern-matching power can be further enhanced with the COLUMNS() expression, which allows you to select columns that match a regular expression or the output of a lambda function.Column aliases that can be reused by subsequent columns in the same SELECT statement, as well as in WHERE and ORDER BY clauses.The ability to start SELECT statements with its FROM clause, allowing you to improve the readability of queries by frontloading the data source. Additionally, omitting the SELECT clause entirely is interpreted as having specified SELECT *, making this common query pattern more concise.Function-call chaining within SQL queries, similar to familiar data processing APIs such as that of pandas, which is designed around method-call chaining.Trailing commas in SQL queries.We’ll cover some of these features in Chapter 10. For a more comprehensive treatment of the friendly SQL enhancements that DuckDB’s SQL dialect provides, see the DuckDB documentation: https://duckdb.org/docs/guides/sql_features/friendly_sql.
DuckDB comes enabled with a range of affordances that make it an incredibly versatile data processing and management tool. In terms of data sources, DuckDB can read and write data formats that are the mainstays of the data world: Parquet, CSV, and JSON (including newline-delimited JSON). In addition to reading from disk, these files can be read from remotely hosted files, and DuckDB can both read and write cloud object storage services using the Simple Storage Service (S3) API. DuckDB is also able to directly connect to and query from multiple databases at a time, including PostgreSQL, MySQL, and SQLite databases.
DuckDB also has tight integrations with in-memory data formats that are commonly used in the analytical data ecosystem, including pandas and Polars dataframes and R dataframes, as well as Apache Arrow tables. DuckDB provides the ability to query directly from these in-memory data structures, as well as export to them.
DuckDB’s rich set of official clients also means that it can be used in a range of contexts beyond Python- and R-based workflows. Of particular note here is the DuckDB WebAssembly (Wasm) client, which enables developers to readily integrate DuckDB into web apps that can be published to the internet for anyone to access with a browser.
Together, all these capabilities make DuckDB a veritable data Swiss Army knife for working with analytical data, opening up many interesting applications that we have only just scratched the surface of.
DuckDB offers extensive support for composing complex queries through SQL, with a particular emphasis on features commonly used in analytical workloads. DuckDB has support for optimized aggregation and join operations, column indexes, window functions, and full-text search. DuckDB supports a wide range of functions for working with different types of data, including numeric operations, date and time operations, date formatting, text operations, bit strings, and string pattern matching, including regular expressions. Analytical workloads are further enabled by SQL commands such as SAMPLE, which provides convenient probabilistic sampling of datasets, the PIVOT command for creating pivot tables, ASOF joins for interpolating missing values when joining time series data, and the QUALIFY clause, for convenient filtering of window-functions results.
DuckDB also has a rich array of data types, which, in addition to those you’d expect, such as numeric, date, and text types, also includes handy types such as INTERVAL for date ranges, ENUM for enumerations, and powerful nested data types for holding multiple values, which include ARRAY, LIST, STRUCT, MAP, and UNION. DuckDB also offers support for analytical applications that involve working with geospatial data, using its spatial extension.
As this book was nearing completion, DuckDB released a vector similarity search extension, which enables using DuckDB’s ARRAY data type for machine learning and data science applications involving vector embeddings and semantic similarity search.
DuckDB is open sourced under the permissive MIT license, making it readily adoptable and extensible for a wide range of commercial and non-commercial applications. The community that has formed around DuckDB has generated many valuable open source extensions, integrations, and alternative DuckDB clients for different languages and runtimes. This rich ecosystem of DuckDB-oriented projects is indicative of and has been a powerful catalyst for the enthusiasm behind DuckDB.
As we have already discussed, when it comes to databases, there is no one-size-fits-all solution. This means that DuckDB won’t always be the right tool for the job.
Since it is optimized for analytical workloads running on a single machine, DuckDB has been intentionally designed to perform well under a specific set of access patterns, which you should confirm are acceptable for your use case. If your workloads correspond more to the OLTP paradigm, with many individual record transactions, including frequent writes, then DuckDB, which is optimized for OLAP workloads consisting mostly of read access and column-oriented queries, is likely not the best tool for you. If you’re after an in-process DBMS that’s optimized for OLTP workloads, then SQLite is hard to beat. Another specific consideration is that in order to open a DuckDB database that can be written to, only one process can both read from and write to the database. If you need multiple processes to be able to concurrently read from a DuckDB database, you must load it in read-only mode.
DuckDB’s performance is truly impressive, allowing you to push the limits of what a single machine can do, arguably redefining what actually constitutes big data for analytical workloads. However, it is ultimately a database that operates in-process on a single machine, and so there are limits to how far it can be pushed. If your target workloads involve regularly processing petabyte-scale datasets, then you will likely need to use some form of distributed data processing-enabled platform.
Now that we’ve established what DuckDB is and when it makes sense to use it, we’ll start to get more practical by looking at DuckDB’s deployment options and how you can get started working with it.
In this section, we’ll look at ways you can use and integrate DuckDB into your analytical workflows, operational infrastructure, and data products. We’ll start by outlining the different clients available for DuckDB, and then we’ll go through how to get started working with the DuckDB command-line interface (CLI) on your own machine.
The data technology landscape is a big place, full of a diverse range of data practitioners with different skill sets and a wide range of tools built on a heterogeneous set of technologies. To cater to this diversity, DuckDB is made available via a number of different official client APIs, covering users of Python, R, JavaScript, Rust, Swift, Julia, Java, C, and C++. In the case of JavaScript, there are two clients: one for Node.js, oriented around backend applications, and one for Wasm, which allows DuckDB to run entirely within any modern web browser. Additionally, DuckDB is also made available as a cross-platform CLI, which is distributed as an executable that can be run virtually anywhere you have access to a command line. At the time of writing, there are also community-supported clients for Rust, Go, C#, Ruby, Common Lisp, Zig, and Crystal. As this list is ever-growing, we encourage you to consult the DuckDB documentation for an up-to-date list of official DuckDB clients and third-party contributed clients: https://duckdb.org/docs/api/overview.
This large selection of clients and integrations provides good coverage over languages and runtimes that meet the needs of a wide range of data practitioners. Here are the clients we’ll cover in this book:
The DuckDB CLI is a cross-platform executable that gives you a CLI for interacting with DuckDB databases via SQL. It’s compiled without any external dependencies, meaning that you can run it virtually anywhere you have access to a terminal. Its ease of installation and portability make it a great way to get started with DuckDB, which is why we will be making use of it for many of the hands-on examples in this book. We’ll walk through how to install it and get started using it shortly.The DuckDB Python client enables Python processes to readily communicate with DuckDB databases. It offers a number of distinct APIs for querying and interacting with DuckDB, making it suitable for a range of applications, spanning the spectrum of interactive data analysis to building data infrastructure and analytical data products. In Chapter 7, we introduce the DuckDB Python client, focusing in particular on how to connect to DuckDB databases. In Chapter 8, we go on a deeper dive into the DuckDB Python client, focusing on two of the APIs that it exposes: the DuckDB-specific Relational API and the Python Database API (DB-API). In Chapter 11, we’ll use the Python client again, this time to perform EDA of a dataset in a Jupyter Notebook. Finally, in Chapter 12’s Alternative DuckDB query interfaces section, we touch on DuckDB’s Spark API—another component of the DuckDB Python client—which enables interacting with DuckDB databases using PySpark queries.The DuckDB R client enables R sessions to connect to and work with DuckDB databases. The client provides support for connecting to DuckDB via the R database interface, as well as dbplyr integration, meaning that R users can query DuckDB databases using the powerful and popular dplyr interface, as an alternative to SQL. Along with DuckDB’s core features, this makes DuckDB a powerful component of an R analytical toolchain. We cover all this in Chapter 9, where we go on a tour of the DuckDB R client.The DuckDB Wasm client is a full DuckDB client that has been compiled to run on Wasm, which is a virtual machine (VM) that runs on all modern browsers. With DuckDB Wasm, you can build web apps in JavaScript that can interact with DuckDB databases, running completely on client devices. This opens up a range of compelling possibilities for building lean analytical data apps with low-latency response times. In Chapter 12’s DuckDB-powered data apps section, we’ll unpack these possibilities further, identifying contexts where you might want to consider adopting DuckDB Wasm for building data apps. We’ll also cover using the DuckDB Web shell, a DuckDB CLI that runs completely within the browser, which you can try out online: https://shell.duckdb.org.DuckDB provides clients for both the Open Database Connectivity (ODBC) API and the Java Database Connectivity (JDBC) API. These are both important workhorses in the data ecosystem, being frequently used to connect analytical data applications, such as BI tools, to databases for querying. We discuss both these DuckDB integration targets in the DuckDB integration section in Chapter 12.The DuckDB Arrow Database Connectivity (ADBC) client provides an implementation of the ADBC API. This plays an analogous role to the JDBC and ODBC clients, enabling applications to connect to DuckDB databases as a data source, with the notable difference that the ADBC client makes use of Apache Arrow, an in-memory columnar data format. This is a much better fit for analytical applications, addressing the inefficiencies associated with the use of the OLTP-oriented JDBC and ODBC APIs. We’ll discuss the DuckDB ADBC client in Chapter 12’s DuckDB integration section.Next, we’ll get the DuckDB CLI installed so that we’re ready to dive into some hands-on DuckDB examples.
The DuckDB CLI is made available for Windows, macOS, and Linux operating systems. For up-to-date installation options and instructions, go to the DuckDB installation page and ensure that you have the appropriate operating system for your machine selected: https://duckdb.org/docs/installation. You can choose between using a package manager to install DuckDB onto your system, or directly downloading a ZIP file that contains the DuckDB CLI executable. If you haven’t used the package manager indicated in the instructions for your operating system, we suggest you take the direct download approach. Please proceed now to either install DuckDB using the package manager or download the DuckDB CLI ZIP file. If downloading directly, once the file has downloaded, unzip it and extract the DuckDB executable file into the directory you plan to work in.
Which version of DuckDB?
In this book, the output you’ll see comes from the most recent version of DuckDB available at the time of writing (April 2024), which is 0.10.2. DuckDB Labs has indicated that the intention is for no new features to be added between this release and the 1.0.0 release, with the focus being on stability and robustness. We therefore recommend that you use the most recent version of DuckDB to work through these exercises. You may see some minor differences in the output of commands and error messages with later versions; however, this is unlikely to interfere with the exercises. If you do come across any unexpected behavior, as always, the DuckDB documentation should be considered the source of truth: https://duckdb.org/docs
