E-Book
29,99 €

In-Memory Analytics with Apache Arrow E-Book

Matthew Topol

0,0

29,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Lebensstil
Sprache: Englisch

Beschreibung

Apache Arrow is an open source, columnar in-memory data format designed for efficient data processing and analytics. This book harnesses the author’s 15 years of experience to show you a standardized way to work with tabular data across various programming languages and environments, enabling high-performance data processing and exchange.
This updated second edition gives you an overview of the Arrow format, highlighting its versatility and benefits through real-world use cases. It guides you through enhancing data science workflows, optimizing performance with Apache Parquet and Spark, and ensuring seamless data translation. You’ll explore data interchange and storage formats, and Arrow's relationships with Parquet, Protocol Buffers, FlatBuffers, JSON, and CSV. You’ll also discover Apache Arrow subprojects, including Flight, SQL, Database Connectivity, and nanoarrow. You’ll learn to streamline machine learning workflows, use Arrow Dataset APIs, and integrate with popular analytical data systems such as Snowflake, Dremio, and DuckDB. The latter chapters provide real-world examples and case studies of products powered by Apache Arrow, providing practical insights into its applications.
By the end of this book, you’ll have all the building blocks to create efficient and powerful analytical services and utilities with Apache Arrow.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 582

Veröffentlichungsjahr: 2024

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

In-Memory Analytics with Apache Arrow

Accelerate data analytics for efficient processing of flat and hierarchical data structures

Matthew Topol

In-Memory Analytics with Apache Arrow

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Apeksha Shetty

Publishing Product Manager: Chayan Majumdar

Book Project Manager: Aparna Nair

Senior Content Development Editor: Shreya Moharir

Technical Editor: Sweety Pagaria

Copy Editor: Safis Editing

Proofreader: Shreya Moharir

Indexer: Pratik Shirodkar

Production Designer: Prafulla Nikalje

DevRel Marketing Executive: Nivedita Singh

First published: June 2022

Second edition: September 2024

Production reference: 1060924

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK

ISBN 978-1-83546-122-8

www.packtpub.com

For my family, Kat and Haley, who managed to tolerate me the entire time I was writing this.

Also for Logan and Penny, my fuzzy coding companions who got me through so much. Their memory is a blessing.

Foreword

Since launching as an open source project in 2016, Apache Arrow has rapidly become the de facto standard for interoperability and accelerated in-memory processing for tabular data. We have broadened support to a dozen programming languages while expanding substantially beyond the project’s initial goal of defining a standardized columnar data format to create a true multi-language developer toolbox for creating high-performance data applications. While Arrow has helped greatly with improving interoperability and performance in heterogeneous systems (such as across programming languages or different kinds of execution engines), it is also increasingly being chosen as the foundation for building new data processing systems and databases. With Dremio as the first true “Arrow-native” system, we hope that many more production systems will become “Arrow-compatible” or “Arrow-native” over the coming years.

Part of Arrow’s success and the rapid growth of its developer community comes from the passion and time investment of its early adopters and most prolific core contributors. Matt Topol has been a driving force in the Go libraries for Arrow, and with this new book, he has made a significant contribution to making the whole project a lot more accessible to newcomers. The book goes in depth into the details of how different pieces of Arrow work while highlighting the many different building blocks that could be employed by an Arrow user to accelerate or simplify their application.

I am thrilled to see this updated second edition of this book as the Arrow project and its open source ecosystem continue to expand in new, impactful directions, even more than eight years since the project started. This was the first true “Arrow book” since the project’s founding, and it is a valuable resource for developers who want to explore different areas in depth and to learn how to apply new tools in their projects. I’m always happy to recommend it to new users of Arrow as well as existing users who are looking to deepen their knowledge by learning from an expert like Matt.

– Wes McKinney

Co-founder of Voltron Data and Principal Architect at Posit

Co-creator and PMC for Apache Arrow

Contributors

About the author

Matthew Topol is a member of the Apache Arrow Project Management Committee (PMC) and a staff software engineer at Voltron Data, Inc. Matt has worked in infrastructure, application development, and large-scale distributed system analytical processing for financial data. At Voltron Data, Matt’s primary responsibilities have been working on and enhancing the Apache Arrow libraries and associated sub-projects. In his spare time, Matt likes to bash his head against a keyboard, develop and run delightfully demented fantasy games for his victims—er—friends, and share his knowledge and experience with anyone interested enough to listen.

A very special thanks go out to my friends Hope and Stan, whose encouragement is the only reason I wrote a book in the first place. Finally, thanks go to my parents, who beam with pride every time I talk about this book. Thank you for your support and for being there through everything.

About the reviewers

Weston Pace is a maintainer for the Apache Arrow project and a member of the Arrow PMC and Substrait SMC. He has worked closely with the C++, Python, and Rust implementations of Apache Arrow. He has developed components in several of the systems described in this book, such as datasets and Acero. Weston is currently employed at LanceDB, where he is working on new Arrow-compatible storage formats to enable even more Arrow-native technology.

Jacob Wujciak-Jens is an Apache Arrow committer and an elected member of the Apache Software Foundation. His work at Voltron Data as a senior software release engineer has included pivotal roles in the Apache Arrow and Velox projects. During his tenure, he has developed a deep knowledge of the release processes, build systems, and inner workings of these high-profile open source software projects. Jacob has a passion for open source and its use, both in the open source community and industry. Holding a Master of Education in computer science and public health, he loves to share his knowledge, enriching the community and enhancing collaborative projects.

Raúl Cumplido is a PMC of the Apache Arrow project and has been the release manager for the project for more than 10 releases now. He has worked on several areas of the project. He has been always involved with open source communities, contributing mainly to Python-related projects. He’s one of the cofounders of the Python Spanish Association and has also been involved in the organization of several EuroPython and PyCon ES conferences. He currently works as a senior software release engineer at Voltron Data where he contributed to the Apache Arrow project.

Preface

Part 1: Overview of What Arrow is, Its Capabilities, Benefits, and Goals

1 Getting Started with Apache Arrow

Technical requirements

Understanding the Arrow format and specifications

Why does Arrow use a columnar in-memory format?

Learning the terminology and physical memory layout

Quick summary of physical layouts, or TL;DR

How to speak Arrow

Arrow format versioning and stability

Would you download a library? Of course!

Setting up your shooting range

Using PyArrow for Python

C++ for the 1337 coders

Go, Arrow, go!

Summary

References

2 Working with Key Arrow Specifications

Technical requirements

Playing with data, wherever it might be!

Working with Arrow tables

Accessing data files with PyArrow

Accessing data files with Arrow in C++

Bears firing arrows

Putting pandas in your quiver

Making pandas run fast

Keeping pandas from running wild

Polar bears use Rust-y arrows

Sharing is caring… especially when it’s your memory

Diving into memory management

Managing buffers for performance

Crossing boundaries

Summary

3 Format and Memory Handling

Technical requirements

Storage versus runtime in-memory versus message-passing formats

Long-term storage formats

In-memory runtime formats

Message-passing formats

Summing up

Passing your Arrows around

What is this sorcery?!

Producing and consuming Arrows

Learning about memory cartography

The base case

Parquet versus CSV

Mapping data into memory

Too long; didn’t read (TL;DR) – computers are magic

Leaving the CPU – using device memory

Starting with a few pointers

Device-agnostic buffer handling

Summary

Part 2: Interoperability with Arrow: The Power of Open Standards

4 Crossing the Language Barrier with the Arrow C Data API

Technical requirements

Using the Arrow C data interface

The ArrowSchema structure

The ArrowArray structure

Example use cases

Using the C data API to export Arrow-formatted data

Importing Arrow data with Python

Exporting Arrow data with the C Data API from Python to Go

Streaming Arrow data between Python and Go

What about non-CPU device data?

The ArrowDeviceArray struct

Using ArrowDeviceArray

Other use cases

Some exercises

Summary

5 Acero: A Streaming Arrow Execution Engine

Technical requirements

Letting Acero do the work for you

Input shaping

Value casting

Types of functions in Acero

Invoking functions

Using the C++ compute library

Using the compute library in Python

Picking the right tools

Adding a constant value to an array

Compute Add function

A simple for loop

Using std::for_each and reserve space

Divide and conquer

Always have a plan

Where does Acero fit?

Acero’s core concepts

Let’s get streaming!

Simplifying complexity

Summary

6 Using the Arrow Datasets API

Technical requirements

Querying multifile datasets

Creating a sample dataset

Discovering dataset fragments

Filtering data programmatically

Expressing yourself – a quick detour

Using expressions for filtering data

Deriving and renaming columns (projecting)

Using the Datasets API in Python

Creating our sample dataset

Discovering the dataset

Using different file formats

Filtering and projecting columns with Python

Streaming results

Working with partitioned datasets

Writing partitioned data

Connecting everything together

Summary

7 Exploring Apache Arrow Flight RPC

Technical requirements

The basics and complications of gRPC

Building modern APIs for data

Efficiency and streaming are important

Arrow Flight’s building blocks

Horizontal scalability with Arrow Flight

Adding your business logic to Flight

Other bells and whistles

Understanding the Flight Protobuf definitions

Using Flight, choose your language!

Building a Python Flight server

Building a Go Flight server

What is Flight SQL?

Setting up a performance test

Everyone gets a containerized development environment!

Running the performance test

Flight SQL, the new kid on the block

Summary

8 Understanding Arrow Database Connectivity (ADBC)

Technical requirements

ODBC takes an Arrow to the knee

Lost in translation

Arrow adoption in ODBC drivers

The benefits of standards around connectivity

The ADBC specification

ADBC databases

ADBC connections

ADBC statements

ADBC error handling

Using ADBC for performance and adaptability

ADBC with C/C++

Using ADBC with Python

Using ADBC with Go

Summary

9 Using Arrow with Machine Learning Workflows

Technical requirements

SPARKing new ideas on Jupyter

Understanding the integration of Arrow in Spark

Containerization makes life easier

SPARKing joy with Arrow and PySpark

Facehuggers implanting data

Setting up your environment

Proving the benefits by checking resource usage

Using Arrow with the standard tools for ML

More GPU, more speed!

Summary

Part 3: Real-World Examples, Use Cases, and Future Development

10 Powered by Apache Arrow

Swimming in data with Dremio Sonar

Clarifying Dremio Sonar’s architecture

The library of the gods…of data analysis

Spicing up your data workflows

Arrow in the browser using JavaScript

Gaining a little perspective

Taking flight with Falcon

An Influx of connectivity

Summary

11 How to Leave Your Mark on Arrow

Technical requirements

Contributing to open source projects

Communication is key

You don’t necessarily have to contribute code

There are a lot of reasons why you should contribute!

Preparing your first pull request

Creating and navigating GitHub issues

Setting up Git

Orienting yourself in the code base

Building the Arrow libraries

Creating the pull request

Understanding Archery and the CI configuration

Find your interest and expand on it

Getting that sweet, sweet approval

Finishing up with style!

C++ code styling

Python code styling

Go code styling

Summary

12 Future Development and Plans

Globetrotting with data – GeoArrow and GeoParquet

Collaboration breeds success

Expanding ADBC adoption

Final words

Index

Other Books You May Enjoy

Preface

To quote a famous blue hedgehog, Gotta Go Fast! When it comes to data, speed is important. It doesn’t matter if you’re collecting or analyzing data or developing utilities for others to do so, performance and efficiency are going to be huge factors in your technology choices, not just in the efficiency of the software itself, but also in development time. You need the right tools and the right technology, or you’re dead in the water.

The Apache Arrow ecosystem is developer-centric, and this book is no different. Get started with understanding what Arrow is and how it works, then learn how to utilize it in your projects. You’ll find code examples, explanations, and diagrams here, all with the express purpose of helping you learn. You’ll integrate your data sources with Python DataFrame libraries such as pandas or NumPy and utilize Arrow Flight to create efficient data services.

With real-world datasets, you’ll learn how to leverage Apache Arrow with Apache Spark and other technologies. Apache Arrow’s format is language-independent and organized so that analytical operations are performed extremely quickly on modern CPU and GPU hardware. Join the industry adoption of this open source data format and save yourself valuable development time creating high-performant, memory-efficient, analytical workflows.

This book has been a labor of love to share knowledge. I hope you learn a lot from it! I sure did when writing it.

Who this book is for

This book is for developers, data analysts, and data scientists looking to explore the capabilities of Apache Arrow from the ground up. This book will also be useful for any engineers who are working on building utilities for data analytics, query engines, or otherwise working with tabular data, regardless of the language they are programming in.

What this book covers

Chapter 1, Getting Started with Apache Arrow, introduces you to the basic concepts underpinning Apache Arrow. It introduces and explains the Arrow format and the data types it supports, along with how they are represented in memory. Afterward, you’ll set up your development environment and run some simple code examples showing the basic operation of Arrow libraries.

Chapter 2, Working with Key Arrow Specifications, continues your introduction to Apache Arrow by explaining how to read both local and remote data files using different formats. You’ll learn how to integrate Arrow with the Python pandas and Polars libraries and how to utilize the zero-copy aspects of Arrow to share memory for performance.

Chapter 3, Format and Memory Handling, discusses the relationships between Apache Arrow and Apache Parquet, Feather, Protocol Buffers, JSON, and CSV data, along with when and why to use these different formats. Following this, the Arrow IPC format is introduced and described, along with an explanation of using memory mapping to further improve performance. Finally, we wrap up with some basic leveraging of Arrow on a GPU.

Chapter 4, Crossing the Language Barrier with the Arrow C Data API, introduces the titular C Data API for efficiently passing Apache Arrow data between different language runtimes and devices. This chapter will cover the struct definitions utilized for this interface along with describing use cases that make it beneficial.

Chapter 5, Acero: A Streaming Arrow Execution Engine, describes how to utilize the reference implementation of an Arrow computation engine named Acero. You’ll learn when and why you should use the compute engine to perform analytics rather than implementing something yourself and why we’re seeing Arrow showing up in many popular execution engines.

Chapter 6, Using the Arrow Datasets API, demonstrates querying, filtering, and otherwise interacting with multi-file datasets that can potentially be across multiple sources. Partitioned datasets are also covered, along with utilizing Acero to perform streaming filtering and other operations on the data.

Chapter 7, Exploring Apache Arrow Flight RPC, examines the Flight RPC protocol and its benefits. You will be walked through building a simple Flight server and client in multiple languages to produce and consume tabular data.

Chapter 8, Understanding Arrow Database Connectivity (ADBC), introduces and explains an Apache Arrow-based alternative to ODBC/JDBC and why it matters for the ecosystem. You will be walked through several examples with sample code that interact with multiple database systems such as DuckDB and PostgreSQL.

Chapter 9, Using Arrow with Machine Learning Workflows, integrates multiple concepts that have been covered to explain the various ways that Apache Arrow can be utilized to improve parts of data pipelines and the performance of machine learning model training. It will describe how Arrow’s interoperability and defined standards make it ideal for use with Spark, GPU compute, and many other tools.

Chapter 10, Powered by Apache Arrow, provides a few examples of current real-world usage of Apache Arrow, such as Dremio, Spice.AI, and InfluxDB.

Chapter 11, How to Leave Your Mark on Arrow, provides a brief introduction to contributing to open source projects in general, but specifically how to contribute to the Arrow project itself. You will be walked through finding starter issues, setting up your first pull request to contribute, and what to expect when doing so. To that end, this chapter also contains various instructions on locally building Arrow C++, Python, and Go libraries from source to test your contributions.

Chapter 12, Future Development and Plans, wraps up the book by examining the features that are still in development at the time of writing. This includes geospatial integrations with GeoArrow and GeoParquet along with expanding Arrow Database Connectivity (ADBC) adoption. Finally, there are some parting words and a challenge from me to you.

To get the most out of this book

It is assumed that you have a basic understanding of writing code in at least one of C++, Python, or Go to benefit from and use the code snippets. You should know how to compile and run code in the desired language. Some familiarity with basic concepts of data analysis will help you get the most out of the scenarios and use cases explained in this book. Beyond this, concepts such as tabular data and installing software on your machine are assumed to be understood rather than explained.

Software/hardware covered in the book

Operating system requirements

An internet-connected computer

Git

Windows, macOS, or Linux

C++ compiler capable of C++17 or higher

Windows, macOS, or Linux

Python 3.8 or higher

Windows, macOS, or Linux

conda/mamba (optional)

Windows, macOS, or Linux

vcpkg (optional)

Windows

MSYS2 (optional)

Windows

CMake 3.16 or higher

Windows, macOS, or Linux

make or ninja

macOS or Linux

Docker

Windows, macOS, or Linux

Go 1.21 or higher

Windows, macOS, or Linux

The sample data is in the book’s GitHub repository. You’ll need to use Git Large File Storage (LFS) or a browser to download the large data files. There are also a couple of larger sample data files in publicly accessible AWS S3 buckets. The book will provide a link to download the files when necessary. Code examples are provided in C++, Python, and Go.

If you are using the digital version of this book, we advise you to the complete code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Take your time, enjoy, and experiment in all kinds of ways, and please, have fun with the exercises!

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/In-Memory-Analytics-with-Apache-Arrow-Second-Edition. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “We’re using PyArrow in this example, but if you have the ArrowDeviceArray struct definition, you could create and populate the struct without ever needing to directly include or link against the Arrow libraries!”

A block of code is set as follows:

>>> import numba.cuda >>> import pyarrow as pa >>> from pyarrow import cuda >>> import numpy as np >>> from pyarrow.cffi import ffi

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

std::unique_ptr<arrow::ArrayBuilder> tmp; // returns a status, handle the error case arrow::MakeBuilder(arrow::default_memory_pool(), st_type, &tmp); std::shared_ptr<arrow::StructBuilder> builder; builder.reset(static_cast<arrow::StructBuilder*>( tmp.release()));

Any command-line input or output is written as follows:

$ mkdir arrow_chapter1 && cd arrow_chapter1 $ go mod init arrow_chapter1 $ go get -u github.com/apache/arrow/go/v17/arrow@latest

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “You'll notice that for the Filter and Project nodes in the figure, since they each use a compute expression, there is a sub-tree of the execution graph representing the expression tree.”

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you’ve read In-Memory Analytics with Apache Arrow, Second Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/9781835461228

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

Part 1: Overview of What Arrow is, Its Capabilities, Benefits, and Goals

This section is an introduction to Apache Arrow as a format specification and a project, the benefits it claims, and the goals it’s trying to achieve. You’ll also find a high-level overview of basic use cases and examples.

This part has the following chapters:

Chapter 1, Getting Started with Apache ArrowChapter 2, Working with Key Arrow SpecificationsChapter 3, Format and Memory Handling

1 Getting Started with Apache Arrow

Regardless of whether you’re a data scientist/engineer, a machine learning (ML) specialist, or a software engineer trying to build something to perform data analytics, you’ve probably heard of or read about something called Apache Arrow and either looked for more information or wondered what it was. Hopefully, this book can serve as a springboard in understanding what Apache Arrow is and isn’t, as well as a reference book to be continuously utilized so that you can supercharge your analytical capabilities.

For now, we’ll start by explaining what Apache Arrow is and what you will use it for. Following that, we will walk through the Arrow specifications, set up a development environment where you can play around with the various Apache Arrow libraries, and walk through a few simple exercises so that you can get a feel for how to use them.

In this chapter, we’re going to cover the following topics:

Understanding the Arrow format and specificationsWhy does Arrow use a columnar in-memory format?Learning the terminology and the physical memory layoutArrow format versioning and stabilitySetting up your shooting range

Technical requirements

For the portion of this chapter that describes how to set up a development environment for working with various Arrow libraries, you’ll need the following:

Your preferred integrated development environment (IDE) – for example, VS Code, Sublime, Emacs, or VimPlugins for your desired language (optional but highly recommended)An interpreter or toolchain for your desired language(s):Python 3.8+: pip and venvand/or pipenvGo 1.21+C++ Compiler (capable of compiling C++17 or newer)

Understanding the Arrow format and specifications

The Apache Arrow documentation states the following [1]:

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Well, that’s a lot of technical jargon! Let’s start from the top. Apache Arrow (just Arrow for brevity) is an open source project from the Apache Software Foundation (https://apache.org) that is released under the Apache License, version 2.0 [2]. It was co-created by Jacques Nadeau and Wes McKinney, the creator of pandas, and first released in 2016. Simply put, Arrow is a collection of libraries and specifications that make it easy to build high-performance software utilities for processing and transporting large datasets. It consists of a collection of libraries related to in-memory data processing, including specifications for memory layouts and protocols for sharing and efficiently transporting data between systems and processes. When we’re talking about in-memory data processing, we’re talking exclusively about processing data in RAM and eliminating slow data access (as well as redundantly copying and converting data) wherever possible to improve performance. This is where Arrow excels and provides libraries to support this with utilities for streaming and transportation to speed up data access.

When working with data, there are two primary situations to consider, and each has different needs: the in-memory format and the on-disk format. When data is stored on disk, the biggest concerns are the size of the data and the input/output (I/O) cost to read it into the main memory before you can operate on it. As a result, formats for data on disk tend to focus much more on increasing I/O throughput, such as compressing the data to make it smaller and faster to read into memory. One example of this might be the Apache Parquet format, which is a columnar on-disk file format. Instead of being an on-disk format, Arrow’s focus is on the in-memory format, which targets processing efficiency, with numerous tactics such as cache locality and vectorization of computation.

The primary goal of Arrow is to become the lingua franca of data analytics and processing – the One Format to Rule Them All, so to speak. Different databases, programming languages, and libraries tend to implement and use separate internal formats for managing data, which means that any time you’re moving data between these components for different uses, you’re paying a cost to serialize and deserialize that data every time. Not only that but lots of time and resources get spent reimplementing common algorithms and processing in those different data formats over and over. If we can standardize on an efficient, feature-rich internal data format that can be widely adopted and used instead, this excess computation and development time is no longer necessary. Figure 1.1 shows a simplified diagram of multiple systems, each with their own data formats, having to be copied and/or converted for the different components to work with each other:

Figure 1.1 – Copy and convert components

In many cases, the serialization and deserialization processes can end up taking nearly 90% of the processing time in such a system and prevent you from being able to spend that CPU on analytics. Alternatively, if every component is using Arrow’s in-memory format, you end up with a system similar to the one shown in Figure 1.2, where the data can be transferred between components at little-to-no cost. All the components can either share memory directly or send the data as-is without having to convert between different formats:

Figure 1.2 – Sharing Arrow memory between components

At this point, there’s no need for the different components and systems to implement custom connectors or re-implement common algorithms and utilities. The same libraries and connectors can be utilized, even across programming languages and process barriers, by sharing memory directly so that it refers to the same data rather than copying multiple times between them. An example of this idea will be covered in Chapter 8, Understanding Arrow Database Connectivity (ADBC), where we’ll consider a specification for leveraging common database drivers in a cross-platform way to enable efficient interactions using Arrow-formatted data.

Most data processing systems now use distributed processing by breaking the data into chunks and sending those chunks across the network to various workers. So, even if we can share memory across processes on a box, there’s still the cost to send it across the network. This brings us to the final piece of the puzzle: the format of raw Arrow data on the wire is the same as it is in memory. You can avoid having to deserialize that data before you can use it (skipping a copy) or reference the memory buffers you were operating on to send it across the network without having to serialize it first. Just a bit of metadata sent along with the raw data buffers and interfaces that perform zero copies can be created to achieve performance benefits, by reducing memory usage and improving throughput. We’ll cover this more directly in Chapter 3, Format and Memory Handling, so look forward to it!

Let’s quickly recap the features of the Arrow format we just described before moving on:

Using the same high-performance internal format across components allows for much more code reuse in libraries instead of the need to reimplement common workflows.The Arrow libraries provide mechanisms to directly share memory buffers to reduce copying between processes by using the same internal representation, regardless of the language. This is what’s being referred to whenever you see theterm zero-copy.The wire format is the same as the in-memory format to eliminate serialization and deserialization costs when sending data across networks between components of a system.

Now, you might be thinking, “Well, this sounds too good to be true!” And of course, being skeptical of promises like this is always a good idea. The community around Arrow has done a ton of work over the years to bring these ideas and concepts to fruition. The project itself provides and distributes libraries in a variety of different programming languages so that projects that want to incorporate and/or support the Arrow format don’t need to implement it themselves. Above and beyond the interaction with Arrow-formatted data, the libraries provide a significant amount of utility in assisting with common processes such as data access and I/O-related optimizations. As a result, the Arrow libraries can be useful for projects, even if they don’t utilize the Arrow format themselves.

Here’s just a quick sample of use cases where using Arrow as the internal/intermediate data format can be very beneficial:

SQL execution engines (such as Dremio Sonar, InfluxDB, or Apache DataFusion)Data analysis utilities and pipelines (such as pandas or Apache Spark)Streaming and message queue systems (such as Apache Kafka or Storm)Storage systems and formats (such as Apache Parquet, Cassandra, and Kudu)

As for how Arrow can help you, it depends on which piece of the data puzzle you work with. The following are a few different roles that work with data and show how using Arrow could potentially be beneficial; it’s by no means a complete list though:

If you’re a data scientist:You can utilize Arrow via Polars or pandas and NumPy integration to significantly improve the performance of your data manipulations.If the tools you use integrate Arrow support, you can gain significant speed-ups for your queries and computations by using Arrow directly to reduce copies and/or serialization costs.If you’re a data engineer specializing in extract, transform, and load (ETL):The higher adoption of Arrow as an internal and externally-facing format can make it easier to integrate with many different utilities.By using Arrow, data can be shared between processes and tools, with shared memory increasing the tools available to you for building pipelines, regardless of the language you’re operating in. You could take data from Python, use it in Spark, and then pass it directly to the Java virtual machine (JVM) without paying the cost of copying between them.If you’re a software engineer or ML specialist building computation tools and utilities for data analysis:Arrow, as an internal format, can be used to improve your memory usage and performance by reducing serialization and deserialization between components.Understanding how to best utilize the data transfer protocols can improve your ability to parallelize queries and access your data, wherever it might be.Because Arrow can be used for any sort of tabular data, it can be integrated into many different areas of data analysis and computation pipelines and is versatile enough to be beneficial as an internal and data transfer format, regardless of the shape of your data.

Now that you know what Arrow is, let’s dig into its design and how it delivers on the aforementioned promises of high-performance analytics, zero-copy sharing, and network communication without serialization costs. First, you’ll see why a column-oriented memory representation was chosen for Arrow’s internal format. In later chapters, we’ll cover specific integration points, explicit examples, and transfer protocols.

Why does Arrow use a columnar in-memory format?

There is often a lot of debate surrounding whether a database should be row-oriented or column-oriented, but this primarily refers to the on-disk format of the underlying storage files. Arrow’s data format is different from most cases discussed so far since it uses a columnar organization of data structures in memory directly. If you’re not familiar with columnar as a term, let’s take a look at what it means. First, imagine the following table of data:

Figure 1.3 – Sample data table

Traditionally, if you were to read this table into memory, you’d likely have some structure to represent a row and then read the data in one row at a time – maybe something like struct { string archer; string location; int year }. The result is that you have the memory grouped closely together for each row, which is great if you always want to read all the columns for every row or are always writing a full row every time. However, if this were a much bigger table, and you just wanted to find out the minimum and maximum years or any other column-wise analytics, such as unique locations, you would have to read the whole table into memory and then jump around in memory, skipping the fields you didn’t care about so that you could read the value for each row of one column.

Most operating systems, while reading data into main memory and CPU caches, will attempt to make predictions about what memory it is going to need next. In our example table of archers, consider how many memory pages (refer to Chapter 3 for more details) of data would have to be accessible and traversed to get a list of unique locations if the data were organized in row or column orientations:

Figure 1.4 – Row versus columnar memory buffers

As shown in Figure 1.4, the columnar format keeps the data organized by column instead of by row. As a result, operations such as grouping, filtering, or aggregations based on column values become much more efficient to perform since the entire column is already contiguous in memory. Considering memory pages again, it’s plain to see that for a large table, there would be significantly more pages that need to be traversed to get a list of unique locations from a row-oriented buffer than a columnar one. Fewer page faults and more cache hits mean increased performance and a happier CPU. Computational routines and query engines tend to operate on subsets of the columns for a dataset rather than needing every column for a given computation, making it significantly more efficient to operate on columnar data.

If you look closely at the construction of the column-oriented data buffer on the right-hand side of Figure 1.4, you’ll see how it benefits the queries I mentioned earlier. If we wanted all the archers that are in Europe, we could easily scan through just the location column and discover which rows are the ones we want, and then spin through just the archer block and grab only the rows that correspond to the row indexes we found. This will come into play again when we start looking at the physical memory layout of Arrow arrays; since the data is column-oriented, it makes it easier for the CPU to predict instructions to execute and maintain this memory localitybetween instructions.

By keeping the column data contiguous in memory, it allows the computations to be vectorized. Most modern processors have single instruction, multiple data (SIMD) instructions available that can be taken advantage of for speeding up computations and require the data to be in a contiguous block of memory so that they can operate on it. This concept can be found heavily utilized by graphics cards, and Arrow provides libraries to take advantage of graphics processing units (GPUs) precisely because of this. Consider an example where you might want to multiply every element of a list by a static value, such as performing a currency conversion on a column of prices while using an exchange rate:

Figure 1.5 – SIMD/vectorized versus non-vectorized

In Figure 1.5, you can see the following:

The left-hand side of the figure shows that an ordinary CPU performing the computation in a non-vectorized fashion requires loading each value into a register, multiplying it with the exchange rate, and then saving the result back into RAM.On the right-hand side of the figure, you can see that vectorized computation, such as using SIMD, performs the same operation on multiple different inputs at the same time, enabling a single load to multiply and save to get the result for the entire group of prices. Being able to vectorize a computation has various constraints; often, one of those constraints is requiring the data being operated on to be in a contiguous chunk of memory, which is why columnar data is much easier to do this with.

SIMD versus multithreading

If you’re not familiar with SIMD, you might be wondering how it differs from another parallelization technique: multithreading. Multithreading operates at a higher conceptual level than SIMD. Each thread has its own set of registers and memory space representing its execution context. These contexts could be spread across separate CPU cores or possibly interleaved by a single CPU core switching whenever it needs to wait for I/O. SIMD is a processor-level concept that refers to the specific instructions being executed. Simply put, multithreading involves multitasking and SIMD involves doing less work to achieve the same result.

Another benefit of utilizing column-oriented data comes into play when considering compression techniques. At some point, your data will become large enough that sending it across the network could become a bottleneck, purely due to size and bandwidth. With the data being grouped in columns that are all the same type as contiguous memory, you end up with significantly better compression ratios than you would get with the same data in a row-oriented configuration, simply because data of the same type is easier to compress together than data of different types.

Learning the terminology and physical memory layout

As mentioned previously, the Arrow columnar format specification includes definitions of the in-memory data structures, metadata serialization, and protocols for data transportation. The format itself has a few key promises:

Data adjacency for sequential accessO(1) (constant time) random accessSIMD and vectorization-friendlyRelocatable, allowing for zero-copy access in shared memory

To ensure we’re all on the same page, here’s a quick glossary of terms that are used throughout the format specification and the rest of this book:

Array: A list of values with a known length of the same type.Slot: The value in an array identified by a specific index.Buffer/contiguous memory region: A single contiguous block of memory with a given length.Physical layout: The underlying memory layout for an array without accounting for the interpretation of the logical value. For example, a 32-bit signed integer array and a 32-bit floating-point array are both laid out as contiguous chunks of memory where each value is made up of four contiguous bytes in the buffer.Parent/child arrays: Terms used for the relationship between physical arrays when describing the structure of a nested type. For example, a struct parent array has a child array for each of its fields.Primitive type: A type that has no child types and so consists of a single array, such as fixed-bit-width arrays (for example, int32) or variable-size types (for example, string arrays).Nested type: A type that depends on one or more other child types. Nested types are only equal if their child types are also equal (for example, List<T> and List<U> are equal if T and Uare equal).Data type: A particular form of interpreting the values in an array that’s implemented using a specific physical layout. For example, the decimal data type stores values as 16 bytes per value in a fixed-size binary layout. Similarly, a timestamp data type stores values using a 64-bit fixed-size layout.

Now that we’ve got the fancy words out of the way, let’s have a look at how we lay out these arrays in memory. An array or vector is defined by the following information:

A data type (typically identified by an enum value and metadata)A group of buffersA length as a 64-bit signed integerA null count as a 64-bit signed integerOptionally, a dictionary for dictionary-encoded arrays (more on these later in this chapter)

To define a nested array type, there would also be one or more sets of this information that would then be the child arrays. Arrow defines a series of data types and each one has a well-defined physical layout in the specification. For the most part, the physical layout just affects the sequence of buffers that make up the raw data. Since there is a null count in the metadata, it comes as a given that any value in an array may be null data rather than having a value, regardless of the type. Apart from the union data types, all the arrays have a validity bitmap as one of their buffers, which can optionally be left out if there are no nulls in the array. As might be expected, 1 in the corresponding bit means it is a valid value in that index, and 0 means it’s null.

Quick summary of physical layouts, or TL;DR

When working with Arrow-formatted data, it’s important to understand how it is physically laid out in memory. Understanding these physical layouts can provide ideas for efficiently constructing (or deconstructing) Arrow data when developing applications. Here’s a quick summary:

Figure 1.6 – Table of physical layouts

Let’s consider the physical memory layouts that are used by the Arrow format. This will be primarily useful for when you’re either implementing the Arrow specification yourself (or contributing to the libraries) or if you simply want to know what’s going on under the hood and how it all works.

Primitive fixed-length value arrays

Let’s look at an example of a 32-bit integer array that looks like this: [1, null, 2, 4, 8]. What would the physical layout look like in memory based on the information you’ve been provided with so far (Figure 1.7)? Something to keep in mind is that all of the buffers should be padded to a multiple of 64 bytes for alignment, which matches the largest SIMD instructions available on widely deployed x86 architecture processors (Intel AVX-512), and that the values for null slots are marked UNF or undefined. Implementations are free to zero out the data in null slots if they desire, and many do. However, since the format specification does not define anything, the data in a null slot could technically be anything:

Figure 1.7 – Layout of a primitive int32 array

This same conceptual layout is the case for any fixed-size primitive type, with the only exception being that the validity buffer can be left out entirely if there are no nulls in the array. For any data type that is physically represented as simple fixed-bit-width values, such as integers, floating-point values, fixed-size binary arrays, or even timestamps, it will use this layout in memory. The padding for the buffers in the subsequent diagrams will be left out just to avoid cluttering them.

Variable-length binary arrays

Things get slightly trickier when you’re dealing with variable-length value arrays, which are generally used for variable-size binary or string data. In this layout, every value can consist of 0 or more bytes, and, in addition to the data buffer, there will also be an offsets buffer. Using an offsets buffer allows the entirety of the data of the array to be held in a single contiguous memory buffer. The only lookup cost for finding the value of a given index is to look up the indexes in the offsets buffer to find the correct slice of the data. The offsets buffer will always contain length + 1 signed integers (either 32-bit or 64-bit, based on the data type being used) that indicate the starting position of each corresponding slot of the array. Consider an array of two strings: [ "Water", "Rising" ].

Figure 1.8 – Arrow string versus traditional string vector

This differs from a lot of standard ways of representing a list of strings in memory in most library models. Generally, a string is represented as a pointer to a memory location and an integer for the length, so a vector of strings is a vector of these pointers and lengths (Figure 1.8). For many use cases, this is very efficient since, typically, a single memory address is going to be much smaller than the size of the string data, so passing around this address and length is efficient for referencing individual strings:

Figure 1.9 – Viewing string index 1

But if your goal is operating on a large number of strings, it’s much more efficient to have a single buffer to scan through in memory. As you operate on each string, you can maintain the memory locality we mentioned previously, keeping the memory we need to look at physically close to the next chunk of memory we’re likely going to need. This way, we spend less time jumping around different pages of memory and can spend more CPU cycles performing the computations. It’s also extremely efficient to get a single string as you can simply take a view of the buffer by using the address indicated by the offset to create a string object without copying the data.

Variable-length binary view arrays

Arrow has a large community and development continues to evolve as more systems adopt compatibility. Researchers at the Technical University of Munich wrote a paper describing a system they called UmbraDB. The details are interesting and I highly recommend reading up on it (https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf). However, what’s relevant here is that this paper outlined a new, highly efficient in-memory string representation for columnar data. Given the source, this representation has also been popularly named German-style strings. The representation was later adopted by two extremely popular open source projects: Meta’sVelox engine and DuckDB. Finally, it was adapted as a data type for Arrow arrays, both to maintain zero-copy compatibility with these systems and to bring this representation to the entire ecosystem that relies on Arrow.

Instead of the simple approach taken by variable-length binary arrays – using an offsets buffer and data buffer – binary views are a bit more complex. Like most data types, the first buffer is a validity bitmap. As each value in the array consists of 0 or more bytes, the second buffer uses a fixed-size, 16-byte structure to represent the location and length of each view of bytes (called a view header). Finally, while every other data type has a fixed number of buffers, the binary view data type is the first case of a variable number of buffers occurring in an Arrow data type! Let’s take a look!

Figure 1.10 – Layout of a binary view header

Figure 1.10 depicts the binary view header structure that exists for each value in the array. The interpretation of the bytes differs slightly depending on one of two cases: a short string (length < 12) or a long string (length > 12). If you happen to be familiar with common compiler optimizations, this is inspired by what is commonly referred to as small string optimization or short string optimization. For strings that are small enough, we can avoid requiring extra memory and a level of indirection to store and access it by having the string stored in-line in the structure. If you’re working with a large enough number of small strings, this adds up to a significant performance improvement.

For large string values, the first four bytes are copied into the structure as the prefix (stored after the length). The inline storage of the prefix enables a fast path for string comparisons; if the prefixes don’t match, there is no need to look at the rest of the data. Following the prefix are two 4-byte integers indicating which buffer (after the validity and views buffers) the full string data is in, and the offset into that buffer where the string starts. This may sound a bit complicated, but Figure 1.11 shows what it looks like in practice while providing a possible layout for the array: ["Hello", "Penny the cat", "and welcome"]:

Figure 1.11 – Layout of a string view array

The reason why I call it a “possible” layout is because the binary view layout allows for easy reuse of buffers. There can be any number of data buffers after the validity bitmap and views buffers, and the offset indicated can be located anywhere within those buffers.

Important note!

All integers referenced in this layout (the length, buffer index, and offset) must be signed, 32-bit integers. The reason for this is that some languages – in particular, Java – make working with unsigned integers more difficult than with signed integers. So, to ensure easier interoperability within a multi-language environment, the Arrow specification prefers to utilize signed integers where applicable rather than unsigned ones.

In addition to benefitting from short string optimization and leveraging prefix comparisons, there’s one other significant benefit to the binary view data type: it can easily be constructed out-of-order and in parallel because the view headers are a fixed-size structure and can refer to any buffer and offset by index. This makes binary view arrays ideal for element-wise operations that can be trivially parallelized, such as constructing substrings, sorting, or conditional operations.

New addition!

Watch the Arrow versions when you’re using this data type. This type was added to the Arrow format with Columnar Format v1.4 and was released in several implementations, starting with the Arrow v14 libraries. For insight into which implementations support the new data types, you can look at the implementation status page in the documentation (https://arrow.apache.org/docs/status.html).

List and fixed-size list arrays

What about nested formats? Well, they work similarly to the variable-length binary format. First up is the variable-size list layout. It’s defined by two buffers – a validity bitmap and an offsets buffer – along with a child array. The difference between this and the variable-length binary format is that instead of the offsets referencing a buffer, they are indexes into the child array (which could potentially be a nested type itself). The common denotation of list types is to specify them as List<T>, where T is any type at all. When using 64-bit offsets instead of 32-bit, it is denoted as LargeList<T>. Let’s represent the following List<Int8> array: [[12, -7, 25], null, [0, -127, 127, 50], []]:

Figure 1.12 – Layout of a list array

The first thing to notice in the preceding diagram is that the offsets buffer has exactly one more element than the List array it belongs to since there are four elements to our List<Int8> array and we have five elements in the offsets buffer. Each value in the offsets buffer represents the starting slot of the corresponding list index, i. Looking closer at the offsets buffer, we notice that 3 and 7 are repeating, indicating that those lists are either null or empty (have a length of 0). To discover the length of a list at a given slot, you simply take the difference between the offset for that slot and the offset after it:

The same holds for the previous variable-length binary format; the number of bytes for a given slot is the difference in the offsets. Knowing this, what is the length of the list at index 2 of Figure 1.12?

Remember, 0-based indexes! With this, we can tell that the list at index 3 is empty because the bitmap has a 1, but the length is 0 (7 – 7). This also explains why we need that extra element in the offsets buffer! We need it to be able to calculate the length of the last element in the array.

Given that example, what would a List<List<Int8>> array look like? I’ll leave that as an exercise for you to figure out.

There’s also a FixedSizeList<T>[N] type, which works nearly the same as the variable-sized list, except there’s no need for an offsets buffer. The child array of a fixed-size list type is the values array, complete with its own validity buffer. The value in slot of a fixed-size list array is stored in an -long slice of the values array, starting at offset . Figure 1.13 shows what this looks like:

Figure 1.13 – Layout of a fixed-size list array

What’s the benefit of FixedSizeList versus List? Look back at the preceding two diagrams again! Determining the values for a given slot of FixedSizeList doesn’t require any lookups into a separate offsets buffer, making it more efficient if you know that your lists will always be a specific size. As a result, you also save space by not needing the extra memory for an offsets buffer!

Important note

One thing to keep in mind is the semantic difference between a null value and an empty list. Using JSON notation, the difference is equivalent to the difference between null and []. The meaning of such a difference would be up to a particular application to decide, but it’s important to note that a null list is not identical to an empty list, even though the only difference in the physical representation is the bit in the validity bitmap.

Phew! That was a lot. We’re almost done!

ListView arrays

Similar to the relationship between variable-length binary arrays and their view counterpart, variable-length list arrays also have a view counterpart type. The ListView<T> data type contains a validity bitmap, offsets buffer, and a child array just like a List<T> array but adds an additional buffer to represent the sizes of the list views instead of the sizes being inferred from the offsets. The benefit of the sizes buffer is that it allows for efficient out-of-order processing. As usual, let’s see what it looks like by representing the following ListView<Int8> array: [[12, -7, 25], null, [0, -127, 127, 50], [], [50, 12]]:

Figure 1.14 – Layout of a ListView array

Right off the bat in Figure 1.14, you can see that the offsets are not guaranteed to increase like in the standard variable-length list array layout. The elements that make up the first list in the array are located at the end of the child array and the elements that make up the third list element are at the beginning of the child array.

Not only are the offsets out of order but the values in the child array are even shared between more than one list element. The final list element utilizes the values 50 and 12, which are also part of the other two lists.

Every list-view value, including null and empty lists, must guarantee the following:

0≤offsets[i]≤len(child)0≤offsets[i]+sizes[i]≤len(child)

Additionally, just like the variable-length list layout, there is both a 32-bit and 64-bit variant of the list view array, with the 64-bit version being denoted as LargeListView<T>. In this case, the bit-width refers to the size of both the offset and the size values (that is, ListView<T> contains two buffers full of 32-bit integers and LargeListView<T> contains two buffers full of 64-bit integers).

As you might have guessed, the list-view layout allows for more efficient representations of lists that contain overlapping elements, along with easier parallelization by facilitating out-of-order creation and processing. The only drawback is the additional buffer of values to represent the sizes, at least compared to the standard list representation. As a result, for list arrays that are not defined out-of-order or do not contain overlapping list elements, this representation would be less efficient than just using the standard list data type. As with the variable-length binary view array layout, this one was also inspired by representations utilized by the open source Velox and DuckDB engines.

New addition!

Struct arrays

The next type on our tour of the Arrow format is the struct type’s layout. A struct is a nested type that has an ordered sequence of fields that can all have distinct types. It’s semantically very similar to a simple object with attributes that you might find in a variety of programming languages. Each field must have its own UTF-8 encoded name, and these field names are part of the metadata for defining a struct type. Instead of having any physical storage allocated for its values, a struct array has one child array for each of its fields. All of these children arrays are independent and don’t need to be adjacent to each other in memory; remember, our goal is column (or field)-oriented, not row-oriented. However, a struct array must have a validity bitmap if it contains one or more null struct values. It can still contain a validity bitmap if there are no null values; it’s just optional in that case.

Let’s use the example of a struct with the following structure: Struct<name: VarBinary, age: Int32>. An array of this type would have two child arrays, one VarBinary array (a variable-sized binary layout), and one 4-byte primitive value array with Int32 as a data type With this definition, we can map out a representation of the [{"joe", 1}, {null, 2}, null, {"mark", 4}] array:

Figure 1.15 – Layout of a struct array

When an entire slot of the struct array is set to null, the null value is represented in the parent’s validity bitmap, which is different from a particular value in a child array being null. In Figure 1.15, the child arrays each have a slot for the null struct in which they could have any value at all. However, they would be hidden by the struct array’s validity bitmap, thus marking the corresponding struct slot as null and taking priority over the children.

Union arrays – sparse and dense

For cases when a single column could have multiple types, the Union type array exists. Whereas the struct array is an ordered sequence of fields, a union type is an ordered sequence of types. The value in each slot of the array could be of any of these types, which are named like struct fields and included in the metadata of the type. Unlike other layouts, the union type does not have its own validity bitmap. Instead, each slot’s validity is determined by the children, which are composed to create the union array itself. There are two distinct union layouts that can be used when creating an array: dense and sparse. Each can be optimized for a different use case.

A dense union represents a mixed-type array with 5 bytes of overhead for each value. It contains the following structures:

One child array for each type.A types buffer

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

In-Memory Analytics with Apache Arrow E-Book

Matthew Topol

In-Memory Analytics with Apache Arrow

Foreword

Contributors

About the author

About the reviewers

Table of Contents

Preface

Part 1: Overview of What Arrow is, Its Capabilities, Benefits, and Goals

1

Getting Started with Apache Arrow

Technical requirements

Understanding the Arrow format and specifications

Why does Arrow use a columnar in-memory format?

Learning the terminology and physical memory layout

Quick summary of physical layouts, or TL;DR

How to speak Arrow

Arrow format versioning and stability

Would you download a library? Of course!

Setting up your shooting range

Using PyArrow for Python

C++ for the 1337 coders

Go, Arrow, go!

Summary

References

2

Working with Key Arrow Specifications

Technical requirements

Playing with data, wherever it might be!

Working with Arrow tables

Accessing data files with PyArrow

Accessing data files with Arrow in C++

Bears firing arrows

Putting pandas in your quiver

Making pandas run fast

Keeping pandas from running wild

Polar bears use Rust-y arrows

Sharing is caring… especially when it’s your memory

Diving into memory management

Managing buffers for performance

Crossing boundaries

Summary

3

Format and Memory Handling

Technical requirements

Storage versus runtime in-memory versus message-passing formats

Long-term storage formats

In-memory runtime formats

Message-passing formats

Summing up

Passing your Arrows around

What is this sorcery?!

Producing and consuming Arrows

Learning about memory cartography

The base case

Parquet versus CSV

Mapping data into memory

Too long; didn’t read (TL;DR) – computers are magic

Leaving the CPU – using device memory

Starting with a few pointers

Device-agnostic buffer handling

Summary

Part 2: Interoperability with Arrow: The Power of Open Standards

4

Crossing the Language Barrier with the Arrow C Data API

Technical requirements

Using the Arrow C data interface

The ArrowSchema structure

The ArrowArray structure

Example use cases

Using the C data API to export Arrow-formatted data

Importing Arrow data with Python

Exporting Arrow data with the C Data API from Python to Go

Streaming Arrow data between Python and Go

What about non-CPU device data?

The ArrowDeviceArray struct

Using ArrowDeviceArray

Other use cases

Some exercises