35,99 €
This pioneering guide to the new ISO-standardized Graph Query Language (GQL) will help you modify, query, and analyze graph data using foundational and advanced concepts, as well as practice with sample codes on the GQL Playground.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 415
Veröffentlichungsjahr: 2025
Getting Started with the Graph Query Language (GQL)
A complete guide to designing, querying, and managing graph databases with GQL
Ricky Sun
Jason Zhang
Yuri Simione
Getting Started with the Graph Query Language (GQL)
Copyright © 2025 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Portfolio Director: Sunith Shetty
Relationship Lead: Teny Thomas
Project Manager: Shashank Desai
Content Engineer: Saba Umme Salma
Technical Editors: Seemanjay Ameriya and Aniket Shetty
Copy Editor: Safis Editing
Indexer: Rekha Nair
Production Designer: Shantanu Zagade
First published: August 2025
Production reference: 1280725
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN 978-1-83620-401-5
www.packtpub.com
RickySun is a serial entrepreneur and an expert in high-performance storage and computing systems. He is also the author of several technology books. Ricky began his career in Silicon Valley, working with his professor a year before graduating from Santa Clara University. Over the past 20+ years, he has experienced three mergers and acquisitions. Ultipa is his fourth venture. Ricky previously served as CTO of EMC Asia R&D Center, managing director of EMC Labs China, and chief architect of Splashtop, a pre-IPO unicorn start-up. He was also the CEO of Allhistory, a knowledge graph-powered causality search engine, now part of Human, a publicly traded online education services company. Ricky is also the author of The Essential Criteria of Graph Databases.
JasonZhang holds a master’s degree in computer science from SUPINFO, Paris, and has over 10 years of IT experience. He has worked for start-ups and is currently the director of engineering at Ultipa and CTO at Ultipa HK. In his role, Jason has contributed to the design and implementation of Ultipa’s property graph database, recognized as one of the most performant and innovative in the market. He leads the implementation of Ultipa Graph and adds support for the GQL standard.
YuriSimione holds a master’s degree in computer science from the University of Pisa, Italy. He has nearly 30 years of IT experience. He has worked with large enterprises such as Xerox and EMC (now Dell EMC), specializing in managing unstructured information projects with products such as OpenText Documentum and Adobe Experience Manager. In 2014, Yuri recognized the potential of graph databases for handling unstructured data and shifted his focus to this technology and the semantic knowledge graphs market. He currently serves as VP of partnerships and alliances at Ultipa, a leading graph database and graph analytics vendor.
Keith Hare has worked with JCC Consulting, Inc. since 1985, first as a senior consultant and then, since 2019, as president. At JCC Consulting, he has worked on high-performance, high-availability database applications in multiple companies and multiple industries, focusing on database administration, performance tuning, and data replication.
Keith has participated in both US and international standards processes for the database languages SQL and GQL since 1988 and has served as the convenor of the international committee ISO/IEC JTC1 SC32 WG3 Database Languages since 2005.
PearlCao brings over a decade of experience in the IT industry, having worked across a wide range of sectors. Currently serving as senior content strategist at Ultipa, she leads all technical writing initiatives for the company. Pearl spearheaded the development of Ultipa GQL’s official documentation, training programs, and certification systems, playing a central role in shaping how users and partners engage with Ultipa’s technology. Her work bridges the gap between complex graph database concepts and clear, actionable content for both technical and business audiences.
Preface
Who this book is for
What this book covers
Get in touch
Your Book Comes with Exclusive Perks – Here’s How to Unlock Them
Unlock Your Book’s Exclusive Benefits
How to unlock these benefits in three easy steps
Step 1
Step 2
Step 3
Need help?
Evolution Towards Graph Databases
History of database query languages
Early computer-based data management
Navigational data modeling
Relational data modeling
Not-only-SQL data modeling
The rise of SQL
The rise of the relational model
Evolution of NoSQL and new query paradigms
The emergence of NoSQL and big data
Graphs and graph models
Graph theory and graph data models
Property graphs and semantic knowledge graphs
Property Graphs
Semantic Knowledge Graphs (SKGs)
Current and future trends in graph database technology
Hybrid Transactional and Analytical Processing (HTAP)
Handling large-scale graph data
GQL compliance
Why is GQL the new standard?
The genesis of GQL
Evolution pathways
Personal reflections on GQL’s evolution
Core features and capabilities
Flexibility and expressiveness
Usability and developer experience
Advantages of GQL over traditional query languages
Intuitive representation of graph data
Simplified and expressive querying
Enhanced performance for relationship queries
Flexibility in schema design
Advanced pattern matching and analysis
Streamlined integration with graph algorithms
Future-proofing and standardization
Enhanced support for real-time applications
Better alignment with use cases
Summary
Key Concepts of GQL
Graph terms and definitions
Graph element
Directed and undirected edge
Label
Property
Property graph
Simple and multigraph
Directed, undirected, and mixed graph
Empty graph
Path
GQL catalog system
GQL objects
Binding tables
GQL values
NULL and material values
Reference value
Comparable value
GQL value types
Dynamic types
Static types
Constructed types
Graph patterns
Element patterns
Path patterns
Graph patterns
Other topics
GQL status code
Summary
Getting Started with GQL
Introduction to Ultipa Graph
Guide to GQL Playground
Setting up your Ultipa account
Accessing GQL Playground
Hands-on GQL
Scenario script
Creating the graph and building the graph model
Inserting nodes and edges
Querying the graph
Reading data
Nodes
Edges
Paths
Updating nodes and edges
Deleting nodes and edges
Dropping graph
Summary
GQL Basics
Technical requirements
GQL statements
Linear query statements
Modifying catalogs
CREATE and DROP GQL schema
Creating a GQL schema
Dropping a GQL schema
CREATE and DROP graph
Creating a graph
Creating a new graph with an existing graph or graph type
Dropping a graph
The CREATE and DROP graph types
Creating a graph type
Dropping a graph type
Retrieving graph data
Matching with node patterns
Matching with edge patterns
Matching with path patterns
Edge pattern abbreviation
Connected and disconnected path matching
Matching with connected path patterns
Matching with disconnected path patterns
Matching with quantified path patterns
Repeated edge patterns
Repeated path patterns
Simplified path patterns and full path patterns
Returning results
Filtering records
Differences between FILTER and WHERE
Sorting and paging results
Sorting (ORDER BY)
Limiting records
Skipping records
Paging records
Grouping and aggregating results
Grouping records
Aggregating records
Modifying graph data
Inserting graph data
Inserting nodes
Inserting edges
Inserting paths
Updating data
Deleting data
Removing properties and labels
Composite query statements
Retrieving union records
Excluding records from another query
Retrieving intersected records
Other clauses
The YIELD clause
The AT SCHEMA clause
The USE graph clause
Summary
Exploring Expressions and Operators
Technical requirements
Understanding GQL operators
Comparison operators
General comparisons
Comparing values with different types
Mathematical operators
Boolean operators
Assignment and other operators
Assignment operator
Other predicate operators
Predicating value type
Predicating source and destination nodes
Predicating labels
Using value expressions
Boolean value expressions
Nested value expressions
Common value expressions
String expressions
Numeric expressions
Temporal expressions
Datetime expressions
List expressions
Path expressions
Other expressions
Value query expressions
LET value expressions
CASE expressions
Summary
Working With GQL Functions
Technical requirements
Numeric functions
Mathematical functions
Rounding, absolute value, and modulus functions
Logarithmic and exponential functions
Length of values
Trigonometric functions
Radians and degrees
Basic trigonometric functions
Inverse trigonometric functions
Hyperbolic trigonometric functions
String functions
Substring functions
Uppercase and lowercase
Trimming strings
Single-character trim
Single-character trim for byte strings
Multiple-character trim
Normalizing strings
Temporal functions
Aggregating values
Set quantifiers: DISTINCT and ALL
Counting records
Numeric aggregation functions
MAX, MIN, AVG, and SUM
Standard deviation
PERCENTILE_CONT
PERCENTILE_DISC
Collect list
Other functions
Trimming a list
Converting a path to a list
Converting a data type with CAST
Extra functions
Example: The reduce function
Summary
Delve into Advanced Clauses
Technical requirements
Traversing modes
Path modes
WALK mode
TRAIL mode
ACYCLIC mode
SIMPLE mode
MATCH modes
REPEATABLE ELEMENTS
DIFFERENT EDGES
Restrictive path search
ALL path prefix
ANY path prefix
Searching shortest paths
Specifying a shortest-path search
ALL SHORTEST
ANY SHORTEST
Counted shortest search
Counted shortest group search
Counting K-hop neighbors by shortest group
Using the TEMP variable
CALL procedures
Inline procedures
Example: Using CALL to count grouped neighbors
Example: Aggregating values in paths
Named procedures
Using OPTIONAL
OPTIONAL MATCH
OPTIONAL CALL
Summary
Configuring Sessions
Technical requirements
What is a GQL session?
Explicit session creation
Implicit session creation
Session management
Setting sessions
Setting a GQL schema
Setting the current graph
Setting the time zone
Configuring session parameters
Setting a graph parameter
Setting a binding table parameter
Setting a value parameter
Setting all parameters
Resetting a session
Resetting all session settings
Resetting a single setting
Closing a session
Summary
Graph Transactions
Technical requirements
What is a transaction?
Types of transactions in databases
ACID rules
Initializing a transaction
Implicit transaction initialization
Explicit transaction initialization
Creating a transaction
Specifying the transaction mode
Customized modes
Committing the transaction
Rolling back a transaction
Summary
Conformance to the GQL Standard
Minimum conformance
Requirements of the data model
Mandatory features
Optional features
Implementation-defined elements
Implementation-dependent elements
Future of GQL conformance
Summary
Beyond GQL
Technical requirements
Graph operations
Showing the list of graphs
Options for creating a graph
Altering a graph
Renaming a graph
Updating comments on a graph
Node and edge schema operations
Showing node and edge schemas
Modifying the node or edge schema
Adding a new schema
Dropping a schema
Renaming a schema and updating its comment
Property operations
Listing properties
Adding and dropping a property
Renaming a property
Constraining properties
Creating constraints
Deleting constraints
Managing EDGE KEY
Improving query performance
Managing the property index
Creating an index
Showing the index
Dropping an index
Ultipa’s HDC graph
HDC graph
Managing an HDC graph
Access controls
User management
Listing users
Creating users
Deleting users
Updating users
Granting and revoking privileges
Understanding the privilege system
Managing privileges
Role management
Creating and dropping roles
Granting and revoking privileges
Assigning roles
Other operations
Checking background jobs
Listing job records
Stopping and clearing jobs
Truncating a graph
Truncating an entire graph
Truncating a node and an edge
Summary
A Case Study – Anti-Fraud
Technical requirements
Case introduction
Understanding transaction fraud
Data preparation
Establishing the transaction graph
Creating the graph
Inserting data
Querying the graph
Anti-fraud graph model
Summary
The Evolving Landscape of GQL
Emerging features and capabilities
Advanced graph traversals
Named stored queries and functions
Integration with machine learning and AI
Enhanced performance and scalability
Transition for SQL users
Intuitive graph data modeling
Challenges and opportunities
Technical challenges
Performance optimization
Security concerns
Marketing and adoption challenges
Educating the developer community
Demonstrating practical advantages
Opportunities for growth
Industry applications
Academic and research opportunities
Collaboration and partnerships
The future of GQL
Integration with emerging technologies
Standardization and interoperability
Continuous innovation
The missing protocol: How GQL could expand its reach through standardized access
Final reflections: A unified future with GQL
Glossary and Resources
Glossary
Resources
Optional features – GQL conformance
Implementation-defined elements – GQL conformance
Implementation-dependent elements – GQL conformance
Other Books You May Enjoy
Index
Download a Free PDF Copy of This Book
Cover
Index
Over the past several decades, the world of data has evolved dramatically—from the structured era of relational databases to the expansive realms of big data and fast data. Today, we are entering a new phase: the age of deep and connected data. As data volumes grow and analytics become increasingly interdependent, traditional database systems are being reimagined. Graph technology has emerged as a powerful solution, offering new possibilities for modeling and querying complex relationships.
Before the standardization of Graph Query Language (GQL), the graph database landscape was fragmented. Popular query languages such as Cypher (Neo4j), Gremlin (Apache TinkerPop), GSQL (TigerGraph), UQL (Ultipa), and AQL (ArangoDB) each introduced unique features tailored to specific platforms. While these innovations advanced the field, they also created challenges for users—requiring time and effort to learn multiple proprietary syntaxes.
The introduction of GQL (ISO/IEC 39075) marks a pivotal moment in database history. As the second standardized database query language—following SQL’s release in 1986 (ANSI) and 1987 (ISO)—GQL provides a unified, vendor-neutral syntax for querying graph databases. This standardization fosters interoperability, reduces learning curves, and accelerates adoption across industries.
This book begins with the evolution of graph databases and query languages, setting the stage for a comprehensive understanding of GQL. You’ll explore its syntax, structure, data types, and clauses, and gain hands-on experience through practical examples. As you progress, you’ll learn how to write efficient queries, optimize performance, and apply GQL to real-world scenarios such as fraud detection.
By the end of this journey, you’ll have a solid grasp of GQL, be equipped to implement a graph-based solution with GQL, and gain insight into the future direction of graph technology and its growing role in data ecosystems.
As GQL emerges as a new standard for querying graph databases, its relevance is expanding across nearly every industry. This book is designed for a wide range of professionals who work with data and seek to harness the power of graph-based systems. Whether you’re a developer, engineer, data analyst, database administrator (DBA), data engineer, or data scientist, you’ll find valuable insights and practical guidance in these pages. GQL opens new possibilities for modeling and analyzing complex, interconnected data. As such, this book serves as both an introduction and a deep dive into the language, helping readers of all backgrounds understand and apply GQL effectively in real-world scenarios.
Note: Some features covered in this book may not work as expected with the current versions of GQL Playground and the cloud. These features are planned for future releases. The book includes them to provide a comprehensive guide to GQL and its evolving capabilities.
Chapter 1, Evolution Towards Graph Databases, traces the journey from relational databases to NoSQL, and ultimately to the emergence of GQL, which promises to redefine how we query and manage complex, interconnected data in the digital age.
Chapter 2, Key Concepts of GQL, introduces the key concepts of GQL and graph theory. The foundational knowledge covered here will enhance your understanding of the remaining sections of the book.
Chapter 3, Getting Started with GQL, takes you on a journey to acquiring practical experience in interacting with graph data using GQL. You will learn how to formulate and execute GQL queries against a graph database, which is essential for querying, manipulating, and analyzing graph-structured data.
Chapter 4, GQL Basics, explores the fundamentals of GQL, uncovering the power of GQL statements and learning how to match data and return results tailored to your needs.
Chapter 5, Exploring Expressions and Operators, explores expressions and operators to be able to filter nodes and relationships, compute metrics over graph structures, construct dynamic labels, and transform properties on the fly.
Chapter 6, Working with GQL Functions, introduces a variety of essential functions for effective data manipulation and analysis.
Chapter 7, Delve into Advanced Clauses, delves into more advanced usages of GQL that allow for more sophisticated graph queries and operations.
Chapter 8, Configuring Sessions, delves into session management, exploring the creation, modification, and termination of sessions. This chapter presents a detailed overview of the session context, commands for setting session parameters, and resetting and closing sessions.
Chapter 9, Graph Transactions, delves into the specifics of initiating transactions using the TRANSACTION commands, detailing the syntax, usage, and conditions.
Chapter 10, Conformance to the GQL Standard, overviews conformance to the GQL standard, including required capabilities, optional features, and implementation-defined and implementation-dependent elements.
Chapter 11, Beyond GQL, explores GQL extensions provided by Ultipa Graph Database, including operations such as additional options to create a graph, constraints, and index operations, as well as access controls.
Chapter 12, A Case Study – Anti-Fraud, provides hands-on practice by tackling a common issue with GQL, identifying suspicious transactions in bank accounts.
Chapter 13, The Evolving Landscape of GQL, explores local Terraform automation processes and implementing a CI/CD pipeline to apply Terraform configuration automatically.
Chapter 14, Glossary and Resources, provides definitions of key terms and a comprehensive list of required and optional GQL features, along with additional resources for further learning.
The code bundle for the book is hosted on GitHub at https://github.com/PacktPublishing/Getting-Started-with-the-Graph-Query-Language-GQL. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/gbp/9781836204015.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter/X handles. For example: “TRAIL mode excludes paths that contain duplicate edges, such as the a->b->a->b path, where the edge a->b is traversed more than once.”
A block of code is set as follows:
GQL: INSERT (a:Node {_id: 'a'}), (b:Node {_id: 'b'}), (c:Node {_id: 'c'}), (i:Node {_id: 'i'}), (j:Node {_id: 'j'}), (b)-[:Edge]->(a), (a)-[:Edge]->(c), (c)-[:Edge]->(i), (i)-[:Edge]->(j)When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
GQL: FOR id in ["a", "b", "z"] OPTIONAL CALL { MATCH (start {_id: id }) MATCH (start)-(end) RETURN COLLECT_LIST(end._id) as neigbours } LET neigbours COALESCE(neigbours, []) RETURN id, neigboursBold: Indicates a new term, an important word, or words that you see on the screen. For instance, words in menus or dialog boxes appear in the text like this. For example: “In this case, the results are generated by computing the Cartesian product of the result sets from the individual patterns.”
Warnings or important notes appear like this.
Tips and tricks appear like this.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book or have any general feedback, please email us at [email protected] and mention the book’s title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you reported this to us. Please visit http://www.packt.com/submit-errata, click Submit Errata, and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packt.com/.
Once you’ve read Getting Started with the Graph Query Language (GQL), we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily.
Follow these simple steps to get the benefits:
Scan the QR code or visit the link below:https://packt.link/free-ebook/9781836204015
Submit your proof of purchase.That’s it! We’ll send your free PDF and other benefits to your email directly.Scan this QR code or go to packtpub.com/unlock, then search this book by name. Ensure it’s the correct edition.
Note: Keep your purchase invoice ready before you start.
Enhanced reading experience with our Next-gen Reader:
Multi-device progress sync: Learn from any device with seamless progress sync.
Highlighting and notetaking: Turn your reading into lasting knowledge.
Bookmarking: Revisit your most important learnings anytime.
Dark mode: Focus with minimal eye strain by switching to dark or sepia mode.
Learn smarter using our AI assistant (Beta):
Summarize it: Summarize key sections or an entire chapter.
AI code explainers: In the next-gen Packt Reader, click the Explain button above each code block for AI-powered code explanations.
Note: The AI assistant is part of next-gen Packt Reader and is still in beta.
Learn anytime, anywhere:
Access your content offline with DRM-free PDF and ePub versions—compatible with your favorite e-readers.
Your copy of this book comes with the following exclusive benefits:
Next-gen Packt Reader
AI assistant (beta)
DRM-free PDF/ePub downloads
Use the following guide to unlock them if you haven’t already. The process takes just a few minutes and needs to be done only once.
Keep your purchase invoice for this book ready, as you’ll need it in Step 3. If you received a physical invoice, scan it on your phone and have it ready as either a PDF, JPG, or PNG.
For more help on finding your invoice, visit https://www.packtpub.com/unlock-benefits/help.
Note: Did you buy this book directly from Packt? You don’t need an invoice. After completing Step 2, you can jump straight to your exclusive content.
Scan this QR code or go to packtpub.com/unlock.
On the page that opens (which will look similar to Figure 0.1 if you’re on desktop), search for this book by name. Make sure you select the correct edition.
Figure 0.1: Packt unlock landing page on desktop
Sign in to your Packt account or create a new one for free. Once you’re logged in, upload your invoice. It can be in PDF, PNG, or JPG format and must be no larger than 10 MB. Follow the rest of the instructions on the screen to complete the process.
If you get stuck and need help, visit https://www.packtpub.com/unlock-benefits/help for a detailed FAQ on how to find your invoices and more. The following QR code will take you to the help page directly:
Note: If you are still facing issues, reach out to [email protected].
In today’s data-driven world, the way we store, manage, and query data has evolved significantly. As businesses and organizations handle more complex, interconnected datasets, traditional database models are being stretched to their limits. Graph databases, in particular, have gained traction due to their ability to model relationships in ways that relational databases cannot. According to a recent report by Gartner, the graph database market is expected to grow at a compound annual growth rate (CAGR) of 28.1%, reflecting its increasing adoption across industries such as finance, healthcare, and social media.
This chapter explores the evolution of database query languages, right from the early innovations that laid the foundation for modern database systems. We’ll trace the journey from relational databases to NoSQL, and ultimately to the emergence of Graph Query Language (GQL), which promises to redefine how we query and manage complex, interconnected data in the digital age.
Before electronic databases, data management was manual. Records were maintained in physical forms such as ledgers, filing cabinets, and card catalogs. Until the mid-20th century, this was the primary approach to data management. This method, while systematic, was labor-intensive and prone to human errors.
Discussions on database technology often begin with the 1950s and 1960s, particularly with the introduction of magnetic tapes and disks. These developments paved the way for navigational data models and, eventually, relational models. While these discussions are valuable, they sometimes overlook deeper historical perspectives.
Before magnetic tapes, punched cards were widely used, particularly for the 1890 U.S. Census. The company behind these tabulating systems later evolved into IBM Corporation, one of the first major technological conglomerates. I vividly recall my father attending college courses on modern computing, where key experiments involved operating IBM punch-card computers—decades before personal computers emerged in the 1980s.
Examining punched card systems reveals a connection to the operation of looms, one of humanity’s earliest sophisticated machines. Looms, which possibly originated in China and spread globally, have been found in various forms, including in remote African and South American villages.
Across the 3,000 to 5,000 years of recorded history, there have been many inventions for memory-aiding, messaging, scheduling, or recording data, ranging from tally sticks to quipu (khipu). While tally sticks were once thought to be a European invention, Marco Polo, after his extensive travels in China, reported that they were widely used there to track daily transactions.
On the other hand, when quipu was first discovered by Spanish colonists, it was believed to be an Inca invention. However, if the colonists had paid more attention to the pronunciation of khipu, they would have noticed that it means recording book in ancient Chinese. This suggests that quipu was a popular method for recording data and information long before written languages were developed.
Why focus on these pre-database inventions? Understanding these historical innovations through a graph-thinking lens helps illustrate how interconnected these concepts are and underscores the importance of recognizing these connections. Embracing this perspective allows us to better understand and master modern technologies, such as graph databases and graph query languages.
The advent of electronic computers marked the beginning of computerized data storage. World Wars I and II drove major advancements in computing technology, notably the German Enigma machine and the Polish and Allied forces deciphering its encrypted messages, which contained top-secret information from Nazi Germany. When mechanical machines proved inadequate for the required computing power—such as in brute-force decryption—electronic and much more powerful alternatives were invented. Consequently, the earliest computers were developed during and before the end of World War II.
Early computers such as the ENIAC (1946) and UNIVAC (1951) were used for calculations and data processing. The Bureau of the Census and military and defense departments quickly adopted them to optimize troop deployment and arrange the most cost-effective logistics. These efforts laid the foundation for modern global supply chains, network analytics, and social behavior network studies.
The concept of systematic data management, or databases, became feasible with the rapid advancement of electronic computers and storage media, such as magnetic disks. Initially, most of these computers operated in isolation; the development of computer networks lagged significantly behind telecommunication networks for over a century.
The development of database technologies is centered around how data modeling is conducted, and the general perception is that there have been three phases so far:
Phase 1: Navigational data modelingPhase 2: Relational (or SQL) data modelingPhase 3: Not-only-SQL (or post-relational, or GQL) data modelingLet’s briefly examine the three development phases so that we have a clear understanding of why GQL or the graphical way of data modeling and processing was invented.
Before navigational data modeling (or navigational databases), the access of data on punched-cards or magnetic-tapes was sequential. Hence, this was very counter-productive. To improve speed, systems introduced references, which were similar to pointers, that allowed users to navigate data more efficiently. This led to the development of two data navigation models:
Hierarchical model (or tree-like model) Network modelThe hierarchical model was first developed by IBM in the 1960s on top of their mainframe computers, while the network model, though conceptually more comprehensive, was never widely adopted beyond the mainframe era. Both models were quickly displaced by the relational model in the 1970s.
One key reason for this shift was that navigational database programming is intrinsically procedural, focusing on instructing the computer systems with steps on how to access the desired data record. This approach had two major drawbacks:
Strong data dependencyLow usability due to programming complexityThe relational model, unlike the navigational model, is intrinsically declarative. This means instructing the system what data to retrieve, which means better data independence and program usability.
Another key reason for the shift from navigational databases/models was their limited search capabilities, as data records were stored using linked lists. This limitation led Edgar F. Codd, while working at IBM’s San Jose, California Labs, to invent tables as a replacement for linked lists. His groundbreaking work culminated in the highly influential 1970 paper titled A Relational Model of Data for Large Shared Data Banks. This seminal paper inspired a host of relational databases, including IBM’s System R (1974), UC Berkeley’s INGRES (1974, which spawned several well-known products such as PostgreSQL, Sybase, and Microsoft SQL Server), and Larry Ellison’s Oracle (1977).
Today, there are approximately 500 known and active database management systems (DBMS) worldwide (as shown in Figure 1.1). While over one-third are relational DBMS, the past two decades have seen a rise in hundreds of non-relational (NoSQL) databases. This growth is driven by increasing data volumes, which have given rise to many big data processing frameworks that utilize both data modeling and processing techniques beyond the relational model. Additionally, evolving business demands have led to more sophisticated architectural designs, requiring more streamlined data processing.
The entry of major players into the database market has further propelled this transformation, with large technology companies spearheading the development of new database systems tailored to handle diverse and increasingly complex data structures. These companies have helped define and redefine database paradigms, providing a foundation for a variety of solutions in different industries.
As the landscape has continued to evolve, OpenAI, among other cutting-edge companies, has contributed to this revolution with diverse database systems to optimize data processing in machine learning models. In OpenAI’s system architecture, a variety of databases (both commercial and open source) are used, including PostgreSQL (RDBMS), Redis (key-value), Elasticsearch (full-text), MongoDB (document), and possibly Rockset (a derivative of the popular KV-library RocksDB, ideal for real-time data analytics). This heterogeneous approach is typical in large-scale, especially highly distributed, data processing environments. Often, multiple types of databases are leveraged to meet diverse data processing needs, reflecting the difficulty—if not impossibility—of a single database type performing all functions optimally.
Figure 1.1: Changes in Database popularity per category (August 2024, DB-Engines)
Despite the wide range of database genres, large language models still struggle with questions requiring “deep knowledge.” Figure 1.2 illustrates how a large language model encounters challenges with queries necessitating extensive traversal.
Figure 1.2: Hallucination with LLM
The question in Figure 1.2 involves finding causal paths (simply the shortest path) between different entities. While large language models are trained on extensive datasets, including Wikipedia, they may struggle to calculate and retrieve hidden paths between entities if they are not directly connected.
Figure 1.3 demonstrates how Wikipedia articles—represented as nodes (titles or hyperlinks) and their relationships as predicates—can be ingested into the Ultipa graph database. By performing a real-time six-hop-deep shortest path query, the results yield casual paths that are self-explanatory:
Genghis Khan launched the Mongol invasions of West Asia and Europe.These invasions triggered the spread of the Black Death.The last major outbreak of the Black Death was the Great Plague of London.Isaac Newton fled the plague while attending Trinity College.Figure 1.3: The shortest paths between entities using a graph database
The key takeaway from this section is the importance of looking beyond the surface when addressing complex issues or scenarios. The ability to connect the dots and delve deeper into the underlying details allows for identifying root causes, which in turn fosters better decision-making and a more comprehensive understanding of the world’s intricate dynamics.
The introduction of the relational model revolutionized database management by offering a more structured and flexible way to organize and retrieve data. With the relational model as its foundation, SQL emerged as the standard query language, enabling users to interact with relational databases in a more efficient and intuitive manner. This section will explore how SQL’s development, built upon the relational model, became central to modern database systems and continues to influence their evolution today.
Edgar F. Codd’s 1970 paper, A Relational Model of Data for Large Shared Data Banks (https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf) laid the foundation for the relational model. Codd proposed a table-based structure for organizing data, introducing the key concepts of relations (tables), columns (attributes), rows (tuples), primary keys, and foreign keys. When compared to the navigational model (hierarchical model), such a structure provided a more intuitive and flexible way to handle data.
While we state that the relational model is more intuitive and flexible, it is within the context of dealing with data processing scenarios back in the 1970s-1990s. Things have gradually and constantly changed. The relational model has been facing more challenges and criticism with the rise of NoSQL and eventually the standardization of GQL. We will expand the topic on the limitations of SQL and the promises of GQL in the last section of this chapter.
The key concept of the relational model is rather simple, as simple as just five components, which are table, schema, key, relationship, and transaction. Let’s break them down one by one.
A table, or relation, in the relational model is a structured collection of related data organized into rows and columns. Each table is defined by its schema, which specifies the structure and constraints of the table. Here’s a breakdown of its components:
Table Name: Each table has a unique name that describes its purpose. For instance, a table named Employees would likely store employee-related information.Columns (Attributes): Each table consists of a set of columns, also known as attributes or fields. Columns represent the specific characteristics or properties of the entities described by the table. Each column has a name and a data type, which defines the kind of data it can hold. For example, in an Employees table, columns might include EmployeeID, FirstName, LastName, HireDate, and Department. The data type for each column could be integer, varchar (variable character), date, etc.Rows (Tuples): Rows, or tuples, represent individual records within a table. Each row contains a set of values corresponding to the columns defined in the table. For example, a row in the Employees table might include 101, John, Doe, 2023-06-15, Marketing. Each row is a unique instance of the data described by the table.Another key concept tied to tables (sometimes even tied to the entire RDBMS) is the schema. The schema of a table is a blueprint that outlines the table’s structure. It includes the following:
Column Definitions: For each column, the schema specifies its name, data type, and any constraints. Constraints might include NOT NULL (indicating that a column cannot have null values), UNIQUE (ensuring all values in a column are distinct), or DEFAULT (providing a default value if none is specified).Primary Key: A primary key is a column or a set of columns that uniquely identifies each row in the table. It ensures that no two rows can have the same value for the primary key columns. This uniqueness constraint is crucial for maintaining data integrity and enabling efficient data retrieval. For example, EmployeeID in the Employees table could serve as the primary key.Foreign Keys: Foreign keys are columns that create relationships between tables. They refer to the primary key of another table, establishing a link between the two tables. This mechanism supports referential integrity, ensuring that relationships between data in different tables are consistent.Here we need to talk about normalization, which is a process applied to table design to reduce redundancy and improve data integrity. It involves decomposing tables into smaller, related tables and defining relationships between them. The goal is to minimize duplicate data and ensure that each piece of information is stored in only one place.
For example, rather than storing employee department information repeatedly in the Employees table, a separate Departments table can be created, with a foreign key in the Employees table linking to it.
The normalization concept sounds absolutely wonderful, but only on the surface. In many large data warehouses, too many tables have been formed. What was once seen as being intuitive and flexible in the relational model, can be a huge limitation and burden from a data governance perspective.
Entity Relationship (ER) modeling is a foundational technique for designing databases using the relational model. Developed by Peter Chen in 1976, ER modeling provides a graphical framework for representing data and its relationships. It is crucial for understanding and organizing the data within a relational database. The keyword of ER modeling is graphical. The core concepts include entities, relationships, and attributes:
Entities: In ER modeling, an entity represents a distinct object or concept within the database. For example, in a university database, entities might include Student, Course, and Professor. Each entity is represented as a table in the relational model.Attributes: Attributes describe the properties of entities. For instance, the Student entity might have attributes such as Student_ID, Name, Date_Of_Birth, and Major. Attributes become columns within the corresponding table.Relationships: Relationships in ER modeling illustrate how entities are associated with one another. Relationships represent the connections between entities and are essential for understanding how data is interrelated. For example, a Student might be enrolled in a Course, creating a relationship between these two entities.The caveat about relationships is that there are many types of relationships:
One-to-One: In this type of relationship, each instance of entity A is associated with exactly one instance of entity B, and vice versa. For example, each Student might have one Student_ID, and each Student_ID corresponds to exactly one student.One-to-Many: This relationship type occurs when a single instance of entity A is associated with multiple instances of entity B, but each instance of entity B is associated with only one instance of entity A. For example, a Professor might teach multiple Courses, but each Course is taught by only one Professor. If we pause here, we can immediately sense that a problem will arise when such a rigid relationship is enforced, if a course is to be taught by two or three professors (a rare scenario but it does happen), the schema and table design would need a change. And the more exceptions you can think of here, the more re-designs you would experience.Many-to-Many: This relationship occurs when multiple instances of entity A can be associated with multiple instances of entity B. For example, a Student can enroll in multiple Courses, and each Course can have multiple Students enrolled. To model many-to-many relationships, a junction table (or associative entity) is used, which holds foreign keys referencing both entities.ER diagrams offer a clear and structured way to represent entities, their attributes, and the relationships between them:
Entities are represented by rectanglesAttributes are shown as ovals, each connected to its corresponding entityRelationships are illustrated as diamonds, linking the relevant entitiesThis visual framework provides a comprehensive way to design database schemas and better understand how different data elements interact within a system.
The ER diagram is essentially the graph data model we will be discussing throughout the book. The only difference between SQL and GQL in terms of ER diagrams is that GQL and graph databases natively organize and represent entities and their relationships, while SQL and RDBMS use ER diagrams with metadata, where real data records are stored in lower-dimensional tables. It’s tempting to believe that the prevalence of the relational model matches with the limited computing power at the time it was invented. Exponentially higher computing power eventually would demand something more advanced, and more intuitive and flexible as well.
Transactions are a crucial aspect of relational databases, ensuring that operations are performed reliably and consistently. To better understand how these principles work in practice, let’s explore ACID properties.
The ACID properties – Atomicity, Consistency, Isolation, and Durability – define the key attributes of a transaction. Let’s explore them in detail:
Atomicity: Atomicity ensures that a transaction is treated as a single, indivisible unit of work. This is crucial for maintaining data integrity, especially in scenarios where multiple operations are performed as part of a single transaction. This means that either all operations within the transaction are completed successfully, or none are applied. If any operation fails, the entire transaction is rolled back, leaving the database in its previous state. It prevents partial updates that could lead to inconsistent data states.Consistency: Consistency ensures that a transaction takes the database from one valid state to another valid state, preserving the integrity constraints defined in the schema. All business rules, data constraints, and relationships must be maintained throughout the transaction. Consistency guarantees that database rules are enforced and that the database remains in a valid state before and after the transaction.Isolation: Isolation ensures that the operations of a transaction are isolated from other concurrent transactions. Even if multiple transactions are executed simultaneously, each transaction operates as if it were the only one interacting with the database. Isolation prevents interference between transactions, avoiding issues such as dirty reads, non-repeatable reads, and phantom reads. It ensures that each transaction’s operations are independent and not affected by others.Durability: Durability guarantees that once a transaction is committed, its changes are permanent and persist even in the event of a system failure or crash. The committed data is stored in non-volatile memory, ensuring its longevity. Durability ensures that completed transactions are preserved and that changes are not lost due to unforeseen failures. This property provides reliability and trustworthiness in the database system.These attributes are best illustrated by linking them to a real-world system and application ecosystem. Considering any financial institution’s transaction processing system where a transaction involves transferring funds from one account to another, the transaction must ensure that both the debit and credit operations are completed successfully (atomicity), the account balances remain consistent (consistency), other transactions do not see intermediate states (isolation), and the changes persist even if the system fails (durability). These properties are essential for the accuracy and reliability of financial transactions.
The ACID properties were introduced in 1976 by Jim Gray and laid the foundation for reliable database transaction management. These properties were gradually incorporated into the SQL standard with the SQL-86 standard and have since remained integral to relational database systems. For over fifty years, the principles of ACID have been continuously adopted and refined by most relational database vendors, ensuring robust transaction management and data integrity. When comparing relational database management systems (RDBMS) with NoSQL and graph databases, the needs and implementation priorities of ACID properties vary, influencing how these systems handle transaction management and consistency.
Modern RDBMS include robust transaction management mechanisms to handle ACID properties. These systems use techniques such as logging, locking, and recovery to ensure transactions are executed correctly and data integrity is maintained. Managing concurrent transactions is essential for ensuring isolation and consistency. Techniques such as locking (both exclusive and shared) and multi-version concurrency control (MVCC) are used to handle concurrent access to data and prevent conflicts.
Today, big data is ubiquitous, influencing nearly every industry across the globe. As data grows in complexity and scale, traditional relational databases show limitations in addressing these new challenges. Unlike the structured, table-based model of relational databases, the real world is rich, high-dimensional, and interconnected, requiring new approaches to data management. The evolution of big data and NoSQL technologies demonstrates how traditional models struggled to meet the needs of complex, multi-faceted datasets. In this context, graph databases have emerged as a powerful and flexible solution, capable of modeling and querying intricate relationships in ways that were previously difficult to achieve. As industries continue to generate and rely on interconnected data, graph databases are positioning themselves as a transformative force, offering significant advantages in managing and leveraging complex data relationships.
The advent of big data marked a significant turning point in data management and analytics. While we often date the onset of the big data era to around 2012, the groundwork for this revolution was laid much earlier. A key milestone was the release of Hadoop by Yahoo! in 2006, which was subsequently donated to the Apache Foundation. Hadoop’s design was heavily inspired by Google’s seminal papers on the Google File System (GFS) and MapReduce.
GFS, introduced in 2003, and MapReduce, which followed in 2004, provided a new way of handling vast amounts of data across distributed systems. These innovations stemmed from the need to process and analyze the enormous data generated by Google’s search engine. At the core of Google’s search engine technology was PageRank, a graph algorithm for ranking web pages based on their link structures. Named intentionally as a pun after Google co-founder Larry Page. This historical context illustrates that big data technologies have deep roots in graph theory, evolving towards increasingly sophisticated and large-scale systems.
Figure 1.4: From data to big data to fast data and deep data
Examining the trajectory of data processing technologies over the past 50 years reveals a clear evolution through distinct stages:
The Era of Relational Databases (1970s-present): This era is defined by the dominance of relational databases, which organize data into structured tables and use SQL for data manipulation and retrieval.The Era of Non-Relational Databases and Big Data Frameworks (2000s-present): The rise of NoSQL databases and big data frameworks marked a departure from traditional relational models. These technologies address the limitations of relational databases in handling unstructured data and massive data volumes.The Post-Relational Database Era (2020s and beyond): Emerging technologies signal a shift towards post-relational databases, including NewSQL and Graph Query Language (GQL). These advancements seek to overcome the constraints of previous models and offer enhanced capabilities for managing complex, interconnected data.Each of these stages has been accompanied by the development of corresponding query languages:
Relational Database—SQL: Standardized in 1983, SQL became the cornerstone of relational databases, providing a powerful and versatile language for managing structured data.Non-Relational Database—NoSQL: The NoSQL movement introduced alternative models for data storage and retrieval, focusing on scalability and flexibility. NoSQL databases extend beyond SQL’s capabilities but lack formal standardization.Post-Relational Database—NewSQL and GQL