Graph Data Science with Neo4j - Estelle Scifo - E-Book

Graph Data Science with Neo4j E-Book

Estelle Scifo

0,0
33,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Neo4j, along with its Graph Data Science (GDS) library, is a complete solution to store, query, and analyze graph data. As graph databases are getting more popular among developers, data scientists are likely to face such databases in their career, making it an indispensable skill to work with graph algorithms for extracting context information and improving the overall model prediction performance.
Data scientists working with Python will be able to put their knowledge to work with this practical guide to Neo4j and the GDS library that offers step-by-step explanations of essential concepts and practical instructions for implementing data science techniques on graph data using the latest Neo4j version 5 and its associated libraries. You’ll start by querying Neo4j with Cypher and learn how to characterize graph datasets. As you get the hang of running graph algorithms on graph data stored into Neo4j, you’ll understand the new and advanced capabilities of the GDS library that enable you to make predictions and write data science pipelines. Using the newly released GDSL Python driver, you’ll be able to integrate graph algorithms into your ML pipeline.
By the end of this book, you’ll be able to take advantage of the relationships in your dataset to improve your current model and make other types of elaborate predictions.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 319

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Graph Data Science with Neo4j

Learn how to use Neo4j 5 with Graph Data Science library 2.0 and its Python driver for your project

Estelle Scifo

BIRMINGHAM—MUMBAI

Graph Data Science with Neo4j

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Ali Abidi

Senior Editor: Nathanya Dias

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Hemangini Bari

Production Designer: Shankar Kalbhor

Marketing Coordinator: Vinishka Kalra

First published: January 2023

Production reference: 1310123

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80461-274-3

www.packtpub.com

Contributors

About the author

Estelle Scifo is a Neo4j Certified Professional and Neo4j Graph Data Science certified user. She is currently a machine learning engineer at GraphAware where she builds Neo4j-related solutions to make customers happy with graphs.

Before that, she worked in several fields, starting out with research in particle physics, during which she worked at CERN on uncovering Higgs boson properties. She received her PhD in 2014 from the Laboratoire de l’Accélérateur Linéaire (Orsay, France). Continuing her career in industry, she worked in real estate, mobility, and logistics for almost 10 years. In the Neo4j community, she is known as the creator of neomap, a map visualization application for data stored in Neo4j. She also regularly gives talks at conferences such as NODES and PyCon. Her domain expertise and deep insight into the perspective of a beginner’s needs make her an excellent teacher.

There is only one name on the cover, but a book is not the work of one person. I would like to thank everyone involved in making this book a reality. Beyond everyone at Packt, the reviewers did an incredible job of suggesting some very relevant improvements. Thank you, all!

I hope this book will inspire you as much as other books of this genre have inspired me.

About the reviewers

Dr. David Gurzick is the founding chair of the George B. Delaplaine Jr. School of Business and an associate professor of management science at Hood College. He has a BS in computer science from Frostburg State University, an M.S. in computer science from Hood College, a PhD in information systems from the University of Maryland, Baltimore County, and is a graduate of Harvard’s business analytics program. As a child of the internet, he grew up on AOL and programmed his way through dot.com. He now helps merge technology and business strategy to enable innovation and accelerate commercial success as the lead data scientist at Genitive.ai and as a director of the Frederick Innovative Technology Center, Inc (FITCI).

Sean William Grant is a product and analytics professional with over 20 years of experience in technology and data analysis. His experience ranges from geospatial intelligence with the United States Marine Corps, product management within the aviation and autonomy space, to implementing advanced analytics and data science within organizations. He is a graph data science and network analytics enthusiast who frequently gives presentations and workshops on connected data. He has also been a technical advisor to several early-stage start-ups. Sean is passionate about data and technology, and how it can elevate our understanding of ourselves.

Jose Ernesto Echeverria has worked with all kinds of databases, from relational databases in the 1990s to non-SQL databases in the 2010s. He considers graph databases to be the best fit for solving real-world problems, given their strong capability for modeling and adaptability to change. As a polyglot programmer, he has used languages such as Java, Ruby, and R and tools such as Jupyter with Neo4j in order to solve data management problems for multinational corporations. A long-time advocate of data science, he expects this long-awaited book to cover the proper techniques and approach the intersections of this discipline, as well as help readers to discover the possibilities of graph databases. When not working, he enjoys spending time with friends and family.

Table of Contents

Preface

Part 1 – Creating Graph Data in Neo4j

1

Introducing and Installing Neo4j

Technical requirements

What is a graph database?

Databases

Graph database

Finding or creating a graph database

A note about the graph dataset’s format

Modeling your data as a graph

Neo4j in the graph databases landscape

Neo4j ecosystem

Setting up Neo4j

Downloading and starting Neo4j Desktop

Creating our first Neo4j database

Creating a database in the cloud – Neo4j Aura

Inserting data into Neo4j with Cypher, the Neo4j query language

Extracting data from Neo4j with Cypher pattern matching

Summary

Further reading

Exercises

2

Importing Data into Neo4j to Build a Knowledge Graph

Technical requirements

Importing CSV data into Neo4j with Cypher

Discovering the Netflix dataset

Defining the graph schema

Importing data

Introducing the APOC library to deal with JSON data

Browsing the dataset

Getting to know and installing the APOC plugin

Loading data

Dealing with temporal data

Discovering the Wikidata public knowledge graph

Data format

Query language – SPARQL

Enriching our graph with Wikidata information

Loading data into Neo4j for one person

Importing data for all people

Dealing with spatial data in Neo4j

Importing data in the cloud

Summary

Further reading

Exercises

Part 2 – Exploring and Characterizing Graph Data with Neo4j

3

Characterizing a Graph Dataset

Technical requirements

Characterizing a graph from its node and edge properties

Link direction

Link weight

Node type

Computing the graph degree distribution

Definition of a node’s degree

Computing the node degree with Cypher

Visualizing the degree distribution with NeoDash

Installing and using the Neo4j Python driver

Counting node labels and relationship types in Python

Building the degree distribution of a graph

Improved degree distribution

Learning about other characterizing metrics

Triangle count

Clustering coefficient

Summary

Further reading

Exercises

4

Using Graph Algorithms to Characterize a Graph Dataset

Technical requirements

Digging into the Neo4j GDS library

GDS content

Installing the GDS library with Neo4j Desktop

GDS project workflow

Projecting a graph for use by GDS

Native projections

Cypher projections

Computing a node’s degree with GDS

stream mode

The YIELD keyword

write mode

mutate mode

Algorithm configuration

Other centrality metrics

Understanding a graph’s structure by looking for communities

Number of components

Modularity and the Louvain algorithm

Summary

Further reading

5

Visualizing Graph Data

Technical requirements

The complexity of graph data visualization

Physical networks

General case

Visualizing a small graph with networkx and matplotlib

Visualizing a graph with known coordinates

Visualizing a graph with unknown coordinates

Configuring object display

Discovering the Neo4j Bloom graph application

What is Bloom?

Bloom installation

Selecting data with Neo4j Bloom

Configuring the scene in Bloom

Visualizing large graphs with Gephi

Installing Gephi and its required plugin

Using APOC Extended to synchronize Neo4j and Gephi

Configuring the view in Gephi

Summary

Further reading

Exercises

Part 3 – Making Predictions on a Graph

6

Building a Machine Learning Model with Graph Features

Technical requirements

Introducing the GDS Python client

GDS Python principles

Input and output types

Creating a projected graph from Python

Running GDS algorithms from Python and extracting data in a dataframe

write mode

stream mode

Dropping the projected graph

Using features from graph algorithms in a scikit-learn pipeline

Machine learning tasks with graphs

Our task

Computing features

Extracting and visualizing data

Building the model

Summary

Further reading

Exercise

7

Automatically Extracting Features with Graph Embeddings for Machine Learning

Technical requirements

Introducing graph embedding algorithms

Defining embeddings

Graph embedding classification

Using a transductive graph embedding algorithm

Understanding the Node2Vec algorithm

Using Node2Vec with GDS

Training an inductive embedding algorithm

Understanding GraphSAGE

Introducing the GDS model catalog

Training GraphSAGE with GDS

Computing new node representations

Summary

Further reading

Exercises

8

Building a GDS Pipeline for Node Classification Model Training

Technical requirements

The GDS pipelines

What is a pipeline?

Building and training a pipeline

Creating the pipeline and choosing the features

Setting the pipeline configuration

Training the pipeline

Making predictions

Computing the confusion matrix

Using embedding features

Choosing the graph embedding algorithm to use

Training using Node2Vec

Training using GraphSAGE

Summary

Further reading

Exercise

9

Predicting Future Edges

Technical requirements

Introducing the LP problem

LP examples

LP with the Netflix dataset

Framing an LP problem

LP features

Topological features

Features based on node properties

Building an LP pipeline with the GDS

Creating and configuring the pipeline

Pipeline training and testing

Summary

Further reading

10

Writing Your Custom Graph Algorithms with the Pregel API in Java

Technical requirements

Introducing the Pregel API

GDS’s features

The Pregel API

Implementing the PageRank algorithm

The PageRank algorithm

Simple Python implementation

Pregel Java implementation

Implementing the tolerance-stopping criteria

Testing our code

Test for the PageRank class

Test for the PageRankTol class

Using our algorithm from Cypher

Adding annotations

Building the JAR file

Updating the Neo4j configuration

Testing our procedure

Summary

Further reading

Exercises

Index

Other Books You May Enjoy

Preface

Data science today is a core component of many companies and organizations taking advantage of its predictive power to improve their products or better understand their customers. It is an ever-evolving field, still undergoing intense research. One of the most trending research areas is graph data science (GDS), or how representing data as a connected network can improve models.

Among the different tools on the market to work with graphs, Neo4j, a graph database, is popular among developers for its ability to build simple and evolving data models and query data easily with Cypher. For a few years now, it has also stood out as a leader in graph analytics, especially since the release of the first version of its GDS library, allowing you to run graph algorithms from data stored in Neo4j, even at a large scale.

This book is designed to guide you through the field of GDS, always using Neo4j and its GDS library as the main tool. By the end of this book, you will be able to run your own GDS model on a graph dataset you created. By the end of the book, you will even be able to pass the Neo4j Data Science certification to prove your new skills to the world.

Who this book is for

This book is for people who are curious about graphs and how this data structure can be useful in data science. It can serve both data scientists who are learning about graphs and Neo4j developers who want to get into data science.

The book assumes minimal data science knowledge (classification, training sets, confusion matrices) and some experience with Python and its related data science toolkit (pandas, matplotlib, and scikit-learn).

What this book covers

Chapter 1, Introducing and Installing Neo4j, introduces the basic principles of graph databases and gives instructions on how to set up Neo4j locally, create your first graph, and write your first Cypher queries.

Chapter 2, Using Existing Data to Build a Knowledge Graph, guides you through loading data into Neo4j from different formats (CSV, JSON, and an HTTP API). This is where you will build the dataset that will be used throughout this book.

Chapter 3, Characterizing a Graph Dataset, introduces some key metrics to differentiate one graph dataset from another.

Chapter 4, Using Graph Algorithms to Characterize a Graph Dataset, goes deeper into understanding a graph dataset by using graph algorithms. This is the chapter where you will start to use the Neo4j GDS plugin.

Chapter 5, Visualizing Graph Data, delves into graph data visualization by drawing nodes and edges, starting from static representations and moving on to dynamic ones.

Chapter 6, Building a Machine Learning Model with Graph Features, talks about machine learning model training using scikit-learn. This is where we will first use the GDS Python client.

Chapter 7, Automating Feature Extraction with Graph Embeddings for Machine Learning, introduces the concept of node embedding, with practical examples using the Neo4j GDS library.

Chapter 8, Building a GDS Pipeline for Node Classification Model Training, introduces the topic of node classification within GDS without involving a third-party tool.

Chapter 9, Predicting Future Edges, gives a short introduction to the topic of link prediction, a graph-specific machine learning task.

Chapter 10, Writing Your Custom Graph Algorithms with the Pregel API in Java, covers the exciting topic of building an extension for the GDS plugin.

To get the most out of this book

You will need access to a Neo4j instance. Options and installation instructions are given in Chapter 1, Introducing and Installing Neo4j. We will also intensively use Python and the following packages: pandas, scikit-learn, network, and graphdatascience. The code was tested with Python 3.10 but should work with newer versions, assuming no breaking change is made in its dependencies. Python code is provided as a Jupyter notebook, so you’ll need Jupyter Server installed and running to go through it.

For the very last chapter, a Java JDK will also be required. The code was tested with OpenJDK 11.

Software/hardware covered in the book

Operating system requirements

Neo4j 5.x

Windows, macOS, or Linux

Python 3.10

Windows, macOS or Linux

Jupyter

Windows, macOS or Linux

OpenJDK 11

Windows, macOS or Linux

You will also need to install Neo4j plugins: APOC and GDS. Installation instructions for Neo4j Desktop are given in the relevant chapters. However, if you are not using a local Neo4j instance, please refer to the following pages for installation instructions, especially regarding version compatibilities:

APOC: https://neo4j.com/docs/apoc/current/installation/GDS: https://neo4j.com/docs/graph-data-science/current/installation/

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Graph-Data-Science-with-Neo4j. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system.”

A block of code is set as follows:

CREATE (:Movie {     id: line.show_id,     title: line.title,     releaseYear: line.release_year   }

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

LOAD CSV WITH HEADERS FROM 'file:///netflix/netflix_titles.csv' AS line WITH split(line.director, ",") as directors_list UNWIND directors_list AS director_name CREATE (:Person {name: trim(director_name)})

Any command-line input or output is written as follows:

$ mkdir css $ cd css

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “Select System info from the Administration panel.”

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you’ve read Graph Data Science with Neo4J, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere? Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application. 

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/9781804612743

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

Part 1 – Creating Graph Data in Neo4j

In this first part, you will learn about Neo4j and set up your first graph database. You will also build a graph dataset in Neo4j using Cypher, the APOC library, and public knowledge graphs.

This part includes the following chapters:

Chapter 1, Introducing and Installing Neo4jChapter 2, Using Existing Data to Build a Knowledge Graph

1

Introducing and Installing Neo4j

Graph databases in general, and Neo4j in particular, have gained increasing interest in the past few years. They provide a natural way of modeling entities and relationships and take into account observation context, which is often crucial to extract the most out of your data. Among the different graph database vendors, Neo4j has become one of the most popular for both data storage and analytics. A lot of tools have been developed by the company itself or the community to make the whole ecosystem consistent and easy to use: from storage to querying, to visualization to graph data science. As you will see through this book, there is a well-integrated application or plugin for each of these topics.

In this chapter, you will get to know what Neo4j is, positioning it in the broad context of databases. We will also introduce the aforementioned plugins that are used for graph data science.

Finally, you will set up your first Neo4j instance locally if you haven’t done so already and run your first Cypher queries to populate the database with some data and retrieve it.

In this chapter, we’re going to cover the following main topics:

What is a graph database?Finding or creating a graph databaseNeo4j in the graph databases landscapeSetting up Neo4jInserting data into Neo4j with Cypher, the Neo4j query languageExtracting data from Neo4j with Cypher pattern matching

Technical requirements

To follow this chapter well, you will need access to the following resources:

You’ll need a computer that can run Neo4j locally; Windows, macOS, and Linux are all supported. Please refer to the Neo4j website for more details about system requirements: https://neo4j.com/docs/operations-manual/current/installation/requirements/.Any code listed in the book will be available in the associated GitHub repository – that is, https://github.com/PacktPublishing/Graph-Data-Science-with-Neo4j – in the corresponding chapter folder.

What is a graph database?

Before we get our hands dirty and start playing with Neo4j, it is important to understand what Neo4j is and how different it is from the data storage engine you are used to. In this section, we are going to discuss (quickly) the different types of databases you can find today, and why graph databases are so interesting and popular both for developers and data professionals.

Databases

Databases make up an important part of computer science. Discussing the evolution and state-of-the-art areas of the different implementations in detail would require several books like this one – fortunately, this is not a requirement to use such systems effectively. However, it is important to be aware of the existing tools related to data storage and how they differ from each other, to be able to choose the right tool for the right task. The fact that, after reading this book, you’ll be able to use graph databases and Neo4j in your data science project doesn’t mean you will have to use it every single time you start a new project, whatever the context is. Sometimes, it won’t be suitable; this introduction will explain why.

A database, in the context of computing, is a system that allows you to store and query data on a computer, phone, or, more generally, any electronic device.

As developers or data scientists of the 2020s, we have mainly faced two kinds of databases:

Relational databases (SQL) such as MySQL or PostgreSQL. These store data as records in tables whose columns are attributes or fields and whose rows represent each entity. They have a predefined schema, defining how data is organized and the type of each field. Relationships between entities in this representation are modeled by foreign keys (requiring unique identifiers). When the relationship is more complex, such as when attributes are required or when we can have many relationships between the same objects, an intermediate junction (join) table is required.NoSQL databases contain many different typesof databases:Key-value stores such as Redis or Riak. A key-value (KV) store, as the name suggests, is a simple lookup database where the key is usually a string, and the value can be a more complex object that can’t be used to filter the query – it can only be retrieved. They are known to be very efficient for caching in a web context, where the key is the page URL and the value is the HTML content of the page, which is dynamically generated. KV stores can also be used to model graphs when building a native graph engine is not an option. You can see KV stores in action in the following projects:IndraDB: This is a graph database written in Rust that relies on different types of KV stores: https://github.com/indradb/indradbDocument-oriented databases such as MongoDB or CouchDB. These are useful for storing schema-less documents (usually JSON (or a derivative) objects). They are much more flexible compared to relational databases, since each document may have different fields. However, relationships are harder to model, and such databases rely a lot on nested JSON and information duplication instead of joining multiple tables.

The preceding list is non-exhaustive; other types of data stores have been created over time and abandoned or were born in the past years, so we’ll need to wait to see how useful they can be. We can mention, for instance, vector databases, such as Weaviate, which are used to store data with their vector representations to ease searching in the vector space, with many applications in machine learning once a vector representation (embedding) of an observation has been computed.

Graph databases can also be classified as NoSQL databases. They bring another approach to the data storage landscape, especially in the data model phase.

Graph database

In the previous section, we talked about databases. Before discussing graph databases, let’s introduce the concept of graphs.

A graph is a mathematical object defined by the following:

A set of vertices or nodes (the dots)A set of edges (the connections between these dots)

The following figure shows several examples of graphs, big and small:

Figure 1.1 – Representations of some graphs

As you can see, there’s a Road network (in Europe), a Computer network, and a Social network. But in practice, far more objects can be seen as graphs:

Time series: Each observation is connected to the next oneImages: Each pixel is linked to its eight neighbors (see the bottom-right picture in Figure 1.1)Texts: Here, each word is connected to its surrounding words or a more complex mapping, depending on its meaning (see the following figure):

Figure 1.2 – Figure generated with the spacy Python library, which was able to identify the relationships between words in a sentence using NLP techniques

A graph can be seen as a generalization of these static representations, where links can be created with fewer constraints.

Another advantage of graphs is that they can be easily traversed, going from one node to another by following edges. They have been used for representing networks for a long time – road networks or communication infrastructure, for instance. The concept of a path, especially the shortest path in a graph, is a long-studied field. But the analysis of graphs doesn’t stop here – much more information can be extracted from carefully analyzing a network, such as its structure (are there groups of nodes disconnected from the rest of the graph? Are groups of nodes more tied to each other than to other groups?) and node ranking (node importance). We will discuss these algorithms in more detail in Chapter 4, Using Graph Algorithms to Characterize a Graph Dataset.

So, we know what a database is and what a graph is. Now comes the natural question: what is a graph database? The answer is quite simple: in a graph database, data is saved into nodes, which can be connected through edges to model relationships between them.

At this stage, you may be wondering: ok, but where can I find graph data? While we are used to CSV or JSON datasets, graph formats are not yet common and it might be misleading to some of you. If you do not have graph data, why would you need a graph database? There are two possible answers to this question, both of which we are going to discuss.

Finding or creating a graph database

Data scientists know how to find or generate datasets that fit their needs. Randomly generating a variable distribution while following some probabilistic law is one of the first things you’ll learn in a statistics course. Similarly, graph datasets can be randomly generated, following some rules. However, this book is not a graph theory book, so we are not going to dig into these details here. Just be aware that this can be done. Please refer to the references in the Further reading section to learn more.

Regarding existing datasets, some of them are very popular and data scientists know about them because they have used them while learning data science and/or because they are the topic of well-known Kaggle competitions. Think, for instance, about the Titanic or house price datasets. Other datasets are also used for model benchmarking, such as the MNIST or ImageNet datasets in computer vision tasks.

The same holds for graph data science, where some datasets are very common for teaching or benchmarking purposes. If you investigate graph theory, you will read about the Zachary’s karate club (ZKC) dataset, which is probably one of the most famous graph datasets out there (side note: there is even a ZKC trophy, which is awarded to the first person in a graph conference that mentions this dataset). The ZKC dataset is very simple (30 nodes, as we’ll see in Chapter 3, Characterizing a Graph Dataset, and Chapter 4, Using Graph Algorithms to Characterize a Graph Dataset, on how to characterize a graph dataset), but bigger and more complex datasets are also available.

There are websites referencing graph datasets, which can be used for benchmarking in a research context or educational purpose, such as this book. Two of the most popular ones are the following:

The Stanford Network Analysis Project (SNAP) (https://snap.stanford.edu/data/index.html) lists different types of networks in different categories (social networks, citation networks, and so on)The Network Repository Project, via its website at https://networkrepository.com/index.php, provides hundreds of graph datasets from real-world examples, classified into categories (for example, biology, economics, recommendations, road, and so on)

If you browse these websites and start downloading some of the files, you’ll notice the data comes in unfamiliar formats. We’re going to list some of them next.

A note about the graph dataset’s format

The datasets we are used to are mainly exchanged as CSV or JSON files. To represent a graph, with nodes on one side and edges on the other, several specific formats have been imagined.

The main data formats that are used to save graph data as text files are the following:

Edge list: This is a text file where each row contains an edge definition. For instance, a graph with three nodes (A, B, C) and two edges (A-B and C-A) is defined by the following edgelist file: A B C AMatrix Market (with the .mtx extension): This format is an extension of the previous one. It is quite frequent on the network repository website.Adjacency matrix: The adjacency matrix is an NxN matrix (where N is the number of nodes in the graph) where the ij element is 1 if nodes i and j are connected through an edge and 0 otherwise. The adjacency matrix of the simple graph with three nodes and two edges is a 3x3 matrix, as shown in the following code block. I have explicitly displayed the row and column names only for convenience, to help you identify what i and j are:   A B C A 0 1 0 B 0 0 0 C 1 0 0

Note

The adjacency matrix is one way to vectorize a graph. We’ll come back to this topic in Chapter 7, Automatically Extracting Features with Graph Embeddings for Machine Learning.

GraphML: Derived from XML, the GraphML format is much more verbose but lets us define more complex graphs, especially those where nodes and/or edges carry properties. The following example uses the preceding graph but adds a name property to nodes and a length property to edges: <?xml version='1.0' encoding='utf-8'?> <graphml xmlns="http://graphml.graphdrawing.org/xmlns"    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"    xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd" >    <!-- DEFINING PROPERTY NAME WITH TYPE AND ID -->     <key attr.name="name" attr.type="string" for="node" id="d1"/>     <key attr.name="length" attr.type="double" for="edge" id="d2"/>     <graph edgedefault="directed">        <!-- DEFINING NODES -->        <node id="A">              <!-- SETTING NODE PROPERTY -->             <data key="d1">"Point A"</data>         </node>         <node id="B">             <data key="d1">"Point B"</data>         </node>         <node id="C">             <data key="d1">"Point C"</data>         </node>         <!-- DEFINING EDGES        with source and target nodes and properties    -->         <edge id="AB" source="A" target="B">             <data key="d2">123.45</data>         </edge>         <edge id="CA" source="C" target="A">             <data key="d2">56.78</data>         </edge>     </graph> </graphml>

If you find a dataset already formatted as a graph, it is likely to be using one of the preceding formats. However, most of the time, you will want to use your own data, which is not yet in graph format – it might be stored in the previously described databases or CSV or JSON files. If that is the case, then the next section is for you! There, you will learn how to transform your data into a graph.

Modeling your data as a graph

The second answer to the main question in this section is: your data is probably a graph, without you being aware of it yet. We will elaborate on this topic in the next chapter (Chapter 2, Using Existing Data to Build a Knowledge Graph), but let me give you a quick overview.

Let’s take the example of an e-commerce website, which has customers (users) and products. As in every e-commerce website, users can place orders to buy some products. In the relational world, the data schema that’s traditionally used to represent such a scenario is represented on the left-hand side of the following screenshot:

Figure 1.3 – Modeling e-commerce data as a graph

The relational data model works as follows:

A table is created to store users, with a unique identifier (id) and a username (apart from security and personal information required for such a website, you can easily imagine how to add columns to this table).Another table contains the data about the available products.Each time a customer places an order, a new row is added to an order table, referencing the user by its ID (a foreign key with a one-to-many relationship, where a user can place many orders).To remember which products were part of which orders, a many-to-many relationship is created (an order contains many products and a product is part of many orders). We usually create a relationship table, linking orders to products (the order product table, in our example).

Note

Please refer to the colored version of the preceding figure, which can be found in the graphics bundle link provided in the Preface, for a better understanding of the correspondence between the two sides of the figure.

In a graph database, all the _id columns are replaced by actual relationships, which are real entities with graph databases, not just conceptual ones like in the relational model. You can also get rid of the order product table since information specific to a product in a given order such as the ordered quantity can be stored directly in the relationship between the order and the product node. The data model is much more natural and easier to document and present to other people on your team.

Now that we have a better understanding of what a graph database is, let’s explore the different implementations out there. Like the other types of databases, there is no single implementation for graph databases, and several projects provide graph database functionalities.

In the next section, we are going to discuss some of the differences between them, and where Neo4j is positioned in this technology landscape.

Neo4j in the graph databases landscape

Even when restricting the scope to graph databases, there are still different ways to envision such data stores:

Resource description framework (RDF): Each record is a triplet of the Subject Predicate Object type. This is a complex vocabulary that expresses a relationship of a certain type (the predicate) between a subject and an object; for instance:Alice(Subject) KNOWS(Predicate) Bob(Object)

Very famous knowledge bases such as DBedia and Wikidata use the RDF format. We will talk about this a bit more in the next chapter (Chapter 2, Using Existing Data to Build a Knowledge Graph).

Labeled-property graph (LPG): A labeled-property graph contains nodes and relationships. Both of these entities can be labeled (for instance, Alice and Bob are nodes with the Person label, and the relationship between them has the KNOWS label) and have properties (people have names; an acquaintance relationship can contain the date when both people first met as a property).

Neo4j is a labeled-property graph. And even there, like MySQL, PostgreSQL, and Microsoft SQL Server are all relational databases, you will find different vendors proposing LPG graph databases. They differ in many aspects:

Whether they use a native graph engine or not: As we discussed earlier, it is possible to use a KV store or even a SQL database to store graph data. In this case, we’re talking about non-native storage engines since the storage does not reflect the graphical nature of the data.The query language: Unlike SQL, the query language to deal with graph data has not yet been standardized, even if there is an ongoing effort being led by the GQL group (see, for instance, https://gql.today/). Neo4j uses Cypher, a declarative query language developed by the company in 2011 and then open-sourced in the openCypher project, allowing other databases to use the same language (see, for instance, RedisGraph or Amazon Neptune). Other vendors have created their own languages (AQL for ArangoDB or CQL for TigerGraph, for instance). To me, this is a key point to take into account since the learning curve can be very different from one language to another. Cypher has the advantage of being very intuitive and a few minutes are enough to write your own queries without much effort.Their (integrated or not) support for graph analytics and data science.

A note about performances

Almost every vendor claims to be the best one, at least in some aspects. This book won’t create another debate about that. The best option, if performances are crucial for your application, is to test the candidates with a scenario close to your final goal in terms of data volume and the type of queries/analysis.

Neo4j ecosystem

The Neo4j database is already very helpful by itself, but the amount of extensions, libraries, and applications related to it makes it the most complete solution. In addition, it has a very active community of members always keen to help each other, which is one of the reasons to choose it.

The core Neo4j database capabilities can be extended thanks to some plugins. Awesome Procedures on Cypher (APOC), a common Neo4j extension, contains some procedures that can extend the database and Cypher capabilities. We will use it later in this book to load JSON data.

The main plugin we will explore in this book is the Graph Data Science Library. Its predecessor, the Graph Algorithm Library, was first released in 2018 by the Neo4j lab team. It was quickly replaced by the Graph Data Science Library, a fully production-ready plugin, with improved performance. Algorithms are improved and added regularly. Version 2.0, released in 2021, takes graph data science even further, allowing us to train models and build analysis pipelines directly from the library. It also comes with a handy Python client, which is very convenient for including graph algorithms into your usual machine learning processes, whether you use scikit-learn or other machine learning libraries such as TensorFlow or PyTorch.

Besides the plugins, there are also lots of applications out there to help us deal with Neo4j and explore the data it contains. The first application we will use is Neo4j Desktop, which lets us manage several Neo4j databases. Continue reading to learn how to use it. Neo4j Desktop also lets you manage your installed plugins and applications.

Applications installed into Neo4j Desktop are granted access to your active database. While reading this book, you will use the following:

Neo4j Browser: A simple but powerful application that lets you write Cypher queries and visualize the result as a graph, table, or JSON:

Figure 1.4 – Neo4j Browser

Neo4j Bloom: A graph visualization application in which you can customize node styles (size, color, and so on) based on their labels and/or properties: