E-Book
32,39 €

Data Science for Web3 E-Book

Gabriela Castillo Areco

0,0

32,39 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Lebensstil
Sprache: Englisch

Beschreibung

Data is the new oil and Web3 is generating it at an unprecedented rate. Complete with practical examples, detailed explanations, and ideas for portfolio development, this comprehensive book serves as a step-by-step guide covering the industry best practices, tools, and resources needed to easily navigate the world of data in Web3.
You’ll begin by acquiring a solid understanding of key blockchain concepts and the fundamental data science tools essential for Web3 projects. The subsequent chapters will help you explore the main data sources that can help address industry challenges, decode smart contracts, and build DeFi- and NFT-specific datasets. You’ll then tackle the complexities of feature engineering specific to blockchain data and familiarize yourself with diverse machine learning use cases that leverage Web3 data.
The book includes interviews with industry leaders providing insights into their professional journeys to drive innovation in the Web 3 environment. Equipped with experience in handling crypto data, you’ll be able to demonstrate your skills in job interviews, academic pursuits, or when engaging potential clients.
By the end of this book, you’ll have the essential tools to undertake end-to-end data science projects utilizing blockchain data, empowering you to help shape the next-generation internet.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 440

Veröffentlichungsjahr: 2023

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Data Science for Web3

A comprehensive guide to decoding blockchain data with data analysis basics and machine learning cases

Gabriela Castillo Areco

BIRMINGHAM—MUMBAI

Data Science for Web3

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

The author of this book is the proprietor of the Artsy Monke #9937 NFT, featured on the cover, and possesses the commercial rights to this NFT.

Group Product Manager: Niranjan Naikwadi

Publishing Product Managers: Dinesh Chaudhary and Sanjana Gupta

Book Project Manager:Hemangi Lotlikar

Content Development Editor: Shreya Moharir

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Proofreader: Safis Editing

Indexer: Hemangini Bari

Production Designer: Prafulla Nikalje

DevRel Marketing Executive: Vinishka Kalra

First published: December 2023

Production reference: 1261223

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK

ISBN 978-1-83763-754-6

www.packtpub.com

To my family, who gifted me the wings to soar.

And to my loving husband, my co-pilot in life – thank you for sharing this incredible flight.

Foreword

In the ever-evolving landscape of data science, Gabriela's Data Science for Web3 stands as a formidable guide, navigating the intricate intersection of blockchain, data science, and new decentralized businesses.

This book, meticulously crafted, provides a practical roadmap for those versed in economics, programming, blockchain, and data science. Gabriela's clear and independent approach demystifies complex concepts, offering both a glossary for the initiated and a launching pad for those embarking on new projects.

As you delve into these pages, you will discover not only theoretical foundations but also tangible insights that can be applied to real-world scenarios. Data Science for Web3 is not just an exploration of possibilities; it is a functional tool for individuals seeking to enhance their skills and contribute meaningfully to the evolving landscape of Web3 and fintech.

– José Dahlquist

Rootstock Engineer Director of IOVLabs

Contributors

About the author

Gabriela Castillo Areco holds an MSc in big data science from the TECNUM School of Engineering, University of Navarra. With extensive experience in both the business and data facets of blockchain technology, Gabriela has undertaken roles as a data scientist, machine learning analyst, and blockchain consultant in both large corporations and small ventures. She served as a professor of new crypto businesses at Torcuato di Tella University and is currently a member of the BizOps data team at IOVLabs.

Special thanks to the data leaders I had the chance to interview for Chapter 14, who generously shared their time and vision.

In addition, I extend my gratitude to Shreya Moharir and Hemangi Lotlikar for their invaluable help and support throughout the writing process.

My appreciation also goes to my colleagues, clients, and mentors who have provided me with the opportunity to learn and work alongside them.

About the reviewer

Mayukh Mukhopadhyay is a full-time technology consultant with more than 13 years of experience in designing business continuity solutions using complex digital transformation programs. He did his MEng at Jadavpur University and his MBA at IIT Kharagpur. He is currently pursuing a PhD (industry track) at IIM Indore. He has authored the book Ethereum Smart Contract Development for Packt Publishing and Scopus-indexed teaching cases on generative AI and blockchain for Sage Publications. He is an active professional member of Association for Computing Machinery (ACM), a Professional Scrum Master I, and an Azure Certified Solution Architect.

I am forever indebted to my spouse, Mrittika, and my daughter, Abriti, for their sacrifices in every walk of life to support my academic endeavors.

Preface

Part 1: Web3 Data Analysis Basics

1 Where Data and Web3 Meet

Technical requirements

Exploring the data ingredients

Understanding the blockchain ingredients

Three generations of blockchain

Introducing the blockchain ingredients

Making the first transaction

Approaching Web3 industry metrics

Block height

Time

Tokenomics

Total Value Locked (TVL)

Total market cap

Data quality challenges

Data standards challenges

Retail

Confirmations

NFT Floor Price

The concept of “lost”

A brief overview of APIs

Summary

2 Working with On-Chain Data

Technical requirements

Dissecting a transaction

Nonce

Gas price

Gas limit

Recipient

Sender

Value

Input data

V,R,S

Transaction receipt

Status

Gas used and Cumulative gas used

Logs

Dissecting a block

Exploring state data

Reviewing data sources

Block explorers

Infura

Moralis

GetBlock

Dune

Covalent

Flipside

The Graph

Google BigQuery

Summary

3 Working with Off-Chain Data

Technical requirements

Introductory example – listing data sources

Adding prices to our dataset

CoinGecko

CoinMarketCap

Binance

Oracles – Chainlink

OHLC – Kraken

Final thoughts on prices

Adding news to our dataset

Adding social networks to our dataset

X (formerly Twitter)

Summary

4 Exploring the Digital Uniqueness of NFTs – Games, Art, and Identity

Enabling unique asset tracking on the blockchain using NFT

The business requests

The technical solution

Blockchain gaming – the GameFi proposal

Introduction to the business landscape

Analytics

Identity in the blockchain

Introduction to the business landscape

Analytics

Redefining the art business with blockchain

Introduction to the business landscape

Data extraction

Floor price and wash trading

A word on anti-money laundering (AML) practices

Summary

5 Exploring Analytics on DeFi

Technical requirements

Stablecoins and other tokens

Understanding tokens, native assets, and the ERC-20 data structure

Hands-on example

Understanding DEX

Hands-on example – pools and AMM

DEX aggregators

Lending and borrowing services on Web3

Flash loans

A note on protocol bad debt

Multichain protocols and cross-chain bridges

Hands-on example – Hop bridge

Summary

Part 2: Web3 Machine Learning Cases

6 Preparing and Exploring Our Data

Technical requirements

Data preparation

Hex values

Checksum

Decimal treatment

From Unix timestamps to datetime formats

Evolution of smart contracts

Exploratory Data Analysis

Summarizing data

Outlier detection

Summary

7 A Primer on Machine Learning and Deep Learning

Technical requirements

Introducing machine learning

Building a machine learning pipeline

Model

Training

Underfitting and overfitting

Prediction and evaluation

Introducing deep learning

Model preparation

Model building

Training and evaluating a model

Summary

8 Sentiment Analysis – NLP and Crypto News

Technical requirements

Example datasets

Building our pipeline

Preparation

Model building

Training and evaluation

ChatGPT integration

Summary

9 Generative Art for NFTs

Technical requirements

Creating with colors – colorizing

Hands-on Style2Paints

Theory

A note on training datasets

Creating with style – style transfer

Preparation

Model building

Training and inference

Creating with prompts – text to image

DALL.E 2

Stable Diffusion

Midjourney

Leonardo.Ai

Minting an NFT collection

Summary

10 A Primer on Security and Fraud Detection

Technical requirements

A primer on illicit activity on Ethereum

Preprocessing

Training the model

Evaluating the results

Presenting results

Summary

11 Price Prediction with Time Series

Technical requirements

A primer on time series

Exploring the dataset

Discussing traditional pipelines

Preprocessing

Modeling – ARIMA/SARIMAX and Auto ARIMA

Auto ARIMA

Adding exogenous variables

Using a neural network – LSTM

Preprocessing

Model building

Training and evaluation

Summary

12 Marketing Discovery with Graphs

Technical requirements

A primer on graphs

Types of graphs

Graph properties

The dataset

Node classification

Preparation

Modeling

Training and evaluation

Summary

Part 3: Appendix

13 Building Experience with Crypto Data – BUIDL

Showcasing your work – portfolio

Looking for a job

Preparing for a job interview

Importance of soft skills

Where to study

Summary

14 Interviews with Web3 Data Leaders

Hildebert Moulié (aka hildobby)

Jackie Zhang

Marina Ghosh

Professor One Digit

Appendix 1

Appendix 2

Appendix 3 Index

Other Books You May Enjoy

Part 1 Web3 Data Analysis Basics

In this part of the book, we will explore what Web3 data looks like and identify reliable sources for its extraction. Throughout the chapters, we will examine the business flow of the most important protocols and learn how to gain actionable insights from the data generated within their ecosystems.

This part includes the following chapters:

Chapter 1, Where Data and Web3 MeetChapter 2, Working with On-Chain DataChapter 3, Working with Off-Chain DataChapter 4, Exploring the Digital Uniqueness of NFTs – Games, Art, and IdentityChapter 5, Exploring Analytics on DeFi

1 Where Data and Web3 Meet

As we assume no prior knowledge of data or blockchain, this chapter introduces the basic concepts of both topics. A good understanding of these concepts is essential to tackle Web3 data science projects, as we will refer to them. A Web3 data science project tries to solve a business problem or unlock new value with data; it is an example of applied science. It has two main components, the data science ingredients and the blockchain ingredients, which we will cover in this chapter.

In the Exploring the data ingredients section, we will analyze the concept of data science, available data tools, and the general steps we will follow, and provide a gentle practical introduction to Python. In the Understanding the blockchain ingredients section, we will cover what blockchain is, its main characteristics, and why it is called the internet of value.

In the final part of this chapter, we will dive into some industry concepts and how to use them. We will also analyze challenges related to the quality and standardization of data and concepts, respectively. Lastly, we will briefly review the concept of APIs and describe the ones that we will be using throughout the book.

In this chapter, we will cover the following topics:

What is a business data science project?What are data ingredients?Introducing the blockchain ingredientsApproaching relevant industry metricsThe challenges with data quality and standardsClassifying the APIs

Technical requirements

We will utilize Web3.py, a library designed for interacting with EVM-based blockchains such as Ethereum. Originally a JavaScript library, it has since evolved “towards the needs and creature comforts of Python developers,” as stated in its documentation. Web3.py is a library that facilitates connection and interaction with the blockchain.

If you have not worked with Web3.py before, it can be installed with the following code:

pip install web3

The blockchain data is reached after connecting to a copy of the blockchain hosted on a node. Infura serves as a cloud-based infrastructure provider that streamlines our connection to the Ethereum network, eliminating the need to run our own nodes. Detailed steps for creating an account and obtaining an API key are provided in the Appendix 1.

Additionally, we will employ Ganache, a program that creates a local blockchain on our computer for testing and developing purposes. This tool allows us to simulate the behavior of a real blockchain without the necessity of spending real cryptocurrency for interactions. For a comprehensive guide on how to utilize Ganache, please refer to the Appendix 1.

You can find all the data and code files for this chapter in the book’s GitHub repository at https://github.com/PacktPublishing/Data-Science-for-Web3/tree/main/Chapter01. We recommend that you read through the code files in the Chapter01 folder to follow along.

Exploring the data ingredients

Important note

If you have a background in data science, you may skip this section.

However, if you do not, this basic introduction is essential to understand the concepts and tools discussed throughout the book.

Data science is an interdisciplinary field that combines mathematics, statistics, programming, and machine learning with specific subject matter knowledge to extract meaningful insights.

Imagine you work at a top-tier bank that is considering making its first investment in a blockchain protocol, and they have asked you to present a shortlist of protocols to invest in based on relevant metrics. You may have some ideas about what metrics to consider, but how do you know which metric and value is the most relevant to determine which protocol should make it on your list? And once you know the metric, how do you find the data and calculate it?

This is where data science comes in. By analyzing transaction data (on-chain data) and data that is not on-chain (off-chain data), we can identify patterns and insights that will help us make informed decisions. For example, we might find that certain protocols are more active during business hours in a time zone different from where the bank is located. In this case, the bank can decide whether they are ready to make an investment in a product serving clients in a different time zone. We may also check the value locked in the protocol to assess the general investors’ trust in that smart contract, among many other metrics.

But data science is not just about analyzing past data. We can also use predictive modeling to forecast future trends and add those trends to our assessment. For instance, we could use machine learning algorithms to predict the price range of the token issued by the protocol based on its price history.

For this data analysis, we require the right tools, skills, and business knowledge. We’ll need to know how to collect and clean our data, how to analyze it using statistical techniques, how to separate what is business-relevant from what is not, and how to visualize our findings so we can communicate them effectively. Making data-driven decisions is the most effective way to improve all the relevant metrics for a business, which is more valuable than ever in this competitive world.

Due to the fast pace of data creation and the shortage of data scientists on the market, data scientist has been referred to as “the sexiest job of the 21st century” by the Harvard Business Review. The data economy has opened the door to multiple roles, such as data analyst, data scientist, data engineer, data architect, Business Intelligence (BI) analyst, and machine learning engineer. Depending on the complexity of the problem and the size of the data, we can see them playing a role in a typical data science project.

A typical Web3 data science project involves the following steps:

Problem definition: At this stage, we try to answer the question of whether the problem can be solved with data, and if so, what data would be useful to answer it. Collaboration between data scientists and business users is crucial in defining the problem, as the latter are the specialists and those who will use what the data scientist produces. BI tools such as Tableau, Looker, and Power BI, or Python data visualization libraries such as Seaborn and Matplotlib, are useful in meetings with business stakeholders. It is worth noting that while many BI tools currently provide optimization packages for commonly used data sources, such as Facebook Ads or HubSpot, as of the time of writing, I have not seen any optimization for on-chain data. Therefore, it is preferable to choose highly flexible data visualization tools that can adapt to any visualization needs.Investigation and data ingestion: At this stage, we try to answer the question: where can we find the necessary data to use for this project? Throughout this book, especially Chapters 2 and 3, we will list multiple data sources related to Web3 that will help answer this question. Once we find where the data is, we need to build an ingestion pipeline for consumption by the data scientist. This process is called ETL, which stands for extract, transform, and load. These steps are necessary to make clean and organized data available to the data analyst or data scientist.

Data collection or extraction is the first step of the ETL process and can include manual entry, web scraping, live streaming from devices, or a connection to an API. Data can be presented in a structured format, meaning that it is stored in a predefined way, or an unstructured format, meaning that it has no predefined storage format and is simply stored in its native way. Transformation consists of modifying the raw data to be stored or analyzed. Some of the activities that data transformation can involve include data normalization, data deduplication, and data cleaning. Finally, loading is the act of moving the transformed data into data storage and making it available. There are a few additional aspects to consider when referring to data availability, such as storing data in the correct format, including all related metadata, providing the correct access privileges to the right team members, and ensuring the data is up to date, accurate, or enough to fulfill the data scientist’s needs. The ETL process is generally led by the data engineer, but the business owner and the data scientist will have a say when identifying the data source.

Analysis/modeling: In this stage, we analyze the data to extract conclusions and may need to model it to try to predict future outcomes. Once the data is available, we can perform the following:Descriptive analysis: This uses data analysis and methods to describe what the data shows, gaining insights into trends, composition, distribution, and more. For example, a descriptive analysis of a Decentralized Finance (DeFi) protocol can reveal when its clients are most active and the Total Value Locked (TVL) and how the locked value has evolved over time.Diagnostic analysis: This uses data analysis to explain the reasons behind the occurrence of certain matters. Techniques such as data composition, correlations, and drill-down are used in these types of analyses. For example, a blockchain analyst may try to understand the correlation between a peak in new addresses and the activity of certain addresses to identify the applications that these users give to the chain.Predictive analysis: This uses historical data to make forecasts about trends or events in the future. Techniques can include machine learning, cluster analysis, and time series forecasting. For example, a trader may try to predict the evolution of a certain cryptocurrency based on its historical performance.Prescriptive analysis: This uses the result of predictive analysis as an input to suggest the optimum response or best course of action. For example, a bot can suggest whether to sell or buy certain cryptocurrency.Generative AI: This uses machine learning techniques and huge amounts of data to learn patterns and generates new and original outputs. Artificial intelligence can create images, videos, audio, text, and more. Applications of generative models include ChatGPT, Leonardo AI, and Midjourney.Evaluation: In this stage, the result of our analysis or modeling is evaluated and tested to confirm it meets the project goals and provides value to the business. Any bias or weakness of our models is identified, and if necessary, the process starts again to address those errors.Presentation/deployment: The final stage of the process depends on the problem. If it is an analysis from which the company will make a decision, our job will probably conclude with a presentation and explanation of our findings. Alternatively, if we are working as part of a larger software pipeline, our model will most likely be deployed or integrated into the data pipeline.

This is an iterative process, meaning that many times, especially in step 4, we will receive valuable feedback from the business team about our analysis, and we will change the initial conclusions accordingly. What is true for traditional data science is reinforced for the Web3 industry as this is one of the industries where data plays a key role in building trust, leading investments, and, in general, unlocking new value.

Although data science is not a programming career, it heavily relies on programming because of the large amount of data available. In this book, we will work with the Python language and some SQL to query databases. Python is a general-purpose programming language commonly used by the data science community, and it is easy to learn due to its simple syntax. An alternative to Python is R, which is a statistical programming language commonly used for data analysis, machine learning, scientific research, and data visualization. A simple way to access Python or R and their associated libraries and tools is to install the Anaconda distribution. It includes popular data science libraries (such as NumPy, pandas, and Matplotlib for Python) and simplifies the process of setting up an environment to start working on data analysis and machine learning projects.

The activities in this book will be carried out in three work environments:

Notebooks: For example, Anaconda Jupyter notebooks or Google Colaboratory (also frequently referred to as Colab). These files are saved in .ipynb format and are very useful for data analysis or training models. We will use Colab notebooks in the machine learning chapters due to the access it provides to GPU resources in its free tier.IDEs: PyCharm, Visual Studio Code, or any other IDE that supports Python. Their files are saved in .py format and are very useful for building applications. Most IDEs allow the user to download extensions to work with notebook files.Query platforms: In Chapter 2, we will access on-chain data platforms that have built-in query systems. Examples of those platforms are Dune Analytics, Flipside, Footprint Analytics, and Increment.

Anaconda Jupyter notebooks and IDEs use our computer resources (e.g., RAM), while Google Colaboratory uses cloud services (more on resources can be found in the Appendix 1).

Please refer to the Appendix 1 to install any of the environments mentioned previously.

Once we have a clean notebook, we will warm up our Python skills with the Chapter01/Python_warm_up notebook, which follows a tutorial by https://learnxinyminutes.com/docs/python/. For a more thorough study of Python, we encourage you to check out Data Science with Python, by Packt Publishing, or Python Data Science Handbook, both of which are listed in the Further reading section of this chapter.

Once we have completed the warm-up exercise, we will initiate the Web3 client using the Web3.py library. Let’s learn about these concepts in the following section.

Understanding the blockchain ingredients

If you have a background in blockchain development, you may skip this section. Web3 represents a new generation of the World Wide Web that is based on decentralized databases, permissionless and trustless interactions, and native payments. This new concept of the internet opens up various business possibilities, some of which are still in their early stages.

Figure 1.1 – Evolution of the web

Currently, we are in the Web2 stage, where centralized companies store significant amounts of data sourced from our interactions with apps. The promise of Web3 is that we will interact with Decentralized Apps (dApps) that store only the relevant information on the blockchain, accessible to everyone.

As of the time of writing, Web3 has some limitations recognized by the Ethereum organization:

Velocity: The speed at which the blockchain is updated poses a scalability challenge. Multiple initiatives are being tested to try to solve this issue.Intuition: Interacting with Web3 is still difficult to understand. The logic and user experience are not as intuitive as in Web2 and a lot of education will be necessary before users can start utilizing it on a massive scale.Cost: Recording an entire business process on the chain is expensive. Having multiple smart contracts as part of a dApp costs a lot for the developer and the user.

Blockchain technology is a foundational technology that underpins Web3. It is based on Distributed Ledger Technology (DLT), which stores information once it is cryptographically verified. Once reflected on the ledger, each transaction cannot be modified and multiple parties have a complete copy of it.

Two structural characteristics of the technology are the following:

It is structured as a set of blocks, where each block contains information (cryptographically hashed – we will learn more about this in this chapter) about the previous block, making it impossible to alter it at a later stage. Each block is chained to the previous one by this cryptographic sharing mechanism.

Figure 1.2 – Representation of a set of blocks

It is decentralized. The copy of the entire ledger is distributed among several servers, which we will call nodes. Each node has a complete copy of the ledger and verifies consistency every time it adds a new block on top of the blockchain.

This structure provides the solution to double spending, enabling for the first time the decentralized transfer of value through the internet. This is why Web3 is known as the internet of value.

Since the complete version of the ledger is distributed among all the participants of the blockchain, any new transaction that contradicts previously stored information will not be successfully processed (there will be no consensus to add it). This characteristic facilitates transactions among parties that do not know each other without the need for an intermediary acting as a guarantor between them, which is why this technology is known as trustless.

The decentralized storage also takes control away from each server and, thus, there is no sole authority with sufficient power to change any data point once the transaction is added to the blockchain. Since taking down one node will not affect the network, if a hacker wants to attack the database, they would require such high computing power that the attempt would be economically unfeasible. This adds a security level that centralized servers do not have.

Three generations of blockchain

The first-generation blockchain is Bitcoin, which is based on Satoshi Nakamoto’s paper Bitcoin: A Peer-to-Peer Electronic Cash System. The primary use case of this blockchain is financial. Although the technology was initially seen as a way to bypass intermediaries such as banks, currently, traditional financial systems and the crypto world are starting to work together, especially with regard to Bitcoin because it is now considered a digital store of value, a sort of digital gold. Notwithstanding the preceding, there are still many regulatory and practical barriers to the integration of the two systems.

The second-generation blockchain added the concept of smart contracts to the database structure described previously, and Ethereum was the first to introduce this. With Ethereum, users can agree on terms and conditions before a transaction is carried out. This chain started the smart contracts era, and as Nick Szabo describes it, the smart contract logic is that of a vending machine that can execute code autonomously, including the management of digital assets, which is a real revolution. To achieve this, the network has an Ethereum Virtual Machine (EVM) that can execute arbitrary code.

Lastly, the third-generation blockchain builds upon the previous generations and aims to solve scalability and interoperability problems. When referring to on-chain data in this book, we will be talking about data generated by the second- and third-generation blockchains that are EVM compatible, as this is where most development is being carried out at the time of writing (e.g., Ethereum, BSC, or Rootstock). Consequently, Bitcoin data and non-EVM structures are not covered.

Introducing the blockchain ingredients

Now, let’s understand some important additional concepts regarding blockchain.

Gas

In order to make a car move forward, we use gas as fuel. This enables us to reach our desired destination, but it comes at a cost. The price of gas fluctuates based on various factors, such as oil prices and transportation costs. The same concept applies to the blockchain technology. To save a transaction on a chain, it is necessary to pay for gas. In short, gas is the instruction cost paid to the network to carry out our transactions. The purpose of establishing a cost is twofold: the proceedings of the gas payment go to the miners/validators as a payment for their services and as an incentive to continue being integrated into the blockchain; it also sets a price for users to be mindful of when using resources, encouraging the use of the blockchain to record only what is worth more than the gas value paid. This concept is universal to all networks we will study in this book.

Gas has several cost implications. As the price of gas is paid in the network’s native coin, if the price increases, the cost of using the network can become excessively expensive, discouraging adoption. This is what happened with Ethereum, which led to multiple changes to its internal rules to solve this issue.

As mentioned earlier, each interaction with the blockchain incurs a cost. Therefore, not everything needs to be stored in it and the adoption of such a database as blockchain needs to be validated by business requirements.

Cryptocurrencies can be divided into smaller units of that cryptocurrency, just like how a dollar can be divided into cents. The smaller unit of a Bitcoin is a Satoshi and the smaller denomination of an Ether is Wei. The following is a chart with the denominations, which will be useful for tracking gas costs.

Unit name

Wei

Ether

Wei

10-18

Kwei

1,000

10-15

Mwei

1,000,000

10-12

Gwei

1,000,000,000

10-9

Microether

1,000,000,000,000

10-6

Milliether

1,000,000,000,000,000

10-3

Ether

1,000,000,000,000,000,000

Table 1.1 – Unit denominations and their values

Address

When we use a payment method other than cash, we transmit a sequence of letters or numbers, or a combination of both, to transfer our funds. This sequence of characters is essential for identifying the country, bank, and account of the recipient, for the entity that holds our funds. Similarly, an address performs a comparable function and serves as an identification number on the blockchain. It is a string of letters and numbers that can send or receive cryptocurrency. For example, Ethereum addresses consist of 42 hexadecimal characters. An address is the public key hash of an asymmetric key pair, which is all the information required by a third party to transfer cryptocurrency. This public key is derived from a private key, but the reverse process (deriving the private key from a public key) cannot be performed. The private key is required to authorize/sign transactions or access the funds stored in the account.

Addresses can be classified into two categories: Externally Owned Addresses (EOAs) and contract accounts. Both of them can receive, hold, and send funds and interact with smart contracts. EOAs are owned by users who hold the private key, and users can create as many as they need. Contract accounts are those where smart contracts are deployed and are controlled by their contract code. Another difference between them is the cost of creating them. Creating an EOA does not cost gas but creating a smart contract address has to pay for gas. Only EOA accounts can initiate transactions.

There is another product in the market known as smart accounts that leverage the account abstraction concept. The idea behind this development is to facilitate users to program more security and better user experiences into their accounts, such as setting rules on daily spending limits or selecting the token to pay for gas. These are programmable smart contracts.

Although the terms “wallet” and “address” are often used interchangeably, there is a technical distinction between them. As mentioned before, an address is the public key hash of an asymmetric key pair. On the other hand, a wallet is the abstract location where the public and private keys are stored together. It is a software interface or application that simplifies interacting with the network and facilitates querying our accounts, transaction signing, and more.

Consensus protocols

When multiple parties work together, especially if they do not know each other, it is necessary to agree on a set of rules to work sustainably. In the blockchain case, it is necessary to determine how to add transactions to a block and alter its state. This is where the consensus protocol comes into play. Consensus refers to the agreement reached by all nodes of the blockchain to change the state of the chain by adding a new block to it. The protocol comprises a set of rules for participation, rewards/penalties to align incentives, and more. The more nodes participate, the more decentralized the network becomes, making it more secure.

Consensus can be reached in several ways, but two main concepts exist in open networks.

Proof of Work (PoW)

This is the consensus protocol used by Bitcoin. It involves solving mathematical equations that vary in difficulty depending on how congested the network is.

Solving these puzzles consumes a lot of energy, resulting in a hardware-intensive competition. Parties trying to solve the puzzle are known as miners.

The winning party finds an integer that complies with the equation rules and informs the other nodes of the answer. The other parties verify that the answer is correct and add the block to their copy of the blockchain. The winning party gets the reward for solving the puzzle, which is a predefined amount of cryptocurrency. This is how the system issues Bitcoin that has never been spent and is known as a Coinbase transaction.

In Bitcoin protocol, the reward is halved every 21,000 blocks.

Proof of Stake (PoS)

This is the current protocol used by the Ethereum blockchain (up to September 15, 2022, the consensus protocol was PoW) and many others, such as Cardano.

The rationale behind PoS is that parties become validators in the blockchain by staking their own cryptocurrency in exchange for the chance to validate transactions, update the blockchain, and earn rewards. Generally, there is a minimum amount of cryptocurrency that must be staked to become a validator. It is “at stake” because the rules include potential penalizations or “slashing” of the deposited cryptocurrency if the validator (node that processes transactions and adds new blocks to the chain) goes offline or behaves poorly. Slashing means losing a percentage of the deposited cryptocurrency.

As we can see, there are rewards and penalties to align the incentives of all participants toward a single version of the blockchain.

The list of consensus protocols is continuously evolving, reflecting the ongoing search to solve some of the limitations identified in Web3, such as speed or cost. Some alternative consensus protocols include proof of authority – where a small number of nodes have the power to validate transactions and add blocks to the chain – and proof of space – which uses disk space to validate transactions.

Making the first transaction

With these concepts in mind, we will now carry out a transaction on our local environment with local Ethereum from Ganache.

To get started, let’s open a local Jupyter notebook and a quick-start version of Ganache.

Here is the information we need:

Figure 1.3 – Ganache main page and relevant information to connect

Let’s look at the code:

Import the Web3.py library:from web3 import Web3Connect to the blockchain running on the port described in our Ganache page (item 1):ganache_url= "http://127.0.0.1:8545"web3= Web3(Web3.HTTPProvider(ganache_url))Define the receiving and sending addresses (item 2):from_account="0xd5eAc5e5f45ddFCC698b0aD27C52Ba55b98F5653"to_account= "0xFfd597CE52103B287Efa55a6e6e0933dff314C63"Define the transaction. In this case, we are transferring 30 ether between the accounts defined previously:We can review the account balances before and after the transaction with the following code snippet:web3.fromWei(web3.eth.getBalance(from_account),'ether'))web3.fromWei (web3.eth.getBalance(to_account), 'ether'))

Congratulations! If you have never before transferred value on a blockchain, you have achieved your first milestone. The complete code can be found in Chapter01/First_transaction.

A word on CBDC

What is CBDC? The acronym stands for Central Bank Digital Currency. It is a new form of electronic money issued by the central banks of countries.

Many countries are at different stages in this roadmap. On January 20, 2022, the Federal Reserve Board issued discussion papers about CBDC, and prior to the COVID-19 pandemic, they also informed of ongoing research regarding the benefits that could be brought to their system. As of July 2022, there were 100 CBDCs in research and development. Countries are looking for the best infrastructure, studying the impact on their communities, and are mindful of a new range of risks that this new way of transferring value will pose to financial systems that may be reluctant to change.

Some of the concepts that we have covered in this chapter will be useful for the CBDC era, but depending on the project and its characteristics, not all of them will be present. It will be especially interesting to see how they solve centralization issues. A very informative tracker on the status of the projects is available at the following link: https://cbdctracker.org/.

In this section, we analyzed the fundamentals of blockchain technology, including key concepts such as gas, addresses, and consensus protocols, and explored the evolution of Web3. We also executed a transaction using Ganache and Web3.py.

With this basic understanding of the transaction flow, we will now shift our focus toward analyzing initial metrics and gaining a better understanding of the data challenges in this industry.

Approaching Web3 industry metrics

There are some metrics that are pretty standard on every Web3 dashboard that we review in this section. However, this is just a basic layer, and each player in the industry will add additional metrics relevant to them.

To extract information from the Ethereum blockchain, we need to establish a connection to the blockchain through a node that holds a copy of it. There are multiple ways to connect to the blockchain, which we will explore in more detail in Chapter 2. For the following metrics, we will make use of Infura. For a step by step guide to connect to Infura, refer to the Appendix 1.

Block height

This refers to the current block on the blockchain. The Genesis block is commonly referred to as block 0 and subsequent blocks are numbered accordingly. To check the block height, use the following code snippet:

web3.eth.blockNumber

The block number can be used as the ID of the block. Tracking it can be useful to determine the number of confirmations a transaction has, which is equivalent to the number of additional blocks that were mined or added after the block of interest. The deeper a transaction is in the chain, the safer and more irreversible it becomes.

Time

When discussing time in the context of blockchain, two concepts need to be taken into account. The first is the time between blocks, which varies depending on the blockchain. In Ethereum, after the recent protocol change, there are 12-second slots. Each validator is given a slot to propose a block during that time, and if all validators are online, there will be no empty slots, resulting in a new block being added every 12 seconds. The second concept is the timestamp for when a block was added to the blockchain, which is typically stored in Unix timestamp format. The Unix timestamp is a way of tracking the time elapsed as a running total of seconds from January 1, 1970, in UTC.

To extract the block timestamps, use the following code snippet:

web3.eth.get_block(latest).timestamp

Tokenomics

Tokenomics refers to the characteristics of the internal economy of token projects on the blockchain, including supply, demand, and inflation. This involves determining how many digital assets will be issued, whether there is a cap on the total offer, the use cases of the token, and the burning schema to control the number of assets in circulation.

The token white paper typically contains the official explanation for basic tokenomics questions.

Bitcoin tokenomics

The Bitcoin supply is capped at 21 million Bitcoins, and this amount cannot be exceeded. New Bitcoin enters circulation through mining, and miners are rewarded each time they successfully add a block to the chain.

Each block is mined approximately every 10 minutes, so all 21 million Bitcoins will be in circulation by 2140.

The number of Bitcoins rewarded is halved every time 210,000 blocks are mined, resulting in a halving approximately every four years. Once all 21 million Bitcoins have been mined, miners will no longer receive block rewards and will rely solely on transaction fees for revenue.

Tokens, and therefore their tokenomics, play a fundamental role in the functioning and sustainability of DeFi platforms. One of the industries most impacted by this technology is the financial industry, which has given birth to a new concept known as Decentralized Finance, or DeFi. DeFi consists of peer-to-peer financial solutions built on public blockchains. These initiatives offer services that are similar to those offered by banks and other financial institutions, such as earning interest on deposits, lending, and trading assets, without the intervention of banks or other centralized financial institutions. This is achieved through a set of smart contracts (or protocols) that are open to anyone with an address to participate.

One concrete example of DeFi is Aave, a lending and borrowing platform that allows users to lend and borrow various cryptocurrencies without intermediaries such as banks. For instance, if Jane wants to borrow 10 ETH, she can go to Aave, create a borrowing request, and wait for the smart contract to match her request with available lenders who are willing to lend ETH. The borrowed ETH is lent with an interest rate percentage that reflects supply and demand levels. The money lent comes from a liquidity pool where lenders deposit their cryptocurrencies to earn interest on them. With Aave’s decentralized platform, borrowers and lenders can transact directly with each other without needing to go through a traditional financial institution.

We will dive deep into DeFi in Chapter 5.

Total Value Locked (TVL)

TVL refers to the total value of assets currently locked in a specific DeFi protocol. It measures the health of a certain protocol by the amount of money users secure in it. The TVL will increase when users deposit more assets in the protocol, and vice versa, it will decrease when the users withdraw it. It is calculated by summing the value of the assets locked in the protocol and multiplying them by the current price.

Different DeFi protocols may have specific ways of measuring their TVL, and accurately calculating it requires an understanding of how each protocol works. A website that specializes in measuring TVL is DefiLlama (available at https://defillama.com/).

TVL also helps traders determine whether a certain token is undervalued or not by dividing that number by the market cap (or total supply in circulation) of the token issued by said protocol.

This metric helps compare DeFi protocols with each other.

Total market cap

Market capitalization represents the size of the market for a certain token and is closely related to traditional financial concepts. It is calculated by multiplying the number of coins or tokens issued by their current market price. The circulating supply is the sum of tokens currently held by public holders. To get this number, we calculate the tokens in all addresses that are not the minting and burning addresses and subtract the value held by addresses that we know are controlled by the protocol or are allocated to the development team or some investors, and so on.

The max supply or total supply of tokens is the total number of tokens that will be issued by a certain smart contract. Multiplying the max supply by the current price will result in a fully diluted market cap. There are two ways to get the total supply, with state data and transactional data.

In Chapter 2, we will learn how to access state data as tokens as smart contracts have a function that can be queried with Web3.py. To do this, we will need the Application Binary Interface (ABI) of

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Data Science for Web3 E-Book

Gabriela Castillo Areco

Data Science for Web3

Foreword

Contributors

About the author

About the reviewer

Table of Contents

Preface

Part 1: Web3 Data Analysis Basics

1

Where Data and Web3 Meet

Technical requirements

Exploring the data ingredients

Understanding the blockchain ingredients

Three generations of blockchain

Introducing the blockchain ingredients

Making the first transaction

Approaching Web3 industry metrics

Block height

Time

Tokenomics

Total Value Locked (TVL)

Total market cap

Data quality challenges

Data standards challenges

Retail

Confirmations

NFT Floor Price

The concept of “lost”

A brief overview of APIs

Summary

Further reading

2

Working with On-Chain Data

Technical requirements

Dissecting a transaction

Nonce

Gas price

Gas limit

Recipient

Sender

Value

Input data

V,R,S

Transaction receipt

Status

Gas used and Cumulative gas used

Logs

Dissecting a block

Exploring state data

Reviewing data sources

Block explorers

Infura

Moralis

GetBlock

Dune

Covalent

Flipside

The Graph

Google BigQuery

Summary

Further reading

3

Working with Off-Chain Data

Technical requirements

Introductory example – listing data sources

Adding prices to our dataset

CoinGecko

CoinMarketCap

Binance

Oracles – Chainlink

OHLC – Kraken

Final thoughts on prices

Adding news to our dataset

Adding social networks to our dataset

X (formerly Twitter)

Summary

Further reading

4