Data Wrangling with SQL - Raghav Kandarpa - E-Book

Data Wrangling with SQL E-Book

Raghav Kandarpa

0,0
27,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

The amount of data generated continues to grow rapidly, making it increasingly important for businesses to be able to wrangle this data and understand it quickly and efficiently. Although data wrangling can be challenging, with the right tools and techniques you can efficiently handle enormous amounts of unstructured data.
The book starts by introducing you to the basics of SQL, focusing on the core principles and techniques of data wrangling. You’ll then explore advanced SQL concepts like aggregate functions, window functions, CTEs, and subqueries that are very popular in the business world. The next set of chapters will walk you through different functions within SQL query that cause delays in data transformation and help you figure out the difference between a good query and bad one. You’ll also learn how data wrangling and data science go hand in hand. The book is filled with datasets and practical examples to help you understand the concepts thoroughly, along with best practices to guide you at every stage of data wrangling.
By the end of this book, you’ll be equipped with essential techniques and best practices for data wrangling, and will predominantly learn how to use clean and standardized data models to make informed decisions, helping businesses avoid costly mistakes.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 441

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Data Wrangling with SQL

A hands-on guide to manipulating, wrangling, and engineering data using SQL

Raghav Kandarpa

Shivangi Saxena

BIRMINGHAM—MUMBAI

Data Wrangling with SQL

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Kaustubh Manglurkar

Publishing Product Manager: Arindam Majumder

Senior Editor: Sushma Reddy

Technical Editor: Kavyashree K. S.

Copy Editor: Safis Editing

Project Coordinator: Hemangi Lotlikar

Proofreader: Safis Editing

Indexer: Hemangini Bari

Production Designer: Shyam Sundar Korumilli

Marketing Coordinator: Nivedita Singh

First published: July 2023

Production reference: 1280723

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1R

ISBN 978-1-83763-002-8

www.packtpub.com

To our adorable nephew, who filled our writing journey with laughter, joy, and endless inspiration. Though you may be just two years old, your presence has brought an abundance of love and light into our lives. This book is dedicated to you, a reminder of the cherished moments we shared as we crafted its pages together. May it serve as a symbol of our bond, and may it inspire you to embrace your own passions and embark on remarkable adventures.

Acknowledgements

Writing a book is a journey of dedication and perseverance, and we are immensely grateful for the support and contributions of numerous individuals who have made this endeavor possible.

First and foremost, we would like to express our deepest gratitude to each other. The process of writing this book was not only a professional collaboration but also a personal journey for us. Amid the demands of deadlines and research, we took a significant step in our lives and got married. Our love and partnership have been the driving force behind this book, and we are grateful for the incredible bond that sustains us through both personal and professional challenges.

We would like to extend our heartfelt appreciation to our families, whose unwavering love and support have been a constant source of strength. Your belief in us and the encouragement you provided throughout this journey have been instrumental in our success. We are grateful for your understanding during the moments when our focus was fully dedicated to the completion of this project.

To our editor and the entire publishing team, we extend our sincere thanks. Your guidance, patience, and professionalism have been invaluable throughout the book-writing process. We appreciate your belief in our work and the opportunity to collaborate with such a talented group of individuals.

We would also like to acknowledge the reviewers and technical experts who generously shared their knowledge and insights. Your feedback and suggestions have played a vital role in refining the content of this book and ensuring its accuracy and relevance. We are grateful for the time and expertise you dedicated to reviewing our work.

Lastly, we want to express our deepest appreciation to the readers who embark on this journey with us. Your curiosity and enthusiasm for data wrangling using SQL inspire us, and we hope that this book serves as a valuable resource in your own endeavors. Thank you for being part of our journey and for your support and engagement.

Writing this book has been an extraordinary experience, blending personal and professional milestones. We are humbled by the love and support we have received from our loved ones and the publishing community. Thank you all for standing by our side, celebrating our marriage, and contributing to the creation of this book.

With heartfelt gratitude,

Raghav and Shivangi

Dedication

This book is dedicated to our families, whose unwavering love and support have been the foundation of our lives. Your belief in us and the encouragement you have given throughout this journey have been instrumental in our success.

To our parents, who instilled in us the values of hard work, perseverance, and the pursuit of knowledge; we are forever grateful. Your guidance and sacrifices have shaped us into the individuals we are today, and this book is a testament to your unwavering belief in our abilities.

To our siblings, whose unwavering support and camaraderie have brought us joy and inspiration; thank you for always being there, cheering us on, and offering a listening ear when we needed it most.

To our extended families, relatives, and friends who have celebrated our achievements, offered words of encouragement, and lent a helping hand along the way; your presence in our lives has brought warmth and happiness. Your love and support have meant the world to us.

Above all, this dedication is for each other. Our love and partnership have been the driving force behind this book. Through the late nights, the endless revisions, and the challenges we faced, we found strength and solace in each other. Your unwavering belief in our shared vision and the constant support you provide are the cornerstones of our journey.

May this book serve as a tribute to the love, support, and dedication of our families and the power of partnership. We dedicate this work to all those who have played a part, big or small, in shaping our lives and making this book possible.

With all our love and gratitude,

Raghav and Shivangi

Contributors

About the authors

Raghav Kandarpa is a skilled author and data analyst passionate about extracting insights from complex datasets. Originally from India, he pursued his academic and professional goals in data analysis in the United States. With a master’s degree in business analysis, specializing in business intelligence and analytics, he possesses expertise in data-driven decision-making and advanced analytical techniques. Currently a lead data science analyst at a top Fortune 500 finance company, Raghav leverages data to drive strategic initiatives, delivering actionable recommendations and optimizing business operations. Alongside his professional achievements, he is dedicated to empowering aspiring data analysts through practical writing that bridges the gap between theory and practice. Raghav’s commitment to innovation, data-driven decision-making, and knowledge sharing continues to shape the future of data analysis in finance and beyond.

I want to thank the people who have been close to me and supported me, especially my parents and my brother.

Shivangi Saxena, a highly accomplished data analyst and Business Intelligence Engineer with over six years of experience, relocated from India to the United States to pursue a master’s degree in information technology and management, specializing in business intelligence and analytics. She quickly became a valuable asset in the industry, leveraging data to drive strategic business decisions with her strong analytical mindset and passion for technology. Currently working with a top Fortune 500 company, Shivangi excels in data analysis, visualization, and business intelligence, making her a trusted advisor. Her ability to derive meaningful insights from complex datasets and translate them into actionable strategies sets her apart. Committed to empowering others, Shivangi has written this book to help aspiring data analysts enhance their skills. Her writing combines theory and practical application to offer valuable guidance. Additionally, Shivangi actively mentors aspiring data professionals, particularly women, encouraging them to pursue their dreams in the field. Her journey reflects her resilience, determination, and passion for data analysis, as she continues to inspire others and shape the future of the industry with her expertise and thought leadership.

I would like to thank my loving and patient parents and siblings for their continued support, patience, and encouragement throughout the long process of writing this book.

About the reviewers

Charles Mendelson is a Seattle-based software engineer who has worked at virtually every stage of company, from early-stage start-ups with no infrastructure to century-old operations full of legacy systems. He has a master’s degree from Harvard University where he studied psychometrics (how you measure things in psychology) at the Harvard Extension School. In 2022, he was named one of the top 25 data-engineering influencers on LinkedIn by Databand.

He is also on the instructional staff in the Python Certificate Program at the University of Washington’s School of Professional and Continuing Education, where he helps adult learners develop coding skills.

At his primary employer, PitchBook Data, he has built a training program called Women in Engineering (WinE) to help women early in their careers develop technical skills, particularly in Python, SQL, and Excel.

In 2017, he was published in the Harvard Business Review, and since 2020, he has been a contributing writer for Towards Data Science, where he primarily writes articles on structuring and organizing data.

Jessica Ginesta is an accomplished professional with a diverse background and extensive experience in pre-sales and customer success roles. With over seven years of expertise in the data space, she possesses a deep understanding of SQL, analytics, and AI technologies. Her skill set includes proficiency in enterprise software, big data/analytics, data engineering, and data science.

During her tenure at Databricks, Jessica has excelled in her role as a senior solutions architect. Jessica’s exceptional proficiency in SQL has allowed her to leverage data effectively, providing valuable insights to clients. Her expertise in analytics and AI enables her to develop and deliver innovative solutions that drive business growth and address critical challenges. Jessica’s strategic mindset and ability to communicate complex solutions in a clear and concise manner have made her a trusted advisor to clients and an asset to her team.

Prior to joining Databricks, Jessica held positions in renowned organizations where she played a pivotal role in utilizing SQL, analytics, and AI to deliver impactful results. Her passion for leveraging data to drive decision-making and her strong track record of success have earned her recognition as a leader in the field.

Mohammed Kamil Khan is currently a graduate student at the University of Houston-Downtown (UHD), majoring in data analytics. He has accumulated 1.5 years of experience in various analytics-related positions. He is now working as a research assistant at UHD, engaged in a study funded by a grant from the National Institutes of Health (NIH). With an unwavering passion for democratizing knowledge, Kamil strives to make complex concepts accessible to all. Moreover, Kamil’s commitment to sharing his expertise led him to publish tutorial-based articles on platforms including DigitalOcean, Open Source For You magazine, and Red Hat’s opensource.com. These articles explore a diverse range of topics, such as pandas DataFrames, API data extraction, SQL queries, and the Django REST framework for API development.

Preface

Welcome to Data Wrangling with SQL, a comprehensive guide that equips you with essential skills to efficiently manipulate and prepare data for analysis using SQL. In today’s digital age, harnessing the power of data has become crucial for individuals and organizations across industries. Data wrangling, also known as data preprocessing or data munging, involves transforming raw data into a clean, structured format for easy analysis. It encompasses handling missing values, removing outliers, merging datasets, and reshaping data to meet analysis requirements. Effective data wrangling lays the foundation for accurate and meaningful insights, enabling informed decision-making.

This book aims to demystify the art of data wrangling using SQL, a powerful language for data manipulation. Whether you are a data analyst, business intelligence professional, or data enthusiast, this guide provides the knowledge and skills necessary to navigate data preparation complexities. Throughout this book, we explore SQL techniques and best practices for data wrangling. Starting with basics such as selecting, filtering, and sorting data, we delve into advanced topics such as data aggregation, joins, subqueries, and data transformation functions.

While SQL’s versatility can be overwhelming, this book balances theoretical concepts with practical examples and hands-on exercises. Real-world scenarios and datasets reinforce your understanding, fostering confidence in tackling data-wrangling challenges. Beyond technical skills, we discuss strategies for handling common data quality issues, ensuring integrity, and improving performance. Techniques for handling missing data, outliers, and inconsistencies are explored, alongside common pitfalls and tips for optimizing SQL queries.

Approach each chapter with curiosity and enthusiasm, embracing challenges to deepen understanding and enhance problem-solving abilities. Data wrangling is not just a process but also an art form, requiring creativity, logical thinking, and attention to detail.

This book serves as a valuable resource in your journey to becoming a proficient data wrangler. May it empower you to unlock data’s full potential, uncover hidden insights, and drive meaningful impact.

Happy wrangling!

Who this book is for

Data Wrangling with SQL is intended for individuals who are interested in data analysis, data manipulation, and data preparation using the SQL language. The book caters to a diverse audience, including the following:

Aspiring data analysts: Individuals who want to enter the field of data analysis and gain a solid foundation in data-wrangling techniques using SQLBusiness intelligence professionals: Professionals working in business intelligence roles who wish to enhance their SQL skills for effective data preparation and analysisData enthusiasts: Individuals passionate about data and eager to acquire practical skills in data wrangling using SQL

What this book covers

Chapter 1, Database Introduction, is where you will discover the fundamentals of databases and their role in data wrangling, equipping you with a solid foundation to leverage SQL for efficient data manipulation and analysis.

Chapter 2, Data Profiling and Preparation before Data Wrangling, is where you will master the art of data profiling and preparation, empowering you to assess data quality, handle missing values, address outliers, and ensure data integrity before diving into the data-wrangling process using SQL.

Chapter 3, Data Wrangling on String Data Types, explores the ins and outs of manipulating and transforming string data using SQL, enabling you to clean, format, extract, and combine textual information efficiently in your data-wrangling workflows.

Chapter 4, Data Wrangling on the DATE Data Type, unlocks the power of SQL to handle date data effectively, covering techniques for date formatting, extraction, manipulation, and calculations, allowing you to wrangle temporal data with precision and accuracy in your analysis.

Chapter 5, Handling NULL Values, navigates the complexities of NULL values in datasets and teaches you SQL techniques to identify, handle, and manage null values effectively, ensuring data integrity and enabling seamless data wrangling for accurate analysis.

Chapter 6, Pivoting Data Using SQL, will help you master the art of transforming row-based data into a structured columnar format using SQL, enabling you to pivot and reshape data for enhanced analysis and reporting capabilities in your data-wrangling endeavors.

Chapter 7, Subqueries and CTEs, dives into the world of subqueries and Common Table Expressions (CTEs) in SQL, mastering the art of structuring complex queries, enhancing data-wrangling capabilities, and simplifying your data analysis workflows for optimal efficiency and clarity.

Chapter 8, Aggregate Functions, unleashes the power of aggregate functions in SQL, empowering you to perform powerful calculations and summarizations on your data, enabling effective data wrangling for extracting insightful statistics and metrics in your analysis workflows.

Chapter 9, SQL Window Functions, unlocks the advanced capabilities of SQL window functions, enabling you to perform complex calculations and analyses over customized subsets of data, revolutionizing your data-wrangling techniques for insightful data partitions, rankings, and aggregations.

Chapter 10, Optimizing Query Performance, helps you master the art of optimizing SQL queries, exploring techniques and strategies to enhance query performance, minimize execution time, and maximize efficiency in your data-wrangling workflows, ensuring faster and more effective data analysis.

Chapter 11, Descriptive Statistics with SQL, shows you how to harness the power of SQL to perform descriptive statistical analysis on your data, exploring SQL functions and techniques to extract key insights, summarize data distributions, and uncover patterns, enabling data wrangling for robust exploratory data analysis.

Chapter 12, Time Series with SQL, unleashes the potential of SQL for time-series analysis, exploring techniques for manipulating, aggregating, and extracting valuable insights from temporal data, empowering you to conduct effective data wrangling and uncover trends and patterns in your time series datasets.

Chapter 13, Outlier Detection, helps you master the art of identifying and handling outliers in your data using SQL, equipping you with techniques and strategies to detect, analyze, and manage outliers effectively in your data-wrangling workflows, ensuring data integrity and accurate analysis.

To get the most out of this book

To make the most of this book, readers should have a basic understanding of SQL fundamentals, including database concepts, querying, and manipulating data using SQL statements. Familiarity with relational databases and SQL syntax is beneficial.

While prior experience with data analysis or business intelligence is not mandatory, a general understanding of data analysis concepts and the importance of data preparation will be advantageous.

The book’s progressive approach, starting from foundational concepts and gradually advancing to more complex topics, allows readers with varying levels of SQL expertise to benefit from the content.

Regardless of their background, readers should come with an eagerness to learn, a willingness to explore practical examples, and a desire to develop their data-wrangling skills using SQL.

By catering to both beginners and those with some SQL experience, this book aims to provide valuable insights and techniques for individuals at different stages of their data analysis journey.

Software/hardware requirements

Operating system requirements

SQL database

MySQL

Windows or macOS

SQL Server (for some examples)

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Data-Wrangling-with-SQL. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “The syntax for creating a database is – CREATE DATABASE Database_name.”

A block of code is set as follows:

Create table walmart.customer_info( CustomerID int, Name varchar(255), Address varchar(255), Email varchar(255), Phone varchar(255) )

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

Select lower(Address) from customers

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “In the preceding code, the customerID column is defined as an integer data type”

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you’ve read Data Wrangling with SQL, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere? Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application. 

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/9781837630028

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

Part 1:Data Wrangling Introduction

This part includes the following chapters:

Chapter 1, Database IntroductionChapter 2, Data Profiling and Preparation before Data Wrangling

1

Database Introduction

Welcome to the exciting world of data-driven decision-making! In this fast-paced landscape, the ability to extract, transform, and analyze data efficiently is essential. At the heart of this process lies something incredibly powerful: databases. These structured repositories are key to organizing and managing vast amounts of information. If you want to make the most of your data-wrangling endeavors, understanding databases and the Structured Query Language (SQL) that brings them to life is crucial. That’s where this book, Data Wrangling with SQL, comes in. It’s a comprehensive guide, designed to empower you with the knowledge and tools you need to unlock the full potential of databases. By diving into the fundamentals of databases and SQL, you’ll gain a deep appreciation for how crucial they are in the data-wrangling journey.

Getting started

Before delving into the fascinating world of data wrangling using SQL, it is essential to grasp the fundamental concepts of databases. This introductory chapter serves as the foundation for your data-wrangling journey, setting the stage for understanding why databases play a pivotal role in efficiently extracting insights from data.

Establishing the foundation

The study of databases forms the foundation upon which the entire data-wrangling process is built. Understanding the core concepts and principles of databases will enable you to navigate the intricacies of data management effectively. By familiarizing yourself with key terms such as tables, rows, and columns, you’ll develop a solid foundation upon which you can confidently build your data-wrangling skills.

Efficient data organization

Databases provide a structured and organized approach to storing and retrieving data. They offer a systematic way to manage vast amounts of information, making it easier to store, update, and retrieve data when needed. By learning about database design principles, data modeling techniques, and normalization, you will be equipped with the knowledge to create efficient and optimized database structures, ensuring smooth data-wrangling processes.

Data integrity and consistency

In the field of data wrangling, maintaining data integrity and consistency is of utmost importance. Databases provide various mechanisms, such as constraints and relationships, to enforce data integrity rules and ensure the accuracy and reliability of the data. Having a clear understanding of how databases maintain data consistency can help you trust the quality of the data you are working with, leading to more reliable and meaningful insights during the data-wrangling process.

By understanding the essential concepts discussed in this introductory chapter, you will be well equipped to begin your data-wrangling journey using SQL. A solid understanding of databases will give you the confidence to tackle real-world data problems, ensuring that your data-wrangling efforts are accurate, reliable, and efficient. Therefore, let’s delve into the material and establish a solid foundation for a productive and satisfying data-wrangling experience!

Technical requirements

Please follow these step-by-step instructions to install MySQL on your machine and create a database on it.

To install MySQL Workbench on your computer, follow these steps:

Visit the official MySQL website: Go to the MySQL website at https://dev.mysql.com/downloads/workbench/.Select your operating system: On the MySQL Workbench downloads page, you will see a list of available operating systems. Choose the appropriate option for your operating system. For example, if you are using Windows, click on the Windows (x86, 64-bit), MSI Installer option.Download the installer: Click on the Download button next to the selected operating system version. This will start the download of the MySQL Workbench installer file.Run the installer: Once the download is complete, locate the downloaded installer file on your computer and double-click on it to run the installer.Choose installation options: The installer will guide you through the installation process. You can choose the installation type (Typical, Complete, or Custom) and specify the installation location if desired. It is recommended to choose the Typical installation type for most users.Accept the license agreement: Read through the license agreement and click on the checkbox to accept it. Then, click on the Next button to proceed.Install MySQL Workbench: Click on the Install button to start the installation process. The installer will copy the necessary files and install MySQL Workbench on your computer.Complete the installation: Once the installation is finished, you will see an Installation Complete message. Click on the Next button to proceed.Launch MySQL Workbench: By default, the installer will offer to launch MySQL Workbench immediately after installation. If the option is selected, MySQL Workbench will open automatically. Otherwise, you can manually launch it from the Start menu or desktop shortcut.

That’s it! MySQL Workbench is now installed on your computer. You can now launch it and start using it to connect to MySQL servers, manage databases, and perform various database-related tasks.

Note

For a step-by-step pictorial representation of setting up MySQL Workbench, please follow this link: https://www.dataquest.io/blog/install-mysql-windows/.

Decoding database structures – relational and non-relational

Before we delve into the details of relational and non-relational databases, let us first understand the meaning of the term database and why it is important to know about databases.

What is a database?

Most of us have heard of a database, right? To put it simply, it is a collection of information that is stored in an organized and logical manner. This helps people keep track of things and find information quickly. For example, imagine you are walking into a superstore and looking for a specific item, such as a phone charger. To find it, you would use logical categorization. First, you would go to the electronics section, but this section would have all sorts of electronics the superstore had to offer. So, you would then look for a section called phones and accessories and search for the specific phone charger that was compatible with your phone.

By using logical reasoning, you can determine the location of the object and purchase the charger successfully. If we consider the process from the perspective of the superstore, we can see that they have divided the entire area into sections such as electronics, merchandise, and groceries, and further subdivided it into rows and columns known as aisles. They store each object according to its category in an organized manner, which can be accessed through the store’s database.

The business definition of a database is that it is a collection of information stored on a server that is accessed regularly for analysis and decision-making. The information is organized into tables, which are similar to spreadsheets, with rows and columns. A database can contain multiple tables, and a server can have multiple databases for different categories or clients. For example, a university database may contain information on students, teachers, and subjects, while a superstore database may contain data on products, orders, store locations, and customers. Each row in the database represents a specific occurrence or transaction. The database stores information and its relationships.

Types of databases

Database Management Systems (DBMSs) are used to store and manage data in a database. The most commonly used language to extract information from a database is SQL. The history of databases dates back several decades, specifically to the 1970s. Since then, databases have evolved into two broad categories, known as relational and non-relational.

Relational databases

A relational database, or relational DBMS, stores data in the form of tables or entities that we want to track, such as customers and orders. The data about these entities is stored in relations, which are 2D tables of rows and columns, similar to a spreadsheet. Each row contains data, and each column contains different attributes about that entity.

Figure 1.1 – Relational database

For instance, in a table/entity that contains information about customers, the attributes or columns could include Name, Phone Number, Address, and Gender. The rows would then represent specific information for each customer in a separate row.

For example, we could have a customers table as follows:

Customer_ID

Name

Address

Phone

Gender

Email

1

Joey

Texas

834-2345

M

[email protected]

2

Ron

Tennessee

987-6543

M

[email protected]

3

Fred

New York

876-5678

M

[email protected]

4

Tom

LA

765-7654

M

[email protected]

5

Mary

Georgia

124-0987

F

[email protected]

Figure 1.2 – Customers table

Every row in a relational database should have a unique key, which we call the primary key (discussed later in the chapter). This key can be used as a foreign key in a different table to build logical referential relations between the two tables. The relations between the fields and tables are known as schemas. To extract data from databases, we use SQL queries.

These are some of the advantages of relational databases:

Highly efficientHigh readability as data is sorted and uniqueHigh data integrityNormalized data

Non-relational databases

A non-relational database stores data in a non-tabular format, meaning it does not have a structure of tables and relations. Instead, this type of database stores information in various ways, such as key-value and document-based databases. In a key-value database, data is stored in two parts: a key and its corresponding value. Each key is unique and can only connect to one value in the collection. In contrast, a document-oriented database pairs a key with a document that contains a complex combination of several key-value pairs. Non-relational databases, also known as NoSQL databases, are more flexible than traditional relational databases. Some commonly used non-relational databases include MongoDB, Cassandra, Amazon DynamoDB, andApache HBase.

Figure 1.3 – NoSQL databases

Key

Document

2022

{  Customer ID: 1234,  Customer Name: Joey,  Customer address: XYZTX,  Order details:    {     Order 1: {product 1, product description}     Order 2: {Product 1, product description}    }}

2023

{  Customer ID:5667,  Customer Name: Ron,  Customer address: LKJHNTN,  Order details:    {     Order 1: {product 1, product description}     Order 2: {Product 1, product description}    }}

Figure 1.4 – Non-relational database example

These are some of the advantages of non-relational databases:

Simple management – no sorting needed, so data can be directly dumped into the database without any preprocessingHigher readability of a particular document, especially when the dataset contains big data, avoiding the need to parse through multiple tables and write complex queriesCan be scaled to a huge level by splitting the servers into multiple clusters and managing the CPU utilization on each of these clusters

Let’s understand that last point in detail. Multiple clusters refers to a distributed computing architecture that consists of multiple servers or nodes, each serving a different purpose, but working together to achieve a common goal. In this context, a cluster typically consists of a group of interconnected computers that works together to provide a more powerful and scalable computing environment. Each cluster may have its own dedicated resources, such as CPU, memory, and storage, and can be managed independently. By splitting servers into multiple clusters, the workload can be distributed more efficiently, allowing for greater scalability and flexibility. For example, suppose you have a large-scale application that requires a lot of processing power and storage. In that case, you might split your servers into multiple clusters and distribute the workload across those clusters. This way, you can achieve better performance, as well as reduce the risk of a single point of failure. Overall, multiple clusters offer several advantages, including increased scalability, improved performance, better fault tolerance, and the ability to handle large workloads more efficiently.

Tables and relationships

In the realm of database management, tables and relationships form the backbone of organizing and connecting data in SQL Server. With the power of structured data organization and connectivity, SQL Server enables businesses and organizations to efficiently store, retrieve, and analyze vast amounts of information.

The SQL CREATE DATABASE statement

All tables in SQL Server are stored in a repository known as a database. A database is a collection of tables related to a specific entity. For instance, we can have separate databases for insurance company providers such as Blue Cross Blue Shield and Cigna. Having one database for each entity helps maintain and scale the database and its dataset for the future. The owner of the database is known as a database administrator who holds admin privileges to the database. Only a database administrator can provide or revoke access to a database.

The syntax for creating a database is CREATE DATABASE Database_nameThe preceding statement will create a database with the name Database_nameThe syntax for deleting a database is DROP DATABASE Database_nameThe syntax for viewing the entire database is SHOW DATABASE

The SQL CREATE TABLE statement

The CREATE TABLE statement is used to create a new table, which is a combination of rows and columns, in a specified database:

CREATE TABLE DATABASENAME.TABLE_NAME( Column 1 datatype, Column 2 datatype, Column 3 datatype, );

For example. we can create a walmart.customer_info table as follows:

(Customer_ID int, Name varchar(255), Address varchar(255), Email varchar(255), Phone varchar(255) )

In the preceding code, the Customer_ID column is defined as an integer data type, while the Name column is defined as a varchar data type. This means that the Name column can hold both letters and numbers, up to a maximum length of 255 characters. The code will create an empty table with these columns.

Customer_ID

Name

Address

Phone

Email

Figure 1.5 – customer_info table

Note

Post-creation of this table, the database administrator has to provide read access to all the table users so that they can access the data within it.

SQL DROP TABLE versus TRUNCATE TABLE

If the requirement is to delete the entire table along with its schema, the SQL syntax used is DROP TABLE table_name. However, if the need is to delete only the content from within the table and retain the table structure, the SQL syntax used is TRUNCATE TABLE table_name. After truncation, the table will still exist but with 0 rows within it.

SQL ALTER TABLE

The SQL syntax of ALTER TABLE is used to insert, update, or delete data or columns in a pre-created table:

ALTER TABLE Walmart.customer_infoDROP COLUMN Email;

The following table will be created after the execution of the drop statement:

Customer_ID

Name

Address

Phone

Figure 1.6 – Email column dropped

SQL constraints

Constraints are a set of predefined rules that a database administrator or table creator defines to ensure that the table and its data are unique and clean. They can be defined at the table or column level. Some commonly used constraints include the following:

The UNIQUE constraint: Ensures all columns have a unique value.The NOT NULL constraint: Ensures all columns have some value.PRIMARY KEY: A unique value at each row level.FOREIGN KEY: A relational key, which is a copy of a primary key column from a different table within the same database. A foreign key is used to ensure that the communication between two tables is maintained and never destroyed by accidentally dropping the foreign key column. More on this will be discussed in the following sections.

SQL keys

In RDBMSes, the term “relational” refers to the relationship between tables in a database that allows for the retrieval of necessary data. This relationship is established through the use of keys such as primary keys, foreign keys, and candidate keys.

Customer_Id

Name

Passport_Number

DOB

1

Adam

L08790

7/11/1990

2

James

L08791

8/6/1992

3

Paul

L08792

3/4/1993

Figure 1.7 – Candidate keys

Candidate keys

A candidate key is a set of one or more columns that can uniquely identify a record in a table. It can be used as a primary key as it cannot be null and must be unique. A candidate key is a super key with no repeating attributes. Out of all the candidate keys that are possible for a table, only one key can be used to retrieve unique rows from the table. This key is called the primary key. It is important to remember that the candidate key used as a primary key should be unique and have non-null attributes.

In the customer table shown in Figure 1.8, we have one candidate key – Passport_Number – which is unique, whereas the Customer_ID column is a primary key.

Customer_ID

1

2

3

Name

Adam

James

Paul

Passport_Number

L08790

L08791

L08792

DOB

7/11/1990

8/6/1992

3/4/1993

Figure 1.8 – Candidate key

Primary keys

A primary key is an attribute used to uniquely identify a row in a table. In the table mentioned previously, Customer_ID would be the primary key, while Passport_Number would not be a primary key as it contains confidential information.

Customer_ID

1

2

3

Primary Key

Name

Adam

James

Paul

Passport_Number

L08790

L08791

L08792

DOB

7/11/1990

8/6/1992

3/4/1993

Figure 1.9 – Primary key

Alternate keys

A candidate key that hasn’t already been assigned as a primary key is known as an alternate key that can uniquely identify a row. In the following table, Customer_ID is the primary key and Passport_Number is a candidate key; therefore, License_Number can be an alternate key as it can also uniquely identify a customer.

Customer_ID

1

2

3

Alternate Key

Name

Adam

James

Paul

Passport_Number

L08790

L08791

L08792

DOB

7/11/1990

8/6/1992

3/4/1993

License_Number

L01YZ

L02ER

L03PX

Figure 1.10 – Alternate key

Super keys

If more than one attribute is assigned as the primary key and it still uniquely identifies a row within a table, then it becomes a super key.

For example, Customer_ID + Name is a super key, as the name of a customer may not be unique, but when combined with Customer_ID, then it becomes unique.

Customer_ID

1

2

3

}➡

Super Key

Name

Adam

James

Paul

Passport_Number

L08790

L08791

L08792

DOB

7/11/1990

8/6/1992

3/4/1993

Figure 1.11 – Super key

Composite keys

If the table does not have an individual attribute that can qualify as a candidate key, then we need to select two or more columns to create a unique key, which is known as a composite key.

For example, if we do not have a customer ID or passport number, we can use a composite primary key consisting of the full name and date of birth.

There is still the possibility of duplicate rows in this scenario if both the name and date of birth have the same value.

Customer_ID

1

2

3

}➡

Composite Key

Name

Adam

James

Paul

Passport_Number

L08790

L08791

L08792

DOB

7/11/1990

8/6/1992

3/4/1993

Figure 1.12 – Composite key

Surrogate key

A surrogate key is a key that is generated by the system and has no business meaning. The values generated for the keys are sequential and act as a primary key. When we don’t have a proper primary key for the table, a surrogate key is generated to uniquely identify the data. In such scenarios, the surrogate key becomes the primary key.

For example, let’s consider that we are creating a database for addresses.

Address_ID

Street_Name

City

State

Zipcode

1

Jefferson St

Dallas

Texas

38256

2

Thomas St

Memphis

Tennessee

38257

3

James St

Chicago

Illinois

33189

4

Perkins St

Miami

Florida

23487

Figure 1.13 – Address table with Address_ID as a surrogate key

Here Address_ID is the surrogate key as it is generated systematically and is used to uniquely identify the rows. It holds no business value.

Primary keys in detail

A primary key is a specially created attribute to uniquely identify each record in a table and has the following features:

It holds unique values for each record/rowIt can’t have null values

The primary key and foreign key are core principles used to establish relationships between tables in a relational database.

A few examples of primary keys include Social Security Number (SSN), passport number, and driver’s license number. These are used to uniquely identify a person.

To ensure unique identification, a composite primary key is sometimes created using a combination of columns.

Foreign keys in detail

A foreign key is a column or combination of columns that creates a link between data in one table (the referencing table) and another table that holds the primary key values (the referenced table). It creates cross-references between tables by referencing the primary key (unique values) in another table.

The table that has the primary key is called the parent table or referenced table, and the table that has a foreign key is called the referencing table or child table.

The column that has a foreign key must have a corresponding value in its related table. This ensures referential integrity.

The FOREIGN KEY constraint is used to prevent actions that would destroy links between tables. Essentially, it ensures that if a column value A refers to a column value B, then column value B must exist.

For example, let’s consider the Orders table and the Customers table. In this case, the customer_ID column in the Orders table refers to the Customer_ID column in the Customers table. Here are some key points regarding the relationship between these tables:

Any value that is updated or inserted in the customer_id attribute (foreign key) of the Orders table must exactly match a value in the Customer_ID attribute (primary key) of the Customers table or be NULL to ensure the relation is maintained.The values in the ID attribute (the Customer_ID primary key) of the Customers table that are referencing the customer_ID attribute (foreign key) of the Orders table cannot be updated or deleted unless cascading rules are applied (cascading actions will be discussed shortly). However, the values of ID in the Customers table that are not present in the Customer_ID attribute of the Orders table can be deleted or updated.

Figure 1.14 – Foreign key illustration

Cascading actions

Cascading actions in SQL refer to the automatic propagation of changes in a parent table to related child tables through foreign key constraints. It enables actions such as deletion or modification in the parent table to automatically affect corresponding records in the child tables. This ensures data integrity and simplifies data management by reducing manual updates.

DELETE CASCADE

This ensures that when a row with a primary key is deleted from a parent table, the corresponding row in the child table is also deleted.

UPDATE CASCADE

This ensures that when a referencing row that is a primary key is updated in a parent table, then the same is updated in the child table as well.

Using a foreign key eliminates the need to store data repeatedly. Since we can directly reference primary keys in another table, we don’t have to store that data again in every table.

Take the following example:

Customer_Id

Name

Address

Phone

Gender

Email

1

Joey

Texas

834-2345

M

[email protected]

2

Ron

Tennessee

987-6543

M

[email protected]

3

Fred

New York

876-5678

M

[email protected]

4

Tom

LA

765-7654

M

[email protected]

5

Mary

Georgia

124-0987

F

[email protected]

Figure 1.15 – Customers table

Order_ID

Customer_ID

OrderDate

ShippingDate

ShippingStatus

O1

1

1/1/2022

1/7/2022

Delivered

O2

1

9/1/2022

9/4/2022

In Progress

O3

2

12/20/2022

12/31/2022

Not Started

O4

3

8/15/2022

8/20/2022

Delivered

O5

4

5/31/2022

5/31/2022

Delivered

Figure 1.16 – Orders table

As you can see, Customer_ID in the Orders table is a foreign key that can be used to establish a connection between the Orders and Customers tables. This allows us to retrieve customer details from the Orders table and vice versa.

Database relationships

Database relationships are used to build well-defined table structures and establish relationships between different tables. With the correct relationships, it helps to standardize data quality and eliminate data redundancy.

Different types of database relationships include the following:

One-to-one relationshipsOne-to-many relationshipsMany-to-many relationships

A database entity can be a customer, product, order, unit, object, or any other item that has data stored in the database. Typically, entities are represented by tables in the database.

An entity relationship diagram, commonly known as an ER diagram or ERD, is a flowchart that illustrates how different entities/tables are related to each other.

One-to-many relationships

One-to-many is the most common type of relationship, in which one record in one entity/table can be associated with multiple records in another entity/table.

Figure 1.17 – One-to-many relationship

Example

Let’s consider the Customers and Orders tables. In this case, one customer can have multiple orders, establishing a one-to-many relationship between them. Customer ID in the Customers table serves as the primary key and is associated with unique values, representing each customer uniquely. On the other hand, Customer ID in the Orders table acts as the foreign key and can have multiple instances, indicating that multiple orders can be associated with the same customer.

In this case, a single customer ordered multiple products, creating a one-to-many relationship where each product was associated with one customer.

One-to-one relationships

A one-to-one relationship is a type of relationship between two tables in which one record in one table is associated with only one record in another table.

Figure 1.18 – One-to-one relationship

Example

In a school database, each student is assigned a unique student_ID, and each student_ID is linked to only one student.

In a country database, each country is associated with one capital city, and each capital city is associated with only one country.

Many-to-many relationships

A many-to-many relationship is one in which multiple records in one table are associated with multiple records in another table.

Figure 1.19 – Many-to-many relationship

Example

Let’s consider the Customers and Products tables. Customers can purchase multiple products, and each product can be purchased by different customers.

In a relational database, a direct many-to-many relationship is typically not permitted between two tables. For example, in a bank transaction database with multiple invoices having the same number, it can be difficult to map the correct invoice and retrieve the necessary information when a customer makes an inquiry. To address this issue, many-to-many relation tables are often broken down into two one-to-one relationship tables by introducing a third table known as a join table. The join table holds the primary key of both tables as a foreign key and may also contain other necessary attributes.

For example, let’s consider the Products and Distributor tables.

Here we have the following attributes in each table.

The attributes for the Distributor table are as follows:

id: The distributor’s ID is the primary key used to identify the distributordistributors_name: The distributor’s namedistributors_address: The distributor’s addressdistributors_city: The city where the distributor is locateddistributors_state: The state where the distributor is located

The attributes for the Products table are as follows:

id: The product’s ID is the primary key used to identify the product IDproduct_name: The product’s nameproduct_description: The product’s descriptionprice: The product’s price per unit

Multiple products can be ordered by multiple distributors, and each distributor can order different products. Therefore, this information needs to be transformed into a relational database model, which would look something like this:

Figure 1.20 – Relational database model

This is known as a join table, and it contains two attributes that serve as foreign keys referencing the primary keys of the original tables.

id references the distributor ID from the distributor table.

product_id references the product_id column in the product table.

These two together serve as the primary key for this table.

However, this information is not sufficient. It would be better to add more attributes to this table.

Figure 1.21 – Adding more attributes to the relational database model

So, we have now converted it into a relational database model and have the data in a cleaner form. Additionally, I have changed the table name to make it more aligned with the data and named it Orders. We have also added the following additional attributes to the table:

order_id: Contains the unique order ID for each order placed by the customerdistributor_id: The distributor’s ID is the unique identifier for each distributorproduct_id: The unique identifier for each product, which can be orderedorder_date: The order datequantity: The number of units orderedtotal_price: The total price of the orders

Note

There are two ways that you can create a join table.