28,99 €
Learn the most important SQL skills and apply them in your job--quickly and efficiently! SQL (Structured Query Language) is the modern language that almost every relational database system supports for adding data, retrieving data, and modifying data in a database. Although basic visual tools are available to help end-users input common commands, data scientists, business intelligence analysts, Cloud engineers, Machine Learning programmers, and other professionals routinely need to query a database using SQL. Job Ready SQL provides you with the foundational skills necessary to work with data of any kind. Offering a straightforward 'learn-by-doing' approach, this concise and highly practical guide teaches you all the basics of SQL so you can apply your knowledge in real-world environments immediately. Throughout the book, each lesson includes clear explanations of key concepts and hands-on exercises that mirror real-world SQL tasks. * Teaches the basics of SQL database creation and management using easy-to-understand language * Helps readers develop an understanding of fundamental concepts and more advanced applications such as data engineering and data science * Discusses the key types of SQL commands, including Data Definition Language (DDL) commands and Data Manipulation Language (DML) commands * Includes useful reference information on querying SQL-based databases Job Ready SQL is a must-have resource for students and working professionals looking to quickly get up to speed with SQL and take their relational database skills to the next level.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 484
Veröffentlichungsjahr: 2023
Cover
Title Page
Introduction
WHAT DOES THIS BOOK COVER?
READER SUPPORT FOR THIS BOOK
Part I: Introduction to Database Concepts
Lesson 1: Exploring Relational Databases and SQL
SAVING DATA
WHAT IS A DATABASE?
RELATIONAL DATABASE CONCEPTS
ACID COMPLIANCE
ENTITY INTEGRITY
BACKUP STRATEGIES
SUMMARY
EXERCISES
Lesson 2: Applying Normalization
WHAT IS NORMALIZATION?
NORMALIZING DATA
FIRST NORMAL FORM
SECOND NORMAL FORM
THIRD NORMAL FORM
DENORMALIZATION
SUMMARY
EXERCISES
Lesson 3: Creating Entity‐Relationship Diagrams
USING ERDs
ERD COMPONENTS
ERD OF DATABASE
WHAT ABOUT MANY‐TO‐MANY RELATIONSHIPS?
SUMMARY
EXERCISES
Lesson 4: Pulling It All Together: Normalizing a Vinyl Record Shop Database
THE VINYL RECORD SHOP DATA OVERVIEW
STEP 1: IDENTIFY THE ENTITIES AND ATTRIBUTES
STEP 2: FIRST NORMAL FORM
STEP 3: SECOND NORMAL FORM
STEP 4: THIRD NORMAL FORM
STEP 5: FINALIZE THE STRUCTURE
FINAL STEPS
SUMMARY
Part II: Applying SQL
Lesson 5: Working with MySQL Server
MySQL INSTALLATION
MySQL NOTIFIER
COMMAND‐LINE INTERFACE
GETTING STARTED WITH MySQL WORKBENCH
SUMMARY
EXERCISES
Lesson 6: Diving into SQL
INTRODUCTION TO SQL
SQL SYNTAX
WORKING WITH NULL VALUES
WORKING WITH INDEXES
SUMMARY
EXERCISES
Lesson 7: Database Management Using DDL
DATABASE MANAGEMENT
MySQL DATA TYPES
MANAGING TABLES IN MYSQL
MANAGING RELATIONSHIPS IN MySQL
SUMMARY
EXERCISES
Lesson 8: Pulling It All Together: Building the Vinyl Record Shop Database
STEP 1: EXAMINE THE STRUCTURE
STEP 2: CREATE THE DATABASE
STEP 3: CREATE THE PRIMARY TABLES
STEP 4: CREATE THE RELATED TABLES
STEP 5: FINALIZE THE SCRIPT
SUMMARY
Part III: Data Management and Manipulation
Lesson 9: Applying CRUD: Basic Data Management and Manipulation
DATA MANIPULATION LANGUAGE
CREATE A DATABASE
INSERT DATA
UPDATE DATA
DELETE DATA
SUMMARY
EXERCISES
Lesson 10: Working with SELECT Queries
SETTING UP A DATABASE
USING THE SELECT KEYWORD
USING THE WHERE CLAUSE
PERFORMING CALCULATIONS
SUMMARY
EXERCISES
Lesson 11: Adding JOIN Queries
STARTING WITH A SCHEMA
GET DATA FROM MULTIPLE TABLES
USE THE JOIN CLAUSE
INNER JOIN
OUTER JOIN: LEFT, RIGHT, AND FULL
SELF‐JOIN AND ALIASES
CROSS JOIN
SUMMARY
EXERCISES
Lesson 12: Sorting and Limiting Query Results
USING ORDER BY
USING LIMIT
USING DISTINCT
SUMMARY
EXERCISES
Lesson 13: Grouping and Aggregates
AGGREGATE FUNCTIONS
USING GROUP BY
USING HAVING
SELECT EVALUATION ORDER
OTHER EXAMPLES
SUMMARY
EXERCISES
Lesson 14: Pulling It All Together: Adding Data to the Vinyl Record Shop Database
INSERTING DATA
IMPORT CSV DATA
ADD DATA TO THE SCRIPT
TEST THE SCRIPT
WRAP UP THE VINYL MUSIC SHOP SCRIPT
SUMMARY
Lesson 15: Diving into Advanced SQL Topics
ADDING SUBQUERIES
WORKING WITH VIEWS
UNDERSTANDING TRANSACTIONS
SCHEMA OPTIMIZATION
SUMMARY
EXERCISES
Appendix A: Bonus Lesson on Applying SQL with Python
DATABASE OPERATIONS
TABLE OPERATIONS
DATA OPERATIONS: CRUD
SUMMARY
EXERCISES
Appendix B: SQL Quick Reference
WORKING WITH DATABASES
DEFINING TABLES, COLUMNS, AND ROWS
PERFORMING TABLE QUERIES
BASIC SQL DATA TYPES
Index
Copyright
Acknowledgments
About the Authors
About the Technical Writer
About the Technical Editor
End User License Agreement
Chapter 2
Table 2.1 Client Account Information
Table 2.2 A Simple Contact List
Table 2.3 Modified Contacts Table with a Key
Table 2.4 Contact Table
Table 2.5 Telephone Table
Table 2.6 Customer Orders with Ordered Products
Table 2.7 A Basic Product Table
Table 2.8 Basic Product Table with a Primary Key
Table 2.9 Customer Table
Table 2.10 Order Table
Table 2.11 OrderID Functional Dependencies
Table 2.12 OrderProduct Table
Table 2.13 Customer Orders with Ordered Products
Table 2.14 OrderProduct Table with a Nonkey Field Added for Price Paid
Table 2.15 Customer
Table 2.16 Product
Table 2.17 Order
Table 2.18 OrderProduct
Table 2.19 Contact Table
Table 2.20 Contact Table
Table 2.21 Telephone Table
Table 2.22 PhoneType Table
Table 2.23 Updated Telephone Table
Table 2.24 Employee Database
Table 2.25 Students and Course List
Chapter 3
Table 3.1 ERD Relationship Notations
Chapter 6
Table 6.1 Product Table for Candy Vendor's Database
Table 6.2 Product Table for Candy and Stickers
Table 6.3 New Product Table
Table 6.4 Sticker Table
Table 6.5 Candy Table
Table 6.6 Contacts Table
Chapter 7
Table 7.1 Integer Data Types
Table 7.2 Common Decimal Types
Table 7.3 Common String Types
Table 7.4 SQL Date and Time Data Types
Chapter 9
Table 9.1 The Results of Running
SHOW TABLES;
Chapter 10
Table 10.1 Common WHERE Operators
Table 10.2 Common Numeric Comparison Operators
Table 10.3 Date Comparison Operators
Table 10.4 Examples of Pattern Matching
Chapter 11
Table 11.1 Visual Representation of
OUTER JOIN
s
Chapter 14
Table 14.1 The Album Table Definition
Table 14.2 Sample Dataset for the Album Table
Table 14.3 The artist Table in MySQL
Table 14.4 First Few Rows of Data for artist Table
Table 14.5 The artist Table Column Descriptions for SQL
Appendix A
Table A.1 The artist Table Fields and Properties
Chapter 1
Figure 1.1 Customer orders
Chapter 3
Figure 3.1 Draw.io
Figure 3.2 ERDPlus
Figure 3.3 The ERD Contact table
Figure 3.4 The updated Contact table with field information
Figure 3.5 The updated Contact table with a primary key
Figure 3.6 The PhoneType table
Figure 3.7 The Phone table
Figure 3.8 Showing the relationship that contacts can have phones
Figure 3.9 Updated relationship notation
Figure 3.10 Relationship between Phone and PhoneType tables
Figure 3.11 The full ERD of the database
Figure 3.12 Showing the many‐to‐many relationship between students and class...
Figure 3.13 A many‐to‐many relationship
Chapter 4
Figure 4.1 ERD for the band and artist entities as well as the bridge table...
Figure 4.2 ERD for the song and album entities as well as the bridge table
Figure 4.3 The complete ERD diagram at 1NF
Figure 4.4 The Vinyl Record Store data in 3NF
Figure 4.5 Adding a trackNumber to the songAlbum entity
Figure 4.6 The final normalized ERD diagram
Chapter 5
Figure 5.1 The MySQL download page
Figure 5.2 Skipping the account sign‐up
Figure 5.3 Selecting the setup type
Figure 5.4 Selecting MySQL tools
Figure 5.5 The Product Configuration screen
Figure 5.6 Selecting the High Availability option
Figure 5.7 Selecting the Type And Networking options
Figure 5.8 Selecting the Authentication Method option
Figure 5.9 Setting the root account password
Figure 5.10 Setting the Windows Service options
Figure 5.11 Applying the configuration options
Figure 5.12 Product configuration of MySQL Server complete
Figure 5.13 MySQL Router Configuration options
Figure 5.14 Connecting to a server
Figure 5.15 “Connection succeeded” message
Figure 5.16 The Apply Configuration window
Figure 5.17 The updated Product Configuration window
Figure 5.18 Installation complete
Figure 5.19 The SQL Notifier menu
Figure 5.20 The MySQL Command Line Client prompt
Figure 5.21 List of all existing databases currently managed
Figure 5.22 The MySQL Workbench
Figure 5.23 The MySQL Workbench environment
Figure 5.24 The MySQL Workbench Schema tab
Figure 5.25 The expanded city table
Figure 5.26 The executed query
Chapter 7
Figure 7.1 Executing
SHOW DATABASES
in MySQL Workbench
Figure 7.2 Executing
SHOW TABLES
Figure 7.3 The primary‐foreign key relationship
Figure 7.4 The books database ERD
Chapter 8
Figure 8.1 The Vinyl Record Shop database ERD
Chapter 9
Figure 9.1 The TrackIt database ERD
Figure 9.2 A simplified Books database
Chapter 10
Figure 10.1 The Complaint table columns
Chapter 11
Figure 11.1 ERD for the TrackIt database
Figure 11.2 An
INNER JOIN
Figure 11.3 The Personal Trainer schema
Chapter 12
Figure 12.1 The world database schema
Chapter 13
Figure 13.1 The PersonalTrainer database schema
Chapter 14
Figure 14.1 ERD for the VinylRecordShop database
Figure 14.2 The Stop option
Figure 14.3 The
artist.csv
file in Notepad
Figure 14.4 Opening the
artist.csv
file in Excel
Figure 14.5
artist.csv
file with header row removed
Figure 14.6 The
artist.csv
file with the names reversed
Figure 14.7 Selecting a column in Excel
Figure 14.8 The Find and Replace dialog in Excel
Figure 14.9 Updated table with 0s and 1s
Figure 14.10 The
band.csv
file
Figure 14.11 Starting the Table Data Import Wizard in MySQL Workbench
Figure 14.12 Entering the filename to be imported
Figure 14.13 Selecting the table to use
Figure 14.14 The import settings
Figure 14.15 Confirming the import
Figure 14.16 Confirmation that the import occurred
Figure 14.17 Using
SELECT
to verify band data is imported
Figure 14.18 Viewing the
.sql
file from the
mysqldump
process
Chapter 15
Figure 15.1 Flowchart showing possible transaction states
Cover
Table of Contents
Title Page
Copyright
Acknowledgments
About the Authors
About the Technical Writer
About the Technical Editor
Introduction
Begin Reading
Appendix A: Bonus Lesson on Applying SQL with Python
Appendix B: SQL Quick Reference
Index
End User License Agreement
iii
xix
xx
xxi
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
169
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
iv
v
vi
vii
viii
391
KIMBERLY A. WEISS
HAYTHEM BALTI
Modern computer applications rely heavily on databases, even when the program in question isn't designed to help users manage data. Computer games rely on databases to keep track of characters, character attributes, items that each character can use during gameplay, and even locations within the game. A learning management system (LMS) uses databases to keep track of learners, instructors, content, grades, attendance, and communication between users.
A database can contain data that is structured or unstructured. Modern database software programs hosting databases can usually handle both structured and unstructured data, but it is still good to understand the difference.
In a database with structured data, which we will call a structured database, the data is organized in a specific pattern. This makes it easy to control what data is available and where to find specific pieces of data. In a structured database, the developer can limit what kinds of data are stored in the database to improve data integrity and reduce the amount of redundant data. This comes as a trade‐off in that creating new data and accessing stored data are relatively slow compared to creating and accessing data in an unstructured database. Structured databases are best for datasets that contain predictable types of data, such as bank accounts, personnel records, and inventories.
Relational databases are highly structured in that they organize data into one or more tables or relations, where each table represents a logical group of data. In day‐to‐day professional work, we usually say table. Relation is the formal, academic term, which you may run into if you read about databases in other contexts. Relation is also the basis for the term relational database. At a more abstract level, the term entity is also used to refer to a table, especially during the design phase of a database and before the database is built on a server.
Job Ready SQL provides readers with an understanding of relational databases and Structured Query Language (SQL). SQL is a domain‐specific language designed for managing data held in a relational database management system.
This book contains a full‐fledged SQL course that is used by the Wiley Edge Global Academy and the Software Guild to train our alumni in SQL and other topics, such as data analysis and data science.
As you read through this book, enter the code listings and play with the code. If you also take a hands‐on approach to doing the exercises, you will be better able to take what you learned to the next level.
Most important, this book (as well the Job Ready series) goes beyond what many books provide by including lessons that help you pull together everything you are learning in a way that is more like what you would find in the professional world. This includes building a more comprehensive example than what you get in the standard short listings provided in most books. If you work through the “Pulling It All Together” lessons, then you will be better prepared for many of those jobs that require SQL.
As mentioned, this book is a complete SQL programming course. It is broken into several parts, each containing a number of lessons.
Part 1
: Introduction to Database Concepts The first part of this book focuses on introducing database concepts including structured and unstructured data, as well as relational database concepts.
Part 2
: Applying SQL The second part focuses on getting you set up to use MySQL. This includes help for installing MySQL and setting up the tools you will need to work through this book. Additionally, this section dives into the basics of MySQL including query design and development and the basics of database management.
Part 3
: Data Management and Manipulation The third part focuses on going beyond the basics of MySQL and focuses on learning about concepts you'll need to design and develop complex databases and advanced querying of the data stored on MySQL. This includes CRUD operations, joins, select queries, sorting, and aggregation. Finally, this section also includes lessons on how to leverage Python to query SQL data.
There are several ways to get the help you need for this book.
As you work through the examples in this book, you should type in all the code manually. This will help you learn and better understand what the code does.
However, in some lessons, download files are referenced. You can download the files from www.wiley.com/go/jobreadysql.
If you believe you have found a mistake in this book, please bring it to our attention. At John Wiley & Sons, we understand how important it is to provide our customers with accurate content, but even with our best efforts an error may occur.
To submit your possible errata, please email it to our Customer Service Team at [email protected] with the subject line “Possible Book Errata Submission.”
Lesson 1: Exploring Relational Databases and SQL
Lesson 2: Applying Normalization
Lesson 3: Creating Entity‐Relationship Diagrams
Lesson 4: Pulling It All Together: Normalizing a Vinyl Record Shop Database
SQL is used to access data. Before jumping into SQL and how it is used, it is important to step back and consider how information that you will access has been stored. In this chapter, you'll dive into the topic of data and databases to set the foundation for then accessing the information. You will get a high‐level look at databases in general and at relational databases specifically.
By the end of this lesson, you will be able to:
Describe what a relational database is, how it works, and how it differs from a database management system (DBMS)
Define database tables, relations, columns, attributes, rows, records, tuples, and data types
Identify the ACID properties
Know about entity integrity and uniqueness using keys
Discuss database backup strategies
To be useful, software systems must remember. If your character started at the beginning (level 0) every time you fired up a video game, or your online banking app reset your balance to $0 when you logged off, or your phone forgot your contacts when it rebooted, you wouldn't use them. To remember, applications must save data in a way that allows ready access to that data when needed.
There are a few options for saving data.
Write text or bytes directly into a file
Store data in a relational database
Store data in a nonrelational database
The first option of writing directly into a file can be cumbersome. In this case, the file typically is expected to be local (on the same computer as the program that is accessing the file). This means there is a high risk of losing data if something happens to that computer.
In a software environment, databases (relational or not) are preferable to files because they can store data separately from the application itself, often on a completely separate server. While this might slow down access to the data to a small degree, the fact that they are separate means that multiple applications can access the same database and that changes to the data in the database are immediately accessible to any application that uses that data.
Nonrelational databases are becoming more common today, but relational databases are standard across many industries as a way of storing data in a predictable and reliable way that allows applications to easily retrieve that data as needed.
In real life, most people work with databases every day, often without realizing it. Most computer applications depend on some kind of data access to work correctly. Any advanced system with a goal of identifying objects and performing specific actions on those objects (such as an employee list, an inventory, or a course roster) depends on a database. A database is a structured representation of data that can be read from and written to, and a database is often stored separately from any application that uses the data.
Modern computer applications rely heavily on databases, even when the program in question isn't designed to help users manage data. Computer games rely on databases to keep track of characters, character attributes, items that each character can use during gameplay, and even locations within the game. A learning management system (LMS) uses databases to keep track of learners, instructors, content, grades, attendance, and communication between users.
As a concrete example, consider a modern smartphone. The phone itself has a database that stores connection information, OS version, model number, serial number, and similar data about the device itself.
A smartphone also has a variety of applications on it, many of which have their own databases. Common examples include a contact list, a calendar, email apps, photo galleries, social networking apps, and shopping apps. While these apps store data for internal use, the user can grant permission for some apps to access data stored in other apps. For example, a calendar app may be connected to the contact app's database so that the user can easily add appointments with specific people to their calendar, while Facebook can access photos stored on the phone so that the user can share the photos with friends.
In none of these cases, though, does the user have direct access to the database itself. Instead, the application's front end (the part the user interacts with) includes tools that allow the user to create and retrieve data, update existing data, and even delete data that the user no longer needs. The software developer must incorporate the database into the application in a way that allows the application to access and manage the data.
When talking about databases, data and information are invariably mentioned. The terms data and information are often used interchangeably in casual speech, but from a software perspective, there is a very clear difference between the two. Specifically:
The term
data
refers to individual, raw facts. In many cases, individual pieces of data are meaningless on their own.
When we process data, the result is
information
. Unlike the raw data, information is useful and normally corresponds to the end user's specific needs.
As an example, consider a piece of data like smith. On its own, this is meaningless. While you might first think it is someone's last name, there isn't enough information here to tell you exactly whose name it is. It could also be an occupation rather than a name.
In a specific context, however, this piece of data can be combined with other data to give you useful information. As part of a course roster, for example, it could be combined with a first name to reference a specific student. In a job application or online profile, it could reference the person's work experience, with a completely different name.
A database can contain data that is structured or unstructured. Modern database software programs hosting databases can usually handle both structured and unstructured data, but it is still good to understand the difference.
In a database with structured data, which we will call a structured database, the data is organized in a specific pattern. This makes it easy to control what data is available and where to find specific pieces of data. In a structured database, the developer can limit what kinds of data are stored in the database to improve data integrity and reduce the amount of redundant data. This comes as a trade‐off in that creating new data and accessing stored data are relatively slow compared to creating and accessing data in an unstructured database. Structured databases are best for datasets that contain predictable types of data, such as bank accounts, personnel records, and inventories.
NOTE Data integrity refers to the reliability and accuracy of the data. A dataset is a collection of related information that is composed of separate elements but can be manipulated as a group.
A database with unstructured data typically does contain some amount of structure, but without the strict controls inherent to a structured database. Unstructured databases are typically a little faster than structured databases, but they are also prone to duplicate or redundant data. Unstructured databases are often found in applications that have unpredictable or irregular kinds of data, such as social media posts, online product reviews, and similar user‐generated content.
NOTE In this lesson, we will look only at structured databases, specifically relational databases.
As mentioned earlier, a database is a representation of data that can be read from and written to and is often stored separately from any application that uses the data. The fact that the database is separate from an application means that it can be made available to multiple applications, such as allowing access to a contact database from a calendar app or posting photos from an image database using a social media app. While any application that uses a specific database must know how to access the data, the data itself is simply a pool that any authorized app can pull from.
A database management system (DBMS) is a software system that manages databases. The DBMS executes commands, provides security, enables network access, and provides admin tools for database administrators (DBAs) to work with database files.
A subset of DBMSs includes relational database management systems (RDBMSs) that are designed specifically to work with relational databases. There are many options for RDBMSs, including MySQL, PostgreSQL, Microsoft SQL Server, Oracle Database, and DB2. The choice of DBMS determines some factors of how the data itself is organized, but in large part, all RDBMSs do the same thing.
While a specific RDBMS will be used in this book, keep in mind that all RDBMSs do essentially the same things in the same way. If you understand how one RDMBS works, you can easily transfer that knowledge to a different RDBMS.
Relational databases are highly structured in that they organize data into one or more tables or relations, where each table represents a logical group of data. In day‐to‐day professional work, we usually say table. Relation is the formal, academic term, which you may run into if you read about databases in other contexts. Relation is also the basis for the term relational database. At a more abstract level, the term entity is also used to refer to a table, especially during the design phase of a database and before the database is built on a server.
NOTE The relational database model was first proposed by Edgar F. Codd in 1970, with the main goal of reducing duplicate data in a database and thereby making it easier to retrieve and manage specific data.
A table can be imagined as a two‐dimensional grid of cells.
A
row
in a table is a horizontal group of cells, one cell high. It holds facts about one discrete thing represented by the table. That thing could be anything such as a person, a credit card transaction, or a professional sports mascot.
A
column
in the table is a vertical strip of cells, one cell wide. Every cell in the column holds the same type of data, but each cell is a fact about a different thing. For example, if the table includes data about people, then the table could include separate columns for name, phone number, and address.
A
cell
represents the intersection of a row and a column. Each cell contains a single piece of data.
The following is a concrete example:
Name
Abbr
Capital
Established
Population
Alabama
AL
Montgomery
Dec 14, 1819
4,874,747
Alaska
AK
Juneau
Jan 3, 1959
739,795
Arizona
AZ
Phoenix
Feb 14, 1912
7,016,270
Arkansas
AR
Little Rock
Jun 15, 1836
3,004,279
California
CA
Sacramento
Sep 9, 1850
39,536,653
In the data in this table, each row represents one state. Rows are also known as records or tuples. The term record is common, while tuple is an academic term. A database record is a single row in a table in the database, and each row in a table is considered a separate object from the other rows in the same table.
Each column represents a fact about a state: its name, abbreviation, capital, date established, and current population. There's a subtle nuance here. The term column refers to a strip of vertical cells, but it also refers to a definition: an overall name (Abbr, Capital, Population) and restrictions on the size, shape, and type of data allowed as values in the column. In fact, when a developer says column, they're usually talking about the definition, rather than the cells themselves. To reduce confusion, we may call the value of a record's column a field. Academically, column definitions are also called attributes.
NOTE The size, shape, and type of data allowed in a column will be discussed in more detail in Part 2, “Applying SQL.” As an example, a Population field might be defined as one that can hold a whole number (the type) that is between one and four billion (the size and shape).
Relational databases provide rich and powerful ways to model our data, and it doesn't stop there. A relational database's data structures and algorithms also provide behavior guarantees. They can't guarantee an action will always work, but they can guarantee the state of a database after an action succeeds or fails. They also guarantee predictable behavior when multiple users are interacting with a database. There are many types of guarantees.
The ACID properties are four of the most important guarantees. ACID is an acronym for the following:
Atomicity
Consistency
Isolation
Durability
Before jumping into ACID, you need to understand database transactions. A relational database allows the following actions:
Read existing data
Insert new data
Update existing data
Delete existing data
Add or alter schema (tables and relationships)
A transaction is a set of one or more actions that represents a single, logical unit of work. Say you want to reserve three rooms at a hotel, and those room reservations are stored in at least three rows in a database. In most circumstances, you don't want to book any rooms if one or more room reservations fail—it's all or nothing. That makes your three‐room reservation a transaction. It's a single unit of work that should succeed or fail as a unit.
If you purchase a concert ticket for an in‐demand concert, the software system must first find an available ticket and then put it on hold until you can provide payment. That's a transaction. If it wasn't, the system might find an available ticket only to have another person purchase it before you. Without a transaction, the ticket could be purchased twice, or the system might waste time presenting tickets that are no longer for sale. Imagine selling tickets to a show that's predicted to sell out in 10 minutes without software that can handle transactions.
As mentioned, ACID is an acronym for atomicity, consistency, isolation, and durability. The ACID properties are not required. You can run a database without them; however, for some situations, running without ACID is risky. In situations involving things, such as banking, medical records, and real‐time decision‐making, bad things can and will happen to your application if you ignore ACID.
Additionally, you never know what will happen when using a software. The network can fail, the operating system can crash, or another user can alter data that you're using. Given enough time, something will fail.
ACID‐compliant databases are designed to withstand unexpected failures without corrupting your data. Let's look at each of the elements of ACID.
A transaction is atomic if it follows the “all‐or‐nothing” rule. If one action in the transaction fails, then the entire transaction fails. An atomic transaction never partially succeeds.
Imagine a scenario where you write a new row of data to a table with 10 columns. On the eighth column write, a power failure occurs, and your server immediately shuts down. If the database supports atomicity, it will notice the unfinished transaction and restore the data to its pre‐transaction state when it comes back online.
A transaction is consistent if it can move the database from only one valid state to another valid state. This means that the data processed during the operation is in a consistent state when the transaction starts and when the transaction ends as well. For example, when we transfer money from one account to another, consistency means that the sum of both accounts before the transfer and after it will be the same. If Account A has $200 and Account B has $100, the sum of both accounts would be $300. If $100 is transferred from Account A to Account B, then Account A would have $100, and Account B would have $200. The sum of both accounts is still $300. Consistency has been maintained.
A consistent database also enforces constraints on the types and sizes of data that are allowed. For example, the balance of an account is expected to be a numeric value. A birthdate is expected to be a date value. The database will maintain consistency by ensuring that the type of the data and size of the data are maintained.
Consistency also enforces primary and foreign key relationships. A primary key is a unique value assigned to each row of a table. For example, in a table containing bank account information, the account number would likely be unique and thus could be a primary key. In regard to consistency, the system will never allow a duplicate primary key in a table to occur, and it will require that each record have a primary key value.
A foreign key is a value within the row of a table that is used to connect or is related to another table. For example, a store can have a table of customers, and each customer can have multiple orders. An order ID can be stored in the customer row that can then connect to a table of orders, as shown in Figure 1.1.
Figure 1.1 Customer orders
For foreign keys, most DBMS systems by default will not allow you to orphan a row, where the value used as a foreign key does not correspond to a value in the primary key of the related table. Using our example of customers and orders, there could be a Customer and Order relationship where a Customer can have one or more Orders. If you were to try to delete only a Customer row without first deleting its Orders, then the Orders associated with that Customer would have a foreign key pointing to a record that no longer existed.
NOTE A properly configured relational schema will prevent this from happening by either rejecting the delete transaction outright or automatically deleting all the orders associated with the customer being deleted first. (This automation is called a cascade delete, and because it can lead to the deletion of millions of records without warning, it is usually not the preferred solution to resolving orphan keys.)
A transaction is isolated if its effects are not visible to other transactions until it is complete. This is often referred to as concurrency control. A large database application may have hundreds or thousands of users making changes to it at the same time, so if transactions are not isolated, this could cause inconsistent data.
Imagine two users, John and Sally, accessing the database at once. John is updating data in the orders table. At the same time, Sally is reading data from the orders table, including records being edited by John. A DBMS has various levels of isolation it could apply. As a beginner, you need to know only two levels.
Serializable:
Sally will not receive her data until John's changes are committed. When John begins a transaction to change data, the data is
locked
until his transaction is complete.
Read Uncommitted:
Sally will get her data right away, including whatever changes John has made that haven't been committed yet. This is called a
dirty read
because it is possible that John's transaction could fail and be rolled back.
The default isolation in most DBMS systems is serializable because it does a better job of avoiding errors or corrupting data.
A transaction is durable if once it is committed (saved to the database), it will remain so, even in the event of catastrophic failures. Even if you kick the server's power cord out of the wall after a transaction, it will stay committed.
This means a transaction is not fully committed until it is written to permanent storage, such as a storage drive.
In most ACID databases, a transaction log (sometimes referred to as a journal or audit trail) is a history of executed actions. The upshot of this is that even if there are crashes or hardware fails, the log file has a durable list of each change made to the database.
The log file is physically separate from the actual database data. This is important to ensure a database remains consistent. For example, when you insert a new row into a table, a few things happen.
The DBMS validates the incoming command.
A record is added to the log file specifying what changes are about to be made.
The DBMS attempts to make the changes to the actual data in the table or tables.
If successful, the log record is marked as committed.
If a failure occurs between steps 2 and 4, like a server reboot, the DBMS will scan the log file for uncommitted transactions when it comes back online. If it finds them, it will examine the actions performed and undo them, effectively restoring the database to its former, consistent state.
One of the keystones of relational database design is entity integrity, which guarantees that each record in a table is unique within that table. All RDBMSs enforce entity integrity automatically, but the database creator has to appropriately define a primary key within each table for this to work. As data is added to a table, the RDBMS will check two properties to ensure that the new record is unique.
That no other existing record in the table has the same primary key value as the new entry
That there is a value entered for each field of the primary key
If a new record fails to meet both criteria, then the RDBMS will reject the record and prevent it from being added to the table.
Remember from our definitions for a relational database that a record is the collection of values for a single item in a table and that each record is independent of other records in a table. In this case, the term unique has its original definition of “one of a kind,” so under the guidelines of referential integrity, the set of values in each row must be different from the set of values in all other rows of the table. This uniqueness serves two specific purposes.
Reducing (but not necessarily eliminating) duplicate data
Allowing the database to easily find specific records within a table
The relational design approach to meeting the requirements of entity integrity is to include one or more fields in each table whose sole purpose is to identify each individual record. In some cases, we can use an existing field to be the primary key. For example, we could use the entry date as a primary key in a table that tracks newspaper issues, on the grounds that each newspaper in the database issues only one paper per day. This is a called a natural key because it happens to be a piece of data we want to track anyway, so we don't need to create a separate field just to ensure uniqueness. Other examples of potential natural keys include a phone number or email address to identify people in a Contacts table. We typically want both of those values in such a dataset, and if each person has their own unique phone number or email address, either could work as a natural key.
A much more common approach is to use a surrogate key, a field (or collection of fields) that is created specifically to identify each record in a database, but which has no other purpose or meaning in the table. Because we are surrounded by databases, we are used to using surrogate keys for this purpose. For example, if you contact your bank, chances are good that they will want to use your account number to identify your account, rather than simply your name; your account number is unique, but your name probably isn't. In fact, we regularly use surrogate keys to identify things, such as Social Security numbers, bank account numbers, vehicle identification numbers, and product barcodes. Even when the key value includes some kind of meaning (such as where the person lived when they applied for a Social Security number or the product manufacturer code in a UPC), these values are only loosely connected to the object they identify, and the assignment itself is mostly arbitrary.
On their own, surrogate keys are completely meaningless, but they can serve the very useful purpose of uniquely identifying each object in a table. In a Contact table, a field named ContactID could be defined. Each person could then be assigned a different ContactID value as they are added to the table. While this doesn't prevent us from adding the same person to the contact list more than one time (and assigning a different ContactID value to each instance), it does allow the database to easily distinguish between John Johnson and his son John Johnson, Jr.
The term candidate key is used to refer to a field that is inherently unique to each record but that may not act as the primary key of the table. For example, in an employee table, it is highly likely that each employee's Social Security number will be included, and each employee has a unique value for that field. However, for security reasons, it will not be selected as the primary key, and the database designer will select either a different candidate key or a surrogate key to act as the primary key for that table.
The most important role of a primary key is to allow the database to find specific records in a database quickly and accurately. In a relational database, to reduce redundancy and improve efficiency, data is typically stored across many different tables, where each table focuses on one kind of data. In a product inventory database, for example, there will likely be separate tables for products, vendors, warehouse locations, and even categories. As a result, when all the details about a specific item must be retrieved, the database will have to search across tables to find the required information about that item.
For many databases, primary keys are indexed by default. Indexes are normally applied to columns (or groups of columns), and they are stored as a separate object from the rest of the table the index applies to. Each index acts as a pointer to records stored in tables in the database, and the RDMBS applies a default sort to the indexes, regardless of the order in which the records were added to the relevant table. These sorted primary keys include a connection (an index) back to the full records associated with each key. The fact that the indexes are both unique and sorted means that the database does not have to search through more records than necessary to find the required key values. Once the requested value is found in the index, the database knows it can stop looking for more instances of that value. Because the primary key is connected to the rest of the data in the associated record, the database can simply pull data from that record and safely ignore all other records in the table.
This is actually similar in concept to an index in a book. A book's index normally appears at the end of the book, where it is easy to locate. In addition, the terms in the index are sorted in alphabetic order, so the reader can easily find the term they are looking for. The index also tells the reader where to find the term in the book so that the reader can go straight to the correct location in the book.
This use of primary keys does have weaknesses, however. The biggest weakness is that the primary key index structure must be reindexed each time a new record is added or an existing record is deleted, which can slow down update processes within the database. In essence, this is similar to having to rewrite the index for a book when a chapter is deleted or added to the book. It is also not possible to change the value of a primary key, meaning that if a value is assigned incorrectly, you cannot change the value later to correct it. However, even with the weaknesses inherent to primary keys, they are an integral component in a relational database. While they can guarantee uniqueness only at the record level, they do allow the database to distinguish and find individual records in a table.
Other options for primary keys will be considered as we work through designing a database later in this book. For now, understand that the role of a primary key is to ensure that each record in a table is unique within that table.
A lot of time and effort is put into the backup and recovery of database systems. In some businesses, losing access to the database can cost thousands of dollars per minute as orders can no longer be taken or customer data could be lost or compromised.
For this reason, it is important that the database administrator has backup and recovery options for both data and log files. Because logs contain all transaction information, they provide point‐in‐time restoration information. Full data backups tend to be very large and are done only periodically. Log backups tend to be much smaller.
As an example, nightly data backup might be performed nightly, while a log backup could occur every 10 minutes. If the data backup occurs at midnight and the server fails at 2:55 p.m., the last data backup would be restored, followed by restoring all the logged transactions until 2:50 p.m. We would lose changes only between 2:50 p.m. and 2:55 p.m. While this 5‐minute loss is not great, it is better than the alternative if there were no log backups at all.
To further reduce losses, we could use multiple database servers and execute transactions on each. If one server fails, another can take its place. As you approach a true lossless solution, the cost of servers and software increases exponentially. An experienced database administrator has the job of matching budget to loss tolerance for a business.
This lesson presented an overview of databases and relational database concepts, as well as information on ACID and the use of keys. This information is foundational for learning and working with Structured Query Language. Many of the terms presented in this lesson are used frequently when talking about data and databases, so it's good to learn what they mean early in the game. Most will also be revisited in subsequent chapters when they are directly applied to SQL. This includes the following:
Table/relation (academic)/entity (abstract):
A logical grouping of data in a relational database. A table defines valid facts and contains facts about one type of thing.
Row/record/tuple (academic):
A single, logical item in a table, comprised of one or more values or fields.
Column/field/attribute (academic):
Names a fact to be tracked in a table and restricts the size and type of the fact's data.
ACID was also covered in the lesson. When reliable data is essential, using an ACID‐compliant database is a requirement. Only databases that are atomic, consistent, isolated, and durable are going to handle all the errors and failures that can occur while protecting the quality of the data.
Keys were also mentioned as they are a core part of organizing data in a way that can help you create uniqueness as well as help your records to be accessed quickly. As you begin to organize your data into a database that can be accessed with SQL starting in the next chapter, you'll see how primary and foreign keys are used.
Finally, no business should be without a backup and recovery plan that suits their budget and risk tolerance.
The following exercises are provided to allow you to experiment with concepts presented in this lesson:
Exercise 1.1
:
Customers and Orders
Exercise 1.2
:
Libraries and the Books Within
Exercise 1.3
:
Your Scenario
NOTE The exercises are for your benefit and help you apply what you learn in the lessons. Please note that these are to be done on your own, and thus solutions are not provided.
In Figure 1.1, tables were shown for Customers and Orders. Create a list of at least three additional fields that could be added to each of the tables.
Assume that you will be creating a program that accesses a database containing information on libraries as well as information on the books each library contains. Do the following:
List the tables you could include.
Create the list of fields you might include within each row of each table you identified.
Create three rows of sample data for each of your tables.
Identify the fields in your tables (if any) that would be primary keys.
Identify the fields in your tables (if any) that would be foreign keys.
Come up with your own scenario for using a database. This can be a banking example that tracks accounts, a media database, a restaurant menu, or a store inventory system. Do the following for your scenario:
List the tables you could include.
Create the list of fields you might include within each row of each table you identified.
Create three rows of sample data for each of your tables.
Identify the fields in your tables (if any) that would be primary keys.
Identify the fields in your tables (if any) that would be foreign keys.
Chapter 1 mentioned that to reduce redundancy and improve efficiency in a relational database, data is typically stored across many different tables, where each table focuses on one kind of data. In this lesson, you will learn the standard process for organizing your data.
By the end of this lesson, you will be able to:
Understand the purpose of database normalization
Describe and apply the first through third normalization forms
Know when to apply denormalization to your tables
Relational database design has the goal of organizing data in a way that reduces redundant (or duplicate) data while also streamlining ways in which we can access that data.
The relational database model was proposed by E. F. Codd in his article “A Relational Model of Data for Large Shared Data Banks” in the June 1970 issue of Communications of the ACM (Association for Computing Machinery). Being an academic proposal, it is a bit dry, but the point of this paper was to propose a database model based on relations (tables), accessible via a universal language, and having a defined set of rules for splitting a dataset into relations. In discussing normal form, he notes that relations with simple domains lend themselves to storage in a simple, two‐dimensional array, while other relations are more complex and require correspondingly more complex structures to represent them.
NOTE You can find Codd's article at www.seas.upenn.edu/~zives/03f/cis550/codd.pdf.
Normalization is the process of breaking down these complex relationships into simpler structures. A properly normalized design improves performance and reduces the complexity of relationships by minimizing data duplication (redundancy). Codd showed that is possible for all data domains to be reduced to simple table relationships. A database where all relations are reduced in this manner, following the process of normalization, is said to be normalized.
Data redundancy is the act of storing the same piece of data multiple times in the database. Table 2.1 presents an example of client account information.
Table 2.1 Client Account Information
FirstName
LastName
AccountNum
AccountType
Street
City
State
Zip
Eduino
Bayly
512663484
Checking
07755 Marquette Park
Spring
Texas
77386
Missie
Cavee
374078993
Savings
557 Roxbury Street
Peoria
Illinois
61656
Missie
Cavee
647794666
Checking
557 Roxbury Street
Peoria
Illinois
61656
Geordie
Eirwin
450433555
Savings
8 Cottonwood Terrace
Zephyrhills
Florida
33543
Davy
Louis
317202667
Credit Card
38 Commercial Hill
Columbus
Ohio
43204
Dorri
McNair
192333561
Checking
8208 Stuart Center
Fort Lauderdale
Florida
33355
Dorri
McNair
166808336
Savings
8208 Stuart Center
Fort Lauderdale
Florida
33355
Dorri
McNair
666343073
Investment
8208 Stuart Center
Fort Lauderdale
Florida
33355
Michael
McNair
439224678
Credit Card
8208 Stuart Center
Fort Lauderdale
Florida
33355
Annmarie
Rubenov
396112179
Credit Card
4 Loftsgordon Place
Jackson
Mississippi
39210
Looking at Table 2.1, you can see how the same address information is duplicated across many rows. Aside from the inefficacy of storing many copies of the same data (disk space), having the data duplicated like this can lead to data anomalies (incorrect data). Suppose Dorri McNair moves to a different state. Because this client appears in three separate rows, you would have to be sure to update the address correctly in all three rows. Possible resulting anomalies include the following:
Updating only one or two rows and leaving the others unchanged.
Mistyping the address in one row (such as
Stewart
instead of
Stuart
).
If Dorri McNair chooses to open another account, the address associated with the new account would have to match the existing address.
As a result of each these cases, Dorri McNair could end up with at least two different addresses in the system, which means they may not receive mailed documents correctly or the ZIP code wouldn't match the account for validation. This type of error is called an update anomaly.
Another potential problem is deleting an address that is still in use. For example, if Michael McNair chooses to close their credit card account and the associated address is removed from the system, that could potentially delete the address associated with Dorri McNair, creating a delete anomaly.