CompTIA Data+: DAO-001 Certification Guide - Cameron Dodd - E-Book

CompTIA Data+: DAO-001 Certification Guide E-Book

Cameron Dodd

0,0
26,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

The CompTIA Data+ certification exam not only helps validate a skill set required to enter one of the fastest-growing fields in the world, but also is starting to standardize the language and concepts within the field. However, there’s a lot of conflicting information and a lack of existing resources about the topics covered in this exam, and even professionals working in data analytics may need a study guide to help them pass on their first attempt.
The CompTIA Data + (DAO-001) Certification Guide will give you a solid understanding of how to prepare, analyze, and report data for better insights.

You’ll get an introduction to Data+ certification exam format to begin with, and then quickly dive into preparing data. You'll learn about collecting, cleaning, and processing data along with data wrangling and manipulation. As you progress, you’ll cover data analysis topics such as types of analysis, common techniques, hypothesis techniques, and statistical analysis, before tackling data reporting, common visualizations, and data governance. All the knowledge you've gained throughout the book will be tested with the mock tests that appear in the final chapters.

By the end of this book, you’ll be ready to pass the Data+ exam with confidence and take the next step in your career.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 498

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



CompTIA Data+: DAO-001 Certification Guide

Complete coverage of the new CompTIA Data + (DAO-001) exam to help you pass on the first attempt

Cameron Dodd

BIRMINGHAM—MUMBAI

CompTIA Data+: DAO-001 Certification Guide

Copyright © 2022 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Heramb Bhavsar

Content Development Editor: Joseph Sunil

Technical Editor: Sweety Pagaria

Copy Editor: Safis Editing

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Sejal Dsilva

Production Designer: Shankar Kalbhor

Marketing Coordinator: Priyanka Mhatre and Nivedita Singh

First published: December 2022

Production reference: 1301122

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80461-608-6

www.packt.com

This book is dedicated to everyone out there who is working hard to improve themselves, fighting for a better life. I hope this book helps guide you on your journey.

Contributors

About the author

Cameron Dodd is a data professional who specializes in instruction and writing clean, simple curricula that can be understood without an advanced degree. He has helped thousands of people around the world find their place in the field of data analytics. Recently, he wrote and taught the CompTIA Train-The-Trainer course for Data+, where he taught instructors how to teach the information covered in this book.

I would like to thank the great team at Packt, who made this possible. Heramb, Kirti, Nazia, Joseph, Priyanka, Sweety, and everyone working behind the scenes came together to make this book what it is today. Finally, I would like to thank my wife, without whom I could not have done this. You are my support and my inspiration, now and always.

About the reviewer

Devin Moya is Data Science mentor from Jupiter, Florida. Since 2018, he has developed curriculum at top-rated Data Science bootcamps and mentored 450+ aspiring data professionals in more than a dozen countries. He specializes in being a data generalist, helping people from different backgrounds transition into data-related roles. He has a genuine love for the cutting-edge and when not working or spending time with his wife and kids, he is researching and building applications that combine AI, art, and human creativity.

Renganathan Palanisamy is a Data and AI Specialist and an avid trainer with 20+ years' experience in both academic and technical field. Serves as a Director with skills in Microsoft, AWS, Alibaba Cloud and IBM based Data and AI solutions. Renganathan also serves as Microsoft MCT Regional Lead to provide guidance to MCTs. He is also a pioneering CompTIA Data+ Trainer for APAC region. His involvement as mentor in technical communities is well known.

Table of Contents

Preface

Part 1: Preparing Data

1

Introduction to CompTIA Data+

Understanding Data+

CompTIA Data+: DAO-001

Data science

Introducing the exam domains

Data Concepts and Environments

Exam format

Who should take the exam?

Summary

2

Data Structures, Types, and Formats

Understanding structured and unstructured data

Structured databases

Unstructured databases

Relational and non-relational databases

Going through a data schema and its types

Star schema

Snowflake schema

Understanding the concept of warehouses and lakes

Data warehouses

Data marts

Data lakes

Updating stored data

Updating a record with an up-to-date value

Changing the number of variables being recorded

Going through data types and file types

Data types

Variable types

File types

Summary

Practice questions and their answers

Questions

Answers

3

Collecting Data

Utilizing public sources of data

Public databases

Open sources

Application programming interfaces and web services

Collecting your own data

Web scraping

Surveying

Observing

Differentiating ETL and ELT

ETL

ELT

Delta load

Understanding OLTP and OLAP

OLTP

OLAP

Optimizing query structure

Filtering and subsets

Indexing and sorting

Parameterization

Temporary tables and subqueries

Execution plan

Summary

Practice questions and their answers

Questions

Answers

4

Cleaning and Processing Data

Managing duplicate and redundant data

Duplicate data

Redundant data

Dealing with missing data

Types of missing data

Deletion

Imputation

Interpolation

Dealing with MNAR

Understanding invalid data, specification mismatch, and data type validation

Invalid data

Specification mismatch

Data type validation

Understanding non-parametric data

Finding outliers

Summary

Practice questions

Questions

Answers

5

Data Wrangling and Manipulation

Merging data

Key variables

Joining

Blending

Concatenation and appending

Calculating derived and reduced variables

Derived variables

Reduction variables

Parsing your data

Recoding variables

Recoding numbers into categories

Recoding categories into numbers

Shaping data with common functions

Working with dates

Conditional operators

Transposing data

System functions

Summary

Practice questions

Questions

Answers

Part 2: Analyzing Data

6

Types of Analytics

Technical requirements

Exploring your data

Common types of EDA

EDA example

Checking on performance

KPIs

Project management

Process analytics

Discovering trends

Finding links

Choosing the correct analysis

Why is choosing an analysis difficult?

Assumptions

Making a list

Finally choosing the analysis type

Summary

Practice questions

Questions

Answers

7

Measures of Central Tendency and Dispersion

Discovering distributions

Normal distribution

Uniform distribution

Poisson distribution

Exponential distribution

Bernoulli distribution

Binomial distribution

Skew and kurtosis

Understanding measures of central tendency

Mean

Median

Mode

When to use which

Calculating ranges and quartiles

Ranges

Quartiles

Interquartile range

Finding variance and standard deviation

Variance

Standard deviation

Summary

Practice questions

Questions

Answers

8

Common Techniques in Descriptive Statistics

Understanding frequencies and percentages

Frequencies

Percentages

Calculating percent change and percent difference

Percent change

Percent difference

Discovering confidence intervals

Understanding z-scores

Summary

Practice questions

Questions

Answers

9

Hypothesis Testing

Understanding hypothesis testing

Why use hypothesis testing

Hypothesis testing process

Differentiating null hypothesis and alternative hypothesis

Null hypothesis ()

Alternative hypothesis ()

Null hypothesis versus alternative hypothesis

Learning about p-value and alpha

p-value

Alpha

Alpha and tails

Understanding type I and type II errors

Type I error

Type II error

How type I and type II errors interact with alpha

Writing the right questions

The parts of a good question

Qualities of a good question

What to do about bad questions

Summary

Practice questions

Questions

Answers

10

Introduction to Inferential Statistics

Technical requirements

Understanding t-tests

What you need to know about t-tests

T-test practice

Knowing chi-square

What you need to know about chi-square

Chi-square practice

Calculating correlations

Correlation

Correlation practice

Understanding simple linear regression

What you need to know about simple linear regression

Simple linear regression practice

Summary

Practice questions

Questions

Answers

Part 3: Reporting Data

11

Types of Reports

Distinguishing between static and dynamic reports

Point-in-time reports

Real-time reports

Static versus dynamic reports

Understanding ad hoc and research reports

Ad hoc reports

Research reports

Knowing about self-service reports

Understanding recurring reports

Compliance reports

Risk and regulatory reports

Operational reports (KPI reports)

Knowing important analytical tools

Query tools

Spreadsheet tools

Programming language tools

Visualization tools

Business services

All-purpose tools

Which tools you should learn to use

Summary

Practice questions

Questions

Answers

12

Reporting Process

Understanding the report development process

Creating a plan

Getting the plan approved

Creating the report

Delivering the report

Knowing what to consider when making a report

Business requirements

Dashboard-specific requirements

Understanding report elements

Understanding report delivery

Designing reports

Branding

Fonts, layouts, and chart elements

Color theory

Summary

Practice questions

Questions

Answers

13

Common Visualizations

Understanding infographics and word clouds

Infographics

Word clouds

Comprehending bar charts

Bar charts

Stacked charts

Histograms

Waterfall charts

Charting lines, circles, and dots

Line charts

Pareto charts

Pie charts

Scatter plots

Bubble charts

Understanding heat maps, tree maps, and geographic maps

Heat maps

Tree maps

Geographic maps

Summary

Practice questions

Questions

Answers

14

Data Governance

Understanding data security

Access requirements

Security requirements

Knowing use requirements

Acceptable use policy

Data processing

Data deletion

Data retention

Understanding data classifications

Personally identifiable information

Personal health information

Payment Card Industry

Handling entity relationship requirements

Summary

Practice questions

Questions

Answers

15

Data Quality and Management

Understanding quality control

When to check for quality

Data quality dimensions

Data quality rules and metrics

Validating quality

Cross-validation

Sample/spot check

Reasonable expectations

Data profiling

Data audits

Automated checks

Understanding master data management

When to use MDM

Processes of MDM

Summary

Practice questions

Questions

Answers

Part 4: Mock Exams

16

Practice Exam One

Practice exam one

Congratulations!

Practice exam one answers

17

Practice Exam Two

Practice exam two

Congratulations!

Practice exam two answers

Index

Other Books You May Enjoy

Part 1: Preparing Data

This part provides an overview of the CompTIA Data+ certification exam, before jumping right into the process of preparing data for analysis. Starting with how data is stored, you will walk through the process step by step for collecting, processing, and shaping your data into something you can use.

This part covers the following chapters:

Chapter 1, Introduction to CompTIA Data+Chapter 2, Data Structures, Types, and FormatsChapter 3, Collecting DataChapter 4, Cleaning and Processing DataChapter 5, Data Wrangling and Manipulation

1

Introduction to CompTIA Data+

Welcome! This book is designed specifically to teach you everything you need to know to pass the CompTIA Data+ (DAO-001) certification exam. Here, you will learn the ins and outs of each domain covered by the exam, before going over practice questions at the end of every chapter to solidify those lessons. The last chapter of the book contains two mock exams that will test your knowledge and see whether there are any areas you should review before taking the exam. While CompTIA Data+ is a certification worth having, testing for it is not cheap, so we are going to do everything we can to empower you to pass on your first try. Ready? Let’s begin.

This chapter is an introduction to the certification itself. Together, we will talk about what Data + is and the benefits of achieving it. Then, we will briefly go over what sorts of content will be covered by the exam. Finally, we will discuss the format of the exam itself, going over things such as how long you have to take the exam, how many questions there will be, and how they will be distributed across the different exam domains. Our goal here is to understand what Data+: DAO-001 is, so we can start preparing you for it.

In this chapter, we’re going to cover the following main topics:

Understanding Data+Introducing the exam domainsGoing through the exam format

Understanding Data+

Before we dive into specific domains covered by the test and how it is structured, we should briefly discuss what the Data+ certification is and why you should obtain it, as well as why data science is a good field to enter right now. It is important to have a clear understanding of your motivations before you begin.

CompTIA Data+: DAO-001

CompTIA is an organization known for professional and technical certifications, particularly in the field of cybersecurity. The certifications they provide indicate that a person has the knowledge and skill set to perform a specific task. Earning one of these certifications means that you have more credibility in the field and are more likely to be able to enter or move up in the field of your choice. Data+: DAO-001 is the first certification CompTIA has created for data science. It is also one of the first data science certifications by any of the major professional certification companies. This is a major step in standardizing terminology and roles within the field.

Because data science is still a developing field, people are coming in from all sorts of educational and professional backgrounds. What this means is that employers are often not sure what specifically they need to look for and so certifications are becoming more important. As we have seen with some other technical fields, such as cybersecurity, it is difficult to get employed no matter what your background is, unless you have a certification from a recognized organization such as CompTIA.

Important note

Data+: DAO-001 is a vendor-neutral exam. Because data science is so varied, no one tool or software is used in every role. To reflect this, no specific programming language is required to pass this exam. That said, this book uses popular tools or programming languages as examples to help explain certain concepts. These examples are only one way to reach a solution and are meant to reflect broader concepts. You should practice using the tools with which you are most comfortable.

Okay, we’ve established that the CompTIA Data+: DAO-001 certification will help your data science career, but why should you enter the field of data science anyway?

Data science

Data science as a field is simply analyzing data to answer specific questions to help people make better decisions. It involves a little bit of statistics, a moderate amount of programming, and a lot of critical thinking. The concept has been around for a long time; the term data science has even been around since the 1960s. But it is only in the last few years that companies have gotten access to more data than they know what to do with. Large companies aren’t the only ones with a lot of data anymore. Now, medium and even small companies are quickly gathering or purchasing large datasets. All these companies, large or small, now need professionals to store, access, analyze, and report on all this information.

Almost every list you find of the fastest-growing careers in the world will mention data scientist, or one of the other specialties in the data science domain, such as data analyst, data engineer, database administrator, research analyst, statistician, or half a dozen other job titles that end in analyst. It has been gaining in popularity this past decade and is only expected to continue growing in the next decade.

Data science is an exciting field that is quickly changing and developing, with new programs and techniques coming out every year. As the field continues to grow, there will be more and more jobs and opportunities for those who have the right skills and the right certifications. Now that we know why the test is important, let’s talk about what this exam covers.

Introducing the exam domains

The exam was designed by a group of subject matter experts with different specialties in the field of data science. Together, they decided on common ground that any early career data analyst should know. They then categorized that knowledge into the following five domains:

Data Concepts and EnvironmentsData MiningData AnalysisVisualizationData Governance, Quality, and Control

Data Concepts and Environments

The domains move through the data pipeline chronologically. The first domain, Data Concepts and Environments, is largely about how data is stored. This covers multiple levels, from different database types, structures, and schemas, through file types for specific kinds of data, and even into different variable types. This domain is a broad view of storage concepts mixed with the ability to identify what type of data you can expect from different storage solutions.

Data Mining

This domain is a bit of a misnomer. Data mining is when you already have a huge dataset and you just go through it to find any insights that might be of interest, instead of answering specific questions. While data mining, you must go through all the concepts contained within this domain, but you also go through all these concepts for regular data analysis. What this domain is actually about is every step after storing your data but before you run an analysis. This domain includes collecting, querying, cleaning, and wrangling data. Effectively, these are the steps you need to take to get your data into a useful shape so you can analyze it.

Data Analysis

You have stored your data, you have pulled your data and made it pretty, and now it is time to do something with it. This domain is all about analyses. You will be expected to perform descriptive statistical analyses, understand the concepts behind inferential statistics, be able to pick appropriate types of analysis, and even know some common tools used in the field. You don’t need to be able to use any of these tools because the test is vendor-neutral, just be able to identify them.

Visualization

It doesn’t matter how perfect your analyses are if you can’t communicate the results. What’s the point in coming up with an equation that solves world hunger if you can’t explain it to anyone else? To that end, the next domain is all about visualizations and reporting. This covers what information a report should include, what type of report is most appropriate, who should get a report, when reports should be delivered, the basics of report design, types of visualizations, and even the process of developing a dashboard.

Data Governance, Quality, and Control

The final domain is made up of larger concepts that span the entire life cycle of data analytics. A large part of this is made up of policies. Some of the policies focus on protected data and how it can be handled legally, while other policies are more about how you can ensure the quality of your data. If your data has low quality, you can’t trust anything it says, and if you are mishandling protected information, you could face legal penalties, so these are important factors to know. This domain also includes a short section on the concept of master data management, as an example of an ideal state.

Now that you know what domains will be covered on the certification exam, let’s talk about how the exam is structured.

Exam format

CompTIA Data+: DAO-001 was launched on February 28, 2022. You have 90 minutes to answer 90 questions. Most of the questions will be multiple choice, but there will be some performance-based questions, in which you will be presented with an example of something you would encounter on the job and you must make appropriate decisions based on it. Scoring is on a scale of 100 to 900, with 675 being considered a passing score. For the most up-to-date information on the exam, you can check out the exam website at https://www.comptia.org/certifications/data. The testing provider is Pearson VUE, and you can take the test online or at one of their approved testing centers. The following table breaks down roughly how many questions will fall into each domain:

Data Concepts and Environments

Data Mining

Data Analysis

Visualization

Data Governance, Quality, and Control

15%

25%

23%

23%

14%

Table 1.1 – Percentage breakdown of each domain

Who should take the exam?

The test is aimed at those early in their data analytics career – specifically, those who have been working in a data analytics position for 1.5 to 2 years. This is not a requirement but a suggestion, because people with less experience than this may not have encountered all the concepts covered by the exam. At the end of the day, anyone who is eager to get further in the exciting field of data science should take this exam. Rest assured that even if you don’t have personal experience with any of the concepts on the exam, they are covered here in this book!

Summary

In this chapter, we learned that CompTIA Data+: DAO-001 is a new professional certification in the rapidly growing field of data science, which is the study of gaining insights from data to inform decisions. This certification will confirm your skills to current or potential employers, helping you progress in your career in this field. The exam gives you 90 minutes to answer 90 questions that cover the domains that we discussed in the Introducing the exam domains section of this chapter.

Most importantly, you should know that, with the help of this book, you can pass this exam and take an important first step into the big wide world of data science.

In the next chapter, we will dive right into the first domain and look at data structures, types, and formats.

2

Data Structures, Types, and Formats

This chapter is all about data storage. Occasionally, a data analyst will have to collect their own data, but for many, the company already has data stored and ready for use. During their career, a data analyst may encounter data stored in several different formats, and each format requires a different approach. While there are hundreds, if not thousands, of different formats, the exam only covers the most common, and this chapter will go over what you need to know about them.

Here, we will discuss things such as how data is actually stored in a database, which includes whether or not it is structured, deciding what kind of data is stored in it, and whether the database is organized to follow a specific data schema to make it more efficient. We will also cover common database archetypes, or data storage solutions that are arranged in a specific way for a specific purpose, such as a data warehouse or a data lake. After that, we will discuss why updating data that is already stored can be problematic and a few approaches you can take depending on your goals. Finally, we will wrap up this chapter by discussing data formats so that you can tell what kind of data to expect from each format.

In this chapter, we’re going to cover the following main topics:

Understanding structured and unstructured dataGoing through a data schema and its typesUnderstanding the concept of warehouses and lakesUpdating stored dataGoing through data types and file types

Understanding structured and unstructured data

More often than not, data is stored in a database. Databases are simple places where data can be electronically stored and accessed. They can be stored on a single computer, a cluster, or even in the cloud. Databases come in every shape and size, but all of them fall into one of two categories:

StructuredUnstructured

Structured databases

If a database is structured, it follows a standardized format that allows you to set rules as to what kind of data can be expected and where. That is to say that there is, hopefully, a clear and logical structure to how the data is organized. There are two main database archetypes that the exam considers structured:

Defined rows/columnsKey-value pairs

Defined rows and columns refer to tables or spreadsheets. This is the format that most people are familiar with and is, by far, the most common in the field of data analytics. Usually, every column in the grid represents a variable or type of data collected, and every row represents a single entry or data point. The following screenshot shows a simple table that counts beans:

Figure 2.1 – Structured database: defined rows and columns

In the preceding screenshot, the columns are vertical and represent the different variables, while the rows are horizontal and represent a different entry or data point. The cells are specific values or counts that make up a data point.

Key-value pairs store each data point as a data object; every object in a series has the same set of keys, but the values for them can be different. In the following code snippet, each key represents a variable, and each value is the data collected for it, while each object is a data point:

"Beans" : [      {           "Total" : 10,           "Red" : 3,           "Blue" : 4,           "Yellow" : 3      },      {           "Total" : 12,           "Red" : 2,           "Blue" : 6,           "Yellow" : 4      },      {           "Total" : 11,           "Red" : 3,           "Blue" : 5,           "Yellow" : 3      } ]

This is the same information as shown in Figure 2.1, but now it is stored with each row being a data object defined by {}. The column names ("Total", "Red", "Blue", and "Yellow") are the keys. The keys are unique in that they cannot be repeated within a data object. For example, you cannot have two keys named "Yellow". However, all objects within a set usually have the same keys. The numbers represent the values or are the same as individual cells within a table.

Unstructured databases

Unstructured data has, more or less, no attempt at organization. You can think of these as big buckets of data, often in the form of a folder, where individual files or random data objects can be dropped. The exam breaks these down into two groups:

Undefined fieldsMachine data

In this case, an undefined field is sort of a catch-all for file types that do not fit into a structured database nicely. These include several different file types, such as the following:

Text filesAudio filesVideo filesImagesSocial media dataEmails

These are all data types that, by default, store every data point as a separate file, which makes it difficult to keep them in a conventional structured database.

The other type of unstructured data recognized by the exam is machine data. When data is automatically generated by software without human intervention, it is often considered machine data. This includes automated logging programs used by websites, servers, and other applications, but it also includes sensor data. If a smart refrigerator logs measurements of temperature and electricity used at a regular time interval, this is considered to be machine data.

Relational and non-relational databases

Another way to categorize databases is whether they are relational or non-relational. Relational databases store information and how it relates to other pieces of information, while non-relational databases only store information.

This gets a little confusing when we look at query languages for pulling information out of databases. The majority of query languages are variations of Structured Query Language (SQL). Anything that isn’t based on SQL is considered NoSQL. Now, in common usage, everything that uses SQL is structured and relational, and everything that is NoSQL is unstructured and non-relational. However, this is not quite true. This can get a little confusing, so we will break it down into the following four main points:

All SQL databases are structured and relationalAll non-relational databases are unstructuredSome NoSQL databases are structured and relationalSome NoSQL databases are unstructured, but still relational

Let’s start from the top. All SQL databases are structured and relational. SQL databases are broken down into tables. Tables are inherently structured and show how one point connects to another, so they are relational. For example, if you look at a table, you know how every cell in a column is related (they are all different data points that record the same measurement or classification) and how every cell in a row is related (they are all different measurements of the same data point).

All non-relational databases are unstructured. This means that all databases that do not show how things are related inherently have no structure. Any structural organization would show relationships. For example, if you have a database that is nothing but a folder for audio files, there is nothing that shows the relationships between the files. Each audio file is its own discrete unit, and there is no structure between them.

Some NoSQL databases are structured and relational. Just because something is not based on SQL does not mean that it has no structure. For example, the exam considers key-value pairs to be structured and relational, but these are most often used in JSON files, which are considered NoSQL.

Some NoSQL databases are unstructured, but still relational. There exist databases called graph databases. These are NoSQL databases, where the nodes are stored separately in an unstructured manner, but they also store the relationships between every node. If a database stores the relationships between each node, it is relational.

This seems a little jumbled, but the most important things to remember are that tables and key-value pairs are structured and relational, while undefined fields and machine data are unstructured and, unless otherwise stated, non-relational. This can be summarized in Figure 2.2:

Figure 2.2 – Structure and relationships

Now, whether something is relational or structured is not the only consideration you need to think about when you are storing data in a database. How the tables themselves are arranged makes a huge difference in how useful and efficient a database is, so it’s time to learn about schemas.

Going through a data schema and its types

An SQL database—structured and relational—is often made up of more than one table. In fact, more complicated databases may have dozens or even hundreds of tables. Every table should have a key. A key is a variable that is shared with another table so that the tables can be joined together. We will discuss specifics on joins later in this book. In this way, all the tables can be connected to one another, even if it would take several joins to do it. As you can imagine, these databases can be confusing, inefficient, and impractical. To make databases cleaner and easier to use, how tables are organized and interact with each other often follows a few common patterns. These patterns are called data schemas. There are several different popular schemas, and each meets specific needs, but this exam only covers two of the most basic schemas:

StarSnowflake

The schemas are named after the shapes the tables make when you graph out how they are related.

Star schema

A star schema is one of the simplest schemas. At the center is a key table (sometimes called a fact table) that holds metrics, but they also have key variables for every other table in the database. Around the key tables, there are tables called dimension tables. Each dimension table has one key variable to connect to the key table and several other variables for storing information. Because all the dimension tables are connected directly to the central key table, the shape looks like a star. In Figure 2.3, you can see an example of the basic database of this schema, with the key table in the center and the dimension tables attached:

Figure 2.3 – Star schema

As you can imagine, there are pros and cons to this type of schema.

Pros:

Simple (there are generally fewer tables)Fewer joins are requiredEasier to understand how the tables relate to one another

Cons:

High redundancy (a lot of data is repeated)Denormalized (because of the high redundancy)

This schema is more user-friendly, but not the most efficient for larger databases. In Figure 2.4, you will see a diagram detailing a join with a star schema:

Figure 2.4 – Joining with a star schema

Imagine you wanted information from Dimension_Date_List and Dimension_Client_List together. You only need to join Dimension_Date_List to Key_List and then Key_List to Dimension_Client_List. That is only two joins, and you will never need more than two joins to connect any two tables in a star schema. That said, because the information is condensed to fit around a single table, information is often repeated, and it is not the most efficient approach. For example, in the preceding diagram, Dimension_Date_List has a date variable, but then it also has quarter and month variables in the same table. The variables are repeated because they are used for different things, but are grouped together in a table in a star schema.

Snowflake schema

A snowflake schema is similar to a star schema with one main difference: there are two levels of dimension tables. There is still a key table in the middle with dimension tables connected directly to it, but there is now a second set of dimension tables that connect to the first. This doesn’t necessarily mean that there is more data than we saw in the star schema, but the data is spread out more. Figure 2.5 is a simplified example of what a snowflake schema may look like:

Figure 2.5 – Snowflake schema

Because of the branching lines coming out from the center, it is considered to look like a snowflake. Just as with a star schema, a snowflake schema has strengths and weaknesses.

Pros:

Low redundancy (very few, if any, metrics are repeated)Normalized (follows a step of rules introduced by Edgar F. Codd to optimize databases)

Cons:

More complicated (understanding how the tables relate to one another is more difficult)More joins are required (joining two tables may require several intermediate joins)

These are generally more efficient than star schemas, but they are less user-friendly. Because they are more complicated, it requires a greater understanding of how this specific database is structured in order to navigate it. Figure 2.6 is an example of trying to join two random tables within a snowflake schema:

Figure 2.6 – Joining within a snowflake schema

Here, connecting any two tables may require anywhere from two to four joins. For example, if we wanted to connect Dimension_Month_List to Dimension_CLast_List, we would have to connect the Dimension_Month_List table to Dimension_Date_List, Dimension_Date_List to Key_List, Key_List to Dimension_First_List, and the Dimension_First_List table to Dimension_CLast_List. A snowflake schema only has one more level of tables than a star schema, but you can probably see how it can already be more complicated. However, the data is more spread out and does not repeat itself as much, making it more efficient. In the preceding simplified example, we see that Date has now been spread out among three tables, so there is no repeated information in any one of them.

Next, we will talk about a way to classify databases based on how they are used.

Understanding the concept of warehouses and lakes

Not all databases are used for the same purpose—they often become specialized based on how they are used. Each specialized database has a specific name, is used for different things, has different kinds of data, and is used by different people. A few of these specialized databases include:

Data warehousesData martsData lakes

There are more, but these are some common types of database archetypes that you are likely to encounter. More to the point, these are the ones that will be covered in the exam.

Data warehouses

Data warehouses are most often used for structured relational tables. Usually, they hold large amounts of processed transactional data. Data warehouses are often more complicated and are used by data engineers or database administrators. Because these store large amounts of data and efficiency is more important, they are more likely to follow a snowflake schema.

Data marts

Data marts are a specialized subset of a data warehouse. They are smaller, only hold processed information on a specific topic, and have a simpler structure. Data marts usually contain customer-facing data and are considered self-service because they are designed to be simple enough for analysts or customer support employees to access by themselves. Because these databases prioritize ease of use, they often follow a star schema.

Data lakes

Data lakes store large amounts of raw, unprocessed data. They can contain structured data, unstructured data, or some combination of both. Data lakes often collect and pool data from several different sources and can include different data or file types. Because of the nature of raw information, data lakes are often used by data scientists and do not follow any specific schema.

Important note

At this stage in your career, you will not be asked to create a data warehouse or lake. These are usually made by very specialized data engineers, and many companies only have one. That said, the majority of data warehouses and lakes are created through third-party services or software. Some of the more popular approaches include Snowflake, Hevo, Amazon Web Services (AWS) data warehouse tools, Microsoft Azure data warehouse tools, and Google data warehouse tools. If you want to try practicing with these tools, I suggest you find one (I like Snowflake), and look up tutorials on that software specifically.

Enough about storing data—let’s talk about how to update data that has already been stored.

Updating stored data

Sometimes information changes, and you must update your dataset. In such cases, there are decisions that must be made. Each decision has pros and cons based on the reason you are collecting the data. For slowly changing dimensions, two cases often come up:

Updating a current valueChanging the number of variables being recorded

Updating a record with an up-to-date value

In most cases, you will simply add data points to the end of a table, but sometimes there is a specific value that is calculated or recorded that you need to keep as up to date as possible. Now, you have two options:

Overwrite historical values: If you just change the value in the cell, this keeps your dataset much smaller and simpler. However, because you have lost what your value was, you no longer have access to historical data. Historical data has many uses and is required for trend analysis. If you don’t care about predicting future values of this number and just want to keep your dataset as simple as possible, this is the appropriate path.Keep historical values: Keeping historical values usually requires the addition of extra columns so that you can keep track of what the current value is, as well as when other values were active.

To do this, you add the following columns:

Active RecordActive StartActive End

These columns work in a direct manner. Active Record simply states whether the specified value is the most current value, and is either Yes or No. Active Start describes the date a value became active, and Active End describes the date it stopped being active. This is shown in Figure 2.7:

Figure 2.7 – Active Record

If a number is currently active, Active Record will be Yes and it will have an Active Start date, but no Active End date:

Figure 2.8 – Updated Active Record

As shown in Figure 2.8, when you are updating this value, you change Active Record to No and add an Active End date for the old value. The new value will set Active Record to Yes and receive an Active Start date. In this way, you can keep a record of all historical values. If you are more interested in being able to perform trend analyses and predict a future value using historical values than you are in keeping your dataset small and simple, this is the appropriate approach.

Changing the number of variables being recorded

Occasionally, you will need to change the number of variables being recorded, so you are adding or removing columns from a table or dataset. Whether you are adding or removing variables, you will have to decide whether or not to delete all historical data. It may seem odd, but there is a specific reason: null values, or spaces where there are no values where there should be. It doesn’t matter whether you are adding columns or removing them—either way, you will create null values.

In Figure 2.9, we are adding columns that track the color of beans, which was not tracked until the third data point. This means that all data points before you added the columns will have null values:

Figure 2.9 – Adding variables

In Figure 2.10, we are no longer tracking the color of the beans, so we are removing them as a variable, and normally we would just delete the columns. However, since the historical data—the first two data points—still has these columns, you will have null values for everything after you stop tracking these variables:

Figure 2.10 – Removing variables

The only way to completely avoid these null values is to dump the historical data for those columns. That said, sometimes you can’t or just don’t want to get rid of so much data, so you will have to address the null values by other means. Figure 2.11 is an example of a table that has had the null values removed:

Figure 2.11 – Deleting historical values

Okay—now you know about updated stored records. Next, we will jump into different data types and file types so that you can know what to expect from each.

Going through data types and file types

Data comes in countless formats, each requiring different treatment and capable of different things. While each programming language has its own data types, these will not be tested because the exam is vendor-neutral and does not require knowledge of any specific programming language. However, there are some generic data types that everyone working with data should know that are covered in the exam.

Data types

When discussing data types, we are talking about the format of specific variables. While there are some commonalities between programming languages, these data types may have different names or be subdivided into different groups. However, all data processing programs should have the following data types:

DateNumericAlphanumericCurrency

Date is a data type that records a point in time by year, month, and day. This data type can also include hours, minutes, and seconds. There are many different ways to format a date variable, but the International Organization for Standardization (ISO) recommends ordering your dates from the biggest unit of time to the smallest, like so:

YYYY-MM-DDYYYY-MM-DD HH:MI:SS

That said, it is more important to be consistent within a dataset than to have any one particular format. When merging two datasets from different sources, check to make sure they are both using the same format for any date variables.

Numeric data is made of numbers. Different programs break this up into multiple different subtypes, but for the exam, all you need to know is that a value that is a number, no matter whether it is a decimal or a whole number, is considered numeric data.

Alphanumeric data includes numbers and letters. Just as with numeric data, these go by many different names, based on the program used, but include any value that has letters in it. The only exception is if the value is a number formatted in scientific notation.

Currency data includes monetary values. This one is pretty simple. Just remember that if the numbers show a dollar sign, it is counting money and is probably formatted as currency.

Note

The exam does not cover Boolean values or values that can only be TRUE or FALSE. Not every program recognizes Booleans as their own data type, and even if the program does, these values are often translated into a different data type for use. For example, a Boolean might be recoded as 1 and 0 instead of TRUE and FALSE, so it will be recognized and processed by a machine learning (ML) algorithm, most of which require specific data types.

Variable types

When discussing variable types in the context of this book, we are really looking at different kinds of statistical variables. What that means is if the exam indicates a specific column in a spreadsheet, you will have to be able to tell whether it is discrete, continuous, categorical, independent, or dependent. These are the types of things you will need to know when working as a data analyst to figure out whether you can run an analysis or not because every analysis has specific data requirements.

Discrete versus continuous

Discrete and continuous are two different kinds of numeric or currency data. Discrete variables are counts and usually describe whole numbers or integers. There are limited possibilities that a discrete variable can have.

Discrete examples:

237$1.23

Important note

Sometimes a number can be a decimal and still be discrete. For example, when counting currency, $1.23 can still be considered discrete, because the values after the period represent cents that can be counted individually. However, $1.235 would no longer be considered discrete because there is no way to count half of a coin.

Continuous variables are not limited to whole numbers and can represent an infinite number of values between two points. Often, these values are measured or calculated and are represented as decimals.

Continuous examples:

3.477.00$1.235

Categorical

Categorical variables, sometimes called dimensions, represent classifications or groups. Often, these are formatted as alphanumeric. There are three main types of categorical variables:

BinaryNominalOrdinal

Binary variables are categorical variables that only have two possible states, such as TRUE and FALSE, 1 and 0, Success and Failure, or Yes and No. All Booleans are binary variables.

Nominal variables are categorical variables that contain more than two groups and have no intrinsic order. The majority of categorical variables fall into this classification. Nominal variables can include things such as color, breed, city, product, or name.

Ordinal variables are categorical variables that have an intrinsic order. These are most often represented as scales. Ordinal variables can include things such as Small, Medium, and Large or Low Priority, Medium Priority, and High Priority.

Independent versus dependent

Independence is one of the most important distinctions in statistics and will be featured heavily in the Data Analysis domain of the exam. You can consider this the purpose of a variable in a study.

Independent variables are the variables in a study that you are manipulating directly. These variables are independent because they are not influenced by anything besides you. Independent variables cause changes in other variables (or don’t).

Dependent variables are the variables in a study that you are measuring. You do not manipulate these variables at all. If these variables change, it is because of the independent variables, so the values of these variables are dependent upon the values of the independent variables.

Let’s look at an example. You run a simple study where you want to find out whetherbeing able to see impacts the accuracy of dart throwing. You gather twenty people, blindfold ten of them, have each of them throw three darts, then measure the distance those darts landed from the center of the target. In this example, the variable of sight, or whether the person was blindfolded or not, is your independent variable. You are directly manipulating this variable by choosing who to blindfold. Your dependent variable is what you are measuring. In this case, you are measuring the distance of the darts from the center of the target. After all of this is done, you will run an analysis to see whether changing your independent variable had an impact on your dependent variable.

File types

Often, data is saved on a computer as a file. Different types of data are saved as different types of files. A data analyst may be expected to deal with any number of file types, so the exam tests to see whether you can identify what kind of information can be found in the most common file types. The exam includes the following:

Text (such as TXT)Image (such as JPEG)Audio (such as MP3)Video (such as MP4)Flat (such as CSV)Website (such as HTML)

Important note

While the exam is vendor-neutral and tries to avoid file types that require you to know a specific software, some of the file types are associated with a particular operating system (OS) because they are common enough that you are likely to encounter them if you are dealing with the associated data type. Also, these file types, while more common on some OSes than others, can be played on any OS. For example, a WMA file type is short for Windows Media Audio, so it is inherently associated with the Windows OS. That said, you do not need to be an expert in this OS to remember that a WMA file is an audio file.

Text

Files that only contain text may be common, depending on the specific data analytics position. Many word processing programs have their own file type. However, since the exam is vendor-neutral, we need only discuss ones that are not inherently associated with any particular program. For text files, that leaves a plain text file:

Text (TXT)

Image

Images have several different file types that store the image in different ways. The most common include the following:

Joint Photographic Experts Group (JPG/JPEG)Portable Network Graphics (PNG)Graphics Interchange Format (GIF)Bitmap (BMP)Raw Unprocessed Image (RAW)

Audio

Audio files include the following popular formats:

MPEG-1 Audio Layer III (MP3)Waveform Audio (WAV)Windows Media Audio (WMA)Advanced Audio Coding (AAC)Apple Lossless Audio Codec (ALAC)

Video

Video files include the following popular formats:

MPEG-4 Part 14 (MP4)Windows Media Video (WMV)QuickTime Video Format (MOV)Flash Video (FLV)Audio Video Interleave (AVI)

Flat

Flat files contain a simple two-dimensional dataset or spreadsheet. Again, every spreadsheet software has its own unique file type, but for the purpose of this exam, there are only two generic file types you need to know:

Tab-Separated Values (TSV)Comma-Separated Values (CSV)

The difference between these two is how they separate values, or which delimiter they use.

TSV values are separated by tabs. Here’s an example:

Column1     Column2     Column3

CSV values are separated by commas. Here’s an example:

Column1,Column2,Column3

Website

When discussing website file types, we are talking about file types that can be used by a website to store or convey information to be used by a data analyst and not specifically file types used to create or manage websites. It should also be noted that these file types do represent specific languages, but you do not need to know these languages to extract information from them. For example, you do not need to understand how to structure a website with HTML to extract useful information from an HTML file; there are parsers that can do this for you. The types of website files recognized by the exam are as follows:

Hypertext Markup Language (HTML)Extensible Markup Language (XML)JavaScript Object Notation (JSON)

HTML is a common file type that focuses on website structure, and occasionally passing information. Information is stored between tags. The tags create elements that all have specific pre-determined meanings and act in specific ways when used. Here’s an example:

<div>      <h1>           Store Data Here      </h1> <p>           Or Here      </p> </div>

XML is similar to HTML, but the tags have no pre-determined meanings and don’t in a specific way. You can use whichever tags have meaning to you. For this reason, it can be difficult to parse information from an XML file that came from a new source. Here’s an example of XML:

<Dataset>      <Data>           Store Data Here      </Data> <AlsoData>           Or Here      </AlsoData> </Dataset>

JSON files are not used to structure websites, unlike the other two. JSON specializes in storing and passing information. A JSON file contains a list of data objects and gives values to those objects using key-value pairs. Here’s an example:

"Dataset" : [      {           "Data" : "Store Data Here"      },      {           "Data" : "Or Here"      } ]

In the end, you do not need to know how to use any of these languages. Make sure you understand that while all three can pass information, JSON is the one that specializes in it. Also, know that JSON is the only one that does not contribute to website structure and is based on key-value pairs.

Summary

We covered a lot of information in this chapter. First, we covered structured and unstructured databases, and what types of data can be expected in each. Also, we talked about relational and non-relational databases, and how they relate to structured and unstructured databases. Next, we covered database schemas such as star and snowflake schemas. Then, we covered data warehouses, data marts, and data lakes. Briefly, we touched on how to update stored data. Finally, we wrapped things up with different data types and file types. This is everything you need to know about the storage of data.

In the next chapter, we will go over how this data is collected in the first place!

Practice questions and their answers

Let’s try to practice the material in this chapter with a few example questions.

Questions

A smart thermometer collects information about the temperature outside every 30 minutes, creates a log, and sends the data to a local database. What can you tell about the database?It is structuredIt is unstructuredIt is relationalThere is not enough informationClient-facing agents at banks are not technical experts, but require the ability to query client information. Which data schema is most appropriate for their database?Star schemaSnowflake schemaGalaxy schemaAvalanche schemaYou are working as a data scientist for a video-streaming website. You require access to raw, unprocessed data and video files. What is the most appropriate database style to use?Data warehouseData martData lakeData mine