38,99 €
Build a solid foundation in data analysis skills and pursue a coveted Data+ certification with this intuitive study guide CompTIA Data+ Study Guide: Exam DA0-001 delivers easily accessible and actionable instruction for achieving data analysis competencies required for the job and on the CompTIA Data+ certification exam. You'll learn to collect, analyze, and report on various types of commonly used data, transforming raw data into usable information for stakeholders and decision makers. With comprehensive coverage of data concepts and environments, data mining, data analysis, visualization, and data governance, quality, and controls, this Study Guide offers: * All the information necessary to succeed on the exam for a widely accepted, entry-level credential that unlocks lucrative new data analytics and data science career opportunities * 100% coverage of objectives for the NEW CompTIA Data+ exam * Access to the Sybex online learning resources, with review questions, full-length practice exam, hundreds of electronic flashcards, and a glossary of key terms Ideal for anyone seeking a new career in data analysis, to improve their current data science skills, or hoping to achieve the coveted CompTIA Data+ certification credential, CompTIA Data+ Study Guide: Exam DA0-001 provides an invaluable head start to beginning or accelerating a career as an in-demand data analyst.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 491
Veröffentlichungsjahr: 2022
Cover
Title Page
Copyright
Dedication
Acknowledgments
About the Authors
About the Technical Editor
Introduction
The Data+ Exam
What Does This Book Cover?
Exam DA0-001 Exam Objectives
DA0-001 Certification Exam Objective Map
Assessment Test
Answers to Assessment Test
Chapter 1: Today's Data Analyst
Welcome to the World of Analytics
Careers in Analytics
The Analytics Process
Analytics Techniques
Data Governance
Analytics Tools
Summary
Chapter 2: Understanding Data
Exploring Data Types
Common Data Structures
Common File Formats
Summary
Exam Essentials
Review Questions
Chapter 3: Databases and Data Acquisition
Exploring Databases
Database Use Cases
Data Acquisition Concepts
Working with Data
Summary
Exam Essentials
Review Questions
Chapter 4: Data Quality
Data Quality Challenges
Data Manipulation Techniques
Managing Data Quality
Summary
Exam Essentials
Review Questions
Chapter 5: Data Analysis and Statistics
Fundamentals of Statistics
Descriptive Statistics
Inferential Statistics
Analysis Techniques
Summary
Exam Essentials
Review Questions
Chapter 6: Data Analytics Tools
Spreadsheets
Programming Languages
Statistics Packages
Machine Learning
Analytics Suites
Summary
Exam Essentials
Review Questions
Chapter 7: Data Visualization with Reports and Dashboards
Understanding Business Requirements
Understanding Report Design Elements
Understanding Dashboard Development Methods
Exploring Visualization Types
Comparing Report Types
Summary
Exam Essentials
Review Questions
Chapter 8: Data Governance
Data Governance Concepts
Understanding Master Data Management
Summary
Exam Essentials
Review Questions
Appendix: Answers to the Review Questions
Chapter 2: Understanding Data
Chapter 3: Databases and Data Acquisition
Chapter 4: Data Quality
Chapter 5: Data Analysis and Statistics
Chapter 6: Data Analytics Tools
Chapter 7: Data Visualization with Reports and Dashboards
Chapter 8: Data Governance
Index
End User License Agreement
Chapter 1
TABLE 1.1 Gigabyte storage costs over time
TABLE 1.2 Highest-demand occupations
Chapter 2
TABLE 2.1 Pet data
TABLE 2.2 Selected character data types and maximum size
TABLE 2.3 Selected integer data types and value range
TABLE 2.4 Selected integer data types and value range
TABLE 2.5 Selected date and time data types
TABLE 2.6 Currency data types in Microsoft SQL Server
TABLE 2.7 Selected binary data types and maximum sizes
TABLE 2.8 Approximate storage needs
TABLE 2.9 Selected binary data types and maximum sizes
Chapter 3
TABLE 3.1 Animal data
TABLE 3.2 Person Data
TABLE 3.3 AnimalPerson table
TABLE 3.4 Appointment booking
TABLE 3.5 Product dimension
TABLE 3.6 Current flag
TABLE 3.7 Effective date
TABLE 3.8 Data manipulation in SQL
TABLE 3.9 Augmented animal data
TABLE 3.10 Sex Lookup Table
TABLE 3.11 Desired Query Results
TABLE 3.12 Modified
IFF
query results
TABLE 3.13 Common SQL aggregate functions
Chapter 5
TABLE 5.1 Common symbols in statistics
TABLE 5.2 Selected implementations of count
TABLE 5.3 Sample data
TABLE 5.4 Exploring percentages
TABLE 5.5 Institutional control percentage
TABLE 5.6 Standard deviation performance levels
TABLE 5.7 Illustrating degrees of freedom
TABLE 5.8 Illustrating interquartile range
TABLE 5.9 Illustrating lower and upper limits for interquartiles
TABLE 5.10 Confidence level and critical value for normally distributed data...
TABLE 5.11 Calculating Z-score
TABLE 5.12 Potential hypothesis test outcomes
TABLE 5.13 Categorical variables
TABLE 5.14 Medical research correlation coefficient guidelines
Chapter 7
TABLE 7.1 Color codes for General Electric Blue
TABLE 7.2 Sample daily sales report version history
Chapter 8
TABLE 8.1 Sample data classification matrix
TABLE 8.2 Breach notification requirements
Chapter 1
FIGURE 1.1 Analytics is made possible by modern data, storage, and computing...
FIGURE 1.2 Storage costs have decreased over time.
FIGURE 1.3 The analytics process
FIGURE 1.4 Table of college tuition data
FIGURE 1.5 Visualization of college tuition data
FIGURE 1.6 The relationship between artificial intelligence, machine learnin...
FIGURE 1.7 Data analysis in Microsoft Excel
FIGURE 1.8 Data analysis in RStudio
Chapter 2
FIGURE 2.1 Person Data
FIGURE 2.2 SKU example
FIGURE 2.3 U.S. QWERTY keyboard layout
FIGURE 2.4 Excel text-only formula
FIGURE 2.5 Text-only data validation restriction
FIGURE 2.6 Numeric data formatted as different currencies
FIGURE 2.7 Currency formats in Google Sheets
FIGURE 2.8 Incorrect percentage due to money data type
FIGURE 2.9 Correct percentage as calculated by a spreadsheet
FIGURE 2.10 Unrounded calculation in a calculator
FIGURE 2.11 Correct percentage when using a numeric data type
FIGURE 2.12 Images in Excel
FIGURE 2.13 Binary data in a Google spreadsheet
FIGURE 2.14 Photograph of racing motorcycles
FIGURE 2.15 Motorcycle search results
FIGURE 2.16 Discrete data example
FIGURE 2.17 Continuous data example
FIGURE 2.18 Dimension illustration
FIGURE 2.19 Data entry error
FIGURE 2.20 Data entry error identified in a summary
FIGURE 2.21 Pet ID as a key
FIGURE 2.22 Unstructured data: log entry
FIGURE 2.23 Files in object storage
FIGURE 2.24 Keys and values in object storage
FIGURE 2.25 Exporting as CSV or TSV
FIGURE 2.26 Contents of a CSV file
FIGURE 2.27 Semi-structured CSV
FIGURE 2.28 Fixed-width file
FIGURE 2.29 Pet data JSON example
FIGURE 2.30 Reading JSON in Python
FIGURE 2.31 Reading JSON in R
FIGURE 2.32 Representing a single animal in XML
FIGURE 2.33 Representing a single animal in HTML
FIGURE 2.34 HTML table in a browser
FIGURE 2.35 Displaying an image in an HTML table
Chapter 3
FIGURE 3.1 The (a) Animal entity and (b) Person entity in a veterinary datab...
FIGURE 3.2 Relationship connecting Animal and Person
FIGURE 3.3 ERD line endings
FIGURE 3.4 Unary relationship
FIGURE 3.5 Entity relationship diagram
FIGURE 3.6 Database schema
FIGURE 3.7 Email template
FIGURE 3.8 Reminder appointment email
FIGURE 3.9 Referential integrity illustration
FIGURE 3.10 Customer ERD
FIGURE 3.11 Foreign key data constraint
FIGURE 3.12 JSON person data
FIGURE 3.13 Data in a graph
FIGURE 3.14 Vet clinic transactional schema
FIGURE 3.15 Data in first normal form
FIGURE 3.16 Data in second normal form
FIGURE 3.17 Data in third normal form
FIGURE 3.18 Yearly spend by animal
FIGURE 3.19 CostSummary table
FIGURE 3.20 Star schema example
FIGURE 3.21 OLTP and OLAP query example
FIGURE 3.22 Snowflake example
FIGURE 3.23 Delta load example
FIGURE 3.24 Internal API example
FIGURE 3.25 Single question survey
FIGURE 3.26 Qualtrics survey build
FIGURE 3.27 SQL
SELECT
statement
FIGURE 3.28 Hard-coded SQL query
FIGURE 3.29 Parameterized SQL query
Chapter 4
FIGURE 4.1 Duplicate data resolution process
FIGURE 4.2 Illustration of multiple data sources
FIGURE 4.3 Resolving redundancy with an integrated ETL process
FIGURE 4.4 Redundant data
FIGURE 4.5 Name change issue
FIGURE 4.6 Transactional design
FIGURE 4.7 Missing temperature value
FIGURE 4.8 Invalid temperature value
FIGURE 4.9 Pain rating scale
FIGURE 4.10 Real estate sales outlier
FIGURE 4.11 Automotive schema excerpt
FIGURE 4.12 List of automotive manufacturers
FIGURE 4.13 Patient pain data
FIGURE 4.14 Recoded patient pain data
FIGURE 4.15 Patient data
FIGURE 4.16 Deriving age
FIGURE 4.17 Merging disparate data
FIGURE 4.18 ETL and the data merge approach
FIGURE 4.19 Extract, transform, and load process
FIGURE 4.20 Data blending
FIGURE 4.21 Creating a date variable with concatenation
FIGURE 4.22 Appending weather data
FIGURE 4.23 Weight log with missing values
FIGURE 4.24 Various imputation techniques
FIGURE 4.25 Dimensionality reduction example
FIGURE 4.26 Numerosity reduction with histograms
FIGURE 4.27 Sampling and histograms
FIGURE 4.28 Summary statistics
FIGURE 4.29 Sales data
FIGURE 4.30 Transposed sales data
FIGURE 4.31 Combining transposition with aggregation
FIGURE 4.32 Raw athlete data
FIGURE 4.33 Scaled athlete data
FIGURE 4.34 Splitting a composite column
FIGURE 4.35 Combining multiple columns
FIGURE 4.36 Improving data quality with string manipulation
FIGURE 4.37 Data source description
FIGURE 4.38 Three opportunities to impact data quality
FIGURE 4.39 Address standardization
FIGURE 4.40 Data conversion issue
FIGURE 4.41 Data manipulation issue
FIGURE 4.42 Ensuring consistency
FIGURE 4.43 Using uniqueness to improve consistency
FIGURE 4.44 Referential integrity and data validity
FIGURE 4.45 Multiple source systems and the potential for nonconformity
FIGURE 4.46 Reprocessing of bad data
Chapter 5
FIGURE 5.1 Weight log
FIGURE 5.2 Histogram of average SAT score for U.S. institutions of higher ed...
FIGURE 5.3 Histograms of SAT averages and institutional control
FIGURE 5.4A Mean salary data
FIGURE 5.4B Effect of an outlier on the mean
FIGURE 5.5A Calculating median salary data
FIGURE 5.5B Effect of an outlier on the median
FIGURE 5.6 Categorical question
FIGURE 5.7 Normal distribution
FIGURE 5.8 Right skewed distribution
FIGURE 5.9 Left skewed distribution
FIGURE 5.10 Bimodal distribution
FIGURE 5.11 Temperature variance
FIGURE 5.12 Standard normal distribution and t-distribution
FIGURE 5.13 Sales prices for 30,000 homes
FIGURE 5.14 Samples in a population
FIGURE 5.15 Samples at a 95 percent confidence level and population mean
FIGURE 5.16 Confidence levels and the normal distribution
FIGURE 5.17 Visualizing alpha
FIGURE 5.18 One-tailed test
FIGURE 5.19 Two-tailed test
FIGURE 5.20 Sample statistic and p-value
FIGURE 5.21 Hypothesis testing flowchart
FIGURE 5.22 Insignificant sample mean
FIGURE 5.23 Simple linear regression of age and BMI
FIGURE 5.24 Linear model output of age and BMI
FIGURE 5.25 Simple linear regression of speed and distance
FIGURE 5.26 Highly correlated data
Chapter 6
FIGURE 6.1 Table of data in Microsoft Excel
FIGURE 6.2 Data visualization in Microsoft Excel
FIGURE 6.3 Data analysis using R and RStudio
FIGURE 6.4 Data analysis using Python and pandas
FIGURE 6.5 SQL query using Azure Data Studio
FIGURE 6.6 Calculating correlations in SPSS
FIGURE 6.7 Analyzing data in SAS
FIGURE 6.8 Building a simple model in Stata
FIGURE 6.9 Evaluating a regression model in Minitab
FIGURE 6.10 Designing a machine learning task in SPSS Modeler
FIGURE 6.11 Exploring a data model in SPSS Modeler
FIGURE 6.12 Designing a machine learning task in RapidMiner
FIGURE 6.13 Exploring a data model in RapidMiner
FIGURE 6.14 Exploring data in IBM Cognos
FIGURE 6.15 Communicating a story with data in Microsoft Power BI
FIGURE 6.16 Building a dashboard in MicroStrategy
FIGURE 6.17 Analyzing a dataset in Domo
FIGURE 6.18 Communicating a data story in Datorama
FIGURE 6.19 Building a dashboard in AWS QuickSight
FIGURE 6.20 Visualizing data in Tableau
FIGURE 6.21 Visualizing data in Qlik Sense
Chapter 7
FIGURE 7.1 Historical reporting
FIGURE 7.2 Real-time reporting
FIGURE 7.3 Poor font choice
FIGURE 7.4 Porsche AG Annual Report cover page
FIGURE 7.5 Porsche AG Annual Report executive summary
FIGURE 7.6 Black monochromatic color palette
FIGURE 7.7 Nonparallel construction
FIGURE 7.8 Parallel construction
FIGURE 7.9 Serif and sans serif fonts
FIGURE 7.10 Sample layout elements and font sizes
FIGURE 7.11 Sample chart
FIGURE 7.12 Sample chart with legend
FIGURE 7.13 General Electric logo
FIGURE 7.14 Sample chart with watermark
FIGURE 7.15 Sample public-facing dashboard
FIGURE 7.16 Data sourcing flowchart
FIGURE 7.17 Sales funnel
FIGURE 7.18 Line chart
FIGURE 7.19 Pie chart
FIGURE 7.20 Bar chart by genre
FIGURE 7.21 Bar chart by count
FIGURE 7.22 Stacked bar chart
FIGURE 7.23 Average duration scatterplot
FIGURE 7.24 Budget and box office gross scatterplot
FIGURE 7.25 Budget and box office gross scatter and line chart
FIGURE 7.26 Budget and average box office gross scatter and line chart
FIGURE 7.27 Budget and average box office gross bubble and line chart
FIGURE 7.28 Histogram of box office gross
FIGURE 7.29 Histogram of movie duration
FIGURE 7.30 Large U.S. colleges and universities
FIGURE 7.31 Risk heat map
FIGURE 7.32 Correlation heat map
FIGURE 7.33 Tree map of movie genres
FIGURE 7.34 Tree map of the comedy genre
FIGURE 7.35 Headcount waterfall chart
FIGURE 7.36 Infographic
FIGURE 7.37 Positive word cloud
FIGURE 7.38 Negative word cloud
Chapter 8
FIGURE 8.1 Organizational example
FIGURE 8.2 Access roles over time
FIGURE 8.3 Danger of user-based access
FIGURE 8.4 Sample organization chart
FIGURE 8.5 Sample user group–based roles
FIGURE 8.6 Encrypted network connection
FIGURE 8.7 HTTPS padlock
FIGURE 8.8 Encrypted ETL process
FIGURE 8.9 Data masking ETL process
FIGURE 8.10 Reidentification by combining datasets
FIGURE 8.11 Local encryption process flow
FIGURE 8.12 Record linkage example
FIGURE 8.13 Duplicate resolution process
FIGURE 8.14 MDM for person data
Cover
Table of Contents
Title Page
Copyright
Dedication
Acknowledgments
About the Authors
About the Technical Editor
Introduction
Begin Reading
Appendix Answers to the Review Questions
Index
End User License Agreement
iii
iv
v
vi
vii
viii
xv
xvi
xvii
xviii
xix
xx
xxi
xxii
xxiii
xxiv
xxv
xxvi
xxvii
xxviii
xxix
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
327
328
329
330
331
332
333
334
335
336
337
Mike Chapple
Sharif Nijim
Copyright © 2022 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
978-1-119-84525-6978-1-119-84527-0 (ebk.)978-1-119-84526-3 (ebk.)
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Website is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Website may provide or recommendations it may make. Further, readers should be aware the Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Control Number: 2022930191
Trademarks: WILEY, the Wiley logo, Sybex and the Sybex logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. CompTIA and Data+ are registered trademarks of CompTIA, Inc. Dr. Ing. h. c. F. Porsche AG is the owner of numerous trademarks, both registered and unregistered, including without limitation the Porsche Crest, Porsche, Boxster, Carrera, Cayenne, Cayman, Panamera, Taycan, 911, 718, and the model numbers and distinctive shapes of Porsche automobiles such as the 911 and Boxster automobiles in the United States; these are used with permission of Porsche Cars North America, Inc. and Dr. Ing. h. c. F. Porsche AG. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Cover images: © Jeremy Woodhouse/Getty Images
Cover design: Wiley
To my aspiring engineer, Chris. Your mother and I are so proud of you and can't wait to see all of the incredible things that you accomplish!
—Mike
To my parents, Basheer and Germana. Thank you for your constant love and support, and for showing me how to live.
—Sharif
Books like this involve work from many people, and as authors, we truly appreciate the hard work and dedication that the team at Wiley shows. We would especially like to thank senior acquisitions editor Kenyon Brown. We have worked with Ken on multiple projects and consistently enjoy our work with him.
We also greatly appreciated the editing and production team for the book. First and foremost, we'd like to thank our friend and colleague, Dr. Jen Waddell. Jen provided us with invaluable insight as we worked our way through the many challenges inherent in putting out a book covering a brand-new certification. Jen's a whiz at statistics and analytics, and we couldn't have completed this book without her support. We also benefited greatly from the work of two of our students at Notre Dame. Ricky Chapple did a final read-through of this book to ensure that it was ready to go, and Matthew Howard helped create the instructor materials that accompany this book.
We'd also like to thank the many people who helped us make this project successful, including Adaobi Obi Tulton, our project editor, who brought years of experience and great talent to the project, and Barath Kumar Rajasekaran, our content refinement specialist, who guided us through layouts, formatting, and final cleanup to produce a great book. We would also like to thank the many behind-the-scenes contributors, including the graphics, production, and technical teams who make the book and companion materials into a finished product.
Our agent, Carole Jelen of Waterside Productions, continues to provide us with wonderful opportunities, advice, and assistance throughout our writing careers.
Finally, we would like to thank our families who support us through the late evenings, busy weekends, and long hours that a book like this requires to write, edit, and get to press.
Mike Chapple, Ph.D., CySA+, is author of the best-selling CISSP (ISC)2Certified Information Systems Security Professional Official Study Guide (Sybex, 2021) and the CISSP (ISC)2Official Practice Tests (Sybex, 2021). He is an information technology professional with two decades of experience in higher education, the private sector, and government.
Mike currently serves as a teaching professor in the IT, Analytics, and Operations department at the University of Notre Dame's Mendoza College of Business, where he teaches undergraduate and graduate courses on cybersecurity, data management, and business analytics.
Before returning to Notre Dame, Mike served as executive vice president and chief information officer of the Brand Institute, a Miami-based marketing consultancy. Mike also spent four years in the information security research group at the National Security Agency and served as an active duty intelligence officer in the U.S. Air Force.
Mike has written more than 30 books. He earned both his B.S. and Ph.D. degrees from Notre Dame in computer science and engineering. Mike also holds an M.S. in computer science from the University of Idaho and an MBA from Auburn University.
Learn more about Mike and his other certification materials at CertMike.com.
Sharif Nijim is an assistant teaching professor in the IT, Analytics, and Operations department at the Mendoza College of Business at the University of Notre Dame, where he teaches undergraduate and graduate courses in business analytics and information technology.
Prior to Notre Dame, Sharif co-founded and served on the board of a customer data integration company serving the airline industry. Sharif also spent more than a decade building and optimizing enterprise-class transactional and decision support systems for clients in the energy, healthcare, hospitality, insurance, logistics, manufacturing, real estate, telecommunications, and travel and transportation sectors.
Sharif earned both his B.B.A. and M.S. from the University of Notre Dame.
Jennifer Waddell is a teaching professor, assistant department chair, and the director of undergraduate studies in the IT, Analytics, and Operations department at the University of Notre Dame, specializing in the areas of statistical methodology and analytics. Over the last 20 years, she has educated students at the undergraduate, graduate, and executive levels in these disciplines, focusing on their theoretical understanding as well as technical skill implementation. In addition to her time in the classroom, she has worked as a statistical consultant on research projects in healthcare systems and educational services.
If you're preparing to take the Data+ exam, you'll undoubtedly want to find as much information as you can about data and analytics. The more information you have at your disposal and the more hands-on experience you gain, the better off you'll be when attempting the exam. This study guide was written with that in mind. The goal was to provide enough information to prepare you for the test, but not so much that you'll be overloaded with information that's outside the scope of the exam.
We've included review questions at the end of each chapter to give you a taste of what it's like to take the exam. If you're already working in the data field, we recommend that you check out these questions first to gauge your level of expertise. You can then use the book mainly to fill in the gaps in your current knowledge. This study guide will help you round out your knowledge base before tackling the exam.
If you can answer 90 percent or more of the review questions correctly for a given chapter, you can feel safe moving on to the next chapter. If you're unable to answer that many correctly, reread the chapter and try the questions again. Your score should improve.
Don't just study the questions and answers! The questions on the actual exam will be different from the practice questions included in this book. The exam is designed to test your knowledge of a concept or objective, so use this book to learn the objectives behind the questions.
The Data+ exam is designed to be a vendor-neutral certification for data professionals and those seeking to enter the field. CompTIA recommends this certification for those currently working, or aspiring to work, in data analyst and business intelligence reporting roles.
The exam covers five major domains:
Data Concepts and Environments
Data Mining
Data Analysis
Visualization
Data Governance, Quality, and Controls
These five areas include a range of topics, from data types to statistical analysis and from data visualization to tools and techniques, while focusing heavily on scenario-based learning. That's why CompTIA recommends that those attempting the exam have 18–24 months of hands-on work experience, although many individuals pass the exam before moving into their first data analysis role.
The Data+ exam is conducted in a format that CompTIA calls “performance-based assessment.” This means that the exam combines standard multiple-choice questions with other, interactive question formats. Your exam may include several types of questions such as multiple-choice, fill-in-the-blank, multiple-response, drag-and-drop, and image-based problems. More details about the Data+ exam and how to take it can be found here:
http://www.comptia.org/certifications/data
You'll have 90 minutes to take the exam and will be asked to answer 90 questions during that time period. Your exam will be scored on a scale ranging from 100 to 900, with a passing score of 675.
You should also know that CompTIA is notorious for including vague questions on all of its exams. You might see a question for which two of the possible four answers are correct—but you can choose only one. Use your knowledge, logic, and intuition to choose the best answer and then move on. Sometimes, the questions are worded in ways that would make English majors cringe—a typo here, an incorrect verb there. Don't let this frustrate you; answer the question and move on to the next one.
CompTIA frequently does what is called item seeding, which is the practice of including unscored questions on exams. It does so to gather psychometric data, which is then used when developing new versions of the exam. Before you take the exam, you will be told that your exam may include these unscored questions. So, if you come across a question that does not appear to map to any of the exam objectives—or for that matter, does not appear to belong in the exam—it is likely a seeded question. You never really know whether or not a question is seeded, however, so always make your best effort to answer every question.
Once you are fully prepared to take the exam, you can visit the CompTIA website to purchase your exam voucher:
https://store.comptia.org/Certification-Vouchers/c/11293
Currently, CompTIA offers two options for taking the exam: an in-person exam at a testing center and an at-home exam that you take on your own computer.
This book includes a coupon that you may use to save 10 percent on your CompTIA exam registration.
CompTIA partners with Pearson VUE's testing centers, so your next step will be to locate a testing center near you. In the United States, you can do this based on your address or your ZIP code, while non-U.S. test takers may find it easier to enter their city and country. You can search for a test center near you at the Pearson Vue website, where you will need to navigate to “Find a test center.”
http://www.pearsonvue.com/comptia
Now that you know where you'd like to take the exam, simply set up a Pearson VUE testing account and schedule an exam on their site.
On the day of the test, take two forms of identification, and make sure to show up with plenty of time before the exam starts. Remember that you will not be able to take your notes, electronic devices (including smartphones and watches), or other materials in with you.
CompTIA began offering online exam proctoring in response to the coronavirus pandemic. As of the time this book went to press, the at-home testing option was still available and appears likely to continue. Candidates using this approach will take the exam at their home or office and be proctored over a webcam by a remote proctor.
Due to the rapidly changing nature of the at-home testing experience, candidates wishing to pursue this option should check the CompTIA website for the latest details.
Once you have taken the exam, you will be notified of your score immediately, so you'll know if you passed the test right away. You should keep track of your score report with your exam registration records and the email address you used to register for the exam.
This book covers everything you need to know to pass the Data+ exam.
Chapter 1
: Today's Data Analyst
Chapter 2
: Understanding Data
Chapter 3
: Databases and Data Acquisition
Chapter 4
: Data Quality
Chapter 5
: Data Analysis and Statistics
Chapter 6
: Data Analytics Tools
Chapter 7
: Data Visualization with Reports and Dashboards
Chapter 8
: Data Governance
Practice Exam 1
Practice Exam 2
Appendix
: Answers to the Review Questions
This study guide uses a number of common elements to help you prepare. These include the following:
Summaries
The summary section of each chapter briefly explains the chapter, allowing you to easily understand what it covers.
Exam Essentials
The exam essentials focus on major exam topics and critical knowledge that you should take into the test. The exam essentials focus on the exam objectives provided by CompTIA.
Chapter Review Questions
A set of questions at the end of each chapter will help you assess your knowledge and whether you are ready to take the exam based on your knowledge of that chapter's topics.
This book comes with a number of additional study tools to help you prepare for the exam. They include the following.
Go to https://www.wiley.com/go/sybextestprep to register and gain access to this interactive online learning environment and test bank with study tools.
Sybex's test preparation software lets you prepare with electronic test versions of the review questions from each chapter, the practice exam, and the bonus exam that are included in this book. You can build and take tests on specific domains, by chapter, or cover the entire set of Data+ exam objectives using randomized tests.
Our electronic flashcards are designed to help you prepare for the exam. Over 100 flashcards will ensure that you know critical terms and concepts.
Sybex provides a full glossary of terms in PDF format, allowing quick searches and easy reference to materials in this book.
In addition to the practice questions for each chapter, this book includes two full 90-question practice exams. We recommend that you use them both to test your preparedness for the certification exam.
Like all exams, the Data+ certification from CompTIA is updated periodically and may eventually be retired or replaced. At some point after CompTIA is no longer offering this exam, the old editions of our books and online tools will be retired. If you have purchased this book after the exam was retired or are attempting to register in the Sybex online learning environment after the exam was retired, please know that we make no guarantees that this exam's online Sybex tools will be available once the exam is no longer available.
CompTIA goes to great lengths to ensure that its certification programs accurately reflect the IT industry's best practices. It does this by establishing committees for each of its exam programs. Each committee consists of a small group of IT professionals, training providers, and publishers who are responsible for establishing the exam's baseline competency level and who determine the appropriate target-audience level.
Once these factors are determined, CompTIA shares this information with a group of hand-selected subject matter experts (SMEs). These folks are the true brainpower behind the certification program. The SMEs review the committee's findings, refine them, and shape them into the objectives that follow this section. CompTIA calls this process a job-task analysis (JTA).
Finally, CompTIA conducts a survey to ensure that the objectives and weightings truly reflect job requirements. Only then can the SMEs go to work writing the hundreds of questions needed for the exam. Even so, they have to go back to the drawing board for further refinements in many cases before the exam is ready to go live in its final state. Rest assured that the content you're about to learn will serve you long after you take the exam.
CompTIA also publishes relative weightings for each of the exam's objectives. The following table lists the five Data+ objective domains and the extent to which they are represented on the exam.
Domain
% of Exam
1.0 Data Concepts and Environments
15%
2.0 Data Mining
25%
3.0 Data Analysis
23%
4.0 Visualization
23%
5.0 Data Governance, Quality, and Controls
14%
Objective
Chapter
1.0 Data Concepts and Environments
1.1 Identify basic concepts of data schemas and dimensions
Chapter 3
1.2 Compare and contrast different data types
Chapter 2
1.3 Compare and contrast common data structures and file formats
Chapter 2
2.0 Data Mining
2.1 Explain data acquisition concepts
Chapter 3
2.2 Identify common reasons for cleansing and profiling datasets
Chapter 4
2.3 Given a scenario, execute data manipulation techniques
Chapter 4
2.4 Explain common techniques for data manipulation and query optimization
Chapter 3
3.0 Data Analysis
3.1 Given a scenario, apply the appropriate descriptive statistical methods
Chapter 5
3.2 Explain the purpose of inferential statistical methods
Chapter 5
3.3 Summarize types of analysis and key analysis techniques
Chapter 5
3.4 Identify common data analytics tools
Chapter 6
4.0 Visualization
4.1 Given a scenario, translate business requirements to form a report
Chapter 7
4.2 Given a scenario, use appropriate design components for reports and dashboards
Chapter 7
4.3 Given a scenario, use appropriate methods for dashboard development
Chapter 7
4.4 Given a scenario, apply the appropriate type of visualization
Chapter 7
4.5 Compare and contrast types of reports
Chapter 7
5.0 Data Governance, Quality, and Controls
5.1 Summarize important data governance concepts
Chapter 8
5.2 Given a scenario, apply data quality control concepts
Chapter 4
5.3 Explain master data management (MDM) concepts
Chapter 8
Exam objectives are subject to change at any time without prior notice and at CompTIA's discretion. Please visit CompTIA's website (www.comptia.org) for the most current listing of exam objectives.
Lila is aggregating data from a CRM system with data from an employee system. While performing an initial quality check, she realizes that her employee ID is not associated with her identifier in the CRM system. What kind of issue is Lila facing? Choose the best answer.
ETL process
Record linkage
ELT process
System integration
Rob is a pricing analyst for a retailer. Using a hypothesis test, he wants to assess whether people who receive electronic coupons spend more on average. What should Rob's null hypothesis be?
People who receive electronic coupons spend more on average.
People who receive electronic coupons spend less on average.
People who receive electronic coupons do not spend more on average.
People do not receive electronic coupons spend more on average.
Tonya needs to create a dashboard that will draw information from many other data sources and present it to business leaders. Which one of the following tools is least likely to meet her needs?
QuickSight
Tableau
Power BI
SPSS Modeler
Ryan is using the Structured Query Language to work with data stored in a relational database. He would like to add several new rows to a database table. What command should he use?
SELECT
ALTER
INSERT
UPDATE
Daniel is working on an ELT process that sources data from six different source systems. Looking at the source data, he finds that data about the sample people exists in two of the six systems. What does he have to make sure he checks for in his ELT process? Choose the best answer.
Duplicate data
Redundant data
Invalid data
Missing data
Samantha needs to share a list of her organization's top 50 customers with the VP of Sales. She would like to include the name of the customer, the business they represent, their contact information, and their total sales over the past year. The VP does not have any specialized analytics skills or software but would like to make some personal notes on the dataset. What would be the best tool for Samantha to use to share this information?
Power BI
Microsoft Excel
Minitab
SAS
Alexander wants to use data from his corporate sales, CRM, and shipping systems to try and predict future sales. Which of the following systems is most appropriate? Choose the best answer.
Data mart
OLAP
Data warehouse
OLTP
Jackie is working in a data warehouse and finds a finance fact table links to an organization dimension, which in turn links to a currency dimension that is not linked to the fact table. What type of design pattern is the data warehouse using?
Star
Sun
Snowflake
Comet
Encryption is a mechanism for protecting data. When should encryption be applied to data? Choose the best answer.
When data is at rest
When data is at rest or in transit
When data is in transit
When data is at rest, unless you are using local storage
What subset of the Structured Query Language (SQL) is used to add, remove, modify, or retrieve the information stored within a relational database?
DDL
DSL
DQL
DML
Which of the following roles is responsible for ensuring an organization's data quality, security, privacy, and regulatory compliance?
Data owner
Data steward
Data custodian
Data processor
Jen wants to study the academic performance of undergraduate sophomores and wants to determine the average grade point average at different points during an academic year. What best describes the data set she needs?
Sample
Observation
Variable
Population
Mauro works with a group of R programmers tasked with copying data from an accounting system into a data warehouse. In what phase are the group's R skills most relevant?
Extract
Load
Transform
Purge
Which one of the following tools would not be considered a fully featured analytics suite?
Minitab
MicroStrategy
Domo
Power BI
Omar is conducting a study and wants to capture eye color. What kind of data is eye color? Choose the best response.
Discrete
Categorical
Continuous
Alphanumeric
Lars is looking at home sales prices in a single zip code and notices that one home sold for $938,294 when the average selling price of similar homes is $209,383. What type of data does the $938,294 sales price represent? Choose the best answer.
Duplicate data
Data outlier
Redundant data
Invalid data
Trianna wants to explore central tendency in her dataset. Which statistic best matches her need?
Interquartile range
Range
Median
Standard deviation
Shakira has 15 people on her data analytics team. Her team's charter requires that all team members have read access to the finance, human resources, sales, and customer service areas of the corporate data warehouse. What is the best way to provision access to her team? Choose the best answer.
Since there are 15 people on her team, create a role for each person to improve security.
Since there are four discrete data subjects, create one role for each subject area.
Enable multifactor authentication (MFA) to protect the data.
Create a single role that includes finance, human resources, sales, and customer service data.
What is the median of the following numbers?
13, 2, 65, 3, 5, 4, 7, 3, 4, 7, 8, 2, 4, 4, 60, 23, 43, 2
4
4.5
63
18
Lewis is designing an ETL process to copy sales data into a data warehouse on an hourly basis. What approach should Lewis choose that would be most efficient and minimize the chance of losing historical data?
Bulk load
Purge and load
Use ELT instead of ETL
Delta load
Carlos wants to analyze profit based on sales of five different product categories. His source data set consists of 5.8 million rows with columns including region, product category, product name, and sales price. How should he manipulate the data to facilitate his analysis? Choose the best answer.
Transpose by region and summarize.
Transpose by product category and summarize.
Transpose by product name and summarize.
Transpose by sales price and summarize.
According to the empirical rule, what percent of the values in a sample fall within three standard deviations of the mean in a normal distribution?
99.70%
95%
90%
68%
Martin is building a database to store prices for a items on a restaurant menu. What data type is most appropriate for this field?
Numeric
Date
Text
Alphanumeric
Harrison is conducting a survey. He intends to distribute the survey via email and wants to optionally follow up with respondents based on their answers. What quality dimension is most vital to the success of Harrison's survey? Choose the best answer.
Completeness
Accuracy
Consistency
Validity
Mary is developing a script that will perform some common analytics tasks. In order to improve the efficiency of her workflow, she is using a package called the tidyverse. What programming language is she using?
Python
R
Ruby
C++
B. While this scenario describes a system integration challenge that can be solved with either ETL or ELT, Lila is facing a record linkage issue. See
Chapter 8
for more information on this topic.
C. The null hypothesis presumes the status quo. Rob is testing whether or not people who receive an electronic coupon spend more on average, so the null hypothesis states that people who receive the coupon do spend more on average. See
Chapter 5
for more information on this topic.
D. QuickSight, Tableau, and Power BI are all powerful analytics and reporting tools that can pull data from a variety of sources. SPSS Modeler is a machine learning package that would not be used to create a dashboard. See
Chapter 6
for more information on this topic.
C. The
INSERT
command is used to add new records to a database table. The
SELECT
command is used to retrieve information from a database. It's the most commonly used command in SQL because it is used to pose queries to the database and retrieve the data that you're interested in working with. The
UPDATE
command is used to modify rows in the database. The
CREATE
command is used to create a new table within your database or a new database on your server. See
Chapter 6
for more information on this topic.
A. While invalid, redundant, or missing data are all valid concerns, data about people exists in two of the six systems. As such, Daniel needs to account for duplicate data issues. See
Chapter 4
for more information on this topic.
B. This scenario presents a very simple use case where the business leader needs a dataset in an easy-to-access form and will not be performing any detailed analysis. A simple spreadsheet, such as Microsoft Excel, would be the best tool for this job. There is no need to use a statistical analysis package, such as SAS or Minitab, as this would likely confuse the VP without adding any value. The same is true of an integrated analytics suite, such as Power BI. See
Chapter 6
for more information on this topic.
C. Data warehouses bring together data from multiple systems used by an organization. A data mart is too narrow, as Alexander needs data from across multiple divisions. OLAP is a broad term for analytical processing, and OLTP systems are transactional and not ideal for this task. See
Chapter 3
for more information on this topic.
C. Since the dimension links to a dimension that isn't connected to the fact table, it must be a snowflake. With a star, all dimensions link directly to the fact table. Sun and Comet are not data warehouse design patterns. See
Chapter 3
for more information on this topic.
B. To provide maximum protection, encrypt data both in transit and at rest. See
Chapter 8
for more information on this topic.
D. The Data Manipulation Language (DML) is used to work with the data stored in a database. DML includes the
SELECT
,
INSERT
,
UPDATE
, and
DELETE
commands. The Data Definition Language (DDL) contains the commands used to create and structure a relational database. It includes the
CREATE
,
ALTER
, and
DROP
commands. DDL and DML are the only two sublanguages of SQL. See
Chapter 6
for more information on this topic.
B. A data steward is responsible for leading an organization's data governance activities, which include data quality, security, privacy, and regulatory compliance. See
Chapter 8
for more information on this topic.
A. Jen does not have data for the entire population of all undergraduate sophomores. While a specific grade point average is an observation of a variable, Jen needs sample data. See
Chapter 5
for more information on this topic.
C. The R programming language is used to manipulate and model data. In the ETL process, this activity normally takes place during the Transform phase. The extract and load phases typically use database-centric tools. Purging data from a database is typically done using SQL. See
Chapter 3
for more information on this topic.
A. Power BI, Domo, and MicroStrategy are all analytics suites offering features that fill many different needs within the analytics process. Minitab is a statistical analysis package that lacks many of these capabilities. See
Chapter 6
for more information on this topic.
B. Eye color can only fall into a certain range of values; as such, it is categorical. See
Chapter 2
for more information on this topic.
B. Since the value is more than four times the average, the $938,294 value is an outlier. See
Chapter 4
for more information on this topic.
C. The median is the middle observation of a variable and is, therefore, a measure of central tendency. Interquartile range is a measure of position. Range and standard deviation are both measures of dispersion. See
Chapter 5
for more information on this topic.
D. While MFA is a good security practice, it doesn't govern access to data. Creating a single role for her team and assigning that role to the individuals on the team is the best approach. See
Chapter 8
for more information on this topic.
B. To find the median, sort the numbers in your dataset and find the one located in the middle. In this case, there are an even number of observations, so we take the two middle numbers (4 and 5) and use their average as the median, making the median value 4.5. The mode is 4, the range is 63, and the number of observations is 18. See
Chapter 5
for more information on this topic.
D. Since Lewis needs to migrate changes every hour, a delta load is the best approach. See
Chapter 3
for more information on this topic.
B. We can transpose this data by product category to perform this analysis broken out by product category. Transposing by sales price, region, or product name will not further his state analytical goal. See
Chapter 4
for more information on this topic.
A. According to the empirical rule, 68% of values are within one standard deviation, 95% are within two standard deviations, and 99.7% are within three standard deviations. See
Chapter 5
for more information on this topic.
A. Prices are numbers stored in dollars and cents; as such, the data type needs to be capable of storing numbers. See
Chapter 2
for more information on this topic.
A. Accuracy is for measuring how well an attribute matches its intended use. Consistency measures an attribute's value across systems. Validity ensures an attribute's value falls within an expected range. While all of these dimensions are important, completeness is foundational to Harrison's purpose. See
Chapter 4
for more information on this topic.
B. The tidyverse is a collection of packages for the R programming language designed to facilitate the analytics workflow. The tidyverse is not available for Python, Ruby, or C++, all of which are general-purpose programming languages. See
Chapter 6
for more information on this topic.
Analytics is at the heart of modern business. Virtually every organization collects large quantities of data about its customers, products, employees, and service offerings. Managers naturally seek to analyze that data and harness the information it contains to improve the efficiency, effectiveness, and profitability of their work.
Data analysts are the professionals who possess the skills and knowledge required to perform this vital work. They understand how the organization can acquire, clean, and transform data to meet the organization's needs. They are able to take that collected information and analyze it using the techniques of statistics and machine learning. They may then create powerful visualizations that display this data to business leaders, managers, and other stakeholders.
We are fortunate to live in the Golden Age of Analytics. Businesses around the world recognize the vital nature of data to their work and are investing heavily in analytics programs designed to give them a competitive advantage. Organizations have been collecting this data for years, and many of the statistical tools and techniques used in analytics work date back decades. But if that's the case, why are we just now in the early years of this Golden Age? Figure 1.1 shows the three major pillars that have come together at this moment to allow analytics programs to thrive: data, storage, and computing power.
The amount of data the modern world generates on a daily basis is staggering. From the organized tables of spreadsheets to the storage of photos, video, and audio recordings, modern businesses create an almost overwhelming avalanche of data that is ripe for use in analytics programs.
Let's try to quantify the amount of data that exists in the world. We'll begin with an estimate made by Google's then-CEO Eric Schmidt in 2010. At a technology conference, Schmidt estimated that the sum total of all of the stored knowledge created by the world at that point in time was approximately 5 exabytes. To give that a little perspective, the file containing the text of this chapter is around 100 kilobytes. So, Schmidt's estimate is that the world in 2010 had total knowledge that is about the size of 50,000,000,000,000 (that's 50 trillion!) copies of this book chapter. That's a staggering number, but it's only the beginning of our journey.
FIGURE 1.1 Analytics is made possible by modern data, storage, and computing capabilities.
Now fast-forward just two years to 2012. In that year, researchers estimated that the total amount of stored data in the world had grown to 1,000 exabytes (or one zettabyte). Remember, Schmidt's estimate of 5 exabytes was made only two years earlier. In just two years, the total amount of stored data in the world grew by a factor of 200! But we're still not finished!
In the year 2020, IDC estimates that the world created 59 zettabytes (or 59,000 exabytes) of new information. Compare that to Schmidt's estimate of the world having a total of 5 exabytes of stored information in 2010. If you do the math, you'll discover that this means that on any given day in the modern era, the world generates an amount of brand-new data that is approximately 32 times the sum total of all information created from the dawn of civilization until 2010! Now, that is a staggering amount of data!
From an analytics perspective, this trove of data is a gold mine of untapped potential.
The second key trend driving the growth of analytics programs is the increased availability of storage at rapidly decreasing costs. Table 1.1 shows the cost of storing a gigabyte of data in different years using magnetic hard drives.
TABLE 1.1 Gigabyte storage costs over time
Year
Cost per GB
1985
$169,900
1990
$53,940
1995
$799
2000
$17.50
2005
$0.62
2010
$0.19
2015
$0.03
2020
$0.01
Figure 1.2 shows the same data plotted as a line graph on a logarithmic scale. This visualization clearly demonstrates the fact that storage costs have plummeted to the point where storage is almost free and businesses can afford to retain data for analysis in ways that they never have before.
In 1975, Gordon Moore, one of the co-founders of Intel Corporation, made a prediction that computing technology would continue to advance so quickly that manufacturers would be able to double the number of components placed on an integrated circuit every two years. Remarkably, that prediction has stood the test of time and remains accurate today.
Commonly referred to as Moore's Law, this prediction is often loosely interpreted to mean that we will double the amount of computing power on a single device every two years. That trend has benefited many different technology-enabled fields, among them the world of analytics.
In the early days of analytics, computing power was costly and difficult to come by. Organizations with advanced analytics needs purchased massive supercomputers to analyze their data, but those supercomputers were scarce resources. Analysts fortunate enough to work in an organization that possessed a supercomputer had to justify their requests for small slices of time when they could use the powerful machines.
Today, the effects of Moore's Law have democratized computing. Most employees in an organization now have enough computing power sitting on their desks to perform a wide variety of analytic tasks. If they require more powerful computing resources, cloud services allow them to rent massive banks of computers at very low cost. Even better, those resources are charged at hourly rates, and analysts pay only for the computing time that they actually use.
FIGURE 1.2 Storage costs have decreased over time.