CompTIA Data+ Study Guide - Mike Chapple - E-Book

CompTIA Data+ Study Guide E-Book

Mike Chapple

0,0
38,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Build a solid foundation in data analysis skills and pursue a coveted Data+ certification with this intuitive study guide CompTIA Data+ Study Guide: Exam DA0-001 delivers easily accessible and actionable instruction for achieving data analysis competencies required for the job and on the CompTIA Data+ certification exam. You'll learn to collect, analyze, and report on various types of commonly used data, transforming raw data into usable information for stakeholders and decision makers. With comprehensive coverage of data concepts and environments, data mining, data analysis, visualization, and data governance, quality, and controls, this Study Guide offers: * All the information necessary to succeed on the exam for a widely accepted, entry-level credential that unlocks lucrative new data analytics and data science career opportunities * 100% coverage of objectives for the NEW CompTIA Data+ exam * Access to the Sybex online learning resources, with review questions, full-length practice exam, hundreds of electronic flashcards, and a glossary of key terms Ideal for anyone seeking a new career in data analysis, to improve their current data science skills, or hoping to achieve the coveted CompTIA Data+ certification credential, CompTIA Data+ Study Guide: Exam DA0-001 provides an invaluable head start to beginning or accelerating a career as an in-demand data analyst.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 491

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright

Dedication

Acknowledgments

About the Authors

About the Technical Editor

Introduction

The Data+ Exam

What Does This Book Cover?

Exam DA0-001 Exam Objectives

DA0-001 Certification Exam Objective Map

Assessment Test

Answers to Assessment Test

Chapter 1: Today's Data Analyst

Welcome to the World of Analytics

Careers in Analytics

The Analytics Process

Analytics Techniques

Data Governance

Analytics Tools

Summary

Chapter 2: Understanding Data

Exploring Data Types

Common Data Structures

Common File Formats

Summary

Exam Essentials

Review Questions

Chapter 3: Databases and Data Acquisition

Exploring Databases

Database Use Cases

Data Acquisition Concepts

Working with Data

Summary

Exam Essentials

Review Questions

Chapter 4: Data Quality

Data Quality Challenges

Data Manipulation Techniques

Managing Data Quality

Summary

Exam Essentials

Review Questions

Chapter 5: Data Analysis and Statistics

Fundamentals of Statistics

Descriptive Statistics

Inferential Statistics

Analysis Techniques

Summary

Exam Essentials

Review Questions

Chapter 6: Data Analytics Tools

Spreadsheets

Programming Languages

Statistics Packages

Machine Learning

Analytics Suites

Summary

Exam Essentials

Review Questions

Chapter 7: Data Visualization with Reports and Dashboards

Understanding Business Requirements

Understanding Report Design Elements

Understanding Dashboard Development Methods

Exploring Visualization Types

Comparing Report  Types

Summary

Exam Essentials

Review Questions

Chapter 8: Data Governance

Data Governance Concepts

Understanding Master Data Management

Summary

Exam Essentials

Review Questions

Appendix: Answers to the Review Questions

Chapter 2: Understanding Data

Chapter 3: Databases and Data Acquisition

Chapter 4: Data Quality

Chapter 5: Data Analysis and Statistics

Chapter 6: Data Analytics Tools

Chapter 7: Data Visualization with Reports and Dashboards

Chapter 8: Data Governance

Index

End User License Agreement

List of Tables

Chapter 1

TABLE 1.1 Gigabyte storage costs over time

TABLE 1.2 Highest-demand occupations

Chapter 2

TABLE 2.1 Pet data

TABLE 2.2 Selected character data types and maximum size

TABLE 2.3 Selected integer data types and value range

TABLE 2.4 Selected integer data types and value range

TABLE 2.5 Selected date and time data types

TABLE 2.6 Currency data types in Microsoft SQL Server

TABLE 2.7 Selected binary data types and maximum sizes

TABLE 2.8 Approximate storage needs

TABLE 2.9 Selected binary data types and maximum sizes

Chapter 3

TABLE 3.1 Animal data

TABLE 3.2 Person Data

TABLE 3.3 AnimalPerson table

TABLE 3.4 Appointment booking

TABLE 3.5 Product dimension

TABLE 3.6 Current flag

TABLE 3.7 Effective date

TABLE 3.8 Data manipulation in SQL

TABLE 3.9 Augmented animal data

TABLE 3.10 Sex Lookup Table

TABLE 3.11 Desired Query Results

TABLE 3.12 Modified

IFF

query results

TABLE 3.13 Common SQL aggregate functions

Chapter 5

TABLE 5.1 Common symbols in statistics

TABLE 5.2 Selected implementations of count

TABLE 5.3 Sample data

TABLE 5.4 Exploring percentages

TABLE 5.5 Institutional control percentage

TABLE 5.6 Standard deviation performance levels

TABLE 5.7 Illustrating degrees of freedom

TABLE 5.8 Illustrating interquartile range

TABLE 5.9 Illustrating lower and upper limits for interquartiles

TABLE 5.10 Confidence level and critical value for normally distributed data...

TABLE 5.11 Calculating Z-score

TABLE 5.12 Potential hypothesis test outcomes

TABLE 5.13 Categorical variables

TABLE 5.14 Medical research correlation coefficient guidelines

Chapter 7

TABLE 7.1 Color codes for General Electric Blue

TABLE 7.2 Sample daily sales report version history

Chapter 8

TABLE 8.1 Sample data classification matrix

TABLE 8.2 Breach notification requirements

List of Illustrations

Chapter 1

FIGURE 1.1 Analytics is made possible by modern data, storage, and computing...

FIGURE 1.2 Storage costs have decreased over time.

FIGURE 1.3 The analytics process

FIGURE 1.4 Table of college tuition data

FIGURE 1.5 Visualization of college tuition data

FIGURE 1.6 The relationship between artificial intelligence, machine learnin...

FIGURE 1.7 Data analysis in Microsoft Excel

FIGURE 1.8 Data analysis in RStudio

Chapter 2

FIGURE 2.1 Person Data

FIGURE 2.2 SKU example

FIGURE 2.3 U.S. QWERTY keyboard layout

FIGURE 2.4 Excel text-only formula

FIGURE 2.5 Text-only data validation restriction

FIGURE 2.6 Numeric data formatted as different currencies

FIGURE 2.7 Currency formats in Google Sheets

FIGURE 2.8 Incorrect percentage due to money data type

FIGURE 2.9 Correct percentage as calculated by a spreadsheet

FIGURE 2.10 Unrounded calculation in a calculator

FIGURE 2.11 Correct percentage when using a numeric data type

FIGURE 2.12 Images in Excel

FIGURE 2.13 Binary data in a Google spreadsheet

FIGURE 2.14 Photograph of racing motorcycles

FIGURE 2.15 Motorcycle search results

FIGURE 2.16 Discrete data example

FIGURE 2.17 Continuous data example

FIGURE 2.18 Dimension illustration

FIGURE 2.19 Data entry error

FIGURE 2.20 Data entry error identified in a summary

FIGURE 2.21 Pet ID as a key

FIGURE 2.22 Unstructured data: log entry

FIGURE 2.23 Files in object storage

FIGURE 2.24 Keys and values in object storage

FIGURE 2.25 Exporting as CSV or TSV

FIGURE 2.26 Contents of a CSV file

FIGURE 2.27 Semi-structured CSV

FIGURE 2.28 Fixed-width file

FIGURE 2.29 Pet data JSON example

FIGURE 2.30 Reading JSON in Python

FIGURE 2.31 Reading JSON in R

FIGURE 2.32 Representing a single animal in XML

FIGURE 2.33 Representing a single animal in HTML

FIGURE 2.34 HTML table in a browser

FIGURE 2.35 Displaying an image in an HTML table

Chapter 3

FIGURE 3.1 The (a) Animal entity and (b) Person entity in a veterinary datab...

FIGURE 3.2 Relationship connecting Animal and Person

FIGURE 3.3 ERD line endings

FIGURE 3.4 Unary relationship

FIGURE 3.5 Entity relationship diagram

FIGURE 3.6 Database schema

FIGURE 3.7 Email template

FIGURE 3.8 Reminder appointment email

FIGURE 3.9 Referential integrity illustration

FIGURE 3.10 Customer ERD

FIGURE 3.11 Foreign key data constraint

FIGURE 3.12 JSON person data

FIGURE 3.13 Data in a graph

FIGURE 3.14 Vet clinic transactional schema

FIGURE 3.15 Data in first normal form

FIGURE 3.16 Data in second normal form

FIGURE 3.17 Data in third normal form

FIGURE 3.18 Yearly spend by animal

FIGURE 3.19 CostSummary table

FIGURE 3.20 Star schema example

FIGURE 3.21 OLTP and OLAP query example

FIGURE 3.22 Snowflake example

FIGURE 3.23 Delta load example

FIGURE 3.24 Internal API example

FIGURE 3.25 Single question survey

FIGURE 3.26 Qualtrics survey build

FIGURE 3.27 SQL

SELECT

statement

FIGURE 3.28 Hard-coded SQL query

FIGURE 3.29 Parameterized SQL query

Chapter 4

FIGURE 4.1 Duplicate data resolution process

FIGURE 4.2 Illustration of multiple data sources

FIGURE 4.3 Resolving redundancy with an integrated ETL process

FIGURE 4.4 Redundant data

FIGURE 4.5 Name change issue

FIGURE 4.6 Transactional design

FIGURE 4.7 Missing temperature value

FIGURE 4.8 Invalid temperature value

FIGURE 4.9 Pain rating scale

FIGURE 4.10 Real estate sales outlier

FIGURE 4.11 Automotive schema excerpt

FIGURE 4.12 List of automotive manufacturers

FIGURE 4.13 Patient pain data

FIGURE 4.14 Recoded patient pain data

FIGURE 4.15 Patient data

FIGURE 4.16 Deriving age

FIGURE 4.17 Merging disparate data

FIGURE 4.18 ETL and the data merge approach

FIGURE 4.19 Extract, transform, and load process

FIGURE 4.20 Data blending

FIGURE 4.21 Creating a date variable with concatenation

FIGURE 4.22 Appending weather data

FIGURE 4.23 Weight log with missing values

FIGURE 4.24 Various imputation techniques

FIGURE 4.25 Dimensionality reduction example

FIGURE 4.26 Numerosity reduction with histograms

FIGURE 4.27 Sampling and histograms

FIGURE 4.28 Summary statistics

FIGURE 4.29 Sales data

FIGURE 4.30 Transposed sales data

FIGURE 4.31 Combining transposition with aggregation

FIGURE 4.32 Raw athlete data

FIGURE 4.33 Scaled athlete data

FIGURE 4.34 Splitting a composite column

FIGURE 4.35 Combining multiple columns

FIGURE 4.36 Improving data quality with string manipulation

FIGURE 4.37 Data source description

FIGURE 4.38 Three opportunities to impact data quality

FIGURE 4.39 Address standardization

FIGURE 4.40 Data conversion issue

FIGURE 4.41 Data manipulation issue

FIGURE 4.42 Ensuring consistency

FIGURE 4.43 Using uniqueness to improve consistency

FIGURE 4.44 Referential integrity and data validity

FIGURE 4.45 Multiple source systems and the potential for nonconformity

FIGURE 4.46 Reprocessing of bad data

Chapter 5

FIGURE 5.1 Weight log

FIGURE 5.2 Histogram of average SAT score for U.S. institutions of higher ed...

FIGURE 5.3 Histograms of SAT averages and institutional control

FIGURE 5.4A Mean salary data

FIGURE 5.4B Effect of an outlier on the mean

FIGURE 5.5A Calculating median salary data

FIGURE 5.5B Effect of an outlier on the median

FIGURE 5.6 Categorical question

FIGURE 5.7 Normal distribution

FIGURE 5.8 Right skewed distribution

FIGURE 5.9 Left skewed distribution

FIGURE 5.10 Bimodal distribution

FIGURE 5.11 Temperature variance

FIGURE 5.12 Standard normal distribution and t-distribution

FIGURE 5.13 Sales prices for 30,000 homes

FIGURE 5.14 Samples in a population

FIGURE 5.15 Samples at a 95 percent confidence level and population mean

FIGURE 5.16 Confidence levels and the normal distribution

FIGURE 5.17 Visualizing alpha

FIGURE 5.18 One-tailed test

FIGURE 5.19 Two-tailed test

FIGURE 5.20 Sample statistic and p-value

FIGURE 5.21 Hypothesis testing flowchart

FIGURE 5.22 Insignificant sample mean

FIGURE 5.23 Simple linear regression of age and BMI

FIGURE 5.24 Linear model output of age and BMI

FIGURE 5.25 Simple linear regression of speed and distance

FIGURE 5.26 Highly correlated data

Chapter 6

FIGURE 6.1 Table of data in Microsoft Excel

FIGURE 6.2 Data visualization in Microsoft Excel

FIGURE 6.3 Data analysis using R and RStudio

FIGURE 6.4 Data analysis using Python and pandas

FIGURE 6.5 SQL query using Azure Data Studio

FIGURE 6.6 Calculating correlations in SPSS

FIGURE 6.7 Analyzing data in SAS

FIGURE 6.8 Building a simple model in Stata

FIGURE 6.9 Evaluating a regression model in Minitab

FIGURE 6.10 Designing a machine learning task in SPSS Modeler

FIGURE 6.11 Exploring a data model in SPSS Modeler

FIGURE 6.12 Designing a machine learning task in RapidMiner

FIGURE 6.13 Exploring a data model in RapidMiner

FIGURE 6.14 Exploring data in IBM Cognos

FIGURE 6.15 Communicating a story with data in Microsoft Power BI

FIGURE 6.16 Building a dashboard in MicroStrategy

FIGURE 6.17 Analyzing a dataset in Domo

FIGURE 6.18 Communicating a data story in Datorama

FIGURE 6.19 Building a dashboard in AWS QuickSight

FIGURE 6.20 Visualizing data in Tableau

FIGURE 6.21 Visualizing data in Qlik Sense

Chapter 7

FIGURE 7.1 Historical reporting

FIGURE 7.2 Real-time reporting

FIGURE 7.3 Poor font choice

FIGURE 7.4 Porsche AG Annual Report cover page

FIGURE 7.5 Porsche AG Annual Report executive summary

FIGURE 7.6 Black monochromatic color palette

FIGURE 7.7 Nonparallel construction

FIGURE 7.8 Parallel construction

FIGURE 7.9 Serif and sans serif fonts

FIGURE 7.10 Sample layout elements and font sizes

FIGURE 7.11 Sample chart

FIGURE 7.12 Sample chart with legend

FIGURE 7.13 General Electric logo

FIGURE 7.14 Sample chart with watermark

FIGURE 7.15 Sample public-facing dashboard

FIGURE 7.16 Data sourcing flowchart

FIGURE 7.17 Sales funnel

FIGURE 7.18 Line chart

FIGURE 7.19 Pie chart

FIGURE 7.20 Bar chart by genre

FIGURE 7.21 Bar chart by count

FIGURE 7.22 Stacked bar chart

FIGURE 7.23 Average duration scatterplot

FIGURE 7.24 Budget and box office gross scatterplot

FIGURE 7.25 Budget and box office gross scatter and line chart

FIGURE 7.26 Budget and average box office gross scatter and line chart

FIGURE 7.27 Budget and average box office gross bubble and line chart

FIGURE 7.28 Histogram of box office gross

FIGURE 7.29 Histogram of movie duration

FIGURE 7.30 Large U.S. colleges and universities

FIGURE 7.31 Risk heat map

FIGURE 7.32 Correlation heat map

FIGURE 7.33 Tree map of movie genres

FIGURE 7.34 Tree map of the comedy genre

FIGURE 7.35 Headcount waterfall chart

FIGURE 7.36 Infographic

FIGURE 7.37 Positive word cloud

FIGURE 7.38 Negative word cloud

Chapter 8

FIGURE 8.1 Organizational example

FIGURE 8.2 Access roles over time

FIGURE 8.3 Danger of user-based access

FIGURE 8.4 Sample organization chart

FIGURE 8.5 Sample user group–based roles

FIGURE 8.6 Encrypted network connection

FIGURE 8.7 HTTPS padlock

FIGURE 8.8 Encrypted ETL process

FIGURE 8.9 Data masking ETL process

FIGURE 8.10 Reidentification by combining datasets

FIGURE 8.11 Local encryption process flow

FIGURE 8.12 Record linkage example

FIGURE 8.13 Duplicate resolution process

FIGURE 8.14 MDM for person data

Guide

Cover

Table of Contents

Title Page

Copyright

Dedication

Acknowledgments

About the Authors

About the Technical Editor

Introduction

Begin Reading

Appendix Answers to the Review Questions

Index

End User License Agreement

Pages

iii

iv

v

vi

vii

viii

xv

xvi

xvii

xviii

xix

xx

xxi

xxii

xxiii

xxiv

xxv

xxvi

xxvii

xxviii

xxix

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

327

328

329

330

331

332

333

334

335

336

337

CompTIA®Data+®Study Guide

Exam DA0-001

 

 

Mike Chapple

Sharif Nijim

 

 

Copyright © 2022 by John Wiley & Sons, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

978-1-119-84525-6978-1-119-84527-0 (ebk.)978-1-119-84526-3 (ebk.)

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Website is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Website may provide or recommendations it may make. Further, readers should be aware the Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Control Number: 2022930191

Trademarks: WILEY, the Wiley logo, Sybex and the Sybex logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. CompTIA and Data+ are registered trademarks of CompTIA, Inc. Dr. Ing. h. c. F. Porsche AG is the owner of numerous trademarks, both registered and unregistered, including without limitation the Porsche Crest, Porsche, Boxster, Carrera, Cayenne, Cayman, Panamera, Taycan, 911, 718, and the model numbers and distinctive shapes of Porsche automobiles such as the 911 and Boxster automobiles in the United States; these are used with permission of Porsche Cars North America, Inc. and Dr. Ing. h. c. F. Porsche AG. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Cover images: © Jeremy Woodhouse/Getty Images

Cover design: Wiley

To my aspiring engineer, Chris. Your mother and I are so proud of you and can't wait to see all of the incredible things that you accomplish!

—Mike

To my parents, Basheer and Germana. Thank you for your constant love and support, and for showing me how to live.

—Sharif

Acknowledgments

Books like this involve work from many people, and as authors, we truly appreciate the hard work and dedication that the team at Wiley shows. We would especially like to thank senior acquisitions editor Kenyon Brown. We have worked with Ken on multiple projects and consistently enjoy our work with him.

We also greatly appreciated the editing and production team for the book. First and foremost, we'd like to thank our friend and colleague, Dr. Jen Waddell. Jen provided us with invaluable insight as we worked our way through the many challenges inherent in putting out a book covering a brand-new certification. Jen's a whiz at statistics and analytics, and we couldn't have completed this book without her support. We also benefited greatly from the work of two of our students at Notre Dame. Ricky Chapple did a final read-through of this book to ensure that it was ready to go, and Matthew Howard helped create the instructor materials that accompany this book.

We'd also like to thank the many people who helped us make this project successful, including Adaobi Obi Tulton, our project editor, who brought years of experience and great talent to the project, and Barath Kumar Rajasekaran, our content refinement specialist, who guided us through layouts, formatting, and final cleanup to produce a great book. We would also like to thank the many behind-the-scenes contributors, including the graphics, production, and technical teams who make the book and companion materials into a finished product.

Our agent, Carole Jelen of Waterside Productions, continues to provide us with wonderful opportunities, advice, and assistance throughout our writing careers.

Finally, we would like to thank our families who support us through the late evenings, busy weekends, and long hours that a book like this requires to write, edit, and get to press.

About the Authors

Mike Chapple, Ph.D., CySA+, is author of the best-selling CISSP (ISC)2Certified Information Systems Security Professional Official Study Guide (Sybex, 2021) and the CISSP (ISC)2Official Practice Tests (Sybex, 2021). He is an information technology professional with two decades of experience in higher education, the private sector, and government.

Mike currently serves as a teaching professor in the IT, Analytics, and Operations department at the University of Notre Dame's Mendoza College of Business, where he teaches undergraduate and graduate courses on cybersecurity, data management, and business analytics.

Before returning to Notre Dame, Mike served as executive vice president and chief information officer of the Brand Institute, a Miami-based marketing consultancy. Mike also spent four years in the information security research group at the National Security Agency and served as an active duty intelligence officer in the U.S. Air Force.

Mike has written more than 30 books. He earned both his B.S. and Ph.D. degrees from Notre Dame in computer science and engineering. Mike also holds an M.S. in computer science from the University of Idaho and an MBA from Auburn University.

Learn more about Mike and his other certification materials at CertMike.com.

Sharif Nijim is an assistant teaching professor in the IT, Analytics, and Operations department at the Mendoza College of Business at the University of Notre Dame, where he teaches undergraduate and graduate courses in business analytics and information technology.

Prior to Notre Dame, Sharif co-founded and served on the board of a customer data integration company serving the airline industry. Sharif also spent more than a decade building and optimizing enterprise-class transactional and decision support systems for clients in the energy, healthcare, hospitality, insurance, logistics, manufacturing, real estate, telecommunications, and travel and transportation sectors.

Sharif earned both his B.B.A. and M.S. from the University of Notre Dame.

About the Technical Editor

Jennifer Waddell is a teaching professor, assistant department chair, and the director of undergraduate studies in the IT, Analytics, and Operations department at the University of Notre Dame, specializing in the areas of statistical methodology and analytics. Over the last 20 years, she has educated students at the undergraduate, graduate, and executive levels in these disciplines, focusing on their theoretical understanding as well as technical skill implementation. In addition to her time in the classroom, she has worked as a statistical consultant on research projects in healthcare systems and educational services.

Introduction

If you're preparing to take the Data+ exam, you'll undoubtedly want to find as much information as you can about data and analytics. The more information you have at your disposal and the more hands-on experience you gain, the better off you'll be when attempting the exam. This study guide was written with that in mind. The goal was to provide enough information to prepare you for the test, but not so much that you'll be overloaded with information that's outside the scope of the exam.

We've included review questions at the end of each chapter to give you a taste of what it's like to take the exam. If you're already working in the data field, we recommend that you check out these questions first to gauge your level of expertise. You can then use the book mainly to fill in the gaps in your current knowledge. This study guide will help you round out your knowledge base before tackling the exam.

If you can answer 90 percent or more of the review questions correctly for a given chapter, you can feel safe moving on to the next chapter. If you're unable to answer that many correctly, reread the chapter and try the questions again. Your score should improve.

Don't just study the questions and answers! The questions on the actual exam will be different from the practice questions included in this book. The exam is designed to test your knowledge of a concept or objective, so use this book to learn the objectives behind the questions.

The Data+ Exam

The Data+ exam is designed to be a vendor-neutral certification for data professionals and those seeking to enter the field. CompTIA recommends this certification for those currently working, or aspiring to work, in data analyst and business intelligence reporting roles.

The exam covers five major domains:

Data Concepts and Environments

Data Mining

Data Analysis

Visualization

Data Governance, Quality, and Controls

These five areas include a range of topics, from data types to statistical analysis and from data visualization to tools and techniques, while focusing heavily on scenario-based learning. That's why CompTIA recommends that those attempting the exam have 18–24 months of hands-on work experience, although many individuals pass the exam before moving into their first data analysis role.

The Data+ exam is conducted in a format that CompTIA calls “performance-based assessment.” This means that the exam combines standard multiple-choice questions with other, interactive question formats. Your exam may include several types of questions such as multiple-choice, fill-in-the-blank, multiple-response, drag-and-drop, and image-based problems. More details about the Data+ exam and how to take it can be found here:

http://www.comptia.org/certifications/data

You'll have 90 minutes to take the exam and will be asked to answer 90 questions during that time period. Your exam will be scored on a scale ranging from 100 to 900, with a passing score of 675.

You should also know that CompTIA is notorious for including vague questions on all of its exams. You might see a question for which two of the possible four answers are correct—but you can choose only one. Use your knowledge, logic, and intuition to choose the best answer and then move on. Sometimes, the questions are worded in ways that would make English majors cringe—a typo here, an incorrect verb there. Don't let this frustrate you; answer the question and move on to the next one.

CompTIA frequently does what is called item seeding, which is the practice of including unscored questions on exams. It does so to gather psychometric data, which is then used when developing new versions of the exam. Before you take the exam, you will be told that your exam may include these unscored questions. So, if you come across a question that does not appear to map to any of the exam objectives—or for that matter, does not appear to belong in the exam—it is likely a seeded question. You never really know whether or not a question is seeded, however, so always make your best effort to answer every question.

Taking the Exam

Once you are fully prepared to take the exam, you can visit the CompTIA website to purchase your exam voucher:

https://store.comptia.org/Certification-Vouchers/c/11293

Currently, CompTIA offers two options for taking the exam: an in-person exam at a testing center and an at-home exam that you take on your own computer.

This book includes a coupon that you may use to save 10 percent on your CompTIA exam registration.

In-Person Exams

CompTIA partners with Pearson VUE's testing centers, so your next step will be to locate a testing center near you. In the United States, you can do this based on your address or your ZIP code, while non-U.S. test takers may find it easier to enter their city and country. You can search for a test center near you at the Pearson Vue website, where you will need to navigate to “Find a test center.”

http://www.pearsonvue.com/comptia

Now that you know where you'd like to take the exam, simply set up a Pearson VUE testing account and schedule an exam on their site.

On the day of the test, take two forms of identification, and make sure to show up with plenty of time before the exam starts. Remember that you will not be able to take your notes, electronic devices (including smartphones and watches), or other materials in with you.

At-Home Exams

CompTIA began offering online exam proctoring in response to the coronavirus pandemic. As of the time this book went to press, the at-home testing option was still available and appears likely to continue. Candidates using this approach will take the exam at their home or office and be proctored over a webcam by a remote proctor.

Due to the rapidly changing nature of the at-home testing experience, candidates wishing to pursue this option should check the CompTIA website for the latest details.

After the Data+ Exam

Once you have taken the exam, you will be notified of your score immediately, so you'll know if you passed the test right away. You should keep track of your score report with your exam registration records and the email address you used to register for the exam.

What Does This Book Cover?

This book covers everything you need to know to pass the Data+ exam.

Chapter 1

: Today's Data Analyst

Chapter 2

: Understanding Data

Chapter 3

: Databases and Data Acquisition

Chapter 4

: Data Quality

Chapter 5

: Data Analysis and Statistics

Chapter 6

: Data Analytics Tools

Chapter 7

: Data Visualization with Reports and Dashboards

Chapter 8

: Data Governance

Practice Exam 1

Practice Exam 2

Appendix

: Answers to the Review Questions

Study Guide Elements

This study guide uses a number of common elements to help you prepare. These include the following:

Summaries

 The summary section of each chapter briefly explains the chapter, allowing you to easily understand what it covers.

Exam Essentials

 The exam essentials focus on major exam topics and critical knowledge that you should take into the test. The exam essentials focus on the exam objectives provided by CompTIA.

Chapter Review Questions

 A set of questions at the end of each chapter will help you assess your knowledge and whether you are ready to take the exam based on your knowledge of that chapter's topics.

Interactive Online Learning Environment and Test Bank

This book comes with a number of additional study tools to help you prepare for the exam. They include the following.

Go to https://www.wiley.com/go/sybextestprep to register and gain access to this interactive online learning environment and test bank with study tools.

Sybex Test Preparation Software

Sybex's test preparation software lets you prepare with electronic test versions of the review questions from each chapter, the practice exam, and the bonus exam that are included in this book. You can build and take tests on specific domains, by chapter, or cover the entire set of Data+ exam objectives using randomized tests.

Electronic Flashcards

Our electronic flashcards are designed to help you prepare for the exam. Over 100 flashcards will ensure that you know critical terms and concepts.

Glossary of Terms

Sybex provides a full glossary of terms in PDF format, allowing quick searches and easy reference to materials in this book.

Bonus Practice Exams

In addition to the practice questions for each chapter, this book includes two full 90-question practice exams. We recommend that you use them both to test your preparedness for the certification exam.

Like all exams, the Data+ certification from CompTIA is updated periodically and may eventually be retired or replaced. At some point after CompTIA is no longer offering this exam, the old editions of our books and online tools will be retired. If you have purchased this book after the exam was retired or are attempting to register in the Sybex online learning environment after the exam was retired, please know that we make no guarantees that this exam's online Sybex tools will be available once the exam is no longer available.

Exam DA0-001 Exam Objectives

CompTIA goes to great lengths to ensure that its certification programs accurately reflect the IT industry's best practices. It does this by establishing committees for each of its exam programs. Each committee consists of a small group of IT professionals, training providers, and publishers who are responsible for establishing the exam's baseline competency level and who determine the appropriate target-audience level.

Once these factors are determined, CompTIA shares this information with a group of hand-selected subject matter experts (SMEs). These folks are the true brainpower behind the certification program. The SMEs review the committee's findings, refine them, and shape them into the objectives that follow this section. CompTIA calls this process a job-task analysis (JTA).

Finally, CompTIA conducts a survey to ensure that the objectives and weightings truly reflect job requirements. Only then can the SMEs go to work writing the hundreds of questions needed for the exam. Even so, they have to go back to the drawing board for further refinements in many cases before the exam is ready to go live in its final state. Rest assured that the content you're about to learn will serve you long after you take the exam.

CompTIA also publishes relative weightings for each of the exam's objectives. The following table lists the five Data+ objective domains and the extent to which they are represented on the exam.

Domain

% of Exam

1.0 Data Concepts and Environments

15%

2.0 Data Mining

25%

3.0 Data Analysis

23%

4.0 Visualization

23%

5.0 Data Governance, Quality, and Controls

14%

DA0-001 Certification Exam Objective Map

Objective

Chapter

1.0 Data Concepts and Environments

1.1 Identify basic concepts of data schemas and dimensions

Chapter 3

1.2 Compare and contrast different data types

Chapter 2

1.3 Compare and contrast common data structures and file formats

Chapter 2

2.0 Data Mining

2.1 Explain data acquisition concepts

Chapter 3

2.2 Identify common reasons for cleansing and profiling datasets

Chapter 4

2.3 Given a scenario, execute data manipulation techniques

Chapter 4

2.4 Explain common techniques for data manipulation and query optimization

Chapter 3

3.0 Data Analysis

3.1 Given a scenario, apply the appropriate descriptive statistical methods

Chapter 5

3.2 Explain the purpose of inferential statistical methods

Chapter 5

3.3 Summarize types of analysis and key analysis techniques

Chapter 5

3.4 Identify common data analytics tools

Chapter 6

4.0 Visualization

4.1 Given a scenario, translate business requirements to form a report

Chapter 7

4.2 Given a scenario, use appropriate design components for reports and dashboards

Chapter 7

4.3 Given a scenario, use appropriate methods for dashboard development

Chapter 7

4.4 Given a scenario, apply the appropriate type of visualization

Chapter 7

4.5 Compare and contrast types of reports

Chapter 7

5.0 Data Governance, Quality, and Controls

5.1 Summarize important data governance concepts

Chapter 8

5.2 Given a scenario, apply data quality control concepts

Chapter 4

5.3 Explain master data management (MDM) concepts

Chapter 8

Exam objectives are subject to change at any time without prior notice and at CompTIA's discretion. Please visit CompTIA's website (www.comptia.org) for the most current listing of exam objectives.

Assessment Test

Lila is aggregating data from a CRM system with data from an employee system. While performing an initial quality check, she realizes that her employee ID is not associated with her identifier in the CRM system. What kind of issue is Lila facing? Choose the best answer.

ETL process

Record linkage

ELT process

System integration

Rob is a pricing analyst for a retailer. Using a hypothesis test, he wants to assess whether people who receive electronic coupons spend more on average. What should Rob's null hypothesis be?

People who receive electronic coupons spend more on average.

People who receive electronic coupons spend less on average.

People who receive electronic coupons do not spend more on average.

People do not receive electronic coupons spend more on average.

Tonya needs to create a dashboard that will draw information from many other data sources and present it to business leaders. Which one of the following tools is least likely to meet her needs?

QuickSight

Tableau

Power BI

SPSS Modeler

Ryan is using the Structured Query Language to work with data stored in a relational database. He would like to add several new rows to a database table. What command should he use?

SELECT

ALTER

INSERT

UPDATE

Daniel is working on an ELT process that sources data from six different source systems. Looking at the source data, he finds that data about the sample people exists in two of the six systems. What does he have to make sure he checks for in his ELT process? Choose the best answer.

Duplicate data

Redundant data

Invalid data

Missing data

Samantha needs to share a list of her organization's top 50 customers with the VP of Sales. She would like to include the name of the customer, the business they represent, their contact information, and their total sales over the past year. The VP does not have any specialized analytics skills or software but would like to make some personal notes on the dataset. What would be the best tool for Samantha to use to share this information?

Power BI

Microsoft Excel

Minitab

SAS

Alexander wants to use data from his corporate sales, CRM, and shipping systems to try and predict future sales. Which of the following systems is most appropriate? Choose the best answer.

Data mart

OLAP

Data warehouse

OLTP

Jackie is working in a data warehouse and finds a finance fact table links to an organization dimension, which in turn links to a currency dimension that is not linked to the fact table. What type of design pattern is the data warehouse using?

Star

Sun

Snowflake

Comet

Encryption is a mechanism for protecting data. When should encryption be applied to data? Choose the best answer.

When data is at rest

When data is at rest or in transit

When data is in transit

When data is at rest, unless you are using local storage

What subset of the Structured Query Language (SQL) is used to add, remove, modify, or retrieve the information stored within a relational database?

DDL

DSL

DQL

DML

Which of the following roles is responsible for ensuring an organization's data quality, security, privacy, and regulatory compliance?

Data owner

Data steward

Data custodian

Data processor

Jen wants to study the academic performance of undergraduate sophomores and wants to determine the average grade point average at different points during an academic year. What best describes the data set she needs?

Sample

Observation

Variable

Population

Mauro works with a group of R programmers tasked with copying data from an accounting system into a data warehouse. In what phase are the group's R skills most relevant?

Extract

Load

Transform

Purge

Which one of the following tools would not be considered a fully featured analytics suite?

Minitab

MicroStrategy

Domo

Power BI

Omar is conducting a study and wants to capture eye color. What kind of data is eye color? Choose the best response.

Discrete

Categorical

Continuous

Alphanumeric

Lars is looking at home sales prices in a single zip code and notices that one home sold for $938,294 when the average selling price of similar homes is $209,383. What type of data does the $938,294 sales price represent? Choose the best answer.

Duplicate data

Data outlier

Redundant data

Invalid data

Trianna wants to explore central tendency in her dataset. Which statistic best matches her need?

Interquartile range

Range

Median

Standard deviation

Shakira has 15 people on her data analytics team. Her team's charter requires that all team members have read access to the finance, human resources, sales, and customer service areas of the corporate data warehouse. What is the best way to provision access to her team? Choose the best answer.

Since there are 15 people on her team, create a role for each person to improve security.

Since there are four discrete data subjects, create one role for each subject area.

Enable multifactor authentication (MFA) to protect the data.

Create a single role that includes finance, human resources, sales, and customer service data.

What is the median of the following numbers?

13, 2, 65, 3, 5, 4, 7, 3, 4, 7, 8, 2, 4, 4, 60, 23, 43, 2

4

4.5

63

18

Lewis is designing an ETL process to copy sales data into a data warehouse on an hourly basis. What approach should Lewis choose that would be most efficient and minimize the chance of losing historical data?

Bulk load

Purge and load

Use ELT instead of ETL

Delta load

Carlos wants to analyze profit based on sales of five different product categories. His source data set consists of 5.8 million rows with columns including region, product category, product name, and sales price. How should he manipulate the data to facilitate his analysis? Choose the best answer.

Transpose by region and summarize.

Transpose by product category and summarize.

Transpose by product name and summarize.

Transpose by sales price and summarize.

According to the empirical rule, what percent of the values in a sample fall within three standard deviations of the mean in a normal distribution?

99.70%

95%

90%

68%

Martin is building a database to store prices for a items on a restaurant menu. What data type is most appropriate for this field?

Numeric

Date

Text

Alphanumeric

Harrison is conducting a survey. He intends to distribute the survey via email and wants to optionally follow up with respondents based on their answers. What quality dimension is most vital to the success of Harrison's survey? Choose the best answer.

Completeness

Accuracy

Consistency

Validity

Mary is developing a script that will perform some common analytics tasks. In order to improve the efficiency of her workflow, she is using a package called the tidyverse. What programming language is she using?

Python

R

Ruby

C++

Answers to Assessment Test

B. While this scenario describes a system integration challenge that can be solved with either ETL or ELT, Lila is facing a record linkage issue. See

Chapter 8

for more information on this topic.

C. The null hypothesis presumes the status quo. Rob is testing whether or not people who receive an electronic coupon spend more on average, so the null hypothesis states that people who receive the coupon do spend more on average. See

Chapter 5

for more information on this topic.

D. QuickSight, Tableau, and Power BI are all powerful analytics and reporting tools that can pull data from a variety of sources. SPSS Modeler is a machine learning package that would not be used to create a dashboard. See

Chapter 6

for more information on this topic.

C. The

INSERT

command is used to add new records to a database table. The

SELECT

command is used to retrieve information from a database. It's the most commonly used command in SQL because it is used to pose queries to the database and retrieve the data that you're interested in working with. The

UPDATE

command is used to modify rows in the database. The

CREATE

command is used to create a new table within your database or a new database on your server. See

Chapter 6

for more information on this topic.

A. While invalid, redundant, or missing data are all valid concerns, data about people exists in two of the six systems. As such, Daniel needs to account for duplicate data issues. See

Chapter 4

for more information on this topic.

B. This scenario presents a very simple use case where the business leader needs a dataset in an easy-to-access form and will not be performing any detailed analysis. A simple spreadsheet, such as Microsoft Excel, would be the best tool for this job. There is no need to use a statistical analysis package, such as SAS or Minitab, as this would likely confuse the VP without adding any value. The same is true of an integrated analytics suite, such as Power BI. See

Chapter 6

for more information on this topic.

C. Data warehouses bring together data from multiple systems used by an organization. A data mart is too narrow, as Alexander needs data from across multiple divisions. OLAP is a broad term for analytical processing, and OLTP systems are transactional and not ideal for this task. See

Chapter 3

for more information on this topic.

C. Since the dimension links to a dimension that isn't connected to the fact table, it must be a snowflake. With a star, all dimensions link directly to the fact table. Sun and Comet are not data warehouse design patterns. See

Chapter 3

for more information on this topic.

B. To provide maximum protection, encrypt data both in transit and at rest. See

Chapter 8

for more information on this topic.

D. The Data Manipulation Language (DML) is used to work with the data stored in a database. DML includes the

SELECT

,

INSERT

,

UPDATE

, and

DELETE

commands. The Data Definition Language (DDL) contains the commands used to create and structure a relational database. It includes the

CREATE

,

ALTER

, and

DROP

commands. DDL and DML are the only two sublanguages of SQL. See

Chapter 6

for more information on this topic.

B. A data steward is responsible for leading an organization's data governance activities, which include data quality, security, privacy, and regulatory compliance. See

Chapter 8

for more information on this topic.

A. Jen does not have data for the entire population of all undergraduate sophomores. While a specific grade point average is an observation of a variable, Jen needs sample data. See

Chapter 5

for more information on this topic.

C. The R programming language is used to manipulate and model data. In the ETL process, this activity normally takes place during the Transform phase. The extract and load phases typically use database-centric tools. Purging data from a database is typically done using SQL. See

Chapter 3

for more information on this topic.

A. Power BI, Domo, and MicroStrategy are all analytics suites offering features that fill many different needs within the analytics process. Minitab is a statistical analysis package that lacks many of these capabilities. See

Chapter 6

for more information on this topic.

B. Eye color can only fall into a certain range of values; as such, it is categorical. See

Chapter 2

for more information on this topic.

B. Since the value is more than four times the average, the $938,294 value is an outlier. See

Chapter 4

for more information on this topic.

C. The median is the middle observation of a variable and is, therefore, a measure of central tendency. Interquartile range is a measure of position. Range and standard deviation are both measures of dispersion. See

Chapter 5

for more information on this topic.

D. While MFA is a good security practice, it doesn't govern access to data. Creating a single role for her team and assigning that role to the individuals on the team is the best approach. See

Chapter 8

for more information on this topic.

B. To find the median, sort the numbers in your dataset and find the one located in the middle. In this case, there are an even number of observations, so we take the two middle numbers (4 and 5) and use their average as the median, making the median value 4.5. The mode is 4, the range is 63, and the number of observations is 18. See

Chapter 5

for more information on this topic.

D. Since Lewis needs to migrate changes every hour, a delta load is the best approach. See

Chapter 3

for more information on this topic.

B. We can transpose this data by product category to perform this analysis broken out by product category. Transposing by sales price, region, or product name will not further his state analytical goal. See

Chapter 4

for more information on this topic.

A. According to the empirical rule, 68% of values are within one standard deviation, 95% are within two standard deviations, and 99.7% are within three standard deviations. See

Chapter 5

for more information on this topic.

A. Prices are numbers stored in dollars and cents; as such, the data type needs to be capable of storing numbers. See

Chapter 2

for more information on this topic.

A. Accuracy is for measuring how well an attribute matches its intended use. Consistency measures an attribute's value across systems. Validity ensures an attribute's value falls within an expected range. While all of these dimensions are important, completeness is foundational to Harrison's purpose. See

Chapter 4

for more information on this topic.

B. The tidyverse is a collection of packages for the R programming language designed to facilitate the analytics workflow. The tidyverse is not available for Python, Ruby, or C++, all of which are general-purpose programming languages. See

Chapter 6

for more information on this topic.

Chapter 1Today's Data Analyst

Analytics is at the heart of modern business. Virtually every organization collects large quantities of data about its customers, products, employees, and service offerings. Managers naturally seek to analyze that data and harness the information it contains to improve the efficiency, effectiveness, and profitability of their work.

Data analysts are the professionals who possess the skills and knowledge required to perform this vital work. They understand how the organization can acquire, clean, and transform data to meet the organization's needs. They are able to take that collected information and analyze it using the techniques of statistics and machine learning. They may then create powerful visualizations that display this data to business leaders, managers, and other stakeholders.

Welcome to the World of Analytics

We are fortunate to live in the Golden Age of Analytics. Businesses around the world recognize the vital nature of data to their work and are investing heavily in analytics programs designed to give them a competitive advantage. Organizations have been collecting this data for years, and many of the statistical tools and techniques used in analytics work date back decades. But if that's the case, why are we just now in the early years of this Golden Age? Figure 1.1 shows the three major pillars that have come together at this moment to allow analytics programs to thrive: data, storage, and computing power.

Data

The amount of data the modern world generates on a daily basis is staggering. From the organized tables of spreadsheets to the storage of photos, video, and audio recordings, modern businesses create an almost overwhelming avalanche of data that is ripe for use in analytics programs.

Let's try to quantify the amount of data that exists in the world. We'll begin with an estimate made by Google's then-CEO Eric Schmidt in 2010. At a technology conference, Schmidt estimated that the sum total of all of the stored knowledge created by the world at that point in time was approximately 5 exabytes. To give that a little perspective, the file containing the text of this chapter is around 100 kilobytes. So, Schmidt's estimate is that the world in 2010 had total knowledge that is about the size of 50,000,000,000,000 (that's 50 trillion!) copies of this book chapter. That's a staggering number, but it's only the beginning of our journey.

FIGURE 1.1 Analytics is made possible by modern data, storage, and computing capabilities.

Now fast-forward just two years to 2012. In that year, researchers estimated that the total amount of stored data in the world had grown to 1,000 exabytes (or one zettabyte). Remember, Schmidt's estimate of 5 exabytes was made only two years earlier. In just two years, the total amount of stored data in the world grew by a factor of 200! But we're still not finished!

In the year 2020, IDC estimates that the world created 59 zettabytes (or 59,000 exabytes) of new information. Compare that to Schmidt's estimate of the world having a total of 5 exabytes of stored information in 2010. If you do the math, you'll discover that this means that on any given day in the modern era, the world generates an amount of brand-new data that is approximately 32 times the sum total of all information created from the dawn of civilization until 2010! Now, that is a staggering amount of data!

From an analytics perspective, this trove of data is a gold mine of untapped potential.

Storage

The second key trend driving the growth of analytics programs is the increased availability of storage at rapidly decreasing costs. Table 1.1 shows the cost of storing a gigabyte of data in different years using magnetic hard drives.

TABLE 1.1 Gigabyte storage costs over time

Year

Cost per GB

1985

$169,900

1990

$53,940

1995

$799

2000

$17.50

2005

$0.62

2010

$0.19

2015

$0.03

2020

$0.01

Figure 1.2 shows the same data plotted as a line graph on a logarithmic scale. This visualization clearly demonstrates the fact that storage costs have plummeted to the point where storage is almost free and businesses can afford to retain data for analysis in ways that they never have before.

Computing Power

In 1975, Gordon Moore, one of the co-founders of Intel Corporation, made a prediction that computing technology would continue to advance so quickly that manufacturers would be able to double the number of components placed on an integrated circuit every two years. Remarkably, that prediction has stood the test of time and remains accurate today.

Commonly referred to as Moore's Law, this prediction is often loosely interpreted to mean that we will double the amount of computing power on a single device every two years. That trend has benefited many different technology-enabled fields, among them the world of analytics.

In the early days of analytics, computing power was costly and difficult to come by. Organizations with advanced analytics needs purchased massive supercomputers to analyze their data, but those supercomputers were scarce resources. Analysts fortunate enough to work in an organization that possessed a supercomputer had to justify their requests for small slices of time when they could use the powerful machines.

Today, the effects of Moore's Law have democratized computing. Most employees in an organization now have enough computing power sitting on their desks to perform a wide variety of analytic tasks. If they require more powerful computing resources, cloud services allow them to rent massive banks of computers at very low cost. Even better, those resources are charged at hourly rates, and analysts pay only for the computing time that they actually use.

FIGURE 1.2 Storage costs have decreased over time.