15,99 €
The fast and easy way to make sense of statistics for big data Does the subject of data analysis make you dizzy? You've come to the right place! Statistics For Big Data For Dummies breaks this often-overwhelming subject down into easily digestible parts, offering new and aspiring data analysts the foundation they need to be successful in the field. Inside, you'll find an easy-to-follow introduction to exploratory data analysis, the lowdown on collecting, cleaning, and organizing data, everything you need to know about interpreting data using common software and programming languages, plain-English explanations of how to make sense of data in the real world, and much more. Data has never been easier to come by, and the tools students and professionals need to enter the world of big data are based on applied statistics. While the word "statistics" alone can evoke feelings of anxiety in even the most confident student or professional, it doesn't have to. Written in the familiar and friendly tone that has defined the For Dummies brand for more than twenty years, Statistics For Big Data For Dummies takes the intimidation out of the subject, offering clear explanations and tons of step-by-step instruction to help you make sense of data mining--without losing your cool. * Helps you to identify valid, useful, and understandable patterns in data * Provides guidance on extracting previously unknown information from large databases * Shows you how to discover patterns available in big data * Gives you access to the latest tools and techniques for working in big data If you're a student enrolled in a related Applied Statistics course or a professional looking to expand your skillset, Statistics For Big Data For Dummies gives you access to everything you need to succeed.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 444
Veröffentlichungsjahr: 2015
Statistics For Big Data For Dummies®
Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774, www.wiley.com
Copyright © 2015 by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc., and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc., is not associated with any product or vendor mentioned in this book.
LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: WHILE THE PUBLISHER AND AUTHOR HAVE USED THEIR BEST EFFORTS IN PREPARING THIS BOOK, THEY MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS BOOK AND SPECIFICALLY DISCLAIM ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES REPRESENTATIVES OR WRITTEN SALES MATERIALS. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR YOUR SITUATION. YOU SHOULD CONSULT WITH A PROFESSIONAL WHERE APPROPRIATE. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM.
For general information on our other products and services, please contact our Customer Care Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002. For technical support, please visit www.wiley.com/techsupport.
Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.
Library of Congress Control Number: 2015943222
ISBN 978-1-118-94001-3 (pbk); ISBN 978-1-118-94002-0 (ePub); ISBN 978-1-118-94003-7 (ePDF)
Table of Contents
Cover
Introduction
About This Book
Foolish Assumptions
Icons Used in This Book
Beyond the Book
Where to Go From Here
Part I: Introducing Big Data Statistics
Chapter 1: What Is Big Data and What Do You Do with It?
Characteristics of Big Data
Exploratory Data Analysis (EDA)
Statistical Analysis of Big Data
Chapter 2: Characteristics of Big Data: The Three Vs
Characteristics of Big Data
Traditional Database Management Systems (DBMS)
Chapter 3: Using Big Data: The Hot Applications
Big Data and Weather Forecasting
Big Data and Healthcare Services
Big Data and Insurance
Big Data and Finance
Big Data and Electric Utilities
Big Data and Higher Education
Big Data and Retailers
Big Data and Search Engines
Big Data and Social Media
Chapter 4: Understanding Probabilities
The Core Structure: Probability Spaces
Discrete Probability Distributions
Continuous Probability Distributions
Introducing Multivariate Probability Distributions
Chapter 5: Basic Statistical Ideas
Some Preliminaries Regarding Data
Summary Statistical Measures
Overview of Hypothesis Testing
Higher-Order Measures
Part II: Preparing and Cleaning Data
Chapter 6: Dirty Work: Preparing Your Data for Analysis
Passing the Eye Test: Does Your Data Look Correct?
Being Careful with Dates
Does the Data Make Sense?
Frequently Encountered Data Headaches
Other Common Data Transformations
Chapter 7: Figuring the Format: Important Computer File Formats
Spreadsheet Formats
Database Formats
Chapter 8: Checking Assumptions: Testing for Normality
Goodness of fit test
Jarque-Bera test
Chapter 9: Dealing with Missing or Incomplete Data
Missing Data: What’s the Problem?
Techniques for Dealing with Missing Data
Chapter 10: Sending Out a Posse: Searching for Outliers
Testing for Outliers
Robust Statistics
Dealing with Outliers
Part III: Exploratory Data Analysis (EDA)
Chapter 11: An Overview of Exploratory Data Analysis (EDA)
Graphical EDA Techniques
EDA Techniques for Testing Assumptions
Quantitative EDA Techniques
Chapter 12: A Plot to Get Graphical: Graphical Techniques
Stem-and-Leaf Plots
Scatter Plots
Box Plots
Histograms
Quantile-Quantile (QQ) Plots
Autocorrelation Plots
Chapter 13: You’re the Only Variable for Me: Univariate Statistical Techniques
Counting Events Over a Time Interval: The Poisson Distribution
Continuous Probability Distributions
Chapter 14: To All the Variables We’ve Encountered: Multivariate Statistical Techniques
Testing Hypotheses about Two Population Means
Using Analysis of Variance (ANOVA) to Test Hypotheses about Population Means
The F-Distribution
F-Test for the Equality of Two Population Variances
Correlation
Chapter 15: Regression Analysis
The Fundamental Assumption: Variables Have a Linear Relationship
Defining the Population Regression Equation
Estimating the Population Regression Equation
Testing the Estimated Regression Equation
Using Statistical Software
Assumptions of Simple Linear Regression
Multiple Regression Analysis
Multicollinearity
Chapter 16: When You’ve Got the Time: Time Series Analysis
Key Properties of a Time Series
Forecasting with Decomposition Methods
Smoothing Techniques
Seasonal Components
Modeling a Time Series with Regression Analysis
Comparing Different Models: MAD and MSE
Part IV: Big Data Applications
Chapter 17: Using Your Crystal Ball: Forecasting with Big Data
ARIMA Modeling
Simulation Techniques
Chapter 18: Crunching Numbers: Performing Statistical Analysis on Your Computer
Excelling at Excel
Programming with Visual Basic for Applications (VBA)
R, Matey!
Chapter 19: Seeking Free Sources of Financial Data
Yahoo! Finance
Federal Reserve Economic Data (FRED)
Board of Governors of the Federal Reserve System
U.S. Department of the Treasury
Other Useful Financial Websites
Part V: The Part of Tens
Chapter 20: Ten (or So) Best Practices in Data Preparation
Check Data Formats
Verify Data Types
Graph Your Data
Verify Data Accuracy
Identify Outliers
Deal with Missing Values
Check Your Assumptions about How the Data Is Distributed
Back Up and Document Everything You Do
Chapter 21: Ten (or So) Questions Answered by Exploratory Data Analysis (EDA)
What Are the Key Properties of a Dataset?
What’s the Center of the Data?
How Much Spread Is There in the Data?
Is the Data Skewed?
What Distribution Does the Data Follow?
Are the Elements in the Dataset Uncorrelated?
Does the Center of the Dataset Change Over Time?
Does the Spread of the Dataset Change Over Time?
Are There Outliers in the Data?
Does the Data Conform to Our Assumptions?
About the Authors
Cheat Sheet
Advertisement Page
Connect with Dummies
End User License Agreement
Cover
Table of Contents
Begin Reading
i
ii
v
vi
vii
viii
ix
x
xi
xii
1
2
3
4
5
7
8
9
10
11
12
13
14
15
16
17
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
339
340
341
342
343
344
345
346
347
348
367
368
369
370
371
372
Welcome to Statistics For Big Data For Dummies! Every day, what has come to be known as big data is making its influence felt in our lives. Some of the most useful innovations of the past 20 years have been made possible by the advent of massive data-gathering capabilities combined with rapidly improving computer technology.
For example, of course, we have become accustomed to finding almost any information we need through the Internet. You can locate nearly anything under the sun immediately by using a search engine such as Google or DuckDuckGo. Finding information this way has become so commonplace that Google has slowly become a verb, as in “I don’t know where to find that restaurant — I’ll just Google it.” Just think how much more efficient our lives have become as a result of search engines. But how does Google work? Google couldn’t exist without the ability to process massive quantities of information at an extremely rapid speed, and its software has to be extremely efficient.
Another area that has changed our lives forever is e-commerce, of which the classic example is Amazon.com. People can buy virtually every product they use in their daily lives online (and have it delivered promptly, too). Often online prices are lower than in traditional “brick-and-mortar” stores, and the range of choices is wider. Online shopping also lets people find the best available items at the lowest possible prices.
Another huge advantage to online shopping is the ability of the sellers to provide reviews of products and recommendations for future purchases. Reviews from other shoppers can give extremely important information that isn’t available from a simple product description provided by manufacturers. And recommendations for future purchases are a great way for consumers to find new products that they might not otherwise have known about. Recommendations are enabled by one application of big data — the use of highly sophisticated programs that analyze shopping data and identify items that tend to be purchased by the same consumers.
Although online shopping is now second nature for many consumers, the reality is that e-commerce has only come into its own in the last 15–20 years, largely thanks to the rise of big data. A website such as Amazon.com must process quantities of information that would have been unthinkably gigantic just a few years ago, and that processing must be done quickly and efficiently. Thanks to rapidly improving technology, many traditional retailers now also offer the option of making purchases online; failure to do so would put a retailer at a huge competitive disadvantage.
In addition to search engines and e-commerce, big data is making a major impact in a surprising number of other areas that affect our daily lives:
Social media
Online auction sites
Insurance
Healthcare
Energy
Political polling
Weather forecasting
Education
Travel
Finance
This book is intended as an overview of the field of big data, with a focus on the statistical methods used. It also provides a look at several key applications of big data. Big data is a broad topic; it includes quantitative subjects such as math, statistics, computer science, and data science. Big data also covers many applications, such as weather forecasting, financial modeling, political polling methods, and so forth.
Our intentions for this book specifically include the following:
Provide an overview of the field of big data.
Introduce many useful applications of big data.
Show how data may be organized and checked for bad or missing information.
Show how to handle outliers in a dataset.
Explain how to identify assumptions that are made when analyzing data.
Provide a detailed explanation of how data may be analyzed with graphical techniques.
Cover several key
univariate
(involving only one variable) statistical techniques for analyzing data.
Explain widely used
multivariate
(involving more than one variable) statistical techniques.
Provide an overview of modeling techniques such as regression analysis.
Explain the techniques that are commonly used to analyze time series data.
Cover techniques used to forecast the future values of a dataset.
Provide a brief overview of software packages and how they can be used to analyze statistical data.
Because this is a For Dummies book, the chapters are written so you can pick and choose whichever topics that interest you the most and dive right in. There’s no need to read the chapters in sequential order, although you certainly could. We do suggest, though, that you make sure you’re comfortable with the ideas developed in Chapters 4 and 5 before proceeding to the later chapters in the book. Each chapter also contains several tips, reminders, and other tidbits, and in several cases there are links to websites you can use to further pursue the subject. There’s also an online Cheat Sheet that includes a summary of key equations for ease of reference.
As mentioned, this is a big topic and a fairly new field. Space constraints make possible only an introduction to the statistical concepts that underlie big data. But we hope it is enough to get you started in the right direction.
We make some assumptions about you, the reader. Hopefully, one of the following descriptions fits you:
You’ve heard about big data and would like to learn more about it.
You’d like to use big data in an application but don’t have sufficient background in statistical modeling.
You don’t know how to implement statistical models in a software package.
Possibly all of these are true. This book should give you a good starting point for advancing your interest in this field. Clearly, you are already motivated.
This book does not assume any particularly advanced knowledge of mathematics and statistics. The ideas are developed from fairly mundane mathematical operations. But it may, in many places, require you to take a deep breath and not get intimidated by the formulas.
Throughout the book, we include several icons designed to point out specific kinds of information. Keep an eye out for them:
A Tip points out especially helpful or practical information about a topic. It may be hard-won advice on the best way to do something or a useful insight that may not have been obvious at first glance.
A Warning is used when information must be treated carefully. These icons point out potential problems or trouble you may encounter. They also highlight mistaken assumptions that could lead to difficulties.
Technical Stuff points out stuff that may be interesting if you’re really curious about something, but which is not essential. You can safely skip these if you’re in a hurry or just looking for the basics.
Remember is used to indicate stuff that may have been previously encountered in the book or that you will do well to stash somewhere in your memory for future benefit.
Besides the pages or pixels you’re presently perusing, this book comes with even more goodies online. You can check out the Cheat Sheet at www.dummies.com/cheatsheet/statisticsforbigdata.
We’ve also written some additional material that wouldn’t quite fit in the book. If this book were a DVD, these would be on the Bonus Content disc. This handful of extra articles on various mini-topics related to big data is available at www.dummies.com/extras/statisticsforbigdata.
You can approach this book from several different angles. You can, of course, start with Chapter 1 and read straight through to the end. But you may not have time for that, or maybe you are already familiar with some of the basics. We suggest checking out the table of contents to see a map of what’s covered in the book and then flipping to any particular chapter that catches your eye. Or if you’ve got a specific big data issue or topic you’re burning to know more about, try looking it up in the index.
Once you’re done with the book, you can further your big data adventure (where else?) on the Internet. Instructional videos are available on websites such as YouTube. Online courses, many of them free, are also becoming available. Some are produced by private companies such as Coursera; others are offered by major universities such as Yale and M.I.T. Of course, many new books are being written in the field of big data due to its increasing importance.
If you’re even more ambitious, you will find specialized courses at the college undergraduate and graduate levels in subject areas such as statistics, computer science, information technology, and so forth. In order to satisfy the expected future demand for big data specialists, several schools are now offering a concentration or a full degree in Data Science.
The resources are there; you should be able to take yourself as far as you want to go in the field of big data. Good luck!
Part I
Visit www.dummies.com for Great Dummies content online.
In this part …
Introducing big data and stuff it’s used for
Exploring the three Vs of big data
Checking out the hot big data applications
Discovering probabilities and other basic statistical idea
Chapter 1
In This Chapter
Understanding what big data is all about
Seeing how data may be analyzed using Exploratory Data Analysis (EDA)
Gaining insight into some of the key statistical techniques used to analyze big data
Big data refers to sets of data that are far too massive to be handled with traditional hardware. Big data is also problematic for software such as database systems, statistical packages, and so forth. In recent years, data-gathering capabilities have experienced explosive growth, so that storing and analyzing the resulting data has become progressively more challenging.
Many fields have been affected by the increasing availability of data, including finance, marketing, and e-commerce. Big data has also revolutionized more traditional fields such as law and medicine. Of course, big data is gathered on a massive scale by search engines such as Google and social media sites such as Facebook. These developments have led to the evolution of an entirely new profession: the data scientist, someone who can combine the fields of statistics, math, computer science, and engineering with knowledge of a specific application.
This chapter introduces several key concepts that are discussed throughout the book. These include the characteristics of big data, applications of big data, key statistical tools for analyzing big data, and forecasting techniques.
The three factors that distinguish big data from other types of data are volume,velocity, and variety.
Clearly, with big data, the volume is massive. In fact, new terminology must be used to describe the size of these datasets. For example, one petabyte of data consists of bytes of data. That’s 1,000 trillion bytes!
A byte is a single unit of storage in a computer’s memory. A byte is used to represent a single number, character, or symbol. A byte consists of eight bits, each consisting of either a 0 or a 1.
Velocity refers to the speed at which data is gathered. Big datasets consist of data that’s continuously gathered at very high speeds. For example, it has been estimated that Twitter users generate more than a quarter of a million tweets every minute. This requires a massive amount of storage space as well as real-time processing of the data.
Variety refers to the fact that the contents of a big dataset may consist of a number of different formats, including spreadsheets, videos, music clips, email messages, and so on. Storing a huge quantity of these incompatible types is one of the major challenges of big data.
Chapter 2 covers these characteristics in more detail.
Before you apply statistical techniques to a dataset, it’s important to examine the data to understand its basic properties. You can use a series of techniques that are collectively known as Exploratory Data Analysis (EDA) to analyze a dataset. EDA helps ensure that you choose the correct statistical techniques to analyze and forecast the data. The two basic types of EDA techniques are graphical techniques and quantitative techniques.
Graphical EDA techniques show the key properties of a dataset in a convenient format. It’s often easier to understand the properties of a variable and the relationships between variables by looking at graphs rather than looking at the raw data. You can use several graphical techniques, depending on the type of data being analyzed. Chapters 11 and 12 explain how to create and use the following:
Box plots
Histograms
Normal probability plots
Scatter plots
Quantitative EDA techniques provide a more rigorous method of determining the key properties of a dataset. Two of the most important of these techniques are
Interval estimation (discussed in
Chapter 11
).
Hypothesis testing (introduced in
Chapter 5
).
Interval estimates are used to create a range of values within which a variable is likely to fall. Hypothesis testing is used to test various propositions about a dataset, such as
The mean value of the dataset.
The standard deviation of the dataset.
The probability distribution the dataset follows.
Hypothesis testing is a core technique in statistics and is used throughout the chapters in Part III of this book.
Chapter 2
In This Chapter
Understanding the characteristics of big data and how it can be classified
Checking out the features of the latest methods for storing and analyzing big data
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!