Statistics for Big Data For Dummies - Alan Anderson - E-Book

Statistics for Big Data For Dummies E-Book

Alan Anderson

4,6
15,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

The fast and easy way to make sense of statistics for big data Does the subject of data analysis make you dizzy? You've come to the right place! Statistics For Big Data For Dummies breaks this often-overwhelming subject down into easily digestible parts, offering new and aspiring data analysts the foundation they need to be successful in the field. Inside, you'll find an easy-to-follow introduction to exploratory data analysis, the lowdown on collecting, cleaning, and organizing data, everything you need to know about interpreting data using common software and programming languages, plain-English explanations of how to make sense of data in the real world, and much more. Data has never been easier to come by, and the tools students and professionals need to enter the world of big data are based on applied statistics. While the word "statistics" alone can evoke feelings of anxiety in even the most confident student or professional, it doesn't have to. Written in the familiar and friendly tone that has defined the For Dummies brand for more than twenty years, Statistics For Big Data For Dummies takes the intimidation out of the subject, offering clear explanations and tons of step-by-step instruction to help you make sense of data mining--without losing your cool. * Helps you to identify valid, useful, and understandable patterns in data * Provides guidance on extracting previously unknown information from large databases * Shows you how to discover patterns available in big data * Gives you access to the latest tools and techniques for working in big data If you're a student enrolled in a related Applied Statistics course or a professional looking to expand your skillset, Statistics For Big Data For Dummies gives you access to everything you need to succeed.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 444

Veröffentlichungsjahr: 2015

Bewertungen
4,6 (16 Bewertungen)
11
4
1
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Statistics For Big Data For Dummies®

Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774, www.wiley.com

Copyright © 2015 by John Wiley & Sons, Inc., Hoboken, New Jersey

Published simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc., and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc., is not associated with any product or vendor mentioned in this book.

LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: WHILE THE PUBLISHER AND AUTHOR HAVE USED THEIR BEST EFFORTS IN PREPARING THIS BOOK, THEY MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS BOOK AND SPECIFICALLY DISCLAIM ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES REPRESENTATIVES OR WRITTEN SALES MATERIALS. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR YOUR SITUATION. YOU SHOULD CONSULT WITH A PROFESSIONAL WHERE APPROPRIATE. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM.

For general information on our other products and services, please contact our Customer Care Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002. For technical support, please visit www.wiley.com/techsupport.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Control Number: 2015943222

ISBN 978-1-118-94001-3 (pbk); ISBN 978-1-118-94002-0 (ePub); ISBN 978-1-118-94003-7 (ePDF)

Statistics For Big Data For Dummies

Visit http://www.dummies.com/cheatsheet/statisticsforbigdata to view this book's cheat sheet.

Table of Contents

Cover

Introduction

About This Book

Foolish Assumptions

Icons Used in This Book

Beyond the Book

Where to Go From Here

Part I: Introducing Big Data Statistics

Chapter 1: What Is Big Data and What Do You Do with It?

Characteristics of Big Data

Exploratory Data Analysis (EDA)

Statistical Analysis of Big Data

Chapter 2: Characteristics of Big Data: The Three Vs

Characteristics of Big Data

Traditional Database Management Systems (DBMS)

Chapter 3: Using Big Data: The Hot Applications

Big Data and Weather Forecasting

Big Data and Healthcare Services

Big Data and Insurance

Big Data and Finance

Big Data and Electric Utilities

Big Data and Higher Education

Big Data and Retailers

Big Data and Search Engines

Big Data and Social Media

Chapter 4: Understanding Probabilities

The Core Structure: Probability Spaces

Discrete Probability Distributions

Continuous Probability Distributions

Introducing Multivariate Probability Distributions

Chapter 5: Basic Statistical Ideas

Some Preliminaries Regarding Data

Summary Statistical Measures

Overview of Hypothesis Testing

Higher-Order Measures

Part II: Preparing and Cleaning Data

Chapter 6: Dirty Work: Preparing Your Data for Analysis

Passing the Eye Test: Does Your Data Look Correct?

Being Careful with Dates

Does the Data Make Sense?

Frequently Encountered Data Headaches

Other Common Data Transformations

Chapter 7: Figuring the Format: Important Computer File Formats

Spreadsheet Formats

Database Formats

Chapter 8: Checking Assumptions: Testing for Normality

Goodness of fit test

Jarque-Bera test

Chapter 9: Dealing with Missing or Incomplete Data

Missing Data: What’s the Problem?

Techniques for Dealing with Missing Data

Chapter 10: Sending Out a Posse: Searching for Outliers

Testing for Outliers

Robust Statistics

Dealing with Outliers

Part III: Exploratory Data Analysis (EDA)

Chapter 11: An Overview of Exploratory Data Analysis (EDA)

Graphical EDA Techniques

EDA Techniques for Testing Assumptions

Quantitative EDA Techniques

Chapter 12: A Plot to Get Graphical: Graphical Techniques

Stem-and-Leaf Plots

Scatter Plots

Box Plots

Histograms

Quantile-Quantile (QQ) Plots

Autocorrelation Plots

Chapter 13: You’re the Only Variable for Me: Univariate Statistical Techniques

Counting Events Over a Time Interval: The Poisson Distribution

Continuous Probability Distributions

Chapter 14: To All the Variables We’ve Encountered: Multivariate Statistical Techniques

Testing Hypotheses about Two Population Means

Using Analysis of Variance (ANOVA) to Test Hypotheses about Population Means

The F-Distribution

F-Test for the Equality of Two Population Variances

Correlation

Chapter 15: Regression Analysis

The Fundamental Assumption: Variables Have a Linear Relationship

Defining the Population Regression Equation

Estimating the Population Regression Equation

Testing the Estimated Regression Equation

Using Statistical Software

Assumptions of Simple Linear Regression

Multiple Regression Analysis

Multicollinearity

Chapter 16: When You’ve Got the Time: Time Series Analysis

Key Properties of a Time Series

Forecasting with Decomposition Methods

Smoothing Techniques

Seasonal Components

Modeling a Time Series with Regression Analysis

Comparing Different Models: MAD and MSE

Part IV: Big Data Applications

Chapter 17: Using Your Crystal Ball: Forecasting with Big Data

ARIMA Modeling

Simulation Techniques

Chapter 18: Crunching Numbers: Performing Statistical Analysis on Your Computer

Excelling at Excel

Programming with Visual Basic for Applications (VBA)

R, Matey!

Chapter 19: Seeking Free Sources of Financial Data

Yahoo! Finance

Federal Reserve Economic Data (FRED)

Board of Governors of the Federal Reserve System

U.S. Department of the Treasury

Other Useful Financial Websites

Part V: The Part of Tens

Chapter 20: Ten (or So) Best Practices in Data Preparation

Check Data Formats

Verify Data Types

Graph Your Data

Verify Data Accuracy

Identify Outliers

Deal with Missing Values

Check Your Assumptions about How the Data Is Distributed

Back Up and Document Everything You Do

Chapter 21: Ten (or So) Questions Answered by Exploratory Data Analysis (EDA)

What Are the Key Properties of a Dataset?

What’s the Center of the Data?

How Much Spread Is There in the Data?

Is the Data Skewed?

What Distribution Does the Data Follow?

Are the Elements in the Dataset Uncorrelated?

Does the Center of the Dataset Change Over Time?

Does the Spread of the Dataset Change Over Time?

Are There Outliers in the Data?

Does the Data Conform to Our Assumptions?

About the Authors

Cheat Sheet

Advertisement Page

Connect with Dummies

End User License Agreement

Guide

Cover

Table of Contents

Begin Reading

Pages

i

ii

v

vi

vii

viii

ix

x

xi

xii

1

2

3

4

5

7

8

9

10

11

12

13

14

15

16

17

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

339

340

341

342

343

344

345

346

347

348

367

368

369

370

371

372

Introduction

Welcome to Statistics For Big Data For Dummies! Every day, what has come to be known as big data is making its influence felt in our lives. Some of the most useful innovations of the past 20 years have been made possible by the advent of massive data-gathering capabilities combined with rapidly improving computer technology.

For example, of course, we have become accustomed to finding almost any information we need through the Internet. You can locate nearly anything under the sun immediately by using a search engine such as Google or DuckDuckGo. Finding information this way has become so commonplace that Google has slowly become a verb, as in “I don’t know where to find that restaurant — I’ll just Google it.” Just think how much more efficient our lives have become as a result of search engines. But how does Google work? Google couldn’t exist without the ability to process massive quantities of information at an extremely rapid speed, and its software has to be extremely efficient.

Another area that has changed our lives forever is e-commerce, of which the classic example is Amazon.com. People can buy virtually every product they use in their daily lives online (and have it delivered promptly, too). Often online prices are lower than in traditional “brick-and-mortar” stores, and the range of choices is wider. Online shopping also lets people find the best available items at the lowest possible prices.

Another huge advantage to online shopping is the ability of the sellers to provide reviews of products and recommendations for future purchases. Reviews from other shoppers can give extremely important information that isn’t available from a simple product description provided by manufacturers. And recommendations for future purchases are a great way for consumers to find new products that they might not otherwise have known about. Recommendations are enabled by one application of big data — the use of highly sophisticated programs that analyze shopping data and identify items that tend to be purchased by the same consumers.

Although online shopping is now second nature for many consumers, the reality is that e-commerce has only come into its own in the last 15–20 years, largely thanks to the rise of big data. A website such as Amazon.com must process quantities of information that would have been unthinkably gigantic just a few years ago, and that processing must be done quickly and efficiently. Thanks to rapidly improving technology, many traditional retailers now also offer the option of making purchases online; failure to do so would put a retailer at a huge competitive disadvantage.

In addition to search engines and e-commerce, big data is making a major impact in a surprising number of other areas that affect our daily lives:

Social media

Online auction sites

Insurance

Healthcare

Energy

Political polling

Weather forecasting

Education

Travel

Finance

About This Book

This book is intended as an overview of the field of big data, with a focus on the statistical methods used. It also provides a look at several key applications of big data. Big data is a broad topic; it includes quantitative subjects such as math, statistics, computer science, and data science. Big data also covers many applications, such as weather forecasting, financial modeling, political polling methods, and so forth.

Our intentions for this book specifically include the following:

Provide an overview of the field of big data.

Introduce many useful applications of big data.

Show how data may be organized and checked for bad or missing information.

Show how to handle outliers in a dataset.

Explain how to identify assumptions that are made when analyzing data.

Provide a detailed explanation of how data may be analyzed with graphical techniques.

Cover several key

univariate

(involving only one variable) statistical techniques for analyzing data.

Explain widely used

multivariate

(involving more than one variable) statistical techniques.

Provide an overview of modeling techniques such as regression analysis.

Explain the techniques that are commonly used to analyze time series data.

Cover techniques used to forecast the future values of a dataset.

Provide a brief overview of software packages and how they can be used to analyze statistical data.

Because this is a For Dummies book, the chapters are written so you can pick and choose whichever topics that interest you the most and dive right in. There’s no need to read the chapters in sequential order, although you certainly could. We do suggest, though, that you make sure you’re comfortable with the ideas developed in Chapters 4 and 5 before proceeding to the later chapters in the book. Each chapter also contains several tips, reminders, and other tidbits, and in several cases there are links to websites you can use to further pursue the subject. There’s also an online Cheat Sheet that includes a summary of key equations for ease of reference.

As mentioned, this is a big topic and a fairly new field. Space constraints make possible only an introduction to the statistical concepts that underlie big data. But we hope it is enough to get you started in the right direction.

Foolish Assumptions

We make some assumptions about you, the reader. Hopefully, one of the following descriptions fits you:

You’ve heard about big data and would like to learn more about it.

You’d like to use big data in an application but don’t have sufficient background in statistical modeling.

You don’t know how to implement statistical models in a software package.

Possibly all of these are true. This book should give you a good starting point for advancing your interest in this field. Clearly, you are already motivated.

This book does not assume any particularly advanced knowledge of mathematics and statistics. The ideas are developed from fairly mundane mathematical operations. But it may, in many places, require you to take a deep breath and not get intimidated by the formulas.

Icons Used in This Book

Throughout the book, we include several icons designed to point out specific kinds of information. Keep an eye out for them:

A Tip points out especially helpful or practical information about a topic. It may be hard-won advice on the best way to do something or a useful insight that may not have been obvious at first glance.

A Warning is used when information must be treated carefully. These icons point out potential problems or trouble you may encounter. They also highlight mistaken assumptions that could lead to difficulties.

Technical Stuff points out stuff that may be interesting if you’re really curious about something, but which is not essential. You can safely skip these if you’re in a hurry or just looking for the basics.

Remember is used to indicate stuff that may have been previously encountered in the book or that you will do well to stash somewhere in your memory for future benefit.

Beyond the Book

Besides the pages or pixels you’re presently perusing, this book comes with even more goodies online. You can check out the Cheat Sheet at www.dummies.com/cheatsheet/statisticsforbigdata.

We’ve also written some additional material that wouldn’t quite fit in the book. If this book were a DVD, these would be on the Bonus Content disc. This handful of extra articles on various mini-topics related to big data is available at www.dummies.com/extras/statisticsforbigdata.

Where to Go From Here

You can approach this book from several different angles. You can, of course, start with Chapter 1 and read straight through to the end. But you may not have time for that, or maybe you are already familiar with some of the basics. We suggest checking out the table of contents to see a map of what’s covered in the book and then flipping to any particular chapter that catches your eye. Or if you’ve got a specific big data issue or topic you’re burning to know more about, try looking it up in the index.

Once you’re done with the book, you can further your big data adventure (where else?) on the Internet. Instructional videos are available on websites such as YouTube. Online courses, many of them free, are also becoming available. Some are produced by private companies such as Coursera; others are offered by major universities such as Yale and M.I.T. Of course, many new books are being written in the field of big data due to its increasing importance.

If you’re even more ambitious, you will find specialized courses at the college undergraduate and graduate levels in subject areas such as statistics, computer science, information technology, and so forth. In order to satisfy the expected future demand for big data specialists, several schools are now offering a concentration or a full degree in Data Science.

The resources are there; you should be able to take yourself as far as you want to go in the field of big data. Good luck!

Part I

Introducing Big Data Statistics

Visit www.dummies.com for Great Dummies content online.

In this part …

Introducing big data and stuff it’s used for

Exploring the three Vs of big data

Checking out the hot big data applications

Discovering probabilities and other basic statistical idea

Chapter 1

What Is Big Data and What Do You Do with It?

In This Chapter

Understanding what big data is all about

Seeing how data may be analyzed using Exploratory Data Analysis (EDA)

Gaining insight into some of the key statistical techniques used to analyze big data

Big data refers to sets of data that are far too massive to be handled with traditional hardware. Big data is also problematic for software such as database systems, statistical packages, and so forth. In recent years, data-gathering capabilities have experienced explosive growth, so that storing and analyzing the resulting data has become progressively more challenging.

Many fields have been affected by the increasing availability of data, including finance, marketing, and e-commerce. Big data has also revolutionized more traditional fields such as law and medicine. Of course, big data is gathered on a massive scale by search engines such as Google and social media sites such as Facebook. These developments have led to the evolution of an entirely new profession: the data scientist, someone who can combine the fields of statistics, math, computer science, and engineering with knowledge of a specific application.

This chapter introduces several key concepts that are discussed throughout the book. These include the characteristics of big data, applications of big data, key statistical tools for analyzing big data, and forecasting techniques.

Characteristics of Big Data

The three factors that distinguish big data from other types of data are volume,velocity, and variety.

Clearly, with big data, the volume is massive. In fact, new terminology must be used to describe the size of these datasets. For example, one petabyte of data consists of bytes of data. That’s 1,000 trillion bytes!

A byte is a single unit of storage in a computer’s memory. A byte is used to represent a single number, character, or symbol. A byte consists of eight bits, each consisting of either a 0 or a 1.

Velocity refers to the speed at which data is gathered. Big datasets consist of data that’s continuously gathered at very high speeds. For example, it has been estimated that Twitter users generate more than a quarter of a million tweets every minute. This requires a massive amount of storage space as well as real-time processing of the data.

Variety refers to the fact that the contents of a big dataset may consist of a number of different formats, including spreadsheets, videos, music clips, email messages, and so on. Storing a huge quantity of these incompatible types is one of the major challenges of big data.

Chapter 2 covers these characteristics in more detail.

Exploratory Data Analysis (EDA)

Before you apply statistical techniques to a dataset, it’s important to examine the data to understand its basic properties. You can use a series of techniques that are collectively known as Exploratory Data Analysis (EDA) to analyze a dataset. EDA helps ensure that you choose the correct statistical techniques to analyze and forecast the data. The two basic types of EDA techniques are graphical techniques and quantitative techniques.

Graphical EDA techniques

Graphical EDA techniques show the key properties of a dataset in a convenient format. It’s often easier to understand the properties of a variable and the relationships between variables by looking at graphs rather than looking at the raw data. You can use several graphical techniques, depending on the type of data being analyzed. Chapters 11 and 12 explain how to create and use the following:

Box plots

Histograms

Normal probability plots

Scatter plots

Quantitative EDA techniques

Quantitative EDA techniques provide a more rigorous method of determining the key properties of a dataset. Two of the most important of these techniques are

Interval estimation (discussed in

Chapter 11

).

Hypothesis testing (introduced in

Chapter 5

).

Interval estimates are used to create a range of values within which a variable is likely to fall. Hypothesis testing is used to test various propositions about a dataset, such as

The mean value of the dataset.

The standard deviation of the dataset.

The probability distribution the dataset follows.

Hypothesis testing is a core technique in statistics and is used throughout the chapters in Part III of this book.

Chapter 2

Characteristics of Big Data: The Three Vs

In This Chapter

Understanding the characteristics of big data and how it can be classified

Checking out the features of the latest methods for storing and analyzing big data

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!