Big Data, Data Mining, and Machine Learning - Jared Dean - E-Book

Big Data, Data Mining, and Machine Learning E-Book

Jared Dean

0,0
40,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

With big data analytics comes big insights into profitability Big data is big business. But having the data and the computational power to process it isn't nearly enough to produce meaningful results. Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners is a complete resource for technology and marketing executives looking to cut through the hype and produce real results that hit the bottom line. Providing an engaging, thorough overview of the current state of big data analytics and the growing trend toward high performance computing architectures, the book is a detail-driven look into how big data analytics can be leveraged to foster positive change and drive efficiency. With continued exponential growth in data and ever more competitive markets, businesses must adapt quickly to gain every competitive advantage available. Big data analytics can serve as the linchpin for initiatives that drive business, but only if the underlying technology and analysis is fully understood and appreciated by engaged stakeholders. This book provides a view into the topic that executives, managers, and practitioners require, and includes: * A complete overview of big data and its notable characteristics * Details on high performance computing architectures for analytics, massively parallel processing (MPP), and in-memory databases * Comprehensive coverage of data mining, text analytics, and machine learning algorithms * A discussion of explanatory and predictive modeling, and how they can be applied to decision-making processes Big Data, Data Mining, and Machine Learning provides technology and marketing executives with the complete resource that has been notably absent from the veritable libraries of published books on the topic. Take control of your organization's big data analytics to produce real results with a resource that is comprehensive in scope and light on hyperbole.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 378

Veröffentlichungsjahr: 2014

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Additional praise for Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners

“Jared's book is a great introduction to the area of High Powered Analytics. It will be useful for those who have experience in predictive analytics but who need to become more versed in how technology is changing the capabilities of existing methods and creating new possibilities. It will also be helpful for business executives and IT professionals who'll need to make the case for building the environments for, and reaping the benefits of, the next generation of advanced analytics.”

—Jonathan Levine, Senior Director, Consumer Insight Analysis at Marriott International

“The ideas that Jared describes are the same ideas that being used by our Kaggle contest winners. This book is a great overview for those who want to learn more and gain a complete understanding of the many facets of data mining, knowledge discovery and extracting value from data.”

—Anthony Goldbloom Founder and CEO of Kaggle

“The concepts that Jared presents in this book are extremely valuable for the students that I teach and will help them to more fully understand the power that can be unlocked when an organization begins to take advantage of its data. The examples and case studies are particularly useful for helping students to get a vision for what is possible. Jared's passion for analytics comes through in his writing, and he has done a great job of making complicated ideas approachable to multiple audiences.”

—Tonya Etchison Balan, Ph.D., Professor of Practice, Statistics, Poole College of Management, North Carolina State University

Wiley & SAS Business Series

The Wiley & SAS Business Series presents books that help senior-level managers with their critical management decisions.

Titles in the Wiley & SAS Business Series include:

Activity-Based Management for Financial Institutions: Driving Bottom-Line Results

by Brent Bahnub

Analytics in a Big Data World: The Essential Guide to Data Science and its Applications by

Bart Baesens

Bank Fraud: Using Technology to Combat Losses

by Revathi Subramanian

Big Data Analytics: Turning Big Data into Big Money

by Frank Ohlhorst

Branded! How Retailers Engage Consumers with Social Media and Mobility

by Bernie Brennan and Lori Schafer

Business Analytics for Customer Intelligence

by Gert Laursen

Business Analytics for Managers: Taking Business Intelligence beyond Reporting

by Gert Laursen and Jesper Thorlund

The Business Forecasting Deal: Exposing Bad Practices and Providing Practical Solutions

by Michael Gilliland

Business Intelligence Applied: Implementing an Effective Information and Communications Technology Infrastructure

by Michael Gendron

Business Intelligence and the Cloud: Strategic Implementation Guide

by Michael S. Gendron

Business Intelligence Success Factors: Tools for Aligning Your Business in the Global Economy

by Olivia Parr Rud

Business Transformation: A Roadmap for Maximizing Organizational Insights

by Aiman Zeid

CIO Best Practices: Enabling Strategic Value with Information Technology,

second edition by Joe Stenzel

Connecting Organizational Silos: Taking Knowledge Flow Management to the Next Level with Social Media

by Frank Leistner

Credit Risk Assessment: The New Lending System for Borrowers, Lenders, and Investors

by Clark Abrahams and Mingyuan Zhang

Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring

by Naeem Siddiqi

The Data Asset: How Smart Companies Govern Their Data for Business Success

by Tony Fisher

Delivering Business Analytics: Practical Guidelines for Best Practice

by Evan Stubbs

Demand-Driven Forecasting: A Structured Approach to Forecasting,

second edition by Charles Chase

Demand-Driven Inventory Optimization and Replenishment: Creating a More Efficient Supply Chain

by Robert A. Davis

Developing Human Capital: Using Analytics to Plan and Optimize

Your

Learning and Development Investments

by Gene Pease, Barbara Beresford, and Lew Walker

The Executive's Guide to Enterprise Social Media Strategy: How Social Networks Are Radically Transforming Your Business

by David Thomas and Mike Barlow

Economic and Business Forecasting: Analyzing and Interpreting

Econometric Results

by John Silvia, Azhar Iqbal, Kaylyn Swankoski, Sarah Watt, and Sam Bullard

Executive's Guide to Solvency II

by David Buckham, Jason Wahl, and Stuart Rose

Fair Lending Compliance: Intelligence and Implications for Credit Risk Management

by Clark R. Abrahams and Mingyuan Zhang

Foreign Currency Financial Reporting from Euros to Yen to Yuan: A Guide to Fundamental Concepts and Practical Applications

by Robert Rowan

Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production with Data Driven Models

by Keith Holdaway

Health Analytics: Gaining the Insights to Transform Health Care

by Jason Burke

Heuristics in Analytics: A Practical Perspective of What Influences Our Analytical World

by Carlos Andre Reis Pinheiro and Fiona McNeill

Human Capital Analytics: How to Harness the Potential of Your Organi­zation's Greatest Asset

by Gene Pease, Boyce Byerly, and Jac Fitz-enz

Implement, Improve and Expand Your Statewide Longitudinal Data System: Creating a Culture of Data in Education

by Jamie McQuiggan and Armistead Sapp

Information Revolution: Using the Information Evolution Model to Grow Your Business

by Jim Davis, Gloria J. Miller, and Allan Russell

Killer Analytics: Top 20 Metrics Missing from your Balance Sheet

by Mark Brown

Manufacturing Best Practices: Optimizing Productivity and Product

Quality

by Bobby Hull

Marketing Automation: Practical Steps to More Effective Direct Marketing

by Jeff LeSueur

Mastering Organizational Knowledge Flow: How to Make Knowledge Sharing Work

by Frank Leistner

The New Know: Innovation Powered by Analytics

by Thornton May

Performance Management: Integrating Strategy Execution, Methodologies, Risk, and Analytics

by Gary Cokins

Predictive Business Analytics: Forward-Looking Capabilities to Improve Business Performance

by Lawrence Maisel and Gary Cokins

Retail Analytics: The Secret Weapon

by Emmett Cox

Social Network Analysis in Telecommunications

by Carlos Andre Reis Pinheiro

Statistical Thinking: Improving Business Performance,

second edition, by Roger W. Hoerl and Ronald D. Snee

Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics

by Bill Franks

Too Big to Ignore: The Business Case for Big Data

by Phil Simon

The Value of Business Analytics: Identifying the Path to Profitability

by Evan Stubbs

The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions

by Phil Simon

Visual Six Sigma: Making Data Analysis Lean

by Ian Cox, Marie A. Gaudard, Philip J. Ramsey, Mia L. Stephens, and Leo Wright

Win with Advanced Business Analytics: Creating Business Value from Your Data

by Jean Paul Isson and Jesse Harriott

For more information on any of the above titles, please visit www.wiley.com.

Big Data, Data Mining, and Machine Learning

Value Creation for Business Leaders and Practitioners

Jared Dean

Cover Design: Wiley Cover Image: © iStockphoto / elly99

Copyright © 2014 by SAS Institute Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or on the Web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Dean, Jared, 1978-     Big data, data mining, and machine learning : value creation for business leaders and practitioners / Jared Dean.        1 online resource.—(Wiley & SAS business series)     Includes index.     ISBN 978-1-118-92069-5 (ebk); ISBN 978-1-118-92070-1 (ebk); ISBN 978-1-118-61804-2 (hardback) 1. Management—Data processing. 2. Data mining. 3. Big data. 4. Database management. 5. Information technology—Management. I. Title.     HD30.2     658′.05631—dc23

2014009116

To my wife, without whose help, love, and devotion,this book would not exist. Thank you, Katie!

For Geoffrey, Ava, Mason, and Chase: Remember that the quickest path to easy is through hard.

Contents

Foreword

Preface

Notes

Acknowledgments

Introduction

Big Data Timeline

Why This Topic Is Relevant Now

Is Big Data a Fad?

Where Using Big Data Makes a Big Difference

Notes

Part One The Computing Environment

Chapter 1 Hardware

Storage (Disk)

Central Processing Unit

Memory

Network

Notes

Chapter 2 Distributed Systems

Database Computing

File System Computing

Considerations

Notes

Chapter 3 Analytical Tools

Weka

Java and JVM Languages

R

Python

SAS

Notes

Part Two Turning Data into Business Value

Chapter 4 Predictive Modeling

A Methodology for Building Models

sEMMA

Binary Classification

Multilevel Classification

Interval Prediction

Assessment of Predictive Models

Notes

Chapter 5 Common Predictive Modeling Techniques

RFM

Regression

Generalized Linear Models

Neural Networks

Decision and Regression Trees

Support Vector Machines

Bayesian Methods Network Classification

Ensemble Methods

Notes

Chapter 6 Segmentation

Cluster Analysis

Distance Measures (Metrics)

Evaluating Clustering

Number of Clusters

K

-means Algorithm

Hierarchical Clustering

Profiling Clusters

Notes

Chapter 7 Incremental Response Modeling

Building the Response Model

Measuring the Incremental Response

Chapter 8 Time Series Data Mining

Reducing Dimensionality

Detecting Patterns

Time Series Data Mining in Action: Nike+ FuelBand

Notes

Chapter 9 Recommendation Systems

What Are Recommendation Systems?

Where Are They Used?

How Do They Work?

Assessing Recommendation Quality

Recommendations in Action: SAS Library

Notes

Chapter 10 Text Analytics

Information Retrieval

Content Categorization

Text Mining

Text Analytics in Action: Let’s Play

Jeopardy!

Notes

Part Three Success Stories of Putting It All Together

Qualities of Successful Projects

Chapter 11 Case Study of a Large U.S.-Based Financial Services Company

Traditional Marketing Campaign Process

High-Performance Marketing Solution

Value Proposition for Change

Chapter 12 Case Study of a Major Health Care Provider

CAHPS

HEDIS

HOS

IRE

Chapter 13 Case Study of a Technology Manufacturer

Finding Defective Devices

How They Reduced Cost

Chapter 14 Case Study of Online Brand Management

Chapter 15 Case Study of Mobile Application Recommendations

Chapter 16 Case Study of a High-Tech Product Manufacturer

Handling the Missing Data

Application beyond Manufacturing

Chapter 17 Looking to the Future

Reproducible Research

Privacy with Public Data Sets

The Internet of Things

Software Development in the Future

Future Development of Algorithms

In Conclusion

Notes

About the Author

Appendix Nike+ Fuelband Script to Retrieve Information

References

Introduction

Chapter 2

Chapter 3

Chapter 4

Chapter 5

Chapter 6

Chapter 7

Chapter 8

Chapter 9

Chapter 10

Chapter 17

Index

End User License Agreement

List of Tables

Chapter 4

Table 4.1

Table 4.2

Chapter 5

Table 5.1

Table 5.2

Table 5.3

Table 5.4

Table 5.5

Table 5.6

Table 5.7

Table 5.8

Table 5.9

Table 5.10

Table 5.11

Chapter 7

Table 7.1

Chapter 8

Table 8.1

Chapter 9

Table 9.1

Table 9.2

Table 9.3

Chapter 10

Table 10.1

Table 10.2

Table 10.3

Table 10.4

Table 10.5

Table 10.6

Table 10.7

Chapter 15

Table 15.1

List of Illustrations

Introduction

Figure I.1

Trend of Google Searches of “Big Data” over Time Showing the Popularity of the Term

Chapter 2

Figure 2.1

Graphical illustration of a Hadoop System

Figure 2.2

Comparison of Big Data Technologies

Chapter 4

Figure 4.1

Multidisciplinary Nature of Data Mining

Chapter 5

Figure 5.1

Scatter Plot for Height and Weight for Class

Figure 5.2

Fit Plot for Weight

Figure 5.3

Residuals for Weight

Figure 5.4

Plot Residuals in a Funnel Shape

Figure 5.5

Residuals in an Arc

Figure 5.6

Fit Diagnostics for Weight

Figure 5.7

Effect of Outlier on Regression Line

Figure 5.8

Proportion of Response by Dose

Figure 5.9

Predicted Probabilities for Response

Figure 5.10

Unstandardized DFBETA Plots for Response

Figure 5.11

Original Drawing of a Neural Network

Figure 5.12

Simplified Perception Diagram

Figure 5.13

Data Properly Separated by a Hyperplane

Figure 5.14

Standard Three Layer Neural Network

Figure 5.15

Two Output Functions

Figure 5.16

Neural Network Diagram for

Iris

Data

Figure 5.17

Improvement of Digit Recognition Using Deep Learning

Figure 5.18

Basic Decision Tree

Figure 5.19

Basic Decision Tree Shown as a Partition Region

Figure 5.20

Data with Two Groups for Classification

Figure 5.21

Two of Infinite Solutions to Separate into Correct Groups

Figure 5.22

Best Separating Line

Figure 5.23

Best Separating Line with Perpendicular Reference

Figure 5.24

Best Separating Line with Margin Lines

Figure 5.25

Nonlinear Class Separation

Figure 5.26

Naive Bayes Model for Pain Medication Study

Figure 5.27

House Alarm Example

Chapter 6

Figure 6.1

Example of Valid Sets in Rummikub

Figure 6.2

Plot of Seven Different Sets of Generated Data

Figure 6.3

Diagnostic Plot to Assess the Best Number of Clusters

Chapter 7

Figure 7.1

Control versus Treatment

Figure 7.2

Outlier Identification of Control Group

Figure 7.3

Separation of Responders and Nonresponders in the Control Group

Figure 7.4

Projection of Control Group to Treatment Group

Figure 7.5

Simulation of 1,300 Responders to Coupon Offer

Chapter 8

Figure 8.1

Seasonal Index of Authors Activity Levels

Figure 8.2

Minute-to-Minute Activity Levels Over 13 Months

Figure 8.3

Average Pattern of Calories Burned over Two-Month Period

Figure 8.4

Comparison of Baseline Series and 01May2013

Figure 8.5

Comparison of Baseline Series and 29April2013

Figure 8.6

Two Most Similar Days to the Baseline

Chapter 9

Figure 9.1

Factor Matrix

Figure 9.2

Restricted Boltzmann Machine for One User

Figure 9.3

Root Mean Square Error on a Held-Out Test Set from SAS Library Data

Chapter 10

Figure 10.1

Jeopardy!

Game Board

Chapter 16

Figure 16.1

Time Series Similarity

Guide

Cover

Table of Contents

Foreword

Preface

Introduction

Part

Chapter

Pages

xiii

xiv

xv

xvi

xvii

xix

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

23

24

25

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

163

164

165

166

167

168

169

170

171

172

173

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

193

194

195

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

243

245

246

247

248

249

250

251

253

254

255

256

257

258

259

260

261

262

263

264

265

Foreword

I love the field of predictive analytics and have lived in this world for my entire career. The mathematics are fun (at least for me), but turning what the algorithms uncover into solutions that a company uses and generates profit from makes the mathematics worthwhile. In some ways, Jared Dean and I are unusual in this regard; we really do love seeing these solutions work for organizations we work with. What amazes us, though, is that this field that we used to do in the back office, a niche of a niche, has now become one of the sexiest jobs of the twenty-first century. How did this happen?

We live in a world where data is collected in ever-increasing amounts, summarizing more of what people and machines do, and capturing finer granularity of their behavior. These three ways to characterize data are sometimes described as volume, variety, and velocity—the definition of big data. They are collected because of the perceived value in the data even if we don’t know exactly what we will do with it. Initially, many organizations collect it and report summaries, often using approaches from business intelligence that have become commonplace.

But in recent years, a paradigm shift has taken place. Organizations have found that predictive analytics transforms the way they make decisions. The algorithms and approaches to predictive modeling described in this book are not new for the most part; Jared himself describes the big-data problem as nothing new. The algorithms he describes are all at least 15 years old, a testimony to their effectiveness that fundamentally new algorithms are not needed. Nevertheless, predictive modeling is in fact new to many organizations as they try to improve decisions with data. These organizations need to gain an understanding not only of the science and principles of predictive modeling but how to apply the principles to problems that defy the standard approaches and answers.

But there is much more to predictive modeling than just building predictive models. The operational aspects of predictive modeling projects are often overlooked and are rarely covered in books and courses. First, this includes specifying hardware and software needed for a predictive modeling. As Jared describes, this depends on the organization, the data, and the analysts working on the project. Without setting up analysts with the proper resources, projects flounder and often fail. I’ve personally witnessed this on projects I have worked on, where hardware was improperly specified causing me to spend a considerable amount of time working around the limitations in RAM and processing speed.

Ultimately, the success of predictive modeling projects is measured by the metric that matters to the organization using it, whether it be increased efficiency, ROI, customer lifetime value, or soft metrics like company reputation. I love the case studies in this book that address these issues, and you have a half-dozen here to whet your appetite. This is especially important for managers who are trying to understand how predictive modeling will impact their bottom line.

Predictive modeling is science, but successful implementation of predictive modeling solutions requires connecting the models to the business. Experience is essential to recognize these connections, and there is a wealth of experience here to draw from to propel you in your predictive modeling journey.

Dean Abbott Abbott Analytics, Inc. March 2014

Preface

This book project was first presented to me during my first week in my current role of managing the data mining development at SAS. Writing a book has always been a bucket-list item, and I was very excited to be involved. I’ve come to realize why so many people want to write books, but why so few get the chance to see their thoughts and ideas bound and published.

I’ve had the opportunity during my studies and professional career to be front and center to some great developments in the area of data mining and to study under some brilliant minds. This experience helped position me with the skills and experience I needed to create this work.

Data mining is a field I love. Ever since childhood, I’ve wanted to explain how things work and understand how systems function both in the “average” case but also at the extremes. From elementary school through high school, I thought engineering would be the job that would couple both my curiosity and my desire to explain the world around me. However, before my last year as an undergraduate student, I found statistics and information systems, and I was hooked.

In Part One of the book, I explore the foundations of hardware and system architecture. This is a love that my parents were kind enough to indulge me in, in a day when computers cost much much more than $299. The first computer in my home was an Apple IIc, with two 5.25" floppy disk drives and no hard drive. A few years later I built an Intel 386 PC from a kit, and I vividly remember playing computer games and hitting the turbo button to move the CPU clock speed from 8 MHz to 16 MHz. I’ve seen Moore’s Law firsthand, and it still amazes me that my smartphone holds more computing power than the computers used in the Mercury space program, the Apollo space program, and the Orbiter space shuttle program combined.

After I finished my undergraduate degree in statistics, I began to work for the federal government at the U.S. Bureau of the Census. This is where I got my first exposure to big data. Prior to joining the Census Bureau, I had never written a computer program that took more than a minute to run (unless the point was to make the program run for more than a minute). One of my first projects was working with the Master Address File (MAF),1 which is an address list maintained by the Census Bureau. This address list is also the primary survey frame for current surveys that the Census Bureau administers (yes, there is lots of work to do the other nine years). The list has more than 300 million records, and combining all the address information, longitudinal information, and geographic information, there are hundreds of attributes associated with each housing unit. Working with such a large data set was where I first learned about programming efficiency, scalability, and hardware optimization. I’m grateful to my patient manager, Maryann, who gave me the time to learn and provided me with interesting, valuable projects that gave me practical experience and the opportunity to innovate. It was a great position because I got to try new techniques and approaches that had not been studied before in that department. As with any new project, some ideas worked great and others failed. One specific project I was involved in was trying to identify which blocks (the Census Bureau has the United States divided up into unique geographic areas—the hierarchy is state, county, track, block group, and block; there are about 8.2 million blocks in the United States) from Census 2000 had been overcounted or undercounted. Through the available data, we did not have a way to verify that our model for predicting the deviation of actual housing unit count from reported housing unit count was accurate. The program was fortunate to have funding from congress to conduct field studies to provide feedback and validation of the models. This was the first time I had heard the term “data mining” and I was first exposed to SAS™ Enterprise Miner® and CART® by Salford Systems. After a period of time working for the Census Bureau, I realized that I needed more education to achieve my career goals, and so I enrolled in the statistics department at George Mason University in Fairfax, VA.

During graduate school, I learned in more detail about the algorithms common to the fields of data mining, machine learning, and statistics; these included survival analysis, survey sampling, and computational statistics. Through my graduate studies, I was able to merge the lessons taught in the classroom to the practical data analysis and innovations required in the office. I acquired an understanding of the theory and the relative strengths and weaknesses of different approaches for data analysis and predictive analytics.

After graduate school, I changed direction in my career, moving from a data analysis2 role and becoming a software developer. I went to work for SAS Institute Inc., where I was participating in the creation of the software that I had previously used. I had moved from using the software to building it. This presented new challenges and opportunities for growth as I learned about the rigorous numerical validation that SAS imposes on the software, along with its thorough documentation and tireless effort to make new software enhancements consistent with existing software and to consistently deliver new software features that customers need.

During my years at SAS, I’ve come to thoroughly understand how the software is made and how our customers use it. I often get the chance to visit with customers, listen to their business challenges, and recommend methods or process that help lead them to success; creating value for their organizations.

It is from this collection of experience that I wrote this book, along with the help of the wonderful staff and my colleagues both inside and outside of SAS Institute.

NOTES

1

The MAF is created during decennial census operations for every housing unit, or potential housing unit, in the United States.

2

I was a data scientist before the term was invented

Acknowledgments

I would like to thank all those who helped me to make this book a reality. It was a long journey and a wonderful learning and growing experience.

Patrick Hall, thank you for your validation of my ideas and contributing many of your own. I appreciate that I could discuss ideas and trends with you and get thoughtful, timely, and useful feedback.

Joseph Pingenot, Ilknur Kabul, Jorge Silva, Larry Lewis, Susan Haller, and Wendy Czika, thank you for sharing your domain knowledge and passion for analytics.

Michael Wallis, thank you for your help in the text analytics area and developing the Jeopardy! example.

Udo Sglavo and Taiyeong Lee, thank you for reviewing and offering significant contributions in the analysis of times series data mining.

Barbara Walters and Vicki Jones, thank you for all the conversations about reads and feeds in understanding how the hardware impacted the software.

Jared Peterson for his help in downloading the data from my Nike+ FuelBand.

Franklin So, thank you for your excellent description of a customer’s core business problem.

Thank you Grandma Catherine Coyne, who sacrificed many hours to help a fellow author in editing the manuscript to greatly improve its readability. I am very grateful for your help and hope that when I am 80-something I can be half as active as you are.

I would also like to thank the staff of SAS Press and John Wiley & Sons for the feedback and support through all phases of this project, including some major detours along the way.

Finally, I need to acknowledge my wife, Katie, for shouldering many burdens as I researched, wrote, edited, and wrote more. Meeting you was the best thing that has happened to me in my whole life.

Introduction

Hiding within those mounds of data is knowledge that could change the life of a patient, or change the world.

—Atul Butte, Stanford University

Cancer” is the term given for a class of diseases in which abnormal cells divide in an uncontrolled fashion and invade body tissues. There are more than 100 unique types of cancer. Most are named after the location (usually an organ) where they begin. Cancer begins in the cells of the body. Under normal circumstances, the human body controls the production of new cells to replace cells that are old or have become damaged. Cancer is not normal. In patients with cancer, cells do not die when they are supposed to and new cells form when they are not needed (like when I ask my kids to use the copy machine and I get back ten copies instead of the one I asked for). The extra cells may form a mass of tissue; this is referred to as a tumor. Tumors come in two varieties: benign tumors, which are not cancerous, and malignant tumors, which are cancerous. Malignant tumors spread through the body and invade the tissue. My family, like most I know, has lost a family member to the disease. There were an estimated 1.6 million new cases of cancer in the United States in 2013 and more than 580,000 deaths as a result of the disease.

An estimated 235,000 people in the United States were diagnosed with breast cancer in 2014, and about 40,000 people will die in 2014 as a result of the disease. The most common type of breast cancer is ductal carcinoma, which begins in the lining of the milk ducts. The next most common type of breast cancer is lobular carcinoma. There are a number of treatment options for breast cancer including surgery, chemotherapy, radiation therapy, immunotherapy, and vaccine therapy. Often one or more of the treatment options is used to help ensure the best outcome for patients. About 60 different drugs are approved by the Food and Drug Administration (FDA) for the treatment of breast cancer. The course of treatment and which drug protocols should be used is decided based on consultation between the doctor and patient, and a number of factors go into those decisions.

One of the FDA-approved drug treatments for breast cancer is tamoxifen citrate. It is sold under the brand name of Nolvadex and was first prescribed in 1969 in England but approved by the FDA in 1998. Tamoxifen is normally taken as a daily tablet with doses of 10 mg, 20 mg, or 40 mg. It carries a number of side effects including nausea, indigestion, and leg cramps. Tamoxifen has been used to treat millions of women and men diagnosed with hormone-receptor-positive breast cancer. Tamoxifen is often one of the first drugs prescribed for treating breast cancer because it has a high success rate of around 80%.

Learning that a drug is 80% successful gives us hope that tamoxifen will provide good patient outcomes, but there is one important detail about the drug that was not known until the big data era. It is that tamoxifen is not 80% effective in patients but 100% effective in 80% of patients and ineffective in the rest. That is a life-changing finding for thousands of people each year. Using techniques and ideas discussed in this book, scientists were able to identify genetic markers that can identify, in advance, if tamoxifen will effectively treat a person diagnosed with breast cancer. This type of analysis was not possible before the era of big data. Why was it not possible? Because the volume and granularity of the data was missing; volume came from pooling patient results and granularity came from DNA sequencing. In addition to the data, the computational resources needed to solve a problem like this were not readily available to most scientists outside of the super computing lab. Finally the third component, the algorithms or modeling techniques needed to understand this relationship, have matured greatly in recent years.

The story of Tamoxifen highlights the exciting opportunities that are available to us as we have more and more data along with computing resources and algorithms that aid in classification and prediction. With knowledge like that was gained by the scientists studying tamoxifen, we can begin to reshape the treatment of disease and disrupt positively many other areas of our lives. With these advances we can avoid giving the average treatment to everyone but instead determine which people will be helped by a particular drug. No longer will a drug be 5% effective; now we can identify which 5% of patients the drug will help. The concept of personalized medicine has been discussed for many years. With advances in working with big data and improved predictive analytics, it is more of a reality than ever. A drug with a 2% success rate will never be pursued by a drug manufacturer or approved by the FDA unless it can be determined which patients it will help. If that information exists, then lives can be saved. Tamoxifen is one of many examples that show us the potential that exists if we can take advantage of the computational resources and are patient enough to find value in the data that surrounds us.

We are currently living in the big data era. That term “big data” was first coined around the time the big data era began. While I consider the big data era to have begun in 2001, the date is the source of some debate and impassioned discussion on blogs—and even the New York Times. The term “big data” appears to have been first used, with its currently understood context, in the late 1990s. The first academic paper was presented in 2000, and published in 2003, by Francis X. Diebolt— “Big Data Dynamic Factor Models for Macroeconomic Measurement and Forecasting”—but credit is largely given to John Mashey, the chief scientist for SGI, as the first person to use the term “big data.” In the late 1990s, Mashey gave a series of talks to small groups about this big data tidal wave that was coming. The big data era is an era described by rapidly expanding data volumes, far beyond what most people imagined would ever occur.

The large data volume does not solely classify this as the big data era, because there have always been data volumes larger than our ability to effectively work with the data have existed. What sets the current time apart as the big data era is that companies, governments, and nonprofit organizations have experienced a shift in behavior. In this era, they want to start using all the data that it is possible for them to collect, for a current or future unknown purpose, to improve their business. It is widely believed, along with significant support through research and case studies, that organizations that use data to make decisions over time in fact do make better decisions, which leads to a stronger, more viable business. With the velocity at which data is created increasing at such a rapid rate, companies have responded by keeping every piece of data they could possibly capture and valuing the future potential of that data higher than they had in the past. How much personal data do we generate? The first question is: What is personal data? In 1995, the European Union in privacy legislation defined it as any information that could identify a person, directly or indirectly. International Data Corporation (IDC) estimated that 2.8 zettabytes1 of data were created in 2012 and that the amount of data generated each year will double by 2015. With such a large figure, it is hard to understand how much of that data is actually about you. It breaks down to about 5 gigabytes of data per day for the average American office worker. This data consists of email, downloaded movies, streamed audio, Excel spreadsheets, and so on. In this data also includes the data that is generated as information moves throughout the Internet. Much of this generated data is not seen directly by you or me but is stored about us. Some examples of nondirect data are things like traffic camera footage, GPS coordinates from our phones, or toll transactions as we speed through automated E-ZPass lanes.

Before the big data era began, businesses assigned relatively low value to the data they were collecting that did not have immediate value. When the big data era began, this investment in collecting and storing data for its potential future value changed, and organizations made a conscious effort to keep every potential bit of data. This shift in behavior created a virtuous circle where data was stored and then, because data was available, people were assigned to find value in it for the organization. The success in finding value led to more data being gathered and so on. Some of the data stored was a dead end, but many times the results were confirmed that the more data you have, the better off you are likely to be. The other major change in the beginning of the big data era was the rapid development, creation, and maturity of technologies to store, manipulate, and analyze this data in new and efficient ways.

Now that we are in the big data era, our challenge is not getting data but getting the right data and using computers to augment our domain knowledge and identify patterns that we did not see or could not find previously.

Some key technologies and market disruptions have led us to this point in time where the amount of data being collected, stored, and considered in analytical activities has grown at a tremendous rate. This is due to many factors including Internet Protocol version 6 (IPv6), improved telecommunications equipment, technologies like RFID, telematics sensors, the reduced per unit cost of manufacturing electronics, social media, and the Internet.

Here is a timeline that highlights some of the key events leading up to the big data era and events that continue to shape the usage of big data and the future of analytics.

BIG DATA TIMELINE

Here are a number of items that show influential events that prepared the way for the big data era and significant milestones during the era.

1991

The Internet, or World Wide Web as we know it, is born. The protocol Hypertext Transfer Protocol (HTTP) becomes the standard means for sharing information in this new medium.

1995

Sun releases the Java platform. Java, invented in 1991, has become the second most popular language behind C. It dominates the Web applications space and is the de facto standard for middle-tier applications. These applications are the source for recording and storing web traffic.

Global Positioning System (GPS) becomes fully operational. GPS was originally developed by DARPA (Defense Advanced Research Projects Agency) for military applications in the early 1970s. This technology has become omnipresent in applications for car and airline navigation and finding a missing iPhone.

1998

Carlo Strozzi develops an open-source relational database and calls it NoSQL. Ten years later, a movement to develop NoSQL databases to work with large, unstructured data sets gains momentum.

Google is founded by Larry Page and Sergey Brin, who worked for about a year on a Stanford search engine project called BackRub.

1999

Kevin Ashton, cofounder of the Auto-ID Center at the Massachusetts Institute of Technology (MIT), invents the term “the Internet of Things.”

2001

Wikipedia is launched. The crowd-sourced encyclopedia revolutionized the way people reference information.

2002

Version 1.1 of the Bluetooth specification is released by the Institute of Electrical and Electronics Engineers (IEEE). Bluetooth is a wireless technology standard for the transfer of data over short distances. The advancement of this specification and its adoption lead to a whole host of wearable devices that communicate between the device and another computer. Today nearly every portable device has a Bluetooth receiver.

2003

According to studies by IDC and EMC, the amount of data created in 2003 surpasses the amount of data created in all of human history before then. It is estimated that 1.8 zettabytes (ZB) was created in 2011 alone (1.8 ZB is the equivalent of 200 billion high-definition movies, each two hours long, or 47 million years of footage with no bathroom breaks).

LinkedIn, the popular social networking website for professionals, launches. In 2013, the site had about 260 million users.

2004

Wikipedia reaches 500,000 articles in February; seven months later it tops 1 million articles.

Facebook, the social networking service, is founded by Mark Zuckerberg and others in Cambridge, Massachusetts. In 2013, the site had more than 1.15 billion users.

2005

The Apache Hadoop project is created by Doug Cutting and Mike Caferella. The name for the project came from the toy elephant of Cutting’s young son. The now-famous yellow elephant becomes a household word just a few years later and a foundational part of almost all big data strategies.

The National Science Board recommends that the National Science Foundation (NSF) create a career path for “a sufficient number of high-quality data scientists” to manage the growing collection of digital information.

2007

Apple releases the iPhone and creates a strong consumer market for smartphones.

2008

The number of devices connected to the Internet exceeds the world’s population.

2011

IBM’s Watson computer scans and analyzes 4 terabytes (200 million pages) of data in seconds to defeat two human players on the television show

Jeopardy!

(There is more about the show in Part Two.)

Work begins in UnQL, a query language for NoSQL databases.

The available pools in the IPv4 address space have all been assigned. IPv4 is a standard for assigning an Internet protocol (IP) address. The IPv4 protocol was based on a 32-bit number, meaning there are 2

32

or 4.5 billion unique addresses available. This event shows the real demand and quantity of Internet-connected devices.

2012

The Obama administration announces the Big Data Research and Development Initiative, consisting of 84 programs in six departments. The NSF publishes “Core Techniques and Technologies for Advancing Big Data Science & Engineering.”

IDC and EMC estimate that 2.8 ZB of data will be created in 2012 but that only 3% of what could be usable for big data is tagged and less is analyzed. The report predicts that the digital world will by 2020 hold 40 ZB, 57 times the number of grains of sand on all the beaches in the world.

The

Harvard Business Review

calls the job of data scientist “the sexiest job of the 21st century.”

2013

The democratization of data begins. With smartphones, tablets, and Wi-Fi, everyone generates data at prodigious rates. More individuals access large volumes of public data and put data to creative use.

The events of the last 20 years have fundamentally changed the way data is treated. We create more of it each day; it is not a waste product but a buried treasure waiting to be discovered by curious, motivated researchers and practitioners who see these trends and are reaching out to meet the current challenges.

WHY THIS TOPIC IS RELEVANT NOW