Big Data, Big Analytics - Michael Minelli - E-Book

Big Data, Big Analytics E-Book

Michael Minelli

0,0
33,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Unique prospective on the big data analytics phenomenon for both business and IT professionals The availability of Big Data, low-cost commodity hardware and new information management and analytics software has produced a unique moment in the history of business. The convergence of these trends means that we have the capabilities required to analyze astonishing data sets quickly and cost-effectively for the first time in history. These capabilities are neither theoretical nor trivial. They represent a genuine leap forward and a clear opportunity to realize enormous gains in terms of efficiency, productivity, revenue and profitability. The Age of Big Data is here, and these are truly revolutionary times. This timely book looks at cutting-edge companies supporting an exciting new generation of business analytics. * Learn more about the trends in big data and how they are impacting the business world (Risk, Marketing, Healthcare, Financial Services, etc.) * Explains this new technology and how companies can use them effectively to gather the data that they need and glean critical insights * Explores relevant topics such as data privacy, data visualization, unstructured data, crowd sourcing data scientists, cloud computing for big data, and much more.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 336

Veröffentlichungsjahr: 2012

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



CONTENTS

Foreword

Preface

Acknowledgments

Chapter 1: What Is Big Data and Why Is It Important?

A Flood of Mythic “Start-Up” Proportions

Big Data Is More Than Merely Big

Why Now?

A Convergence of Key Trends

Relatively Speaking . . .

A Wider Variety of Data

The Expanding Universe of Unstructured Data

Setting the Tone at the Top

Notes

Chapter 2: Industry Examples of Big Data

Digital Marketing and the Non-line World

Database Marketers, Pioneers of Big Data

Big Data and the New School of Marketing

Fraud and Big Data

Risk and Big Data

Credit Risk Management

Big Data and Algorithmic Trading

Big Data and Advances in Health Care

Pioneering New Frontiers in Medicine

Advertising and Big Data: From Papyrus to Seeing Somebody

Using Consumer Products as a Doorway

Notes

Chapter 3: Big Data Technology

The Elephant in the Room: Hadoop’s Parallel World

Old vs. New Approaches

Data Discovery: Work the Way People’s Minds Work

Open-Source Technology for Big Data Analytics

The Cloud and Big Data

Predictive Analytics Moves into the Limelight

Software as a Service BI

Mobile Business Intelligence Is Going Mainstream

Crowdsourcing Analytics

Inter- and Trans-Firewall Analytics

R&D Approach Helps Adopt New Technology

Big Data Technology Terms

Data Size 101

Notes

Chapter 4: Information Management

The Big Data Foundation

Big Data Computing Platforms (or Computing Platforms That Handle the Big Data Analytics Tsunami)

Big Data Computation

More on Big Data Storage

Big Data Computational Limitations

Big Data Emerging Technologies

Chapter 5: Business Analytics

The Last Mile in Data Analysis

Geospatial Intelligence Will Make Your Life Better

Listening: Is It Signal or Noise?

Consumption of Analytics

From Creation to Consumption

Visualizing: How to Make It Consumable?

Organizations Are Using Data Visualization as a Way to Take Immediate Action

Moving from Sampling to Using All the Data

Thinking Outside the Box

360° Modeling

Need for Speed

Let’s Get Scrappy

What Technology Is Available?

Moving from Beyond the Tools to Analytic Applications

Notes

Chapter 6: The People Part of the Equation

Rise of the Data Scientist

Using Deep Math, Science, and Computer Science

The 90/10 Rule and Critical Thinking

Analytic Talent and Executive Buy-in

Developing Decision Sciences Talent

Holistic View of Analytics

Creating Talent for Decision Sciences

Creating a Culture That Nurtures Decision Sciences Talent

Setting Up the Right Organizational Structure for Institutionalizing Analytics

Chapter 7: Data Privacy and Ethics

The Privacy Landscape

The Great Data Grab Isn’t New

Preferences, Personalization, and Relationships

Rights and Responsibility

Conscientious and Conscious Responsibility

Privacy May Be the Wrong Focus

Can Data Be Anonymized?

Balancing for Counterintelligence

Now What?

Notes

Conclusion

Recommended Resources

About the Authors

Index

WILEY CIO SERIES

Founded in 1807, John Wiley & Sons is the oldest independent publishing company in the United States. With offices in North America, Europe, Asia, and Australia, Wiley is globally committed to developing and marketing print and electronic products and services for our customers’ professional and personal knowledge and understanding.

The Wiley CIO series provides information, tools, and insights to IT executives and managers. The products in this series cover a wide range of topics that supply strategic and implementation guidance on the latest technology trends, leadership, and emerging best practices.

Titles in the Wiley CIO series include:

The Agile Architecture Revolution: How Cloud Computing, REST-Based SOA, and Mobile Computing Are Changing Enterprise IT by Jason Bloomberg
Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses by Michele Chambers, Ambiga Dhiraj, and Michael Minelli
The Chief Information Officer’s Body of Knowledge: People, Process, and Technology by Dean Lane
CIO Best Practices: Enabling Strategic Value with Information Technology by Joe Stenzel, Randy Betancourt, Gary Cokins, Alyssa Farrell, Bill Flemming, Michael H. Hugos, Jonathan Hujsak, and Karl D. Schubert
The CIO Playbook: Strategies and Best Practices for IT Leaders to Deliver Value by Nicholas R. Colisto
Enterprise IT Strategy, + Website: An Executive Guide for Generating Optimal ROI from Critical IT Investments by Gregory J. Fell
Executive’s Guide to Virtual Worlds: How Avatars Are Transforming Your Business and Your Brand by Lonnie Benson
Innovating for Growth and Value: How CIOs Lead Continuous Transformation in the Modern Enterprise by Hunter Muller
IT Leadership Manual: Roadmap to Becoming a Trusted Business Partner by Alan R. Guibord
Managing Electronic Records: Methods, Best Practices, and Technologies by Robert F. Smallwood
On Top of the Cloud: How CIOs Leverage New Technologies to Drive Change and Build Value Across the Enterprise by Hunter Muller
Straight to the Top: CIO Leadership in a Mobile, Social, and Cloud-based (Second Edition) by Gregory S. Smith
Strategic IT: Best Practices for IT Managers and Executives by Arthur M. Langer
Strategic IT Management: Transforming Business in Turbulent Times by Robert J. Benson
Transforming IT Culture: How to Use Social Intelligence, Human Factors and Collaboration to Create an IT Department That Outperforms by Frank Wander
Unleashing the Power of IT: Bringing People, Business, and Technology Together by Dan Roberts
The U.S. Technology Skills Gap: What Every Technology Executive Must Know to Save America’s Future by Gary Beach

Cover image: © nobeastsofierce/Alamy

Cover design: John Wiley & Sons, Inc.

Copyright © 2013 by Michael Minelli, Michele Chambers, and Ambiga Dhiraj. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or on the Web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Cataloging-in-Publication Data

Minelli, Michael, 1974-

Big data, big analytics : emerging business intelligence and analytic trends for today’s businesses / Michael Minelli, Michele Chambers, Ambiga Dhiraj.

pages cm

Includes bibliographical references and index.

ISBN 978-1-118-14760-3 (cloth); ISBN 978-1-118-22583-7 (ebk); ISBN 978-1-118-23915-5 (ebk); ISBN 978-1-118-26381-5 (ebk)

1. Business intelligence. 2. Information technology. 3. Data processing. 4. Data mining. 5. Strategic planning. I. Chambers, Michele. II. Dhiraj, Ambiga, 1975-III. Title.

HD38.7.M565 2013

658.4′72—dc23

2012044882

To my wife Jenny and our three incredible children, Jack, Madeline, and Max. Also to my parents, who have always been there for me.

—Mike

To my son Cole, who is the light of my life and the person who taught me empathy. Also to my adopted family and support system, Lisa Patrick, Pei Yee Cheng, and Patrick Thean. Finally, to my colleagues Bill Zannine, Brian Hess, Jon Niess, Matt Rollender, Kevin Kostuik, Krishnan Parasuraman, Mario Inchiosa, Thomas Baeck, Thomas Dinsmore, and Usama Fayyad, for their generous support.

—Michele

To Mu Sigmans all around the world for their passion toward building the decision sciences industry.

—Ambiga

FOREWORD: BIG DATA AND CORPORATE EVOLUTION

When my friend Mike Minelli asked me to write this foreword I wasn’t sure at first what I should put on paper. Forewords are often one part book summary and one part overview of the field. But when I read the draft Mike sent me I realized that this is a really good book, and it doesn’t need either of those. Without any additional help from me it will give you plenty of insight into what is happening and why it’s happening now, and it will help you see the possibilities for your industry in this transition to a data-centric age. Also, the book is just full of practical suggestions for what you can do about them. But perhaps there’s an opportunity to establish a wider context. To explore what Big Data means across a broad arc of technological advancement. So rather than bore you with a summary of a book you’re going to read anyway, I’ll try to daub a bit of paint onto the big picture of what it all might mean.

This foreword is based on the thesis that Big Data isn’t merely another technology. It isn’t just another gift box en route to the world’s systems integrators via the conveyor belt of Gartner hype cycles. I believe Big Data will follow digital computing and internetworking to take its place as the third epoch of the information age, and in doing so it will fundamentally alter the trajectory of corporate evolution. The corporation is about to undergo a change analogous to the rise of consciousness in humans.

So let’s start at the beginning. The Industrial Age was an era of vast changes in society. We harnessed first steam and then electricity as prime movers to unleash astonishing increases in productivity. The result was the first sustained growth of wealth in human history.

Those early industrial concerns required vast pools of labor that gradually grew more specialized. To coordinate the efforts of all of those people, management developed systems of rules and hierarchy of authority. At massive scale the corporation was no longer the direct exercise of an owner’s will, it was a kind of organism.

It was an organism whose systems of control were born out of the Napoleonic bureaucracy of the French State and its emphasis on specialized function, fixed rules, and rigid hierarchy. The “bureau” in bureaucracy literally means desk, and paper was both the storage mechanism in them and the signaling mechanism between them.

The bureaucracy was a form of organization that could process stimuli at scale and coordinate masses of participants, but it was, and remains today, severely limited in its evolutionary progress. Bureaucracy is the nematode of human industrial organization.

With over 24,000 species the nematode is a plentiful and adaptable round worm whose nervous system typically consists of 302 neurons. A mere 20 of those neurons are in its pharyngeal nervous system, the part that serves as a rudimentary brain. Yet it is able to maintain homeostasis, direct movement, detect information in its environment, create complex responses, and even manage some basic learning. So, it’s a nice approximation for the bureaucratic corporation.

Despite its display of complex behaviors the nematode is of course completely unaware of them in any conscious sense. Its actions, like those of a bureaucracy, are reactive and dispositional. A worm bumps into something and is stimulated. Neurons fire. Worm reacts. It moves away or maybe eats what it bumped. Likewise shelves go empty and an order is placed. Papers move between desks. Trucks arrive. Shelves get replenished.

Worms and corporations are both complex event-processing engines, but they are largely deterministic. The corporation is evolving though, becoming more aware of its surroundings and emergent in its reactions. The information age, or the second industrial age, has been a major part of that.

In 1954 Joe Glickauf of Arthur Andersen implemented a payroll system for the General Electric Corporation on a UNIVAC 1 digital electronic computer. He thus introduced the computational epoch of the information age to the American corporation. (Incidentally, also creating the IT consulting industry.) Throughout the 1950s other corporations rapidly adopted systems like it to serve a wide spectrum of corporate processes. The corporation was still a nematode but we were wiring the worm and aggressively digitizing its nervous system.

Yet it remained basically the same worm. Sure, it became more efficient and could react faster but with basically the same dispositions, because as we automated those existing systems with computers we mimicked the paper. Invoices, accounts, and customer master files all simply migrated into the machine as we dumped file cabinets into database tables. We were wiring the worm, but we weren’t re-wiring it.

So it remained a bureaucracy, just a more efficient, responsive, and scalable one. Yet this was the beginning of a symbiotic evolution between corporation and information age technology and it became a departure point in the corporation’s further evolutionary history. This digital foundation is the substrate on which further evolutionary processes would occur.

Then about thirty years ago, Leonard Kleinrock, Lawrence Roberts, Robert Kahn, and Vint Cerf invented the Internet and ushered in the second epoch of information age, the network era.

Suddenly our little worm was connected to its peers and surrounding ecosystem in ways that it hadn’t been before. Messaging between companies became as natural as messaging between desks and with later pushes by Jack Welch and others who understood the revolution that was at hand, those messages finally succumbed to the pull of digitization. The era of the paper purchase order and invoice finally died. The first 35 years of digitization had focused on internal processes; now the focus was more on interactions with the outside world. (I say more, because EDI had been around for a while. But it was with the cost structure of the Internet that it really took off.) For the worm it was like the evolution of a sixth sense. It could see further, predict deeper into the future, and respond faster.

But those new networks didn’t just affect the way our corporations interacted with the outside world. They also began to erode the very foundation of bureaucracy: its hierarchy.

While the strict hierarchy of bureaucracy had been a force multiplier for labor during the industrial age, in practice it meant that a company could never be smarter than the smartest person at its head. Restrained by hierarchy, rigid rules, and specialized functions, the sum total of a corporation’s intelligence was always much less than the sum of the intelligence of its participants.

With globalization, complex connections, and faster market cycle times the complexity of the corporation’s environment has increased rapidly and has long since exceeded the complexity that any single person can understand. There has after all only been one Steve Jobs. Something had to give.

So corporations have (slowly) begun the journey toward more agile, network-enabled, learning organizations that can crowd source intelligence both within their ranks and from inside their customer bases. They are beginning to exhibit locally emergent behaviors in response to that learning. This is what is behind corporate mottos like Facebook’s “Move fast and break stuff.” It’s just another way of saying that initiative is local and that the head can’t know everything.

Of course companies in the network era still have organization charts. But they don’t tell the whole story anymore. These days we need to analyze email patterns, phone records, instant messaging and other evidence of actual human connection to determine the real organizational model that emerges like an interstitial lattice within the official org chart.

So corporate evolution is no longer just incremental improvement along an efficiency and productivity vector. The very form of the corporation is changing, enabled by technology and spurred by the necessity of complexity and cycle times. The corporation is growing external sensors and the necessary neurons to deal with what it discovers. It is changing from dispositional and reactive to complex and emergent in order to better impedance match with the post-industrial world it occupies.

So here we are, at the doorstep of the Information Age’s Big Data epoch. The corporation has already taken advantage of the computing and internetworking epochs to evolve significantly and adapt to a more complex world. But even bigger changes are ahead.

This book will take you through the entire Big Data story, so I’m not going to expound much on the meaning of Big Data here. I’ll just describe enough to set the stage for the next phase of corporate evolution. And this is a key point: Big Data isn’t Business Intelligence (BI) with bigger data.

We are no longer limited to the structured transactional world that has been the domain of corporate information technology for the last 55 years. Big Data represents a transition-in-kind for both storage and analysis. It isn’t just about size.

The data your corporation does “BI” with today is mostly internally generated highly-structured transactional data. It’s like a record of the neurons that fired. All too often the role of the business intelligence analyst really boils down to corporate kinesthesis. Reports are generated to tell the head of a hierarchy what its limbs are doing, or did.

But Big Data has the potential to be different. For one, often the data being analyzed will come from somewhere else, and in its original unstructured form. And two, we won’t just be analyzing what we did; we’ll be analyzing what is happening in the world around us, with all of the richness and detail of the original sensation.

Now we can think of web logs, video clips, voice response unit recordings, every document in every SharePoint repository, social data, open government data, partner data sets, and many more as part of our analytical corpus. No longer limited to mere introspection, analysis can be about more deeply detailed external sensing. What do my customers do? Who do they know? Were they happy or angry when they called? What are their network neighbors like and when and how much will they be influenced by them? Which of my customers are most similar? What are they saying about our competitors? What are they buying from our competitors? Are my competitors’ parking lots full? And on and on. . .

Perhaps more importantly, how can this mass of data be turned directly into product, or at least an attribute of our products? Can we close the loop: from what we sense in our environment, to what to know, and to what we do?

The term data science speaks to the notion that we are now using data to apply the scientific method to our businesses. We create (or discover) hypotheses, run experiments, see if our customers react the way we predict and then build new products or interactions based on the results. Forward thinking companies are closing the loops so that the entire process runs without human intervention and products are updated in real time based on customer behavior or other inputs.

Put another way, the corporation’s OODA Loop (Observe, Orient, Decide, Act. The work of USAF Col Boyd, the OODA loop describes a model for action in the face of uncertainty) is being implemented, at least in the tactical time scale, directly in the machinery of the corporation. Humans design the algorithms, but their participation isn’t necessary beyond that. And unlike traditional BI, which focused on the OO of the OODA loop, the modern corporation has to directly integrate the Decide and Act phases to keep up with the dynamics of the modern market. It’s not enough to be more analytical, future corporations will require greater product and organizational agility to act in real time.

As analogy, we humans experience our world in real time via internally rendered maps of our sensory perceptions, and we store those maps as memory. Maps are the scaffolding on which mind and our processes of self unfold. They are the evolutionary portal through which we passed from disposition to reasoning, when along the way we evolved from reactive worm to reasoning human.

By storing rich complex interactions, the corporation is beginning to create and store map-like structures as well. Instead of reducing complex interactions into the cartoonish renderings of summarized transactions, we are beginning to store the whole map, the pure bits from every sensor and touch point. And with the network and relationship data we are capturing now, corporate memories are beginning to look like the associative model of the human brain. The corporation isn’t becoming a person, but it is becoming more than a worm. (I realize that as of this writing the Supreme Court disagrees with my assessment.) It’s becoming intelligent.

The big data epoch will be one of a major transition. For the past 55 years the focus of information technology has been on wiring the worm for automation, efficiency, and productivity. Now I think we’ll see that shift to support of the very intelligence of the corporation.

Until now we measured projects mostly on the ROI inherent in their potential cost savings. But we’ll soon begin to think in terms of intelligentization—a made up word that means making something smarter. Our goal in business and IT will be the application of data and analytics to increasing corporate intelligence. Something like IQcorp = f(data, algorithms). That’s an altogether different framing goal for technology, and it will mean new ways of organizing and conceptualizing how it is funded and delivered.

How does the data we capture and the algorithms we develop increase the intelligence of our organization? Can we begin to think in terms of something like an IQ for our companies—a combination of its sensory perception, recall, reasoning, and ability to act? Will we go from return on investment to acquisition of intelligence? Regardless, we will be building companies that are smarter and faster-reacting than the humans that run them.

Of course, this isn’t the end of transactional IT. The corporation will have “vestigial IT” too just like the human brain still has regions remaining from our dispositional evolutionary past. After all, we still pull our hands away from a hot stove without thinking about it first, and companies will continue to automatically resupply empty shelves. But an intelligent corporation will be one with a seamlessly integrated post-dispositional reasoning mind wired for action. One that is more intelligent as a collection of people and as a set of systems than any member of its management, and one whose OODA loop often runs without human intervention.

Big Data is an epoch in the information age, and on the other side of this discontinuity in corporate evolution the companies you work for are going to be smarter.

Jim Stogdill

General Manager, Radar, O’Reilly Media

PREFACE

Big Data, Big Analytics is written for business managers and executives who want to understand more about “Big Data.” In researching this book, we realized that there were many texts about high-level strategy and some that went deep into the weeds with sample code. We have attempted to create a balance between the two, making the topic accessible through stories, metaphors, and analogies even though it’s a technical subject area.

We’ve started out the book defining Big Data and discussing why Big Data is important. We illustrate the value of Big Data through industry examples in Chapter 2 and then move into describing the enabling technology in Chapters 3 through 5. While we introduce the people working with Big Data earlier in the book, in Chapter 6 we dive deeper into the organization and the roles it takes to make Big Data successful in an organization. We wrap up the book with a thorough summary of the ethical and privacy issues surrounding Big Data in Chapter 7. Big Data, Big Analytics concludes with an entertaining lecture by Avinash Kaushik of Google.

We welcome feedback. If you have ideas on how we can make this book better—or what topics you’d like covered in a new edition, we’d love to hear from you. Please visit us at www.BigDataBigAnalytics.com.

ACKNOWLEDGMENTS

We’d like to offer a special thanks to our extended team that helped us along the way: Stokes Adams, Mike Barlow, Sheck Cho, Stacey Rivera, and Paula Thorton.

We’d like to acknowledge the people and their organizations that have made helpful contributions to this book.

Chuck Alvarez

Morgan Stanley

Tasso Argyros

Teradata

Amr Awadallah

Cloudera

Ravi Bandaru

Nokia

Mike Barlow

Cumulus Partners

Randall Beard

Nielsen

David Botkin

Playdom

Nate Burns

State University of New York at Buffalo

David Champagne

Revolution Analytics

Drew Conway

IA Ventures

Joe Cunningham

Visa

Yves de Montcheiul

Talend

Anthony Deighton

QlikTech

Deepinder Dhingra

Mu Sigma

Zubin Dowlaty

Mu Sigma

Shaun Doyle

Cognitive Box

Michael Driscoll

Dataspora

Edd Dumbill

O’Reilly

John Elder

Elder Research

Usama Fayyad

Blue Kangaroo

Financial Services Team

CapGemini

Elissa Fink

Tableau Software

Chris Gage

John Wiley & Sons

Misha Ghosh

MasterCard Worldwide

Anthony Goldbloom

Kaggle

James Golden

Accenture

Pat Hanrahan

Tableau Software

Colin Hill

GNS Healthcare

Ben Hosken

FLINKLABS

Curtis Hougland

Attention

Josh James

Domo

Jeff Jonas

IBM

Avinash Kaushik

Google

Paul Kent

SAS

Dan Kerzner

Microstrategy

James Kobelius

IBM

Jared Lander

JP Lander Consulting

Steve Lucas

SAP

Creve Maples

Event Horizon

Jojy Matthew

Capgemini

Abhishek Mehta

Tresata

John Meister

MasterCard Worldwide

Jake Porway

DataKind

Ori Peled

MasterCard Worldwide

Murali Ramanathan

State University of New York at Buffalo

Andrew Reiskind

MasterCard Worldwide

Partha Sen

Fuzzy Logix

Giovanni Seni

Intuit

Niv Singer

Tracx

David Smith

Revolution Analytics

Dan Springer

Responsys

Jim Stogdill

O’Reilly

Marcia Tal

Tal Consulting

Ian Thomson

Ocean Crusaders

Paula Thornton

Independent Writer

Jer Thorp

New York Times

Nathan Yau

Student at UCLA

Michael Zeitlin

Aqumin

Two men operating a mainframe computer, circa 1960. It’s amazing how today’s smartphone holds so much more data than this huge 1960’s relic. (Photo by Pictorial Parade/Archive Photos)

Chapter 1

What Is Big Data and Why Is It Important?

Big Data is the next generation of data warehousing and business analytics and is poised to deliver top line revenues cost efficiently for enterprises. The greatest part about this phenomenon is the rapid pace of innovation and change; where we are today is not where we’ll be in just two years and definitely not where we’ll be in a decade.

Just think about all the great stories you will tell your grandchildren about the early days of the twenty-first century, when the Age of Big Data Analytics was in its infancy.

This new age didn’t suddenly emerge. It’s not an overnight phenomenon. It’s been coming for a while. It has many deep roots and many branches. In fact, if you speak with most data industry veterans, Big Data has been around for decades for firms that have been handling tons of transactional data over the years—even dating back to the mainframe era. The reasons for this new age are varied and complex, so let’s reduce them to a handful that will be easy to remember in case someone corners you at a cocktail party and demands a quick explanation of what’s really going on. Here’s our standard answer in three parts:

1. Computing perfect storm. Big Data analytics are the natural result of four major global trends: Moore’s Law (which basically says that technology always gets cheaper), mobile computing (that smart phone or mobile tablet in your hand), social networking (Facebook, Foursquare, Pinterest, etc.), and cloud computing (you don’t even have to own hardware or software anymore; you can rent or lease someone else’s).
2. Data perfect storm. Volumes of transactional data have been around for decades for most big firms, but the flood gates have now opened with more volume, and the velocity and variety—the three Vs—of data that has arrived in unprecedented ways. This perfect storm of the three Vs makes it extremely complex and cumbersome with the current data management and analytics technology and practices.
3. Convergence perfect storm. Another perfect storm is happening, too. Traditional data management and analytics software and hardware technologies, open-source technology, and commodity hardware are merging to create new alternatives for IT and business executives to address Big Data analytics.

Let’s make one thing clear. For some industry veterans, “Big Data” isn’t new. There are companies that have dealt with billions of transactions for many years. For example, John Meister, group executive of Data Warehouse Technologies at MasterCard Worldwide, deals with a billion transactions on a strong holiday weekend. However, even the most seasoned IT veterans are awestruck by recent innovations that give their team the ability to leverage new technology and approaches, which enable us to affordably handle more data and take advantage of the variety of data that lives outside of the typical transactional world—such as unstructured data.

Paul Kent, vice president of Big Data at SAS, is an R&D professional who has developed big data crunching software for over two decades. At the SAS Global Forum 2012, Kent explained that the ability to store data in an affordable way has changed the game for his customers:

People are able to store that much data now and more than they ever before. We have reached this tipping point where they don’t have to make decisions about which half to keep or how much history to keep. It’s now economically feasible to keep all of your history and all of your variables and go back later when you have a new question and start looking for an answer. That hadn’t been practical up until just recently. Certainly the advances in blade technology and the idea that Google brought to market of you take lots and lots of small Intel servers and you gang them together and use their potential in aggregate. That is the super computer of the future.

Let’s now introduce Misha Ghosh, who is known to be an innovator with several patents under his belt. Ghosh is currently an executive at MasterCard Advisors and before that he spent 11 years at Bank of America solving business issues by using data. Ghosh explains, “Aside from the changes in the actual hardware and software technology, there has also been a massive change in the actual evolution of data systems. I compare it to the stages of learning: dependent, independent, and interdependent.”

Using Misha’s analogy, let’s breakdown the three pinnacle stages in the evolution of data systems:

Dependent

(Early Days). Data systems were fairly new and users didn’t know quite know what they wanted. IT assumed that “Build it and they shall come.”

Independent

(Recent Years). Users understood what an analytical platform was and worked together with IT to define the business needs and approach for deriving insights for their firm.

Interdependent

(Big Data Era). Interactional stage between various companies, creating more social collaboration beyond your firm’s walls.

Moving from independent (Recent Years) to interdependent (Big Data Era) is sort of like comparing Starbucks to a hip independent neighborhood coffee shop with wizard baristas that can tell you when the next local environmental advisory council meet-up is taking place. Both shops have similar basic product ingredients, but the independent neighborhood coffee shop provides an approach and atmosphere that caters to social collaboration within a given community. The customers share their artwork and tips about the best picks at Saturday’s farmers market as they stand by the giant corkboard with a sea of personal flyers with tear off tabs . . . “Web Designer Available for Hire, 555-1302.”

One relevant example and Big Data parity to the coffee shop is the New York City data meet-ups with data scientists like Drew Conway, Jared Lander, and Jake Porway. These bright minds organize meet-ups after work at places like Columbia University and NYU to share their latest analytic application [including a review of their actual code] followed by a trip to the local pub for a few pints and more data chatter. Their use cases are a blend of Big Data corporate applications and other applications that actually turn their data skills into a helping hand for humanity.

For example, during the day Jared Lander helps a large healthcare organization solve big data problems related to patient data. By night, he is helping a disaster recovery organization with optimization analytics that help direct the correct supplies to areas where they are needed most. Does a village need bottled water or boats, rice or wheat, shelter or toilets? Follow up surveys six, 12, 18, and 24 months following the disaster help track the recovery and direct further relief efforts.

Another great example is Jake Porway, who decided to go full time to use Big Data to help humanity at DataKind, which is the company he co-founded with Craig Barowsky and Drew Conway. From weekend events to long-term projects, DataKind supports a data-driven social sector through services, tools, and educational resources to help with the entire data pipeline.

In the service of humanity, they were able to secure funding from several corporations and foundations such as EMC, O’Reilly Media, Pop Tech, National Geographic, and the Alfred P. Sloan Foundation. Porway described DataKind to us as a group of data superheroes:

I love superheroes, because they’re ordinary people who find themselves with extraordinary powers that they use to make the world a better place. As data and technology become more ubiquitous and the need for insights more pressing, ordinary data scientists are finding themselves with extraordinary powers. The world is changing and those who are stepping up to use data for the greater good have a real opportunity to change it for the better.

In summary, the Big Data world is being fueled with an abundance mentality; a rising tide lifts all boats. This new mentality is fueled by a gigantic global corkboard that includes data scientists, crowd sourcing, and opens source methodologies.

A Flood of Mythic “Start-Up” Proportions

Thanks to the three converging “perfect storms,” those trends discussed in the previous section, the global economy now generates unprecedented quantities of data. People who compare the amount of data produced daily to a deluge of mythic proportions are entirely correct. This flood of data represents something we’ve never seen before. It’s new, it’s powerful, and yes, it’s scary but extremely exciting.

The best way to predict the future is to create it!

—Peter F. Drucker

The influential writer and management consultant Drucker reminds us that the future is up to us to create. This is something that every entrepreneur takes to heart as they evangelize their start-up’s big idea that they know will impact the world! This is also true with Big Data and the new technology and approaches that have arrived at our doorstep.

Over the past decade companies like Facebook, Google, LinkedIn, and eBay have created treasured firms that rely on the skills of new data scientists, who are breaking the traditional barriers by leveraging new technology and approaches to capture and analyze data that drives their business. Time is flying and we have to remember that these firms were once start-ups. In fact, most of today’s start-ups are applying similar Big Data methods and technologies while they’re growing their businesses. The question is how.

This is why it is critical that organizations ensure that they have a mechanism to change with the times and not get caught up appeasing the ghost from data warehousing and business intelligence (BI) analytics of the past! At the end of the day, legacy data warehousing and BI analytics are not going away anytime soon. It’s all about finding the right home for the new approaches and making them work for you!

According to a recent study by the McKinsey Global Institute, organizations capture trillions of bytes of information about their customers, suppliers, and operations through digital systems. Millions of networked sensors embedded in mobile phones, automobiles, and other products are continually sensing, creating, and communicating data. The result is a 40 percent projected annual growth in the volume of data generated. As the study notes, 15 out of 17 sectors in the U.S. economy already “have more data stored per company than the U.S. Library of Congress.”1 The Library of Congress itself has collected more than 235 terabytes of data. That’s Big Data.

Big Data Is More Than Merely Big

What makes Big Data different from “regular” data? It really all depends on when you ask the question.

Edd Dumbill, founding chair of O’Reilly’s Strata Conference and chair of the O’Reilly Open Source Convention, defines Big Data as “data that becomes large enough that it cannot be processed using conventional methods.”

Here is how the McKinsey study defines Big Data:

Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. This definition is intentionally subjective. . . . We assume that, as technology advances over time, the size of datasets that qualify as big data will also increase. Also note that the definition can vary by sector, depending on what kinds of software tools are commonly available and what sizes of datasets are common in a particular industry. With those caveats, big data in many sectors today will range from a few dozen terabytes to multiple petabytes (thousands of terabytes).2

Big Data isn’t just a description of raw volume. “The real issue is usability,” according to industry renowned blogger David Smith. From his perspective, big datasets aren’t even the problem. The real challenge is identifying or developing most cost-effective and reliable methods for extracting value from all the terabytes and petabytes of data now available. That’s where Big Data analytics become necessary.

Comparing traditional analytics to Big Data analytics is like comparing a horse-drawn cart to a tractor–trailer rig. The differences in speed, scale, and complexity are tremendous.

Why Now?

On some level, we all understand that history has no narrative and no particular direction. But that doesn’t stop us from inventing narratives and writing timelines complete with “important milestones.” Keeping those thoughts in mind, Figure 1.1 shows a timeline of recent technology developments.

Figure 1.1 Timeline of Recent Technology Developments

If you believe that it’s possible to learn from past mistakes, then one mistake we certainly do not want to repeat is investing in new technologies that didn’t fit into existing business frameworks. During the customer relationship management (CRM) era of the 1990s, many companies made substantial investments in customer-facing technologies that subsequently failed to deliver expected value. The reason for most of those failures was fairly straightforward: Management either forgot (or just didn’t know) that big projects require a synchronized transformation of people, process, and technology. All three must be marching in step or the project is doomed.

We can avoid those kinds of mistakes if we keep our attention focused on the outcomes we want to achieve. The technology of Big Data is the easy part—the hard part is figuring out what you are going to do with the output generated by your Big Data analytics. As the ancient Greek philosophers said, “Action is character.” It’s what you do that counts. Putting it bluntly, make sure that you have the people and process pieces ready before you commit to buying the technology.

A Convergence of Key Trends

Our friend, Steve Lucas, is the Global Executive Vice President and General Manager, SAP Database & Technology at SAP. He’s an experienced player in the Big Data analytics space, and we’re delighted that he agreed to share some of his insights with us. First of all, according to Lucas, it’s important to remember that big companies have been collecting and storing large amounts of data for a long time. From his perspective, the difference between “Old Big Data” and “New Big Data” is accessibility. Here’s a brief summary of our interview: