Big Data Analytics - Frank J. Ohlhorst - E-Book

Big Data Analytics E-Book

Frank J. Ohlhorst

0,0
32,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Unique insights to implement big data analytics and reap big returns to your bottom line Focusing on the business and financial value of big data analytics, respected technology journalist Frank J. Ohlhorst shares his insights on the newly emerging field of big data analytics in Big Data Analytics. This breakthrough book demonstrates the importance of analytics, defines the processes, highlights the tangible and intangible values and discusses how you can turn a business liability into actionable material that can be used to redefine markets, improve profits and identify new business opportunities. * Reveals big data analytics as the next wave for businesses looking for competitive advantage * Takes an in-depth look at the financial value of big data analytics * Offers tools and best practices for working with big data Once the domain of large on-line retailers such as eBay and Amazon, big data is now accessible by businesses of all sizes and across industries. From how to mine the data your company collects, to the data that is available on the outside, Big Data Analytics shows how you can leverage big data into a key component in your business's growth strategy.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 212

Veröffentlichungsjahr: 2012

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Contents

Preface

Acknowledgments

Chapter 1: What is Big Data?

The Arrival of Analytics

Where is the Value?

More to Big Data Than Meets the Eye

Dealing with the Nuances of Big Data

An Open Source Brings Forth Tools

Caution: Obstacles Ahead

Chapter 2: Why Big Data Matters

Big Data Reaches Deep

Obstacles Remain

Data Continue to Evolve

Data and Data Analysis are Getting More Complex

The Future is Now

Chapter 3: Big Data and the Business Case

Realizing Value

The Case for Big Data

The Rise of Big Data Options

Beyond Hadoop

With Choice Come Decisions

Chapter 4: Building the Big Data Team

The Data Scientist

The Team Challenge

Different Teams, Different Goals

Don’t Forget the Data

Challenges Remain

Teams versus Culture

Gauging Success

Chapter 5: Big Data Sources

Hunting for Data

Setting the Goal

Big Data Sources Growing

Diving Deeper into Big Data Sources

A Wealth of Public Information

Getting Started with Big Data Acquisition

Ongoing Growth, No End in Sight

Chapter 6: The Nuts and Bolts of Big Data

The Storage Dilemma

Building a Platform

Bringing Structure to Unstructured Data

Processing Power

Choosing among In-house, Outsourced, or Hybrid Approaches

Chapter 7: Security, Compliance, Auditing, and Protection

Pragmatic Steps to Securing Big Data

Classifying Data

Protecting Big Data Analytics

Big Data and Compliance

The Intellectual Property Challenge

Chapter 8: The Evolution of Big Data

Big Data: The Modern Era

Today, Tomorrow, and the Next Day

Changing Algorithms

Chapter 9: Best Practices for Big Data Analytics

Start Small with Big Data

Thinking Big

Avoiding Worst Practices

Baby Steps

The Value of Anomalies

Expediency versus Accuracy

In-Memory Processing

Chapter 10: Bringing it All Together

The Path to Big Data

The Realities of Thinking Big Data

Hands-on Big Data

The Big Data Pipeline in Depth

Big Data Visualization

Big Data Privacy

Appendix: Supporting Data

“The MapR Distribution for Apache Hadoop”

“High Availability: No Single Points of Failure”

About the Author

Index

WILEY & SAS BUSINESS SERIES

The Wiley & SAS Business Series presents books that help senior-level managers with their critical management decisions.

Titles in the Wiley and SAS Business Series include:

Activity-Based Management for Financial Institutions: Driving Bottom-Line Results by Brent Bahnub
Advanced Business Analytics: Creating Business Value from Your Data by Jean Paul Isson and Jesse Harriott
Branded! How Retailers Engage Consumers with Social Media and Mobility by Bernie Brennan and Lori Schafer
Business Analytics for Customer Intelligence by Gert Laursen
Business Analytics for Managers: Taking Business Intelligence beyond Reporting by Gert Laursen and Jesper Thorlund
The Business Forecasting Deal: Exposing Bad Practices and Providing Practical Solutions by Michael Gilliland
Business Intelligence Success Factors: Tools for Aligning Your Business in the Global Economy by Olivia Parr Rud
CIO Best Practices: Enabling Strategic Value with Information Technology, Second Edition by Joe Stenzel
Connecting Organizational Silos: Taking Knowledge Flow Management to the Next Level with Social Media by Frank Leistner
Credit Risk Assessment: The New Lending System for Borrowers, Lenders, and Investors by Clark Abrahams and Mingyuan Zhang
Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring by Naeem Siddiqi
The Data Asset: How Smart Companies Govern Their Data for Business Success by Tony Fisher
Demand-Driven Forecasting: A Structured Approach to Forecasting by Charles Chase
Executive’s Guide to Solvency II by David Buckham, Jason Wahl, and Stuart Rose
The Executive’s Guide to Enterprise Social Media Strategy: How Social Networks Are Radically Transforming Your Business by David Thomas and Mike Barlow
Fair Lending Compliance: Intelligence and Implications for Credit Risk Management by Clark R. Abrahams and Mingyuan Zhang
Foreign Currency Financial Reporting from Euros to Yen to Yuan: A Guide to Fundamental Concepts and Practical Applications by Robert Rowan
Human Capital Analytics: How to Harness the Potential of Your Organization’s Greatest Asset by Gene Pease, Boyce Byerly, and Jac Fitz-enz
Information Revolution: Using the Information Evolution Model to Grow Your Business by Jim Davis, Gloria J. Miller, and Allan Russell
Manufacturing Best Practices: Optimizing Productivity and Product Quality by Bobby Hull
Marketing Automation: Practical Steps to More Effective Direct Marketing by Jeff LeSueur
Mastering Organizational Knowledge Flow: How to Make Knowledge Sharing Work by Frank Leistner
The New Know: Innovation Powered by Analytics by Thornton May
Performance Management: Integrating Strategy Execution, Methodologies, Risk, and Analytics by Gary Cokins
Retail Analytics: The Secret Weapon by Emmett Cox
Social Network Analysis in Telecommunications by Carlos Andre Reis Pinheiro
Statistical Thinking: Improving Business Performance, Second Edition by Roger W. Hoerl and Ronald D. Snee
Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics by Bill Franks
The Value of Business Analytics: Identifying the Path to Profitability by Evan Stubbs
Visual Six Sigma: Making Data Analysis Lean by Ian Cox, Marie A. Gaudard, Philip J. Ramsey, Mia L. Stephens, and Leo Wright
For more information on any of the above titles, please visit www.wiley.com.

Cover image: @liangpv/iStockphoto

Cover design: Michael Rutkowski

Copyright © 2013 by John Wiley & Sons, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or on the Web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Ohlhorst, Frank, 1964–

Big data analytics : turning big data into big money / Frank Ohlhorst.

p. cm. — (Wiley & SAS business series)

Includes index.

ISBN 978-1-118-14759-7 (cloth) — ISBN 978-1-118-22582-0 (ePDF) — ISBN 978-1-118-26380-8 (Mobi) — ISBN 978-1-118-23904-9 (ePub)

1. Business intelligence. 2. Data mining. I. Title.

HD38.7.O36 2013

658.4'72—dc23

2012030191

Preface

What are data? This seems like a simple enough question; however, depending on the interpretation, the definition of data can be anything from “something recorded” to “everything under the sun.” Data can be summed up as everything that is experienced, whether it is a machine recording information from sensors, an individual taking pictures, or a cosmic event recorded by a scientist. In other words, everything is data. However, recording and preserving that data has always been the challenge, and technology has limited the ability to capture and preserve data.

The human brain’s memory storage capacity is supposed to be around 2.5 petabytes (or 1 million gigabytes). Think of it this way: If your brain worked like a digital video recorder in a television, 2.5 petabytes would be enough to hold 3 million hours of TV shows. You would have to leave the TV running continuously for more than 300 years to use up all of that storage space. The available technology for storing data fails in comparison, creating a technology segment called Big Data that is growing exponentially.

Today, businesses are recording more and more information, and that information (or data) is growing, consuming more and more storage space and becoming harder to manage, thus creating Big Data. The reasons vary for the need to record such massive amounts of information. Sometimes the reason is adherence to compliance regulations, at other times it is the need to preserve transactions, and in many cases it is simply part of a backup strategy.

Nevertheless, it costs time and money to save data, even if it’s only for posterity. Therein lies the biggest challenge: How can businesses continue to afford to save massive amounts of data? Fortunately, those who have come up with the technologies to mitigate these storage concerns have also come up with a way to derive value from what many see as a burden. It is a process called Big Data analytics.

The concepts behind Big Data analytics are actually nothing new. Businesses have been using business intelligence tools for many decades, and scientists have been studying data sets to uncover the secrets of the universe for many years. However, the scale of data collection is changing, and the more data you have available, the more information you can extrapolate from them.

The challenge today is to find the value of the data and to explore data sources in more interesting and applicable ways to develop intelligence that can drive decisions, find relationships, solve problems, and increase profits, productivity, and even the quality of life.

The key is to think big, and that means Big Data analytics.

This book will explore the concepts behind Big Data, how to analyze that data, and the payoff from interpreting the analyzed data.

Chapter 1 deals with the origins of Big Data analytics, explores the evolution of the associated technology, and explains the basic concepts behind deriving value.
Chapter 2 delves into the different types of data sources and explains why those sources are important to businesses that are seeking to find value in data sets.
Chapter 3 helps those who are looking to leverage data analytics to build a business case to spur investment in the technologies and to develop the skill sets needed to successfully extract intelligence and value out of data sets.
Chapter 4 brings the concepts of the analytics team together, describes the necessary skill sets, and explains how to integrate Big Data into a corporate culture.
Chapter 5 assists in the hunt for data sources to feed Big Data analytics, covers the various public and private sources for data, and identifies the different types of data usable for analytics.
Chapter 6 deals with storage, processing power, and platforms by describing the elements that make up a Big Data analytics system.
Chapter 7 describes the importance of security, compliance, and auditing—the tools and techniques that keep large data sources secure yet available for analytics.
Chapter 8 delves into the evolution of Big Data and discusses the short-term and long-term changes that will materialize as Big Data evolves and is adopted by more and more organizations.
Chapter 9 discusses best practices for data analysis, covers some of the key concepts that make Big Data analytics easier to deliver, and warns of the potential pitfalls and how to avoid them.
Chapter 10 explores the concept of the data pipeline and how Big Data moves through the analysis process and is then transformed into usable information that delivers value.

Sometimes the best information on a particular technology comes from those who are promoting that technology for profit and growth, hence the birth of the white paper. White papers are meant to educate and inform potential customers about a particular technology segment while gently goading those potential customers toward the vendor’s product.

That said, it is always best to take white papers with a grain of salt. Nevertheless, white papers prove to be an excellent source for researching technology and have significant educational value. With that in mind, I have included the following white papers in the appendix of this book, and each offers additional knowledge for those who are looking to leverage Big Data solutions: “The MapR Distribution for Apache Hadoop” and “High Availability: No Single Points of Failure,” both from MapR Technologies.

Acknowledgments

Take it from me, writing a book takes time, patience, and motivation in equal measures. At times the challenges can be overwhelming, and it becomes very easy to lose focus. However, analytics, patterns, and uncovering the hidden meaning behind data have always attracted me. When one considers the possibilities offered by comprehensive analytics and the inclusion of what may seem to be unrelated data sets, the effort involved seems almost inconsequential.

The idea for this book came from a brief conversation with John Wiley & Sons editor Timothy Burgard, who contacted me out of the blue with a proposition to build on some articles I had written on Big Data. Tim explained that comprehensive information that could be consumed by C-level executives and those entering the data analytics arena was sorely lacking, and he thought that I was up to the challenge of creating that information. So it was with Tim’s encouragement that I started down the path to create a book on Big Data.

I would be remiss if I didn’t mention the excellent advice and additional motivation that I received from John Wiley & Sons development editor Stacey Rivera, who was faced with the challenge of keeping me on track and moving me along in the process—a chore that I would not wish on anyone!

Putting together a book like this is a long journey that introduced me to many experts, mentors, and acquaintances who helped me to shape my ideology on how large data sets can be brought together for processing to uncover trends and other valuable bits of information.

I also have to acknowledge the many vendors in the Big Data arena who inadvertently helped me along my journey to expose the value contained in data. Those vendors, who number in the dozens, have made concentrated efforts to educate the public about the value behind Big Data, and the events they have sponsored as well as the information they have disseminated have helped to further define the market and give rise to conversations that encouraged me to pursue my ultimate goal of writing a book.

Writing takes a great deal of energy and can quickly consume all of the hours in a day. With that in mind, I have to thank the numerous editors whom I have worked with on freelance projects while concurrently writing this book. Without their understanding and flexibility, I could never have written this book, or any other. Special thanks go out to Mike Vizard, Ed Scannell, Mike Fratto, Mark Fontecchio, James Allen Miller, and Cameron Sturdevant.

When it comes to providing the ultimate in encouragement and support, no one can compare with my wife, Carol, who understood the toll that writing a book would take on family time and was still willing to provide me with whatever I needed to successfully complete this book. I also have to thank my children, Connor, Tyler, Sarah, and Katelyn, for understanding that Daddy had to work and was not always available. I am very thankful to have such a wonderful and supportive family.

Chapter 1

What Is Big Data?

What exactly is Big Data? At first glance, the term seems rather vague, referring to something that is large and full of information. That description does indeed fit the bill, yet it provides no information on what Big Data really is.

Big Data is often described as extremely large data sets that have grown beyond the ability to manage and analyze them with traditional data processing tools. Searching the Web for clues reveals an almost universal definition, shared by the majority of those promoting the ideology of Big Data, that can be condensed into something like this: Big Data defines a situation in which data sets have grown to such enormous sizes that conventional information technologies can no longer effectively handle either the size of the data set or the scale and growth of the data set. In other words, the data set has grown so large that it is difficult to manage and even harder to garner value out of it. The primary difficulties are the acquisition, storage, searching, sharing, analytics, and visualization of data.

There is much more to be said about what Big Data actually is. The concept has evolved to include not only the size of the data set but also the processes involved in leveraging the data. Big Data has even become synonymous with other business concepts, such as business intelligence, analytics, and data mining.

Paradoxically, Big Data is not that new. Although massive data sets have been created in just the last two years, Big Data has its roots in the scientific and medical communities, where the complex analysis of massive amounts of data has been done for drug development, physics modeling, and other forms of research, all of which involve large data sets. Yet it is these very roots of the concept that have changed what Big Data has come to be.

THE ARRIVAL OF ANALYTICS

As analytics and research were applied to large data sets, scientists came to the conclusion that more is better—in this case, more data, more analysis, and more results. Researchers started to incorporate related data sets, unstructured data, archival data, and real-time data into the process, which in turn gave birth to what we now call Big Data.

In the business world, Big Data is all about opportunity. According to IBM, every day we create 2.5 quintillion (2.5 × 1018) bytes of data, so much that 90 percent of the data in the world today has been created in the last two years. These data come from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos posted online, transaction records of online purchases, and cell phone GPS signals, to name just a few. That is the catalyst for Big Data, along with the more important fact that all of these data have intrinsic value that can be extrapolated using analytics, algorithms, and other techniques.

Big Data has already proved its importance and value in several areas. Organizations such as the National Oceanic and Atmospheric Administration (NOAA), the National Aeronautics and Space Administration (NASA), several pharmaceutical companies, and numerous energy companies have amassed huge amounts of data and now leverage Big Data technologies on a daily basis to extract value from them.

NOAA uses Big Data approaches to aid in climate, ecosystem, weather, and commercial research, while NASA uses Big Data for aeronautical and other research. Pharmaceutical companies and energy companies have leveraged Big Data for more tangible results, such as drug testing and geophysical analysis. The New York Times has used Big Data tools for text analysis and Web mining, while the Walt Disney Company uses them to correlate and understand customer behavior in all of its stores, theme parks, and Web properties.

Big Data plays another role in today’s businesses: Large organizations increasingly face the need to maintain massive amounts of structured and unstructured data—from transaction information in data warehouses to employee tweets, from supplier records to regulatory filings—to comply with government regulations. That need has been driven even more by recent court cases that have encouraged companies to keep large quantities of documents, e-mail messages, and other electronic communications, such as instant messaging and Internet provider telephony, that may be required for e-discovery if they face litigation.

WHERE IS THE VALUE?

Extracting value is much more easily said than done. Big Data is full of challenges, ranging from the technical to the conceptual to the operational, any of which can derail the ability to discover value and leverage what Big Data is all about.

Perhaps it is best to think of Big Data in multidimensional terms, in which four dimensions relate to the primary aspects of Big Data. These dimensions can be defined as follows:

1. Volume. Big Data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.
2. Variety. Big Data extends beyond structured data to include unstructured data of all varieties: text, audio, video, click streams, log files, and more.
3. Veracity. The massive amounts of data collected for Big Data purposes can lead to statistical errors and misinterpretation of the collected information. Purity of the information is critical for value.
4. Velocity. Often time sensitive, Big Data must be used as it is streaming into the enterprise in order to maximize its value to the business, but it must also still be available from the archival sources as well.

These 4Vs of Big Data lay out the path to analytics, with each having intrinsic value in the process of discovering value. Nevertheless, the complexity of Big Data does not end with just four dimensions. There are other factors at work as well: the processes that Big Data drives. These processes are a conglomeration of technologies and analytics that are used to define the value of data sources, which translates to actionable elements that move businesses forward.

Many of those technologies or concepts are not new but have come to fall under the umbrella of Big Data. Best defined as analysis categories, these technologies and concepts include the following:

Traditional business intelligence (BI).

This consists of a broad category of applications and technologies for gathering, storing, analyzing, and providing access to data. BI delivers actionable information, which helps enterprise users make better business decisions using fact-based support systems. BI works by using an in-depth analysis of detailed business data, provided by databases, application data, and other tangible data sources. In some circles, BI can provide historical, current, and predictive views of business operations.

Data mining.

This is a process in which data are analyzed from different perspectives and then turned into summary data that are deemed useful. Data mining is normally used with data at rest or with archival data. Data mining techniques focus on modeling and knowledge discovery for predictive, rather than purely descriptive, purposes—an ideal process for uncovering new patterns from large data sets.

Statistical applications.

These look at data using algorithms based on statistical principles and normally concentrate on data sets related to polls, census, and other static data sets. Statistical applications ideally deliver sample observations that can be used to study populated data sets for the purpose of estimating, testing, and predictive analysis. Empirical data, such as surveys and experimental reporting, are the primary sources for analyzable information.

Predictive analysis.

This is a subset of statistical applications in which data sets are examined to come up with predictions, based on trends and information gleaned from databases. Predictive analysis tends to be big in the financial and scientific worlds, where trending tends to drive predictions, once external elements are added to the data set. One of the main goals of predictive analysis is to identify the risks and opportunities for business process, markets, and manufacturing.

Data modeling.

This is a conceptual application of analytics in which multiple “what-if” scenarios can be applied via algorithms to multiple data sets. Ideally, the modeled information changes based on the information made available to the algorithms, which then provide insight to the effects of the change on the data sets. Data modeling works hand in hand with data visualization, in which uncovering information can help with a particular business endeavor.

The preceding analysis categories constitute only a portion of where Big Data is headed and why it has intrinsic value to business. That value is driven by the never-ending quest for a competitive advantage, encouraging organizations to turn to large repositories of corporate and external data to uncover trends, statistics, and other actionable information to help them decide on their next move. This has helped the concept of Big Data to gain popularity with technologists and executives alike, along with its associated tools, platforms, and analytics.

MORE TO BIG DATA THAN MEETS THE EYE

The volume and overall size of the data set is only one portion of the Big Data equation. There is a growing consensus that both semistructured and unstructured data sources contain business-critical information and must therefore be made accessible for both BI and operational needs. It is also clear that the amount of relevant unstructured business data is not only growing but will continue to grow for the foreseeable future.

Data can be classified under several categories: structured data, semistructured data, and unstructured data. Structured data are normally found in traditional databases (SQL or others) where data are organized into tables based on defined business rules. Structured data usually prove to be the easiest type of data to work with, simply because the data are defined and indexed, making access and filtering easier.