Big Data Visualization - James D. Miller - E-Book

Big Data Visualization E-Book

James D Miller

0,0
34,79 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Gain valuable insight into big data analytics with this book. Covering the tools you need to analyse data, together with IBM certified expert James Miller?s insight, this book is the key to data visualization success. ? Learn the tools & techniques to process big data for efficient data visualization ? Packed with insightful real-world use cases ? Addresses the difficulties faced by professionals in the field of big data analytics

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 251

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Big Data Visualization
Credits
About the Author
About the Reviewer
www.PacktPub.com
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Introduction to Big Data Visualization
An explanation of data visualization
Conventional data visualization concepts
Training options
Challenges of big data visualization
Big data
Using Excel to gauge your data
Pushing big data higher
The 3Vs
Volume
Velocity
Variety
Categorization
Such are the 3Vs
Data quality
Dealing with outliers
Meaningful displays
Adding a fourth V
Visualization philosophies
More on variety
Velocity
Volume
All is not lost
Approaches to big data visualization
Access, speed, and storage
Entering Hadoop
Context
Quality
Displaying results
Not a new concept
Instant gratifications
Data-driven documents
Dashboards
Outliers
Investigation and adjudication
Operational intelligence
Summary
2. Access, Speed, and Storage with Hadoop
About Hadoop
What else but Hadoop?
IBM too!
Log files and Excel
An R scripting example
Points to consider
Hadoop and big data
Entering Hadoop
AWS for Hadoop projects
Example 1
Defining the environment
Getting started
Uploading the data
Manipulating the data
A specific example
Conclusion
Example 2
Sorting
Parsing the IP
Summary
3. Understanding Your Data Using R
Definitions and explanations
Comparisons
Contrasts
Tendencies
Dispersion
Adding context
About R
R and big data
Example 1
Digging in with R
Example 2
Definitions and explanations
No looping
Comparisons
Contrasts
Tendencies
Dispersion
Summary
4. Addressing Big Data Quality
Data quality categorized
DataManager
DataManager and big data
Some examples
Some reformatting
A little setup
Selecting nodes
Connecting the nodes
The work node
Adding the script code
Executing the scene
Other data quality exercises
What else is missing?
Status and relevance
Naming your nodes
More examples
Consistency
Reliability
Appropriateness
Accessibility
Other Output nodes
Summary
5. Displaying Results Using D3
About D3
D3 and big data
Some basic examples
Getting started with D3
A little down time
Visual transitions
Multiple donuts
More examples
Another twist on bar chart visualizations
One more example
Adopting the sample
Summary
6. Dashboards for Big Data - Tableau
About Tableau
Tableau and big data
Example 1 - Sales transactions
Adding more context
Wrangling the data
Moving on
A Tableau dashboard
Saving the workbook
Presenting our work
More tools
Example 2
What's the goal? - purpose and audience
Sales and spend
Sales v Spend and Spend as % of Sales Trend
Tables and indicators
All together now
Summary
7. Dealing with Outliers Using Python
About Python
Python and big data
Outliers
Options for outliers
Delete
Transform
Outliers identified
Some basic examples
Testing slot machines for profitability
Into the outliers
Handling excessive values
Establishing the value
Big data note
Setting outliers
Removing Specific Records
Redundancy and risk
Another point
If Type
Reused
Changing specific values
Setting the Age
Another note
Dropping fields entirely
More to drop
More examples
A themed population
A focused philosophy
Summary
8. Big Data Operational Intelligence with Splunk
About Splunk
Splunk and big data
Splunk visualization -  real-time log analysis
IBM Cognos
Pointing Splunk
Setting rows and columns
Finishing with errors
Splunk and processing errors
Splunk visualization - deeper into the logs
New fields
Editing the dashboard
More about dashboards
Summary

Big Data Visualization

Big Data Visualization

Copyright © 2017 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: February 2017

Production reference: 1230217

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham 

B3 2PB, UK.

ISBN 978-1-78528-194-5

www.packtpub.com

Credits

Author

James D. Miller

Copy Editor

Laxmi Subramanian

Reviewers

Dave Wentzel

Project Coordinator

Shweta H Birwatkar 

Commissioning Editor

Veena Pagare

Proofreader

Safis Editing

Acquisition Editor

Tushar Gupta

Indexer

Aishwarya Gangawane 

Content Development Editor

Sumeet Sawant

Graphics

Tania Dutta

Technical Editor

Sneha Hanchate

Production Coordinator

Arvindkumar Gupta

About the Author

James D. Miller is an IBM certified expert, creative innovator, and accomplished Director, Sr. Project Leader, and Application/System Architect with more than 35 years of extensive applications, system design, and development experience across multiple platforms and technologies.

His experiences and specialties include introducing customers to new and sometimes disruptive technologies and platforms, integrating with IBM Watson Analytics, cloud migrations, Cognos BI, TM1 and web architecture design, systems analysis, GUI design and testing, data and database modeling and systems analysis, design, and the development of OLAP, Client/Server, Web and Mainframe applications and systems utilizing IBM Watson Analytics, IBM Cognos BI and TM1 (TM1 rules, TI, TM1Web and Planning Manager), Cognos Framework Manager, dynaSight/ArcPlan, ASP, DHTML, XML, IIS, MS Visual Basic and VBA, Visual Studio, Perl, Splunk, WebSuite, MS SQL server, ORACLE, SYBASE Server, and more.

His responsibilities have also included all aspects of Windows and SQL solution development and design, including analysis; GUI (and Web site) design; data modeling; table, screen/form and script development; SQL (and remote stored procedures and triggers) development/testing; test preparation; and the management and training of programming staff. His other experience includes the development of ETL infrastructure such as data transfer automation between mainframe (DB2, Lawson, Great Plains, and so on) systems and client/server SQL server and web-based applications and integration of enterprise applications and data sources.

Mr. James D. Miller has acted as an Internet Applications Development manager responsible for the design, development, QA, and delivery of multiple websites, including online trading applications, warehouse process control, scheduling systems, and administrative and control applications. He was also responsible for the design, development, and administration of a web-based financial reporting system for a 450-million-dollar organization, reporting directly to the CFO and his executive team.

Mr. Miller has also been responsible for managing and directing multiple resources in various management roles, including project and team leader, lead developer, and applications development director.

Mr. Miller has authored Cognos TM1 Developers Certification Guide, Mastering Splunk, Learning IBM Watson Analytics, and a number of whitepapers on best practices such as Establishing a Center of Excellence, and continues to post blogs on a number of relevant topics based upon personal experiences and industry best practices. Jim is a perpetual learner who continues to pursue experiences and certifications, and currently holds the following current technical certifications:

IBM Certified Business Analyst - Cognos TM1IBM Cognos TM1 Master 385 Certification (perfect score of 100% on exam)IBM Certified Advanced Solution Expert - Cognos TM1IBM Cognos TM1 10.1 Administrator Certification C2020-703 (perfect score of 100% on exam)IBM OpenPages Developer Fundamentals C2020-001-ENU (98% on exam)IBM Cognos 10 BI Administrator C2020-622 (98% on exam)IBM Cognos 10 BI Professional C2020-180

His specialties include the evaluation and introduction of innovative and disruptive technologies, cloud migration, big data, IBM Watson Analytics, Cognos BI and TM1 application Design and Development, OLAP, Visual Basic, SQL Server, Forecasting and Planning, International Application Development, Business Intelligence, Project Development and Delivery, and process improvement.

About the Reviewer

Dave Wentzel is a Data Solutions Architect for Microsoft. He helps customers with their Azure Digital Transformation focused on data science, big data, and SQL Server. After working with customers, he provides feedback and learnings to the product groups at Microsoft to make better solutions. Dave has been working with SQL Server for many years, and with MDX and SSAS since they were in their infancy. Dave shares his experiences at http://davewentzel.com. He’s always looking for new customers. Would you like to engage?

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Customer Feedback

Thank you for purchasing this Packt book. We take our commitment to improving our content and products to meet your needs seriously—that's why your feedback is so valuable. Whatever your feelings about your purchase, please consider leaving a review on this book's Amazon page. Not only will this help us, more importantly, it will also help others in the community to make an informed decision about the resources that they invest in to learn.

You can also review for us on a regular basis by joining our reviewers' club. If you're interested in joining, or would like to learn more about the benefits we offer, please contact us: [email protected].

Preface

The concepts and models necessary to efficiently and effectively visualize big data can be daunting but are not unobtainable. Unfortunately, when it comes to big data, many of the available data visualization tools, with their rudimentary functions and features, are somewhat ineffective.

Using basic analytical concepts (reviewed in this book), you’ll learn to use some of the most popular open source tools (and others) to meet these challenges and approach the task of big data visualization to support better decision making.

What this book covers

Chapter 1, Introduction to Big Data Visualization, –  starts out by providing a simple explanation of just what data visualization is and then provides a quick overview of various generally accepted data visualization concepts.

Chapter 2, Access, Speed, and Storage with Hadoop, aims to target the challenge of storing and accessing large volumes and varieties (structured or unstructured) of data offering working examples demonstrating solutions for effectively addressing these issues.

Chapter 3, Understanding Your Data Using R, explores the idea of adding context to the big data you are working on with R.

Chapter 4, Addressing Big Data Quality, talks about categorized data quality and the challenges big data brings to them. In addition, examples demonstrating concepts for effectively addressing these areas are covered.

Chapter 5, Displaying Results Using D3, explores the process of visualizing data using a web browser and Data-Driven Documents (D3) to present results from your big data analysis projects.

Chapter 6, Dashboards for Big Data - Tableau, introduces Tableau as a data visualization tool that can be used to construct dashboards and provides working examples demonstrating solutions for effectively presenting results from your big data analysis in a real-time dashboard format.

Chapter 7, Dealing with Outliers Using Python, focuses on the topic of dealing with outliers and other anomalies as they relate to big data visualization, and introduces the Python language with working examples of effectively dealing with data.

Chapter 8, Big Data Operational Intelligence with Splunk, offers working examples demonstrating solutions for valuing big data by gaining operational intelligence (using Splunk).

What you need for this book

Most of the tools and technologies used in this book are open source and available for no charge. All of the others offer free trials for evaluation. With this book, and some basic exposure to data analysis (or basic programming concepts) the reader will be able to gain valuable insights to the world of big data visualization!

Who this book is for

The target audience of this book are data analysts and those with at least a basic knowledge of big data analysis who now want to learn interesting approaches to big data visualization in order to make their analysis more valuable. Readers who possess adequate knowledge of big data platform tools such as Hadoop or have exposure to programming languages such as R can use this book to learn additional approaches (using various technologies) for addressing the inherent challenges of visualizing big data.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Big-Data-Visualization. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/BigDataVisualization_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Chapter 1.  Introduction to Big Data Visualization

Since this is the first chapter, it may be considered prudent to start out by providing a simple explanation of just what data visualization is and then a quick overview of various generally accepted data visualization concepts.

From there, we will proceed by pointing out the specific challenges that big data brings--to the practice of visualizing data--and then finally we will tee up a number of approaches for successfully creating valuable visualizations using big data sources.

After completing this chapter, the reader will be ready to start with the practical big data visualization examples covered in this book's subsequent chapters; each of which will focus on a specific big data visualization topic, using a specific trending tool or technology thought to be well fitted (note that other tools or technologies may be available) to address that particular topic or challenge.

We'll break down this first chapter into:

An explanation of data visualizationConventional data visualization conceptsChallenges of big data visualizationApproaches to big data visualization

An explanation of data visualization

So what is data visualization? Simply put, one can think of the two words, data meaning information/numbers and visualization meaning picturing, or picturing the information as shown in the following figure:

Perhaps a simplistic example that can be used to define data visualization is the practice of striking lines between stars in the night sky to create an image.

Imagine certain stars as the data points you are interested in (among the billions of other stars that are visible in the sky) and connecting them in a certain order to create a picture to help one visualize the constellation.

Voila! Data visualization!

Nowadays, it is reported within the industry that data visualization is regarded by many disciplines as the modern equivalent of visual communication.

Okay, so then what is the point of or chief objective of visual communication or visualizing your data?

The main point (although there are other goals and objectives) when leveraging data visualization is to make something complex appear simple (or in our star example earlier, perhaps to make a data pattern more visible to a somewhat untrained eye).

Communicating a particular point or simplifying the complexities of mountains of data does not require the use of data visualization, but in some way today's world might demand it. That is, the majority of the readers of this book would most likely agree that scanning numerous worksheets, spreadsheets, or reports is mundane and tedious at best, while looking at charts and graphs is typically much easier on the eyes. Additionally, the fact is that we humans are able to process even very large amounts of data much quicker when the data is presented graphically. Therefore, data visualization is a way to convey concepts in a universal manner, allowing your audience or target to quickly get your point.

Other motives for using data visualization include:

To explain the data or put the data in context (that is, highlight demographical statistics)To solve a specific problem, (for example, identifying problem areas within a particular business model)To explore the data to reach a better understanding or add clarity (that is, what periods of time does this data span?)To highlight or illustrate otherwise invisible data (such as isolating outliers residing in the data)To predict, for example, potential sales volumes (perhaps based upon seasonality sales statistics)

With computers, technology, and the corporate business landscape changing so rapidly today (and all indications are that it will continue to change at an even faster pace in the future), what can be considered the future of the art of data visualization?

As per Data Visualization: The future of data visualization, Towler, 2015:

"Data visualization is entering a new era. Emerging sources of intelligence, theoretical developments, and advances in multidimensional imaging are reshaping the potential value that analytics and insights can provide, with visualization playing a key role."                                                       

With big data getting bigger (and bigger!), it is safe to undertake the notion that the use of data visualization will only continue to grow, to evolve, and to be of outstanding value. In addition, how one approaches the process and practice of data visualization will need to grow and evolve as well.

Conventional data visualization concepts

Let's start out this section by clarifying what we mean when we say conventional.

In the context of this book, when I say conventional, I am referring to the ideas and methods that have been used with some level of success within the industry over time (for data visualization).

Although it seems that every day, new technologies and practices are being discovered, developed, and deployed providing new and different options for performing ever more ingenious real-time (or near real time) data visualization, understanding the basic concepts for visualizing data is still essential.

To that point, gaining an understanding of just how to go about choosing the correct or most effective visualization method is essential.

To make that choice, one typically needs to establish:

The size and volume of the data to be visualized.The data's cardinality and context.What is it you are trying to communicate? What is the point that you want to communicate?Who is your audience? Who will consume this information?What kind or type of visual might best convey your message to your audience?

Note

We have also been realistic that sometimes the approach taken or method used is solely based upon your time and budget.

Based on the earlier and perhaps other particulars--and you most likely are already familiar with these--the most common visualization methods/types include:

TableHistogramScatter plotLine, bar, pie, area, flow, and bubble chartsData series or a combination of chartsTime lineVenn diagrams, data flow diagrams, and entity relationship (ER) diagrams

As I've mentioned earlier, as and when needs arise, newer or lesser known options are becoming more main stream.

These include the following:

Word/Text/Tag cloudsNetwork diagramsParallel coordinatesTree mappingCone treesSemantic networks

Each of the earlier mentioned data visualization types/methods speak to a particular scenario or target audience better than others--it all depends. Learning to make the appropriate choice comes from experience as well as (sometimes) a bit of trial and error.

Training options

Due to the popularity of data visualization, there exist many formal training options, (classroom and online) and new and unique training curriculums are becoming available every day.

Coursework includes topics such as:

Channeling an audienceUnderstanding dataDetermining informational hierarchiesSketching and wire framingDefining a narrative

Challenges of big data visualization

We're assuming that you have some background with the topic of data visualization and therefore the earlier deliberations were just enough to refresh your memory and sharpen your appetite for the real purpose of this book.

Big data

Let's take a pause here to define big data.

A large assemblage of data and datasets that are so large or complex that traditional data processing applications are inadequate and data about every aspect of our lives has all been used to define or refer to big data.

In 2001, then Gartner analyst Doug Laney introduced the 3Vs concept ( refer to the following link http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf). The 3Vs, according to Doug Laney, are volume, variety, and velocity. The 3Vs make up the dimensionality of big data: volume (or the measurable amount of data), variety (meaning the number of types of data), and velocity (referring to the speed of processing or dealing with that data).

With this concept in mind, all aspects of big data become increasingly challenging and as these dimensions increase or expand they will also encumber the ability to effectively visualize the data.

Using Excel to gauge your data

Look at the following figure and remember that Excel is not a tool to determine whether your data qualifies as big data:

If your data is too big for Microsoft Excel, it still really doesn't necessarily qualify as big data. In fact, gigabytes of data still are manageable with various techniques, enterprise, and even open source tools, especially with the lower cost of storage today. It is important to be able to realistically size the data that you will be using in an analytic or visualization project before selecting an approach or technology (keeping in mind expected data growth rates).

Pushing big data higher

As the following figure illustrates, the aforementioned Volume, Variety, and Velocity have and will continue to lift Big Data into the future:

The 3Vs

Let's take a moment to further examine the Vs.

Volume

Volume involves determining or calculating how much of something there is, or in the case of big data, how much of something there will be. Here is a thought provoking example:

How fast does moon dust pile up?

As written by Megan Gannon in 2014, (http://www.space.com/23694-moon-dust-mystery-apollo-data.html), a revisited trove of data from NASA's Apollo missions more than 40 years ago is helping scientists answer a lingering lunar question: how fast does moon dust build up? The answer: it would take 1,000 years for a layer of moon dust about a millimeter (0.04 inches) thick to accumulate (big data accumulates much quicker than moon dust!).

With every click of a mouse, big data grows to be petabytes (1,024 terabytes) or even Exabyte's (1,024 petabytes) consisting of billions to trillions of records generated from millions of people and machines.

Although it's been reported (for example, you can refer to the following link: http://blog.sqlauthority.com/2013/07/21/sql-server-what-is-the-maximum-relational-database-size-supported-by-single-instance/) that structured or relational database technology could support applications capable of scaling up to 1 petabyte of storage, it doesn't take a lot of thought to understand with that kind of volume it won't be easy to handle capably, and the accumulation rate of big data isn't slowing any time soon.

It's the case of big, bigger (and we haven't even approached determining), and biggest yet!

Velocity

Velocity is the rate or pace at which something is occurring. The measured velocity experience can and usually does change over time. Velocities directly affect outcomes.

Previously, we lived and worked in a batch environment, meaning we formulate a question (perhaps what is our most popular product?), submit the question (to the information technology group), and wait--perhaps after the nightly sales are processed (maybe 24 hours later), and finally, we receive an answer. This is a business model that doesn't hold up now with the many new sources of data (such as social media or mobile applications), which record and capture data in real time, all of the time. The answers to the questions asked may actually change within a 24-hour period (such is the case with trending now information that you may have observed when you are online).

Given the industry hot topics such as Internet of Things (IoT), it is safe to say that these pace expectations will only quicken.

Variety

Thinking back to our previous mention of relational databases, it is generally accepted that relational databases are considered to be highly structured, although they may contain text in VCHAR, CLOB, or BLOB fields.

Data today (and especially when we talk about big data) comes from many kinds of data sources, and the level in which that data is structured varies greatly from data source to data source. In fact, the growing trend is for data to continue to lose structure and to continue to add hundreds (or more?) of new formats and structures (formats that go beyond pure text, photo, audio, video, web, GPS data, sensor data, relational databases, documents, SMS, pdf, flash, and so on) all of the time.

Categorization

The process of categorization helps us to gain an understanding of the data source.

The industry commonly categorizes big data this way--into the two groups (structured and unstructured)--but the categorizing doesn't stop there.