Practical Big Data Analytics - Nataraj Dasgupta - E-Book

Practical Big Data Analytics E-Book

Nataraj Dasgupta

0,0
34,79 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Big Data analytics relates to the strategies used by organizations to collect, organize, and analyze large amounts of data to uncover valuable business insights that cannot be analyzed through traditional systems. Crafting an enterprise-scale cost-efficient Big Data and machine learning solution to uncover insights and value from your organization’s data is a challenge. Today, with hundreds of new Big Data systems, machine learning packages, and BI tools, selecting the right combination of technologies is an even greater challenge.

This book will help you do that. With the help of this guide, you will be able to bridge the gap between the theoretical world of technology and the practical reality of building corporate Big Data and data science platforms. You will get hands-on exposure to Hadoop and Spark, build machine learning dashboards using R and R Shiny, create web-based apps using NoSQL databases such as MongoDB, and even learn how to write R code for neural networks.

By the end of the book, you will have a very clear and concrete understanding of what Big Data analytics means, how it drives revenues for organizations, and how you can develop your own Big Data analytics solution using the different tools and methods articulated
in this book.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 373

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Practical Big Data Analytics

 

 

Hands-on techniques to implement enterprise analytics and machine learning using Hadoop, Spark, NoSQL and R

 

 

 

 

 

 

 

 

 

 

 

 

Nataraj Dasgupta

 

 

 

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Practical Big Data Analytics

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Veena PagareAcquisition Editor: Vinay ArgekarContent Development Editor: Tejas LimkarTechnical Editor: Dinesh ChaudharyCopy Editor: Safis EditingProject Coordinator: Manthan PatelProofreader: Safis EditingIndexer: Pratik ShirodkarGraphics: Tania DuttaProduction Coordinator: Aparna Bhagat

First published: January 2018

Production reference: 1120118

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78355-439-3

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Nataraj Dasgupta is the vice president of Advanced Analytics at RxDataScience Inc. Nataraj has been in the IT industry for more than 19 years and has worked in the technical and analytics divisions of Philip Morris, IBM, UBS Investment Bank and Purdue Pharma. He led the data science division at Purdue Pharma L.P. where he developed the company’s award-winning big data and machine learning platform. Prior to Purdue, at UBS, he held the role of associate director working with high frequency and algorithmic trading technologies in the Foreign Exchange trading division of the bank.

I'd like to thank my wife, Suraiya, for her caring, support, and understanding as I worked during long weekends and evening hours and to my parents, in-laws, sister and grandmother for all the support, guidance, tutelage and encouragement over the years. I'd also like to thank Packt, especially the editors, Tejas, Dinesh, Vinay, and the team whose persistence and attention to detail has been exemplary.

About the reviewer

Giancarlo Zaccone has more than 10 years experience in managing research projects both in scientific and industrial areas. He worked as a researcher at the C.N.R, the National Research Council, where he was involved in projects on parallel numerical computing and scientific visualization. He is a senior software engineer at a consulting company, developing and testing software systems for space and defense applications. He holds a master's degree in physics from the Federico II of Naples and a second level postgraduate master course in scientific computing from La Sapienza of Rome.

 

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Too Big or Not Too Big

What is big data?

A brief history of data

Dawn of the information age

Dr. Alan Turing and modern computing

The advent of the stored-program computer

From magnetic devices to SSDs

Why we are talking about big data now if data has always existed

Definition of big data

Building blocks of big data analytics

Types of Big Data

Structured

Unstructured

Semi-structured

Sources of big data

The 4Vs of big data

When do you know you have a big data problem and where do you start your search for the big data solution?

Summary

Big Data Mining for the Masses

What is big data mining?

Big data mining in the enterprise

Building the case for a Big Data strategy

Implementation life cycle

Stakeholders of the solution

Implementing the solution

Technical elements of the big data platform

Selection of the hardware stack

Selection of the software stack

Summary

The Analytics Toolkit

Components of the Analytics Toolkit

System recommendations

Installing on a laptop or workstation

Installing on the cloud

Installing Hadoop

Installing Oracle VirtualBox

Installing CDH in other environments

Installing Packt Data Science Box

Installing Spark

Installing R

Steps for downloading and installing Microsoft R Open

Installing RStudio

Installing Python

Summary

Big Data With Hadoop

The fundamentals of Hadoop

The fundamental premise of Hadoop

The core modules of Hadoop

Hadoop Distributed File System - HDFS

Data storage process in HDFS

Hadoop MapReduce

An intuitive introduction to MapReduce

A technical understanding of MapReduce

Block size and number of mappers and reducers

Hadoop YARN

Job scheduling in YARN

Other topics in Hadoop

Encryption

User authentication

Hadoop data storage formats

New features expected in Hadoop 3

The Hadoop ecosystem

Hands-on with CDH

WordCount using Hadoop MapReduce

Analyzing oil import prices with Hive

Joining tables in Hive

Summary

Big Data Mining with NoSQL

Why NoSQL?

The ACID, BASE, and CAP properties

ACID and SQL

The BASE property of NoSQL

The CAP theorem

The need for NoSQL technologies

Google Bigtable

Amazon Dynamo

NoSQL databases

In-memory databases

Columnar databases

Document-oriented databases

Key-value databases

Graph databases

Other NoSQL types and summary of other types of databases 

Analyzing Nobel Laureates data with MongoDB

JSON format

Installing and using MongoDB

Tracking physician payments with real-world data

Installing kdb+, R, and RStudio

Installing kdb+

Installing R

Installing RStudio

The CMS Open Payments Portal

Downloading the CMS Open Payments data

Creating the Q application

Loading the data

The backend code

Creating the frontend web portal

R Shiny platform for developers

Putting it all together - The CMS Open Payments application

Applications

Summary

Spark for Big Data Analytics

The advent of Spark

Limitations of Hadoop

Overcoming the limitations of Hadoop

Theoretical concepts in Spark

Resilient distributed datasets

Directed acyclic graphs

SparkContext

Spark DataFrames

Actions and transformations

Spark deployment options

Spark APIs

Core components in Spark

Spark Core

Spark SQL

Spark Streaming

GraphX

MLlib

The architecture of Spark

Spark solutions

Spark practicals

Signing up for Databricks Community Edition

Spark exercise - hands-on with Spark (Databricks)

Summary

An Introduction to Machine Learning Concepts

What is machine learning?

The evolution of machine learning

Factors that led to the success of machine learning

Machine learning, statistics, and AI

Categories of machine learning

Supervised and unsupervised machine learning

Supervised machine learning

Vehicle Mileage, Number Recognition and other examples

Unsupervised machine learning

Subdividing supervised machine learning

Common terminologies in machine learning

The core concepts in machine learning

Data management steps in machine learning

Pre-processing and feature selection techniques

Centering and scaling

The near-zero variance function

Removing correlated variables

Other common data transformations

Data sampling

Data imputation

The importance of variables

The train, test splits, and cross-validation concepts

Splitting the data into train and test sets

The cross-validation parameter

Creating the model

Leveraging multicore processing in the model

Summary

Machine Learning Deep Dive

The bias, variance, and regularization properties

The gradient descent and VC Dimension theories

Popular machine learning algorithms

Regression models

Association rules

Confidence

Support

Lift

Decision trees

The Random forest extension

Boosting algorithms

Support vector machines

The K-Means machine learning technique

The neural networks related algorithms

Tutorial - associative rules mining with CMS data

Downloading the data

Writing the R code for Apriori

Shiny (R Code)

Using custom CSS and fonts for the application

Running the application

Summary

Enterprise Data Science

Enterprise data science overview

A roadmap to enterprise analytics success

Data science solutions in the enterprise

Enterprise data warehouse and data mining

Traditional data warehouse systems

Oracle Exadata, Exalytics, and TimesTen

HP Vertica

Teradata

IBM data warehouse systems (formerly Netezza appliances)

PostgreSQL

Greenplum

SAP Hana

Enterprise and open source NoSQL Databases

Kdb+

MongoDB

Cassandra

Neo4j

Cloud databases

Amazon Redshift, Redshift Spectrum, and Athena databases

Google BigQuery and other cloud services

Azure CosmosDB

GPU databases

Brytlyt

MapD

Other common databases

Enterprise data science – machine learning and AI

The R programming language

Python

OpenCV, Caffe, and others

Spark

Deep learning

H2O and Driverless AI

Datarobot

Command-line tools

Apache MADlib

Machine learning as a service

Enterprise infrastructure solutions

Cloud computing

Virtualization

Containers – Docker, Kubernetes, and Mesos

On-premises hardware

Enterprise Big Data

Tutorial – using RStudio in the cloud

Summary

Closing Thoughts on Big Data

Corporate big data and data science strategy

Ethical considerations

Silicon Valley and data science

The human factor

Characteristics of successful projects

Summary

External Data Science Resources

Big data resources

NoSQL products

Languages and tools

Creating dashboards

Notebooks

Visualization libraries

Courses on R

Courses on machine learning

Machine learning and deep learning links

Web-based machine learning services

Movies

Machine learning books from Packt

Books for leisure reading

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

This book introduces the reader to a broad spectrum of topics related to big data as used in the enterprise. Big data is a vast area that encompasses elements of technology, statistics, visualization, business intelligence, and many other related disciplines. To get true value from data that oftentimes remains inaccessible, either due to volume or technical limitations, companies must leverage proper tools both at the software as well as the hardware level.

To that end, the book not only covers the theoretical and practical aspects of big data, but also supplements the information with high-level topics such as the use of big data in the enterprise, big data and data science initiatives and key considerations such as resources, hardware/software stack and other related topics. Such discussions would be useful for IT departments in organizations that are planning to implement or upgrade the organizational big data and/or data science platform.

The book focuses on three primary areas:

1. Data mining on large-scale datasets

Big data is ubiquitous today, just as the term data warehouse was omnipresent not too long ago. There are a myriad of solutions in the industry. In particular, Hadoop and products in the Hadoop ecosystem have become both popular and increasingly common in the enterprise. Further, more recent innovations such as Apache Spark have also found a permanent presence in the enterprise - Hadoop clients, realizing that they may not need the complexity of the Hadoop framework have shifted to Spark in large numbers. Finally, NoSQL solutions, such as MongoDB, Redis, Cassandra and commercial solutions such as Teradata, Vertica and kdb+ have provided have taken the place of more conventional database systems.

This book will cover these areas with a fair degree of depth. Hadoop and related products such as Hive, HBase, Pig Latin and others have been covered. We have also covered Spark and explained key concepts in Spark such as Actions and Transformations. NoSQL solutions such as MongoDB and KDB+ have also been covered to a fair extent and hands-on tutorials have also been provided.

2. Machine learning and predictive analytics

The second topic that has been covered is machine learning, also known by various other names, such as Predictive Analytics, Statistical Learning and others. Detailed explanations with corresponding machine learning code written using R and machine learning packages in R have been provided. Algorithms, such as random forest, support vector machines, neural networks, stochastic gradient boosting, decision trees have been discussed. Further, key concepts in machine learning such as bias and variance, regularization, feature section, data pre-processing have also been covered.

3. Data mining in the enterprise

In general, books that cover theoretical topics seldom discuss the more high-level aspects of big data - such as the key requirements for a successful big data initiative. The book includes survey results from IT executives and highlights the shared needs that are common across the industry. The book also includes a step-by-step guide on how to select the right use cases, whether it is for big data or for machine learning based on lessons learned from deploying production solutions in large IT departments.

We believe that with a strong foundational knowledge of these three areas, any practitioner can deliver successful big data and/or data science projects. That is the primary intention behind the overall structure and content of the book.

Who this book is for

The book is intended for a diverse range of audience. In particular, readers who are keen on understanding the concepts of big data, data science and/or machine learning at a holistic level, namely, how they are all inter-related will gain the most benefit from the book.

Technical audience: For technically minded readers, the book contains detailed explanations of the key industry tools for big data and machine learning. Hands-on exercises using Hadoop, developing machine learning use cases using the R programming language, building comprehensive production-grade dashboards with R Shiny have been covered. Other tutorials in Spark and NoSQL have also been included. Besides the practical aspects, the theoretical underpinnings of these key technologies have also been explained.

Business audience: The extensive theoretical and practical treatment of big data has been supplemented with high level topics around the nuances of deploying and implementing robust big data solutions in the workplace. IT management, CIO organizations, business analytics and other groups who are tasked with defining the corporate strategy around data will find such information very useful and directly applicable.

What this book covers

Chapter 1, A Gentle Primer on Big Data, covers the basic concepts of big data and machine learning and the tools used, and gives a general understanding of what big data analytics pertains to.

Chapter 2, Getting started with Big Data Mining, introduces concepts of big data mining in an enterprise and provides an introduction to the software and hardware architecture stack for enterprise big data.

Chapter 3, The Analytics Toolkit, discusses the various tools used for big data and machine Learning and provides step-by-step instructions on where users can download and install tools such as R, Python, and Hadoop. 

Chapter 4, Big Data with Hadoop, looks at the fundamental concepts of Hadoop and delves into the detailed technical aspects of the Hadoop ecosystem. Core components of Hadoop such as Hadoop Distributed File System (HDFS), Hadoop Yarn, Hadoop MapReduce and concepts in Hadoop 2 such as ResourceManager, NodeManger, Application Master have been explained in this chapter. A step-by-step tutorial on using Hive via the Cloudera Distribution of Hadoop (CDH) has also been included in the chapter.

Chapter 5, Big Data Analytics with NoSQL, looks at the various emerging and unique database solutions popularly known as NoSQL, which has upended the traditional model of relational databases. We will discuss the core concepts and technical aspects of NoSQL. The various types of NoSQL systems such as In-Memory, Columnar, Document-based, Key-Value, Graph and others have been covered in this section. A tutorial related to MongoDB and the MongoDB Compass interface as well as an extremely comprehensive tutorial on creating a production-grade R Shiny Dashboard with kdb+ have been included.

Chapter 6, Spark for Big Data Analytics, looks at how to use Spark for big data analytics. Both high-level concepts as well as technical topics have been covered. Key concepts such as SparkContext, Directed Acyclic Graphs, Actions & Transformations have been covered. There is also a complete tutorial on using Spark on Databricks, a platform via which users can leverage Spark

Chapter 7, A Gentle Introduction to Machine Learning Concepts, speaks about the fundamental concepts in machine learning. Further, core concepts such as supervised vs unsupervised learning, classification, regression, feature engineering, data preprocessing and cross-validation have been discussed. The chapter ends with a brief tutorial on using an R library for Neural Networks.

Chapter 8, Machine Learning Deep Dive, delves into some of the more involved aspects of machine learning. Algorithms, bias, variance, regularization, and various other concepts in Machine Learning have been discussed in depth. The chapter also includes explanations of algorithms such as random forest, support vector machines, decision trees. The chapter ends with a comprehensive tutorial on creating a web-based machine learning application.

Chapter 9, Enterprise Data Science, discusses the technical considerations for deploying enterprise-scale data science and big data solutions. We will also discuss the various ways enterprises across the world are implementing their big data strategies, including cloud-based solutions. A step-by-step tutorial on using AWS - Amazon Web Services has also been provided in the chapter.

Chapter 10,Closing Thoughts on Big Data, discusses corporate big data and Data Science strategies and concludes with some pointers on how to make big data related projects successful.

Appendix A, Further Reading on Big Data, contains links for a wider understanding of big data.

To get the most out of this book

A general knowledge of Unix would be very helpful, although isn't mandatory

Access to a computer with an internet connection

will be needed in order 

to download the necessary tools and software used in the exercises 

No prior knowledge of the subject area has been assumed as such

Installation instructions for all the software and tools have been provided in

Chapter 3

,

The Analytics Toolkit

.

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packtpub.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Practical-Big-Data-Analytics. We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/PracticalBigDataAnalytics_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "The results are stored in HDFS under the /user/cloudera/output."

A block of code is set as follows:

"_id" : ObjectId("597cdbb193acc5c362e7ae97"), "firstName" : "Nina", "age" : 53, "frequentFlyer" : [ "Delta", "JetBlue", "Delta"

Any command-line input or output is written as follows:

$ cd Downloads/ # cd to the folder where you have downloaded the zip file

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "This sort of additional overhead can easily be alleviated by using virtual machines (VMs)"

Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Too Big or Not Too Big

Big data analytics constitutes a wide range of functions related to mining, analysis, and predictive modeling on large-scale datasets. The rapid growth of information and technological developments has provided a unique opportunity for individuals and enterprises across the world to derive profits and develop new capabilities redefining traditional business models using large-scale analytics. This chapter aims at providing a gentle overview of the salient characteristics of big data to form a foundation for subsequent chapters that will delve deeper into the various aspects of big data analytics.

In general, this book will provide both theoretical as well as practical hands-on experience with big data analytics systems used across the industry. The book begins with a discussion Big Data and Big Data related platforms such as Hadoop, Spark and NoSQL Systems, followed by Machine Learning where both practical and theoretical topics will be covered and conclude with a thorough analysis of the use of Big Data and more generally, Data Science in the industry. The book will be inclusive of the following topics:

Big data platforms: Hadoop ecosystem and Spark NoSQL databases such as Cassandra Advanced platforms such as KDB+

Machine learning: Basic algorithms and concepts Using R and scikit-learn in Python Advanced tools in C/C++ and Unix Real-world machine learning with neural networks Big data infrastructure

Enterprise cloud architecture with AWS (Amazon Web Services) On-premises enterprise architectures High-performance computing for advanced analytics Business and enterprise use cases for big data analytics and machine learning Building a world-class big data analytics solution

To take the discussion forward, we will have the following concepts cleared in this chapter:

Definition of

Big Data

Why are we talking about

Big Data

now if data has always existed?

A brief history of Big Data

Types of Big Data

Where should you start your search for the

Big Data

solution?

What is big data?

The term big is relative and can often take on different meanings, both in terms of magnitude and applications for different situations. A simple, although naïve, definition of big data is a large collection of information, whether it is data stored in your personal laptop or a large corporate server that is non-trivial to analyze using existing or traditional tools.

Today, the industry generally treats data in the order of terabytes or petabytes and beyond as big data. In this chapter, we will discuss what led to the emergence of the big data paradigm and its broad characteristics. Later on, we will delve into the distinct areas in detail.

A brief history of data

The history of computing is a fascinating tale of how, starting with Charles Babbage’s Analytical Engine in the mid 1830s to the present-day supercomputers, computing technologies have led global transformations. Due to space limitations, it would be infeasible to cover all the areas, but a high-level introduction to data and storage of data is provided for historical background.

Dawn of the information age

Big data has always existed. The US Library of Congress, the largest library in the world, houses 164 million items in its collection, including 24 million books and 125 million items in its non-classified collection. [Source: https://www.loc.gov/about/general-information/].

Mechanical data storage arguably first started with punch cards, invented by Herman Hollerith in 1880. Based loosely on prior work by Basile Bouchon, who, in 1725 invented punch bands to control looms, Hollerith's punch cards provided an interface to perform tabulations and even printing of aggregates.

IBM pioneered the industrialization of punch cards and it soon became the de facto choice for storing information.

Dr. Alan Turing and modern computing

Punch cards established a formidable presence but there was still a missing element--these machines, although complex in design, could not be considered computational devices. A formal general-purpose machine that could be versatile enough to solve a diverse set of problems was yet to be invented.

In 1936, after graduating from King’s College, Cambridge, Turing published a seminal paper titled On Computable Numbers, with an Application to the Entscheidungsproblem, where he built on Kurt Gödel's Incompleteness Theorem to formalize the notion of our present-day digital computing.

The advent of the stored-program computer

The first implementation of a stored-program computer, a device that can hold programs in memory, was the Manchester Small-Scale Experimental Machine (SSEM), developed at the Victoria University of Manchester in 1948 [Source: https://en.wikipedia.org/wiki/Manchester_Small-Scale_Experimental_Machine]. This introduced the concept of RAM, Random Access Memory (or more generally, memory) in computers today. Prior to the SSEM, computers had fixed-storage; namely, all functions had to be prewired into the system. The ability to store data dynamically in a temporary storage device such as RAM meant that machines were no longer bound by the capacity of the storage device, but could hold an arbitrary volume of information.

From magnetic devices to SSDs

In the early 1950’s, IBM introduced magnetic tape that essentially used magnetization on a metallic tape to store data. This was followed in quick succession by hard-disk drives in 1956, which, instead of tapes, used magnetic disk platters to store data.

The first models of hard drives had a capacity of less than 4 MB, which occupied the space of approximately two medium-sized refrigerators and cost in excess of $36,000--a factor of 300 million times more expensive related to today’s hard drives. ­Magnetized surfaces soon became the standard in secondary storage and to date, variations of them have been implemented across various removable devices such as floppy disks in the late 90s, CDs, and DVDs.

Solid-state drives (SSD), the successor to hard drives, were first invented in the mid-1950’s by IBM. In contrast to hard drives, SSD disks stored data using non-volatile memory, which stores data using a charged silicon substrate. As there are no mechanical moving parts, the time to retrieve data stored in an SSD (seek time) is an order of magnitude faster relative to devices such as hard drives.

Why we are talking about big data now if data has always existed

By the early 2000’s, rapid advances in computing and technologies, such as storage, allowed users to collect and store data with unprecedented levels of efficiency. The internet further added impetus to this drive by providing a platform that had an unlimited capacity to exchange information at a global scale. Technology advanced at a breathtaking pace and led to major paradigm shifts powered by tools such as social media, connected devices such as smart phones, and the availability of broadband connections, and by extension, user participation, even in remote parts of the world.

By and large, the majority of this data consists of information generated by web-based sources, such as social networks like Facebook and video sharing sites like YouTube. In big data parlance, this is also known as unstructured data; namely, data that is not in a fixed format such as a spreadsheet or the kind that can be easily stored in a traditional database system.

The simultaneous advances in computing capabilities meant that although the rate of data being generated was very high, it was still computationally feasible to analyze it. Algorithms in machine learning, which were once considered intractable due to both the volume as well as algorithmic complexity, could now be analyzed using various new paradigms such as cluster or multinode processing in a much simpler manner that would have earlier necessitated special-purpose machines.
Chart of data generated per minute. Credit: DOMO Inc.

Definition of big data

Collectively, the volume of data being generated has come to be termed big data and analytics that include a wide range of faculties from basic data mining to advanced machine learning is known as big data analytics. There isn't, as such, an exact definition due to the relative nature of quantifying what can be large enough to meet the criterion to classify any specific use case as big data analytics. Rather, in a generic sense, performing analysis on large-scale datasets, in the order of tens or hundreds of gigabytes to petabytes, can be termed big data analytics. This can be as simple as finding the number of rows in a large dataset to applying a machine learning algorithm on it.

Building blocks of big data analytics

At a fundamental level, big data systems can be considered to have four major layers, each of which are indispensable. There are many such layers that are outlined in various textbooks and literature and, as such, it can be ambiguous. Nevertheless, at a high level, the layers defined here are both intuitive and simplistic:

Big Data Analytics Layers

The levels are broken down as follows:

Hardware

: Servers that provide the computing backbone, storage devices that store the data, and network connectivity across different server components are some of the elements that define the hardware stack. In essence, the systems that provide the computational and storage capabilities and systems that support the interoperability of these devices form the foundational layer of the building blocks.

Software

: Software resources that facilitate analytics on the datasets hosted in the hardware layer, such as Hadoop and NoSQL systems, represent the next level in the big data stack. Analytics software can be classified into various subdivisions. Two of the primary high-level classifications for analytics software are tools that facilitate are:

Data mining

: Software that provides facilities for aggregations, joins across datasets, and pivot tables on large datasets fall into this category. Standard NoSQL platforms such as Cassandra, Redis, and others are high-level, data mining tools for big data analytics.

Statistical analytics

: Platforms that provide analytics capabilities beyond simple data mining, such as running algorithms that can range from simple regressions to advanced neural networks such as Google TensorFlow or R, fall into this category.

Data management

: Data encryption, governance, access, compliance, and other features salient to any enterprise and production environment to manage and, in some ways, reduce operational complexity form the next basic layer. Although they are less tangible than hardware or software, data management tools provide a defined framework, using which organizations can fulfill their obligations such as security and compliance.

End user

: The end user of the analytics software forms the final aspect of a big data analytics engagement. A data platform, after all, is only as good as the extent to which it can be leveraged efficiently and addresses business-specific use cases. This is where the role of the practitioner who makes use of the analytics platform to derive value comes into play. The term data scientist is often used to denote individuals who implement the underlying big data analytics capabilities while business users reap the benefits of faster access and analytics capabilities not available in traditional systems.

Types of Big Data

Data can be broadly classified as being structured, unstructured, or semi-structured. Although these distinctions have always existed, the classification of data into these categories has become more prominent with the advent of big data.

Structured

Structured data, as the name implies, indicates datasets that have a defined organizational structure such as Microsoft Excel or CSV files. In pure database terms, the data should be representable using a schema. As an example, the following table representing the top five happiest countries in the world published by the United Nations in its 2017 World Happiness Index ranking would be an atypical representation of structured data.

We can clearly define the data types of the columns--Rank, Score, GDP per capita, Social support, Healthy life expectancy, Trust, Generosity, and Dystopia are numerical columns, whereas Country is represented using letters, or more specifically, strings.

Refer to the following table for a little more clarity:

Rank

Country

Score

GDP per capita

Social support

Healthy life expectancy

Generosity

Trust

Dystopia

1

Norway

7.537

1.616

1.534

0.797

0.362

0.316

2.277

2

Denmark

7.522

1.482

1.551

0.793

0.355

0.401

2.314

3

Iceland

7.504

1.481

1.611

0.834

0.476

0.154

2.323

4

Switzerland

7.494

1.565

1.517

0.858

0.291

0.367

2.277

5

Finland

7.469

1.444

1.54

0.809

0.245

0.383

2.43

 

World Happiness Report, 2017 [Source: https://en.wikipedia.org/wiki/World_Happiness_Report#cite_note-4]

Commercial databases such as Teradata, Greenplum as well as Redis, Cassandra, and Hive in the open source domain are examples of technologies that provide the ability to manage and query structured data.

Unstructured

Unstructured data consists of any dataset that does not have a predefined organizational schema as in the table in the prior section. Spoken words, music, videos, and even books, including this one, would be considered unstructured. This by no means implies that the content doesn’t have organization. Indeed, a book has a table of contents, chapters, subchapters, and an index--in that sense, it follows a definite organization.

However, it would be futile to represent every word and sentence as being part of a strict set of rules. A sentence can consist of words, numbers, punctuation marks, and so on and does not have a predefined data type as spreadsheets do. To be structured, the book would need to have an exact set of characteristics in every sentence, which would be both unreasonable and impractical.

Data from social media, such as posts on Twitter, messages from friends on Facebook, and photos on Instagram, are all examples of unstructured data.

Unstructured data can be stored in various formats. They can be Blobs or, in the case of textual data, freeform text held in a data storage medium. For textual data, technologies such as Lucene/Solr, Elasticsearch, and others are generally used to query, index, and other operations.

Semi-structured

Semi-structured data refers to data that has both the elements of an organizational schema as well as aspects that are arbitrary. A personal phone diary (increasingly rare these days!) with columns for name, address, phone number, and notes could be considered a semi-structured dataset. The user might not be aware of the addresses of all individuals and hence some of the entries may have just a phone number and vice versa.

Similarly, the column for notes may contain additional descriptive information (such as a facsimile number, name of a relative associated with the individual, and so on). It is an arbitrary field that allows the user to add complementary information. The columns for name, address, and phone number can thus be considered structured in the sense that they can be presented in a tabular format, whereas the notes section is unstructured in the sense that it may contain an arbitrary set of descriptive information that cannot be represented in the other columns in the diary.

In computing, semi-structured data is usually represented by formats, such as JSON, that can encapsulate both structured as well as schemaless or arbitrary associations, generally using key-value pairs. A more common example could be email messages, which have both a structured part, such as name of the sender, time when the message was received, and so on, that is common to all email messages and an unstructured portion represented by the body or content of the email.

Platforms such as Mongo and CouchDB are generally used to store and query semi-structured datasets.

Sources of big data

Technology today allows us to collect data at an astounding rate--both in terms of volume and variety. There are various sources that generate data, but in the context of big data, the primary sources are as follows:

Social networks

: Arguably, the primary source of all big data that we know of today is the social networks that have proliferated over the past 5-10 years. This is by and large unstructured data that is represented by millions of social media postings and other data that is generated on a second-by-second basis through user interactions on the web across the world. Increase in access to the internet across the world has been a self-fulfilling act for the growth of data in social networks.

Media

: Largely a result of the growth of social networks, media represents the millions, if not billions, of audio and visual uploads that take place on a daily basis. Videos uploaded on YouTube, music recordings on SoundCloud, and pictures posted on Instagram are prime examples of media, whose volume continues to grow in an unrestrained manner.

Data warehouses

: Companies have long invested in specialized data storage facilities commonly known as data warehouses. A DW is essentially collections of historical data that companies wish to maintain and catalog for easy retrieval, whether for internal use or regulatory purposes. As industries gradually shift toward the practice of storing data in platforms such as Hadoop and NoSQL, more and more companies are moving data from their pre-existing data warehouses to some of the newer technologies. Company emails, accounting records, databases, and internal documents are some examples of DW data that is now being offloaded onto Hadoop or Hadoop-like platforms that leverage multiple nodes to provide a highly-available and fault-tolerant platform.

Sensors

: A more recent phenomenon in the space of big data has been the collection of data from sensor devices. While sensors have always existed and industries such as oil and gas have been using drilling sensors for measurements at oil rigs for many decades, the advent of wearable devices, also known as the Internet Of Things such as Fitbit and Apple Watch, meant that now each individual could stream data at the same rate at which a few oil rigs used to do just 10 years back.

Wearable devices can collect hundreds of measurements from an individual at any given point in time. While not yet a big data problem as such, as the industry keeps evolving, sensor-related data is likely to become more akin to the kind of spontaneous data that is generated on the web through social network activities.

The 4Vs of big data

The topic of the 4Vs has become overused in the context of big data, where it has started to lose some of the initial charm. Nevertheless, it helps to bear in mind what these Vs indicate for the sake of being aware of the background context to carry on a conversation.

Broadly, the 4Vs indicate the following:

Volume

: The amount of data that is being generated

Variety

: The different types of data, such as textual, media, and sensor or streaming data

Velocity

: The speed at which data is being generated, such as millions of messages being exchanged at any given time across social networks

Veracity

: This has been a more recent addition to the 3Vs and indicates the noise inherent in data, such as inconsistencies in recorded information that requires additional validation

When do you know you have a big data problem and where do you start your search for the big data solution?

Finally, big data analytics refers to the practice of putting the data to work--in other words, the process of extracting useful information from large volumes of data through the use of appropriate technologies. There is no exact definition for many of the terms used to denote different types of analytics, as they can be interpreted in different ways and the meaning hence can be subjective.

Nevertheless, some are provided here to act as references or starting points to help you in forming an initial impression:

Data mining

: Data mining refers to the process of extracting information from datasets through running queries or basic summarization methods such as aggregations. Finding the top 10 products by the number of sales from a dataset containing all the sales records of one million products at an online website would be the process of mining: that is, extracting useful information from a dataset. NoSQL databases such as Cassandra, Redis, and MongoDB are prime examples of tools that have strong data mining capabilities.

Business intelligence

: Business intelligence refers to tools such as Tableau, Spotfire, QlikView, and others that provide frontend dashboards to enable users to query data using a graphical interface. Dashboard products have gained in prominence in step with the growth of data as users seek to extract information. Easy-to-use interfaces with querying and visualization features that could be used universally by both technical and non-technical users set the groundwork to democratize analytical access to data.

Visualization