34,79 €
Big Data analytics relates to the strategies used by organizations to collect, organize, and analyze large amounts of data to uncover valuable business insights that cannot be analyzed through traditional systems. Crafting an enterprise-scale cost-efficient Big Data and machine learning solution to uncover insights and value from your organization’s data is a challenge. Today, with hundreds of new Big Data systems, machine learning packages, and BI tools, selecting the right combination of technologies is an even greater challenge.
This book will help you do that. With the help of this guide, you will be able to bridge the gap between the theoretical world of technology and the practical reality of building corporate Big Data and data science platforms. You will get hands-on exposure to Hadoop and Spark, build machine learning dashboards using R and R Shiny, create web-based apps using NoSQL databases such as MongoDB, and even learn how to write R code for neural networks.
By the end of the book, you will have a very clear and concrete understanding of what Big Data analytics means, how it drives revenues for organizations, and how you can develop your own Big Data analytics solution using the different tools and methods articulated
in this book.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 373
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Veena PagareAcquisition Editor: Vinay ArgekarContent Development Editor: Tejas LimkarTechnical Editor: Dinesh ChaudharyCopy Editor: Safis EditingProject Coordinator: Manthan PatelProofreader: Safis EditingIndexer: Pratik ShirodkarGraphics: Tania DuttaProduction Coordinator: Aparna Bhagat
First published: January 2018
Production reference: 1120118
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78355-439-3
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Nataraj Dasgupta is the vice president of Advanced Analytics at RxDataScience Inc. Nataraj has been in the IT industry for more than 19 years and has worked in the technical and analytics divisions of Philip Morris, IBM, UBS Investment Bank and Purdue Pharma. He led the data science division at Purdue Pharma L.P. where he developed the company’s award-winning big data and machine learning platform. Prior to Purdue, at UBS, he held the role of associate director working with high frequency and algorithmic trading technologies in the Foreign Exchange trading division of the bank.
Giancarlo Zaccone has more than 10 years experience in managing research projects both in scientific and industrial areas. He worked as a researcher at the C.N.R, the National Research Council, where he was involved in projects on parallel numerical computing and scientific visualization. He is a senior software engineer at a consulting company, developing and testing software systems for space and defense applications. He holds a master's degree in physics from the Federico II of Naples and a second level postgraduate master course in scientific computing from La Sapienza of Rome.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Too Big or Not Too Big
What is big data?
A brief history of data
Dawn of the information age
Dr. Alan Turing and modern computing
The advent of the stored-program computer
From magnetic devices to SSDs
Why we are talking about big data now if data has always existed
Definition of big data
Building blocks of big data analytics
Types of Big Data
Structured
Unstructured
Semi-structured
Sources of big data
The 4Vs of big data
When do you know you have a big data problem and where do you start your search for the big data solution?
Summary
Big Data Mining for the Masses
What is big data mining?
Big data mining in the enterprise
Building the case for a Big Data strategy
Implementation life cycle
Stakeholders of the solution
Implementing the solution
Technical elements of the big data platform
Selection of the hardware stack
Selection of the software stack
Summary
The Analytics Toolkit
Components of the Analytics Toolkit
System recommendations
Installing on a laptop or workstation
Installing on the cloud
Installing Hadoop
Installing Oracle VirtualBox
Installing CDH in other environments
Installing Packt Data Science Box
Installing Spark
Installing R
Steps for downloading and installing Microsoft R Open
Installing RStudio
Installing Python
Summary
Big Data With Hadoop
The fundamentals of Hadoop
The fundamental premise of Hadoop
The core modules of Hadoop
Hadoop Distributed File System - HDFS
Data storage process in HDFS
Hadoop MapReduce
An intuitive introduction to MapReduce
A technical understanding of MapReduce
Block size and number of mappers and reducers
Hadoop YARN
Job scheduling in YARN
Other topics in Hadoop
Encryption
User authentication
Hadoop data storage formats
New features expected in Hadoop 3
The Hadoop ecosystem
Hands-on with CDH
WordCount using Hadoop MapReduce
Analyzing oil import prices with Hive
Joining tables in Hive
Summary
Big Data Mining with NoSQL
Why NoSQL?
The ACID, BASE, and CAP properties
ACID and SQL
The BASE property of NoSQL
The CAP theorem
The need for NoSQL technologies
Google Bigtable
Amazon Dynamo
NoSQL databases
In-memory databases
Columnar databases
Document-oriented databases
Key-value databases
Graph databases
Other NoSQL types and summary of other types of databases
Analyzing Nobel Laureates data with MongoDB
JSON format
Installing and using MongoDB
Tracking physician payments with real-world data
Installing kdb+, R, and RStudio
Installing kdb+
Installing R
Installing RStudio
The CMS Open Payments Portal
Downloading the CMS Open Payments data
Creating the Q application
Loading the data
The backend code
Creating the frontend web portal
R Shiny platform for developers
Putting it all together - The CMS Open Payments application
Applications
Summary
Spark for Big Data Analytics
The advent of Spark
Limitations of Hadoop
Overcoming the limitations of Hadoop
Theoretical concepts in Spark
Resilient distributed datasets
Directed acyclic graphs
SparkContext
Spark DataFrames
Actions and transformations
Spark deployment options
Spark APIs
Core components in Spark
Spark Core
Spark SQL
Spark Streaming
GraphX
MLlib
The architecture of Spark
Spark solutions
Spark practicals
Signing up for Databricks Community Edition
Spark exercise - hands-on with Spark (Databricks)
Summary
An Introduction to Machine Learning Concepts
What is machine learning?
The evolution of machine learning
Factors that led to the success of machine learning
Machine learning, statistics, and AI
Categories of machine learning
Supervised and unsupervised machine learning
Supervised machine learning
Vehicle Mileage, Number Recognition and other examples
Unsupervised machine learning
Subdividing supervised machine learning
Common terminologies in machine learning
The core concepts in machine learning
Data management steps in machine learning
Pre-processing and feature selection techniques
Centering and scaling
The near-zero variance function
Removing correlated variables
Other common data transformations
Data sampling
Data imputation
The importance of variables
The train, test splits, and cross-validation concepts
Splitting the data into train and test sets
The cross-validation parameter
Creating the model
Leveraging multicore processing in the model
Summary
Machine Learning Deep Dive
The bias, variance, and regularization properties
The gradient descent and VC Dimension theories
Popular machine learning algorithms
Regression models
Association rules
Confidence
Support
Lift
Decision trees
The Random forest extension
Boosting algorithms
Support vector machines
The K-Means machine learning technique
The neural networks related algorithms
Tutorial - associative rules mining with CMS data
Downloading the data
Writing the R code for Apriori
Shiny (R Code)
Using custom CSS and fonts for the application
Running the application
Summary
Enterprise Data Science
Enterprise data science overview
A roadmap to enterprise analytics success
Data science solutions in the enterprise
Enterprise data warehouse and data mining
Traditional data warehouse systems
Oracle Exadata, Exalytics, and TimesTen
HP Vertica
Teradata
IBM data warehouse systems (formerly Netezza appliances)
PostgreSQL
Greenplum
SAP Hana
Enterprise and open source NoSQL Databases
Kdb+
MongoDB
Cassandra
Neo4j
Cloud databases
Amazon Redshift, Redshift Spectrum, and Athena databases
Google BigQuery and other cloud services
Azure CosmosDB
GPU databases
Brytlyt
MapD
Other common databases
Enterprise data science – machine learning and AI
The R programming language
Python
OpenCV, Caffe, and others
Spark
Deep learning
H2O and Driverless AI
Datarobot
Command-line tools
Apache MADlib
Machine learning as a service
Enterprise infrastructure solutions
Cloud computing
Virtualization
Containers – Docker, Kubernetes, and Mesos
On-premises hardware
Enterprise Big Data
Tutorial – using RStudio in the cloud
Summary
Closing Thoughts on Big Data
Corporate big data and data science strategy
Ethical considerations
Silicon Valley and data science
The human factor
Characteristics of successful projects
Summary
External Data Science Resources
Big data resources
NoSQL products
Languages and tools
Creating dashboards
Notebooks
Visualization libraries
Courses on R
Courses on machine learning
Machine learning and deep learning links
Web-based machine learning services
Movies
Machine learning books from Packt
Books for leisure reading
Other Books You May Enjoy
Leave a review - let other readers know what you think
This book introduces the reader to a broad spectrum of topics related to big data as used in the enterprise. Big data is a vast area that encompasses elements of technology, statistics, visualization, business intelligence, and many other related disciplines. To get true value from data that oftentimes remains inaccessible, either due to volume or technical limitations, companies must leverage proper tools both at the software as well as the hardware level.
To that end, the book not only covers the theoretical and practical aspects of big data, but also supplements the information with high-level topics such as the use of big data in the enterprise, big data and data science initiatives and key considerations such as resources, hardware/software stack and other related topics. Such discussions would be useful for IT departments in organizations that are planning to implement or upgrade the organizational big data and/or data science platform.
The book focuses on three primary areas:
1. Data mining on large-scale datasets
Big data is ubiquitous today, just as the term data warehouse was omnipresent not too long ago. There are a myriad of solutions in the industry. In particular, Hadoop and products in the Hadoop ecosystem have become both popular and increasingly common in the enterprise. Further, more recent innovations such as Apache Spark have also found a permanent presence in the enterprise - Hadoop clients, realizing that they may not need the complexity of the Hadoop framework have shifted to Spark in large numbers. Finally, NoSQL solutions, such as MongoDB, Redis, Cassandra and commercial solutions such as Teradata, Vertica and kdb+ have provided have taken the place of more conventional database systems.
This book will cover these areas with a fair degree of depth. Hadoop and related products such as Hive, HBase, Pig Latin and others have been covered. We have also covered Spark and explained key concepts in Spark such as Actions and Transformations. NoSQL solutions such as MongoDB and KDB+ have also been covered to a fair extent and hands-on tutorials have also been provided.
2. Machine learning and predictive analytics
The second topic that has been covered is machine learning, also known by various other names, such as Predictive Analytics, Statistical Learning and others. Detailed explanations with corresponding machine learning code written using R and machine learning packages in R have been provided. Algorithms, such as random forest, support vector machines, neural networks, stochastic gradient boosting, decision trees have been discussed. Further, key concepts in machine learning such as bias and variance, regularization, feature section, data pre-processing have also been covered.
3. Data mining in the enterprise
In general, books that cover theoretical topics seldom discuss the more high-level aspects of big data - such as the key requirements for a successful big data initiative. The book includes survey results from IT executives and highlights the shared needs that are common across the industry. The book also includes a step-by-step guide on how to select the right use cases, whether it is for big data or for machine learning based on lessons learned from deploying production solutions in large IT departments.
We believe that with a strong foundational knowledge of these three areas, any practitioner can deliver successful big data and/or data science projects. That is the primary intention behind the overall structure and content of the book.
The book is intended for a diverse range of audience. In particular, readers who are keen on understanding the concepts of big data, data science and/or machine learning at a holistic level, namely, how they are all inter-related will gain the most benefit from the book.
Technical audience: For technically minded readers, the book contains detailed explanations of the key industry tools for big data and machine learning. Hands-on exercises using Hadoop, developing machine learning use cases using the R programming language, building comprehensive production-grade dashboards with R Shiny have been covered. Other tutorials in Spark and NoSQL have also been included. Besides the practical aspects, the theoretical underpinnings of these key technologies have also been explained.
Business audience: The extensive theoretical and practical treatment of big data has been supplemented with high level topics around the nuances of deploying and implementing robust big data solutions in the workplace. IT management, CIO organizations, business analytics and other groups who are tasked with defining the corporate strategy around data will find such information very useful and directly applicable.
Chapter 1, A Gentle Primer on Big Data, covers the basic concepts of big data and machine learning and the tools used, and gives a general understanding of what big data analytics pertains to.
Chapter 2, Getting started with Big Data Mining, introduces concepts of big data mining in an enterprise and provides an introduction to the software and hardware architecture stack for enterprise big data.
Chapter 3, The Analytics Toolkit, discusses the various tools used for big data and machine Learning and provides step-by-step instructions on where users can download and install tools such as R, Python, and Hadoop.
Chapter 4, Big Data with Hadoop, looks at the fundamental concepts of Hadoop and delves into the detailed technical aspects of the Hadoop ecosystem. Core components of Hadoop such as Hadoop Distributed File System (HDFS), Hadoop Yarn, Hadoop MapReduce and concepts in Hadoop 2 such as ResourceManager, NodeManger, Application Master have been explained in this chapter. A step-by-step tutorial on using Hive via the Cloudera Distribution of Hadoop (CDH) has also been included in the chapter.
Chapter 5, Big Data Analytics with NoSQL, looks at the various emerging and unique database solutions popularly known as NoSQL, which has upended the traditional model of relational databases. We will discuss the core concepts and technical aspects of NoSQL. The various types of NoSQL systems such as In-Memory, Columnar, Document-based, Key-Value, Graph and others have been covered in this section. A tutorial related to MongoDB and the MongoDB Compass interface as well as an extremely comprehensive tutorial on creating a production-grade R Shiny Dashboard with kdb+ have been included.
Chapter 6, Spark for Big Data Analytics, looks at how to use Spark for big data analytics. Both high-level concepts as well as technical topics have been covered. Key concepts such as SparkContext, Directed Acyclic Graphs, Actions & Transformations have been covered. There is also a complete tutorial on using Spark on Databricks, a platform via which users can leverage Spark
Chapter 7, A Gentle Introduction to Machine Learning Concepts, speaks about the fundamental concepts in machine learning. Further, core concepts such as supervised vs unsupervised learning, classification, regression, feature engineering, data preprocessing and cross-validation have been discussed. The chapter ends with a brief tutorial on using an R library for Neural Networks.
Chapter 8, Machine Learning Deep Dive, delves into some of the more involved aspects of machine learning. Algorithms, bias, variance, regularization, and various other concepts in Machine Learning have been discussed in depth. The chapter also includes explanations of algorithms such as random forest, support vector machines, decision trees. The chapter ends with a comprehensive tutorial on creating a web-based machine learning application.
Chapter 9, Enterprise Data Science, discusses the technical considerations for deploying enterprise-scale data science and big data solutions. We will also discuss the various ways enterprises across the world are implementing their big data strategies, including cloud-based solutions. A step-by-step tutorial on using AWS - Amazon Web Services has also been provided in the chapter.
Chapter 10,Closing Thoughts on Big Data, discusses corporate big data and Data Science strategies and concludes with some pointers on how to make big data related projects successful.
Appendix A, Further Reading on Big Data, contains links for a wider understanding of big data.
A general knowledge of Unix would be very helpful, although isn't mandatory
Access to a computer with an internet connection
will be needed in order
to download the necessary tools and software used in the exercises
No prior knowledge of the subject area has been assumed as such
Installation instructions for all the software and tools have been provided in
Chapter 3
,
The Analytics Toolkit
.
You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packtpub.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Practical-Big-Data-Analytics. We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/PracticalBigDataAnalytics_ColorImages.pdf.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "The results are stored in HDFS under the /user/cloudera/output."
A block of code is set as follows:
"_id" : ObjectId("597cdbb193acc5c362e7ae97"), "firstName" : "Nina", "age" : 53, "frequentFlyer" : [ "Delta", "JetBlue", "Delta"
Any command-line input or output is written as follows:
$ cd Downloads/ # cd to the folder where you have downloaded the zip file
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "This sort of additional overhead can easily be alleviated by using virtual machines (VMs)"
Feedback from our readers is always welcome.
General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packtpub.com.
Big data analytics constitutes a wide range of functions related to mining, analysis, and predictive modeling on large-scale datasets. The rapid growth of information and technological developments has provided a unique opportunity for individuals and enterprises across the world to derive profits and develop new capabilities redefining traditional business models using large-scale analytics. This chapter aims at providing a gentle overview of the salient characteristics of big data to form a foundation for subsequent chapters that will delve deeper into the various aspects of big data analytics.
In general, this book will provide both theoretical as well as practical hands-on experience with big data analytics systems used across the industry. The book begins with a discussion Big Data and Big Data related platforms such as Hadoop, Spark and NoSQL Systems, followed by Machine Learning where both practical and theoretical topics will be covered and conclude with a thorough analysis of the use of Big Data and more generally, Data Science in the industry. The book will be inclusive of the following topics:
Big data platforms: Hadoop ecosystem and Spark NoSQL databases such as Cassandra Advanced platforms such as KDB+
Machine learning: Basic algorithms and concepts Using R and scikit-learn in Python Advanced tools in C/C++ and Unix Real-world machine learning with neural networks Big data infrastructure
Enterprise cloud architecture with AWS (Amazon Web Services) On-premises enterprise architectures High-performance computing for advanced analytics Business and enterprise use cases for big data analytics and machine learning Building a world-class big data analytics solution
To take the discussion forward, we will have the following concepts cleared in this chapter:
Definition of
Big Data
Why are we talking about
Big Data
now if data has always existed?
A brief history of Big Data
Types of Big Data
Where should you start your search for the
Big Data
solution?
The term big is relative and can often take on different meanings, both in terms of magnitude and applications for different situations. A simple, although naïve, definition of big data is a large collection of information, whether it is data stored in your personal laptop or a large corporate server that is non-trivial to analyze using existing or traditional tools.
Today, the industry generally treats data in the order of terabytes or petabytes and beyond as big data. In this chapter, we will discuss what led to the emergence of the big data paradigm and its broad characteristics. Later on, we will delve into the distinct areas in detail.
The history of computing is a fascinating tale of how, starting with Charles Babbage’s Analytical Engine in the mid 1830s to the present-day supercomputers, computing technologies have led global transformations. Due to space limitations, it would be infeasible to cover all the areas, but a high-level introduction to data and storage of data is provided for historical background.
Big data has always existed. The US Library of Congress, the largest library in the world, houses 164 million items in its collection, including 24 million books and 125 million items in its non-classified collection. [Source: https://www.loc.gov/about/general-information/].
Mechanical data storage arguably first started with punch cards, invented by Herman Hollerith in 1880. Based loosely on prior work by Basile Bouchon, who, in 1725 invented punch bands to control looms, Hollerith's punch cards provided an interface to perform tabulations and even printing of aggregates.
IBM pioneered the industrialization of punch cards and it soon became the de facto choice for storing information.
Punch cards established a formidable presence but there was still a missing element--these machines, although complex in design, could not be considered computational devices. A formal general-purpose machine that could be versatile enough to solve a diverse set of problems was yet to be invented.
In 1936, after graduating from King’s College, Cambridge, Turing published a seminal paper titled On Computable Numbers, with an Application to the Entscheidungsproblem, where he built on Kurt Gödel's Incompleteness Theorem to formalize the notion of our present-day digital computing.
The first implementation of a stored-program computer, a device that can hold programs in memory, was the Manchester Small-Scale Experimental Machine (SSEM), developed at the Victoria University of Manchester in 1948 [Source: https://en.wikipedia.org/wiki/Manchester_Small-Scale_Experimental_Machine]. This introduced the concept of RAM, Random Access Memory (or more generally, memory) in computers today. Prior to the SSEM, computers had fixed-storage; namely, all functions had to be prewired into the system. The ability to store data dynamically in a temporary storage device such as RAM meant that machines were no longer bound by the capacity of the storage device, but could hold an arbitrary volume of information.
In the early 1950’s, IBM introduced magnetic tape that essentially used magnetization on a metallic tape to store data. This was followed in quick succession by hard-disk drives in 1956, which, instead of tapes, used magnetic disk platters to store data.
The first models of hard drives had a capacity of less than 4 MB, which occupied the space of approximately two medium-sized refrigerators and cost in excess of $36,000--a factor of 300 million times more expensive related to today’s hard drives. Magnetized surfaces soon became the standard in secondary storage and to date, variations of them have been implemented across various removable devices such as floppy disks in the late 90s, CDs, and DVDs.
Solid-state drives (SSD), the successor to hard drives, were first invented in the mid-1950’s by IBM. In contrast to hard drives, SSD disks stored data using non-volatile memory, which stores data using a charged silicon substrate. As there are no mechanical moving parts, the time to retrieve data stored in an SSD (seek time) is an order of magnitude faster relative to devices such as hard drives.
By the early 2000’s, rapid advances in computing and technologies, such as storage, allowed users to collect and store data with unprecedented levels of efficiency. The internet further added impetus to this drive by providing a platform that had an unlimited capacity to exchange information at a global scale. Technology advanced at a breathtaking pace and led to major paradigm shifts powered by tools such as social media, connected devices such as smart phones, and the availability of broadband connections, and by extension, user participation, even in remote parts of the world.
By and large, the majority of this data consists of information generated by web-based sources, such as social networks like Facebook and video sharing sites like YouTube. In big data parlance, this is also known as unstructured data; namely, data that is not in a fixed format such as a spreadsheet or the kind that can be easily stored in a traditional database system.
Collectively, the volume of data being generated has come to be termed big data and analytics that include a wide range of faculties from basic data mining to advanced machine learning is known as big data analytics. There isn't, as such, an exact definition due to the relative nature of quantifying what can be large enough to meet the criterion to classify any specific use case as big data analytics. Rather, in a generic sense, performing analysis on large-scale datasets, in the order of tens or hundreds of gigabytes to petabytes, can be termed big data analytics. This can be as simple as finding the number of rows in a large dataset to applying a machine learning algorithm on it.
At a fundamental level, big data systems can be considered to have four major layers, each of which are indispensable. There are many such layers that are outlined in various textbooks and literature and, as such, it can be ambiguous. Nevertheless, at a high level, the layers defined here are both intuitive and simplistic:
The levels are broken down as follows:
Hardware
: Servers that provide the computing backbone, storage devices that store the data, and network connectivity across different server components are some of the elements that define the hardware stack. In essence, the systems that provide the computational and storage capabilities and systems that support the interoperability of these devices form the foundational layer of the building blocks.
Software
: Software resources that facilitate analytics on the datasets hosted in the hardware layer, such as Hadoop and NoSQL systems, represent the next level in the big data stack. Analytics software can be classified into various subdivisions. Two of the primary high-level classifications for analytics software are tools that facilitate are:
Data mining
: Software that provides facilities for aggregations, joins across datasets, and pivot tables on large datasets fall into this category. Standard NoSQL platforms such as Cassandra, Redis, and others are high-level, data mining tools for big data analytics.
Statistical analytics
: Platforms that provide analytics capabilities beyond simple data mining, such as running algorithms that can range from simple regressions to advanced neural networks such as Google TensorFlow or R, fall into this category.
Data management
: Data encryption, governance, access, compliance, and other features salient to any enterprise and production environment to manage and, in some ways, reduce operational complexity form the next basic layer. Although they are less tangible than hardware or software, data management tools provide a defined framework, using which organizations can fulfill their obligations such as security and compliance.
End user
: The end user of the analytics software forms the final aspect of a big data analytics engagement. A data platform, after all, is only as good as the extent to which it can be leveraged efficiently and addresses business-specific use cases. This is where the role of the practitioner who makes use of the analytics platform to derive value comes into play. The term data scientist is often used to denote individuals who implement the underlying big data analytics capabilities while business users reap the benefits of faster access and analytics capabilities not available in traditional systems.
Data can be broadly classified as being structured, unstructured, or semi-structured. Although these distinctions have always existed, the classification of data into these categories has become more prominent with the advent of big data.
Structured data, as the name implies, indicates datasets that have a defined organizational structure such as Microsoft Excel or CSV files. In pure database terms, the data should be representable using a schema. As an example, the following table representing the top five happiest countries in the world published by the United Nations in its 2017 World Happiness Index ranking would be an atypical representation of structured data.
We can clearly define the data types of the columns--Rank, Score, GDP per capita, Social support, Healthy life expectancy, Trust, Generosity, and Dystopia are numerical columns, whereas Country is represented using letters, or more specifically, strings.
Refer to the following table for a little more clarity:
Rank
Country
Score
GDP per capita
Social support
Healthy life expectancy
Generosity
Trust
Dystopia
1
Norway
7.537
1.616
1.534
0.797
0.362
0.316
2.277
2
Denmark
7.522
1.482
1.551
0.793
0.355
0.401
2.314
3
Iceland
7.504
1.481
1.611
0.834
0.476
0.154
2.323
4
Switzerland
7.494
1.565
1.517
0.858
0.291
0.367
2.277
5
Finland
7.469
1.444
1.54
0.809
0.245
0.383
2.43
World Happiness Report, 2017 [Source: https://en.wikipedia.org/wiki/World_Happiness_Report#cite_note-4]
Commercial databases such as Teradata, Greenplum as well as Redis, Cassandra, and Hive in the open source domain are examples of technologies that provide the ability to manage and query structured data.
Unstructured data consists of any dataset that does not have a predefined organizational schema as in the table in the prior section. Spoken words, music, videos, and even books, including this one, would be considered unstructured. This by no means implies that the content doesn’t have organization. Indeed, a book has a table of contents, chapters, subchapters, and an index--in that sense, it follows a definite organization.
However, it would be futile to represent every word and sentence as being part of a strict set of rules. A sentence can consist of words, numbers, punctuation marks, and so on and does not have a predefined data type as spreadsheets do. To be structured, the book would need to have an exact set of characteristics in every sentence, which would be both unreasonable and impractical.
Unstructured data can be stored in various formats. They can be Blobs or, in the case of textual data, freeform text held in a data storage medium. For textual data, technologies such as Lucene/Solr, Elasticsearch, and others are generally used to query, index, and other operations.
Semi-structured data refers to data that has both the elements of an organizational schema as well as aspects that are arbitrary. A personal phone diary (increasingly rare these days!) with columns for name, address, phone number, and notes could be considered a semi-structured dataset. The user might not be aware of the addresses of all individuals and hence some of the entries may have just a phone number and vice versa.
Similarly, the column for notes may contain additional descriptive information (such as a facsimile number, name of a relative associated with the individual, and so on). It is an arbitrary field that allows the user to add complementary information. The columns for name, address, and phone number can thus be considered structured in the sense that they can be presented in a tabular format, whereas the notes section is unstructured in the sense that it may contain an arbitrary set of descriptive information that cannot be represented in the other columns in the diary.
In computing, semi-structured data is usually represented by formats, such as JSON, that can encapsulate both structured as well as schemaless or arbitrary associations, generally using key-value pairs. A more common example could be email messages, which have both a structured part, such as name of the sender, time when the message was received, and so on, that is common to all email messages and an unstructured portion represented by the body or content of the email.
Platforms such as Mongo and CouchDB are generally used to store and query semi-structured datasets.
Technology today allows us to collect data at an astounding rate--both in terms of volume and variety. There are various sources that generate data, but in the context of big data, the primary sources are as follows:
Social networks
: Arguably, the primary source of all big data that we know of today is the social networks that have proliferated over the past 5-10 years. This is by and large unstructured data that is represented by millions of social media postings and other data that is generated on a second-by-second basis through user interactions on the web across the world. Increase in access to the internet across the world has been a self-fulfilling act for the growth of data in social networks.
Media
: Largely a result of the growth of social networks, media represents the millions, if not billions, of audio and visual uploads that take place on a daily basis. Videos uploaded on YouTube, music recordings on SoundCloud, and pictures posted on Instagram are prime examples of media, whose volume continues to grow in an unrestrained manner.
Data warehouses
: Companies have long invested in specialized data storage facilities commonly known as data warehouses. A DW is essentially collections of historical data that companies wish to maintain and catalog for easy retrieval, whether for internal use or regulatory purposes. As industries gradually shift toward the practice of storing data in platforms such as Hadoop and NoSQL, more and more companies are moving data from their pre-existing data warehouses to some of the newer technologies. Company emails, accounting records, databases, and internal documents are some examples of DW data that is now being offloaded onto Hadoop or Hadoop-like platforms that leverage multiple nodes to provide a highly-available and fault-tolerant platform.
Sensors
: A more recent phenomenon in the space of big data has been the collection of data from sensor devices. While sensors have always existed and industries such as oil and gas have been using drilling sensors for measurements at oil rigs for many decades, the advent of wearable devices, also known as the Internet Of Things such as Fitbit and Apple Watch, meant that now each individual could stream data at the same rate at which a few oil rigs used to do just 10 years back.
Wearable devices can collect hundreds of measurements from an individual at any given point in time. While not yet a big data problem as such, as the industry keeps evolving, sensor-related data is likely to become more akin to the kind of spontaneous data that is generated on the web through social network activities.
The topic of the 4Vs has become overused in the context of big data, where it has started to lose some of the initial charm. Nevertheless, it helps to bear in mind what these Vs indicate for the sake of being aware of the background context to carry on a conversation.
Broadly, the 4Vs indicate the following:
Volume
: The amount of data that is being generated
Variety
: The different types of data, such as textual, media, and sensor or streaming data
Velocity
: The speed at which data is being generated, such as millions of messages being exchanged at any given time across social networks
Veracity
: This has been a more recent addition to the 3Vs and indicates the noise inherent in data, such as inconsistencies in recorded information that requires additional validation
Finally, big data analytics refers to the practice of putting the data to work--in other words, the process of extracting useful information from large volumes of data through the use of appropriate technologies. There is no exact definition for many of the terms used to denote different types of analytics, as they can be interpreted in different ways and the meaning hence can be subjective.
Nevertheless, some are provided here to act as references or starting points to help you in forming an initial impression:
Data mining
: Data mining refers to the process of extracting information from datasets through running queries or basic summarization methods such as aggregations. Finding the top 10 products by the number of sales from a dataset containing all the sales records of one million products at an online website would be the process of mining: that is, extracting useful information from a dataset. NoSQL databases such as Cassandra, Redis, and MongoDB are prime examples of tools that have strong data mining capabilities.
Business intelligence
: Business intelligence refers to tools such as Tableau, Spotfire, QlikView, and others that provide frontend dashboards to enable users to query data using a graphical interface. Dashboard products have gained in prominence in step with the growth of data as users seek to extract information. Easy-to-use interfaces with querying and visualization features that could be used universally by both technical and non-technical users set the groundwork to democratize analytical access to data.
Visualization