E-Book
28,99 €

Data Lake Development with Big Data E-Book

Pradeep Pasupuleti

0,0

28,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

A Data Lake is a highly scalable platform for storing huge volumes of multistructured data from disparate sources with centralized data management services. This book explores the potential of Data Lakes and explores architectural approaches to building data lakes that ingest, index, manage, and analyze massive amounts of data using batch and real-time processing frameworks. It guides you on how to go about building a Data Lake that is managed by Hadoop and accessed as required by other Big Data applications.
This book will guide readers (using best practices) in developing Data Lake's capabilities. It will focus on architect data governance, security, data quality, data lineage tracking, metadata management, and semantic data tagging. By the end of this book, you will have a good understanding of building a Data Lake for Big Data.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 229

Veröffentlichungsjahr: 2015

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Mission erfüllt

Owen Mark

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Macht, was ihr liebt!

Anja Förster

Kopf schlägt Kapital

Günter Faltin

Der größte Raubzug der Geschichte

Matthias Weik

Der Mann und das Holz

Lars Mytting

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Organisation für Komplexität

Niels Pfläging

Power: Die 48 Gesetze der Macht

Robert Greene

The Truth About Employee Engagement

Patrick M. Lencioni

BLACKOUT - Morgen ist es zu spät

Marc Elsberg

Leseprobe

Data Lake Development with Big Data

Credits

About the Authors

Acknowledgement

About the Reviewer

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Errata

Piracy

Questions

1. The Need for Data Lake

Before the Data Lake

Need for Data Lake

Defining Data Lake

Key benefits of Data Lake

Challenges in implementing a Data Lake

When to go for a Data Lake implementation

Data Lake architecture

Architectural considerations

Architectural composition

Architectural details

Understanding Data Lake layers

The Data Governance and Security Layer

The Information Lifecycle Management layer

The Metadata Layer

Understanding Data Lake tiers

The Data Intake tier

The Source System Zone

The Transient Zone

The Raw Zone

Batch Raw Storage

The real-time Raw Storage

The Data Management tier

The Integration Zone

The Enrichment Zone

The Data Hub Zone

The Data Consumption tier

The Data Discovery Zone

The Data Provisioning Zone

Summary

2. Data Intake

Understanding Intake tier zones

Source System Zone functionalities

Understanding connectivity processing

Understanding Intake Processing for data variety

Structured data

The need for integrating Structured Data in the Data Lake

Structured data loading approaches

Semi-structured data

The need for integrating semi-structured data in the Data Lake

Semi-structured data loading approaches

Unstructured data

The need for integrating Unstructured data in the Data Lake

Unstructured data loading approaches

Transient Landing Zone functionalities

File validation checks

File duplication checks

File integrity checks

File size checks

File periodicity checks

Data Integrity checks

Checking record counts

Checking for column counts

Schema validation checks

Raw Storage Zone functionalities

Data lineage processes

Watermarking process

Metadata capture

Deep Integrity checks

Bit Level Integrity checks

Periodic checksum checks

Security and governance

Information Lifecycle Management

Practical Data Ingestion scenarios

Architectural guidance

Structured data use cases

Semi-structured and unstructured data use cases

Big Data tools and technologies

Ingestion of structured data

Sqoop

Use case scenarios for Sqoop

WebHDFS

Use case scenarios for WebHDFS

Ingestion of streaming data

Apache Flume

Use case scenarios for Flume

Fluentd

Use case scenarios for Fluentd

Kafka

Use case scenarios for Kafka

Amazon Kinesis

Use case scenarios for Kinesis

Apache Storm

Use case scenarios for Storm

Summary

3. Data Integration, Quality, and Enrichment

Introduction to the Data Management Tier

Understanding Data Integration

Introduction to Data Integration

Prominent features of Data Integration

Loosely coupled Integration

Ease of use

Secure access

High-quality data

Lineage tracking

Practical Data Integration scenarios

The workings of Data Integration

Raw data discovery

Data quality assessment

Profiling the data

Data cleansing

Deletion of missing, null, or invalid values

Imputation of missing, null, or invalid values

Data transformations

Unstructured text transformation techniques

Structured data transformations

Data enrichment

Collect metadata and track data lineage

Traditional Data Integration versus Data Lake

Data pipelines

Addressing the limitations using Data Lake

Data partitioning

Addressing the limitations using Data Lake

Scale on demand

Addressing the limitations using Data Lake

Data ingest parallelism

Addressing the limitations using Data Lake

Extensibility

Addressing the limitations using Data Lake

Big Data tools and technologies

Syncsort

Use case scenarios for Syncsort

Talend

Use case scenarios for Talend

Pentaho

Use case scenarios for Pentaho

Summary

4. Data Discovery and Consumption

Understanding the Data Consumption tier

Data Consumption – Traditional versus Data Lake

An introduction to Data Consumption

Practical Data Consumption scenarios

Data Discovery and metadata

Enabling Data Discovery

Data classification

Classifying unstructured data

Named entity recognition

Topic modeling

Text clustering

Applications of data classification

Relation extraction

Extracting relationships from unstructured data

Feature-based methods

Understanding how feature-based methods work

Implementation

Semantic technologies

Understanding how semantic technologies work

Implementation

Extracting Relationships from structured data

Applications of relation extraction

Indexing data

Inverted index

Understanding how inverted index works

Implementation

Applications of Indexing

Performing Data Discovery

Semantic search

Word sense disambiguation

Latent Semantic Analysis

Faceted search

Fuzzy search

Edit distance

Wildcard and regular expressions

Data Provisioning and metadata

Data publication

Data subscription

Data Provisioning functionalities

Data formatting

Data selection

Data Provisioning approaches

Post-provisioning processes

Architectural guidance

Data Discovery

Big Data tools and technologies

Elasticsearch

Use case scenarios for Elasticsearch

IBM InfoSphere Data Explorer

Use case scenarios for IBM InfoSphere Data Explorer

Tableau

Use case scenarios for Tableau

Splunk

Use case scenarios for Splunk

Data Provisioning

Big Data tools and technologies

Data Dispatch

Use case scenarios for Data Dispatch

Summary

5. Data Governance

Understanding Data Governance

Introduction to Data Governance

The need for Data Governance

Governing Big Data in the Data Lake

Data Governance – Traditional versus Data Lake

Practical Data Governance scenarios

Data Governance components

Metadata management and lineage tracking

Data security and privacy

Big Data implications for security and privacy

Security issues in the Data Lake tiers

The Intake Tier

The Management Tier

The Consumption Tier

Information Lifecycle Management

Big Data implications for ILM

Implementing ILM using Data Lake

The Intake Tier

The Management Tier

The Consumption Tier

Architectural guidance

Big Data tools and technologies

Apache Falcon

Understanding how Falcon works

Use case scenarios for Falcon

Apache Atlas

Understanding how Atlas works

Use case scenarios for Atlas

IBM Big Data platform

Understanding how governance is provided in IBM Big Data platform

Use case scenarios for IBM Big Data platform

The current and future trends

Data Lake and future enterprise trajectories

Future Data Lake technologies

Summary

Index

Data Lake Development with Big Data

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: November 2015

Production reference: 1241115

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78588-808-3

www.packtpub.com

Credits

Authors

Pradeep Pasupuleti

Beulah Salome Purra

Reviewer

Dr. Kornel Amadeusz Skałkowski

Commissioning Editor

Priya Singh

Acquisition Editor

Ruchita Bhansali

Content Development Editor

Rohit Kumar Singh

Technical Editor

Saurabh Malhotra

Copy Editor

Trishya Hajare

Project Coordinator

Izzat Contractor

Proofreader

Safis Editing

Indexer

Hemangini Bari

Graphics

Jason Monteiro

Kirk D'Penha

Production Coordinator

Shantanu N. Zagade

Cover Work

Shantanu N. Zagade

About the Authors

Pradeep Pasupuleti has 18 years of experience in architecting and developing distributed and real-time data-driven systems. He constantly explores ways to use the power and promise of advanced analytics-driven platforms to solve the problems of the common man. He founded Datatma, a consulting firm, with a mission to humanize Big Data analytics, putting it to use to solve simple problems that serve a higher purpose.

He architected robust Big Data-enabled automated learning engines that enterprises regularly use in production in order to save time, money, and the lives of humans.

He built solid interdisciplinary data science teams that bridged the gap between theory and practice, thus, creating compelling data products. His primary focus is always to ensure his customers are delighted by assisting and addressing their business problems through data products that use Big Data technologies and algorithms. He consistently demonstrated thought leadership by solving high-dimensional data problems and getting phenomenal results.

He has performed strategic leadership roles in technology consulting, advising Fortune 100 companies on Big Data strategy and creating Big Data Centers of Excellence.

He has worked on use cases such as enterprise Data Lake, fraud detection, patient re-admission prediction, student performance prediction, claims optimization sentiment mining, cloud infrastructure SLA violation prediction, data leakage prevention, and mainframe offloaded ETL on Hadoop.

In the book Pig Design Patterns, Packt Publishing, he has compiled his learning and experiences from the challenges involved in building Hadoop-driven data products such as data ingest, data cleaning and validating, data transformation, dimensionality reduction, and many other interesting Big Data war stories.

Out of his office hours, he enjoys running marathons, exploring archeological sites, finding patterns in unrelated data sources, mentoring start-ups, and budding researchers.

He can be reached at <[email protected]> and https://in.linkedin.com/in/pradeeppasupuleti.

Acknowledgement

This book is dedicated to the loving memory of my mother, Smt. Sumathy; without her never-failing encouragement and everlasting love I would have never been half as good.

First and foremost, I have to thank my father, Sri. Prabhakar Pasupuleti, who never ceases to be a constant source of inspiration, a ray of hope, humility and strength, and whose support and guidance have given me the courage to chase my dreams.

I should also express my deep sense of gratitude to each of my family members, Sushma, Sresht, and Samvruth, who stood by me at every moment through very tough times and enabled me to complete this book.

I would like to sincerely thank all my teachers who were instrumental in shaping me. Among them, I would like to thank Usha Madam, Vittal Rao Sir, Gopal Krishna Sir, and Brindavan Sir for their stellar role in improving me.

I would also like to thank all my friends for their understanding in many ways. Their friendship makes my life a wonderful experience. I cannot list all the names here, but you are always on my mind.

Special thanks to the team at Packt for their contribution to this book.

Finally, I would like to thank my team, Salome et.al, that has placed immense faith in the power of Big Data analytics and built cutting edge data products.

Thank you lord, for always being there for me.

Beulah Salome Purra has over 11 years of experience and she specializes in building highly scalable distributed systems. She has worked extensively on architecting multiple large-scale Big Data solutions for Fortune 100 companies. Her core expertise lies in working on Big Data Analytics. In her current role at ATMECS, her focus is on building robust and scalable data products that extract value from huge data assets.

She can be reached at https://www.linkedin.com/in/beulahsalomep.

I am grateful to my parents, Rathnam and Padma, who have constantly encouraged and supported me throughout. I would like to thank my husband, Pratap, for his help on this book, his patience, love, and support; my brothers, Joel and Michael, for all their support.

I would like to profusely thank Pradeep Pasupuleti for mentoring me; working with him has been an enriching experience. I can't thank him enough for his constant encouragement, guidance, support, and for providing me an opportunity to work with him on this book.

Special thanks to David Hawke, Sanjay Singh, and Ravi Velagapudi—the leadership team at ATMECS—for their encouragement and support while I was writing this book.

Thanks to the editors and reviewers at Packt for all their effort in making this book better.

About the Reviewer

Dr. Kornel Amadeusz Skałkowski has a solid academic and industrial background. For more than 5 years, he worked as an assistant at AGH University of Science and Technology in Krakow. In 2015, he obtained his PhD. in the subject of machine learning-based adaptation of the SOA systems. He has cooperated with several companies on various projects concerning intelligent systems, machine learning, and Big Data. Currently, he works as a Big Data developer for SAP SE.

He is the co-author of 19 papers concerning software engineering, SOA systems, and machine learning. He also works as a reviewer for the American Journal of Software Engineering and Applications. He has participated in numerous European and national scientific projects. His research interests include machine learning, Big Data, and software engineering.

I would like to kindly thank my family, relatives, and friends, for their endless patience and support during the reviewing of this book.

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

Preface

The book Data Lake Development with Big Data is a practical guide to help you learn the essential architectural approaches to design and build Data Lakes. It walks you through the various components of Data Lakes, such as data intake, management, consumption, and governance with a specific focus on practical implementation scenarios.

Data Lake is a highly scalable data platform for better search, analytical processing, and cheaper storage of huge volumes of any structured data acquired from disparate sources.

Traditional Data Management systems are constrained by data silos, upfront data modeling, rigid data structures, and schema-based write approaches while storing and processing data. This hampers the holistic analysis of data residing in multiple silos and excludes unstructured data sources from analysis. The data is generally modeled to answer known business questions.

With Data Lake, there are no more data silos; all the data can be utilized to get a coherent view that can power a new generation of data-aware analytics applications. With Data Lake, you don't have to know all the business questions in advance, as the data can be modeled later using the schema-less approach and it is possible to ask complex far-reaching questions on all the data at any time to find out hidden patterns and complex relationships in the data.

After reading this book, you will be able to address the shortcoming of traditional data systems through the best practices highlighted in this book for building Data Lake. You will understand the complete lifecycle of architecting/building Data Lake with Big Data technologies such as Hadoop, Storm, Spark, and Splunk. You will gain a comprehensive knowledge of various stages in Data Lake such as data intake, data management, and data consumption with focus on the practical use cases at each stage. You will benefit from the book's detailed coverage of data governance, data security, data lineage tracking, metadata management, data provisioning, and consumption.

As Data Lake is such an advanced complex topic, we are honored and excited to author the first book of its kind in the world. However, at the same time, as the topic being so vast and as there is no one-size-fits-all kind of Data Lake architecture, it is very challenging to appeal to a wide audience footprint. As it is a mini series book, which limits the page count, it is extremely difficult to cover every topic in detail without breaking the ceiling. Given these constraints, we have taken a reader-centric approach in writing this book because the broader understanding of the overall concept of Data Lake is far more important than the in-depth understanding of all the technologies and architectural possibilities that go into building Data Lake.

Using this guiding principle, we refrained from the in-depth coverage of any single topic, because we could not possibly do justice to it. At the same time we made efforts to organize chapters to mimick the sequential flow of data in a typical organization so that it is intuitive for the reader to quickly grasp the concepts of Data Lake from an organizational data flow perspective. In order to make the abstract concepts relatable to the real world, we have followed a use case-based approach where practical implementation scenarios of each key Data Lake component are explained. This we believe will help the reader quickly understand the architectural implications of various Big Data technologies that are used for building these components.

What this book covers

Chapter 1, The Need for Data Lake, helps you understand what Data Lake is, its architecture and key components, and the business contexts where Data Lake can be successfully deployed. You will also learn the limitations of the traditional data architectures and how Data Lake addresses some of these inadequacies and provides significant benefits.

Chapter 2, Data Intake, helps you understand the Intake Tier in detail where we will explore the process of obtaining huge volumes of data into Data Lake. You will learn the technology perspective of the various External Data Sources and Hadoop-based data transfer mechanisms to pull or push data into Data Lake.

Chapter 3, Data Integration, Quality, and Enrichment, explores the processes that are performed on vast quantities of data in the Management Tier. You will get a deeper understanding of the key technology aspects and components such as profiling, validation, integration, cleansing, standardization, and enrichment using Hadoop ecosystem components.

Chapter 4, Data Discovery and Consumption, helps you understand how data can be discovered, packaged, and provisioned, for it to be consumed by the downstream systems. You will learn the key technology aspects, architectural guidance and tools for data discovery, and data provisioning functionalities.

Chapter 5, Data Governance, explores the details, need, and utility of data governance in a Data Lake environment. You will learn how to deal with metadata management, lineage tracking, data lifecycle management to govern the usability, security, integrity, and availability of the data through the data governance processes applied on the data in Data Lake. This chapter also explores how the current Data Lake can evolve in a futuristic setting.

What you need for this book

As this book covers only the architectural details and acts as a guide for decision-making, we have not provided any code examples. Hence, there is no explicit software prerequisite.

Who this book is for

Data Lake Development with Big Data is intended for architects and senior managers who are responsible for building a strategy around their current data architecture, helping them identify the need for Data Lake implementation in an organizational business context.

Good knowledge on master data management, information lifecycle management, data governance, data product design, data engineering, systems architecture, and experience on Big Data technologies such as Hadoop, Spark, Splunk, and Storm is necessary.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We can include other contexts through the use of the include directive."

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Clicking the Next button moves you to the next screen."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. The Need for Data Lake

In this chapter, we will understand the rationale behind building a Data Lake in an organization that has huge data assets. The following topics will be covered in this chapter:

Explore the emerging need for Data Lake by understanding the limitations of the traditional architecturesDecipher how a Data Lake addresses the inadequacies of traditional architectures and provides significant benefits in terms of time and costUnderstand what a Data Lake is and also its architecturePractical guidance on the key points to consider before deciding to build a Data LakeUnderstand the key components that could be a part of a Data Lake and comprehend how crucial each of these components are to build a successful Data Lake

Before the Data Lake

In this section, let us quickly look at how the Data Lake has evolved from a historical perspective.

From the time data-intensive applications were used to solve business problems, we have seen many evolutionary steps in the way data has been stored, managed, analyzed, and visualized.

The earlier systems were designed to answer questions about the past; questions such as what were my total sales in the last year?, were answered by machines built around monolithic processors that ran COBOL, accessing data from tapes and disks. Since the dawn of faster processors and better storage, businesses were able to slice and dice data to find fine-grained answers from subsets of data; these questions resembled: what was the sales performance of x unit in y geography in z timeframe?

If we extract one common pattern, all the earlier systems were developed for business users, in order to help them make decisions for their businesses. The current breed of data systems empowers people like you and me to make decisions and improve the way we live. This is an ultimate paradigm shift brought by the advances in myriad technologies.

For many of us, the technologies that run in the background are transparent, while we consult applications that help us make decisions that alter our immediate future profoundly. We use applications to help us navigate to an address (mapping), decide on our holidays (weather and holiday planning sites), get a summary of product reviews (review sites), get similar products (recommendation engines), connect and grow professionally (professional social networks), and the list goes on.

All these applications use enabling technologies that understand natural languages, process humungous amounts of data, store and effortlessly process our personal data such as images and audio, and even extract intelligence from them by tagging our faces and finding relationships. Each of us, in a way, contributes to the flooding of these application servers with our personal data in the form of our preferences, likes, affiliations, networks, hobbies, friends, images, and videos.

If we can attribute one fundamental cause for today's explosion of data, it should be the proliferation of ubiquitous internet connectivity and the Smartphone; with it comes the exponential number of applications that transmit and store a variety of data.

Juxtaposing the growth of Smartphones and the internet with the rapid decline of storage costs and cloud computing, which also bring down the processing costs, we can immediately comprehend that the traditional data architectures do not scale to handle the volume and variety of data; thus cannot, answer questions that you and I want. They work well, extremely well for business users, but not directly for us.

In order to democratize the value hidden in data and thus empower common customers to use data for day-to-day decision making, organizations should first store and extract value from the different types of data being collected in such a huge quantities. For all this to happen, the following two key developments have had a revolutionary impact:

The development of distributed computing architectures that can scale linearly and perform computations at an unbelievable paceThe development of new-age algorithms that can analyze natural languages, comprehend the semantics of the spoken words and special types, run Neural Nets, perform deep learning, graph social network interactions, perform constraint-based stochastic optimization, and so on

Earlier systems were simply not architected to scale linearly and store/analyze these many types of data. They are good for the purpose they were initially built for. They excelled as a historical data store that can offload structured data from Online Transaction Processing (OLTP) systems, perform transformations, cleanse it, slice-dice and summarize it, and then feed it to Online Analytical Processing (OLAP) systems. Business Intelligence tools consume the exhaust of the OLAP systems and spew good-looking reports religiously at regular intervals so that the business users can make the decisions.

We can immediately grasp the glaring differences between the earlier systems and the new age systems by looking at these major aspects:

The storage and processing differs in the way it scales (distributed versus monolithic)In earlier systems, data is managed in relational systems versus NoSQL, MPP, and CEP systems in the new age Big Data systemsTraditional systems cannot handle high-velocity data that is efficiently ingested and processed by Big Data applicationsStructured data is predominantly used in earlier systems versus unstructured data being used in Big Data systems along with structured dataTraditional systems have limitations around the scale of data that they can handle; Big Data systems are scalable and can handle humongous amounts of data