E-Book
36,59 €

Hands-On Big Data Modeling E-Book

James Lee

0,0

36,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Solve all big data problems by learning how to create efficient data models

Key Features

Create effective models that get the most out of big data

Apply your knowledge to datasets from Twitter and weather data to learn big data

Tackle different data modeling challenges with expert techniques presented in this book

Book Description

Modeling and managing data is a central focus of all big data projects. In fact, a database is considered to be effective only if you have a logical and sophisticated data model. This book will help you develop practical skills in modeling your own big data projects and improve the performance of analytical queries for your specific business requirements.

To start with, you'll get a quick introduction to big data and understand the different data modeling and data management platforms for big data. Then you'll work with structured and semi-structured data with the help of real-life examples. Once you've got to grips with the basics, you'll use the SQL Developer Data Modeler to create your own data models containing different file types such as CSV, XML, and JSON. You'll also learn to create graph data models and explore data modeling with streaming data using real-world datasets.

By the end of this book, you'll be able to design and develop efficient data models for varying data sizes easily and efficiently.

What you will learn

Get insights into big data and discover various data models

Explore conceptual, logical, and big data models

Understand how to model data containing different file types

Run through data modeling with examples of Twitter, Bitcoin, IMDB and weather data modeling

Create data models such as Graph Data and Vector Space

Model structured and unstructured data using Python and R

Who this book is for

This book is great for programmers, geologists, biologists, and every professional who deals with spatial data. If you want to learn how to handle GIS, GPS, and remote sensing data, then this book is for you. Basic knowledge of R and QGIS would be helpful.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 252

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Hands-On Big Data Modeling

Effective database design techniques for data architects and business intelligence professionals

James Lee

Tao Wei

Suresh Kumar Mukhiya

BIRMINGHAM - MUMBAI

Hands-On Big Data Modeling

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Pravin DhandreAcquisition Editor: Namrata PatilContent Development Editor:Ishita VoraTechnical Editor:Snehal DalmetCopy Editor:Safis EditingProject Coordinator: Namrata SwettaProofreader: Safis EditingIndexer:Tejal Daruwale SoniGraphics:Jisha ChirayilProduction Coordinator:Deepika Naik

First published: November 2018

Production reference: 1301118

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78862-090-1

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the authors

James Lee is a passionate software wizard working at one of the top Silicon Valley-based start-ups specializing in big data analysis. In the past, he has worked at big companies such as Google and Amazon. In his day job, he works with big data technologies, including Cassandra and Elasticsearch, and is an absolute Docker technology geek and IntelliJ IDEA lover with a strong focus on efficiency and simplicity. Apart from his career as a software engineer, he is keen on sharing his knowledge with others and guiding them, especially in relation to start-ups and programming. He has been teaching courses and conducting workshops on Java programming / IntelliJ IDEA since he was 21. James holds an MS degree in computer science from McGill University and has many years' experience as a teaching assistant in a variety of computer science classes. He also enjoys skiing and swimming, and is a passionate traveler.

Tao Wei is a passionate software engineer who works in a leading Silicon Valley-based big data analysis company. Previously, Tao worked in big IT companies, such as IBM and Cisco. He has intensive experience in designing and building distributed, large-scale systems with proven high availability and reliability. Tao has an MS degree in computer science from McGill University and many years of experience as a teaching assistant in various computer science classes. When not working, he enjoys reading and swimming, and is a passionate photographer.

Suresh Kumar Mukhiya is a PhD candidate currently associated with Western Norway University of Applied Sciences (HVL). He is also a web application developer and big data enthusiast specializing in information systems, model-driven software engineering, big data analysis, and artificial intelligence. He has completed a masters in information systems from the Norwegian University of Science and Technology, along with a thesis in processing mining. He also holds a bachelor's degree in computer science and information technology (BSc.CSIT).

About the reviewers

David Y Aiello is an experienced DevOps engineer, having spent almost a decade implementing continuous integration and continuous delivery along with other software and system engineering projects. He has a background in Java, Python, Ruby, and C++, and has worked with AWS, including EC2, S3, and VPC, Azure DevOps, and GCP. He is currently working toward a degree in computer science and mathematics, with a focus on bioinformatics, and looks forward to furthering his career in the industry. Specifically, he is looking to become more involved in the fields of artificial intelligence, deep (machine) learning, and neural networks, and what they represent in terms of the future of computer science.

Devora Aiello forges her own path, like many other native New Yorkers. Since 2014, she has been studying physics, math, and software engineering, while simultaneously traveling abroad and immersing herself in other cultures. A coding enthusiast, Ms. Aiello has designed and implemented websites using Angular, HTML/CSS, and Bootstrap. She enjoys the challenges inherent in collaborating with users to create appealing and full-featured websites and keeps abreast of the latest tools and trends in IT. Additionally, Ms. Aiello is an active participant and assistant editor for the IEEE P2675 Working Group, which is developing the industry standard for DevOps.

Zaur Fataliyev is a machine learning engineer, currently residing and working in Seoul, South Korea. He received his BSc. in electrical engineering from KAIST, and an MSc. in computer engineering from Korea University, focusing on machine learning and computer vision. He has worked on various ML and CV problems over the years. He is interested in solving problems and making sense of data with learning algorithms. He loves open research, collaboration, contributing, and competitions. He also regularly participates in Kaggle competitions.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Hands-On Big Data Modeling

About Packt

Why subscribe?

Packt.com

Contributors

About the authors

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Introduction to Big Data and Data Management

The concept of big data 

Interesting insights regarding big data

Characteristics of big data

Sources and types of big data

Challenges of big data

Introduction to big data modeling

Uses of models

Introduction to managing big data

Importance and implications of big data modeling and management

Benefits of big data management

Challenges in big data management 

Setting up big data modeling platforms

Getting started on Windows

Getting started on macOS

Summary

Further reading

Data Modeling and Management Platforms

Big data management

Data ingestion

Data storage

Data quality

Data operations

Data scalability and security

Big data management services

Data cleansing

Data integration

Big data management vendors

Big data storage and data models

Storage models

Block-based storage

File-based storage 

Object-based storage

Data models

Relational stores (SQLs)

Scalable relational systems

Database as a Service (DaaS)

NoSQL stores

Document stores

Key-value stores

Extensible-record stores

Big data programming models

MapReduce

MapReduce functionality

Hadoop

Features of Hadoop frameworks

Yet Another Resource Negotiator 

Functional programming

Spark

Reasons to choose Apache Spark

Flink

Advantages of Flink

SQL data models

Hive Query Langauge (HQL)

Cassandra Query Language (CQL)

Spark SQL

Apache Drill

Getting started with Python and R

Python on macOS

Python on Windows

R on macOS

R on Windows

Summary

Further reading

Defining Data Models

Data model structures

Structured data

Unstructured data

Sources of unstructured data

Comparing structured and unstructured data

Data operations

Subsetting

Union

Projection

Join

Data constraints

Types of constraints

Value constraints

Uniqueness constraints

Cardinality constraints

Type constraints

Domain constraints

Structural constraints

A unified approach to big data modeling and data management

Summary

Further reading

Categorizing Data Models

Levels of data modeling

Conceptual data modeling

Logical data modeling

Benefits of constructing LDMs

Physical data modeling

Features of the physical data model

Types of data model

Hierarchical database models

Relational models

Advantages of the relational data model

Network models

Object-oriented database model

Entity-relationship models

Object-relational models

Summary

Further reading

Structures of Data Models

Semi-structured data models

Exploring the semi-structured data model of JSON data

Installing Python and the Tweepy library

Getting authorization credentials to access the Twitter API

VSM with Lucene

Lucene

Graph-data models

Graph-data models with Gephi

Summary 

Further reading

Modeling Structured Data

Getting started with structured data

NumPy

Operations using NumPy

Pandas

Matplotlib

Seaborn

IPython

Modeling structured data using Python

Visualizing the location of houses based on latitude and longitude

Factors that affect the price of houses

Visualizing more than one parameter

Gradient-boosting regression

Summary

Further reading

Modeling with Unstructured Data

Getting started with unstructured data

Tools for intelligent analysis

New methods of data processing

Tools for analyzing unstructured data

Weka

KNIME

Characteristics of KNIME

The R language

Unstructured text analysis using R

Data ingestion

Data cleaning and transformations

Data visualization

Improving the model

Summary

Further reading

Modeling with Streaming Data

Data stream and data model versus data format

Why is streaming data different?

Use cases of stream processing

What is a data stream?

Data streaming systems

How streaming works

Data harvesting

Data processing

Data analytics

Importance and implications of streaming data

Needs for stream processing

Challenges with streaming data

Streaming data solutions

Exploring streaming sensor data from the Twitter API

Analyzing the streaming data

Summary

Further reading

Streaming Sensor Data

Sensor data

Data lakes

Differences between data lakes and data warehouses

How a data lake works

Exploring streaming sensor data from a weather station

Summary

Further study

Concept and Approaches of Big-Data Management

Non-DBMS-based approach to big data

Filesystems

Problems with processing files

DBMS-based approach to big data

Advantages of the DBMS

Declarative Query Language (DQL)

Data independence

Controlling data redundancy

Centralized data management and concurrent access

Data integrity

Data availability

Efficient access through optimization

Parallel and distributed DBMS

Parallel DBMS

Motivations for parallel DBMS

Architectures for parallel databases

Distributed DBMS

Features of a distributed DBMS

Merits of a distributed DBMS

DBMS and MapReduce-style systems

Summary

Further reading

DBMS to BDMS

Characteristics of BDMS

BASE properties

Exploring data management with Redis

Getting started with Redis on macOS

Advanced key-value stores

Redis and Hadoop

Aerospike

Aerospike technology

AsterixDB

Data models

The Asterix query language

Getting started with AsterixDB

Unstructured data in AsterixDB

Inserting into datasets

Querying in AsterixDB

Summary

Further reading

Modeling Bitcoin Data Points with Python

Introduction to Bitcoin data

Theory

Importing Bitcoin data into iPython

Importing required libraries

Preprocessing and model creation

Predicting Bitcoin price using Recurrent Neural Network

Importing packages

Importing datasets

Preprocessing

Constructing the RNN model

Prediction

Summary

Further reading

Modeling Twitter Feeds Using Python

Importing Twitter feed data

Modeling Twitter feeds

The frequency of the tweets

Sentiment analysis

Installing TextBlob

Parts of speech

Noun-phrase extraction

Tokenization

Bag of words

Summary

Further reading

Modeling Weather Data Points with Python

Introduction to weather data

Importing data

Forecasting Nepal's temperature change

Modeling with data

Persistence model forecast

Weather statistics by country

Linear regression to predict the temperature of a city

Summary

Further reading

Modeling IMDb Data Points with Python

Introduction to IMDb data

Episode data

Rating data

Theory 

Modeling with the IMDb dataset

Starting the platform

Importing the required libraries

Importing a file

Data cleansing

Clustering

Summary

Further reading

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

The Hands-On Big Data Modelling series explores the methodology required to model big data using open source platforms in real-world contexts. The rapid growth of big data and people's interest in extracting business intelligence from data have given an opportunity to explore various technologies and methods that can be applied in modeling, mining, and analytics generation. In this book, we are going to use open source tools such as Python, R, Gephi, Lucene, and Weka to explore how big data modeling can be facilitated. The main objectives of this book are as follows:

To understand the concept of big data, the sources of big data, and the importance and implications of big data and big data management

To understand

state-of-the-art big data modeling, the importance of big data modeling, big data applications, and programming platforms for big data analysis

To encourage a range of discussion of concepts, from

Database Management Systems

(

DBMSes

) to

Big Data Management Systems

(

BDMSes

)

To facilitate the planning, analysis, and construction of data models through an actual database for small to enterprise-level database environments

To understand the concept of unified data models for structured, semi-structured, and unstructured data, including finding classes, adding attributes, and simplifying the data structures, followed by advanced data modeling techniques and performance scaling of models

To facilitate working with streaming data with the help of examples on Twitter feeds and weather data points

To understand

how we can model using open access data such as Bitcoin, IMDB, Twitter, and weather data using Python

Who this book is for

Hands-On Big Data Modeling is for you if you are a data modeler, data architect, ETL developer, business intelligence professional, or anyone who wants to design sophisticated and powerful database models. Basic programming skills using Python, R, or any other programming language will be beneficial.

What this book covers

Chapter 1, Introduction to Big Data and Data Management, covers the concept of big data, its sources, and its types. In addition to this, the chapter focuses on providing a theoretical foundation in data modeling and data management, including data ingestion, data storage, data quality, data operations, data scalability, and security, as well as the importance and implications of big data modeling and data management. The user will be getting their hands dirty with real big data and its sources.

Chapter 2, Data Modeling and Management Platforms, provides an in-depth theoretical background to data modeling and data management. Users will learn about big data applications, state-of-the-art modeling techniques, and programming platforms for big data analysis involving use case examples. Readers will be using big data from various sources to perform data ingestion, storage, data quality, and various data operations. This chapter also focuses on various real big data applications and big data programming models. In addition to this, it discusses various programming platforms used for big data analysis, including Python, R, Scala, and many more.

Chapter 3, Defining Data Models, walks users through various structures of data, including structured, semi-structured, and unstructured data, and how to apply modeling techniques to them. In addition, users will become familiar with various operations on data models and various data model constraints. Moreover, the chapter gives a brief introduction to a unified approach to data modeling and data management. Hands-on exercises concerning structured Comma-Separated Value (CSV) data will help users to get a better insight into these terms and processes.

Chapter 4, Categorizing Data Models, focuses on providing both theoretical and practical guidelines regarding different types of data models, including a conceptual data model, a logical data model, a physical data model, a traditional data model, and a big data model. In addition to this, users will get to know different real-life examples of these models and how they differ from the big data model.

Chapter 5, Structures of Data Models, continues to shed light on big data modeling through specific approaches, including vector space models, graph data models, and more. Users will become acquainted with the concept of different structures of data model using hands-on exercises, including the exploration of graph data models with Gephi, and utilizing the semi-structured data models of JSON files.

Chapter 6, Modeling Structured Data, provides real-life examples of structured data found in everyday business through to the enterprise level, and how modeling can be applied to this data. Users will get their hands dirty using Python or the R programming language.

Chapter 7, Modeling with Unstructured Data, provides real-life examples of unstructured data found in everyday business through to the enterprise level, and how modeling can be applied to this data. Users will get their hands dirty using Python or the R programming language.

Chapter 8, Modeling with Streaming Data, provides users with the opportunity to explore data models and data formats, the concept of data streaming and why streaming data is different, as well as the importance and implications of streaming data.

Chapter 9, Streaming Sensor Data, provides users with the opportunity to acquire practical hands-on experience working with different forms of streaming data, including weather data and Twitter feeds.

Chapter 10, Concept and Approaches of Big Data Management, deals with exploring various DBMS and non-DBMS-based approaches to big data. It also focuses on the advantages of using DBMS over the traditional filesystem, and the differences between parallel and distributed filesystems and MapReduce-style DBMS.

Chapter 11, DBMS to BDMS, introduces users to some of the applications available to help with big data management and provides insights into how and when they might be appropriate for the big data management challenges we face.

Chapter 12, Modeling Bitcoin Data Points with Python, covers the different types of models that can be constructed from bitcoin data. We will try to use the model thus extracted to predict the price of bitcoin. In addition to this, we will learn how to use iPython in detail and play with Python libraries, including pandas and NumPy.

Chapter 13, Modeling Twitter Feeds with Python, uses Twitter feeds as big data and utilizes them in Python to produce models on the basis of tips and tricks learned throughout the book as a whole. The book tries to consume the data in raw format, transform it into the correct format and model using Python, and interpret the model thereby produced.

Chapter 14, Modeling Weather Data Points with Python, uses weather data points as big data and utilizes them in R to produce models on the basis of tips and tricks learned throughout the book as a whole. The book tries to consume the data in raw format, transform it into the correct format and model using R, and interpret the model thereby produced.

Chapter 15, Modeling IMDb Data Points with Python, uses IMDb data points as big data and utilizes them in Python to produce models on the basis of tips and tricks learned throughout the book as a whole. The book tries to consume the data in raw format, transform it into the correct format and model using Python, and interpret the model thereby produced.

To get the most out of this book

To get the most out of this book, we assume that readers have the following prerequisite knowledge:

An understanding of DBMSes, data modeling, and UML

An understanding of r

equirement analysis and conceptual data modeling

An understanding of t

he concepts of data warehousing, data mining, and data mining tools

An understanding of the basic concepts of data management, data storage, data retrieval, and data processing

An understanding of b

asic programming skills using Python, R, or any other programming language

We also expect readers to follow the resources that are highlighted as further reading at the end of each chapter. In addition to this, all the code shared in the GitHub will not be the only solution. There can be multiple ways of modeling big data. What we have presented in this book is just one of these ways involving open source technologies.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packt.com

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

Enter the name of the book in the

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Hands-On-Big-Data-Modeling. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/9781788620901_ColorImages.pdf.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Introduction to Big Data and Data Management

This chapter addresses the concept of big data, its sources, and its types. In addition to this, the chapter focuses on giving a theoretical foundation about data modeling and data management. Readers will be getting their hands dirty with setting up a platform where we can utilize big data. The major topics discussed in this chapter are summarized as follows:

Discover the concept of big data and its origins

Learn about the various characteristics of big data

Discuss and explore various challenges in big data mining

Get familiar with big data modeling and its uses

Understand what big data management is and its importance and implications

Set up a big data platform on a local machine

The concept of big data

Digital systems are progressively intertwined with real-world activities. As a consequence, multitudes of data are recorded and reported by information systems. During the last 50 years, the growth in information systems and their capabilities to capture, curate, store, share, transfer, analyze, and visualize data has increased exponentially. Besides these incredible technological advances, people and organizations depend more and more on computerized devices and information sources on the internet. The IDC Digital Universe Study in May 2010 illustrates the spectacular growth of data. This study estimated that the amount of digital information (on personal computers, digital cameras, servers, sensors) stored exceeds 1 zettabyte, and predicted that the digital universe would to grow to 35 zettabytes in 2010. The IDC study characterizes 35 zettabytes as a stack of DVDs reaching halfway to Mars. This is what we refer to as the data explosion.

Most of the data stored in the digital universe is very unstructured, and organizations are facing challenges to capture, curate, and analyze it. One of the most challenging tasks for today's organizations is to extract information and value from data stored in their information systems. This data, which is highly complex and too voluminous to be handled by a traditional DBMS, is called big data.

Big data is a term for a group of datasets so massive and sophisticated that it becomes troublesome to process using on-hand database-management tools or contemporary processing applications. Within the recent market, massive data trends to refer to the employment of user-behavior analytics, predictive analytics, or certain different advanced data-analysis methods that extract value from this new data echo system analytics.

Whether it's day-to-day data, business data, or basis data, if they represent a massive volume of data, either structured or unstructured, the data is relevant for the organization. However, it's not only the dimensions of the data that matters; it's how it's being used by the organization to extract the deeper insights that can drive them to better business and strategic decisions. This voluminous data can be used to determine a quality of research, enhance process flow in an organization, prevent a particular disease, link legal citations, or combat crimes. Big data is everywhere, and with the right tools it can be used to make the data more effective for business analytics.

Interesting insights regarding big data

Some interesting facts related to big data, and its management and analysis, are explained here, while some are presented in the Further reading section. The facts are taken from the source mentioned in the Further reading item.

Almost 91% of the world's marketing leaders consume customer data as big data to make business decisions.

Interestingly, 90% of the world's total data has been generated within the last two years.

87% of people agree to record and distribute the right data. It is important to effectively measure

Return of Investment

(

ROI

) in their own company.

86% of people are willing to pay more for a great customer experience with a brand.

75% of companies claim they will expand investments in big data within the next year.

About 70% of big data is created by individuals—but enterprises are subjected to storing and controlling 80% of it.

70% of businesses accept that their marketing efforts are under higher scrutiny.