36,59 €
Solve all big data problems by learning how to create efficient data models
Key Features
Book Description
Modeling and managing data is a central focus of all big data projects. In fact, a database is considered to be effective only if you have a logical and sophisticated data model. This book will help you develop practical skills in modeling your own big data projects and improve the performance of analytical queries for your specific business requirements.
To start with, you'll get a quick introduction to big data and understand the different data modeling and data management platforms for big data. Then you'll work with structured and semi-structured data with the help of real-life examples. Once you've got to grips with the basics, you'll use the SQL Developer Data Modeler to create your own data models containing different file types such as CSV, XML, and JSON. You'll also learn to create graph data models and explore data modeling with streaming data using real-world datasets.
By the end of this book, you'll be able to design and develop efficient data models for varying data sizes easily and efficiently.
What you will learn
Who this book is for
This book is great for programmers, geologists, biologists, and every professional who deals with spatial data. If you want to learn how to handle GIS, GPS, and remote sensing data, then this book is for you. Basic knowledge of R and QGIS would be helpful.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 252
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Pravin DhandreAcquisition Editor: Namrata PatilContent Development Editor:Ishita VoraTechnical Editor:Snehal DalmetCopy Editor:Safis EditingProject Coordinator: Namrata SwettaProofreader: Safis EditingIndexer:Tejal Daruwale SoniGraphics:Jisha ChirayilProduction Coordinator:Deepika Naik
First published: November 2018
Production reference: 1301118
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78862-090-1
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
James Lee is a passionate software wizard working at one of the top Silicon Valley-based start-ups specializing in big data analysis. In the past, he has worked at big companies such as Google and Amazon. In his day job, he works with big data technologies, including Cassandra and Elasticsearch, and is an absolute Docker technology geek and IntelliJ IDEA lover with a strong focus on efficiency and simplicity. Apart from his career as a software engineer, he is keen on sharing his knowledge with others and guiding them, especially in relation to start-ups and programming. He has been teaching courses and conducting workshops on Java programming / IntelliJ IDEA since he was 21. James holds an MS degree in computer science from McGill University and has many years' experience as a teaching assistant in a variety of computer science classes. He also enjoys skiing and swimming, and is a passionate traveler.
Tao Wei is a passionate software engineer who works in a leading Silicon Valley-based big data analysis company. Previously, Tao worked in big IT companies, such as IBM and Cisco. He has intensive experience in designing and building distributed, large-scale systems with proven high availability and reliability. Tao has an MS degree in computer science from McGill University and many years of experience as a teaching assistant in various computer science classes. When not working, he enjoys reading and swimming, and is a passionate photographer.
Suresh Kumar Mukhiya is a PhD candidate currently associated with Western Norway University of Applied Sciences (HVL). He is also a web application developer and big data enthusiast specializing in information systems, model-driven software engineering, big data analysis, and artificial intelligence. He has completed a masters in information systems from the Norwegian University of Science and Technology, along with a thesis in processing mining. He also holds a bachelor's degree in computer science and information technology (BSc.CSIT).
David Y Aiello is an experienced DevOps engineer, having spent almost a decade implementing continuous integration and continuous delivery along with other software and system engineering projects. He has a background in Java, Python, Ruby, and C++, and has worked with AWS, including EC2, S3, and VPC, Azure DevOps, and GCP. He is currently working toward a degree in computer science and mathematics, with a focus on bioinformatics, and looks forward to furthering his career in the industry. Specifically, he is looking to become more involved in the fields of artificial intelligence, deep (machine) learning, and neural networks, and what they represent in terms of the future of computer science.
Devora Aiello forges her own path, like many other native New Yorkers. Since 2014, she has been studying physics, math, and software engineering, while simultaneously traveling abroad and immersing herself in other cultures. A coding enthusiast, Ms. Aiello has designed and implemented websites using Angular, HTML/CSS, and Bootstrap. She enjoys the challenges inherent in collaborating with users to create appealing and full-featured websites and keeps abreast of the latest tools and trends in IT. Additionally, Ms. Aiello is an active participant and assistant editor for the IEEE P2675 Working Group, which is developing the industry standard for DevOps.
Zaur Fataliyev is a machine learning engineer, currently residing and working in Seoul, South Korea. He received his BSc. in electrical engineering from KAIST, and an MSc. in computer engineering from Korea University, focusing on machine learning and computer vision. He has worked on various ML and CV problems over the years. He is interested in solving problems and making sense of data with learning algorithms. He loves open research, collaboration, contributing, and competitions. He also regularly participates in Kaggle competitions.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Hands-On Big Data Modeling
About Packt
Why subscribe?
Packt.com
Contributors
About the authors
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Introduction to Big Data and Data Management
The concept of big data 
Interesting insights regarding big data
Characteristics of big data
Sources and types of big data
Challenges of big data
Introduction to big data modeling
Uses of models
Introduction to managing big data
Importance and implications of big data modeling and management
Benefits of big data management
Challenges in big data management 
Setting up big data modeling platforms
Getting started on Windows
Getting started on macOS
Summary
Further reading
Data Modeling and Management Platforms
Big data management
Data ingestion
Data storage
Data quality
Data operations
Data scalability and security
Big data management services
Data cleansing
Data integration
Big data management vendors
Big data storage and data models
Storage models
Block-based storage
File-based storage 
Object-based storage
Data models
Relational stores (SQLs)
Scalable relational systems
Database as a Service (DaaS)
NoSQL stores
Document stores
Key-value stores
Extensible-record stores
Big data programming models
MapReduce
MapReduce functionality
Hadoop
Features of Hadoop frameworks
Yet Another Resource Negotiator 
Functional programming
Spark
Reasons to choose Apache Spark
Flink
Advantages of Flink
SQL data models
Hive Query Langauge (HQL)
Cassandra Query Language (CQL)
Spark SQL
Apache Drill
Getting started with Python and R
Python on macOS
Python on Windows
R on macOS
R on Windows
Summary
Further reading
Defining Data Models
Data model structures
Structured data
Unstructured data
Sources of unstructured data
Comparing structured and unstructured data
Data operations
Subsetting
Union
Projection
Join
Data constraints
Types of constraints
Value constraints
Uniqueness constraints
Cardinality constraints
Type constraints
Domain constraints
Structural constraints
A unified approach to big data modeling and data management
Summary
Further reading
Categorizing Data Models
Levels of data modeling
Conceptual data modeling
Logical data modeling
Benefits of constructing LDMs
Physical data modeling
Features of the physical data model
Types of data model
Hierarchical database models
Relational models
Advantages of the relational data model
Network models
Object-oriented database model
Entity-relationship models
Object-relational models
Summary
Further reading
Structures of Data Models
Semi-structured data models
Exploring the semi-structured data model of JSON data
Installing Python and the Tweepy library
Getting authorization credentials to access the Twitter API
VSM with Lucene
Lucene
Graph-data models
Graph-data models with Gephi
Summary 
Further reading
Modeling Structured Data
Getting started with structured data
NumPy
Operations using NumPy
Pandas
Matplotlib
Seaborn
IPython
Modeling structured data using Python
Visualizing the location of houses based on latitude and longitude
Factors that affect the price of houses
Visualizing more than one parameter
Gradient-boosting regression
Summary
Further reading
Modeling with Unstructured Data
Getting started with unstructured data
Tools for intelligent analysis
New methods of data processing
Tools for analyzing unstructured data
Weka
KNIME
Characteristics of KNIME
The R language
Unstructured text analysis using R
Data ingestion
Data cleaning and transformations
Data visualization
Improving the model
Summary
Further reading
Modeling with Streaming Data
Data stream and data model versus data format
Why is streaming data different?
Use cases of stream processing
What is a data stream?
Data streaming systems
How streaming works
Data harvesting
Data processing
Data analytics
Importance and implications of streaming data
Needs for stream processing
Challenges with streaming data
Streaming data solutions
Exploring streaming sensor data from the Twitter API
Analyzing the streaming data
Summary
Further reading
Streaming Sensor Data
Sensor data
Data lakes
Differences between data lakes and data warehouses
How a data lake works
Exploring streaming sensor data from a weather station
Summary
Further study
Concept and Approaches of Big-Data Management
Non-DBMS-based approach to big data
Filesystems
Problems with processing files
DBMS-based approach to big data
Advantages of the DBMS
Declarative Query Language (DQL)
Data independence
Controlling data redundancy
Centralized data management and concurrent access
Data integrity
Data availability
Efficient access through optimization
Parallel and distributed DBMS
Parallel DBMS
Motivations for parallel DBMS
Architectures for parallel databases
Distributed DBMS
Features of a distributed DBMS
Merits of a distributed DBMS
DBMS and MapReduce-style systems
Summary
Further reading
DBMS to BDMS
Characteristics of BDMS
BASE properties
Exploring data management with Redis
Getting started with Redis on macOS
Advanced key-value stores
Redis and Hadoop
Aerospike
Aerospike technology
AsterixDB
Data models
The Asterix query language
Getting started with AsterixDB
Unstructured data in AsterixDB
Inserting into datasets
Querying in AsterixDB
Summary
Further reading
Modeling Bitcoin Data Points with Python
Introduction to Bitcoin data
Theory
Importing Bitcoin data into iPython
Importing required libraries
Preprocessing and model creation
Predicting Bitcoin price using Recurrent Neural Network
Importing packages
Importing datasets
Preprocessing
Constructing the RNN model
Prediction
Summary
Further reading
Modeling Twitter Feeds Using Python
Importing Twitter feed data
Modeling Twitter feeds
The frequency of the tweets
Sentiment analysis
Installing TextBlob
Parts of speech
Noun-phrase extraction
Tokenization
Bag of words
Summary
Further reading
Modeling Weather Data Points with Python
Introduction to weather data
Importing data
Forecasting Nepal's temperature change
Modeling with data
Persistence model forecast
Weather statistics by country
Linear regression to predict the temperature of a city
Summary
Further reading
Modeling IMDb Data Points with Python
Introduction to IMDb data
Episode data
Rating data
Theory 
Modeling with the IMDb dataset
Starting the platform
Importing the required libraries
Importing a file
Data cleansing
Clustering
Summary
Further reading
Other Books You May Enjoy
Leave a review - let other readers know what you think
The Hands-On Big Data Modelling series explores the methodology required to model big data using open source platforms in real-world contexts. The rapid growth of big data and people's interest in extracting business intelligence from data have given an opportunity to explore various technologies and methods that can be applied in modeling, mining, and analytics generation. In this book, we are going to use open source tools such as Python, R, Gephi, Lucene, and Weka to explore how big data modeling can be facilitated. The main objectives of this book are as follows:
To understand the concept of big data, the sources of big data, and the importance and implications of big data and big data management
To understand
state-of-the-art big data modeling, the importance of big data modeling, big data applications, and programming platforms for big data analysis
To encourage a range of discussion of concepts, from
Database Management Systems
(
DBMSes
) to
Big Data Management Systems
(
BDMSes
)
To facilitate the planning, analysis, and construction of data models through an actual database for small to enterprise-level database environments
To understand the concept of unified data models for structured, semi-structured, and unstructured data, including finding classes, adding attributes, and simplifying the data structures, followed by advanced data modeling techniques and performance scaling of models
To facilitate working with streaming data with the help of examples on Twitter feeds and weather data points
To understand
how we can model using open access data such as Bitcoin, IMDB, Twitter, and weather data using Python
Hands-On Big Data Modeling is for you if you are a data modeler, data architect, ETL developer, business intelligence professional, or anyone who wants to design sophisticated and powerful database models. Basic programming skills using Python, R, or any other programming language will be beneficial.
Chapter 1, Introduction to Big Data and Data Management, covers the concept of big data, its sources, and its types. In addition to this, the chapter focuses on providing a theoretical foundation in data modeling and data management, including data ingestion, data storage, data quality, data operations, data scalability, and security, as well as the importance and implications of big data modeling and data management. The user will be getting their hands dirty with real big data and its sources.
Chapter 2, Data Modeling and Management Platforms, provides an in-depth theoretical background to data modeling and data management. Users will learn about big data applications, state-of-the-art modeling techniques, and programming platforms for big data analysis involving use case examples. Readers will be using big data from various sources to perform data ingestion, storage, data quality, and various data operations. This chapter also focuses on various real big data applications and big data programming models. In addition to this, it discusses various programming platforms used for big data analysis, including Python, R, Scala, and many more.
Chapter 3, Defining Data Models, walks users through various structures of data, including structured, semi-structured, and unstructured data, and how to apply modeling techniques to them. In addition, users will become familiar with various operations on data models and various data model constraints. Moreover, the chapter gives a brief introduction to a unified approach to data modeling and data management. Hands-on exercises concerning structured Comma-Separated Value (CSV) data will help users to get a better insight into these terms and processes.
Chapter 4, Categorizing Data Models, focuses on providing both theoretical and practical guidelines regarding different types of data models, including a conceptual data model, a logical data model, a physical data model, a traditional data model, and a big data model. In addition to this, users will get to know different real-life examples of these models and how they differ from the big data model.
Chapter 5, Structures of Data Models, continues to shed light on big data modeling through specific approaches, including vector space models, graph data models, and more. Users will become acquainted with the concept of different structures of data model using hands-on exercises, including the exploration of graph data models with Gephi, and utilizing the semi-structured data models of JSON files.
Chapter 6, Modeling Structured Data, provides real-life examples of structured data found in everyday business through to the enterprise level, and how modeling can be applied to this data. Users will get their hands dirty using Python or the R programming language.
Chapter 7, Modeling with Unstructured Data, provides real-life examples of unstructured data found in everyday business through to the enterprise level, and how modeling can be applied to this data. Users will get their hands dirty using Python or the R programming language.
Chapter 8, Modeling with Streaming Data, provides users with the opportunity to explore data models and data formats, the concept of data streaming and why streaming data is different, as well as the importance and implications of streaming data.
Chapter 9, Streaming Sensor Data, provides users with the opportunity to acquire practical hands-on experience working with different forms of streaming data, including weather data and Twitter feeds.
Chapter 10, Concept and Approaches of Big Data Management, deals with exploring various DBMS and non-DBMS-based approaches to big data. It also focuses on the advantages of using DBMS over the traditional filesystem, and the differences between parallel and distributed filesystems and MapReduce-style DBMS.
Chapter 11, DBMS to BDMS, introduces users to some of the applications available to help with big data management and provides insights into how and when they might be appropriate for the big data management challenges we face.
Chapter 12, Modeling Bitcoin Data Points with Python, covers the different types of models that can be constructed from bitcoin data. We will try to use the model thus extracted to predict the price of bitcoin. In addition to this, we will learn how to use iPython in detail and play with Python libraries, including pandas and NumPy.
Chapter 13, Modeling Twitter Feeds with Python, uses Twitter feeds as big data and utilizes them in Python to produce models on the basis of tips and tricks learned throughout the book as a whole. The book tries to consume the data in raw format, transform it into the correct format and model using Python, and interpret the model thereby produced.
Chapter 14, Modeling Weather Data Points with Python, uses weather data points as big data and utilizes them in R to produce models on the basis of tips and tricks learned throughout the book as a whole. The book tries to consume the data in raw format, transform it into the correct format and model using R, and interpret the model thereby produced.
Chapter 15, Modeling IMDb Data Points with Python, uses IMDb data points as big data and utilizes them in Python to produce models on the basis of tips and tricks learned throughout the book as a whole. The book tries to consume the data in raw format, transform it into the correct format and model using Python, and interpret the model thereby produced.
To get the most out of this book, we assume that readers have the following prerequisite knowledge:
An understanding of DBMSes, data modeling, and UML
An understanding of r
equirement analysis and conceptual data modeling
An understanding of t
he concepts of data warehousing, data mining, and data mining tools
An understanding of the basic concepts of data management, data storage, data retrieval, and data processing
An understanding of b
asic programming skills using Python, R, or any other programming language
We also expect readers to follow the resources that are highlighted as further reading at the end of each chapter. In addition to this, all the code shared in the GitHub will not be the only solution. There can be multiple ways of modeling big data. What we have presented in this book is just one of these ways involving open source technologies.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Hands-On-Big-Data-Modeling. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/9781788620901_ColorImages.pdf.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
This chapter addresses the concept of big data, its sources, and its types. In addition to this, the chapter focuses on giving a theoretical foundation about data modeling and data management. Readers will be getting their hands dirty with setting up a platform where we can utilize big data. The major topics discussed in this chapter are summarized as follows:
Discover the concept of big data and its origins
Learn about the various characteristics of big data
Discuss and explore various challenges in big data mining
Get familiar with big data modeling and its uses
Understand what big data management is and its importance and implications
Set up a big data platform on a local machine
Digital systems are progressively intertwined with real-world activities. As a consequence, multitudes of data are recorded and reported by information systems. During the last 50 years, the growth in information systems and their capabilities to capture, curate, store, share, transfer, analyze, and visualize data has increased exponentially. Besides these incredible technological advances, people and organizations depend more and more on computerized devices and information sources on the internet. The IDC Digital Universe Study in May 2010 illustrates the spectacular growth of data. This study estimated that the amount of digital information (on personal computers, digital cameras, servers, sensors) stored exceeds 1 zettabyte, and predicted that the digital universe would to grow to 35 zettabytes in 2010. The IDC study characterizes 35 zettabytes as a stack of DVDs reaching halfway to Mars. This is what we refer to as the data explosion.
Most of the data stored in the digital universe is very unstructured, and organizations are facing challenges to capture, curate, and analyze it. One of the most challenging tasks for today's organizations is to extract information and value from data stored in their information systems. This data, which is highly complex and too voluminous to be handled by a traditional DBMS, is called big data.
Whether it's day-to-day data, business data, or basis data, if they represent a massive volume of data, either structured or unstructured, the data is relevant for the organization. However, it's not only the dimensions of the data that matters; it's how it's being used by the organization to extract the deeper insights that can drive them to better business and strategic decisions. This voluminous data can be used to determine a quality of research, enhance process flow in an organization, prevent a particular disease, link legal citations, or combat crimes. Big data is everywhere, and with the right tools it can be used to make the data more effective for business analytics.
Some interesting facts related to big data, and its management and analysis, are explained here, while some are presented in the Further reading section. The facts are taken from the source mentioned in the Further reading item.
Almost 91% of the world's marketing leaders consume customer data as big data to make business decisions.
Interestingly, 90% of the world's total data has been generated within the last two years.
87% of people agree to record and distribute the right data. It is important to effectively measure
Return of Investment
(
ROI
) in their own company.
86% of people are willing to pay more for a great customer experience with a brand.
75% of companies claim they will expand investments in big data within the next year.
About 70% of big data is created by individuals—but enterprises are subjected to storing and controlling 80% of it.
70% of businesses accept that their marketing efforts are under higher scrutiny.