E-Book
45,59 €

Big Data Analytics with Java E-Book

Rajat Mehta

0,0

45,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Learn the basics of analytics on big data using Java, machine learning and other big data tools

About This Book

Acquire real-world set of tools for building enterprise level data science applications
Surpasses the barrier of other languages in data science and learn create useful object-oriented codes
Extensive use of Java compliant big data tools like apache spark, Hadoop, etc.

Who This Book Is For

This book is for Java developers who are looking to perform data analysis in production environment. Those who wish to implement data analysis in their Big data applications will find this book helpful.

What You Will Learn

Start from simple analytic tasks on big data
Get into more complex tasks with predictive analytics on big data using machine learning
Learn real time analytic tasks
Understand the concepts with examples and case studies
Prepare and refine data for analysis
Create charts in order to understand the data
See various real-world datasets

In Detail

This book covers case studies such as sentiment analysis on a tweet dataset, recommendations on a movielens dataset, customer segmentation on an ecommerce dataset, and graph analysis on actual flights dataset.

This book is an end-to-end guide to implement analytics on big data with Java. Java is the de facto language for major big data environments, including Hadoop. This book will teach you how to perform analytics on big data with production-friendly Java. This book basically divided into two sections. The first part is an introduction that will help the readers get acquainted with big data environments, whereas the second part will contain a hardcore discussion on all the concepts in analytics on big data. It will take you from data analysis and data visualization to the core concepts and advantages of machine learning, real-life usage of regression and classification using Naive Bayes, a deep discussion on the concepts of clustering,and a review of simple neural networks on big data using deepLearning4j or plain Java Spark code. This book is a must-have book for Java developers who want to start learning big data analytics and want to use it in the real world.

Style and approach

The approach of book is to deliver practical learning modules in manageable content. Each chapter is a self-contained unit of a concept in big data analytics. Book will step by step builds the competency in the area of big data analytics. Examples using real world case studies to give ideas of real applications and how to use the techniques mentioned. The examples and case studies will be shown using both theory and code.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 549

Veröffentlichungsjahr: 2017

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Big Data Analytics with Java

Credits

About the Author

About the Reviewers

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. Big Data Analytics with Java

Why data analytics on big data?

Big data for analytics

Big data – a bigger pay package for Java developers

Basics of Hadoop – a Java sub-project

Distributed computing on Hadoop

HDFS concepts

Design and architecture of HDFS

Main components of HDFS

HDFS simple commands

Apache Spark

Concepts

Transformations

Actions

Spark Java API

Spark samples using Java 8

Loading data

Data operations – cleansing and munging

Analyzing data – count, projection, grouping, aggregation, and max/min

Actions on RDDs

Paired RDDs

Transformations on paired RDDs

Saving data

Collecting and printing results

Executing Spark programs on Hadoop

Apache Spark sub-projects

Spark machine learning modules

MLlib Java API

Other machine learning libraries

Mahout – a popular Java ML library

Deeplearning4j – a deep learning library

Compressing data

Avro and Parquet

Summary

2. First Steps in Data Analysis

Datasets

Data cleaning and munging

Basic analysis of data with Spark SQL

Building SparkConf and context

Dataframe and datasets

Load and parse data

Analyzing data – the Spark-SQL way

Spark SQL for data exploration and analytics

Market basket analysis – Apriori algorithm

Full Apriori algorithm

Implementation of the Apriori algorithm in Apache Spark

Efficient market basket analysis using FP-Growth algorithm

Running FP-Growth on Apache Spark

Summary

3. Data Visualization

Data visualization with Java JFreeChart

Using charts in big data analytics

Time Series chart

All India seasonal and annual average temperature series dataset

Simple single Time Series chart

Multiple Time Series on a single chart window

Bar charts

Histograms

When would you use a histogram?

How to make histograms using JFreeChart?

Line charts

Scatter plots

Box plots

Advanced visualization technique

Prefuse

IVTK Graph toolkit

Other libraries

Summary

4. Basics of Machine Learning

What is machine learning?

Real-life examples of machine learning

Type of machine learning

A small sample case study of supervised and unsupervised learning

Steps for machine learning problems

Choosing the machine learning model

What are the feature types that can be extracted from the datasets?

How do you select the best features to train your models?

How do you run machine learning analytics on big data?

Getting and preparing data in Hadoop

Preparing the data

Formatting the data

Storing the data

Training and storing models on big data

Apache Spark machine learning API

The new Spark ML API

Summary

5. Regression on Big Data

Linear regression

What is simple linear regression?

Where is linear regression used?

Predicting house prices using linear regression

Dataset

Data cleaning and munging

Exploring the dataset

Running and testing the linear regression model

Logistic regression

Which mathematical functions does logistic regression use?

Where is logistic regression used?

Predicting heart disease using logistic regression

Dataset

Data cleaning and munging

Data exploration

Running and testing the logistic regression model

Summary

6. Naive Bayes and Sentiment Analysis

Conditional probability

Bayes theorem

Naive Bayes algorithm

Advantages of Naive Bayes

Disadvantages of Naive Bayes

Sentimental analysis

Concepts for sentimental analysis

Tokenization

Stop words removal

Stemming

N-grams

Term presence and Term Frequency

TF-IDF

Bag of words

Dataset

Data exploration of text data

Sentimental analysis on this dataset

SVM or Support Vector Machine

Summary

7. Decision Trees

What is a decision tree?

Building a decision tree

Choosing the best features for splitting the datasets

Advantages of using decision trees

Disadvantages of using decision trees

Dataset

Data exploration

Cleaning and munging the data

Training and testing the model

Summary

8. Ensembling on Big Data

Ensembling

Types of ensembling

Bagging

Boosting

Advantages and disadvantages of ensembling

Random forests

Gradient boosted trees (GBTs)

Classification problem and dataset used

Data exploration

Training and testing our random forest model

Training and testing our gradient boosted tree model

Summary

9. Recommendation Systems

Recommendation systems and their types

Content-based recommendation systems

Dataset

Content-based recommender on MovieLens dataset

Collaborative recommendation systems

Advantages

Disadvantages

Alternating least square – collaborative filtering

Summary

10. Clustering and Customer Segmentation on Big Data

Clustering

Types of clustering

Hierarchical clustering

K-means clustering

Bisecting k-means clustering

Customer segmentation

Dataset

Data exploration

Clustering for customer segmentation

Changing the clustering algorithm

Summary

11. Massive Graphs on Big Data

Refresher on graphs

Representing graphs

Common terminology on graphs

Common algorithms on graphs

Plotting graphs

Massive graphs on big data

Graph analytics

GraphFrames

Building a graph using GraphFrames

Graph analytics on airports and their flights

Datasets

Graph analytics on flights data

Summary

12. Real-Time Analytics on Big Data

Real-time analytics

Big data stack for real-time analytics

Real-time SQL queries on big data

Real-time data ingestion and storage

Real-time data processing

Real-time SQL queries using Impala

Flight delay analysis using Impala

Apache Kafka

Spark Streaming

Typical uses of Spark Streaming

Base project setup

Big Data Analytics with Java

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: July 2017

Production reference: 1270717

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78728-898-0

www.packtpub.com

Credits

Author

Rajat Mehta

Reviewers

Dave Wentzel

Roberto Casati

Commissioning Editor

Veena Pagare

Acquisition Editor

Chandan Kumar

Content Development Editor

Deepti Thore

Technical Editors

Jovita Alva

Sneha Hanchate

Copy Editors

Safis Editing

Laxmi Subramanian

Project Coordinator

Shweta H Birwatkar

Proofreader

Safis Editing

Indexer

Pratik Shirodkar

Graphics

Tania Dutta

Production Coordinator

Shantanu N. Zagade

Cover Work

Shantanu N. Zagade

About the Author

Rajat Mehta is a VP (technical architect) in technology at JP Morgan Chase in New York. He is a Sun certified Java developer and has worked on Java-related technologies for more than 16 years. His current role for the past few years heavily involves the use of a big data stack and running analytics on it. He is also a contributor to various open source projects that are available on his GitHub repository, and is also a frequent writer for dev magazines.

About the Reviewers

Dave Wentzel is the CTO of Capax Global, a data consultancy specializing in SQL Server, cloud, IoT, data science, and Hadoop technologies. Dave helps customers with data modernization projects. For years, Dave worked at big independent software vendors, dealing with the scalability limitations of traditional relational databases. With the advent of Hadoop and big data technologies everything changed. Things that were impossible to do with data were suddenly within reach.

Before joining Capax, Dave worked at Microsoft, assisting customers with big data solutions on Azure. Success for Dave is solving challenging problems at companies he respects, with talented people who he admires.

Roberto Casati is a certified enterprise architect working in the financial services market. Roberto lives in Milan, Italy, with his wife, their daughter, and a dog.

In a former life, after graduating in engineering, he worked as a Java developer, Java architect, and presales architect for the most important telecommunications, travel, and financial services companies.

His interests and passions include data science, artificial intelligence, technology, and food.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book’s Amazon page at https://www.amazon.com/dp/1787288986.

If you’d like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

This book is dedicated to my mother Kanchan, my wife Harpreet, my daughter Meher, my father Ashwini and my son Vivaan.

Preface

Even as you read this content, there is a revolution happening behind the scenes in the field of big data. From every coffee that you pick up from a coffee store to everything you click or purchase online, almost every transaction, click, or choice of yours is getting analyzed. From this analysis, a lot of deductions are now being made to offer you new stuff and better choices according to your likes. These techniques and associated technologies are picking up so fast that as developers we all should be a part of this new wave in the field of software. This would allow us better prospects in our careers, as well as enhance our skill set to directly impact the business we work for.

Earlier technologies such as machine learning and artificial intelligence used to sit in the labs of many PhD students. But with the rise of big data, these technologies have gone mainstream now. So, using these technologies, you can now predict which advertisement the user is going to click on next, or which product they would like to buy, or it can also show whether the image of a tumor is cancerous or not. The opportunities here are vast. Big data in itself consists of a whole lot of technologies whether cluster computing frameworks such as Apache Spark or Tez or distributed filesystems such as HDFS and Amazon S3 or real-time SQL on underlying data using Impala or Spark SQL.

This book provides a lot of information on big data technologies, including machine learning, graph analytics, real-time analytics and an introductory chapter on deep learning as well. I have tried to cover both technical and conceptual aspects of these technologies. In doing so, I have used many real-world case studies to depict how these technologies can be used in real life. So this book will teach you how to run a fast algorithm on the transactional data available on an e-commerce site to figure out which items sell together, or how to run a page rank algorithm on a flight dataset to figure out the most important airports in a country based on air traffic. There are many content gems like these in the book for readers.

What this book covers

Chapter 1, Big Data Analytics with Java, starts with providing an introduction to the core concepts of Hadoop and provides information on its key components. In easy-to-understand explanations, it shows how the components fit together and gives simple examples on the usage of the core components HDFS and Apache Spark. This chapter also talks about the different sources of data that can put their data inside Hadoop, their compression formats, and the systems that are used to analyze that data.

Chapter 2, First Steps in Data Analysis, takes the first steps towards the field of analytics on big data. We start with a simple example covering basic statistical analytic steps, followed by two popular algorithms for building association rules using the Apriori Algorithm and the FP-Growth Algorithm. For all case studies, we have used realistic examples of an online e-commerce store to give insights to users as to how these algorithms can be used in the real world.

Chapter 3, Data Visualization, helps you to understand what different types of charts there are for data analysis, how to use them, and why. With this understanding, we can make better decisions when exploring our data. This chapter also contains lots of code samples to show the different types of charts built using Apache Spark and the JFreeChart library.

Chapter 4, Basics of Machine Learning, helps you to understand the basic theoretical concepts behind machine learning, such as what exactly is machine learning, how it is used, examples of its use in real life, and the different forms of machine learning. If you are new to the field of machine learning, or want to brush up your existing knowledge on it, this chapter is for you. Here I will also show how, as a developer, you should approach a machine learning problem, including topics on feature extraction, feature selection, model testing, model selection, and more.

Chapter 5, Regression on Big Data, explains how you can use linear regression to predict continuous values and how you can do binary classification using logistic regression. A real-world case study of house price evaluation based on the different features of the house is used to explain the concepts of linear regression. To explain the key concepts of logistic regression, a real-life case study of detecting heart disease in a patient based on different features is used.

Chapter 6, Naive Bayes and Sentimental Analysis, explains a probabilistic machine learning model called Naive Bayes and also briefly explains another popular model called the support vector machine. The chapter starts with basic concepts such as Bayes Theorem and then explains how these concepts are used in Naive Bayes. I then use the model to predict the sentiment whether positive or negative in a set of tweets from Twitter. The same case study is then re-run using the support vector machine model.

Chapter 7, Decision Trees, explains that decision trees are like flowcharts and can be programmatically built using concepts such as Entropy or Gini Impurity. The golden egg in this chapter is a case study that shows how we can predict whether a person's loan application will be approved or not using decision trees.

Chapter 8, Ensembling on Big Data, explains how ensembling plays a major role in improving the performance of the predictive results. I cover different concepts related to ensembling in this chapter, including techniques such as how multiple models can be joined together using bagging or boosting thereby enhancing the predictive outputs. We also cover the highly popular and accurate ensemble of models, random forests and gradient-boosted trees. Finally, we predict loan default by users in a dataset of a real-world Lending Club (a real online lending company) using these models.

Chapter 9, Recommendation Systems, covers the particular concept that has made machine learning so popular and it directly impacts business as well. In this chapter, we show what recommendation systems are, what they can do, and how they are built using machine learning. We cover both types of recommendation systems: content-based and collaborative, and also cover their good and bad points. Finally, we cover two case studies using the MovieLens dataset to show recommendations to users for movies that they might like to see.

Chapter 10, Clustering and Customer Segmentation on Big Data, speaks about clustering and how it can be used by a real-world e-commerce store to segment their customers based on how valuable they are. I have covered both k-Means clustering and bisecting k-Means clustering, and used both of them in the corresponding case study on customer segmentation.

Chapter 11, Massive Graphs on Big Data, covers an interesting topic, graph analytics. We start with a refresher on graphs, with basic concepts, and later go on to explore the different forms of analytics that can be run on the graphs, whether path-based analytics involving algorithms such as breadth-first search, or connectivity analytics involving degrees of connection. A real-world flight dataset is then used to explore the different forms of graph analytics, showing analytical concepts such as finding top airports using the page rank algorithm.

Chapter 12, Real-Time Analytics on Big Data, speaks about real-time analytics by first seeing a few examples of real-time analytics in the real world. We also learn about the products that are used to build real-time analytics system on top of big data. We particularly cover the concepts of Impala, Spark Streaming, and Apache Kafka. Finally, we cover two real-life case studies on how we can build trending videos from data that is generated in real-time, and also do sentiment analysis on tweets by depicting a Twitter-like scenario using Apache Kafka and Spark Streaming.

Chapter 13, Deep Learning Using Big Data, speaks about the wide range of applications that deep learning has in real life whether it's self-driving cars, disease detection, or speech recognition software. We start with the very basics of what a biological neural network is and how it is mimicked in an artificial neural network. We also cover a lot of the theory behind artificial neurons and finally cover a simple case study of flower species detection using a multi-layer perceptron. We conclude the chapter with a brief introduction to the Deeplearning4j library and also cover a case study on handwritten digit classification using convolution neural networks.

What you need for this book

There are a few things you will require to follow the examples in this book: a text editor (I use Sublime Text), internet access, admin rights to your machine to install applications and download sample code, and an IDE (I use Eclipse and IntelliJ).

You will also need other software such as Java, Maven, Apache Spark, Spark modules, the GraphFrames library, and the JFreeChart library. We mention the required software in the respective chapters.

You also need a good computer with a good RAM size, or you can also run the samples on Amazon AWS.

Who this book is for

If you already know some Java and understand the principles of big data, this book is for you. This book can be used by a developer who has mostly worked on web programming or any other field to switch into the world of analytics using machine learning on big data.

A good understanding of Java and SQL is required. Some understanding of technologies such as Apache Spark, basic graphs, and messaging will also be beneficial.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

If you have any questions, don't hesitate to look me up on LinkedIn via my profile https://www.linkedin.com/in/rajatm/, I will be more than glad to help a fellow software professional.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Big-Data-Analytics-with-Java. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from www.packtpub.com/sites/default/files/downloads/BigDataAnalyticswithJava_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the erratasubmissionform link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.