45,59 €
Learn the basics of analytics on big data using Java, machine learning and other big data tools
This book is for Java developers who are looking to perform data analysis in production environment. Those who wish to implement data analysis in their Big data applications will find this book helpful.
This book covers case studies such as sentiment analysis on a tweet dataset, recommendations on a movielens dataset, customer segmentation on an ecommerce dataset, and graph analysis on actual flights dataset.
This book is an end-to-end guide to implement analytics on big data with Java. Java is the de facto language for major big data environments, including Hadoop. This book will teach you how to perform analytics on big data with production-friendly Java. This book basically divided into two sections. The first part is an introduction that will help the readers get acquainted with big data environments, whereas the second part will contain a hardcore discussion on all the concepts in analytics on big data. It will take you from data analysis and data visualization to the core concepts and advantages of machine learning, real-life usage of regression and classification using Naive Bayes, a deep discussion on the concepts of clustering,and a review of simple neural networks on big data using deepLearning4j or plain Java Spark code. This book is a must-have book for Java developers who want to start learning big data analytics and want to use it in the real world.
The approach of book is to deliver practical learning modules in manageable content. Each chapter is a self-contained unit of a concept in big data analytics. Book will step by step builds the competency in the area of big data analytics. Examples using real world case studies to give ideas of real applications and how to use the techniques mentioned. The examples and case studies will be shown using both theory and code.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 549
Veröffentlichungsjahr: 2017
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: July 2017
Production reference: 1270717
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78728-898-0
www.packtpub.com
Author
Rajat Mehta
Reviewers
Dave Wentzel
Roberto Casati
Commissioning Editor
Veena Pagare
Acquisition Editor
Chandan Kumar
Content Development Editor
Deepti Thore
Technical Editors
Jovita Alva
Sneha Hanchate
Copy Editors
Safis Editing
Laxmi Subramanian
Project Coordinator
Shweta H Birwatkar
Proofreader
Safis Editing
Indexer
Pratik Shirodkar
Graphics
Tania Dutta
Production Coordinator
Shantanu N. Zagade
Cover Work
Shantanu N. Zagade
Rajat Mehta is a VP (technical architect) in technology at JP Morgan Chase in New York. He is a Sun certified Java developer and has worked on Java-related technologies for more than 16 years. His current role for the past few years heavily involves the use of a big data stack and running analytics on it. He is also a contributor to various open source projects that are available on his GitHub repository, and is also a frequent writer for dev magazines.
Dave Wentzel is the CTO of Capax Global, a data consultancy specializing in SQL Server, cloud, IoT, data science, and Hadoop technologies. Dave helps customers with data modernization projects. For years, Dave worked at big independent software vendors, dealing with the scalability limitations of traditional relational databases. With the advent of Hadoop and big data technologies everything changed. Things that were impossible to do with data were suddenly within reach.
Before joining Capax, Dave worked at Microsoft, assisting customers with big data solutions on Azure. Success for Dave is solving challenging problems at companies he respects, with talented people who he admires.
Roberto Casati is a certified enterprise architect working in the financial services market. Roberto lives in Milan, Italy, with his wife, their daughter, and a dog.
In a former life, after graduating in engineering, he worked as a Java developer, Java architect, and presales architect for the most important telecommunications, travel, and financial services companies.
His interests and passions include data science, artificial intelligence, technology, and food.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book’s Amazon page at https://www.amazon.com/dp/1787288986.
If you’d like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
This book is dedicated to my mother Kanchan, my wife Harpreet, my daughter Meher, my father Ashwini and my son Vivaan.
Even as you read this content, there is a revolution happening behind the scenes in the field of big data. From every coffee that you pick up from a coffee store to everything you click or purchase online, almost every transaction, click, or choice of yours is getting analyzed. From this analysis, a lot of deductions are now being made to offer you new stuff and better choices according to your likes. These techniques and associated technologies are picking up so fast that as developers we all should be a part of this new wave in the field of software. This would allow us better prospects in our careers, as well as enhance our skill set to directly impact the business we work for.
Earlier technologies such as machine learning and artificial intelligence used to sit in the labs of many PhD students. But with the rise of big data, these technologies have gone mainstream now. So, using these technologies, you can now predict which advertisement the user is going to click on next, or which product they would like to buy, or it can also show whether the image of a tumor is cancerous or not. The opportunities here are vast. Big data in itself consists of a whole lot of technologies whether cluster computing frameworks such as Apache Spark or Tez or distributed filesystems such as HDFS and Amazon S3 or real-time SQL on underlying data using Impala or Spark SQL.
This book provides a lot of information on big data technologies, including machine learning, graph analytics, real-time analytics and an introductory chapter on deep learning as well. I have tried to cover both technical and conceptual aspects of these technologies. In doing so, I have used many real-world case studies to depict how these technologies can be used in real life. So this book will teach you how to run a fast algorithm on the transactional data available on an e-commerce site to figure out which items sell together, or how to run a page rank algorithm on a flight dataset to figure out the most important airports in a country based on air traffic. There are many content gems like these in the book for readers.
Chapter 1, Big Data Analytics with Java, starts with providing an introduction to the core concepts of Hadoop and provides information on its key components. In easy-to-understand explanations, it shows how the components fit together and gives simple examples on the usage of the core components HDFS and Apache Spark. This chapter also talks about the different sources of data that can put their data inside Hadoop, their compression formats, and the systems that are used to analyze that data.
Chapter 2, First Steps in Data Analysis, takes the first steps towards the field of analytics on big data. We start with a simple example covering basic statistical analytic steps, followed by two popular algorithms for building association rules using the Apriori Algorithm and the FP-Growth Algorithm. For all case studies, we have used realistic examples of an online e-commerce store to give insights to users as to how these algorithms can be used in the real world.
Chapter 3, Data Visualization, helps you to understand what different types of charts there are for data analysis, how to use them, and why. With this understanding, we can make better decisions when exploring our data. This chapter also contains lots of code samples to show the different types of charts built using Apache Spark and the JFreeChart library.
Chapter 4, Basics of Machine Learning, helps you to understand the basic theoretical concepts behind machine learning, such as what exactly is machine learning, how it is used, examples of its use in real life, and the different forms of machine learning. If you are new to the field of machine learning, or want to brush up your existing knowledge on it, this chapter is for you. Here I will also show how, as a developer, you should approach a machine learning problem, including topics on feature extraction, feature selection, model testing, model selection, and more.
Chapter 5, Regression on Big Data, explains how you can use linear regression to predict continuous values and how you can do binary classification using logistic regression. A real-world case study of house price evaluation based on the different features of the house is used to explain the concepts of linear regression. To explain the key concepts of logistic regression, a real-life case study of detecting heart disease in a patient based on different features is used.
Chapter 6, Naive Bayes and Sentimental Analysis, explains a probabilistic machine learning model called Naive Bayes and also briefly explains another popular model called the support vector machine. The chapter starts with basic concepts such as Bayes Theorem and then explains how these concepts are used in Naive Bayes. I then use the model to predict the sentiment whether positive or negative in a set of tweets from Twitter. The same case study is then re-run using the support vector machine model.
Chapter 7, Decision Trees, explains that decision trees are like flowcharts and can be programmatically built using concepts such as Entropy or Gini Impurity. The golden egg in this chapter is a case study that shows how we can predict whether a person's loan application will be approved or not using decision trees.
Chapter 8, Ensembling on Big Data, explains how ensembling plays a major role in improving the performance of the predictive results. I cover different concepts related to ensembling in this chapter, including techniques such as how multiple models can be joined together using bagging or boosting thereby enhancing the predictive outputs. We also cover the highly popular and accurate ensemble of models, random forests and gradient-boosted trees. Finally, we predict loan default by users in a dataset of a real-world Lending Club (a real online lending company) using these models.
Chapter 9, Recommendation Systems, covers the particular concept that has made machine learning so popular and it directly impacts business as well. In this chapter, we show what recommendation systems are, what they can do, and how they are built using machine learning. We cover both types of recommendation systems: content-based and collaborative, and also cover their good and bad points. Finally, we cover two case studies using the MovieLens dataset to show recommendations to users for movies that they might like to see.
Chapter 10, Clustering and Customer Segmentation on Big Data, speaks about clustering and how it can be used by a real-world e-commerce store to segment their customers based on how valuable they are. I have covered both k-Means clustering and bisecting k-Means clustering, and used both of them in the corresponding case study on customer segmentation.
Chapter 11, Massive Graphs on Big Data, covers an interesting topic, graph analytics. We start with a refresher on graphs, with basic concepts, and later go on to explore the different forms of analytics that can be run on the graphs, whether path-based analytics involving algorithms such as breadth-first search, or connectivity analytics involving degrees of connection. A real-world flight dataset is then used to explore the different forms of graph analytics, showing analytical concepts such as finding top airports using the page rank algorithm.
Chapter 12, Real-Time Analytics on Big Data, speaks about real-time analytics by first seeing a few examples of real-time analytics in the real world. We also learn about the products that are used to build real-time analytics system on top of big data. We particularly cover the concepts of Impala, Spark Streaming, and Apache Kafka. Finally, we cover two real-life case studies on how we can build trending videos from data that is generated in real-time, and also do sentiment analysis on tweets by depicting a Twitter-like scenario using Apache Kafka and Spark Streaming.
Chapter 13, Deep Learning Using Big Data, speaks about the wide range of applications that deep learning has in real life whether it's self-driving cars, disease detection, or speech recognition software. We start with the very basics of what a biological neural network is and how it is mimicked in an artificial neural network. We also cover a lot of the theory behind artificial neurons and finally cover a simple case study of flower species detection using a multi-layer perceptron. We conclude the chapter with a brief introduction to the Deeplearning4j library and also cover a case study on handwritten digit classification using convolution neural networks.
There are a few things you will require to follow the examples in this book: a text editor (I use Sublime Text), internet access, admin rights to your machine to install applications and download sample code, and an IDE (I use Eclipse and IntelliJ).
You will also need other software such as Java, Maven, Apache Spark, Spark modules, the GraphFrames library, and the JFreeChart library. We mention the required software in the respective chapters.
You also need a good computer with a good RAM size, or you can also run the samples on Amazon AWS.
If you already know some Java and understand the principles of big data, this book is for you. This book can be used by a developer who has mostly worked on web programming or any other field to switch into the world of analytics using machine learning on big data.
A good understanding of Java and SQL is required. Some understanding of technologies such as Apache Spark, basic graphs, and messaging will also be beneficial.
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.
To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.
If you have any questions, don't hesitate to look me up on LinkedIn via my profile https://www.linkedin.com/in/rajatm/, I will be more than glad to help a fellow software professional.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Big-Data-Analytics-with-Java. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from www.packtpub.com/sites/default/files/downloads/BigDataAnalyticswithJava_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the erratasubmissionform link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.
Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <[email protected]> with a link to the suspected pirated material.
We appreciate your help in protecting our authors, and our ability to bring you valuable content.
You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.
