86,39 €
Develop, Implement and Tuneup your Machine Learning applications using the power of Java programming
This course is the right resource for anyone with some knowledge of Java programming who wants to get started with Data Science and Machine learning as quickly as possible. If you want to gain meaningful insights from big data and develop intelligent applications using Java, this course is also a must-have.
Machine Learning is one of the core area of Artificial Intelligence where computers are trained to self-learn, grow, change, and develop on their own without being explicitly programmed. In this course, we cover how Java is employed to build powerful machine learning models to address the problems being faced in the world of Data Science. The course demonstrates complex data extraction and statistical analysis techniques supported by Java, applying various machine learning methods, exploring machine learning sub-domains, and exploring real-world use cases such as recommendation systems, fraud detection, natural language processing, and more, using Java programming. The course begins with an introduction to data science and basic data science tasks such as data collection, data cleaning, data analysis, and data visualization. The next section has a detailed overview of statistical techniques, covering machine learning, neural networks, and deep learning. The next couple of sections cover applying machine learning methods using Java to a variety of chores including classifying, predicting, forecasting, market basket analysis, clustering stream learning, active learning, semi-supervised learning, probabilistic graph modeling, text mining, and deep learning.
The last section highlights real-world test cases such as performing activity recognition, developing image recognition, text classification, and anomaly detection. The course includes premium content from three of our most popular books:
On completion of this course, you will understand various machine learning techniques, different machine learning java algorithms you can use to gain data insights, building data models to analyze larger complex data sets, and incubating applications using Java and machine learning algorithms in the field of artificial intelligence.
This comprehensive course proceeds from being a tutorial to a practical guide, providing an introduction to machine learning and different machine learning techniques, exploring machine learning with Java libraries, and demonstrating real-world machine learning use cases using the Java platform.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1334
Veröffentlichungsjahr: 2017
Copyright © 2017 Packt Publishing
All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Published on: September 2017
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78862-221-9
www.packtpub.com
Authors
Richard M. Reese
Jennifer L. Reese
Boštjan Kaluža
Dr. Uday Kamath
Krishna Choppella
Reviewers
Walter Molina
Shilpi Saxena
Abhik Banerjee
Wei Di
Manjunath Narayana
Ravi Sharma
Samir Sahli
Prashant Verma
Content Development Editor
Aishwarya Pandere
Production Coordinator
Arvindkumar Gupta
Machine learning is a subfield of artificial intelligence. It helps computers to learn and act like human beings with the help of algorithms and data. With a given set of data, an ML algorithm learns different properties of the data and infers the properties of the data that it may encounter in future.
Module 1, Java for Data Science, investigates the support provided for low-level math operations and how they can be supported in a multiple processor environment. Data analysis, at its heart, necessitates the ability to manipulate and analyze large quantities of numeric data.
Module 2, Machine Learning in Java, reviews the various Java libraries and platforms dedicated to machine learning, what each library brings to the table, and what kind of problems it is able to solve. The review includes Weka, Java-ML, Apache Mahout, Apache Spark, deeplearning4j, and Mallet.
Module 3, Mastering Java Machine Learning, presents many advanced methods in clustering and outlier techniques, with applications. Topics covered are feature selection and reduction in unsupervised data, clustering algorithms, evaluation methods in clustering, and anomaly detection using statistical, distance, and distribution techniques. At the end of the chapter, we perform a case study for both clustering and outlier detection using a real-world image dataset, MNIST. We use the Smile API to do feature reduction and ELKI for learning.
Module 1:
Many of the examples in this module use Java 8 features. There are a number of Java APIs demonstrated, each of which is introduced before it is applied. An IDE is not required but is desirable.
Module 2:
To follow the examples throughout the module, you'll need a personal computer with the JDK installed. All the examples and source code that you can download assume Eclipse IDE with support for Maven, a dependency management and build automation tool; and Git, a version control system. Examples in the chapters rely on various libraries, including Weka, deeplearning4j, Mallet, and Apache Mahout. Instructions on how to get and install the libraries are provided in the chapter where the library will be first used.
The module has a dedicated web site, http://machine-learning-in-java.com, where you can find all the example code, errata, and additional materials that will help you to get started.
Module 3:
This book assumes you have some experience of programming in Java and a basic understanding of machine learning concepts. If that doesn't apply to you, but you are curious nonetheless and self-motivated, fret not, and read on! For those who do have some background, it means that you are familiar with simple statistical analysis of data and concepts involved in supervised and unsupervised learning. Those who may not have the requisite math or must poke the far reaches of their memory to shake loose the odd formula or funny symbol, do not be disheartened. If you are the sort that loves a challenge, the short primer in the appendices may be all you need to kick-start your engines—a bit of tenacity will see you through the rest! For those who have never been introduced to machine learning, the first chapter was equally written for you as for those needing a refresher—it is your starter-kit to jump in feet first and find out what it's all about. You can augment your basics with any number of online resources. Finally, for those innocent of Java, here's a secret: many of the tools featured in the book have powerful GUIs. Some include wizard-like interfaces, making them quite easy to use, and do not require any knowledge of Java. So if you are new to Java, just skip the examples that need coding and learn to use the GUI-based tools instead!
This course is the right resource for anyone with some knowledge of Java programming who wants to get started with Data Science and Machine learning as quickly as possible. If you want to gain meaningful insights from big data and develop intelligent applications using Java, this course is also a must-have.
Feedback from our readers is always welcome. Let us know what you think about this course—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail <[email protected]>, and mention the course's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt course, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this course from your account at http://www.packtpub.com. If you purchased this course elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
You can also download the code files by clicking on the Code Files button on the course's webpage at the Packt Publishing website. This page can be accessed by entering the course's name in the Search box. Please note that you need to be logged in to your Packt account.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
The code bundle for the course is also hosted on GitHub at https://github.com/PacktPublishing/repository-name. We also have other code bundles from our rich catalog of books, videos, and courses available at https://github.com/PacktPublishing/. Check them out!
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our courses—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this course. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your course, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the course in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <[email protected]> with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this course, you can contact us at <[email protected]>, and we will do our best to address the problem.
Java for Data Science
Examine the techniques and Java tools supporting the growing field of data science
Data science is not a single science as much as it is a collection of various scientific disciplines integrated for the purpose of analyzing data. These disciplines include various statistical and mathematical techniques, including:
With the advent of cheaper storage technology, more and more data has been collected and stored permitting previously unfeasible processing and analysis of data. With this analysis came the need for various techniques to make sense of the data. These large sets of data, when used to analyze data and identify trends and patterns, become known as big data.
This in turn gave rise to cloud computing and concurrent techniques such as map-reduce, which distributed the analysis process across a large number of processors, taking advantage of the power of parallel processing.
The process of analyzing big data is not simple and evolves to the specialization of developers who were known as data scientists. Drawing upon a myriad of technologies and expertise, they are able to analyze data to solve problems that previously were either not envisioned or were too difficult to solve.
Early big data applications were typified by the emergence of search engines capable of more powerful and accurate searches than their predecessors. For example, AltaVista was an early popular search engine that was eventually superseded by Google. While big data applications were not limited to these search engine functionalities, these applications laid the groundwork for future work in big data.
The term, data science, has been used since 1974 and evolved over time to include statistical analysis of data. The concepts of data mining and data analytics have been associated with data science. Around 2008, the term data scientist appeared and was used to describe a person who performs data analysis. A more in-depth discussion of the history of data science can be found at http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#3d9ea08369fd.
This book aims to take a broad look at data science using Java and will briefly touch on many topics. It is likely that the reader may find topics of interest and pursue these at greater depth independently. The purpose of this book, however, is simply to introduce the reader to the significant data science topics and to illustrate how they can be addressed using Java.
There are many algorithms used in data science. In this book, we do not attempt to explain how they work except at an introductory level. Rather, we are more interested in explaining how they can be used to solve problems. Specifically, we are interested in knowing how they can be used with Java.
The various data science techniques that we will illustrate have been used to solve a variety of problems. Many of these techniques are motivated to achieve some economic gain, but they have also been used to solve many pressing social and environmental problems. Problem domains where these techniques have been used include finance, optimizing business processes, understanding customer needs, performing DNA analysis, foiling terrorist plots, and finding relationships between transactions to detect fraud, among many other data-intensive problems.
Data mining is a popular application area for data science. In this activity, large quantities of data are processed and analyzed to glean information about the dataset, to provide meaningful insights, and to develop meaningful conclusions and predictions. It has been used to analyze customer behavior, detecting relationships between what may appear to be unrelated events, and to make predictions about future behavior.
Machine learning is an important aspect of data science. This technique allows the computer to solve various problems without needing to be explicitly programmed. It has been used in self-driving cars, speech recognition, and in web searches. In data mining, the data is extracted and processed. With machine learning, computers use the data to take some sort of action.
Data science is concerned with the processing and analysis of large quantities of data to create models that can be used to make predictions or otherwise support a specific goal. This process often involves the building and training of models. The specific approach to solve a problem is dependent on the nature of the problem. However, in general, the following are the high-level tasks that are used in the analysis process:
Complementing this set of tasks is the need to develop applications that are efficient. The introduction of machines with multiple processors and GPUs contributes significantly to the end result.
While the exact steps used will vary by application, understanding these basic steps provides the basis for constructing solutions to many data science problems.
Java and its associated third-party libraries provide a range of support for the development of data science applications. There are numerous core Java capabilities that can be used, such as the basic string processing methods. The introduction of lambda expressions in Java 8 helps enable more powerful and expressive means of building applications. In many of the examples that follow in subsequent chapters, we will show alternative techniques using lambda expressions.
There is ample support provided for the basic data science tasks. These include multiple ways of acquiring data, libraries for cleaning data, and a wide variety of analysis approaches for tasks such as natural language processing and statistical analysis. There are also myriad of libraries supporting neural network types of analysis.
Java can be a very good choice for data science problems. The language provides both object-oriented and functional support for solving problems. There is a large developer community to draw upon and there exist multiple APIs that support data science tasks. These are but a few reasons as to why Java should be used.
The remainder of this chapter will provide an overview of the data science tasks and Java support demonstrated in the book. Each section is only able to present a brief introduction to the topics and the available support. The subsequent chapter will go into considerably more depth regarding these topics.
Data acquisition is an important step in the data analysis process. When data is acquired, it is often in a specialized form and its contents may be inconsistent or different from an application's need. There are many sources of data, which are found on the Internet. Several examples will be demonstrated in Chapter 2, Data Acquisition.
Data may be stored in a variety of formats. Popular formats for text data include HTML, Comma Separated Values (CSV), JavaScript Object Notation (JSON), and XML. Image and audio data are stored in a number of formats. However, it is frequently necessary to convert one data format into another format, typically plain text.
For example, JSON (http://www.JSON.org/) is stored using blocks of curly braces containing key-value pairs. In the following example, parts of a YouTube result is shown:
{ "kind": "youtube#searchResult", "etag": etag, "id": { "kind": string, "videoId": string, "channelId": string, "playlistId": string }, ... }Data is acquired using techniques such as processing live streams, downloading compressed files, and through screen scraping, where the information on a web page is extracted. Web crawling is a technique where a program examines a series of web pages, moving from one page to another, acquiring the data that it needs.
With many popular media sites, it is necessary to acquire a user ID and password to access data. A commonly used technique is OAuth, which is an open standard used to authenticate users to many different websites. The technique delegates access to a server resource and works over HTTPS. Several companies use OAuth 2.0, including PayPal, Facebook, Twitter, and Yelp.
The analysis of data often results in a series of numbers representing the results of the analysis. However, for most people, this way of expressing results is not always intuitive. A better way to understand the results is to create graphs and charts to depict the results and the relationship between the elements of the result.
The human mind is often good at seeing patterns, trends, and outliers in visual representation. The large amount of data present in many data science problems can be analyzed using visualization techniques. Visualization is appropriate for a wide range of audiences ranging from analysts to upper-level management to clientele. In this chapter, we present various visualization techniques and demonstrate how they are supported in Java.
In Chapter 4, Data Visualization, we illustrate how to create different types of graphs, plots, and charts. These examples use JavaFX using a free library called GRAL(http://trac.erichseifert.de/gral/).
Visualization allows users to examine large datasets in ways that provide insights that are not present in the mass of the data. Visualization tools helps us identify potential problems or unexpected data results and develop meaningful interpretations of the data.
For example, outliers, which are values that lie outside of the normal range of values, can be hard to spot from a sea of numbers. Creating a graph based on the data allows users to quickly see outliers. It can also help spot errors quickly and more easily classify data.
For example, the following chart might suggest that the upper two values should be outliers that need to be dealt with:
Machine learning has become increasingly important for data science analysis as it has been for a multitude of other fields. A defining characteristic of machine learning is the ability of a model to be trained on a set of representative data and then later used to solve similar problems. There is no need to explicitly program an application to solve the problem. A model is a representation of the real-world object.
For example, customer purchases can be used to train a model. Subsequently, predictions can be made about the types of purchases a customer might subsequently make. This allows an organization to tailor ads and coupons for a customer and potentially providing a better customer experience.
Training can be performed in one of several different approaches:
There are several approaches that support machine learning. In Chapter 6, Machine Learning, we will illustrate three techniques:
A Support Vector Machine (SVM) is used primarily for classification type problems. The approach creates a hyperplane to categorize data, which can be envisioned as a geometric plane that separates two regions. In a two-dimensional space, it will be a line. In a three-dimensional space, it will be a two-dimensional plane. In Chapter 6, Machine Learning, we will demonstrate how to use the approach using a set of data relating to the propensity of individuals to camp. We will use the Weka class, SMO, to demonstrate this type of analysis.
The following figure depicts a hyperplane using a distribution of two types of data points. The lines represent possible hyperplanes that separate these points. The lines clearly separate the data points except for one outlier.
Once the model has been trained, the possible hyperplanes are considered and predictions can then be made using similar data.
