E-Book
39,59 €

Java for Data Science E-Book

Richard M. Reese

0,0

39,59 €

oder

Leseprobe lesen

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

para 1: Get the lowdown on Java and explore big data analytics with Java for Data Science. Packed with examples and data science principles, this book uncovers the techniques & Java tools supporting data science and machine learning. Para 2: The stability and power of Java combines with key data science concepts for effective exploration of data. By working with Java APIs and techniques, this data science book allows you to build applications and use analysis techniques centred on machine learning. Para 3: Java for Data Science gives you the understanding you need to examine the techniques and Java tools supporting big data analytics. These Java-based approaches allow you to tackle data mining and statistical analysis in detail. Deep learning and Java data mining are also featured, so you can explore and analyse data effectively, and build intelligent applications using machine learning. para 4: What?s Inside ? Understand data science principles with Java support ? Discover machine learning and deep learning essentials ? Explore data science problems with Java-based solutions

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Veröffentlichungsjahr: 2017

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Java for Data Science

Credits

About the Authors

About the Reviewers

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Getting Started with Data Science

Problems solved using data science

Understanding the data science problem - solving approach

Using Java to support data science

Acquiring data for an application

The importance and process of cleaning data

Visualizing data to enhance understanding

The use of statistical methods in data science

Machine learning applied to data science

Using neural networks in data science

Deep learning approaches

Performing text analysis

Visual and audio analysis

Improving application performance using parallel techniques

Assembling the pieces

Summary

2. Data Acquisition

Understanding the data formats used in data science applications

Overview of CSV data

Overview of spreadsheets

Overview of databases

Overview of PDF files

Overview of JSON

Overview of XML

Overview of streaming data

Overview of audio/video/images in Java

Data acquisition techniques

Using the HttpUrlConnection class

Web crawlers in Java

Creating your own web crawler

Using the crawler4j web crawler

Web scraping in Java

Using API calls to access common social media sites

Using OAuth to authenticate users

Handing Twitter

Handling Wikipedia

Handling Flickr

Handling YouTube

Searching by keyword

Summary

3. Data Cleaning

Handling data formats

Handling CSV data

Handling spreadsheets

Handling Excel spreadsheets

Handling PDF files

Handling JSON

Using JSON streaming API

Using the JSON tree API

The nitty gritty of cleaning text

Using Java tokenizers to extract words

Java core tokenizers

Third-party tokenizers and libraries

Transforming data into a usable form

Simple text cleaning

Removing stop words

Finding words in text

Finding and replacing text

Data imputation

Subsetting data

Sorting text

Data validation

Validating data types

Validating dates

Validating e-mail addresses

Validating ZIP codes

Validating names

Cleaning images

Changing the contrast of an image

Smoothing an image

Brightening an image

Resizing an image

Converting images to different formats

Summary

4. Data Visualization

Understanding plots and graphs

Visual analysis goals

Creating index charts

Creating bar charts

Using country as the category

Using decade as the category

Creating stacked graphs

Creating pie charts

Creating scatter charts

Creating histograms

Creating donut charts

Creating bubble charts

Summary

5. Statistical Data Analysis Techniques

Working with mean, mode, and median

Calculating the mean

Using simple Java techniques to find mean

Using Java 8 techniques to find mean

Using Google Guava to find mean

Using Apache Commons to find mean

Calculating the median

Using simple Java techniques to find median

Using Apache Commons to find the median

Calculating the mode

Using ArrayLists to find multiple modes

Using a HashMap to find multiple modes

Using a Apache Commons to find multiple modes

Standard deviation

Sample size determination

Hypothesis testing

Regression analysis

Using simple linear regression

Using multiple regression

Summary

6. Machine Learning

Supervised learning techniques

Decision trees

Decision tree types

Decision tree libraries

Using a decision tree with a book dataset

Testing the book decision tree

Support vector machines

Using an SVM for camping data

Testing individual instances

Bayesian networks

Using a Bayesian network

Unsupervised machine learning

Association rule learning

Using association rule learning to find buying relationships

Reinforcement learning

Summary

7. Neural Networks

Training a neural network

Getting started with neural network architectures

Understanding static neural networks

A basic Java example

Understanding dynamic neural networks

Multilayer perceptron networks

Building the model

Evaluating the model

Predicting other values

Saving and retrieving the model

Learning vector quantization

Self-Organizing Maps

Using a SOM

Displaying the SOM results

Additional network architectures and algorithms

The k-Nearest Neighbors algorithm

Instantaneously trained networks

Spiking neural networks

Cascading neural networks

Holographic associative memory

Backpropagation and neural networks

Summary

8. Deep Learning

Deeplearning4j architecture

Acquiring and manipulating data

Reading in a CSV file

Configuring and building a model

Using hyperparameters in ND4J

Instantiating the network model

Training a model

Testing a model

Deep learning and regression analysis

Preparing the data

Setting up the class

Reading and preparing the data

Building the model

Evaluating the model

Restricted Boltzmann Machines

Reconstruction in an RBM

Configuring an RBM

Deep autoencoders

Building an autoencoder in DL4J

Configuring the network

Building and training the network

Saving and retrieving a network

Specialized autoencoders

Convolutional networks

Building the model

Evaluating the model

Recurrent Neural Networks

Summary

9. Text Analysis

Implementing named entity recognition

Using OpenNLP to perform NER

Identifying location entities

Classifying text

Word2Vec and Doc2Vec

Classifying text by labels

Classifying text by similarity

Understanding tagging and POS

Using OpenNLP to identify POS

Understanding POS tags

Extracting relationships from sentences

Using OpenNLP to extract relationships

Sentiment analysis

Downloading and extracting the Word2Vec model

Building our model and classifying text

Summary

10. Visual and Audio Analysis

Text-to-speech

Using FreeTTS

Getting information about voices

Gathering voice information

Understanding speech recognition

Using CMUPhinx to convert speech to text

Obtaining more detail about the words

Extracting text from an image

Using Tess4j to extract text

Identifying faces

Using OpenCV to detect faces

Classifying visual data

Creating a Neuroph Studio project for classifying visual images

Training the model

Summary

11. Mathematical and Parallel Techniques for Data Analysis

Implementing basic matrix operations

Using GPUs with DeepLearning4j

Using map-reduce

Using Apache's Hadoop to perform map-reduce

Writing the map method

Writing the reduce method

Creating and executing a new Hadoop job

Various mathematical libraries

Using the jblas API

Using the Apache Commons math API

Using the ND4J API

Using OpenCL

Using Aparapi

Creating an Aparapi application

Using Aparapi for matrix multiplication

Using Java 8 streams

Understanding Java 8 lambda expressions and streams

Using Java 8 to perform matrix multiplication

Using Java 8 to perform map-reduce

Summary

12. Bringing It All Together

Defining the purpose and scope of our application

Understanding the application's architecture

Data acquisition using Twitter

Understanding the TweetHandler class

Extracting data for a sentiment analysis model

Building the sentiment model

Processing the JSON input

Cleaning data to improve our results

Removing stop words

Performing sentiment analysis

Analysing the results

Other optional enhancements

Summary

Java for Data Science

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: January 2017

Production reference: 1050117

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-78528-011-5

www.packtpub.com

Credits

Authors

Richard M. Reese

Jennifer L. Reese

Copy Editors

Vikrant Phadkay

Safis Editing

Reviewers

Walter Molina

Shilpi Saxena

Project Coordinator

Nidhi Joshi

Commissioning Editor

Veena Pagare

Proofreader

Safis Editing

Acquisition Editor

Tushar Gupta

Indexer

Aishwarya Gangawane

Content Development Editor

Aishwarya Pandere

Graphics

Disha Haria

Technical Editor

Suwarna Patil

Production Coordinator

Nilesh Mohite

About the Authors

Richard M. Reese has worked in both industry and academics. For 17 years, he worked in the telephone and aerospace industries, serving in several capacities, including research and development, software development, supervision, and training. He currently teaches at Tarleton State University, where he has the opportunity to apply his years of industry experience to enhance his teaching.

Richard has written several Java books and a C Pointer book. He uses a concise and easy-to-follow approach to topics at hand. His Java books have addressed EJB 3.1, updates to Java 7 and 8, certification, jMonkeyEngine, natural language processing, functional programming, and networks.

Richard would like to thank his wife, Karla, for her continued support, and to the staff of Packt Publishing for their work in making this a better book.

Jennifer L. Reese studied computer science at Tarleton State University. She also earned her M.Ed. from Tarleton in December 2016. She currently teaches computer science to high-school students. Her research interests include the integration of computer science concepts with other academic disciplines, increasing diversity in computer science courses, and the application of data science to the field of education.

She previously worked as a software engineer developing software for county- and district-level government offices throughout Texas. In her free time she enjoys reading, cooking, and traveling—especially to any destination with a beach. She is a musician and appreciates a variety of musical genres.

I would like to thank Dad for his inspiration and guidance, Mom for her patience and perspective, and Jace for his support and always believing in me.

About the Reviewers

Walter Molina is a UI and UX developer from Villa Mercedes, San Luis, Argentina. His skills include, but are not limited to, HTML5, CSS3, and JavaScript. He uses these technologies at a Jedi/ninja level (along with a plethora of JavaScript libraries) in his daily work as a frontend developer at Tachuso, a creative content agency. He holds a bachelor's degree in computer science and is a member of the School of Engineering at local National University, where he teaches programming skills to second- and third-year students. His LinkedIn profile is https://ar.linkedin.com/in/waltermolina.

Shilpi Saxena is an IT professional and also a technology evangelist. She is an engineer who has had exposure to various domains (IOT and cloud computing space, healthcare, telecom, hiring, and manufacturing). She has experience in all the aspects of conception and execution of enterprise solutions. She has been architecting, managing, and delivering solutions in the big data space for the last 3 years; she also handles a high-performance and geographically distributed team of elite engineers.

Shilpi has more than 14 years (3 years in the big data space) of experience in the development and execution of various facets of enterprise solutions both in the products and services dimensions of the software industry. An engineer by degree and profession, she has worn various hats, such as developer, technical leader, product owner, tech manager, and so on, and has seen all the flavors that the industry has to offer. She has architected and worked through some of the pioneers' production implementations in big data on Storm and Impala with autoscaling in AWS.

Shilpi has also authored Real-time Analytics with Storm and Cassandra (https://www.packtpub.com/big-data-and-business-intelligence/learning-real-time-analytics-storm-and-cassandra) and Real time Big Data Analytics (https://www.packtpub.com/big-data-and-business-intelligence/real-time-big-data-analytics) with Packt Publishing.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Customer Feedback

Thank you for purchasing this Packt book. We take our commitment to improving our content and products to meet your needs seriously—that's why your feedback is so valuable. Whatever your feelings about your purchase, please consider leaving a review on this book's Amazon page. Not only will this help us, more importantly it will also help others in the community to make an informed decision about the resources that they invest in to learn.

You can also review for us on a regular basis by joining our reviewers' club. If you're interested in joining, or would like to learn more about the benefits we offer, please contact us: [email protected].

Preface

In this book, we examine Java-based approaches to the field of data science. Data science is a broad topic and includes such subtopics as data mining, statistical analysis, audio and video analysis, and text analysis. A number of Java APIs provide support for these topics. The ability to apply these specific techniques allows for the creation of new, innovative applications able to handle the vast amounts of data available for analysis.

This book takes an expansive yet cursory approach to various aspects of data science. A brief introduction to the field is presented in the first chapter. Subsequent chapters cover significant aspects of data science, such as data cleaning and the application of neural networks. The last chapter combines topics discussed throughout the book to create a comprehensive data science application.

What this book covers

Chapter 1, Getting Started with Data Science, provides an introduction to the technologies covered by the book. A brief explanation of each technology is given, followed by a short overview and demonstration of the support Java provides.

Chapter 2, Data Acquisition, demonstrates how to acquire data from a number of sources, including Twitter, Wikipedia, and YouTube. The first step of a data science application is to acquire data.

Chapter 3, Data Cleaning,explains that once data has been acquired, it needs to be cleaned. This can involve such activities as removing stop words, validating the data, and data conversion.

Chapter 4, Data Visualization, shows that while numerical processing is a critical step in many data science tasks, people often prefer visual depictions of the results of analysis. This chapter demonstrates various Java approaches to this task.

Chapter 5, Statistical Data Analysis Techniques, reviews basic statistical techniques, including regression analysis, and demonstrates how various Java APIs provide statistical support. Statistical analysis is key to many data analysis tasks.

Chapter 6, Machine Learning, covers several machine learning algorithms, including decision trees and support vector machines. The abundance of available data provides an opportunity to apply machine learning techniques.

Chapter 7, Neural Networks, explains that neural networks can be applied to solve a variety of data science problems. In this chapter, we explain how they work and demonstrate the use of several different types of neural networks.

Chapter 8, Deep Learning, shows that deep learning algorithms are often described as multilevel neural networks. Java provides significant support in this area, and we will illustrate the use of this approach.

Chapter 9, Text Analysis, explains that significant portions of available datasets exist in textual formats. The field of natural language processing has advanced considerably and is frequently used in data science applications. We demonstrate various Java APIs used to support this type of analysis.

Chapter 10, Visual and Audio Analysis, tells us that data science is not restricted to text processing. Many social media sites use visual data extensively. This chapter illustrates the Java supports available for this type of analysis.

Chapter 11,Mathematical and Parallel Techniques for Data Analysis, investigates the support provided for low-level math operations and how they can be supported in a multiple processor environment. Data analysis, at its heart, necessitates the ability to manipulate and analyze large quantities of numeric data.

Chapter 12, Bringing It All Together, examines how the integration of the various technologies introduced in this book can be used to create a data science application. This chapter begins with data acquisition and incorporates many of the techniques used in subsequent chapters to build a complete application.

What you need for this book

Many of the examples in the book use Java 8 features. There are a number of Java APIs demonstrated, each of which is introduced before it is applied. An IDE is not required but is desirable.

Who this book is for

This book is aimed at experienced Java programmers who are interested in gaining a better understanding of the field of data science and how Java supports the underlying techniques. No prior experience in the field is needed.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Java-for-Data-Science. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Chapter 1. Getting Started with Data Science

Data science is not a single science as much as it is a collection of various scientific disciplines integrated for the purpose of analyzing data. These disciplines include various statistical and mathematical techniques, including:

Computer scienceData engineeringVisualizationDomain-specific knowledge and approaches

With the advent of cheaper storage technology, more and more data has been collected and stored permitting previously unfeasible processing and analysis of data. With this analysis came the need for various techniques to make sense of the data. These large sets of data, when used to analyze data and identify trends and patterns, become known as big data.

This in turn gave rise to cloud computing and concurrent techniques such as map-reduce, which distributed the analysis process across a large number of processors, taking advantage of the power of parallel processing.

The process of analyzing big data is not simple and evolves to the specialization of developers who were known as data scientists. Drawing upon a myriad of technologies and expertise, they are able to analyze data to solve problems that previously were either not envisioned or were too difficult to solve.

Early big data applications were typified by the emergence of search engines capable of more powerful and accurate searches than their predecessors. For example, AltaVista was an early popular search engine that was eventually superseded by Google. While big data applications were not limited to these search engine functionalities, these applications laid the groundwork for future work in big data.

The term, data science, has been used since 1974 and evolved over time to include statistical analysis of data. The concepts of data mining and data analytics have been associated with data science. Around 2008, the term data scientist appeared and was used to describe a person who performs data analysis. A more in-depth discussion of the history of data science can be found at http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#3d9ea08369fd.

This book aims to take a broad look at data science using Java and will briefly touch on many topics. It is likely that the reader may find topics of interest and pursue these at greater depth independently. The purpose of this book, however, is simply to introduce the reader to the significant data science topics and to illustrate how they can be addressed using Java.

There are many algorithms used in data science. In this book, we do not attempt to explain how they work except at an introductory level. Rather, we are more interested in explaining how they can be used to solve problems. Specifically, we are interested in knowing how they can be used with Java.

Problems solved using data science

The various data science techniques that we will illustrate have been used to solve a variety of problems. Many of these techniques are motivated to achieve some economic gain, but they have also been used to solve many pressing social and environmental problems. Problem domains where these techniques have been used include finance, optimizing business processes, understanding customer needs, performing DNA analysis, foiling terrorist plots, and finding relationships between transactions to detect fraud, among many other data-intensive problems.

Data mining is a popular application area for data science. In this activity, large quantities of data are processed and analyzed to glean information about the dataset, to provide meaningful insights, and to develop meaningful conclusions and predictions. It has been used to analyze customer behavior, detecting relationships between what may appear to be unrelated events, and to make predictions about future behavior.

Machine learning is an important aspect of data science. This technique allows the computer to solve various problems without needing to be explicitly programmed. It has been used in self-driving cars, speech recognition, and in web searches. In data mining, the data is extracted and processed. With machine learning, computers use the data to take some sort of action.

Understanding the data science problem - solving approach

Data science is concerned with the processing and analysis of large quantities of data to create models that can be used to make predictions or otherwise support a specific goal. This process often involves the building and training of models. The specific approach to solve a problem is dependent on the nature of the problem. However, in general, the following are the high-level tasks that are used in the analysis process:

Acquiring the data: Before we can process the data, it must be acquired. The data is frequently stored in a variety of formats and will come from a wide range of data sources.Cleaning the data: Once the data has been acquired, it often needs to be converted to a different format before it can be used. In addition, the data needs to be processed, or cleaned, so as to remove errors, resolve inconsistencies, and otherwise put it in a form ready for analysis.Analyzing the data: This can be performed using a number of techniques including:

Statistical analysis: This uses a multitude of statistical approaches to provide insight into data. It includes simple techniques and more advanced techniques such as regression analysis.AI analysis: These can be grouped as machine learning, neural networks, and deep learning techniques:

Machine learning approaches are characterized by programs that can learn without being specifically programmed to complete a specific taskNeural networks are built around models patterned after the neural connection of the brainDeep learning attempts to identify higher levels of abstraction within a set of data

Text analysis: This is a common form of analysis, which works with natural languages to identify features such as the names of people and places, the relationship between parts of text, and the implied meaning of text.Data visualization: This is an important analysis tool. By displaying the data in a visual form, a hard-to-understand set of numbers can be more readily understood.Video, image, and audio processing and analysis: This is a more specialized form of analysis, which is becoming more common as better analysis techniques are discovered and faster processors become available. This is in contrast to the more common text processing and analysis tasks.

Complementing this set of tasks is the need to develop applications that are efficient. The introduction of machines with multiple processors and GPUs contributes significantly to the end result.

While the exact steps used will vary by application, understanding these basic steps provides the basis for constructing solutions to many data science problems.

Using Java to support data science

Java and its associated third-party libraries provide a range of support for the development of data science applications. There are numerous core Java capabilities that can be used, such as the basic string processing methods. The introduction of lambda expressions in Java 8 helps enable more powerful and expressive means of building applications. In many of the examples that follow in subsequent chapters, we will show alternative techniques using lambda expressions.

There is ample support provided for the basic data science tasks. These include multiple ways of acquiring data, libraries for cleaning data, and a wide variety of analysis approaches for tasks such as natural language processing and statistical analysis. There are also myriad of libraries supporting neural network types of analysis.

Java can be a very good choice for data science problems. The language provides both object-oriented and functional support for solving problems. There is a large developer community to draw upon and there exist multiple APIs that support data science tasks. These are but a few reasons as to why Java should be used.

The remainder of this chapter will provide an overview of the data science tasks and Java support demonstrated in the book. Each section is only able to present a brief introduction to the topics and the available support. The subsequent chapter will go into considerably more depth regarding these topics.

Acquiring data for an application

Data acquisition is an important step in the data analysis process. When data is acquired, it is often in a specialized form and its contents may be inconsistent or different from an application's need. There are many sources of data, which are found on the Internet. Several examples will be demonstrated in Chapter 2, Data Acquisition.

Data may be stored in a variety of formats. Popular formats for text data include HTML, Comma Separated Values (CSV), JavaScript Object Notation (JSON), and XML. Image and audio data are stored in a number of formats. However, it is frequently necessary to convert one data format into another format, typically plain text.

For example, JSON (http://www.JSON.org/) is stored using blocks of curly braces containing key-value pairs. In the following example, parts of a YouTube result is shown:

{ "kind": "youtube#searchResult", "etag": etag, "id": { "kind": string, "videoId": string, "channelId": string, "playlistId": string }, ... }

Data is acquired using techniques such as processing live streams, downloading compressed files, and through screen scraping, where the information on a web page is extracted. Web crawling is a technique where a program examines a series of web pages, moving from one page to another, acquiring the data that it needs.

With many popular media sites, it is necessary to acquire a user ID and password to access data. A commonly used technique is OAuth, which is an open standard used to authenticate users to many different websites. The technique delegates access to a server resource and works over HTTPS. Several companies use OAuth 2.0, including PayPal, Facebook, Twitter, and Yelp.

Visualizing data to enhance understanding

The analysis of data often results in a series of numbers representing the results of the analysis. However, for most people, this way of expressing results is not always intuitive. A better way to understand the results is to create graphs and charts to depict the results and the relationship between the elements of the result.

The human mind is often good at seeing patterns, trends, and outliers in visual representation. The large amount of data present in many data science problems can be analyzed using visualization techniques. Visualization is appropriate for a wide range of audiences ranging from analysts to upper-level management to clientele. In this chapter, we present various visualization techniques and demonstrate how they are supported in Java.

In Chapter 4, Data Visualization, we illustrate how to create different types of graphs, plots, and charts. These examples use JavaFX using a free library called GRAL(http://trac.erichseifert.de/gral/).

Visualization allows users to examine large datasets in ways that provide insights that are not present in the mass of the data. Visualization tools helps us identify potential problems or unexpected data results and develop meaningful interpretations of the data.

For example, outliers, which are values that lie outside of the normal range of values, can be hard to spot from a sea of numbers. Creating a graph based on the data allows users to quickly see outliers. It can also help spot errors quickly and more easily classify data.

For example, the following chart might suggest that the upper two values should be outliers that need to be dealt with:

Machine learning applied to data science

Machine learning has become increasingly important for data science analysis as it has been for a multitude of other fields. A defining characteristic of machine learning is the ability of a model to be trained on a set of representative data and then later used to solve similar problems. There is no need to explicitly program an application to solve the problem. A model is a representation of the real-world object.

For example, customer purchases can be used to train a model. Subsequently, predictions can be made about the types of purchases a customer might subsequently make. This allows an organization to tailor ads and coupons for a customer and potentially providing a better customer experience.

Training can be performed in one of several different approaches:

Supervised learning: The model is trained with annotated, labeled, data showing corresponding correct resultsUnsupervised learning: The data does not contain results, but the model is expected to find relationships on its ownSemi-supervised: A small amount of labeled data is combined with a larger amount of unlabeled dataReinforcement learning: This is similar to supervised learning, but a reward is provided for good results

There are several approaches that support machine learning. In Chapter 6, Machine Learning, we will illustrate three techniques:

Decision trees: A tree is constructed using features of the problem as internal nodes and the results as leavesSupport vector machines: This is used for classification by creating a hyperplane that partitions the dataset and then makes predictionsBayesian networks: This is used to depict probabilistic relationships between events

A Support Vector Machine (SVM) is used primarily for classification type problems. The approach creates a hyperplane to categorize data, which can be envisioned as a geometric plane that separates two regions. In a two-dimensional space, it will be a line. In a three-dimensional space, it will be a two-dimensional plane. In Chapter 6, Machine Learning, we will demonstrate how to use the approach using a set of data relating to the propensity of individuals to camp. We will use the Weka class, SMO, to demonstrate this type of analysis.

The following figure depicts a hyperplane using a distribution of two types of data points. The lines represent possible hyperplanes that separate these points. The lines clearly separate the data points except for one outlier.

Once the model has been trained, the possible hyperplanes are considered and predictions can then be made using similar data.