Java for Data Science - Richard M. Reese - E-Book

Java for Data Science E-Book

Richard M. Reese

0,0
39,59 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

para 1: Get the lowdown on Java and explore big data analytics with Java for Data Science. Packed with examples and data science principles, this book uncovers the techniques & Java tools supporting data science and machine learning. Para 2: The stability and power of Java combines with key data science concepts for effective exploration of data. By working with Java APIs and techniques, this data science book allows you to build applications and use analysis techniques centred on machine learning. Para 3: Java for Data Science gives you the understanding you need to examine the techniques and Java tools supporting big data analytics. These Java-based approaches allow you to tackle data mining and statistical analysis in detail. Deep learning and Java data mining are also featured, so you can explore and analyse data effectively, and build intelligent applications using machine learning. para 4: What?s Inside ? Understand data science principles with Java support ? Discover machine learning and deep learning essentials ? Explore data science problems with Java-based solutions

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Java for Data Science
Credits
About the Authors
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for 
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Getting Started with Data Science
Problems solved using data science
Understanding the data science problem -  solving approach
Using Java to support data science
Acquiring data for an application
The importance and process of cleaning data
Visualizing data to enhance understanding
The use of statistical methods in data science
Machine learning applied to data science
Using neural networks in data science
Deep learning approaches
Performing text analysis
Visual and audio analysis
Improving application performance using parallel techniques
Assembling the pieces
Summary
2. Data Acquisition
Understanding the data formats used in data science applications
Overview of CSV data
Overview of spreadsheets
Overview of databases
Overview of PDF files
Overview of JSON
Overview of XML
Overview of streaming data
Overview of audio/video/images in Java
Data acquisition techniques
Using the HttpUrlConnection class
Web crawlers in Java
Creating your own web crawler
Using the crawler4j web crawler
Web scraping in Java
Using API calls to access common social media sites
Using OAuth to authenticate users
Handing Twitter
Handling Wikipedia
Handling Flickr
Handling YouTube
Searching by keyword
Summary
3. Data Cleaning
Handling data formats
Handling CSV data
Handling spreadsheets
Handling Excel spreadsheets
Handling PDF files
Handling JSON
Using JSON streaming API
Using the JSON tree API
The nitty gritty of cleaning text
Using Java tokenizers to extract words
Java core tokenizers
Third-party tokenizers and libraries
Transforming data into a usable form
Simple text cleaning
Removing stop words
Finding words in text
Finding and replacing text
Data imputation
Subsetting data
Sorting text
Data validation
Validating data types
Validating dates
Validating e-mail addresses
Validating ZIP codes
Validating names
Cleaning images
Changing the contrast of an image
Smoothing an image
Brightening an image
Resizing an image
Converting images to different formats
Summary
4. Data Visualization
Understanding plots and graphs
Visual analysis goals
Creating index charts
Creating bar charts
Using country as the category
Using decade as the category
Creating stacked graphs
Creating pie charts
Creating scatter charts
Creating histograms
Creating donut charts
Creating bubble charts
Summary
5. Statistical Data Analysis Techniques
Working with mean, mode, and median
Calculating the mean
Using simple Java techniques to find mean
Using Java 8 techniques to find mean
Using Google Guava to find mean
Using Apache Commons to find mean
Calculating the median
Using simple Java techniques to find median
Using Apache Commons to find the median
Calculating the mode
Using ArrayLists to find multiple modes
Using a HashMap to find multiple modes
Using a Apache Commons to find multiple modes
Standard deviation
Sample size determination
Hypothesis testing
Regression analysis
Using simple linear regression
Using multiple regression
Summary
6. Machine Learning
Supervised learning techniques
Decision trees
Decision tree types
Decision tree libraries
Using a decision tree with a book dataset
Testing the book decision tree
Support vector machines
Using an SVM for camping data
Testing individual instances
Bayesian networks
Using a Bayesian network
Unsupervised machine learning
Association rule learning
Using association rule learning to find buying relationships
Reinforcement learning
Summary
7. Neural Networks
Training a neural network
Getting started with neural network architectures
Understanding static neural networks
A basic Java example
Understanding dynamic neural networks
Multilayer perceptron networks
Building the model
Evaluating the model
Predicting other values
Saving and retrieving the model
Learning vector quantization
Self-Organizing Maps
Using a SOM
Displaying the SOM results
Additional network architectures and algorithms
The k-Nearest Neighbors algorithm
Instantaneously trained networks
Spiking neural networks
Cascading neural networks
Holographic associative memory
Backpropagation and neural networks
Summary
8. Deep Learning
Deeplearning4j architecture
Acquiring and manipulating data
Reading in a CSV file
Configuring and building a model
Using hyperparameters in ND4J
Instantiating the network model
Training a model
Testing a model
Deep learning and regression analysis
Preparing the data
Setting up the class
Reading and preparing the data
Building the model
Evaluating the model
Restricted Boltzmann Machines
Reconstruction in an RBM
Configuring an RBM
Deep autoencoders
Building an autoencoder in DL4J
Configuring the network
Building and training the network
Saving and retrieving a network
Specialized autoencoders
Convolutional networks
Building the model
Evaluating the model
Recurrent Neural Networks
Summary
9. Text Analysis
Implementing named entity recognition
Using OpenNLP to perform NER
Identifying location entities
Classifying text
Word2Vec and Doc2Vec
Classifying text by labels
Classifying text by similarity
Understanding tagging and POS
Using OpenNLP to identify POS
Understanding POS tags
Extracting relationships from sentences
Using OpenNLP to extract relationships
Sentiment analysis
Downloading and extracting the Word2Vec model
Building our model and classifying text
Summary
10. Visual and Audio Analysis
Text-to-speech
Using FreeTTS
Getting information about voices
Gathering voice information
Understanding speech recognition
Using CMUPhinx to convert speech to text
Obtaining more detail about the words
Extracting text from an image
Using Tess4j to extract text
Identifying faces
Using OpenCV to detect faces
Classifying visual data
Creating a Neuroph Studio project for classifying visual images
Training the model
Summary
11. Mathematical and Parallel Techniques for Data Analysis
Implementing basic matrix operations
Using GPUs with DeepLearning4j
Using map-reduce
Using Apache's Hadoop to perform map-reduce
Writing the map method
Writing the reduce method
Creating and executing a new Hadoop job
Various mathematical libraries
Using the jblas API
Using the Apache Commons math API
Using the ND4J API
Using OpenCL
Using Aparapi
Creating an Aparapi application
Using Aparapi for matrix multiplication
Using Java 8 streams
Understanding Java 8 lambda expressions and streams
Using Java 8 to perform matrix multiplication
Using Java 8 to perform map-reduce
Summary
12. Bringing It All Together
Defining the purpose and scope of our application
Understanding the application's architecture
Data acquisition using Twitter
Understanding the TweetHandler class
Extracting data for a sentiment analysis model
Building the sentiment model
Processing the JSON input
Cleaning data to improve our results
Removing stop words
Performing sentiment analysis
Analysing the results
Other optional enhancements
Summary

Java for Data Science

Java for Data Science

Copyright © 2017 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: January 2017

Production reference: 1050117

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham 

B3 2PB, UK.

ISBN 978-1-78528-011-5

www.packtpub.com

Credits

Authors

Richard M. Reese

Jennifer L. Reese

Copy Editors

Vikrant Phadkay

Safis Editing

Reviewers

Walter Molina

Shilpi Saxena

Project Coordinator

 Nidhi Joshi

Commissioning Editor

Veena Pagare

Proofreader

Safis Editing

Acquisition Editor

Tushar Gupta

Indexer

Aishwarya Gangawane

Content Development Editor

Aishwarya Pandere

Graphics

Disha Haria

Technical Editor

Suwarna Patil

Production Coordinator

Nilesh Mohite

About the Authors

Richard M. Reese has worked in both industry and academics. For 17 years, he worked in the telephone and aerospace industries, serving in several capacities, including research and development, software development, supervision, and training. He currently teaches at Tarleton State University, where he has the opportunity to apply his years of industry experience to enhance his teaching.

Richard has written several Java books and a C Pointer book. He uses a concise and easy-to-follow approach to topics at hand. His Java books have addressed EJB 3.1, updates to Java 7 and 8, certification, jMonkeyEngine, natural language processing, functional programming, and networks.

Richard would like to thank his wife, Karla, for her continued support, and to the staff of Packt Publishing for their work in making this a better book.

Jennifer L. Reese studied computer science at Tarleton State University. She also earned her M.Ed. from Tarleton in December 2016. She currently teaches computer science to high-school students. Her research interests include the integration of computer science concepts with other academic disciplines, increasing diversity in computer science courses, and the application of data science to the field of education.

She previously worked as a software engineer developing software for county- and district-level government offices throughout Texas. In her free time she enjoys reading, cooking, and traveling—especially to any destination with a beach. She is a musician and appreciates a variety of musical genres.

I would like to thank Dad for his inspiration and guidance, Mom for her patience and perspective, and Jace for his support and always believing in me.

About the Reviewers

Walter Molina is a UI and UX developer from Villa Mercedes, San Luis, Argentina. His skills include, but are not limited to, HTML5, CSS3, and JavaScript. He uses these technologies at a Jedi/ninja level (along with a plethora of JavaScript libraries) in his daily work as a frontend developer at Tachuso, a creative content agency. He holds a bachelor's degree in computer science and is a member of the School of Engineering at local National University, where he teaches programming skills to second- and third-year students. His LinkedIn profile is https://ar.linkedin.com/in/waltermolina.

Shilpi Saxena is an IT professional and also a technology evangelist. She is an engineer who has had exposure to various domains (IOT and cloud computing space, healthcare, telecom, hiring, and manufacturing). She has experience in all the aspects of conception and execution of enterprise solutions. She has been architecting, managing, and delivering solutions in the big data space for the last 3 years; she also handles a high-performance and geographically distributed team of elite engineers.

Shilpi has more than 14 years (3 years in the big data space) of experience in the development and execution of various facets of enterprise solutions both in the products and services dimensions of the software industry. An engineer by degree and profession, she has worn various hats, such as developer, technical leader, product owner, tech manager, and so on, and has seen all the flavors that the industry has to offer. She has architected and worked through some of the pioneers' production implementations in big data on Storm and Impala with autoscaling in AWS.

Shilpi has also authored Real-time Analytics with Storm and Cassandra (https://www.packtpub.com/big-data-and-business-intelligence/learning-real-time-analytics-storm-and-cassandra)  and  Real time Big Data Analytics (https://www.packtpub.com/big-data-and-business-intelligence/real-time-big-data-analytics) with Packt Publishing.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Customer Feedback

Thank you for purchasing this Packt book. We take our commitment to improving our content and products to meet your needs seriously—that's why your feedback is so valuable. Whatever your feelings about your purchase, please consider leaving a review on this book's Amazon page. Not only will this help us, more importantly it will also help others in the community to make an informed decision about the resources that they invest in to learn.

You can also review for us on a regular basis by joining our reviewers' club. If you're interested in joining, or would like to learn more about the benefits we offer, please contact us: [email protected].

Preface

In this book, we examine Java-based approaches to the field of data science. Data science is a broad topic and includes such subtopics as data mining, statistical analysis, audio and video analysis, and text analysis. A number of Java APIs provide support for these topics. The ability to apply these specific techniques allows for the creation of new, innovative applications able to handle the vast amounts of data available for analysis.

This book takes an expansive yet cursory approach to various aspects of data science. A brief introduction to the field is presented in the first chapter. Subsequent chapters cover significant aspects of data science, such as data cleaning and the application of neural networks. The last chapter combines topics discussed throughout the book to create a comprehensive data science application.

What this book covers

Chapter 1, Getting Started with Data Science, provides an introduction to the technologies covered by the book. A brief explanation of each technology is given, followed by a short overview and demonstration of the support Java provides.

Chapter 2, Data Acquisition, demonstrates how to acquire data from a number of sources, including Twitter, Wikipedia, and YouTube. The first step of a data science application is to acquire data.

Chapter 3, Data Cleaning,explains that once data has been acquired, it needs to be cleaned. This can involve such activities as removing stop words, validating the data, and data conversion.

Chapter 4, Data Visualization, shows that while numerical processing is a critical step in many data science tasks, people often prefer visual depictions of the results of analysis. This chapter demonstrates various Java approaches to this task.

Chapter 5, Statistical Data Analysis Techniques, reviews basic statistical techniques, including regression analysis, and demonstrates how various Java APIs provide statistical support. Statistical analysis is key to many data analysis tasks.

Chapter 6, Machine Learning, covers several machine learning algorithms, including decision trees and support vector machines. The abundance of available data provides an opportunity to apply machine learning techniques.

Chapter 7, Neural Networks, explains that neural networks can be applied to solve a variety of data science problems. In this chapter, we explain how they work and demonstrate the use of several different types of neural networks.

Chapter 8, Deep Learning, shows that deep learning algorithms are often described as multilevel neural networks. Java provides significant support in this area, and we will illustrate the use of this approach.

Chapter 9, Text Analysis, explains that significant portions of available datasets exist in textual formats. The field of natural language processing has advanced considerably and is frequently used in data science applications. We demonstrate various Java APIs used to support this type of analysis.

Chapter 10, Visual and Audio Analysis, tells us that data science is not restricted to text processing. Many social media sites use visual data extensively. This chapter illustrates the Java supports available for this type of analysis.

Chapter 11,Mathematical and Parallel Techniques for Data Analysis, investigates the support provided for low-level math operations and how they can be supported in a multiple processor environment. Data analysis, at its heart, necessitates the ability to manipulate and analyze large quantities of numeric data.

Chapter 12, Bringing It All Together, examines how the integration of the various technologies introduced in this book can be used to create a data science application. This chapter begins with data acquisition and incorporates many of the techniques used in subsequent chapters to build a complete application.

What you need for this book

Many of the examples in the book use Java 8 features. There are a number of Java APIs demonstrated, each of which is introduced before it is applied. An IDE is not required but is desirable.

Who this book is for 

This book is aimed at experienced Java programmers who are interested in gaining a better understanding of the field of data science and how Java supports the underlying techniques. No prior experience in the field is needed.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Java-for-Data-Science. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Chapter 1. Getting Started with Data Science

Data science is not a single science as much as it is a collection of various scientific disciplines integrated for the purpose of analyzing data. These disciplines include various statistical and mathematical techniques, including:

Computer scienceData engineeringVisualizationDomain-specific knowledge and approaches

With the advent of cheaper storage technology, more and more data has been collected and stored permitting previously unfeasible processing and analysis of data. With this analysis came the need for various techniques to make sense of the data. These large sets of data, when used to analyze data and identify trends and patterns, become known as big data.

This in turn gave rise to cloud computing and concurrent techniques such as map-reduce, which distributed the analysis process across a large number of processors, taking advantage of the power of parallel processing.

The process of analyzing big data is not simple and evolves to the specialization of developers who were known as data scientists. Drawing upon a myriad of technologies and expertise, they are able to analyze data to solve problems that previously were either not envisioned or were too difficult to solve.

Early big data applications were typified by the emergence of search engines capable of more powerful and accurate searches than their predecessors. For example, AltaVista was an early popular search engine that was eventually superseded by Google. While big data applications were not limited to these search engine functionalities, these applications laid the groundwork for future work in big data.

The term, data science, has been used since 1974 and evolved over time to include statistical analysis of data. The concepts of data mining and data analytics have been associated with data science. Around 2008, the term data scientist appeared and was used to describe a person who performs data analysis. A more in-depth discussion of the history of data science can be found at http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#3d9ea08369fd.

This book aims to take a broad look at data science using Java and will briefly touch on many topics. It is likely that the reader may find topics of interest and pursue these at greater depth independently. The purpose of this book, however, is simply to introduce the reader to the significant data science topics and to illustrate how they can be addressed using Java.

There are many algorithms used in data science. In this book, we do not attempt to explain how they work except at an introductory level. Rather, we are more interested in explaining how they can be used to solve problems. Specifically, we are interested in knowing how they can be used with Java.

Problems solved using data science

The various data science techniques that we will illustrate have been used to solve a variety of problems. Many of these techniques are motivated to achieve some economic gain, but they have also been used to solve many pressing social and environmental problems. Problem domains where these techniques have been used include finance, optimizing business processes, understanding customer needs, performing DNA analysis, foiling terrorist plots, and finding relationships between transactions to detect fraud, among many other data-intensive problems.

Data mining is a popular application area for data science. In this activity, large quantities of data are processed and analyzed to glean information about the dataset, to provide meaningful insights, and to develop meaningful conclusions and predictions. It has been used to analyze customer behavior, detecting relationships between what may appear to be unrelated events, and to make predictions about future behavior.

Machine learning is an important aspect of data science. This technique allows the computer to solve various problems without needing to be explicitly programmed. It has been used in self-driving cars, speech recognition, and in web searches. In data mining, the data is extracted and processed. With machine learning, computers use the data to take some sort of action.

Understanding the data science problem -  solving approach

Data science is concerned with the processing and analysis of large quantities of data to create models that can be used to make predictions or otherwise support a specific goal. This process often involves the building and training of models. The specific approach to solve a problem is dependent on the nature of the problem. However, in general, the following are the high-level tasks that are used in the analysis process:

Acquiring the data: Before we can process the data, it must be acquired. The data is frequently stored in a variety of formats and will come from a wide range of data sources.Cleaning the data: Once the data has been acquired, it often needs to be converted to a different format before it can be used. In addition, the data needs to be processed, or cleaned, so as to remove errors, resolve inconsistencies, and otherwise put it in a form ready for analysis.Analyzing the data: This can be performed using a number of techniques including:
Statistical analysis: This uses a multitude of statistical approaches to provide insight into data. It includes simple techniques and more advanced techniques such as regression analysis.AI analysis: These can be grouped as machine learning, neural networks, and deep learning techniques:
Machine learning approaches are characterized by programs that can learn without being specifically programmed to complete a specific taskNeural networks are built around models patterned after the neural connection of the brainDeep learning attempts to identify higher levels of abstraction within a set of data
Text analysis: This is a common form of analysis, which works with natural languages to identify features such as the names of people and places, the relationship between parts of text, and the implied meaning of text.Data visualization: This is an important analysis tool. By displaying the data in a visual form, a hard-to-understand set of numbers can be more readily understood.Video, image, and audio processing and analysis: This is a more specialized form of analysis, which is becoming more common as better analysis techniques are discovered and faster processors become available. This is in contrast to the more common text processing and analysis tasks.

Complementing this set of tasks is the need to develop applications that are efficient. The introduction of machines with multiple processors and GPUs contributes significantly to the end result.

While the exact steps used will vary by application, understanding these basic steps provides the basis for constructing solutions to many data science problems.

Using Java to support data science

Java and its associated third-party libraries provide a range of support for the development of data science applications. There are numerous core Java capabilities that can be used, such as the basic string processing methods. The introduction of lambda expressions in Java 8 helps enable more powerful and expressive means of building applications. In many of the examples that follow in subsequent chapters, we will show alternative techniques using lambda expressions.

There is ample support provided for the basic data science tasks. These include multiple ways of acquiring data, libraries for cleaning data, and a wide variety of analysis approaches for tasks such as natural language processing and statistical analysis. There are also myriad of libraries supporting neural network types of analysis.

Java can be a very good choice for data science problems. The language provides both object-oriented and functional support for solving problems. There is a large developer community to draw upon and there exist multiple APIs that support data science tasks. These are but a few reasons as to why Java should be used.

The remainder of this chapter will provide an overview of the data science tasks and Java support demonstrated in the book. Each section is only able to present a brief introduction to the topics and the available support. The subsequent chapter will go into considerably more depth regarding these topics.

Acquiring data for an application

Data acquisition is an important step in the data analysis process. When data is acquired, it is often in a specialized form and its contents may be inconsistent or different from an application's need. There are many sources of data, which are found on the Internet. Several examples will be demonstrated in Chapter 2, Data Acquisition.

Data may be stored in a variety of formats. Popular formats for text data include HTML, Comma Separated Values (CSV), JavaScript Object Notation (JSON), and XML. Image and audio data are stored in a number of formats. However, it is frequently necessary to convert one data format into another format, typically plain text.

For example, JSON (http://www.JSON.org/) is stored using blocks of curly braces containing key-value pairs. In the following example, parts of a YouTube result is shown:

{ "kind": "youtube#searchResult", "etag": etag, "id": { "kind": string, "videoId": string, "channelId": string, "playlistId": string }, ... }

Data is acquired using techniques such as processing live streams, downloading compressed files, and through screen scraping, where the information on a web page is extracted. Web crawling is a technique where a program examines a series of web pages, moving from one page to another, acquiring the data that it needs.

With many popular media sites, it is necessary to acquire a user ID and password to access data. A commonly used technique is OAuth, which is an open standard used to authenticate users to many different websites. The technique delegates access to a server resource and works over HTTPS. Several companies use OAuth 2.0, including PayPal, Facebook, Twitter, and Yelp.

Visualizing data to enhance understanding

The analysis of data often results in a series of numbers representing the results of the analysis. However, for most people, this way of expressing results is not always intuitive. A better way to understand the results is to create graphs and charts to depict the results and the relationship between the elements of the result.

The human mind is often good at seeing patterns, trends, and outliers in visual representation. The large amount of data present in many data science problems can be analyzed using visualization techniques. Visualization is appropriate for a wide range of audiences ranging from analysts to upper-level management to clientele. In this chapter, we present various visualization techniques and demonstrate how they are supported in Java.

In Chapter 4, Data Visualization, we illustrate how to create different types of graphs, plots, and charts. These examples use JavaFX using a free library called GRAL(http://trac.erichseifert.de/gral/).

Visualization allows users to examine large datasets in ways that provide insights that are not present in the mass of the data. Visualization tools helps us identify potential problems or unexpected data results and develop meaningful interpretations of the data.

For example, outliers, which are values that lie outside of the normal range of values, can be hard to spot from a sea of numbers. Creating a graph based on the data allows users to quickly see outliers. It can also help spot errors quickly and more easily classify data.

For example, the following chart might suggest that the upper two values should be outliers that need to be dealt with:

Machine learning applied to data science

Machine learning has become increasingly important for data science analysis as it has been for a multitude of other fields. A defining characteristic of machine learning is the ability of a model to be trained on a set of representative data and then later used to solve similar problems. There is no need to explicitly program an application to solve the problem. A model is a representation of the real-world object.

For example, customer purchases can be used to train a model. Subsequently, predictions can be made about the types of purchases a customer might subsequently make. This allows an organization to tailor ads and coupons for a customer and potentially providing a better customer experience.

Training can be performed in one of several different approaches:

Supervised learning: The model is trained with annotated, labeled, data showing corresponding correct resultsUnsupervised learning: The data does not contain results, but the model is expected to find relationships on its ownSemi-supervised: A small amount of labeled data is combined with a larger amount of unlabeled dataReinforcement learning: This is similar to supervised learning, but a reward is provided for good results

There are several approaches that support machine learning. In Chapter 6, Machine Learning, we will illustrate three techniques:

Decision trees: A tree is constructed using features of the problem as internal nodes and the results as leavesSupport vector machines: This is used for classification by creating a hyperplane that partitions the dataset and then makes predictionsBayesian networks: This is used to depict probabilistic relationships between events

A Support Vector Machine (SVM) is used primarily for classification type problems. The approach creates a hyperplane to categorize data, which can be envisioned as a geometric plane that separates two regions. In a two-dimensional space, it will be a line. In a three-dimensional space, it will be a two-dimensional plane. In Chapter 6, Machine Learning, we will demonstrate how to use the approach using a set of data relating to the propensity of individuals to camp. We will use the Weka class, SMO, to demonstrate this type of analysis.

The following figure depicts a hyperplane using a distribution of two types of data points. The lines represent possible hyperplanes that separate these points. The lines clearly separate the data points except for one outlier.

Once the model has been trained, the possible hyperplanes are considered and predictions can then be made using similar data.