E-Book
86,39 €

Machine Learning: End-to-End guide for Java developers E-Book

Richard M. Reese

0,0

86,39 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Develop, Implement and Tuneup your Machine Learning applications using the power of Java programming

About This Book

Detailed coverage on key machine learning topics with an emphasis on both theoretical and practical aspects
Address predictive modeling problems using the most popular machine learning Java libraries
A comprehensive course covering a wide spectrum of topics such as machine learning and natural language through practical use-cases

Who This Book Is For

This course is the right resource for anyone with some knowledge of Java programming who wants to get started with Data Science and Machine learning as quickly as possible. If you want to gain meaningful insights from big data and develop intelligent applications using Java, this course is also a must-have.

What You Will Learn

Understand key data analysis techniques centered around machine learning
Implement Java APIs and various techniques such as classification, clustering, anomaly detection, and more
Master key Java machine learning libraries, their functionality, and various kinds of problems that can be addressed using each of them
Apply machine learning to real-world data for fraud detection, recommendation engines, text classification, and human activity recognition
Experiment with semi-supervised learning and stream-based data mining, building high-performing and real-time predictive models
Develop intelligent systems centered around various domains such as security, Internet of Things, social networking, and more

In Detail

Machine Learning is one of the core area of Artificial Intelligence where computers are trained to self-learn, grow, change, and develop on their own without being explicitly programmed. In this course, we cover how Java is employed to build powerful machine learning models to address the problems being faced in the world of Data Science. The course demonstrates complex data extraction and statistical analysis techniques supported by Java, applying various machine learning methods, exploring machine learning sub-domains, and exploring real-world use cases such as recommendation systems, fraud detection, natural language processing, and more, using Java programming. The course begins with an introduction to data science and basic data science tasks such as data collection, data cleaning, data analysis, and data visualization. The next section has a detailed overview of statistical techniques, covering machine learning, neural networks, and deep learning. The next couple of sections cover applying machine learning methods using Java to a variety of chores including classifying, predicting, forecasting, market basket analysis, clustering stream learning, active learning, semi-supervised learning, probabilistic graph modeling, text mining, and deep learning.

The last section highlights real-world test cases such as performing activity recognition, developing image recognition, text classification, and anomaly detection. The course includes premium content from three of our most popular books:

Java for Data Science
Machine Learning in Java
Mastering Java Machine Learning

On completion of this course, you will understand various machine learning techniques, different machine learning java algorithms you can use to gain data insights, building data models to analyze larger complex data sets, and incubating applications using Java and machine learning algorithms in the field of artificial intelligence.

Style and approach

This comprehensive course proceeds from being a tutorial to a practical guide, providing an introduction to machine learning and different machine learning techniques, exploring machine learning with Java libraries, and demonstrating real-world machine learning use cases using the Java platform.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 1334

Veröffentlichungsjahr: 2017

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Machine Learning: End-to-End guide for Java developers

Credits

Preface

What this learning path covers

What you need for this learning path

Who this learning path is for

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Module 1

1. Getting Started with Data Science

Problems solved using data science

Understanding the data science problem - solving approach

Using Java to support data science

Acquiring data for an application

The importance and process of cleaning data

Visualizing data to enhance understanding

The use of statistical methods in data science

Machine learning applied to data science

Using neural networks in data science

Deep learning approaches

Performing text analysis

Visual and audio analysis

Improving application performance using parallel techniques

Assembling the pieces

Summary

2. Data Acquisition

Understanding the data formats used in data science applications

Overview of CSV data

Overview of spreadsheets

Overview of databases

Overview of PDF files

Overview of JSON

Overview of XML

Overview of streaming data

Overview of audio/video/images in Java

Data acquisition techniques

Using the HttpUrlConnection class

Web crawlers in Java

Creating your own web crawler

Using the crawler4j web crawler

Web scraping in Java

Using API calls to access common social media sites

Using OAuth to authenticate users

Handing Twitter

Handling Wikipedia

Handling Flickr

Handling YouTube

Searching by keyword

Summary

3. Data Cleaning

Handling data formats

Handling CSV data

Handling spreadsheets

Handling Excel spreadsheets

Handling PDF files

Handling JSON

Using JSON streaming API

Using the JSON tree API

The nitty gritty of cleaning text

Using Java tokenizers to extract words

Java core tokenizers

Third-party tokenizers and libraries

Transforming data into a usable form

Simple text cleaning

Removing stop words

Finding words in text

Finding and replacing text

Data imputation

Subsetting data

Sorting text

Data validation

Validating data types

Validating dates

Validating e-mail addresses

Validating ZIP codes

Validating names

Cleaning images

Changing the contrast of an image

Smoothing an image

Brightening an image

Resizing an image

Converting images to different formats

Summary

4. Data Visualization

Understanding plots and graphs

Visual analysis goals

Creating index charts

Creating bar charts

Using country as the category

Using decade as the category

Creating stacked graphs

Creating pie charts

Creating scatter charts

Creating histograms

Creating donut charts

Creating bubble charts

Summary

5. Statistical Data Analysis Techniques

Working with mean, mode, and median

Calculating the mean

Using simple Java techniques to find mean

Using Java 8 techniques to find mean

Using Google Guava to find mean

Using Apache Commons to find mean

Calculating the median

Using simple Java techniques to find median

Using Apache Commons to find the median

Calculating the mode

Using ArrayLists to find multiple modes

Using a HashMap to find multiple modes

Using a Apache Commons to find multiple modes

Standard deviation

Sample size determination

Hypothesis testing

Regression analysis

Using simple linear regression

Using multiple regression

Summary

6. Machine Learning

Supervised learning techniques

Decision trees

Decision tree types

Decision tree libraries

Using a decision tree with a book dataset

Testing the book decision tree

Support vector machines

Using an SVM for camping data

Testing individual instances

Bayesian networks

Using a Bayesian network

Unsupervised machine learning

Association rule learning

Using association rule learning to find buying relationships

Reinforcement learning

Summary

7. Neural Networks

Training a neural network

Getting started with neural network architectures

Understanding static neural networks

A basic Java example

Understanding dynamic neural networks

Multilayer perceptron networks

Building the model

Evaluating the model

Predicting other values

Saving and retrieving the model

Learning vector quantization

Self-Organizing Maps

Using a SOM

Displaying the SOM results

Additional network architectures and algorithms

The k-Nearest Neighbors algorithm

Instantaneously trained networks

Spiking neural networks

Cascading neural networks

Holographic associative memory

Backpropagation and neural networks

Summary

8. Deep Learning

Deeplearning4j architecture

Acquiring and manipulating data

Reading in a CSV file

Configuring and building a model

Using hyperparameters in ND4J

Instantiating the network model

Training a model

Testing a model

Deep learning and regression analysis

Preparing the data

Setting up the class

Reading and preparing the data

Building the model

Evaluating the model

Restricted Boltzmann Machines

Reconstruction in an RBM

Configuring an RBM

Deep autoencoders

Building an autoencoder in DL4J

Configuring the network

Building and training the network

Saving and retrieving a network

Specialized autoencoders

Convolutional networks

Building the model

Evaluating the model

Recurrent Neural Networks

Summary

9. Text Analysis

Implementing named entity recognition

Using OpenNLP to perform NER

Identifying location entities

Classifying text

Word2Vec and Doc2Vec

Classifying text by labels

Classifying text by similarity

Understanding tagging and POS

Using OpenNLP to identify POS

Understanding POS tags

Extracting relationships from sentences

Using OpenNLP to extract relationships

Sentiment analysis

Downloading and extracting the Word2Vec model

Building our model and classifying text

Summary

10. Visual and Audio Analysis

Text-to-speech

Using FreeTTS

Getting information about voices

Gathering voice information

Understanding speech recognition

Using CMUPhinx to convert speech to text

Obtaining more detail about the words

Extracting text from an image

Using Tess4j to extract text

Identifying faces

Using OpenCV to detect faces

Classifying visual data

Creating a Neuroph Studio project for classifying visual images

Training the model

Summary

11. Mathematical and Parallel Techniques for Data Analysis

Implementing basic matrix operations

Using GPUs with DeepLearning4j

Using map-reduce

Using Apache's Hadoop to perform map-reduce

Writing the map method

Writing the reduce method

Creating and executing a new Hadoop job

Various mathematical libraries

Using the jblas API

Using the Apache Commons math API

Using the ND4J API

Using OpenCL

Using Aparapi

Creating an Aparapi application

Using Aparapi for matrix multiplication

Using Java 8 streams

Understanding Java 8 lambda expressions and streams

Using Java 8 to perform matrix multiplication

Using Java 8 to perform map-reduce

Summary

12. Bringing It All Together

Defining the purpose and scope of our application

Understanding the application's architecture

Data acquisition using Twitter

Understanding the TweetHandler class

Extracting data for a sentiment analysis model

Building the sentiment model

Processing the JSON input

Cleaning data to improve our results

Removing stop words

Performing sentiment analysis

Analysing the results

Other optional enhancements

Summary

2. Module 2

1. Applied Machine Learning Quick Start

Machine learning and data science

What kind of problems can machine learning solve?

Applied machine learning workflow

Data and problem definition

Measurement scales

Data collection

Find or observe data

Generate data

Sampling traps

Data pre-processing

Data cleaning

Fill missing values

Remove outliers

Data transformation

Data reduction

Unsupervised learning

Find similar items

Euclidean distances

Non-Euclidean distances

The curse of dimensionality

Clustering

Supervised learning

Classification

Decision tree learning

Probabilistic classifiers

Kernel methods

Artificial neural networks

Ensemble learning

Evaluating classification

Precision and recall

Roc curves

Regression

Linear regression

Evaluating regression

Mean squared error

Mean absolute error

Correlation coefficient

Generalization and evaluation

Underfitting and overfitting

Train and test sets

Cross-validation

Leave-one-out validation

Stratification

Summary

2. Java Libraries and Platforms for Machine Learning

The need for Java

Machine learning libraries

Weka

Java machine learning

Apache Mahout

Apache Spark

Deeplearning4j

MALLET

Comparing libraries

Building a machine learning application

Traditional machine learning architecture

Dealing with big data

Big data application architecture

Summary

3. Basic Algorithms – Classification, Regression, and Clustering

Before you start

Classification

Data

Loading data

Feature selection

Learning algorithms

Classify new data

Evaluation and prediction error metrics

Confusion matrix

Choosing a classification algorithm

Regression

Loading the data

Analyzing attributes

Building and evaluating regression model

Linear regression

Regression trees

Tips to avoid common regression problems

Clustering

Clustering algorithms

Evaluation

Summary

4. Customer Relationship Prediction with Ensembles

Customer relationship database

Challenge

Dataset

Evaluation

Basic naive Bayes classifier baseline

Getting the data

Loading the data

Basic modeling

Evaluating models

Implementing naive Bayes baseline

Advanced modeling with ensembles

Before we start

Data pre-processing

Attribute selection

Model selection

Performance evaluation

Summary

5. Affinity Analysis

Market basket analysis

Affinity analysis

Association rule learning

Basic concepts

Database of transactions

Itemset and rule

Support

Confidence

Apriori algorithm

FP-growth algorithm

The supermarket dataset

Discover patterns

Apriori

FP-growth

Other applications in various areas

Medical diagnosis

Protein sequences

Census data

Customer relationship management

IT Operations Analytics

Summary

6. Recommendation Engine with Apache Mahout

Basic concepts

Key concepts

User-based and item-based analysis

Approaches to calculate similarity

Collaborative filtering

Content-based filtering

Hybrid approach

Exploitation versus exploration

Getting Apache Mahout

Configuring Mahout in Eclipse with the Maven plugin

Building a recommendation engine

Book ratings dataset

Loading the data

Loading data from file

Loading data from database

In-memory database

Collaborative filtering

User-based filtering

Item-based filtering

Adding custom rules to recommendations

Evaluation

Online learning engine

Content-based filtering

Summary

7. Fraud and Anomaly Detection

Suspicious and anomalous behavior detection

Unknown-unknowns

Suspicious pattern detection

Anomalous pattern detection

Analysis types

Pattern analysis

Transaction analysis

Plan recognition

Fraud detection of insurance claims

Dataset

Modeling suspicious patterns

Vanilla approach

Dataset rebalancing

Anomaly detection in website traffic

Dataset

Anomaly detection in time series data

Histogram-based anomaly detection

Loading the data

Creating histograms

Density based k-nearest neighbors

Summary

8. Image Recognition with Deeplearning4j

Introducing image recognition

Neural networks

Perceptron

Feedforward neural networks

Autoencoder

Restricted Boltzmann machine

Deep convolutional networks

Image classification

Deeplearning4j

Getting DL4J

MNIST dataset

Loading the data

Building models

Building a single-layer regression model

Building a deep belief network

Build a Multilayer Convolutional Network

Summary

9. Activity Recognition with Mobile Phone Sensors

Introducing activity recognition

Mobile phone sensors

Activity recognition pipeline

The plan

Collecting data from a mobile phone

Installing Android Studio

Loading the data collector

Feature extraction

Collecting training data

Building a classifier

Reducing spurious transitions

Plugging the classifier into a mobile app

Summary

10. Text Mining with Mallet – Topic Modeling and Spam Detection

Introducing text mining

Topic modeling

Text classification

Installing Mallet

Working with text data

Importing data

Importing from directory

Importing from file

Pre-processing text data

Topic modeling for BBC news

BBC dataset

Modeling

Evaluating a model

Reusing a model

Saving a model

Restoring a model

E-mail spam detection

E-mail spam dataset

Feature generation

Training and testing

Model performance

Summary

11. What is Next?

Machine learning in real life

Noisy data

Class unbalance

Feature selection is hard

Model chaining

Importance of evaluation

Getting models into production

Model maintenance

Standards and markup languages

CRISP-DM

SEMMA methodology

Predictive Model Markup Language

Machine learning in the cloud

Machine learning as a service

Web resources and competitions

Datasets

Online courses

Competitions

Websites and blogs

Venues and conferences

Summary

A. References

3. Module 3

1. Machine Learning Review

Machine learning – history and definition

What is not machine learning?

Machine learning – concepts and terminology

Machine learning – types and subtypes

Datasets used in machine learning

Machine learning applications

Practical issues in machine learning

Machine learning – roles and process

Roles

Process

Machine learning – tools and datasets

Datasets

Summary

2. Practical Approach to Real-World Supervised Learning

Formal description and notation

Data quality analysis

Descriptive data analysis

Basic label analysis

Basic feature analysis

Visualization analysis

Univariate feature analysis

Categorical features

Continuous features

Multivariate feature analysis

Data transformation and preprocessing

Feature construction

Handling missing values

Outliers

Discretization

Data sampling

Is sampling needed?

Undersampling and oversampling

Stratified sampling

Training, validation, and test set

Feature relevance analysis and dimensionality reduction

Feature search techniques

Feature evaluation techniques

Filter approach

Univariate feature selection

Information theoretic approach

Statistical approach

Multivariate feature selection

Minimal redundancy maximal relevance (mRMR)

Correlation-based feature selection (CFS)

Wrapper approach

Embedded approach

Model building

Linear models

Linear Regression

Algorithm input and output

How does it work?

Advantages and limitations

Naïve Bayes

Algorithm input and output

How does it work?

Advantages and limitations

Logistic Regression

Algorithm input and output

How does it work?

Advantages and limitations

Non-linear models

Decision Trees

Algorithm inputs and outputs

How does it work?

Advantages and limitations

K-Nearest Neighbors (KNN)

Algorithm inputs and outputs

How does it work?

Advantages and limitations

Support vector machines (SVM)

Algorithm inputs and outputs

How does it work?

Advantages and limitations

Ensemble learning and meta learners

Bootstrap aggregating or bagging

Algorithm inputs and outputs

How does it work?

Random Forest

Advantages and limitations

Boosting

Algorithm inputs and outputs

How does it work?

Advantages and limitations

Model assessment, evaluation, and comparisons

Model assessment

Model evaluation metrics

Confusion matrix and related metrics

ROC and PRC curves

Gain charts and lift curves

Model comparisons

Comparing two algorithms

McNemar's Test

Paired-t test

Wilcoxon signed-rank test

Comparing multiple algorithms

ANOVA test

Friedman's test

Case Study – Horse Colic Classification

Business problem

Machine learning mapping

Data analysis

Label analysis

Features analysis

Supervised learning experiments

Weka experiments

Sample end-to-end process in Java

Weka experimenter and model selection

RapidMiner experiments

Visualization analysis

Feature selection

Model process flow

Model evaluation metrics

Evaluation on Confusion Metrics

ROC Curves, Lift Curves, and Gain Charts

Results, observations, and analysis

Summary

References

3. Unsupervised Machine Learning Techniques

Issues in common with supervised learning

Issues specific to unsupervised learning

Feature analysis and dimensionality reduction

Notation

Linear methods

Principal component analysis (PCA)

Inputs and outputs

How does it work?

Advantages and limitations

Random projections (RP)

Inputs and outputs

How does it work?

Advantages and limitations

Multidimensional Scaling (MDS)

Inputs and outputs

How does it work?

Advantages and limitations

Nonlinear methods

Kernel Principal Component Analysis (KPCA)

Inputs and outputs

How does it work?

Advantages and limitations

Manifold learning

Inputs and outputs

How does it work?

Advantages and limitations

Clustering

Clustering algorithms

k-Means

Inputs and outputs

How does it work?

Advantages and limitations

DBSCAN

Inputs and outputs

How does it work?

Advantages and limitations

Mean shift

Inputs and outputs

How does it work?

Advantages and limitations

Expectation maximization (EM) or Gaussian mixture modeling (GMM)

Input and output

How does it work?

Advantages and limitations

Hierarchical clustering

Input and output

How does it work?

Advantages and limitations

Self-organizing maps (SOM)

Inputs and outputs

How does it work?

Advantages and limitations

Spectral clustering

Inputs and outputs

How does it work?

Advantages and limitations

Affinity propagation

Inputs and outputs

How does it work?

Advantages and limitations

Clustering validation and evaluation

Internal evaluation measures

Notation

R-Squared

Dunn's Indices

Davies-Bouldin index

Silhouette's index

External evaluation measures

Rand index

F-Measure

Normalized mutual information index

Outlier or anomaly detection

Outlier algorithms

Statistical-based

Inputs and outputs

How does it work?

Advantages and limitations

Distance-based methods

Inputs and outputs

How does it work?

Advantages and limitations

Density-based methods

Inputs and outputs

How does it work?

Advantages and limitations

Clustering-based methods

Inputs and outputs

How does it work?

Advantages and limitations

High-dimensional-based methods

Inputs and outputs

How does it work?

Advantages and limitations

One-class SVM

Inputs and outputs

How does it work?

Advantages and limitations

Outlier evaluation techniques

Supervised evaluation

Unsupervised evaluation

Real-world case study

Tools and software

Business problem

Machine learning mapping

Data collection

Data quality analysis

Data sampling and transformation

Feature analysis and dimensionality reduction

PCA

Random projections

ISOMAP

Observations on feature analysis and dimensionality reduction

Clustering models, results, and evaluation

Observations and clustering analysis

Outlier models, results, and evaluation

Observations and analysis

Summary

References

4. Semi-Supervised and Active Learning

Semi-supervised learning

Representation, notation, and assumptions

Semi-supervised learning techniques

Self-training SSL

Inputs and outputs

How does it work?

Advantages and limitations

Co-training SSL or multi-view SSL

Inputs and outputs

How does it work?

Advantages and limitations

Cluster and label SSL

Inputs and outputs

How does it work?

Advantages and limitations

Transductive graph label propagation

Inputs and outputs

How does it work?

Advantages and limitations

Transductive SVM (TSVM)

Inputs and outputs

How does it work?

Advantages and limitations

Case study in semi-supervised learning

Tools and software

Business problem

Machine learning mapping

Data collection

Data quality analysis

Data sampling and transformation

Datasets and analysis

Feature analysis results

Experiments and results

Analysis of semi-supervised learning

Active learning

Representation and notation

Active learning scenarios

Active learning approaches

Uncertainty sampling

How does it work?

Least confident sampling

Smallest margin sampling

Label entropy sampling

Advantages and limitations

Version space sampling

Query by disagreement (QBD)

How does it work?

Query by Committee (QBC)

How does it work?

Advantages and limitations

Data distribution sampling

How does it work?

Expected model change

Expected error reduction

Variance reduction

Density weighted methods

Advantages and limitations

Case study in active learning

Tools and software

Business problem

Machine learning mapping

Data Collection

Data sampling and transformation

Feature analysis and dimensionality reduction

Models, results, and evaluation

Pool-based scenarios

Stream-based scenarios

Analysis of active learning results

Summary

References

5. Real-Time Stream Machine Learning

Assumptions and mathematical notations

Basic stream processing and computational techniques

Stream computations

Sliding windows

Sampling

Concept drift and drift detection

Data management

Partial memory

Full memory

Detection methods

Monitoring model evolution

Widmer and Kubat

Drift Detection Method or DDM

Early Drift Detection Method or EDDM

Monitoring distribution changes

Welch's t test

Kolmogorov-Smirnov's test

CUSUM and Page-Hinckley test

Adaptation methods

Explicit adaptation

Implicit adaptation

Incremental supervised learning

Modeling techniques

Linear algorithms

Online linear models with loss functions

Inputs and outputs

How does it work?

Advantages and limitations

Online Naïve Bayes

Inputs and outputs

How does it work?

Advantages and limitations

Non-linear algorithms

Hoeffding trees or very fast decision trees (VFDT)

Inputs and outputs

How does it work?

Advantages and limitations

Ensemble algorithms

Weighted majority algorithm

Inputs and outputs

How does it work?

Advantages and limitations

Online Bagging algorithm

Inputs and outputs

How does it work?

Advantages and limitations

Online Boosting algorithm

Inputs and outputs

How does it work?

Advantages and limitations

Validation, evaluation, and comparisons in online setting

Model validation techniques

Prequential evaluation

Holdout evaluation

Controlled permutations

Evaluation criteria

Comparing algorithms and metrics

Incremental unsupervised learning using clustering

Modeling techniques

Partition based

Online k-Means

Inputs and outputs

How does it work?

Advantages and limitations

Hierarchical based and micro clustering

Inputs and outputs

How does it work?

Advantages and limitations

Inputs and outputs

How does it work?

Advantages and limitations

Density based

Inputs and outputs

How does it work?

Advantages and limitations

Grid based

Inputs and outputs

How does it work?

Advantages and limitations

Validation and evaluation techniques

Key issues in stream cluster evaluation

Evaluation measures

Cluster Mapping Measures (CMM)

V-Measure

Other external measures

Unsupervised learning using outlier detection

Partition-based clustering for outlier detection

Inputs and outputs

How does it work?

Advantages and limitations

Distance-based clustering for outlier detection

Inputs and outputs

How does it work?

Exact Storm

Abstract-C

Direct Update of Events (DUE)

Micro Clustering based Algorithm (MCOD)

Approx Storm

Advantages and limitations

Validation and evaluation techniques

Case study in stream learning

Tools and software

Business problem

Machine learning mapping

Data collection

Data sampling and transformation

Feature analysis and dimensionality reduction

Models, results, and evaluation

Supervised learning experiments

Concept drift experiments

Clustering experiments

Outlier detection experiments

Analysis of stream learning results

Summary

References

6. Probabilistic Graph Modeling

Probability revisited

Concepts in probability

Conditional probability

Chain rule and Bayes' theorem

Random variables, joint, and marginal distributions

Marginal independence and conditional independence

Factors

Factor types

Distribution queries

Probabilistic queries

MAP queries and marginal MAP queries

Graph concepts

Graph structure and properties

Subgraphs and cliques

Path, trail, and cycles

Bayesian networks

Representation

Definition

Reasoning patterns

Causal or predictive reasoning

Evidential or diagnostic reasoning

Intercausal reasoning

Combined reasoning

Independencies, flow of influence, D-Separation, I-Map

Flow of influence

D-Separation

I-Map

Inference

Elimination-based inference

Variable elimination algorithm

Input and output

How does it work?

Advantages and limitations

Clique tree or junction tree algorithm

Input and output

How does it work?

Advantages and limitations

Propagation-based techniques

Belief propagation

Factor graph

Messaging in factor graph

Input and output

How does it work?

Advantages and limitations

Sampling-based techniques

Forward sampling with rejection

Input and output

How does it work?

Advantages and limitations

Learning

Learning parameters

Maximum likelihood estimation for Bayesian networks

Bayesian parameter estimation for Bayesian network

Prior and posterior using the Dirichlet distribution

Learning structures

Measures to evaluate structures

Methods for learning structures

Constraint-based techniques

Inputs and outputs

How does it work?

Advantages and limitations

Search and score-based techniques

Inputs and outputs

How does it work?

Advantages and limitations

Markov networks and conditional random fields

Representation

Parameterization

Gibbs parameterization

Factor graphs

Log-linear models

Independencies

Global

Pairwise Markov

Markov blanket

Inference

Learning

Conditional random fields

Specialized networks

Tree augmented network

Input and output

How does it work?

Advantages and limitations

Markov chains

Hidden Markov models

Most probable path in HMM

Posterior decoding in HMM

Tools and usage

OpenMarkov

Weka Bayesian Network GUI

Case study

Business problem

Machine learning mapping

Data sampling and transformation

Feature analysis

Models, results, and evaluation

Analysis of results

Summary

References

7. Deep Learning

Multi-layer feed-forward neural network

Inputs, neurons, activation function, and mathematical notation

Multi-layered neural network

Structure and mathematical notations

Activation functions in NN

Sigmoid function

Hyperbolic tangent ("tanh") function

Training neural network

Empirical risk minimization

Parameter initialization

Loss function

Gradients

Gradient at the output layer

Gradient at the Hidden Layer

Parameter gradient

Feed forward and backpropagation

How does it work?

Regularization

L2 regularization

L1 regularization

Limitations of neural networks

Vanishing gradients, local optimum, and slow training

Deep learning

Building blocks for deep learning

Rectified linear activation function

Restricted Boltzmann Machines

Definition and mathematical notation

Conditional distribution

Free energy in RBM

Training the RBM

Sampling in RBM

Contrastive divergence

Inputs and outputs

How does it work?

Persistent contrastive divergence

Autoencoders

Definition and mathematical notations

Loss function

Limitations of Autoencoders

Denoising Autoencoder

Unsupervised pre-training and supervised fine-tuning

Deep feed-forward NN

Input and outputs

How does it work?

Deep Autoencoders

Deep Belief Networks

Inputs and outputs

How does it work?

Deep learning with dropouts

Definition and mathematical notation

Inputs and outputs

How does it work?

Learning Training and testing with dropouts

Sparse coding

Convolutional Neural Network

Local connectivity

Parameter sharing

Discrete convolution

Pooling or subsampling

Normalization using ReLU

CNN Layers

Recurrent Neural Networks

Structure of Recurrent Neural Networks

Learning and associated problems in RNNs

Long Short Term Memory

Gated Recurrent Units

Case study

Tools and software

Business problem

Machine learning mapping

Data sampling and transfor

Feature analysis

Models, results, and evaluation

Basic data handling

Multi-layer perceptron

Parameters used for MLP

Code for MLP

Convolutional Network

Parameters used for ConvNet

Code for CNN

Variational Autoencoder

Parameters used for the Variational Autoencoder

Code for Variational Autoencoder

DBN

Parameter search using Arbiter

Results and analysis

Summary

References

8. Text Mining and Natural Language Processing

NLP, subfields, and tasks

Text categorization

Part-of-speech tagging (POS tagging)

Text clustering

Information extraction and named entity recognition

Sentiment analysis and opinion mining

Coreference resolution

Word sense disambiguation

Machine translation

Semantic reasoning and inferencing

Text summarization

Automating question and answers

Issues with mining unstructured data

Text processing components and transformations

Document collection and standardization

Inputs and outputs

How does it work?

Tokenization

Inputs and outputs

How does it work?

Stop words removal

Inputs and outputs

How does it work?

Stemming or lemmatization

Inputs and outputs

How does it work?

Local/global dictionary or vocabulary?

Feature extraction/generation

Lexical features

Character-based features

Word-based features

Part-of-speech tagging features

Taxonomy features

Syntactic features

Semantic features

Feature representation and similarity

Vector space model

Binary

Term frequency (TF)

Inverse document frequency (IDF)

Term frequency-inverse document frequency (TF-IDF)

Similarity measures

Euclidean distance

Cosine distance

Pairwise-adaptive similarity

Extended Jaccard coefficient

Dice coefficient

Feature selection and dimensionality reduction

Feature selection

Information theoretic techniques

Statistical-based techniques

Frequency-based techniques

Dimensionality reduction

Topics in text mining

Text categorization/classification

Topic modeling

Probabilistic latent semantic analysis (PLSA)

Input and output

How does it work?

Advantages and limitations

Text clustering

Feature transformation, selection, and reduction

Clustering techniques

Generative probabilistic models

Input and output

How does it work?

Advantages and limitations

Distance-based text clustering

Non-negative matrix factorization (NMF)

Input and output

How does it work?

Advantages and limitations

Evaluation of text clustering

Named entity recognition

Hidden Markov models for NER

Input and output

How does it work?

Advantages and limitations

Maximum entropy Markov models for NER

Input and output

How does it work?

Advantages and limitations

Deep learning and NLP

Tools and usage

Mallet

KNIME

Topic modeling with mallet

Business problem

Machine Learning mapping

Data collection

Data sampling and transformation

Feature analysis and dimensionality reduction

Models, results, and evaluation

Analysis of text processing results

Summary

References

9. Big Data Machine Learning – The Final Frontier

What are the characteristics of Big Data?

Big Data Machine Learning

General Big Data framework

Big Data cluster deployment frameworks

Hortonworks Data Platform

Cloudera CDH

Amazon Elastic MapReduce

Microsoft Azure HDInsight

Data acquisition

Publish-subscribe frameworks

Source-sink frameworks

SQL frameworks

Message queueing frameworks

Custom frameworks

Data storage

HDFS

NoSQL

Key-value databases

Document databases

Columnar databases

Graph databases

Data processing and preparation

Hive and HQL

Spark SQL

Amazon Redshift

Real-time stream processing

Machine Learning

Visualization and analysis

Batch Big Data Machine Learning

H2O as Big Data Machine Learning platform

H2O architecture

Machine learning in H2O

Tools and usage

Case study

Business problem

Machine Learning mapping

Data collection

Data sampling and transformation

Experiments, results, and analysis

Feature relevance and analysis

Evaluation on test data

Analysis of results

Spark MLlib as Big Data Machine Learning platform

Spark architecture

Machine Learning in MLlib

Tools and usage

Experiments, results, and analysis

k-Means

k-Means with PCA

Bisecting k-Means (with PCA)

Gaussian Mixture Model

Random Forest

Analysis of results

Real-time Big Data Machine Learning

SAMOA as a real-time Big Data Machine Learning framework

SAMOA architecture

Machine Learning algorithms

Tools and usage

Experiments, results, and analysis

Analysis of results

The future of Machine Learning

Summary

References

A. Linear Algebra

Vector

Scalar product of vectors

Matrix

Transpose of a matrix

Matrix addition

Scalar multiplication

Matrix multiplication

Properties of matrix product

Linear transformation

Matrix inverse

Eigendecomposition

Positive definite matrix

Singular value decomposition (SVD)

B. Probability

Axioms of probability

Bayes' theorem

Density estimation

Mean

Variance

Standard deviation

Gaussian standard deviation

Covariance

Correlation coefficient

Binomial distribution

Poisson distribution

Gaussian distribution

Central limit theorem

Error propagation

D. Bibliography

Index

Machine Learning: End-to-End guide for Java developers

All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Published on: September 2017

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78862-221-9

www.packtpub.com

Credits

Authors

Richard M. Reese

Jennifer L. Reese

Boštjan Kaluža

Dr. Uday Kamath

Krishna Choppella

Reviewers

Walter Molina

Shilpi Saxena

Abhik Banerjee

Wei Di

Manjunath Narayana

Ravi Sharma

Samir Sahli

Prashant Verma

Content Development Editor

Aishwarya Pandere

Production Coordinator

Arvindkumar Gupta

Preface

Machine learning is a subfield of artificial intelligence. It helps computers to learn and act like human beings with the help of algorithms and data. With a given set of data, an ML algorithm learns different properties of the data and infers the properties of the data that it may encounter in future.

What this learning path covers

Module 1, Java for Data Science, investigates the support provided for low-level math operations and how they can be supported in a multiple processor environment. Data analysis, at its heart, necessitates the ability to manipulate and analyze large quantities of numeric data.

Module 2, Machine Learning in Java, reviews the various Java libraries and platforms dedicated to machine learning, what each library brings to the table, and what kind of problems it is able to solve. The review includes Weka, Java-ML, Apache Mahout, Apache Spark, deeplearning4j, and Mallet.

Module 3, Mastering Java Machine Learning, presents many advanced methods in clustering and outlier techniques, with applications. Topics covered are feature selection and reduction in unsupervised data, clustering algorithms, evaluation methods in clustering, and anomaly detection using statistical, distance, and distribution techniques. At the end of the chapter, we perform a case study for both clustering and outlier detection using a real-world image dataset, MNIST. We use the Smile API to do feature reduction and ELKI for learning.

What you need for this learning path

Module 1:

Many of the examples in this module use Java 8 features. There are a number of Java APIs demonstrated, each of which is introduced before it is applied. An IDE is not required but is desirable.

Module 2:

To follow the examples throughout the module, you'll need a personal computer with the JDK installed. All the examples and source code that you can download assume Eclipse IDE with support for Maven, a dependency management and build automation tool; and Git, a version control system. Examples in the chapters rely on various libraries, including Weka, deeplearning4j, Mallet, and Apache Mahout. Instructions on how to get and install the libraries are provided in the chapter where the library will be first used.

The module has a dedicated web site, http://machine-learning-in-java.com, where you can find all the example code, errata, and additional materials that will help you to get started.

Module 3:

This book assumes you have some experience of programming in Java and a basic understanding of machine learning concepts. If that doesn't apply to you, but you are curious nonetheless and self-motivated, fret not, and read on! For those who do have some background, it means that you are familiar with simple statistical analysis of data and concepts involved in supervised and unsupervised learning. Those who may not have the requisite math or must poke the far reaches of their memory to shake loose the odd formula or funny symbol, do not be disheartened. If you are the sort that loves a challenge, the short primer in the appendices may be all you need to kick-start your engines—a bit of tenacity will see you through the rest! For those who have never been introduced to machine learning, the first chapter was equally written for you as for those needing a refresher—it is your starter-kit to jump in feet first and find out what it's all about. You can augment your basics with any number of online resources. Finally, for those innocent of Java, here's a secret: many of the tools featured in the book have powerful GUIs. Some include wizard-like interfaces, making them quite easy to use, and do not require any knowledge of Java. So if you are new to Java, just skip the examples that need coding and learn to use the GUI-based tools instead!

Who this learning path is for

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this course—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the course's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt course, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this course from your account at http://www.packtpub.com. If you purchased this course elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the course in the Search box.Select the course for which you're looking to download the code files.Choose from the drop-down menu where you purchased this course from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the course's webpage at the Packt Publishing website. This page can be accessed by entering the course's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the course is also hosted on GitHub at https://github.com/PacktPublishing/repository-name. We also have other code bundles from our rich catalog of books, videos, and courses available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our courses—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this course. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your course, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the course in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this course, you can contact us at <[email protected]>, and we will do our best to address the problem.

Part 1. Module 1

Java for Data Science

Examine the techniques and Java tools supporting the growing field of data science

Chapter 1. Getting Started with Data Science

Data science is not a single science as much as it is a collection of various scientific disciplines integrated for the purpose of analyzing data. These disciplines include various statistical and mathematical techniques, including:

Computer scienceData engineeringVisualizationDomain-specific knowledge and approaches

With the advent of cheaper storage technology, more and more data has been collected and stored permitting previously unfeasible processing and analysis of data. With this analysis came the need for various techniques to make sense of the data. These large sets of data, when used to analyze data and identify trends and patterns, become known as big data.

This in turn gave rise to cloud computing and concurrent techniques such as map-reduce, which distributed the analysis process across a large number of processors, taking advantage of the power of parallel processing.

The process of analyzing big data is not simple and evolves to the specialization of developers who were known as data scientists. Drawing upon a myriad of technologies and expertise, they are able to analyze data to solve problems that previously were either not envisioned or were too difficult to solve.

Early big data applications were typified by the emergence of search engines capable of more powerful and accurate searches than their predecessors. For example, AltaVista was an early popular search engine that was eventually superseded by Google. While big data applications were not limited to these search engine functionalities, these applications laid the groundwork for future work in big data.

The term, data science, has been used since 1974 and evolved over time to include statistical analysis of data. The concepts of data mining and data analytics have been associated with data science. Around 2008, the term data scientist appeared and was used to describe a person who performs data analysis. A more in-depth discussion of the history of data science can be found at http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#3d9ea08369fd.

This book aims to take a broad look at data science using Java and will briefly touch on many topics. It is likely that the reader may find topics of interest and pursue these at greater depth independently. The purpose of this book, however, is simply to introduce the reader to the significant data science topics and to illustrate how they can be addressed using Java.

There are many algorithms used in data science. In this book, we do not attempt to explain how they work except at an introductory level. Rather, we are more interested in explaining how they can be used to solve problems. Specifically, we are interested in knowing how they can be used with Java.

Problems solved using data science

The various data science techniques that we will illustrate have been used to solve a variety of problems. Many of these techniques are motivated to achieve some economic gain, but they have also been used to solve many pressing social and environmental problems. Problem domains where these techniques have been used include finance, optimizing business processes, understanding customer needs, performing DNA analysis, foiling terrorist plots, and finding relationships between transactions to detect fraud, among many other data-intensive problems.

Data mining is a popular application area for data science. In this activity, large quantities of data are processed and analyzed to glean information about the dataset, to provide meaningful insights, and to develop meaningful conclusions and predictions. It has been used to analyze customer behavior, detecting relationships between what may appear to be unrelated events, and to make predictions about future behavior.

Machine learning is an important aspect of data science. This technique allows the computer to solve various problems without needing to be explicitly programmed. It has been used in self-driving cars, speech recognition, and in web searches. In data mining, the data is extracted and processed. With machine learning, computers use the data to take some sort of action.

Understanding the data science problem - solving approach

Data science is concerned with the processing and analysis of large quantities of data to create models that can be used to make predictions or otherwise support a specific goal. This process often involves the building and training of models. The specific approach to solve a problem is dependent on the nature of the problem. However, in general, the following are the high-level tasks that are used in the analysis process:

Acquiring the data: Before we can process the data, it must be acquired. The data is frequently stored in a variety of formats and will come from a wide range of data sources.Cleaning the data: Once the data has been acquired, it often needs to be converted to a different format before it can be used. In addition, the data needs to be processed, or cleaned, so as to remove errors, resolve inconsistencies, and otherwise put it in a form ready for analysis.Analyzing the data: This can be performed using a number of techniques including:

Statistical analysis: This uses a multitude of statistical approaches to provide insight into data. It includes simple techniques and more advanced techniques such as regression analysis.AI analysis: These can be grouped as machine learning, neural networks, and deep learning techniques:

Machine learning approaches are characterized by programs that can learn without being specifically programmed to complete a specific taskNeural networks are built around models patterned after the neural connection of the brainDeep learning attempts to identify higher levels of abstraction within a set of data

Text analysis: This is a common form of analysis, which works with natural languages to identify features such as the names of people and places, the relationship between parts of text, and the implied meaning of text.Data visualization: This is an important analysis tool. By displaying the data in a visual form, a hard-to-understand set of numbers can be more readily understood.Video, image, and audio processing and analysis: This is a more specialized form of analysis, which is becoming more common as better analysis techniques are discovered and faster processors become available. This is in contrast to the more common text processing and analysis tasks.

Complementing this set of tasks is the need to develop applications that are efficient. The introduction of machines with multiple processors and GPUs contributes significantly to the end result.

While the exact steps used will vary by application, understanding these basic steps provides the basis for constructing solutions to many data science problems.

Using Java to support data science

Java and its associated third-party libraries provide a range of support for the development of data science applications. There are numerous core Java capabilities that can be used, such as the basic string processing methods. The introduction of lambda expressions in Java 8 helps enable more powerful and expressive means of building applications. In many of the examples that follow in subsequent chapters, we will show alternative techniques using lambda expressions.

There is ample support provided for the basic data science tasks. These include multiple ways of acquiring data, libraries for cleaning data, and a wide variety of analysis approaches for tasks such as natural language processing and statistical analysis. There are also myriad of libraries supporting neural network types of analysis.

Java can be a very good choice for data science problems. The language provides both object-oriented and functional support for solving problems. There is a large developer community to draw upon and there exist multiple APIs that support data science tasks. These are but a few reasons as to why Java should be used.

The remainder of this chapter will provide an overview of the data science tasks and Java support demonstrated in the book. Each section is only able to present a brief introduction to the topics and the available support. The subsequent chapter will go into considerably more depth regarding these topics.

Acquiring data for an application

Data acquisition is an important step in the data analysis process. When data is acquired, it is often in a specialized form and its contents may be inconsistent or different from an application's need. There are many sources of data, which are found on the Internet. Several examples will be demonstrated in Chapter 2, Data Acquisition.

Data may be stored in a variety of formats. Popular formats for text data include HTML, Comma Separated Values (CSV), JavaScript Object Notation (JSON), and XML. Image and audio data are stored in a number of formats. However, it is frequently necessary to convert one data format into another format, typically plain text.

For example, JSON (http://www.JSON.org/) is stored using blocks of curly braces containing key-value pairs. In the following example, parts of a YouTube result is shown:

{ "kind": "youtube#searchResult", "etag": etag, "id": { "kind": string, "videoId": string, "channelId": string, "playlistId": string }, ... }

Data is acquired using techniques such as processing live streams, downloading compressed files, and through screen scraping, where the information on a web page is extracted. Web crawling is a technique where a program examines a series of web pages, moving from one page to another, acquiring the data that it needs.

With many popular media sites, it is necessary to acquire a user ID and password to access data. A commonly used technique is OAuth, which is an open standard used to authenticate users to many different websites. The technique delegates access to a server resource and works over HTTPS. Several companies use OAuth 2.0, including PayPal, Facebook, Twitter, and Yelp.

Visualizing data to enhance understanding

The analysis of data often results in a series of numbers representing the results of the analysis. However, for most people, this way of expressing results is not always intuitive. A better way to understand the results is to create graphs and charts to depict the results and the relationship between the elements of the result.

The human mind is often good at seeing patterns, trends, and outliers in visual representation. The large amount of data present in many data science problems can be analyzed using visualization techniques. Visualization is appropriate for a wide range of audiences ranging from analysts to upper-level management to clientele. In this chapter, we present various visualization techniques and demonstrate how they are supported in Java.

In Chapter 4, Data Visualization, we illustrate how to create different types of graphs, plots, and charts. These examples use JavaFX using a free library called GRAL(http://trac.erichseifert.de/gral/).

Visualization allows users to examine large datasets in ways that provide insights that are not present in the mass of the data. Visualization tools helps us identify potential problems or unexpected data results and develop meaningful interpretations of the data.

For example, outliers, which are values that lie outside of the normal range of values, can be hard to spot from a sea of numbers. Creating a graph based on the data allows users to quickly see outliers. It can also help spot errors quickly and more easily classify data.

For example, the following chart might suggest that the upper two values should be outliers that need to be dealt with:

Machine learning applied to data science

Machine learning has become increasingly important for data science analysis as it has been for a multitude of other fields. A defining characteristic of machine learning is the ability of a model to be trained on a set of representative data and then later used to solve similar problems. There is no need to explicitly program an application to solve the problem. A model is a representation of the real-world object.

For example, customer purchases can be used to train a model. Subsequently, predictions can be made about the types of purchases a customer might subsequently make. This allows an organization to tailor ads and coupons for a customer and potentially providing a better customer experience.

Training can be performed in one of several different approaches:

Supervised learning: The model is trained with annotated, labeled, data showing corresponding correct resultsUnsupervised learning: The data does not contain results, but the model is expected to find relationships on its ownSemi-supervised: A small amount of labeled data is combined with a larger amount of unlabeled dataReinforcement learning: This is similar to supervised learning, but a reward is provided for good results

There are several approaches that support machine learning. In Chapter 6, Machine Learning, we will illustrate three techniques:

Decision trees: A tree is constructed using features of the problem as internal nodes and the results as leavesSupport vector machines: This is used for classification by creating a hyperplane that partitions the dataset and then makes predictionsBayesian networks: This is used to depict probabilistic relationships between events

A Support Vector Machine (SVM) is used primarily for classification type problems. The approach creates a hyperplane to categorize data, which can be envisioned as a geometric plane that separates two regions. In a two-dimensional space, it will be a line. In a three-dimensional space, it will be a two-dimensional plane. In Chapter 6, Machine Learning, we will demonstrate how to use the approach using a set of data relating to the propensity of individuals to camp. We will use the Weka class, SMO, to demonstrate this type of analysis.

The following figure depicts a hyperplane using a distribution of two types of data points. The lines represent possible hyperplanes that separate these points. The lines clearly separate the data points except for one outlier.

Once the model has been trained, the possible hyperplanes are considered and predictions can then be made using similar data.