Machine Learning: End-to-End guide for Java developers - Richard M. Reese - E-Book

Machine Learning: End-to-End guide for Java developers E-Book

Richard M. Reese

0,0
86,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Develop, Implement and Tuneup your Machine Learning applications using the power of Java programming

About This Book

  • Detailed coverage on key machine learning topics with an emphasis on both theoretical and practical aspects
  • Address predictive modeling problems using the most popular machine learning Java libraries
  • A comprehensive course covering a wide spectrum of topics such as machine learning and natural language through practical use-cases

Who This Book Is For

This course is the right resource for anyone with some knowledge of Java programming who wants to get started with Data Science and Machine learning as quickly as possible. If you want to gain meaningful insights from big data and develop intelligent applications using Java, this course is also a must-have.

What You Will Learn

  • Understand key data analysis techniques centered around machine learning
  • Implement Java APIs and various techniques such as classification, clustering, anomaly detection, and more
  • Master key Java machine learning libraries, their functionality, and various kinds of problems that can be addressed using each of them
  • Apply machine learning to real-world data for fraud detection, recommendation engines, text classification, and human activity recognition
  • Experiment with semi-supervised learning and stream-based data mining, building high-performing and real-time predictive models
  • Develop intelligent systems centered around various domains such as security, Internet of Things, social networking, and more

In Detail

Machine Learning is one of the core area of Artificial Intelligence where computers are trained to self-learn, grow, change, and develop on their own without being explicitly programmed. In this course, we cover how Java is employed to build powerful machine learning models to address the problems being faced in the world of Data Science. The course demonstrates complex data extraction and statistical analysis techniques supported by Java, applying various machine learning methods, exploring machine learning sub-domains, and exploring real-world use cases such as recommendation systems, fraud detection, natural language processing, and more, using Java programming. The course begins with an introduction to data science and basic data science tasks such as data collection, data cleaning, data analysis, and data visualization. The next section has a detailed overview of statistical techniques, covering machine learning, neural networks, and deep learning. The next couple of sections cover applying machine learning methods using Java to a variety of chores including classifying, predicting, forecasting, market basket analysis, clustering stream learning, active learning, semi-supervised learning, probabilistic graph modeling, text mining, and deep learning.

The last section highlights real-world test cases such as performing activity recognition, developing image recognition, text classification, and anomaly detection. The course includes premium content from three of our most popular books:

  • Java for Data Science
  • Machine Learning in Java
  • Mastering Java Machine Learning

On completion of this course, you will understand various machine learning techniques, different machine learning java algorithms you can use to gain data insights, building data models to analyze larger complex data sets, and incubating applications using Java and machine learning algorithms in the field of artificial intelligence.

Style and approach

This comprehensive course proceeds from being a tutorial to a practical guide, providing an introduction to machine learning and different machine learning techniques, exploring machine learning with Java libraries, and demonstrating real-world machine learning use cases using the Java platform.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 1334

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Machine Learning: End-to-End guide for Java developers
Credits
Preface
What this learning path covers
What you need for this learning path
Who this learning path is for
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Module 1
1. Getting Started with Data Science
Problems solved using data science
Understanding the data science problem -  solving approach
Using Java to support data science
Acquiring data for an application
The importance and process of cleaning data
Visualizing data to enhance understanding
The use of statistical methods in data science
Machine learning applied to data science
Using neural networks in data science
Deep learning approaches
Performing text analysis
Visual and audio analysis
Improving application performance using parallel techniques
Assembling the pieces
Summary
2. Data Acquisition
Understanding the data formats used in data science applications
Overview of CSV data
Overview of spreadsheets
Overview of databases
Overview of PDF files
Overview of JSON
Overview of XML
Overview of streaming data
Overview of audio/video/images in Java
Data acquisition techniques
Using the HttpUrlConnection class
Web crawlers in Java
Creating your own web crawler
Using the crawler4j web crawler
Web scraping in Java
Using API calls to access common social media sites
Using OAuth to authenticate users
Handing Twitter
Handling Wikipedia
Handling Flickr
Handling YouTube
Searching by keyword
Summary
3. Data Cleaning
Handling data formats
Handling CSV data
Handling spreadsheets
Handling Excel spreadsheets
Handling PDF files
Handling JSON
Using JSON streaming API
Using the JSON tree API
The nitty gritty of cleaning text
Using Java tokenizers to extract words
Java core tokenizers
Third-party tokenizers and libraries
Transforming data into a usable form
Simple text cleaning
Removing stop words
Finding words in text
Finding and replacing text
Data imputation
Subsetting data
Sorting text
Data validation
Validating data types
Validating dates
Validating e-mail addresses
Validating ZIP codes
Validating names
Cleaning images
Changing the contrast of an image
Smoothing an image
Brightening an image
Resizing an image
Converting images to different formats
Summary
4. Data Visualization
Understanding plots and graphs
Visual analysis goals
Creating index charts
Creating bar charts
Using country as the category
Using decade as the category
Creating stacked graphs
Creating pie charts
Creating scatter charts
Creating histograms
Creating donut charts
Creating bubble charts
Summary
5. Statistical Data Analysis Techniques
Working with mean, mode, and median
Calculating the mean
Using simple Java techniques to find mean
Using Java 8 techniques to find mean
Using Google Guava to find mean
Using Apache Commons to find mean
Calculating the median
Using simple Java techniques to find median
Using Apache Commons to find the median
Calculating the mode
Using ArrayLists to find multiple modes
Using a HashMap to find multiple modes
Using a Apache Commons to find multiple modes
Standard deviation
Sample size determination
Hypothesis testing
Regression analysis
Using simple linear regression
Using multiple regression
Summary
6. Machine Learning
Supervised learning techniques
Decision trees
Decision tree types
Decision tree libraries
Using a decision tree with a book dataset
Testing the book decision tree
Support vector machines
Using an SVM for camping data
Testing individual instances
Bayesian networks
Using a Bayesian network
Unsupervised machine learning
Association rule learning
Using association rule learning to find buying relationships
Reinforcement learning
Summary
7. Neural Networks
Training a neural network
Getting started with neural network architectures
Understanding static neural networks
A basic Java example
Understanding dynamic neural networks
Multilayer perceptron networks
Building the model
Evaluating the model
Predicting other values
Saving and retrieving the model
Learning vector quantization
Self-Organizing Maps
Using a SOM
Displaying the SOM results
Additional network architectures and algorithms
The k-Nearest Neighbors algorithm
Instantaneously trained networks
Spiking neural networks
Cascading neural networks
Holographic associative memory
Backpropagation and neural networks
Summary
8. Deep Learning
Deeplearning4j architecture
Acquiring and manipulating data
Reading in a CSV file
Configuring and building a model
Using hyperparameters in ND4J
Instantiating the network model
Training a model
Testing a model
Deep learning and regression analysis
Preparing the data
Setting up the class
Reading and preparing the data
Building the model
Evaluating the model
Restricted Boltzmann Machines
Reconstruction in an RBM
Configuring an RBM
Deep autoencoders
Building an autoencoder in DL4J
Configuring the network
Building and training the network
Saving and retrieving a network
Specialized autoencoders
Convolutional networks
Building the model
Evaluating the model
Recurrent Neural Networks
Summary
9. Text Analysis
Implementing named entity recognition
Using OpenNLP to perform NER
Identifying location entities
Classifying text
Word2Vec and Doc2Vec
Classifying text by labels
Classifying text by similarity
Understanding tagging and POS
Using OpenNLP to identify POS
Understanding POS tags
Extracting relationships from sentences
Using OpenNLP to extract relationships
Sentiment analysis
Downloading and extracting the Word2Vec model
Building our model and classifying text
Summary
10. Visual and Audio Analysis
Text-to-speech
Using FreeTTS
Getting information about voices
Gathering voice information
Understanding speech recognition
Using CMUPhinx to convert speech to text
Obtaining more detail about the words
Extracting text from an image
Using Tess4j to extract text
Identifying faces
Using OpenCV to detect faces
Classifying visual data
Creating a Neuroph Studio project for classifying visual images
Training the model
Summary
11. Mathematical and Parallel Techniques for Data Analysis
Implementing basic matrix operations
Using GPUs with DeepLearning4j
Using map-reduce
Using Apache's Hadoop to perform map-reduce
Writing the map method
Writing the reduce method
Creating and executing a new Hadoop job
Various mathematical libraries
Using the jblas API
Using the Apache Commons math API
Using the ND4J API
Using OpenCL
Using Aparapi
Creating an Aparapi application
Using Aparapi for matrix multiplication
Using Java 8 streams
Understanding Java 8 lambda expressions and streams
Using Java 8 to perform matrix multiplication
Using Java 8 to perform map-reduce
Summary
12. Bringing It All Together
Defining the purpose and scope of our application
Understanding the application's architecture
Data acquisition using Twitter
Understanding the TweetHandler class
Extracting data for a sentiment analysis model
Building the sentiment model
Processing the JSON input
Cleaning data to improve our results
Removing stop words
Performing sentiment analysis
Analysing the results
Other optional enhancements
Summary
2. Module 2
1. Applied Machine Learning Quick Start
Machine learning and data science
What kind of problems can machine learning solve?
Applied machine learning workflow
Data and problem definition
Measurement scales
Data collection
Find or observe data
Generate data
Sampling traps
Data pre-processing
Data cleaning
Fill missing values
Remove outliers
Data transformation
Data reduction
Unsupervised learning
Find similar items
Euclidean distances
Non-Euclidean distances
The curse of dimensionality
Clustering
Supervised learning
Classification
Decision tree learning
Probabilistic classifiers
Kernel methods
Artificial neural networks
Ensemble learning
Evaluating classification
Precision and recall
Roc curves
Regression
Linear regression
Evaluating regression
Mean squared error
Mean absolute error
Correlation coefficient
Generalization and evaluation
Underfitting and overfitting
Train and test sets
Cross-validation
Leave-one-out validation
Stratification
Summary
2. Java Libraries and Platforms for Machine Learning
The need for Java
Machine learning libraries
Weka
Java machine learning
Apache Mahout
Apache Spark
Deeplearning4j
MALLET
Comparing libraries
Building a machine learning application
Traditional machine learning architecture
Dealing with big data
Big data application architecture
Summary
3. Basic Algorithms – Classification, Regression, and Clustering
Before you start
Classification
Data
Loading data
Feature selection
Learning algorithms
Classify new data
Evaluation and prediction error metrics
Confusion matrix
Choosing a classification algorithm
Regression
Loading the data
Analyzing attributes
Building and evaluating regression model
Linear regression
Regression trees
Tips to avoid common regression problems
Clustering
Clustering algorithms
Evaluation
Summary
4. Customer Relationship Prediction with Ensembles
Customer relationship database
Challenge
Dataset
Evaluation
Basic naive Bayes classifier baseline
Getting the data
Loading the data
Basic modeling
Evaluating models
Implementing naive Bayes baseline
Advanced modeling with ensembles
Before we start
Data pre-processing
Attribute selection
Model selection
Performance evaluation
Summary
5. Affinity Analysis
Market basket analysis
Affinity analysis
Association rule learning
Basic concepts
Database of transactions
Itemset and rule
Support
Confidence
Apriori algorithm
FP-growth algorithm
The supermarket dataset
Discover patterns
Apriori
FP-growth
Other applications in various areas
Medical diagnosis
Protein sequences
Census data
Customer relationship management
IT Operations Analytics
Summary
6. Recommendation Engine with Apache Mahout
Basic concepts
Key concepts
User-based and item-based analysis
Approaches to calculate similarity
Collaborative filtering
Content-based filtering
Hybrid approach
Exploitation versus exploration
Getting Apache Mahout
Configuring Mahout in Eclipse with the Maven plugin
Building a recommendation engine
Book ratings dataset
Loading the data
Loading data from file
Loading data from database
In-memory database
Collaborative filtering
User-based filtering
Item-based filtering
Adding custom rules to recommendations
Evaluation
Online learning engine
Content-based filtering
Summary
7. Fraud and Anomaly Detection
Suspicious and anomalous behavior detection
Unknown-unknowns
Suspicious pattern detection
Anomalous pattern detection
Analysis types
Pattern analysis
Transaction analysis
Plan recognition
Fraud detection of insurance claims
Dataset
Modeling suspicious patterns
Vanilla approach
Dataset rebalancing
Anomaly detection in website traffic
Dataset
Anomaly detection in time series data
Histogram-based anomaly detection
Loading the data
Creating histograms
Density based k-nearest neighbors
Summary
8. Image Recognition with Deeplearning4j
Introducing image recognition
Neural networks
Perceptron
Feedforward neural networks
Autoencoder
Restricted Boltzmann machine
Deep convolutional networks
Image classification
Deeplearning4j
Getting DL4J
MNIST dataset
Loading the data
Building models
Building a single-layer regression model
Building a deep belief network
Build a Multilayer Convolutional Network
Summary
9. Activity Recognition with Mobile Phone Sensors
Introducing activity recognition
Mobile phone sensors
Activity recognition pipeline
The plan
Collecting data from a mobile phone
Installing Android Studio
Loading the data collector
Feature extraction
Collecting training data
Building a classifier
Reducing spurious transitions
Plugging the classifier into a mobile app
Summary
10. Text Mining with Mallet – Topic Modeling and Spam Detection
Introducing text mining
Topic modeling
Text classification
Installing Mallet
Working with text data
Importing data
Importing from directory
Importing from file
Pre-processing text data
Topic modeling for BBC news
BBC dataset
Modeling
Evaluating a model
Reusing a model
Saving a model
Restoring a model
E-mail spam detection
E-mail spam dataset
Feature generation
Training and testing
Model performance
Summary
11. What is Next?
Machine learning in real life
Noisy data
Class unbalance
Feature selection is hard
Model chaining
Importance of evaluation
Getting models into production
Model maintenance
Standards and markup languages
CRISP-DM
SEMMA methodology
Predictive Model Markup Language
Machine learning in the cloud
Machine learning as a service
Web resources and competitions
Datasets
Online courses
Competitions
Websites and blogs
Venues and conferences
Summary
A. References
3. Module 3
1. Machine Learning Review
Machine learning – history and definition
What is not machine learning?
Machine learning – concepts and terminology
Machine learning – types and subtypes
Datasets used in machine learning
Machine learning applications
Practical issues in machine learning
Machine learning – roles and process
Roles
Process
Machine learning – tools and datasets
Datasets
Summary
2. Practical Approach to Real-World Supervised Learning
Formal description and notation
Data quality analysis
Descriptive data analysis
Basic label analysis
Basic feature analysis
Visualization analysis
Univariate feature analysis
Categorical features
Continuous features
Multivariate feature analysis
Data transformation and preprocessing
Feature construction
Handling missing values
Outliers
Discretization
Data sampling
Is sampling needed?
Undersampling and oversampling
Stratified sampling
Training, validation, and test set
Feature relevance analysis and dimensionality reduction
Feature search techniques
Feature evaluation techniques
Filter approach
Univariate feature selection
Information theoretic approach
Statistical approach
Multivariate feature selection
Minimal redundancy maximal relevance (mRMR)
Correlation-based feature selection (CFS)
Wrapper approach
Embedded approach
Model building
Linear models
Linear Regression
Algorithm input and output
How does it work?
Advantages and limitations
Naïve Bayes
Algorithm input and output
How does it work?
Advantages and limitations
Logistic Regression
Algorithm input and output
How does it work?
Advantages and limitations
Non-linear models
Decision Trees
Algorithm inputs and outputs
How does it work?
Advantages and limitations
K-Nearest Neighbors (KNN)
Algorithm inputs and outputs
How does it work?
Advantages and limitations
Support vector machines (SVM)
Algorithm inputs and outputs
How does it work?
Advantages and limitations
Ensemble learning and meta learners
Bootstrap aggregating or bagging
Algorithm inputs and outputs
How does it work?
Random Forest
Advantages and limitations
Boosting
Algorithm inputs and outputs
How does it work?
Advantages and limitations
Model assessment, evaluation, and comparisons
Model assessment
Model evaluation metrics
Confusion matrix and related metrics
ROC and PRC curves
Gain charts and lift curves
Model comparisons
Comparing two algorithms
McNemar's Test
Paired-t test
Wilcoxon signed-rank test
Comparing multiple algorithms
ANOVA test
Friedman's test
Case Study – Horse Colic Classification
Business problem
Machine learning mapping
Data analysis
Label analysis
Features analysis
Supervised learning experiments
Weka experiments
Sample end-to-end process in Java
Weka experimenter and model selection
RapidMiner experiments
Visualization analysis
Feature selection
Model process flow
Model evaluation metrics
Evaluation on Confusion Metrics
ROC Curves, Lift Curves, and Gain Charts
Results, observations, and analysis
Summary
References
3. Unsupervised Machine Learning Techniques
Issues in common with supervised learning
Issues specific to unsupervised learning
Feature analysis and dimensionality reduction
Notation
Linear methods
Principal component analysis (PCA)
Inputs and outputs
How does it work?
Advantages and limitations
Random projections (RP)
Inputs and outputs
How does it work?
Advantages and limitations
Multidimensional Scaling (MDS)
Inputs and outputs
How does it work?
Advantages and limitations
Nonlinear methods
Kernel Principal Component Analysis (KPCA)
Inputs and outputs
How does it work?
Advantages and limitations
Manifold learning
Inputs and outputs
How does it work?
Advantages and limitations
Clustering
Clustering algorithms
k-Means
Inputs and outputs
How does it work?
Advantages and limitations
DBSCAN
Inputs and outputs
How does it work?
Advantages and limitations
Mean shift
Inputs and outputs
How does it work?
Advantages and limitations
Expectation maximization (EM) or Gaussian mixture modeling (GMM)
Input and output
How does it work?
Advantages and limitations
Hierarchical clustering
Input and output
How does it work?
Advantages and limitations
Self-organizing maps (SOM)
Inputs and outputs
How does it work?
Advantages and limitations
Spectral clustering
Inputs and outputs
How does it work?
Advantages and limitations
Affinity propagation
Inputs and outputs
How does it work?
Advantages and limitations
Clustering validation and evaluation
Internal evaluation measures
Notation
R-Squared
Dunn's Indices
Davies-Bouldin index
Silhouette's index
External evaluation measures
Rand index
F-Measure
Normalized mutual information index
Outlier or anomaly detection
Outlier algorithms
Statistical-based
Inputs and outputs
How does it work?
Advantages and limitations
Distance-based methods
Inputs and outputs
How does it work?
Advantages and limitations
Density-based methods
Inputs and outputs
How does it work?
Advantages and limitations
Clustering-based methods
Inputs and outputs
How does it work?
Advantages and limitations
High-dimensional-based methods
Inputs and outputs
How does it work?
Advantages and limitations
One-class SVM
Inputs and outputs
How does it work?
Advantages and limitations
Outlier evaluation techniques
Supervised evaluation
Unsupervised evaluation
Real-world case study
Tools and software
Business problem
Machine learning mapping
Data collection
Data quality analysis
Data sampling and transformation
Feature analysis and dimensionality reduction
PCA
Random projections
ISOMAP
Observations on feature analysis and dimensionality reduction
Clustering models, results, and evaluation
Observations and clustering analysis
Outlier models, results, and evaluation
Observations and analysis
Summary
References
4. Semi-Supervised and Active Learning
Semi-supervised learning
Representation, notation, and assumptions
Semi-supervised learning techniques
Self-training SSL
Inputs and outputs
How does it work?
Advantages and limitations
Co-training SSL or multi-view SSL
Inputs and outputs
How does it work?
Advantages and limitations
Cluster and label SSL
Inputs and outputs
How does it work?
Advantages and limitations
Transductive graph label propagation
Inputs and outputs
How does it work?
Advantages and limitations
Transductive SVM (TSVM)
Inputs and outputs
How does it work?
Advantages and limitations
Case study in semi-supervised learning
Tools and software
Business problem
Machine learning mapping
Data collection
Data quality analysis
Data sampling and transformation
Datasets and analysis
Feature analysis results
Experiments and results
Analysis of semi-supervised learning
Active learning
Representation and notation
Active learning scenarios
Active learning approaches
Uncertainty sampling
How does it work?
Least confident sampling
Smallest margin sampling
Label entropy sampling
Advantages and limitations
Version space sampling
Query by disagreement (QBD)
How does it work?
Query by Committee (QBC)
How does it work?
Advantages and limitations
Data distribution sampling
How does it work?
Expected model change
Expected error reduction
Variance reduction
Density weighted methods
Advantages and limitations
Case study in active learning
Tools and software
Business problem
Machine learning mapping
Data Collection
Data sampling and transformation
Feature analysis and dimensionality reduction
Models, results, and evaluation
Pool-based scenarios
Stream-based scenarios
Analysis of active learning results
Summary
References
5. Real-Time Stream Machine Learning
Assumptions and mathematical notations
Basic stream processing and computational techniques
Stream computations
Sliding windows
Sampling
Concept drift and drift detection
Data management
Partial memory
Full memory
Detection methods
Monitoring model evolution
Widmer and Kubat
Drift Detection Method or DDM
Early Drift Detection Method or EDDM
Monitoring distribution changes
Welch's t test
Kolmogorov-Smirnov's test
CUSUM and Page-Hinckley test
Adaptation methods
Explicit adaptation
Implicit adaptation
Incremental supervised learning
Modeling techniques
Linear algorithms
Online linear models with loss functions
Inputs and outputs
How does it work?
Advantages and limitations
Online Naïve Bayes
Inputs and outputs
How does it work?
Advantages and limitations
Non-linear algorithms
Hoeffding trees or very fast decision trees (VFDT)
Inputs and outputs
How does it work?
Advantages and limitations
Ensemble algorithms
Weighted majority algorithm
Inputs and outputs
How does it work?
Advantages and limitations
Online Bagging algorithm
Inputs and outputs
How does it work?
Advantages and limitations
Online Boosting algorithm
Inputs and outputs
How does it work?
Advantages and limitations
Validation, evaluation, and comparisons in online setting
Model validation techniques
Prequential evaluation
Holdout evaluation
Controlled permutations
Evaluation criteria
Comparing algorithms and metrics
Incremental unsupervised learning using clustering
Modeling techniques
Partition based
Online k-Means
Inputs and outputs
How does it work?
Advantages and limitations
Hierarchical based and micro clustering
Inputs and outputs
How does it work?
Advantages and limitations
Inputs and outputs
How does it work?
Advantages and limitations
Density based
Inputs and outputs
How does it work?
Advantages and limitations
Grid based
Inputs and outputs
How does it work?
Advantages and limitations
Validation and evaluation techniques
Key issues in stream cluster evaluation
Evaluation measures
Cluster Mapping Measures (CMM)
V-Measure
Other external measures
Unsupervised learning using outlier detection
Partition-based clustering for outlier detection
Inputs and outputs
How does it work?
Advantages and limitations
Distance-based clustering for outlier detection
Inputs and outputs
How does it work?
Exact Storm
Abstract-C
Direct Update of Events (DUE)
Micro Clustering based Algorithm (MCOD)
Approx Storm
Advantages and limitations
Validation and evaluation techniques
Case study in stream learning
Tools and software
Business problem
Machine learning mapping
Data collection
Data sampling and transformation
Feature analysis and dimensionality reduction
Models, results, and evaluation
Supervised learning experiments
Concept drift experiments
Clustering experiments
Outlier detection experiments
Analysis of stream learning results
Summary
References
6. Probabilistic Graph Modeling
Probability revisited
Concepts in probability
Conditional probability
Chain rule and Bayes' theorem
Random variables, joint, and marginal distributions
Marginal independence and conditional independence
Factors
Factor types
Distribution queries
Probabilistic queries
MAP queries and marginal MAP queries
Graph concepts
Graph structure and properties
Subgraphs and cliques
Path, trail, and cycles
Bayesian networks
Representation
Definition
Reasoning patterns
Causal or predictive reasoning
Evidential or diagnostic reasoning
Intercausal reasoning
Combined reasoning
Independencies, flow of influence, D-Separation, I-Map
Flow of influence
D-Separation
I-Map
Inference
Elimination-based inference
Variable elimination algorithm
Input and output
How does it work?
Advantages and limitations
Clique tree or junction tree algorithm
Input and output
How does it work?
Advantages and limitations
Propagation-based techniques
Belief propagation
Factor graph
Messaging in factor graph
Input and output
How does it work?
Advantages and limitations
Sampling-based techniques
Forward sampling with rejection
Input and output
How does it work?
Advantages and limitations
Learning
Learning parameters
Maximum likelihood estimation for Bayesian networks
Bayesian parameter estimation for Bayesian network
Prior and posterior using the Dirichlet distribution
Learning structures
Measures to evaluate structures
Methods for learning structures
Constraint-based techniques
Inputs and outputs
How does it work?
Advantages and limitations
Search and score-based techniques
Inputs and outputs
How does it work?
Advantages and limitations
Markov networks and conditional random fields
Representation
Parameterization
Gibbs parameterization
Factor graphs
Log-linear models
Independencies
Global
Pairwise Markov
Markov blanket
Inference
Learning
Conditional random fields
Specialized networks
Tree augmented network
Input and output
How does it work?
Advantages and limitations
Markov chains
Hidden Markov models
Most probable path in HMM
Posterior decoding in HMM
Tools and usage
OpenMarkov
Weka Bayesian Network GUI
Case study
Business problem
Machine learning mapping
Data sampling and transformation
Feature analysis
Models, results, and evaluation
Analysis of results
Summary
References
7. Deep Learning
Multi-layer feed-forward neural network
Inputs, neurons, activation function, and mathematical notation
Multi-layered neural network
Structure and mathematical notations
Activation functions in NN
Sigmoid function
Hyperbolic tangent ("tanh") function
Training neural network
Empirical risk minimization
Parameter initialization
Loss function
Gradients
Gradient at the output layer
Gradient at the Hidden Layer
Parameter gradient
Feed forward and backpropagation
How does it work?
Regularization
L2 regularization
L1 regularization
Limitations of neural networks
Vanishing gradients, local optimum, and slow training
Deep learning
Building blocks for deep learning
Rectified linear activation function
Restricted Boltzmann Machines
Definition and mathematical notation
Conditional distribution
Free energy in RBM
Training the RBM
Sampling in RBM
Contrastive divergence
Inputs and outputs
How does it work?
Persistent contrastive divergence
Autoencoders
Definition and mathematical notations
Loss function
Limitations of Autoencoders
Denoising Autoencoder
Unsupervised pre-training and supervised fine-tuning
Deep feed-forward NN
Input and outputs
How does it work?
Deep Autoencoders
Deep Belief Networks
Inputs and outputs
How does it work?
Deep learning with dropouts
Definition and mathematical notation
Inputs and outputs
How does it work?
Learning Training and testing with dropouts
Sparse coding
Convolutional Neural Network
Local connectivity
Parameter sharing
Discrete convolution
Pooling or subsampling
Normalization using ReLU
CNN Layers
Recurrent Neural Networks
Structure of Recurrent Neural Networks
Learning and associated problems in RNNs
Long Short Term Memory
Gated Recurrent Units
Case study
Tools and software
Business problem
Machine learning mapping
Data sampling and transfor
Feature analysis
Models, results, and evaluation
Basic data handling
Multi-layer perceptron
Parameters used for MLP
Code for MLP
Convolutional Network
Parameters used for ConvNet
Code for CNN
Variational Autoencoder
Parameters used for the Variational Autoencoder
Code for Variational Autoencoder
DBN
Parameter search using Arbiter
Results and analysis
Summary
References
8. Text Mining and Natural Language Processing
NLP, subfields, and tasks
Text categorization
Part-of-speech tagging (POS tagging)
Text clustering
Information extraction and named entity recognition
Sentiment analysis and opinion mining
Coreference resolution
Word sense disambiguation
Machine translation
Semantic reasoning and inferencing
Text summarization
Automating question and answers
Issues with mining unstructured data
Text processing components and transformations
Document collection and standardization
Inputs and outputs
How does it work?
Tokenization
Inputs and outputs
How does it work?
Stop words removal
Inputs and outputs
How does it work?
Stemming or lemmatization
Inputs and outputs
How does it work?
Local/global dictionary or vocabulary?
Feature extraction/generation
Lexical features
Character-based features
Word-based features
Part-of-speech tagging features
Taxonomy features
Syntactic features
Semantic features
Feature representation and similarity
Vector space model
Binary
Term frequency (TF)
Inverse document frequency (IDF)
Term frequency-inverse document frequency (TF-IDF)
Similarity measures
Euclidean distance
Cosine distance
Pairwise-adaptive similarity
Extended Jaccard coefficient
Dice coefficient
Feature selection and dimensionality reduction
Feature selection
Information theoretic techniques
Statistical-based techniques
Frequency-based techniques
Dimensionality reduction
Topics in text mining
Text categorization/classification
Topic modeling
Probabilistic latent semantic analysis (PLSA)
Input and output
How does it work?
Advantages and limitations
Text clustering
Feature transformation, selection, and reduction
Clustering techniques
Generative probabilistic models
Input and output
How does it work?
Advantages and limitations
Distance-based text clustering
Non-negative matrix factorization (NMF)
Input and output
How does it work?
Advantages and limitations
Evaluation of text clustering
Named entity recognition
Hidden Markov models for NER
Input and output
How does it work?
Advantages and limitations
Maximum entropy Markov models for NER
Input and output
How does it work?
Advantages and limitations
Deep learning and NLP
Tools and usage
Mallet
KNIME
Topic modeling with mallet
Business problem
Machine Learning mapping
Data collection
Data sampling and transformation
Feature analysis and dimensionality reduction
Models, results, and evaluation
Analysis of text processing results
Summary
References
9. Big Data Machine Learning – The Final Frontier
What are the characteristics of Big Data?
Big Data Machine Learning
General Big Data framework
Big Data cluster deployment frameworks
Hortonworks Data Platform
Cloudera CDH
Amazon Elastic MapReduce
Microsoft Azure HDInsight
Data acquisition
Publish-subscribe frameworks
Source-sink frameworks
SQL frameworks
Message queueing frameworks
Custom frameworks
Data storage
HDFS
NoSQL
Key-value databases
Document databases
Columnar databases
Graph databases
Data processing and preparation
Hive and HQL
Spark SQL
Amazon Redshift
Real-time stream processing
Machine Learning
Visualization and analysis
Batch Big Data Machine Learning
H2O as Big Data Machine Learning platform
H2O architecture
Machine learning in H2O
Tools and usage
Case study
Business problem
Machine Learning mapping
Data collection
Data sampling and transformation
Experiments, results, and analysis
Feature relevance and analysis
Evaluation on test data
Analysis of results
Spark MLlib as Big Data Machine Learning platform
Spark architecture
Machine Learning in MLlib
Tools and usage
Experiments, results, and analysis
k-Means
k-Means with PCA
Bisecting k-Means (with PCA)
Gaussian Mixture Model
Random Forest
Analysis of results
Real-time Big Data Machine Learning
SAMOA as a real-time Big Data Machine Learning framework
SAMOA architecture
Machine Learning algorithms
Tools and usage
Experiments, results, and analysis
Analysis of results
The future of Machine Learning
Summary
References
A. Linear Algebra
Vector
Scalar product of vectors
Matrix
Transpose of a matrix
Matrix addition
Scalar multiplication
Matrix multiplication
Properties of matrix product
Linear transformation
Matrix inverse
Eigendecomposition
Positive definite matrix
Singular value decomposition (SVD)
B. Probability
Axioms of probability
Bayes' theorem
Density estimation
Mean
Variance
Standard deviation
Gaussian standard deviation
Covariance
Correlation coefficient
Binomial distribution
Poisson distribution
Gaussian distribution
Central limit theorem
Error propagation
D. Bibliography
Index

Machine Learning: End-to-End guide for Java developers

Machine Learning: End-to-End guide for Java developers

Copyright © 2017 Packt Publishing

All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Published on: September 2017

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78862-221-9

www.packtpub.com

Credits

Authors

Richard M. Reese

Jennifer L. Reese

Boštjan Kaluža

Dr. Uday Kamath

Krishna Choppella

Reviewers

Walter Molina

Shilpi Saxena

Abhik Banerjee

Wei Di

Manjunath Narayana

Ravi Sharma

Samir Sahli

Prashant Verma

Content Development Editor

Aishwarya Pandere

Production Coordinator

Arvindkumar Gupta

Preface

Machine learning is a subfield of artificial intelligence. It helps computers to learn and act like human beings with the help of algorithms and data. With a given set of data, an ML algorithm learns different properties of the data and infers the properties of the data that it may encounter in future.

What this learning path covers

Module 1, Java for Data Science, investigates the support provided for low-level math operations and how they can be supported in a multiple processor environment. Data analysis, at its heart, necessitates the ability to manipulate and analyze large quantities of numeric data.

Module 2, Machine Learning in Java, reviews the various Java libraries and platforms dedicated to machine learning, what each library brings to the table, and what kind of problems it is able to solve. The review includes Weka, Java-ML, Apache Mahout, Apache Spark, deeplearning4j, and Mallet.

Module 3, Mastering Java Machine Learning, presents many advanced methods in clustering and outlier techniques, with applications. Topics covered are feature selection and reduction in unsupervised data, clustering algorithms, evaluation methods in clustering, and anomaly detection using statistical, distance, and distribution techniques. At the end of the chapter, we perform a case study for both clustering and outlier detection using a real-world image dataset, MNIST. We use the Smile API to do feature reduction and ELKI for learning.

What you need for this learning path

Module 1:

Many of the examples in this module use Java 8 features. There are a number of Java APIs demonstrated, each of which is introduced before it is applied. An IDE is not required but is desirable.

Module 2:

To follow the examples throughout the module, you'll need a personal computer with the JDK installed. All the examples and source code that you can download assume Eclipse IDE with support for Maven, a dependency management and build automation tool; and Git, a version control system. Examples in the chapters rely on various libraries, including Weka, deeplearning4j, Mallet, and Apache Mahout. Instructions on how to get and install the libraries are provided in the chapter where the library will be first used.

The module has a dedicated web site, http://machine-learning-in-java.com, where you can find all the example code, errata, and additional materials that will help you to get started.

Module 3:

This book assumes you have some experience of programming in Java and a basic understanding of machine learning concepts. If that doesn't apply to you, but you are curious nonetheless and self-motivated, fret not, and read on! For those who do have some background, it means that you are familiar with simple statistical analysis of data and concepts involved in supervised and unsupervised learning. Those who may not have the requisite math or must poke the far reaches of their memory to shake loose the odd formula or funny symbol, do not be disheartened. If you are the sort that loves a challenge, the short primer in the appendices may be all you need to kick-start your engines—a bit of tenacity will see you through the rest! For those who have never been introduced to machine learning, the first chapter was equally written for you as for those needing a refresher—it is your starter-kit to jump in feet first and find out what it's all about. You can augment your basics with any number of online resources. Finally, for those innocent of Java, here's a secret: many of the tools featured in the book have powerful GUIs. Some include wizard-like interfaces, making them quite easy to use, and do not require any knowledge of Java. So if you are new to Java, just skip the examples that need coding and learn to use the GUI-based tools instead!

Who this learning path is for

This course is the right resource for anyone with some knowledge of Java programming who wants to get started with Data Science and Machine learning as quickly as possible. If you want to gain meaningful insights from big data and develop intelligent applications using Java, this course is also a must-have.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this course—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the course's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt course, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this course from your account at http://www.packtpub.com. If you purchased this course elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the course in the Search box.Select the course for which you're looking to download the code files.Choose from the drop-down menu where you purchased this course from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the course's webpage at the Packt Publishing website. This page can be accessed by entering the course's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the course is also hosted on GitHub at https://github.com/PacktPublishing/repository-name. We also have other code bundles from our rich catalog of books, videos, and courses available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our courses—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this course. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your course, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the course in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this course, you can contact us at <[email protected]>, and we will do our best to address the problem.

Part 1. Module 1

Java for Data Science

Examine the techniques and Java tools supporting the growing field of data science

Chapter 1. Getting Started with Data Science

Data science is not a single science as much as it is a collection of various scientific disciplines integrated for the purpose of analyzing data. These disciplines include various statistical and mathematical techniques, including:

Computer scienceData engineeringVisualizationDomain-specific knowledge and approaches

With the advent of cheaper storage technology, more and more data has been collected and stored permitting previously unfeasible processing and analysis of data. With this analysis came the need for various techniques to make sense of the data. These large sets of data, when used to analyze data and identify trends and patterns, become known as big data.

This in turn gave rise to cloud computing and concurrent techniques such as map-reduce, which distributed the analysis process across a large number of processors, taking advantage of the power of parallel processing.

The process of analyzing big data is not simple and evolves to the specialization of developers who were known as data scientists. Drawing upon a myriad of technologies and expertise, they are able to analyze data to solve problems that previously were either not envisioned or were too difficult to solve.

Early big data applications were typified by the emergence of search engines capable of more powerful and accurate searches than their predecessors. For example, AltaVista was an early popular search engine that was eventually superseded by Google. While big data applications were not limited to these search engine functionalities, these applications laid the groundwork for future work in big data.

The term, data science, has been used since 1974 and evolved over time to include statistical analysis of data. The concepts of data mining and data analytics have been associated with data science. Around 2008, the term data scientist appeared and was used to describe a person who performs data analysis. A more in-depth discussion of the history of data science can be found at http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#3d9ea08369fd.

This book aims to take a broad look at data science using Java and will briefly touch on many topics. It is likely that the reader may find topics of interest and pursue these at greater depth independently. The purpose of this book, however, is simply to introduce the reader to the significant data science topics and to illustrate how they can be addressed using Java.

There are many algorithms used in data science. In this book, we do not attempt to explain how they work except at an introductory level. Rather, we are more interested in explaining how they can be used to solve problems. Specifically, we are interested in knowing how they can be used with Java.

Problems solved using data science

The various data science techniques that we will illustrate have been used to solve a variety of problems. Many of these techniques are motivated to achieve some economic gain, but they have also been used to solve many pressing social and environmental problems. Problem domains where these techniques have been used include finance, optimizing business processes, understanding customer needs, performing DNA analysis, foiling terrorist plots, and finding relationships between transactions to detect fraud, among many other data-intensive problems.

Data mining is a popular application area for data science. In this activity, large quantities of data are processed and analyzed to glean information about the dataset, to provide meaningful insights, and to develop meaningful conclusions and predictions. It has been used to analyze customer behavior, detecting relationships between what may appear to be unrelated events, and to make predictions about future behavior.

Machine learning is an important aspect of data science. This technique allows the computer to solve various problems without needing to be explicitly programmed. It has been used in self-driving cars, speech recognition, and in web searches. In data mining, the data is extracted and processed. With machine learning, computers use the data to take some sort of action.

Understanding the data science problem -  solving approach

Data science is concerned with the processing and analysis of large quantities of data to create models that can be used to make predictions or otherwise support a specific goal. This process often involves the building and training of models. The specific approach to solve a problem is dependent on the nature of the problem. However, in general, the following are the high-level tasks that are used in the analysis process:

Acquiring the data: Before we can process the data, it must be acquired. The data is frequently stored in a variety of formats and will come from a wide range of data sources.Cleaning the data: Once the data has been acquired, it often needs to be converted to a different format before it can be used. In addition, the data needs to be processed, or cleaned, so as to remove errors, resolve inconsistencies, and otherwise put it in a form ready for analysis.Analyzing the data: This can be performed using a number of techniques including:
Statistical analysis: This uses a multitude of statistical approaches to provide insight into data. It includes simple techniques and more advanced techniques such as regression analysis.AI analysis: These can be grouped as machine learning, neural networks, and deep learning techniques:
Machine learning approaches are characterized by programs that can learn without being specifically programmed to complete a specific taskNeural networks are built around models patterned after the neural connection of the brainDeep learning attempts to identify higher levels of abstraction within a set of data
Text analysis: This is a common form of analysis, which works with natural languages to identify features such as the names of people and places, the relationship between parts of text, and the implied meaning of text.Data visualization: This is an important analysis tool. By displaying the data in a visual form, a hard-to-understand set of numbers can be more readily understood.Video, image, and audio processing and analysis: This is a more specialized form of analysis, which is becoming more common as better analysis techniques are discovered and faster processors become available. This is in contrast to the more common text processing and analysis tasks.

Complementing this set of tasks is the need to develop applications that are efficient. The introduction of machines with multiple processors and GPUs contributes significantly to the end result.

While the exact steps used will vary by application, understanding these basic steps provides the basis for constructing solutions to many data science problems.

Using Java to support data science

Java and its associated third-party libraries provide a range of support for the development of data science applications. There are numerous core Java capabilities that can be used, such as the basic string processing methods. The introduction of lambda expressions in Java 8 helps enable more powerful and expressive means of building applications. In many of the examples that follow in subsequent chapters, we will show alternative techniques using lambda expressions.

There is ample support provided for the basic data science tasks. These include multiple ways of acquiring data, libraries for cleaning data, and a wide variety of analysis approaches for tasks such as natural language processing and statistical analysis. There are also myriad of libraries supporting neural network types of analysis.

Java can be a very good choice for data science problems. The language provides both object-oriented and functional support for solving problems. There is a large developer community to draw upon and there exist multiple APIs that support data science tasks. These are but a few reasons as to why Java should be used.

The remainder of this chapter will provide an overview of the data science tasks and Java support demonstrated in the book. Each section is only able to present a brief introduction to the topics and the available support. The subsequent chapter will go into considerably more depth regarding these topics.

Acquiring data for an application

Data acquisition is an important step in the data analysis process. When data is acquired, it is often in a specialized form and its contents may be inconsistent or different from an application's need. There are many sources of data, which are found on the Internet. Several examples will be demonstrated in Chapter 2, Data Acquisition.

Data may be stored in a variety of formats. Popular formats for text data include HTML, Comma Separated Values (CSV), JavaScript Object Notation (JSON), and XML. Image and audio data are stored in a number of formats. However, it is frequently necessary to convert one data format into another format, typically plain text.

For example, JSON (http://www.JSON.org/) is stored using blocks of curly braces containing key-value pairs. In the following example, parts of a YouTube result is shown:

{ "kind": "youtube#searchResult", "etag": etag, "id": { "kind": string, "videoId": string, "channelId": string, "playlistId": string }, ... }

Data is acquired using techniques such as processing live streams, downloading compressed files, and through screen scraping, where the information on a web page is extracted. Web crawling is a technique where a program examines a series of web pages, moving from one page to another, acquiring the data that it needs.

With many popular media sites, it is necessary to acquire a user ID and password to access data. A commonly used technique is OAuth, which is an open standard used to authenticate users to many different websites. The technique delegates access to a server resource and works over HTTPS. Several companies use OAuth 2.0, including PayPal, Facebook, Twitter, and Yelp.

Visualizing data to enhance understanding

The analysis of data often results in a series of numbers representing the results of the analysis. However, for most people, this way of expressing results is not always intuitive. A better way to understand the results is to create graphs and charts to depict the results and the relationship between the elements of the result.

The human mind is often good at seeing patterns, trends, and outliers in visual representation. The large amount of data present in many data science problems can be analyzed using visualization techniques. Visualization is appropriate for a wide range of audiences ranging from analysts to upper-level management to clientele. In this chapter, we present various visualization techniques and demonstrate how they are supported in Java.

In Chapter 4, Data Visualization, we illustrate how to create different types of graphs, plots, and charts. These examples use JavaFX using a free library called GRAL(http://trac.erichseifert.de/gral/).

Visualization allows users to examine large datasets in ways that provide insights that are not present in the mass of the data. Visualization tools helps us identify potential problems or unexpected data results and develop meaningful interpretations of the data.

For example, outliers, which are values that lie outside of the normal range of values, can be hard to spot from a sea of numbers. Creating a graph based on the data allows users to quickly see outliers. It can also help spot errors quickly and more easily classify data.

For example, the following chart might suggest that the upper two values should be outliers that need to be dealt with:

Machine learning applied to data science

Machine learning has become increasingly important for data science analysis as it has been for a multitude of other fields. A defining characteristic of machine learning is the ability of a model to be trained on a set of representative data and then later used to solve similar problems. There is no need to explicitly program an application to solve the problem. A model is a representation of the real-world object.

For example, customer purchases can be used to train a model. Subsequently, predictions can be made about the types of purchases a customer might subsequently make. This allows an organization to tailor ads and coupons for a customer and potentially providing a better customer experience.

Training can be performed in one of several different approaches:

Supervised learning: The model is trained with annotated, labeled, data showing corresponding correct resultsUnsupervised learning: The data does not contain results, but the model is expected to find relationships on its ownSemi-supervised: A small amount of labeled data is combined with a larger amount of unlabeled dataReinforcement learning: This is similar to supervised learning, but a reward is provided for good results

There are several approaches that support machine learning. In Chapter 6, Machine Learning, we will illustrate three techniques:

Decision trees: A tree is constructed using features of the problem as internal nodes and the results as leavesSupport vector machines: This is used for classification by creating a hyperplane that partitions the dataset and then makes predictionsBayesian networks: This is used to depict probabilistic relationships between events

A Support Vector Machine (SVM) is used primarily for classification type problems. The approach creates a hyperplane to categorize data, which can be envisioned as a geometric plane that separates two regions. In a two-dimensional space, it will be a line. In a three-dimensional space, it will be a two-dimensional plane. In Chapter 6, Machine Learning, we will demonstrate how to use the approach using a set of data relating to the propensity of individuals to camp. We will use the Weka class, SMO, to demonstrate this type of analysis.

The following figure depicts a hyperplane using a distribution of two types of data points. The lines represent possible hyperplanes that separate these points. The lines clearly separate the data points except for one outlier.

Once the model has been trained, the possible hyperplanes are considered and predictions can then be made using similar data.