Mastering Java Machine Learning - Uday Kamath - E-Book

Mastering Java Machine Learning E-Book

Uday Kamath

0,0
44,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Java is one of the main languages used by practicing data scientists; much of the Hadoop ecosystem is Java-based, and it is certainly the language that most production systems in Data Science are written in. If you know Java, Mastering Machine Learning with Java is your next step on the path to becoming an advanced practitioner in Data Science.
This book aims to introduce you to an array of advanced techniques in machine learning, including classification, clustering, anomaly detection, stream learning, active learning, semi-supervised learning, probabilistic graph modeling, text mining, deep learning, and big data batch and stream machine learning. Accompanying each chapter are illustrative examples and real-world case studies that show how to apply the newly learned techniques using sound methodologies and the best Java-based tools available today.
On completing this book, you will have an understanding of the tools and techniques for building powerful machine learning models to solve data science problems in just about any domain.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 656

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Mastering Java Machine Learning
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Errata
Piracy
Questions
1. Machine Learning Review
Machine learning – history and definition
What is not machine learning?
Machine learning – concepts and terminology
Machine learning – types and subtypes
Datasets used in machine learning
Machine learning applications
Practical issues in machine learning
Machine learning – roles and process
Roles
Process
Machine learning – tools and datasets
Datasets
Summary
2. Practical Approach to Real-World Supervised Learning
Formal description and notation
Data quality analysis
Descriptive data analysis
Basic label analysis
Basic feature analysis
Visualization analysis
Univariate feature analysis
Categorical features
Continuous features
Multivariate feature analysis
Data transformation and preprocessing
Feature construction
Handling missing values
Outliers
Discretization
Data sampling
Is sampling needed?
Undersampling and oversampling
Stratified sampling
Training, validation, and test set
Feature relevance analysis and dimensionality reduction
Feature search techniques
Feature evaluation techniques
Filter approach
Univariate feature selection
Information theoretic approach
Statistical approach
Multivariate feature selection
Minimal redundancy maximal relevance (mRMR)
Correlation-based feature selection (CFS)
Wrapper approach
Embedded approach
Model building
Linear models
Linear Regression
Algorithm input and output
How does it work?
Advantages and limitations
Naïve Bayes
Algorithm input and output
How does it work?
Advantages and limitations
Logistic Regression
Algorithm input and output
How does it work?
Advantages and limitations
Non-linear models
Decision Trees
Algorithm inputs and outputs
How does it work?
Advantages and limitations
K-Nearest Neighbors (KNN)
Algorithm inputs and outputs
How does it work?
Advantages and limitations
Support vector machines (SVM)
Algorithm inputs and outputs
How does it work?
Advantages and limitations
Ensemble learning and meta learners
Bootstrap aggregating or bagging
Algorithm inputs and outputs
How does it work?
Random Forest
Advantages and limitations
Boosting
Algorithm inputs and outputs
How does it work?
Advantages and limitations
Model assessment, evaluation, and comparisons
Model assessment
Model evaluation metrics
Confusion matrix and related metrics
ROC and PRC curves
Gain charts and lift curves
Model comparisons
Comparing two algorithms
McNemar's Test
Paired-t test
Wilcoxon signed-rank test
Comparing multiple algorithms
ANOVA test
Friedman's test
Case Study – Horse Colic Classification
Business problem
Machine learning mapping
Data analysis
Label analysis
Features analysis
Supervised learning experiments
Weka experiments
Sample end-to-end process in Java
Weka experimenter and model selection
RapidMiner experiments
Visualization analysis
Feature selection
Model process flow
Model evaluation metrics
Evaluation on Confusion Metrics
ROC Curves, Lift Curves, and Gain Charts
Results, observations, and analysis
Summary
References
3. Unsupervised Machine Learning Techniques
Issues in common with supervised learning
Issues specific to unsupervised learning
Feature analysis and dimensionality reduction
Notation
Linear methods
Principal component analysis (PCA)
Inputs and outputs
How does it work?
Advantages and limitations
Random projections (RP)
Inputs and outputs
How does it work?
Advantages and limitations
Multidimensional Scaling (MDS)
Inputs and outputs
How does it work?
Advantages and limitations
Nonlinear methods
Kernel Principal Component Analysis (KPCA)
Inputs and outputs
How does it work?
Advantages and limitations
Manifold learning
Inputs and outputs
How does it work?
Advantages and limitations
Clustering
Clustering algorithms
k-Means
Inputs and outputs
How does it work?
Advantages and limitations
DBSCAN
Inputs and outputs
How does it work?
Advantages and limitations
Mean shift
Inputs and outputs
How does it work?
Advantages and limitations
Expectation maximization (EM) or Gaussian mixture modeling (GMM)
Input and output
How does it work?
Advantages and limitations
Hierarchical clustering
Input and output
How does it work?
Advantages and limitations
Self-organizing maps (SOM)
Inputs and outputs
How does it work?
Advantages and limitations
Spectral clustering
Inputs and outputs
How does it work?
Advantages and limitations
Affinity propagation
Inputs and outputs
How does it work?
Advantages and limitations
Clustering validation and evaluation
Internal evaluation measures
Notation
R-Squared
Dunn's Indices
Davies-Bouldin index
Silhouette's index
External evaluation measures
Rand index
F-Measure
Normalized mutual information index
Outlier or anomaly detection
Outlier algorithms
Statistical-based
Inputs and outputs
How does it work?
Advantages and limitations
Distance-based methods
Inputs and outputs
How does it work?
Advantages and limitations
Density-based methods
Inputs and outputs
How does it work?
Advantages and limitations
Clustering-based methods
Inputs and outputs
How does it work?
Advantages and limitations
High-dimensional-based methods
Inputs and outputs
How does it work?
Advantages and limitations
One-class SVM
Inputs and outputs
How does it work?
Advantages and limitations
Outlier evaluation techniques
Supervised evaluation
Unsupervised evaluation
Real-world case study
Tools and software
Business problem
Machine learning mapping
Data collection
Data quality analysis
Data sampling and transformation
Feature analysis and dimensionality reduction
PCA
Random projections
ISOMAP
Observations on feature analysis and dimensionality reduction
Clustering models, results, and evaluation
Observations and clustering analysis
Outlier models, results, and evaluation
Observations and analysis
Summary
References
4. Semi-Supervised and Active Learning
Semi-supervised learning
Representation, notation, and assumptions
Semi-supervised learning techniques
Self-training SSL
Inputs and outputs
How does it work?
Advantages and limitations
Co-training SSL or multi-view SSL
Inputs and outputs
How does it work?
Advantages and limitations
Cluster and label SSL
Inputs and outputs
How does it work?
Advantages and limitations
Transductive graph label propagation
Inputs and outputs
How does it work?
Advantages and limitations
Transductive SVM (TSVM)
Inputs and outputs
How does it work?
Advantages and limitations
Case study in semi-supervised learning
Tools and software
Business problem
Machine learning mapping
Data collection
Data quality analysis
Data sampling and transformation
Datasets and analysis
Feature analysis results
Experiments and results
Analysis of semi-supervised learning
Active learning
Representation and notation
Active learning scenarios
Active learning approaches
Uncertainty sampling
How does it work?
Least confident sampling
Smallest margin sampling
Label entropy sampling
Advantages and limitations
Version space sampling
Query by disagreement (QBD)
How does it work?
Query by Committee (QBC)
How does it work?
Advantages and limitations
Data distribution sampling
How does it work?
Expected model change
Expected error reduction
Variance reduction
Density weighted methods
Advantages and limitations
Case study in active learning
Tools and software
Business problem
Machine learning mapping
Data Collection
Data sampling and transformation
Feature analysis and dimensionality reduction
Models, results, and evaluation
Pool-based scenarios
Stream-based scenarios
Analysis of active learning results
Summary
References
5. Real-Time Stream Machine Learning
Assumptions and mathematical notations
Basic stream processing and computational techniques
Stream computations
Sliding windows
Sampling
Concept drift and drift detection
Data management
Partial memory
Full memory
Detection methods
Monitoring model evolution
Widmer and Kubat
Drift Detection Method or DDM
Early Drift Detection Method or EDDM
Monitoring distribution changes
Welch's t test
Kolmogorov-Smirnov's test
CUSUM and Page-Hinckley test
Adaptation methods
Explicit adaptation
Implicit adaptation
Incremental supervised learning
Modeling techniques
Linear algorithms
Online linear models with loss functions
Inputs and outputs
How does it work?
Advantages and limitations
Online Naïve Bayes
Inputs and outputs
How does it work?
Advantages and limitations
Non-linear algorithms
Hoeffding trees or very fast decision trees (VFDT)
Inputs and outputs
How does it work?
Advantages and limitations
Ensemble algorithms
Weighted majority algorithm
Inputs and outputs
How does it work?
Advantages and limitations
Online Bagging algorithm
Inputs and outputs
How does it work?
Advantages and limitations
Online Boosting algorithm
Inputs and outputs
How does it work?
Advantages and limitations
Validation, evaluation, and comparisons in online setting
Model validation techniques
Prequential evaluation
Holdout evaluation
Controlled permutations
Evaluation criteria
Comparing algorithms and metrics
Incremental unsupervised learning using clustering
Modeling techniques
Partition based
Online k-Means
Inputs and outputs
How does it work?
Advantages and limitations
Hierarchical based and micro clustering
Inputs and outputs
How does it work?
Advantages and limitations
Inputs and outputs
How does it work?
Advantages and limitations
Density based
Inputs and outputs
How does it work?
Advantages and limitations
Grid based
Inputs and outputs
How does it work?
Advantages and limitations
Validation and evaluation techniques
Key issues in stream cluster evaluation
Evaluation measures
Cluster Mapping Measures (CMM)
V-Measure
Other external measures
Unsupervised learning using outlier detection
Partition-based clustering for outlier detection
Inputs and outputs
How does it work?
Advantages and limitations
Distance-based clustering for outlier detection
Inputs and outputs
How does it work?
Exact Storm
Abstract-C
Direct Update of Events (DUE)
Micro Clustering based Algorithm (MCOD)
Approx Storm
Advantages and limitations
Validation and evaluation techniques
Case study in stream learning
Tools and software
Business problem
Machine learning mapping
Data collection
Data sampling and transformation
Feature analysis and dimensionality reduction
Models, results, and evaluation
Supervised learning experiments
Concept drift experiments
Clustering experiments
Outlier detection experiments
Analysis of stream learning results
Summary
References
6. Probabilistic Graph Modeling
Probability revisited
Concepts in probability
Conditional probability
Chain rule and Bayes' theorem
Random variables, joint, and marginal distributions
Marginal independence and conditional independence
Factors
Factor types
Distribution queries
Probabilistic queries
MAP queries and marginal MAP queries
Graph concepts
Graph structure and properties
Subgraphs and cliques
Path, trail, and cycles
Bayesian networks
Representation
Definition
Reasoning patterns
Causal or predictive reasoning
Evidential or diagnostic reasoning
Intercausal reasoning
Combined reasoning
Independencies, flow of influence, D-Separation, I-Map
Flow of influence
D-Separation
I-Map
Inference
Elimination-based inference
Variable elimination algorithm
Input and output
How does it work?
Advantages and limitations
Clique tree or junction tree algorithm
Input and output
How does it work?
Advantages and limitations
Propagation-based techniques
Belief propagation
Factor graph
Messaging in factor graph
Input and output
How does it work?
Advantages and limitations
Sampling-based techniques
Forward sampling with rejection
Input and output
How does it work?
Advantages and limitations
Learning
Learning parameters
Maximum likelihood estimation for Bayesian networks
Bayesian parameter estimation for Bayesian network
Prior and posterior using the Dirichlet distribution
Learning structures
Measures to evaluate structures
Methods for learning structures
Constraint-based techniques
Inputs and outputs
How does it work?
Advantages and limitations
Search and score-based techniques
Inputs and outputs
How does it work?
Advantages and limitations
Markov networks and conditional random fields
Representation
Parameterization
Gibbs parameterization
Factor graphs
Log-linear models
Independencies
Global
Pairwise Markov
Markov blanket
Inference
Learning
Conditional random fields
Specialized networks
Tree augmented network
Input and output
How does it work?
Advantages and limitations
Markov chains
Hidden Markov models
Most probable path in HMM
Posterior decoding in HMM
Tools and usage
OpenMarkov
Weka Bayesian Network GUI
Case study
Business problem
Machine learning mapping
Data sampling and transformation
Feature analysis
Models, results, and evaluation
Analysis of results
Summary
References
7. Deep Learning
Multi-layer feed-forward neural network
Inputs, neurons, activation function, and mathematical notation
Multi-layered neural network
Structure and mathematical notations
Activation functions in NN
Sigmoid function
Hyperbolic tangent ("tanh") function
Training neural network
Empirical risk minimization
Parameter initialization
Loss function
Gradients
Gradient at the output layer
Gradient at the Hidden Layer
Parameter gradient
Feed forward and backpropagation
How does it work?
Regularization
L2 regularization
L1 regularization
Limitations of neural networks
Vanishing gradients, local optimum, and slow training
Deep learning
Building blocks for deep learning
Rectified linear activation function
Restricted Boltzmann Machines
Definition and mathematical notation
Conditional distribution
Free energy in RBM
Training the RBM
Sampling in RBM
Contrastive divergence
Inputs and outputs
How does it work?
Persistent contrastive divergence
Autoencoders
Definition and mathematical notations
Loss function
Limitations of Autoencoders
Denoising Autoencoder
Unsupervised pre-training and supervised fine-tuning
Deep feed-forward NN
Input and outputs
How does it work?
Deep Autoencoders
Deep Belief Networks
Inputs and outputs
How does it work?
Deep learning with dropouts
Definition and mathematical notation
Inputs and outputs
How does it work?
Learning Training and testing with dropouts
Sparse coding
Convolutional Neural Network
Local connectivity
Parameter sharing
Discrete convolution
Pooling or subsampling
Normalization using ReLU
CNN Layers
Recurrent Neural Networks
Structure of Recurrent Neural Networks
Learning and associated problems in RNNs
Long Short Term Memory
Gated Recurrent Units
Case study
Tools and software
Business problem
Machine learning mapping
Data sampling and transfor
Feature analysis
Models, results, and evaluation
Basic data handling
Multi-layer perceptron
Parameters used for MLP
Code for MLP
Convolutional Network
Parameters used for ConvNet
Code for CNN
Variational Autoencoder
Parameters used for the Variational Autoencoder
Code for Variational Autoencoder
DBN
Parameter search using Arbiter
Results and analysis
Summary
References
8. Text Mining and Natural Language Processing
NLP, subfields, and tasks
Text categorization
Part-of-speech tagging (POS tagging)
Text clustering
Information extraction and named entity recognition
Sentiment analysis and opinion mining
Coreference resolution
Word sense disambiguation
Machine translation
Semantic reasoning and inferencing
Text summarization
Automating question and answers
Issues with mining unstructured data
Text processing components and transformations
Document collection and standardization
Inputs and outputs
How does it work?
Tokenization
Inputs and outputs
How does it work?
Stop words removal
Inputs and outputs
How does it work?
Stemming or lemmatization
Inputs and outputs
How does it work?
Local/global dictionary or vocabulary?
Feature extraction/generation
Lexical features
Character-based features
Word-based features
Part-of-speech tagging features
Taxonomy features
Syntactic features
Semantic features
Feature representation and similarity
Vector space model
Binary
Term frequency (TF)
Inverse document frequency (IDF)
Term frequency-inverse document frequency (TF-IDF)
Similarity measures
Euclidean distance
Cosine distance
Pairwise-adaptive similarity
Extended Jaccard coefficient
Dice coefficient
Feature selection and dimensionality reduction
Feature selection
Information theoretic techniques
Statistical-based techniques
Frequency-based techniques
Dimensionality reduction
Topics in text mining
Text categorization/classification
Topic modeling
Probabilistic latent semantic analysis (PLSA)
Input and output
How does it work?
Advantages and limitations
Text clustering
Feature transformation, selection, and reduction
Clustering techniques
Generative probabilistic models
Input and output
How does it work?
Advantages and limitations
Distance-based text clustering
Non-negative matrix factorization (NMF)
Input and output
How does it work?
Advantages and limitations
Evaluation of text clustering
Named entity recognition
Hidden Markov models for NER
Input and output
How does it work?
Advantages and limitations
Maximum entropy Markov models for NER
Input and output
How does it work?
Advantages and limitations
Deep learning and NLP
Tools and usage
Mallet
KNIME
Topic modeling with mallet
Business problem
Machine Learning mapping
Data collection
Data sampling and transformation
Feature analysis and dimensionality reduction
Models, results, and evaluation
Analysis of text processing results
Summary
References
9. Big Data Machine Learning – The Final Frontier
What are the characteristics of Big Data?
Big Data Machine Learning
General Big Data framework
Big Data cluster deployment frameworks
Hortonworks Data Platform
Cloudera CDH
Amazon Elastic MapReduce
Microsoft Azure HDInsight
Data acquisition
Publish-subscribe frameworks
Source-sink frameworks
SQL frameworks
Message queueing frameworks
Custom frameworks
Data storage
HDFS
NoSQL
Key-value databases
Document databases
Columnar databases
Graph databases
Data processing and preparation
Hive and HQL
Spark SQL
Amazon Redshift
Real-time stream processing
Machine Learning
Visualization and analysis
Batch Big Data Machine Learning
H2O as Big Data Machine Learning platform
H2O architecture
Machine learning in H2O
Tools and usage
Case study
Business problem
Machine Learning mapping
Data collection
Data sampling and transformation
Experiments, results, and analysis
Feature relevance and analysis
Evaluation on test data
Analysis of results
Spark MLlib as Big Data Machine Learning platform
Spark architecture
Machine Learning in MLlib
Tools and usage
Experiments, results, and analysis
k-Means
k-Means with PCA
Bisecting k-Means (with PCA)
Gaussian Mixture Model
Random Forest
Analysis of results
Real-time Big Data Machine Learning
SAMOA as a real-time Big Data Machine Learning framework
SAMOA architecture
Machine Learning algorithms
Tools and usage
Experiments, results, and analysis
Analysis of results
The future of Machine Learning
Summary
References
A. Linear Algebra
Vector
Scalar product of vectors
Matrix
Transpose of a matrix
Matrix addition
Scalar multiplication
Matrix multiplication
Properties of matrix product
Linear transformation
Matrix inverse
Eigendecomposition
Positive definite matrix
Singular value decomposition (SVD)
B. Probability
Axioms of probability
Bayes' theorem
Density estimation
Mean
Variance
Standard deviation
Gaussian standard deviation
Covariance
Correlation coefficient
Binomial distribution
Poisson distribution
Gaussian distribution
Central limit theorem
Error propagation
Index

Mastering Java Machine Learning

Mastering Java Machine Learning

Copyright © 2017 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: June 2017

Production reference: 1290617

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78588-051-3

www.packtpub.com

Credits

Authors

Uday Kamath

Krishna Choppella

Reviewer

Samir Sahli

Prashant Verma

Commissioning Editor

Veena Pagare

Acquisition Editor

Divya Poojari

Content Development Editor

Mayur Pawanikar

Technical Editor

Vivek Arora

Copy Editors

Vikrant Phadkay

Safis Editing

Project Coordinator

Nidhi Joshi

Proofreaders

Safis Editing

Indexer

Francy Puthiry

Graphics

Tania Dutta

Production Coordinator

Arvindkumar Gupta

Cover Work

Arvindkumar Gupta

Foreword

Dr. Uday Kamath is a volcano of ideas. Every time he walked into my office, we had fruitful and animated discussions. I have been a professor of computer science at George Mason University (GMU) for 15 years, specializing in machine learning and data mining. I have known Uday for five years, first as a student in my data mining class, then as a colleague and co-author of papers and projects on large-scale machine learning. While a chief data scientist at BAE Systems Applied Intelligence, Uday earned his PhD in evolutionary computation and machine learning. As if having two high-demand jobs was not enough, Uday was unusually prolific, publishing extensively with four different people in the computer science faculty during his tenure at GMU, something you don't see very often. Given this pedigree, I am not surprised that less than four years since Uday's graduation with a PhD, I am writing the foreword for his book on mastering advanced machine learning techniques with Java. Uday's thirst for new stimulating challenges has struck again, resulting in this terrific book you now have in your hands.

This book is the product of his deep interest and knowledge in sound and well-grounded theory, and at the same time his keen grasp of the practical feasibility of proposed methodologies. Several books on machine learning and data analytics exist, but Uday's book closes a substantial gap—the one between theory and practice. It offers a comprehensive and systematic analysis of classic and advanced learning techniques, with a focus on their advantages and limitations, practical use and implementations. This book is a precious resource for practitioners of data science and analytics, as well as for undergraduate and graduate students keen to master practical and efficient implementations of machine learning techniques.

The book covers the classic techniques of machine learning, such as classification, clustering, dimensionality reduction, anomaly detection, semi-supervised learning, and active learning. It also covers advanced and recent topics, including learning with stream data, deep learning, and the challenges of learning with big data. Each chapter is dedicated to a topic and includes an illustrative case study, which covers state-of-the-art Java-based tools and software, and the entire knowledge discovery cycle: data collection, experimental design, modeling, results, and evaluation. Each chapter is self-contained, providing great flexibility of usage. The accompanying website provides the source code and data. This is truly a gem for both students and data analytics practitioners, who can experiment first-hand with the methods just learned or deepen their understanding of the methods by applying them to real-world scenarios.

As I was reading the various chapters of the book, I was reminded of the enthusiasm Uday has for learning and knowledge. He communicates the concepts described in the book with clarity and with the same passion. I am positive that you, as a reader, will feel the same. I will certainly keep this book as a personal resource for the courses I teach, and strongly recommend it to my students.

Dr. Carlotta Domeniconi

Associate Professor of Computer Science, George Mason University

About the Authors

Dr. Uday Kamath is the chief data scientist at BAE Systems Applied Intelligence. He specializes in scalable machine learning and has spent 20 years in the domain of AML, fraud detection in financial crime, cyber security, and bioinformatics, to name a few. Dr. Kamath is responsible for key products in areas focusing on the behavioral, social networking and big data machine learning aspects of analytics at BAE AI. He received his PhD at George Mason University, under the able guidance of Dr. Kenneth De Jong, where his dissertation research focused on machine learning for big data and automated sequence mining.

I would like to thank my friend, Krishna Choppella, for accepting the offer to co-author this book and being an able partner on this long but satisfying journey.

Heartfelt thanks to our reviewers, especially Dr. Samir Sahli for his valuable comments, suggestions, and in-depth review of the chapters. I would like to thank Professor Carlotta Domeniconi for her suggestions and comments that helped us shape various chapters in the book. I would also like to thank all the Packt staff, especially Divya Poojari, Mayur Pawanikar, and Vivek Arora, for helping us complete the tasks in time. This book required making a lot of sacrifices on the personal front and I would like to thank my wife, Pratibha, and our nanny, Evelyn, for their unconditional support. Finally, thanks to all my lovely teachers and professors for not only teaching the subjects, but also instilling the joy of learning.

Krishna Choppella builds tools and client solutions in his role as a solutions architect for analytics at BAE Systems Applied Intelligence. He has been programming in Java for 20 years. His interests are data science, functional programming, and distributed computing.

About the Reviewers

Samir Sahli was awarded a BSc degree in applied mathematics and information sciences from the University of Nice Sophia-Antipolis, France, in 2004. He received MSc and PhD degrees in physics (specializing in optics/photonics/image science) from University Laval, Quebec, Canada, in 2008 and 2013, respectively. During his graduate studies, he worked with Defence Research and Development Canada (DRDC) on the automatic detection and recognition of targets in aerial imagery, especially in the context of uncontrolled environment and sub-optimal acquisition conditions. He has worked since 2009 as a consultant for several companies based in Europe and North America specializing in the area of Intelligence, Surveillance, and Reconnaissance (ISR) and in remote sensing.

Dr. Sahli joined McMaster Biophotonics in 2013 as a postdoctoral fellow. His research was in the field of optics, image processing, and machine learning. He was involved in several projects, such as the development of a novel generation of gastrointestinal tract imaging device, hyperspectral imaging of skin erythema for individualized radiotherapy treatment, and automatic detection of the precancerous Barrett's esophageal cell using fluorescence lifetime imaging microscopy and multiphoton microscopy.

Dr. Sahli joined BAE Systems Applied Intelligence in 2015. He has since worked as a data scientist to develop analytics models to detect complex fraud patterns and money laundering schemes for insurance, banking, and governmental clients using machine learning, statistics, and social network analysis tools.

Prashant Verma started his IT career in 2011 as a Java developer in Ericsson, working in the telecom domain. After a couple of years of Java EE experience, he moved into the big data domain and has worked on almost all of the popular big data technologies such as Hadoop, Spark, Flume, Mongo, Cassandra, and so on. He has also played with Scala. Currently, he works with QA Infotech as a lead data engineer, working on solving e-learning problems with analytics and machine learning.

Prashant has worked for many companies, such as Ericsson and QA Infotech, with domain knowledge of telecom and e-learning. He has also worked as a freelance consultant in his free time.

I want to thank Packt Publishing for giving me the chance to review the book, as well as my employer and my family for their patience while I was busy working on this book.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1785880519.

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

 

Dedicated to my parents, Krishna Kamath and Bharathi Kamath, my wife, Pratibha Shenoy, and the kids, Aaroh and Brandy

  --Dr. Uday Kamath.
 

To my parents

  --Krishna Choppella

Preface

There are many notable books on machine learning, from pedagogical tracts on the theory of learning from data; to standard references on specializations in the field, such as clustering and outlier detection or probabilistic graph modeling; to cookbooks that offer practical advice on the use of tools and libraries in a particular language. The books that tend to be broad in coverage are often short on theoretical detail, while those with a focus on one topic or tool may not, for example, have much to say about the difference in approach in a streaming as opposed to a batch environment. Besides, for the non-novices with a preference for tools in Java who wish to reach for a single volume that will extend their knowledge—simultaneously, on the essential aspects—there are precious few options.

Finding in one place

The pros and cons of different techniques given any data availability scenario—when data is labeled or unlabeled, streaming or batch, local, or distributed, structured or unstructuredA ready reference for the most important mathematical results related to those very techniques for a better appreciation of the underlying theoryAn introduction to the most mature Java-based frameworks, libraries, and visualization tools with descriptions and illustrations on how to put these techniques into practice is not possible today, as far as we know

The core idea of this book, therefore, is to address this gap while maintaining a balance between treatment of theory and practice with the aid of probability, statistics, basic linear algebra, and rudimentary calculus in the service of one, and emphasizing methodology, case studies, tools and code in support of the other.

According to the KDnuggets 2016 software poll, Java, at 16.8%, has the second highest share in popularity among languages used in machine learning, after Python. What's more is that this marks a 19% increase from the year before! Clearly, Java remains an important and effective vehicle to build and deploy systems involving machine learning, despite claims of its decline in some quarters. With this book, we aim to reach professionals and motivated enthusiasts with some experience in Java and a beginner's knowledge of machine learning. Our goal is to make Mastering Java Machine Learning the next step on their path to becoming advanced practitioners in data science. To guide them on this path, the book covers a veritable arsenal of techniques in machine learning—some which they may already be familiar with, others perhaps not as much, or only superficially—including methods of data analysis, learning algorithms, evaluation of model performance, and more in supervised and semi-supervised learning, clustering and anomaly detection, and semi-supervised and active learning. It also presents special topics such as probabilistic graph modeling, text mining, and deep learning. Not forgetting the increasingly important topics in enterprise-scale systems today, the book also covers the unique challenges of learning from evolving data streams and the tools and techniques applicable to real-time systems, as well as the imperatives of the world of Big Data:

How does machine learning work in large-scale distributed environments?What are the trade-offs?How must algorithms be adapted?How can these systems interoperate with other technologies in the dominant Hadoop ecosystem?

This book explains how to apply machine learning to real-world data and real-world domains with the right methodology, processes, applications, and analysis. Accompanying each chapter are case studies and examples of how to apply the newly learned techniques using some of the best available open source tools written in Java. This book covers more than 15 open source Java tools supporting a wide range of techniques between them, with code and practical usage. The code, data, and configurations are available for readers to download and experiment with. We present more than ten real-world case studies in Machine Learning that illustrate the data scientist's process. Each case study details the steps undertaken in the experiments: data ingestion, data analysis, data cleansing, feature reduction/selection, mapping to machine learning, model training, model selection, model evaluation, and analysis of results. This gives the reader a practical guide to using the tools and methods presented in each chapter for solving the business problem at hand.

What this book covers

Chapter 1, Machine Learning Review, is a refresher of basic concepts and techniques that the reader would have learned from Packt's Learning Machine Learning in Java or a similar text. This chapter is a review of concepts such as data, data transformation, sampling and bias, features and their importance, supervised learning, unsupervised learning, big data learning, stream and real-time learning, probabilistic graphic models, and semi-supervised learning.

Chapter 2, Practical Approach to Real-World Supervised Learning, cobwebs dusted, dives straight into the vast field of supervised learning and the full spectrum of associated techniques. We cover the topics of feature selection and reduction, linear modeling, logistic models, non-linear models, SVM and kernels, ensemble learning techniques such as bagging and boosting, validation techniques and evaluation metrics, and model selection. Using WEKA and RapidMiner, we carry out a detailed case study, going through all the steps from data analysis to analysis of model performance. As in each of the other chapters, the case study is presented as an example to help the reader understand how the techniques introduced in the chapter are applied in real life. The dataset used in the case study is UCI HorseColic.

Chapter 3, Unsupervised Machine Learning Techniques, presents many advanced methods in clustering and outlier techniques, with applications. Topics covered are feature selection and reduction in unsupervised data, clustering algorithms, evaluation methods in clustering, and anomaly detection using statistical, distance, and distribution techniques. At the end of the chapter, we perform a case study for both clustering and outlier detection using a real-world image dataset, MNIST. We use the Smile API to do feature reduction and ELKI for learning.

Chapter 4, Semi-supervised Learning and Active Learning, gives details of algorithms and techniques for learning when only a small amount labeled data is present. Topics covered are self-training, generative models, transductive SVMs, co-training, active learning, and multi-view learning. The case study involves both learning systems and is performed on the real-world UCI Breast Cancer Wisconsin dataset. The tools introduced are JKernelMachines ,KEEL and JCLAL.

Chapter 5, Real-Time Stream Machine Learning, covers data streams in real-time present unique circumstances for the problem of learning from data. This chapter broadly covers the need for stream machine learning and applications, supervised stream learning, unsupervised cluster stream learning, unsupervised outlier learning, evaluation techniques in stream learning, and metrics used for evaluation. A detailed case study is given at the end of the chapter to illustrate the use of the MOA framework. The dataset used is Electricity (ELEC).

Chapter 6, Probabilistic Graph Modeling, shows that many real-world problems can be effectively represented by encoding complex joint probability distributions over multi-dimensional spaces. Probabilistic graph models provide a framework to represent, draw inferences, and learn effectively in such situations. The chapter broadly covers probability concepts, PGMs, Bayesian networks, Markov networks, Graph Structure Learning, Hidden Markov Models, and Inferencing. A detailed case study on a real-world dataset is performed at the end of the chapter. The tools used in this case study are OpenMarkov and WEKA's Bayes network. The dataset is UCI Adult (Census Income).

Chapter 7, Deep Learning, If there is one super-star of machine learning in the popular imagination today it is deep learning, which has attained a dominance among techniques used to solve the most complex AI problems. Topics broadly covered are neural networks, issues in neural networks, deep belief networks, restricted Boltzman machines, convolutional networks, long short-term memory units, denoising autoencoders, recurrent networks, and others. We present a detailed case study showing how to implement deep learning networks, tuning the parameters and performing learning. We use DeepLearning4J with the MNIST image dataset.

Chapter 8, Text Mining and Natural Language Processing, details the techniques, algorithms, and tools for performing various analyses in the field of text mining. Topics broadly covered are areas of text mining, components needed for text mining, representation of text data, dimensionality reduction techniques, topic modeling, text clustering, named entity recognition, and deep learning. The case study uses real-world unstructured text data (the Reuters-21578 dataset) highlighting topic modeling and text classification; the tools used are MALLET and KNIME.

Chapter 9, Big Data Machine Learning – the Final Frontier, discusses some of the most important challenges of today. What learning options are available when data is either big or available at a very high velocity? How is scalability handled? Topics covered are big data cluster deployment frameworks, big data storage options, batch data processing, batch data machine learning, real-time machine learning frameworks, and real-time stream learning. In the detailed case study for both big data batch and real-time we select the UCI Covertype dataset and the machine learning libraries H2O, Spark MLLib and SAMOA.

Appendix A, Linear Algebra, covers concepts from linear algebra, and is meant as a brief refresher. It is by no means complete in its coverage, but contains a whirlwind tour of some important concepts relevant to the machine learning techniques featured in the book. It includes vectors, matrices and basic matrix operations and properties, linear transformations, matrix inverse, eigen decomposition, positive definite matrix, and singular value decomposition.

Appendix B, Probability, provides a brief primer on probability. It includes the axioms of probability, Bayes' theorem, density estimation, mean, variance, standard deviation, Gaussian standard deviation, covariance, correlation coefficient, binomial distribution, Poisson distribution, Gaussian distribution, central limit theorem, and error propagation.

What you need for this book

This book assumes you have some experience of programming in Java and a basic understanding of machine learning concepts. If that doesn't apply to you, but you are curious nonetheless and self-motivated, fret not, and read on! For those who do have some background, it means that you are familiar with simple statistical analysis of data and concepts involved in supervised and unsupervised learning. Those who may not have the requisite math or must poke the far reaches of their memory to shake loose the odd formula or funny symbol, do not be disheartened. If you are the sort that loves a challenge, the short primer in the appendices may be all you need to kick-start your engines—a bit of tenacity will see you through the rest! For those who have never been introduced to machine learning, the first chapter was equally written for you as for those needing a refresher—it is your starter-kit to jump in feet first and find out what it's all about. You can augment your basics with any number of online resources. Finally, for those innocent of Java, here's a secret: many of the tools featured in the book have powerful GUIs. Some include wizard-like interfaces, making them quite easy to use, and do not require any knowledge of Java. So if you are new to Java, just skip the examples that need coding and learn to use the GUI-based tools instead!

Who this book is for

The primary audience of this book is professionals who works with data and whose responsibilities may include data analysis, data visualization or transformation, the training, validation, testing and evaluation of machine learning models—presumably to perform predictive, descriptive or prescriptive analytics using Java or Java-based tools. The choice of Java may imply a personal preference and therefore some prior experience programming in Java. On the other hand, perhaps circumstances in the work environment or company policies limit the use of third-party tools to only those written in Java and a few others. In the second case, the prospective reader may have no programming experience in Java. This book is aimed at this reader just as squarely as it is at their colleague, the Java expert (who came up with the policy in the first place).

A secondary audience can be defined by a profile with two attributes alone: an intellectual curiosity about machine learning and the desire for a single comprehensive treatment of the concepts, the practical techniques, and the tools. A specimen of this type of reader can opt to skip the math and the tools and focus on learning the most common supervised and unsupervised learning algorithms alone. Another might skim over Chapters 1, 2, 3, and 7, skip the others entirely, and jump headlong into the tools—a perfectly reasonable strategy if you want to quickly make yourself useful analyzing that dataset the client said would be here any day now. Importantly, too, with some practice reproducing the experiments from the book, it'll get you asking the right questions of the gurus! Alternatively, you might want to use this book as a reference to quickly look up the details of the algorithm for affinity propagation (Chapter 3, Unsupervised Machine Learning Techniques), or remind yourself of an LSTM architecture with a brief review of the schematic (Chapter 7, Deep Learning), or dog-ear the page with the list of pros and cons of distance-based clustering methods for outlier detection in stream-based learning (Chapter 5, Real-Time Stream Machine Learning). All specimens are welcome and each will find plenty to sink their teeth into.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/mjmlbook/mastering-java-machine-learning. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. Machine Learning Review

Recent years have seen the revival of artificial intelligence (AI) and machine learning in particular, both in academic circles and the industry. In the last decade, AI has seen dramatic successes that eluded practitioners in the intervening years since the original promise of the field gave way to relative decline until its re-emergence in the last few years.

What made these successes possible, in large part, was the impetus provided by the need to process the prodigious amounts of ever-growing data, key algorithmic advances by dogged researchers in deep learning, and the inexorable increase in raw computational power driven by Moore's Law. Among the areas of AI leading the resurgence, machine learning has seen spectacular developments, and continues to find the widest applicability in an array of domains. The use of machine learning to help in complex decision making at the highest levels of business and, at the same time, its enormous success in improving the accuracy of what are now everyday applications, such as searches, speech recognition, and personal assistants on mobile phones, have made its effects commonplace in the family room and the board room alike. Articles breathlessly extolling the power of deep learning can be found today not only in the popular science and technology press but also in mainstream outlets such as The New York Times and The Huffington Post. Machine learning has indeed become ubiquitous in a relatively short time.

An ordinary user encounters machine learning in many ways in their day-to-day activities. Most e-mail providers, including Yahoo and Gmail, give the user automated sorting and categorization of e-mails into headings such as Spam, Junk, Promotions, and so on, which is made possible using text mining, a branch of machine learning. When shopping online for products on e-commerce websites, such as https://www.amazon.com/, or watching movies from content providers, such as Netflix, one is offered recommendations for other products and content by so-called recommender systems, another branch of machine learning, as an effective way to retain customers.

Forecasting the weather, estimating real estate prices, predicting voter turnout, and even election results—all use some form of machine learning to see into the future, as it were.

The ever-growing availability of data and the promise of systems that can enrich our lives by learning from that data place a growing demand on the skills of the limited workforce of professionals in the field of data science. This demand is particularly acute for well-trained experts who know their way around the landscape of machine learning techniques in the more popular languages, such as Java, Python, R, and increasingly, Scala. Fortunately, thanks to the thousands of contributors in the open source community, each of these languages has a rich and rapidly growing set of libraries, frameworks, and tutorials that make state-of-the-art techniques accessible to anyone with an internet connection and a computer, for the most part. Java is an important vehicle for this spread of tools and technology, especially in large-scale machine learning projects, owing to its maturity and stability in enterprise-level deployments and the portable JVM platform, not to mention the legions of professional programmers who have adopted it over the years. Consequently, mastery of the skills so lacking in the workforce today will put any aspiring professional with a desire to enter the field at a distinct advantage in the marketplace.

Perhaps you already apply machine learning techniques in your professional work, or maybe you simply have a hobbyist's interest in the subject. If you're reading this, it's likely you can already bend Java to your will, no problem, but now you feel you're ready to dig deeper and learn how to use the best of breed open source ML Java frameworks in your next data science project. If that is indeed you, how fortuitous is it that the chapters in this book are designed to do all that and more!

Mastery of a subject, especially one that has such obvious applicability as machine learning, requires more than an understanding of its core concepts and familiarity with its mathematical underpinnings. Unlike an introductory treatment of the subject, a book that purports to help you master the subject must be heavily focused on practical aspects in addition to introducing more advanced topics that would have stretched the scope of the introductory material. To warm up before we embark on sharpening our skills, we will devote this chapter to a quick review of what we already know. For the ambitious novice with little or no prior exposure to the subject (who is nevertheless determined to get the fullest benefit from this book), here's our advice: make sure you do not skip the rest of this chapter; instead, use it as a springboard to explore unfamiliar concepts in more depth. Seek out external resources as necessary. Wikipedia them. Then jump right back in.

For the rest of this chapter, we will review the following:

History and definitionsWhat is not machine learning?Concepts and terminologyImportant branches of machine learningDifferent data types in machine learningApplications of machine learningIssues faced in machine learningThe meta-process used in most machine learning projects Information on some well-known tools, APIs, and resources that we will employ in this book

Machine learning – history and definition

It is difficult to give an exact history, but the definition of machine learning we use today finds its usage as early as the 1860s. In Rene Descartes' Discourse on the Method, he refers to Automata and says:

For we can easily understand a machine's being constituted so that it can utter words, and even emit some responses to action on it of a corporeal kind, which brings about a change in its organs; for instance, if touched in a particular part it may ask what we wish to say to it; if in another part it may exclaim that it is being hurt, and so on.

Note

http://www.earlymoderntexts.com/assets/pdfs/descartes1637.pdf

https://www.marxists.org/reference/archive/descartes/1635/discourse-method.htm

Alan Turing, in his famous publication Computing Machinery and Intelligence gives basic insights into the goals of machine learning by asking the question "Can machines think?".

Note

http://csmt.uchicago.edu/annotations/turing.htm

http://www.csee.umbc.edu/courses/471/papers/turing.pdf

Arthur Samuel in 1959 wrote, "Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.".

Tom Mitchell in recent times gave a more exact definition of machine learning: "A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E."

Machine learning has a relationship with several areas:

Statistics: It uses the elements of data sampling, estimation, hypothesis testing, learning theory, and statistical-based modeling, to name a fewAlgorithms and computation: It uses the basic concepts of search, traversal, parallelization, distributed computing, and so on from basic computer scienceDatabase and knowledge discovery: For its ability to store, retrieve, and access information in various formatsPattern recognition: For its ability to find interesting patterns from the data to explore, visualize, and predictArtificial intelligence: Though it is considered a branch of artificial intelligence, it also has relationships with other branches, such as heuristics, optimization, evolutionary computing, and so on

What is not machine learning?

It is important to recognize areas that share a connection with machine learning but cannot themselves be considered part of machine learning. Some disciplines may overlap to a smaller or larger extent, yet the principles underlying machine learning are quite distinct:

Business intelligence (BI) and reporting: Reporting key performance indicators (KPI's), querying OLAP for slicing, dicing, and drilling into the data, dashboards, and so on that form the central components of BI are not machine learning.Storage and ETL: Data storage and ETL are key elements in any machine learning process, but, by themselves, they don't qualify as machine learning.Information retrieval, search, and queries: The ability to retrieve data or documents based on search criteria or indexes, which form the basis of information retrieval, are not really machine learning. Many forms of machine learning, such as semi-supervised learning, can rely on the searching of similar data for modeling, but that doesn't qualify searching as machine learning.Knowledge representation and reasoning: Representing knowledge for performing complex tasks, such as ontology, expert systems, and semantic webs, does not qualify as machine learning.

Machine learning – concepts and terminology

In this section, we will describe the different concepts and terms normally used in machine learning:

Data or dataset: The basics of machine learning rely on understanding the data. The data or dataset normally refers to content available in structured or unstructured format for use in machine learning. Structured datasets have specific formats, and an unstructured dataset is normally in the form of some free-flowing text. Data can be available in various storage types or formats. In structured data, every element known as an instance or an example or row follows a predefined structure. Data can also be categorized by size: small or medium data have a few hundreds to thousands of instances, whereas big data refers to a large volume, mostly in millions or billions, that cannot be stored or accessed using common devices or fit in the memory of such devices.Features, attributes, variables, or dimensions: In structured datasets, as mentioned before, there are predefined elements with their own semantics and data type, which are known variously as features, attributes, metrics, indicators, variables, or dimensions.Data types: The features defined earlier need some form of typing in many machine learning algorithms or techniques. The most commonly used data types are as follows:
Categorical or nominal: This indicates well-defined categories or values present in the dataset. For example, eye color—black, blue, brown, green, grey; document content type—text, image, video.Continuous or numeric: This indicates a numeric nature of the data field. For example, a person's weight measured by a bathroom scale, the temperature reading from a sensor, or the monthly balance in dollars on a credit card account.Ordinal: This denotes data that can be ordered in some way. For example, garment size—small, medium, large; boxing weight classes: heavyweight, light heavyweight, middleweight, lightweight, and bantamweight.
Target or label: A feature or set of features in the dataset, which is used for learning from training data and predicting in an unseen dataset, is known as a target or a label. The term "ground truth" is also used in some domains. A label can have any form as specified before, that is, categorical, continuous, or ordinal.Machine learning model: Each machine learning algorithm, based on what it learned from the dataset, maintains the state of its learning for predicting or giving insights into future or unseen data. This is referred to as the machine learning model.Sampling: Data sampling is an essential step in machine learning. Sampling means choosing a subset of examples from a population with the intent of treating the behavior seen in the (smaller) sample as being representative of the behavior of the (larger) population. In order for the sample to be representative of the population, care must be taken in the way the sample is chosen. Generally, a population consists of every object sharing the properties of interest in the problem domain, for example, all people eligible to vote in the general election, or all potential automobile owners in the next four years. Since it is usually prohibitive (or impossible) to collect data for all the objects in a population, a well-chosen subset is selected for the purpose of analysis. A crucial consideration in the sampling process is that the sample is unbiased with respect to the population. The following are types of probability-based sampling:
Uniform random sampling: This refers to sampling that is done over a uniformly distributed population, that is, each object has an equal probability of being chosen.Stratified random sampling: This refers to the sampling method used when the data can be categorized into multiple classes. In such cases, in order to ensure all categories are represented in the sample, the population is divided into distinct strata based on these classifications, and each stratum is sampled in proportion to the fraction of its class in the overall population. Stratified sampling is common when the population density varies across categories, and it is important to compare these categories with the same statistical power. Political polling often involves stratified sampling when it is known that different demographic groups vote in significantly different ways. Disproportional representation of each group in a random sample can lead to large errors in the outcomes of the polls. When we control for demographics, we can avoid oversampling the majority over the other groups.Cluster sampling: Sometimes there are natural groups among the population being studied, and each group is representative of the whole population. An example is data that spans many geographical regions. In cluster sampling, you take a random subset of the groups followed by a random sample from within each of those groups to construct the full data sample. This kind of sampling can reduce the cost of data collection without compromising the fidelity of distribution in the population.Systematic sampling: Systematic or interval sampling is used when there is a certain ordering present in the sampling frame (a finite set of objects treated as the population and taken to be the source of data for sampling, for example, the corpus of Wikipedia articles, arranged lexicographically by title). If the sample is then selected by starting at a random object and skipping a constant k number of objects before selecting the next one, that is called systematic sampling. The value of k is calculated as the ratio of the population to the sample size.
Model evaluation metrics: Evaluating models for performance is generally based on different evaluation metrics for different types of learning. In classification, it is generally based on accuracy, receiver operating characteristics (ROC) curves, training speed, memory requirements, false positive ratio, and so on, to name a few (see Chapter 2, Practical Approach to Real-World Supervised Learning). In clustering, the number of clusters found, cohesion, separation, and so on form the general metrics (see Chapter 3, Unsupervised Machine Learning Techniques). In stream-based learning, apart from the standard metrics mentioned earlier, adaptability, speed of learning, and robustness to sudden changes are some of the conventional metrics for evaluating the performance of the learner (see Chapter 5, Real-Time Stream Machine Learning).

To illustrate these concepts, a concrete example in the form of a commonly used sample weather dataset is given. The data gives a set of weather conditions and a label that indicates whether the subject decided to play a game of tennis on the day or not:

@relation weather@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute humidity numeric@attribute windy {TRUE, FALSE}@attribute play {yes, no}@datasunny,85,85,FALSE,nosunny,80,90,TRUE,noovercast,83,86,FALSE,yesrainy,70,96,FALSE,yesrainy,68,80,FALSE,yesrainy,65,70,TRUE,noovercast,64,65,TRUE,yessunny,72,95,FALSE,nosunny,69,70,FALSE,yesrainy,75,80,FALSE,yessunny,75,70,TRUE,yesovercast,72,90,TRUE,yesovercast,81,75,FALSE,yesrainy,71,91,TRUE,no

The dataset is in the format of an ARFF (attribute-relation file format) file. It consists of a header giving the information about features or attributes with their data types and actual comma-separated data following the data tag. The dataset has five features, namely outlook, temperature, humidity, windy, and play. The features outlook and windy are categorical features, while humidity and temperature are continuous. The feature play is the target and is categorical.

Machine learning – types and subtypes

We will now explore different subtypes or branches of machine learning. Though the following list is not comprehensive, it covers the most well-known types:

Supervised learning: This is the most popular branch of machine learning, which is about learning from labeled