E-Book
44,39 €

Mastering Java Machine Learning E-Book

Uday Kamath

0,0

44,39 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

Java is one of the main languages used by practicing data scientists; much of the Hadoop ecosystem is Java-based, and it is certainly the language that most production systems in Data Science are written in. If you know Java, Mastering Machine Learning with Java is your next step on the path to becoming an advanced practitioner in Data Science.
This book aims to introduce you to an array of advanced techniques in machine learning, including classification, clustering, anomaly detection, stream learning, active learning, semi-supervised learning, probabilistic graph modeling, text mining, deep learning, and big data batch and stream machine learning. Accompanying each chapter are illustrative examples and real-world case studies that show how to apply the newly learned techniques using sound methodologies and the best Java-based tools available today.
On completing this book, you will have an understanding of the tools and techniques for building powerful machine learning models to solve data science problems in just about any domain.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 656

Veröffentlichungsjahr: 2017

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Large Language Models: A Deep Dive

Uday Kamath

Explainable Artificial Intelligence: An Introduction to Interpretable Machine Learning

Uday Kamath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Mission erfüllt

Owen Mark

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Macht, was ihr liebt!

Anja Förster

Kopf schlägt Kapital

Günter Faltin

Der größte Raubzug der Geschichte

Matthias Weik

Der Mann und das Holz

Lars Mytting

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Organisation für Komplexität

Niels Pfläging

Power: Die 48 Gesetze der Macht

Robert Greene

Leseprobe

Mastering Java Machine Learning

Credits

Foreword

About the Authors

About the Reviewers

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Errata

Piracy

Questions

1. Machine Learning Review

Machine learning – history and definition

What is not machine learning?

Machine learning – concepts and terminology

Machine learning – types and subtypes

Datasets used in machine learning

Machine learning applications

Practical issues in machine learning

Machine learning – roles and process

Roles

Process

Machine learning – tools and datasets

Datasets

Summary

2. Practical Approach to Real-World Supervised Learning

Formal description and notation

Data quality analysis

Descriptive data analysis

Basic label analysis

Basic feature analysis

Visualization analysis

Univariate feature analysis

Categorical features

Continuous features

Multivariate feature analysis

Data transformation and preprocessing

Feature construction

Handling missing values

Outliers

Discretization

Data sampling

Is sampling needed?

Undersampling and oversampling

Stratified sampling

Training, validation, and test set

Feature relevance analysis and dimensionality reduction

Feature search techniques

Feature evaluation techniques

Filter approach

Univariate feature selection

Information theoretic approach

Statistical approach

Multivariate feature selection

Minimal redundancy maximal relevance (mRMR)

Correlation-based feature selection (CFS)

Wrapper approach

Embedded approach

Model building

Linear models

Linear Regression

Algorithm input and output

How does it work?

Advantages and limitations

Naïve Bayes

Algorithm input and output

How does it work?

Advantages and limitations

Logistic Regression

Algorithm input and output

How does it work?

Advantages and limitations

Non-linear models

Decision Trees

Algorithm inputs and outputs

How does it work?

Advantages and limitations

K-Nearest Neighbors (KNN)

Algorithm inputs and outputs

How does it work?

Advantages and limitations

Support vector machines (SVM)

Algorithm inputs and outputs

How does it work?

Advantages and limitations

Ensemble learning and meta learners

Bootstrap aggregating or bagging

Algorithm inputs and outputs

How does it work?

Random Forest

Advantages and limitations

Boosting

Algorithm inputs and outputs

How does it work?

Advantages and limitations

Model assessment, evaluation, and comparisons

Model assessment

Model evaluation metrics

Confusion matrix and related metrics

ROC and PRC curves

Gain charts and lift curves

Model comparisons

Comparing two algorithms

McNemar's Test

Paired-t test

Wilcoxon signed-rank test

Comparing multiple algorithms

ANOVA test

Friedman's test

Case Study – Horse Colic Classification

Business problem

Machine learning mapping

Data analysis

Label analysis

Features analysis

Supervised learning experiments

Weka experiments

Sample end-to-end process in Java

Weka experimenter and model selection

RapidMiner experiments

Visualization analysis

Feature selection

Model process flow

Model evaluation metrics

Evaluation on Confusion Metrics

ROC Curves, Lift Curves, and Gain Charts

Results, observations, and analysis

Summary

References

3. Unsupervised Machine Learning Techniques

Issues in common with supervised learning

Issues specific to unsupervised learning

Feature analysis and dimensionality reduction

Notation

Linear methods

Principal component analysis (PCA)

Inputs and outputs

How does it work?

Advantages and limitations

Random projections (RP)

Inputs and outputs

How does it work?

Advantages and limitations

Multidimensional Scaling (MDS)

Inputs and outputs

How does it work?

Advantages and limitations

Nonlinear methods

Kernel Principal Component Analysis (KPCA)

Inputs and outputs

How does it work?

Advantages and limitations

Manifold learning

Inputs and outputs

How does it work?

Advantages and limitations

Clustering

Clustering algorithms

k-Means

Inputs and outputs

How does it work?

Advantages and limitations

DBSCAN

Inputs and outputs

How does it work?

Advantages and limitations

Mean shift

Inputs and outputs

How does it work?

Advantages and limitations

Expectation maximization (EM) or Gaussian mixture modeling (GMM)

Input and output

How does it work?

Advantages and limitations

Hierarchical clustering

Input and output

How does it work?

Advantages and limitations

Self-organizing maps (SOM)

Inputs and outputs

How does it work?

Advantages and limitations

Spectral clustering

Inputs and outputs

How does it work?

Advantages and limitations

Affinity propagation

Inputs and outputs

How does it work?

Advantages and limitations

Clustering validation and evaluation

Internal evaluation measures

Notation

R-Squared

Dunn's Indices

Davies-Bouldin index

Silhouette's index

External evaluation measures

Rand index

F-Measure

Normalized mutual information index

Outlier or anomaly detection

Outlier algorithms

Statistical-based

Inputs and outputs

How does it work?

Advantages and limitations

Distance-based methods

Inputs and outputs

How does it work?

Advantages and limitations

Density-based methods

Inputs and outputs

How does it work?

Advantages and limitations

Clustering-based methods

Inputs and outputs

How does it work?

Advantages and limitations

High-dimensional-based methods

Inputs and outputs

How does it work?

Advantages and limitations

One-class SVM

Inputs and outputs

How does it work?

Advantages and limitations

Outlier evaluation techniques

Supervised evaluation

Unsupervised evaluation

Real-world case study

Tools and software

Business problem

Machine learning mapping

Data collection

Data quality analysis

Data sampling and transformation

Feature analysis and dimensionality reduction

PCA

Random projections

ISOMAP

Observations on feature analysis and dimensionality reduction

Clustering models, results, and evaluation

Observations and clustering analysis

Outlier models, results, and evaluation

Observations and analysis

Summary

References

4. Semi-Supervised and Active Learning

Semi-supervised learning

Representation, notation, and assumptions

Semi-supervised learning techniques

Self-training SSL

Inputs and outputs

How does it work?

Advantages and limitations

Co-training SSL or multi-view SSL

Inputs and outputs

How does it work?

Advantages and limitations

Cluster and label SSL

Inputs and outputs

How does it work?

Advantages and limitations

Transductive graph label propagation

Inputs and outputs

How does it work?

Advantages and limitations

Transductive SVM (TSVM)

Inputs and outputs

How does it work?

Advantages and limitations

Case study in semi-supervised learning

Tools and software

Business problem

Machine learning mapping

Data collection

Data quality analysis

Data sampling and transformation

Datasets and analysis

Feature analysis results

Experiments and results

Analysis of semi-supervised learning

Active learning

Representation and notation

Active learning scenarios

Active learning approaches

Uncertainty sampling

How does it work?

Least confident sampling

Smallest margin sampling

Label entropy sampling

Advantages and limitations

Version space sampling

Query by disagreement (QBD)

How does it work?

Query by Committee (QBC)

How does it work?

Advantages and limitations

Data distribution sampling

How does it work?

Expected model change

Expected error reduction

Variance reduction

Density weighted methods

Advantages and limitations

Case study in active learning

Tools and software

Business problem

Machine learning mapping

Data Collection

Data sampling and transformation

Feature analysis and dimensionality reduction

Models, results, and evaluation

Pool-based scenarios

Stream-based scenarios

Analysis of active learning results

Summary

References

5. Real-Time Stream Machine Learning

Assumptions and mathematical notations

Basic stream processing and computational techniques

Stream computations

Sliding windows

Sampling

Concept drift and drift detection

Data management

Partial memory

Full memory

Detection methods

Monitoring model evolution

Widmer and Kubat

Drift Detection Method or DDM

Early Drift Detection Method or EDDM

Monitoring distribution changes

Welch's t test

Kolmogorov-Smirnov's test

CUSUM and Page-Hinckley test

Adaptation methods

Explicit adaptation

Implicit adaptation

Incremental supervised learning

Modeling techniques

Linear algorithms

Online linear models with loss functions

Inputs and outputs

How does it work?

Advantages and limitations

Online Naïve Bayes

Inputs and outputs

How does it work?

Advantages and limitations

Non-linear algorithms

Hoeffding trees or very fast decision trees (VFDT)

Inputs and outputs

How does it work?

Advantages and limitations

Ensemble algorithms

Weighted majority algorithm

Inputs and outputs

How does it work?

Advantages and limitations

Online Bagging algorithm

Inputs and outputs

How does it work?

Advantages and limitations

Online Boosting algorithm

Inputs and outputs

How does it work?

Advantages and limitations

Validation, evaluation, and comparisons in online setting

Model validation techniques

Prequential evaluation

Holdout evaluation

Controlled permutations

Evaluation criteria

Comparing algorithms and metrics

Incremental unsupervised learning using clustering

Modeling techniques

Partition based

Online k-Means

Inputs and outputs

How does it work?

Advantages and limitations

Hierarchical based and micro clustering

Inputs and outputs

How does it work?

Advantages and limitations

Inputs and outputs

How does it work?

Advantages and limitations

Density based

Inputs and outputs

How does it work?

Advantages and limitations

Grid based

Inputs and outputs

How does it work?

Advantages and limitations

Validation and evaluation techniques

Key issues in stream cluster evaluation

Evaluation measures

Cluster Mapping Measures (CMM)

V-Measure

Other external measures

Unsupervised learning using outlier detection

Partition-based clustering for outlier detection

Inputs and outputs

How does it work?

Advantages and limitations

Distance-based clustering for outlier detection

Inputs and outputs

How does it work?

Exact Storm

Abstract-C

Direct Update of Events (DUE)

Micro Clustering based Algorithm (MCOD)

Approx Storm

Advantages and limitations

Validation and evaluation techniques

Case study in stream learning

Tools and software

Business problem

Machine learning mapping

Data collection

Data sampling and transformation

Feature analysis and dimensionality reduction

Models, results, and evaluation

Supervised learning experiments

Concept drift experiments

Clustering experiments

Outlier detection experiments

Analysis of stream learning results

Summary

References

6. Probabilistic Graph Modeling

Probability revisited

Concepts in probability

Conditional probability

Chain rule and Bayes' theorem

Random variables, joint, and marginal distributions

Marginal independence and conditional independence

Factors

Factor types

Distribution queries

Probabilistic queries

MAP queries and marginal MAP queries

Graph concepts

Graph structure and properties

Subgraphs and cliques

Path, trail, and cycles

Bayesian networks

Representation

Definition

Reasoning patterns

Causal or predictive reasoning

Evidential or diagnostic reasoning

Intercausal reasoning

Combined reasoning

Independencies, flow of influence, D-Separation, I-Map

Flow of influence

D-Separation

I-Map

Inference

Elimination-based inference

Variable elimination algorithm

Input and output

How does it work?

Advantages and limitations

Clique tree or junction tree algorithm

Input and output

How does it work?

Advantages and limitations

Propagation-based techniques

Belief propagation

Factor graph

Messaging in factor graph

Input and output

How does it work?

Advantages and limitations

Sampling-based techniques

Forward sampling with rejection

Input and output

How does it work?

Advantages and limitations

Learning

Learning parameters

Maximum likelihood estimation for Bayesian networks

Bayesian parameter estimation for Bayesian network

Prior and posterior using the Dirichlet distribution

Learning structures

Measures to evaluate structures

Methods for learning structures

Constraint-based techniques

Inputs and outputs

How does it work?

Advantages and limitations

Search and score-based techniques

Inputs and outputs

How does it work?

Advantages and limitations

Markov networks and conditional random fields

Representation

Parameterization

Gibbs parameterization

Factor graphs

Log-linear models

Independencies

Global

Pairwise Markov

Markov blanket

Inference

Learning

Conditional random fields

Specialized networks

Tree augmented network

Input and output

How does it work?

Advantages and limitations

Markov chains

Hidden Markov models

Most probable path in HMM

Posterior decoding in HMM

Tools and usage

OpenMarkov

Weka Bayesian Network GUI

Case study

Business problem

Machine learning mapping

Data sampling and transformation

Feature analysis

Models, results, and evaluation

Analysis of results

Summary

References

7. Deep Learning

Multi-layer feed-forward neural network

Inputs, neurons, activation function, and mathematical notation

Multi-layered neural network

Structure and mathematical notations

Activation functions in NN

Sigmoid function

Hyperbolic tangent ("tanh") function

Training neural network

Empirical risk minimization

Parameter initialization

Loss function

Gradients

Gradient at the output layer

Gradient at the Hidden Layer

Parameter gradient

Feed forward and backpropagation

How does it work?

Regularization

L2 regularization

L1 regularization

Limitations of neural networks

Vanishing gradients, local optimum, and slow training

Deep learning

Building blocks for deep learning

Rectified linear activation function

Restricted Boltzmann Machines

Definition and mathematical notation

Conditional distribution

Free energy in RBM

Training the RBM

Sampling in RBM

Contrastive divergence

Inputs and outputs

How does it work?

Persistent contrastive divergence

Autoencoders

Definition and mathematical notations

Loss function

Limitations of Autoencoders

Denoising Autoencoder

Unsupervised pre-training and supervised fine-tuning

Deep feed-forward NN

Input and outputs

How does it work?

Deep Autoencoders

Deep Belief Networks

Inputs and outputs

How does it work?

Deep learning with dropouts

Definition and mathematical notation

Inputs and outputs

How does it work?

Learning Training and testing with dropouts

Sparse coding

Convolutional Neural Network

Local connectivity

Parameter sharing

Discrete convolution

Pooling or subsampling

Normalization using ReLU

CNN Layers

Recurrent Neural Networks

Structure of Recurrent Neural Networks

Learning and associated problems in RNNs

Long Short Term Memory

Gated Recurrent Units

Case study

Tools and software

Business problem

Machine learning mapping

Data sampling and transfor

Feature analysis

Models, results, and evaluation

Basic data handling

Multi-layer perceptron

Parameters used for MLP

Code for MLP

Convolutional Network

Parameters used for ConvNet

Code for CNN

Variational Autoencoder

Parameters used for the Variational Autoencoder

Code for Variational Autoencoder

DBN

Parameter search using Arbiter

Results and analysis

Summary

References

8. Text Mining and Natural Language Processing

NLP, subfields, and tasks

Text categorization

Part-of-speech tagging (POS tagging)

Text clustering

Information extraction and named entity recognition

Sentiment analysis and opinion mining

Coreference resolution

Word sense disambiguation

Machine translation

Semantic reasoning and inferencing

Text summarization

Automating question and answers

Issues with mining unstructured data

Text processing components and transformations

Document collection and standardization

Inputs and outputs

How does it work?

Tokenization

Inputs and outputs

How does it work?

Stop words removal

Inputs and outputs

How does it work?

Stemming or lemmatization

Inputs and outputs

How does it work?

Local/global dictionary or vocabulary?

Feature extraction/generation

Lexical features

Character-based features

Word-based features

Part-of-speech tagging features

Taxonomy features

Syntactic features

Semantic features

Feature representation and similarity

Vector space model

Binary

Term frequency (TF)

Inverse document frequency (IDF)

Term frequency-inverse document frequency (TF-IDF)

Similarity measures

Euclidean distance

Cosine distance

Pairwise-adaptive similarity

Extended Jaccard coefficient

Dice coefficient

Feature selection and dimensionality reduction

Feature selection

Information theoretic techniques

Statistical-based techniques

Frequency-based techniques

Dimensionality reduction

Topics in text mining

Text categorization/classification

Topic modeling

Probabilistic latent semantic analysis (PLSA)

Input and output

How does it work?

Advantages and limitations

Text clustering

Feature transformation, selection, and reduction

Clustering techniques

Generative probabilistic models

Input and output

How does it work?

Advantages and limitations

Distance-based text clustering

Non-negative matrix factorization (NMF)

Input and output

How does it work?

Advantages and limitations

Evaluation of text clustering

Named entity recognition

Hidden Markov models for NER

Input and output

How does it work?

Advantages and limitations

Maximum entropy Markov models for NER

Input and output

How does it work?

Advantages and limitations

Deep learning and NLP

Tools and usage

Mallet

KNIME

Topic modeling with mallet

Business problem

Machine Learning mapping

Data collection

Data sampling and transformation

Feature analysis and dimensionality reduction

Models, results, and evaluation

Analysis of text processing results

Summary

References

9. Big Data Machine Learning – The Final Frontier

What are the characteristics of Big Data?

Big Data Machine Learning

General Big Data framework

Big Data cluster deployment frameworks

Hortonworks Data Platform

Cloudera CDH

Amazon Elastic MapReduce

Microsoft Azure HDInsight

Data acquisition

Publish-subscribe frameworks

Source-sink frameworks

SQL frameworks

Message queueing frameworks

Custom frameworks

Data storage

HDFS

NoSQL

Key-value databases

Document databases

Columnar databases

Graph databases

Data processing and preparation

Hive and HQL

Spark SQL

Amazon Redshift

Real-time stream processing

Machine Learning

Visualization and analysis

Batch Big Data Machine Learning

H2O as Big Data Machine Learning platform

H2O architecture

Machine learning in H2O

Tools and usage

Case study

Business problem

Machine Learning mapping

Data collection

Data sampling and transformation

Experiments, results, and analysis

Feature relevance and analysis

Evaluation on test data

Analysis of results

Spark MLlib as Big Data Machine Learning platform

Spark architecture

Machine Learning in MLlib

Tools and usage

Experiments, results, and analysis

k-Means

k-Means with PCA

Bisecting k-Means (with PCA)

Gaussian Mixture Model

Random Forest

Analysis of results

Real-time Big Data Machine Learning

SAMOA as a real-time Big Data Machine Learning framework

SAMOA architecture

Machine Learning algorithms

Tools and usage

Experiments, results, and analysis

Analysis of results

The future of Machine Learning

Summary

References

A. Linear Algebra

Vector

Scalar product of vectors

Matrix

Transpose of a matrix

Matrix addition

Scalar multiplication

Matrix multiplication

Properties of matrix product

Linear transformation

Matrix inverse

Eigendecomposition

Positive definite matrix

Singular value decomposition (SVD)

B. Probability

Axioms of probability

Bayes' theorem

Density estimation

Mean

Variance

Standard deviation

Gaussian standard deviation

Covariance

Correlation coefficient

Binomial distribution

Poisson distribution

Gaussian distribution

Central limit theorem

Error propagation

Index

Mastering Java Machine Learning

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: June 2017

Production reference: 1290617

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78588-051-3

www.packtpub.com

Credits

Authors

Uday Kamath

Krishna Choppella

Reviewer

Samir Sahli

Prashant Verma

Commissioning Editor

Veena Pagare

Acquisition Editor

Divya Poojari

Content Development Editor

Mayur Pawanikar

Technical Editor

Vivek Arora

Copy Editors

Vikrant Phadkay

Safis Editing

Project Coordinator

Nidhi Joshi

Proofreaders

Safis Editing

Indexer

Francy Puthiry

Graphics

Tania Dutta

Production Coordinator

Arvindkumar Gupta

Cover Work

Arvindkumar Gupta

Foreword

Dr. Uday Kamath is a volcano of ideas. Every time he walked into my office, we had fruitful and animated discussions. I have been a professor of computer science at George Mason University (GMU) for 15 years, specializing in machine learning and data mining. I have known Uday for five years, first as a student in my data mining class, then as a colleague and co-author of papers and projects on large-scale machine learning. While a chief data scientist at BAE Systems Applied Intelligence, Uday earned his PhD in evolutionary computation and machine learning. As if having two high-demand jobs was not enough, Uday was unusually prolific, publishing extensively with four different people in the computer science faculty during his tenure at GMU, something you don't see very often. Given this pedigree, I am not surprised that less than four years since Uday's graduation with a PhD, I am writing the foreword for his book on mastering advanced machine learning techniques with Java. Uday's thirst for new stimulating challenges has struck again, resulting in this terrific book you now have in your hands.

This book is the product of his deep interest and knowledge in sound and well-grounded theory, and at the same time his keen grasp of the practical feasibility of proposed methodologies. Several books on machine learning and data analytics exist, but Uday's book closes a substantial gap—the one between theory and practice. It offers a comprehensive and systematic analysis of classic and advanced learning techniques, with a focus on their advantages and limitations, practical use and implementations. This book is a precious resource for practitioners of data science and analytics, as well as for undergraduate and graduate students keen to master practical and efficient implementations of machine learning techniques.

The book covers the classic techniques of machine learning, such as classification, clustering, dimensionality reduction, anomaly detection, semi-supervised learning, and active learning. It also covers advanced and recent topics, including learning with stream data, deep learning, and the challenges of learning with big data. Each chapter is dedicated to a topic and includes an illustrative case study, which covers state-of-the-art Java-based tools and software, and the entire knowledge discovery cycle: data collection, experimental design, modeling, results, and evaluation. Each chapter is self-contained, providing great flexibility of usage. The accompanying website provides the source code and data. This is truly a gem for both students and data analytics practitioners, who can experiment first-hand with the methods just learned or deepen their understanding of the methods by applying them to real-world scenarios.

As I was reading the various chapters of the book, I was reminded of the enthusiasm Uday has for learning and knowledge. He communicates the concepts described in the book with clarity and with the same passion. I am positive that you, as a reader, will feel the same. I will certainly keep this book as a personal resource for the courses I teach, and strongly recommend it to my students.

Dr. Carlotta Domeniconi

Associate Professor of Computer Science, George Mason University

About the Authors

Dr. Uday Kamath is the chief data scientist at BAE Systems Applied Intelligence. He specializes in scalable machine learning and has spent 20 years in the domain of AML, fraud detection in financial crime, cyber security, and bioinformatics, to name a few. Dr. Kamath is responsible for key products in areas focusing on the behavioral, social networking and big data machine learning aspects of analytics at BAE AI. He received his PhD at George Mason University, under the able guidance of Dr. Kenneth De Jong, where his dissertation research focused on machine learning for big data and automated sequence mining.

I would like to thank my friend, Krishna Choppella, for accepting the offer to co-author this book and being an able partner on this long but satisfying journey.

Heartfelt thanks to our reviewers, especially Dr. Samir Sahli for his valuable comments, suggestions, and in-depth review of the chapters. I would like to thank Professor Carlotta Domeniconi for her suggestions and comments that helped us shape various chapters in the book. I would also like to thank all the Packt staff, especially Divya Poojari, Mayur Pawanikar, and Vivek Arora, for helping us complete the tasks in time. This book required making a lot of sacrifices on the personal front and I would like to thank my wife, Pratibha, and our nanny, Evelyn, for their unconditional support. Finally, thanks to all my lovely teachers and professors for not only teaching the subjects, but also instilling the joy of learning.

Krishna Choppella builds tools and client solutions in his role as a solutions architect for analytics at BAE Systems Applied Intelligence. He has been programming in Java for 20 years. His interests are data science, functional programming, and distributed computing.

About the Reviewers

Samir Sahli was awarded a BSc degree in applied mathematics and information sciences from the University of Nice Sophia-Antipolis, France, in 2004. He received MSc and PhD degrees in physics (specializing in optics/photonics/image science) from University Laval, Quebec, Canada, in 2008 and 2013, respectively. During his graduate studies, he worked with Defence Research and Development Canada (DRDC) on the automatic detection and recognition of targets in aerial imagery, especially in the context of uncontrolled environment and sub-optimal acquisition conditions. He has worked since 2009 as a consultant for several companies based in Europe and North America specializing in the area of Intelligence, Surveillance, and Reconnaissance (ISR) and in remote sensing.

Dr. Sahli joined McMaster Biophotonics in 2013 as a postdoctoral fellow. His research was in the field of optics, image processing, and machine learning. He was involved in several projects, such as the development of a novel generation of gastrointestinal tract imaging device, hyperspectral imaging of skin erythema for individualized radiotherapy treatment, and automatic detection of the precancerous Barrett's esophageal cell using fluorescence lifetime imaging microscopy and multiphoton microscopy.

Dr. Sahli joined BAE Systems Applied Intelligence in 2015. He has since worked as a data scientist to develop analytics models to detect complex fraud patterns and money laundering schemes for insurance, banking, and governmental clients using machine learning, statistics, and social network analysis tools.

Prashant Verma started his IT career in 2011 as a Java developer in Ericsson, working in the telecom domain. After a couple of years of Java EE experience, he moved into the big data domain and has worked on almost all of the popular big data technologies such as Hadoop, Spark, Flume, Mongo, Cassandra, and so on. He has also played with Scala. Currently, he works with QA Infotech as a lead data engineer, working on solving e-learning problems with analytics and machine learning.

Prashant has worked for many companies, such as Ericsson and QA Infotech, with domain knowledge of telecom and e-learning. He has also worked as a freelance consultant in his free time.

I want to thank Packt Publishing for giving me the chance to review the book, as well as my employer and my family for their patience while I was busy working on this book.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1785880519.

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Dedicated to my parents, Krishna Kamath and Bharathi Kamath, my wife, Pratibha Shenoy, and the kids, Aaroh and Brandy

--Dr. Uday Kamath.

To my parents

--Krishna Choppella

Preface

There are many notable books on machine learning, from pedagogical tracts on the theory of learning from data; to standard references on specializations in the field, such as clustering and outlier detection or probabilistic graph modeling; to cookbooks that offer practical advice on the use of tools and libraries in a particular language. The books that tend to be broad in coverage are often short on theoretical detail, while those with a focus on one topic or tool may not, for example, have much to say about the difference in approach in a streaming as opposed to a batch environment. Besides, for the non-novices with a preference for tools in Java who wish to reach for a single volume that will extend their knowledge—simultaneously, on the essential aspects—there are precious few options.

Finding in one place

The pros and cons of different techniques given any data availability scenario—when data is labeled or unlabeled, streaming or batch, local, or distributed, structured or unstructuredA ready reference for the most important mathematical results related to those very techniques for a better appreciation of the underlying theoryAn introduction to the most mature Java-based frameworks, libraries, and visualization tools with descriptions and illustrations on how to put these techniques into practice is not possible today, as far as we know

The core idea of this book, therefore, is to address this gap while maintaining a balance between treatment of theory and practice with the aid of probability, statistics, basic linear algebra, and rudimentary calculus in the service of one, and emphasizing methodology, case studies, tools and code in support of the other.

According to the KDnuggets 2016 software poll, Java, at 16.8%, has the second highest share in popularity among languages used in machine learning, after Python. What's more is that this marks a 19% increase from the year before! Clearly, Java remains an important and effective vehicle to build and deploy systems involving machine learning, despite claims of its decline in some quarters. With this book, we aim to reach professionals and motivated enthusiasts with some experience in Java and a beginner's knowledge of machine learning. Our goal is to make Mastering Java Machine Learning the next step on their path to becoming advanced practitioners in data science. To guide them on this path, the book covers a veritable arsenal of techniques in machine learning—some which they may already be familiar with, others perhaps not as much, or only superficially—including methods of data analysis, learning algorithms, evaluation of model performance, and more in supervised and semi-supervised learning, clustering and anomaly detection, and semi-supervised and active learning. It also presents special topics such as probabilistic graph modeling, text mining, and deep learning. Not forgetting the increasingly important topics in enterprise-scale systems today, the book also covers the unique challenges of learning from evolving data streams and the tools and techniques applicable to real-time systems, as well as the imperatives of the world of Big Data:

How does machine learning work in large-scale distributed environments?What are the trade-offs?How must algorithms be adapted?How can these systems interoperate with other technologies in the dominant Hadoop ecosystem?

This book explains how to apply machine learning to real-world data and real-world domains with the right methodology, processes, applications, and analysis. Accompanying each chapter are case studies and examples of how to apply the newly learned techniques using some of the best available open source tools written in Java. This book covers more than 15 open source Java tools supporting a wide range of techniques between them, with code and practical usage. The code, data, and configurations are available for readers to download and experiment with. We present more than ten real-world case studies in Machine Learning that illustrate the data scientist's process. Each case study details the steps undertaken in the experiments: data ingestion, data analysis, data cleansing, feature reduction/selection, mapping to machine learning, model training, model selection, model evaluation, and analysis of results. This gives the reader a practical guide to using the tools and methods presented in each chapter for solving the business problem at hand.

What this book covers

Chapter 1, Machine Learning Review, is a refresher of basic concepts and techniques that the reader would have learned from Packt's Learning Machine Learning in Java or a similar text. This chapter is a review of concepts such as data, data transformation, sampling and bias, features and their importance, supervised learning, unsupervised learning, big data learning, stream and real-time learning, probabilistic graphic models, and semi-supervised learning.

Chapter 2, Practical Approach to Real-World Supervised Learning, cobwebs dusted, dives straight into the vast field of supervised learning and the full spectrum of associated techniques. We cover the topics of feature selection and reduction, linear modeling, logistic models, non-linear models, SVM and kernels, ensemble learning techniques such as bagging and boosting, validation techniques and evaluation metrics, and model selection. Using WEKA and RapidMiner, we carry out a detailed case study, going through all the steps from data analysis to analysis of model performance. As in each of the other chapters, the case study is presented as an example to help the reader understand how the techniques introduced in the chapter are applied in real life. The dataset used in the case study is UCI HorseColic.

Chapter 3, Unsupervised Machine Learning Techniques, presents many advanced methods in clustering and outlier techniques, with applications. Topics covered are feature selection and reduction in unsupervised data, clustering algorithms, evaluation methods in clustering, and anomaly detection using statistical, distance, and distribution techniques. At the end of the chapter, we perform a case study for both clustering and outlier detection using a real-world image dataset, MNIST. We use the Smile API to do feature reduction and ELKI for learning.

Chapter 4, Semi-supervised Learning and Active Learning, gives details of algorithms and techniques for learning when only a small amount labeled data is present. Topics covered are self-training, generative models, transductive SVMs, co-training, active learning, and multi-view learning. The case study involves both learning systems and is performed on the real-world UCI Breast Cancer Wisconsin dataset. The tools introduced are JKernelMachines ,KEEL and JCLAL.

Chapter 5, Real-Time Stream Machine Learning, covers data streams in real-time present unique circumstances for the problem of learning from data. This chapter broadly covers the need for stream machine learning and applications, supervised stream learning, unsupervised cluster stream learning, unsupervised outlier learning, evaluation techniques in stream learning, and metrics used for evaluation. A detailed case study is given at the end of the chapter to illustrate the use of the MOA framework. The dataset used is Electricity (ELEC).

Chapter 6, Probabilistic Graph Modeling, shows that many real-world problems can be effectively represented by encoding complex joint probability distributions over multi-dimensional spaces. Probabilistic graph models provide a framework to represent, draw inferences, and learn effectively in such situations. The chapter broadly covers probability concepts, PGMs, Bayesian networks, Markov networks, Graph Structure Learning, Hidden Markov Models, and Inferencing. A detailed case study on a real-world dataset is performed at the end of the chapter. The tools used in this case study are OpenMarkov and WEKA's Bayes network. The dataset is UCI Adult (Census Income).

Chapter 7, Deep Learning, If there is one super-star of machine learning in the popular imagination today it is deep learning, which has attained a dominance among techniques used to solve the most complex AI problems. Topics broadly covered are neural networks, issues in neural networks, deep belief networks, restricted Boltzman machines, convolutional networks, long short-term memory units, denoising autoencoders, recurrent networks, and others. We present a detailed case study showing how to implement deep learning networks, tuning the parameters and performing learning. We use DeepLearning4J with the MNIST image dataset.

Chapter 8, Text Mining and Natural Language Processing, details the techniques, algorithms, and tools for performing various analyses in the field of text mining. Topics broadly covered are areas of text mining, components needed for text mining, representation of text data, dimensionality reduction techniques, topic modeling, text clustering, named entity recognition, and deep learning. The case study uses real-world unstructured text data (the Reuters-21578 dataset) highlighting topic modeling and text classification; the tools used are MALLET and KNIME.

Chapter 9, Big Data Machine Learning – the Final Frontier, discusses some of the most important challenges of today. What learning options are available when data is either big or available at a very high velocity? How is scalability handled? Topics covered are big data cluster deployment frameworks, big data storage options, batch data processing, batch data machine learning, real-time machine learning frameworks, and real-time stream learning. In the detailed case study for both big data batch and real-time we select the UCI Covertype dataset and the machine learning libraries H2O, Spark MLLib and SAMOA.

Appendix A, Linear Algebra, covers concepts from linear algebra, and is meant as a brief refresher. It is by no means complete in its coverage, but contains a whirlwind tour of some important concepts relevant to the machine learning techniques featured in the book. It includes vectors, matrices and basic matrix operations and properties, linear transformations, matrix inverse, eigen decomposition, positive definite matrix, and singular value decomposition.

Appendix B, Probability, provides a brief primer on probability. It includes the axioms of probability, Bayes' theorem, density estimation, mean, variance, standard deviation, Gaussian standard deviation, covariance, correlation coefficient, binomial distribution, Poisson distribution, Gaussian distribution, central limit theorem, and error propagation.

What you need for this book

This book assumes you have some experience of programming in Java and a basic understanding of machine learning concepts. If that doesn't apply to you, but you are curious nonetheless and self-motivated, fret not, and read on! For those who do have some background, it means that you are familiar with simple statistical analysis of data and concepts involved in supervised and unsupervised learning. Those who may not have the requisite math or must poke the far reaches of their memory to shake loose the odd formula or funny symbol, do not be disheartened. If you are the sort that loves a challenge, the short primer in the appendices may be all you need to kick-start your engines—a bit of tenacity will see you through the rest! For those who have never been introduced to machine learning, the first chapter was equally written for you as for those needing a refresher—it is your starter-kit to jump in feet first and find out what it's all about. You can augment your basics with any number of online resources. Finally, for those innocent of Java, here's a secret: many of the tools featured in the book have powerful GUIs. Some include wizard-like interfaces, making them quite easy to use, and do not require any knowledge of Java. So if you are new to Java, just skip the examples that need coding and learn to use the GUI-based tools instead!

Who this book is for

The primary audience of this book is professionals who works with data and whose responsibilities may include data analysis, data visualization or transformation, the training, validation, testing and evaluation of machine learning models—presumably to perform predictive, descriptive or prescriptive analytics using Java or Java-based tools. The choice of Java may imply a personal preference and therefore some prior experience programming in Java. On the other hand, perhaps circumstances in the work environment or company policies limit the use of third-party tools to only those written in Java and a few others. In the second case, the prospective reader may have no programming experience in Java. This book is aimed at this reader just as squarely as it is at their colleague, the Java expert (who came up with the policy in the first place).

A secondary audience can be defined by a profile with two attributes alone: an intellectual curiosity about machine learning and the desire for a single comprehensive treatment of the concepts, the practical techniques, and the tools. A specimen of this type of reader can opt to skip the math and the tools and focus on learning the most common supervised and unsupervised learning algorithms alone. Another might skim over Chapters 1, 2, 3, and 7, skip the others entirely, and jump headlong into the tools—a perfectly reasonable strategy if you want to quickly make yourself useful analyzing that dataset the client said would be here any day now. Importantly, too, with some practice reproducing the experiments from the book, it'll get you asking the right questions of the gurus! Alternatively, you might want to use this book as a reference to quickly look up the details of the algorithm for affinity propagation (Chapter 3, Unsupervised Machine Learning Techniques), or remind yourself of an LSTM architecture with a brief review of the schematic (Chapter 7, Deep Learning), or dog-ear the page with the list of pros and cons of distance-based clustering methods for outlier detection in stream-based learning (Chapter 5, Real-Time Stream Machine Learning). All specimens are welcome and each will find plenty to sink their teeth into.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/mjmlbook/mastering-java-machine-learning. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. Machine Learning Review

Recent years have seen the revival of artificial intelligence (AI) and machine learning in particular, both in academic circles and the industry. In the last decade, AI has seen dramatic successes that eluded practitioners in the intervening years since the original promise of the field gave way to relative decline until its re-emergence in the last few years.

What made these successes possible, in large part, was the impetus provided by the need to process the prodigious amounts of ever-growing data, key algorithmic advances by dogged researchers in deep learning, and the inexorable increase in raw computational power driven by Moore's Law. Among the areas of AI leading the resurgence, machine learning has seen spectacular developments, and continues to find the widest applicability in an array of domains. The use of machine learning to help in complex decision making at the highest levels of business and, at the same time, its enormous success in improving the accuracy of what are now everyday applications, such as searches, speech recognition, and personal assistants on mobile phones, have made its effects commonplace in the family room and the board room alike. Articles breathlessly extolling the power of deep learning can be found today not only in the popular science and technology press but also in mainstream outlets such as The New York Times and The Huffington Post. Machine learning has indeed become ubiquitous in a relatively short time.

An ordinary user encounters machine learning in many ways in their day-to-day activities. Most e-mail providers, including Yahoo and Gmail, give the user automated sorting and categorization of e-mails into headings such as Spam, Junk, Promotions, and so on, which is made possible using text mining, a branch of machine learning. When shopping online for products on e-commerce websites, such as https://www.amazon.com/, or watching movies from content providers, such as Netflix, one is offered recommendations for other products and content by so-called recommender systems, another branch of machine learning, as an effective way to retain customers.

Forecasting the weather, estimating real estate prices, predicting voter turnout, and even election results—all use some form of machine learning to see into the future, as it were.

The ever-growing availability of data and the promise of systems that can enrich our lives by learning from that data place a growing demand on the skills of the limited workforce of professionals in the field of data science. This demand is particularly acute for well-trained experts who know their way around the landscape of machine learning techniques in the more popular languages, such as Java, Python, R, and increasingly, Scala. Fortunately, thanks to the thousands of contributors in the open source community, each of these languages has a rich and rapidly growing set of libraries, frameworks, and tutorials that make state-of-the-art techniques accessible to anyone with an internet connection and a computer, for the most part. Java is an important vehicle for this spread of tools and technology, especially in large-scale machine learning projects, owing to its maturity and stability in enterprise-level deployments and the portable JVM platform, not to mention the legions of professional programmers who have adopted it over the years. Consequently, mastery of the skills so lacking in the workforce today will put any aspiring professional with a desire to enter the field at a distinct advantage in the marketplace.

Perhaps you already apply machine learning techniques in your professional work, or maybe you simply have a hobbyist's interest in the subject. If you're reading this, it's likely you can already bend Java to your will, no problem, but now you feel you're ready to dig deeper and learn how to use the best of breed open source ML Java frameworks in your next data science project. If that is indeed you, how fortuitous is it that the chapters in this book are designed to do all that and more!

Mastery of a subject, especially one that has such obvious applicability as machine learning, requires more than an understanding of its core concepts and familiarity with its mathematical underpinnings. Unlike an introductory treatment of the subject, a book that purports to help you master the subject must be heavily focused on practical aspects in addition to introducing more advanced topics that would have stretched the scope of the introductory material. To warm up before we embark on sharpening our skills, we will devote this chapter to a quick review of what we already know. For the ambitious novice with little or no prior exposure to the subject (who is nevertheless determined to get the fullest benefit from this book), here's our advice: make sure you do not skip the rest of this chapter; instead, use it as a springboard to explore unfamiliar concepts in more depth. Seek out external resources as necessary. Wikipedia them. Then jump right back in.

For the rest of this chapter, we will review the following:

History and definitionsWhat is not machine learning?Concepts and terminologyImportant branches of machine learningDifferent data types in machine learningApplications of machine learningIssues faced in machine learningThe meta-process used in most machine learning projects Information on some well-known tools, APIs, and resources that we will employ in this book

Machine learning – history and definition

It is difficult to give an exact history, but the definition of machine learning we use today finds its usage as early as the 1860s. In Rene Descartes' Discourse on the Method, he refers to Automata and says:

For we can easily understand a machine's being constituted so that it can utter words, and even emit some responses to action on it of a corporeal kind, which brings about a change in its organs; for instance, if touched in a particular part it may ask what we wish to say to it; if in another part it may exclaim that it is being hurt, and so on.

Note

http://www.earlymoderntexts.com/assets/pdfs/descartes1637.pdf

https://www.marxists.org/reference/archive/descartes/1635/discourse-method.htm

Alan Turing, in his famous publication Computing Machinery and Intelligence gives basic insights into the goals of machine learning by asking the question "Can machines think?".

Note

http://csmt.uchicago.edu/annotations/turing.htm

http://www.csee.umbc.edu/courses/471/papers/turing.pdf

Arthur Samuel in 1959 wrote, "Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.".

Tom Mitchell in recent times gave a more exact definition of machine learning: "A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E."

Machine learning has a relationship with several areas:

Statistics: It uses the elements of data sampling, estimation, hypothesis testing, learning theory, and statistical-based modeling, to name a fewAlgorithms and computation: It uses the basic concepts of search, traversal, parallelization, distributed computing, and so on from basic computer scienceDatabase and knowledge discovery: For its ability to store, retrieve, and access information in various formatsPattern recognition: For its ability to find interesting patterns from the data to explore, visualize, and predictArtificial intelligence: Though it is considered a branch of artificial intelligence, it also has relationships with other branches, such as heuristics, optimization, evolutionary computing, and so on

What is not machine learning?

It is important to recognize areas that share a connection with machine learning but cannot themselves be considered part of machine learning. Some disciplines may overlap to a smaller or larger extent, yet the principles underlying machine learning are quite distinct:

Business intelligence (BI) and reporting: Reporting key performance indicators (KPI's), querying OLAP for slicing, dicing, and drilling into the data, dashboards, and so on that form the central components of BI are not machine learning.Storage and ETL: Data storage and ETL are key elements in any machine learning process, but, by themselves, they don't qualify as machine learning.Information retrieval, search, and queries: The ability to retrieve data or documents based on search criteria or indexes, which form the basis of information retrieval, are not really machine learning. Many forms of machine learning, such as semi-supervised learning, can rely on the searching of similar data for modeling, but that doesn't qualify searching as machine learning.Knowledge representation and reasoning: Representing knowledge for performing complex tasks, such as ontology, expert systems, and semantic webs, does not qualify as machine learning.

Machine learning – concepts and terminology

In this section, we will describe the different concepts and terms normally used in machine learning:

Data or dataset: The basics of machine learning rely on understanding the data. The data or dataset normally refers to content available in structured or unstructured format for use in machine learning. Structured datasets have specific formats, and an unstructured dataset is normally in the form of some free-flowing text. Data can be available in various storage types or formats. In structured data, every element known as an instance or an example or row follows a predefined structure. Data can also be categorized by size: small or medium data have a few hundreds to thousands of instances, whereas big data refers to a large volume, mostly in millions or billions, that cannot be stored or accessed using common devices or fit in the memory of such devices.Features, attributes, variables, or dimensions: In structured datasets, as mentioned before, there are predefined elements with their own semantics and data type, which are known variously as features, attributes, metrics, indicators, variables, or dimensions.Data types: The features defined earlier need some form of typing in many machine learning algorithms or techniques. The most commonly used data types are as follows:

Categorical or nominal: This indicates well-defined categories or values present in the dataset. For example, eye color—black, blue, brown, green, grey; document content type—text, image, video.Continuous or numeric: This indicates a numeric nature of the data field. For example, a person's weight measured by a bathroom scale, the temperature reading from a sensor, or the monthly balance in dollars on a credit card account.Ordinal: This denotes data that can be ordered in some way. For example, garment size—small, medium, large; boxing weight classes: heavyweight, light heavyweight, middleweight, lightweight, and bantamweight.

Target or label: A feature or set of features in the dataset, which is used for learning from training data and predicting in an unseen dataset, is known as a target or a label. The term "ground truth" is also used in some domains. A label can have any form as specified before, that is, categorical, continuous, or ordinal.Machine learning model: Each machine learning algorithm, based on what it learned from the dataset, maintains the state of its learning for predicting or giving insights into future or unseen data. This is referred to as the machine learning model.Sampling: Data sampling is an essential step in machine learning. Sampling means choosing a subset of examples from a population with the intent of treating the behavior seen in the (smaller) sample as being representative of the behavior of the (larger) population. In order for the sample to be representative of the population, care must be taken in the way the sample is chosen. Generally, a population consists of every object sharing the properties of interest in the problem domain, for example, all people eligible to vote in the general election, or all potential automobile owners in the next four years. Since it is usually prohibitive (or impossible) to collect data for all the objects in a population, a well-chosen subset is selected for the purpose of analysis. A crucial consideration in the sampling process is that the sample is unbiased with respect to the population. The following are types of probability-based sampling:

Uniform random sampling: This refers to sampling that is done over a uniformly distributed population, that is, each object has an equal probability of being chosen.Stratified random sampling: This refers to the sampling method used when the data can be categorized into multiple classes. In such cases, in order to ensure all categories are represented in the sample, the population is divided into distinct strata based on these classifications, and each stratum is sampled in proportion to the fraction of its class in the overall population. Stratified sampling is common when the population density varies across categories, and it is important to compare these categories with the same statistical power. Political polling often involves stratified sampling when it is known that different demographic groups vote in significantly different ways. Disproportional representation of each group in a random sample can lead to large errors in the outcomes of the polls. When we control for demographics, we can avoid oversampling the majority over the other groups.Cluster sampling: Sometimes there are natural groups among the population being studied, and each group is representative of the whole population. An example is data that spans many geographical regions. In cluster sampling, you take a random subset of the groups followed by a random sample from within each of those groups to construct the full data sample. This kind of sampling can reduce the cost of data collection without compromising the fidelity of distribution in the population.Systematic sampling: Systematic or interval sampling is used when there is a certain ordering present in the sampling frame (a finite set of objects treated as the population and taken to be the source of data for sampling, for example, the corpus of Wikipedia articles, arranged lexicographically by title). If the sample is then selected by starting at a random object and skipping a constant k number of objects before selecting the next one, that is called systematic sampling. The value of k is calculated as the ratio of the population to the sample size.

Model evaluation metrics: Evaluating models for performance is generally based on different evaluation metrics for different types of learning. In classification, it is generally based on accuracy, receiver operating characteristics (ROC) curves, training speed, memory requirements, false positive ratio, and so on, to name a few (see Chapter 2, Practical Approach to Real-World Supervised Learning). In clustering, the number of clusters found, cohesion, separation, and so on form the general metrics (see Chapter 3, Unsupervised Machine Learning Techniques). In stream-based learning, apart from the standard metrics mentioned earlier, adaptability, speed of learning, and robustness to sudden changes are some of the conventional metrics for evaluating the performance of the learner (see Chapter 5, Real-Time Stream Machine Learning).

To illustrate these concepts, a concrete example in the form of a commonly used sample weather dataset is given. The data gives a set of weather conditions and a label that indicates whether the subject decided to play a game of tennis on the day or not:

@relation weather@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute humidity numeric@attribute windy {TRUE, FALSE}@attribute play {yes, no}@datasunny,85,85,FALSE,nosunny,80,90,TRUE,noovercast,83,86,FALSE,yesrainy,70,96,FALSE,yesrainy,68,80,FALSE,yesrainy,65,70,TRUE,noovercast,64,65,TRUE,yessunny,72,95,FALSE,nosunny,69,70,FALSE,yesrainy,75,80,FALSE,yessunny,75,70,TRUE,yesovercast,72,90,TRUE,yesovercast,81,75,FALSE,yesrainy,71,91,TRUE,no

The dataset is in the format of an ARFF (attribute-relation file format) file. It consists of a header giving the information about features or attributes with their data types and actual comma-separated data following the data tag. The dataset has five features, namely outlook, temperature, humidity, windy, and play. The features outlook and windy are categorical features, while humidity and temperature are continuous. The feature play is the target and is categorical.

Machine learning – types and subtypes

We will now explore different subtypes or branches of machine learning. Though the following list is not comprehensive, it covers the most well-known types:

Supervised learning: This is the most popular branch of machine learning, which is about learning from labeled