45,59 €
An effective guide to using ensemble techniques to enhance machine learning models
This book is for data scientists, machine learning practitioners, and deep learning enthusiasts who want to implement ensemble techniques and make a deep dive into the world of machine learning algorithms. You are expected to understand Python code and have a basic knowledge of probability theories, statistics, and linear algebra.
Ensembling is a technique of combining two or more similar or dissimilar machine learning algorithms to create a model that delivers superior prediction power. This book will show you how you can use many weak algorithms to make a strong predictive model. This book contains Python code for different machine learning algorithms so that you can easily understand and implement it in your own systems.
This book covers different machine learning algorithms that are widely used in the practical world to make predictions and classifications. It addresses different aspects of a prediction framework, such as data pre-processing, model training, validation of the model, and more. You will gain knowledge of different machine learning aspects such as bagging (decision trees and random forests), Boosting (Ada-boost) and stacking (a combination of bagging and boosting algorithms).
Then you'll learn how to implement them by building ensemble models using TensorFlow and Python libraries such as scikit-learn and NumPy. As machine learning touches almost every field of the digital world, you'll see how these algorithms can be used in different applications such as computer vision, speech recognition, making recommendations, grouping and document classification, fitting regression on data, and more.
By the end of this book, you'll understand how to combine machine learning algorithms to work behind the scenes and reduce challenges and common problems.
This comprehensive guide offers the perfect blend of theory, examples, and implementations of real-world use cases.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 474
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: December 2017
Production reference: 1191217
ISBN 978-1-78829-775-2
www.packtpub.com
Author
Ankit Dixit
Copy Editor
Vikrant Phadkay
Reviewers
Apeksha Jain
Radovan Kavicky
Project Coordinator
Nidhi Joshi
Commissioning Editor
Sunith Shetty
Proofreader
Safis Editing
Acquisition Editor
Viraj Madhav
Indexer
Tejal Daruwale Soni
Content Development Editor
Aishwarya Pandere
Graphics
Tania Dutta
Technical Editor
Suwarna Patil
Production Coordinator
Nilesh Mohite
Ankit Dixit is a data scientist and computer vision engineer from Mumbai. Ankit has studied BTech in biomedical engineering and has a master's degree in computer vision specialization. He has worked in the field of computer vision and machine learning for the past 6 years. He has worked with various software and hardware platforms for the design and development of machine vision algorithms. Ankit has experience with a wide variety of machine learning algorithms. Currently, he is focusing on designing computer vision and machine learning algorithms for medical imaging data, with the use of various advanced technologies such as ensemble methods and deep learning-based models.
Apeksha Jain is a data scientist and computer vision engineer from Mumbai, India. She holds a BTech in biomedical engineering and has a master's degree in computer vision specialization. She has been working in the field of computer vision and machine learning for more than 6 years. She has used various software and hardware platforms for the design and development of machine vision algorithms, and has experience on various machine learning algorithms, including deep learning. Currently, she is working on designing computer vision and machine learning algorithms for medical imaging data for Aditya Imaging and Information Technologies (part of the Sun Pharmaceutical advanced research center), Mumbai. She does this with the use of various advanced technologies such as ensemble methods and deep learning-based models.
Radovan Kavicky is the principal data scientist and president at GapData Institute, based in Bratislava, Slovakia, where he harnesses the power of data and wisdom of economics for public good. He is a macroeconomist by education, and consultant and analyst by profession (8+ years of experience in consulting for clients from the public and private sectors), with strong mathematical and analytical skills. He is able to deliver top-level research and analytical work. From MATLAB, SAS, and Stata, he switched to Python, R, and Tableau.
Radovan is an evangelist of open data and a member of the Slovak Economic Association (SEA), Open Budget Initiative, Open Government Partnership, and the global Tableau #DataLeader network (2017). He is the founder of PyData Bratislava, R <- Slovakia, and the SK/CZ Tableau User Group (skczTUG). He has been a speaker at @TechSummit (Bratislava, 2017) and @PyData (Berlin, 2017).
You can follow him on Twitter at @radovankavicky, @GapDataInst, or @PyDataBA. His full profile and experience are available at https://www.linkedin.com/in/radovankavicky/ and https://github.com/radovankavicky.
GapData Institute: https://www.gapdata.org.
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/178829775X.
If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
Introduction to Ensemble Learning
What is ensemble machine learning?
The purpose of ensemble machine learning
How to create an ensemble system
Quantification of performance
Bias and variance errors
Methods to create ensemble systems
Bagging
Boosting
Stacking
Summary
Decision Trees
How do decision trees work?
ID3 algorithm for decision tree building
Root node
Salary
The Sex attribute
Marital status
Parent node
Choosing between the Sex and Marital attributes for the low salary group
Choosing between the Sex and Marital attributes for the Med salary group
Marital status
Case study – car evaluation problem
Summary
Random Forest
Classification and regression trees
Gini index for impurity check
Node selection
Creating a split
Tree building
At depth – 1 (root node)
At depth – 2 (left branch)
At depth – 2 (right branch)
Case study – breast cancer type prediction
Decision tree bagging
From bagging to random forest
Summary
Random Subspace and KNN Bagging
Subspace bagging
Case study – subspace bagging
More information about the dataset
KNN classification
KNN for spam filtering
Dataset
Dataset information
Attribute information
KNN bagging with random subspaces
Summary
AdaBoost Classifier
Boosting
AdaBoost in a nutshell
Weak classifier
AdaBoost in action
Application of the AdaBoost classifier in face detection
Face detection using Haar cascades
Integral image
Implementation using OpenCV
Summary
Gradient Boosting Machines
Gradient Boosting Machines
What is the difference?
Create split
Node selection
Build tree
Regression tree as a classifier
GBM implementation
Algorithm
Improvements to basic gradient boosting
Tree constraints
Weighted updates
Stochastic gradient boosting
Penalized gradient boosting
Summary
XGBoost – eXtreme Gradient Boosting
XGBoost – supervised learning
Models and parameters
Objective function – training loss + regularization
Why introduce the general principle?
XGBoost features
Model features
System features
Algorithm features
Why use XGBoost?
XGBoost execution speed
Model performance
How to install
Building the shared library
Building on Ubuntu/Debian
Building on Windows
A trick for easy installation on a Windows machine
XGBoost in action
Dataset information
Attribute information
XGBoost parameters
General parameters
Booster parameters
Learning task parameters
Parameter tuning – number and size of decision trees
Problem description – Otto dataset
Tune the number of decision trees in XGBoost
Tuning the size of decision trees in XGBoost
Tuning the number of trees and max depth in XGBoost
Summary
Stacked Generalization
Stacked generalization
Submodel training
KNN classification
Distance calculation (Euclidean)
Estimating the neighbors
Making predictions using voting
Perceptron
Training the perceptron
Gradient descent
Stochastic gradient descent
Implementation of perceptron
Logistic regression
The logistic function
Representation of logistic regression
Modeling probability using logistic regression
Learning the model
Prediction using logistic regression
Implementation of algorithm
Stacked generalization implementation
Practical application – Sonar dataset (Mine and Rock prediction)
More information about the dataset
Summary
Stacked Generalization – Part 2
Feature selection
Why feature selection?
Simplification of models
Dataset information
Predicted attribute
Attribute information
Shorter training time
To avoid the curse of dimensionality
Enhanced generalization by reducing overfitting
Feature selection for machine learning
Univariate selection
Recursive Feature Elimination
Principle Component Analysis
Choosing important features (feature importance)
Understanding the SVM
How does SVM work?
Hyperplane – separation between the data points
Implementation of an SVM
Objective function
Function optimization
Handling a nonlinear dataset
Stacking of nonlinear algorithms
Spam classification with stacking
Dataset information
Attribute information
How to choose classifiers?
Summary
Modern Day Machine Learning
Artificial Neural Networks (feed-forward)
How does ANN work?
Training of ANNs
Learning by backpropagation
ANN implementation using Keras and TensorFlow
TensorFlow for machine learning
Keras for machine learning
Digit classification using Keras and TensorFlow
Deep learning
Convolutional Neural Networks
Local receptive fields
Shared weights and biases
Pooling layers
Combining all the layers
Implementation of CNN in Python
Recurrent Neural Networks
How RNN works (unrolling RNN)
Unrolling the forward pass
Unrolling the backward pass
Backpropagation Through Time
Backpropagation training algorithm
Backpropagation Through Time
Long Short-Term Memory networks
The idea behind LSTMs
Step-by-step LSTM walkthrough
Text generation using LSTM
Problem description – project Gutenberg
LSTM model
Generating text with an LSTM Network
Summary
Troubleshooting
Full code of the implemented algorithm ID3
Code of the CART algorithm
Code for random forest
Code for KNN and subspace bagging
KNN subspace bagging code
Code of the AdaBoost classifier
Code of GBMs
Full code of implementation
Full code of LSTM implementation
Science has given us its biggest gift: Computers. This invention is as significant as Fire. It has changed the history of mankind. Tell me any field of work where computers are not being used; I bet you cannot. Computers are special kind of species that only eats electricity and one precious thing in which all of the world is interested, information a.k.a. DATA. Yes, without data, there is no use of a computer; it is just a television-like screen and nothing more. So the next question arises: What to do with this data? Believe me, every chapter of this book will give you a perspective to utilize your data and extract useful results from it.
Chapter 1, Introduction to Ensemble Learning, is our introductory chapter to the world of ensembles. So we will see how ensembles can be useful for getting high accuracy from classifiers, and how to quantify the performance of a classifier by analyzing variance and bias errors. We will discuss three important aspects of ensemble algorithms: bagging, boosting, and stacking. We will see decision tree bagging in this chapter. We will also see how boosting works and how to use it. At the end, we will discuss what stacking is and how to implement stacked generalization.
