E-Book
34,79 €

Hands-On Ensemble Learning with R E-Book

Prabhanjan Narayanachar Tattar

0,0

34,79 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

Ensemble techniques are used for combining two or more similar or dissimilar machine learning algorithms to create a stronger model. Such a model delivers superior prediction power and can give your datasets a boost in accuracy.

Hands-On Ensemble Learning with R begins with the important statistical resampling methods. You will then walk through the central trilogy of ensemble techniques – bagging, random forest, and boosting – then you'll learn how they can be used to provide greater accuracy on large datasets using popular R packages. You will learn how to combine model predictions using different machine learning algorithms to build ensemble models. In addition to this, you will explore how to improve the performance of your ensemble models.

By the end of this book, you will have learned how machine learning algorithms can be combined to reduce common problems and build simple efficient ensemble models with the help of real-world examples.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 415

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Statistical Application Development with R and Python - Second Edition

Prabhanjan Narayanachar Tattar

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Der größte Raubzug der Geschichte

Matthias Weik

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Organisation für Komplexität

Niels Pfläging

Radikal führen

Reinhard K. Sprenger

30 Minuten Sympathisch und souverän: So geht Vortragen!

Thomas Lorenz

BLACKOUT - Morgen ist es zu spät

Marc Elsberg

The Truth About Employee Engagement

Patrick M. Lencioni

Mensch und Wald

Carsten Wippermann

The Food Truck Handbook

David Weber

Leseprobe

Hands-On Ensemble Learning with R

Why subscribe?

PacktPub.com

Contributors

About the author

About the reviewer

Packt is Searching for Authors Like You

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

1. Introduction to Ensemble Techniques

Datasets

Hypothyroid

Waveform

German Credit

Iris

Pima Indians Diabetes

US Crime

Overseas visitors

Primary Biliary Cirrhosis

Multishapes

Board Stiffness

Statistical/machine learning models

Logistic regression model

Logistic regression for hypothyroid classification

Neural networks

Neural network for hypothyroid classification

Naïve Bayes classifier

Naïve Bayes for hypothyroid classification

Decision tree

Decision tree for hypothyroid classification

Support vector machines

SVM for hypothyroid classification

The right model dilemma!

An ensemble purview

Complementary statistical tests

Permutation test

Chi-square and McNemar test

ROC test

Summary

2. Bootstrapping

Technical requirements

The jackknife technique

The jackknife method for mean and variance

Pseudovalues method for survival data

Bootstrap – a statistical method

The standard error of correlation coefficient

The parametric bootstrap

Eigen values

Rule of thumb

The boot package

Bootstrap and testing hypotheses

Bootstrapping regression models

Bootstrapping survival models*

Bootstrapping time series models*

Summary

3. Bagging

Technical requirements

Classification trees and pruning

Bagging

k-NN classifier

Analyzing waveform data

k-NN bagging

Summary

4. Random Forests

Technical requirements

Random Forests

Variable importance

Proximity plots

Random Forest nuances

Comparisons with bagging

Missing data imputation

Clustering with Random Forest

Summary

5. The Bare Bones Boosting Algorithms

Technical requirements

The general boosting algorithm

Adaptive boosting

Gradient boosting

Building it from scratch

Squared-error loss function

Using the adabag and gbm packages

Variable importance

Comparing bagging, random forests, and boosting

Summary

6. Boosting Refinements

Technical requirements

Why does boosting work?

The gbm package

Boosting for count data

Boosting for survival data

The xgboost package

The h2o package

Summary

7. The General Ensemble Technique

Technical requirements

Why does ensembling work?

Ensembling by voting

Majority voting

Weighted voting

Ensembling by averaging

Simple averaging

Weight averaging

Stack ensembling

Summary

8. Ensemble Diagnostics

Technical requirements

What is ensemble diagnostics?

Ensemble diversity

Numeric prediction

Class prediction

Pairwise measure

Disagreement measure

Yule's or Q-statistic

Correlation coefficient measure

Cohen's statistic

Double-fault measure

Interrating agreement

Entropy measure

Kohavi-Wolpert measure

Disagreement measure for ensemble

Measurement of interrater agreement

Summary

9. Ensembling Regression Models

Technical requirements

Pre-processing the housing data

Visualization and variable reduction

Variable clustering

Regression models

Linear regression model

Neural networks

Regression tree

Prediction for regression models

Bagging and Random Forests

Boosting regression models

Stacking methods for regression models

Summary

10. Ensembling Survival Models

Core concepts of survival analysis

Nonparametric inference

Regression models – parametric and Cox proportional hazards models

Survival tree

Ensemble survival models

Summary

11. Ensembling Time Series Models

Technical requirements

Time series datasets

AirPassengers

co2

uspop

gas

Car Sales

austres

WWWusage

Time series visualization

Core concepts and metrics

Essential time series models

Naïve forecasting

Seasonal, trend, and loess fitting

Exponential smoothing state space model

Auto-regressive Integrated Moving Average (ARIMA) models

Auto-regressive neural networks

Messing it all up

Bagging and time series

Ensemble time series models

Summary

12. What's Next?

A. Bibliography

References

R package references

Index

Hands-On Ensemble Learning with R

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Sunith Shetty

Acquisition Editor: Tushar Gupta

Content Development Editor: Aaryaman Singh

Technical Editor: Dinesh Chaudhary

Copy Editors: Safis Editing

Project Coordinator: Manthan Patel

Proofreader: Safis Editing

Indexer: Mariammal Chettiyar

Graphics: Jisha Chirayil

Production Coordinator: Nilesh Mohite

First published: July 2018

Production reference: 1250718

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78862-414-5

www.packtpub.com

On the personal front, I continue to benefit from the support of my family: my daughter, Pranathi; my wife, Chandrika; and my parents, Lakshmi and Narayanachar. The difference in their support from acknowledgement in earlier books is that now I am in Chennai and they support me from Bengaluru. It involves a lot of sacrifice to allow a writer his private time with writing. I also thank my managers, K. Sridharan, Anirban Singha, and Madhu Rao, at Ford Motor Company for their support. Anirban had gone through some of the draft chapters and expressed confidence in the treatment of topics in the book.

My association with Packt is now six years and four books! This is the third title I have done with Tushar Gupta and it is needless to say that I enjoy working with him. Menka Bohra and Aaryaman Singh have put a lot of faith in my work and strived to accommodate the delays, so special thanks to both of them. Manthan Patel and Snehal Kolte have also extended their support. Finally, it is a great pleasure to thank Storm Mann for improving the language of the book. If you still come across a few mistakes, the blame is completely mine.

It is a pleasure to dedicate this book to them for all their support.

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionalsLearn better with Skill Plans built especially for youGet a free eBook or video every monthMapt is fully searchableCopy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Prabhanjan Narayanachar Tattar is a lead statistician and manager at the Global Data Insights & Analytics division of Ford Motor Company, Chennai. He received the IBS(IR)-GK Shukla Young Biometrician Award (2005) and Dr. U.S. Nair Award for Young Statistician (2007). He held SRF of CSIR-UGC during his PhD. He has authored books such as Statistical Application Development with R and Python, 2nd Edition, Packt; Practical Data Science Cookbook, 2nd Edition, Packt; and A Course in Statistics with R, Wiley. He has created many R packages.

The statistics and machine learning community, powered by software engineers, is striving to make the world a better, safer, and more efficient place. I would like to thank these societies on behalf of the reader.

About the reviewer

Antonio L. Amadeu is a data science consultant and is passionate about artificial intelligence and neural networks. He uses machine learning and deep learning algorithms in his daily challenges, solving all types of issues in any business field. He has worked for Unilever, Lloyds Bank, TE Connectivity, Microsoft, and Samsung. As an aspiring astrophysicist, he does some research with the Virtual Observatory group at São Paulo University in Brazil, a member of the International Virtual Observatory Alliance – IVOA.

Packt is Searching for Authors Like You

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Preface

Ensemble learning! This specialized topic of machine learning broadly deals with putting together multiple models with the aim of providing higher accuracy and stable model performance. The ensemble methodology is based on sound theory and its usage has seen successful applications in complex data science scenarios. This book grabs the opportunity of dealing with this important topic.

Moderately sized datasets are used throughout the book. All the concepts—well, most of them—have been illustrated using the software, and R packages have been liberally used to drive home the point. While care has been taken to ensure that all the codes are error free, please feel free to write us with any bugs or errors in the codes. The approach has been mildly validated through two mini-courses based on earlier drafts. The material was well received by my colleagues and that gave me enough confidence to complete the book.

The Packt editorial team has helped a lot with the technical review, and the manuscript reaches you after a lot of refinement. The bugs and shortcomings belong to the author.

Who this book is for

This book is for anyone who wants to master machine learning by building ensemble models with the power of R. Basic knowledge of machine learning techniques and programming knowledge of R are expected in order to get the most out of the book.

What this book covers

Chapter 1, Introduction to Ensemble Techniques, will give an exposition to the need for ensemble learning, important datasets, essential statistical and machine learning models, and important statistical tests. This chapter displays the spirit of the book.

Chapter 2, Bootstrapping, will introduce the two important concepts of jackknife and bootstrap. The chapter will help you carry out statistical inference related to unknown complex parameters. Bootstrapping of essential statistical models, such as linear regression, survival, and time series, is illustrated through R programs. More importantly, it lays the basis for resampling techniques that forms the core of ensemble methods.

Chapter 3, Bagging, will propose the first ensemble method of using a decision tree as a base model. Bagging is a combination of the words bootstrap aggregation. Pruning of decision trees is illustrated, and it will lay down the required foundation for later chapters. Bagging of decision trees and k-NN classifiers are illustrated in this chapter.

Chapter 4, Random Forests, will discuss the important ensemble extension of decision trees. Variable importance and proximity plots are two important components of random forests, and we carry out the related computations about them. The nuances of random forests are explained in depth. Comparison with the bagging method, missing data imputation, and clustering with random forests are also dealt with in this chapter.

Chapter 5, The Bare-Bones Boosting Algorithms, will first state the boosting algorithm. Using toy data, the chapter will then explain the detailed computations of the adaptive boosting algorithm. Gradient boosting algorithm is then illustrated for the regression problem. The use of the gbm and adabag packages shows implementations of other boosting algorithms. The chapter concludes with a comparison of the bagging, random forest, and boosting methods.

Chapter 6, Boosting Refinements, will begin with an explanation of the working of the boosting technique. The gradient boosting algorithm is then extended to count and survival datasets. The extreme gradient boosting implementation of the popular gradient boosting algorithm details are exhibited with clear programs. The chapter concludes with an outline of the important h2o package.

Chapter 7, The General Ensemble Technique, will study the probabilistic reasons for the success of the ensemble technique. The success of the ensemble is explained for classification and regression problems.

Chapter 8, Ensemble Diagnostics, will examine the conditions for the diversity of an ensemble. Pairwise comparisons of classifiers and overall interrater agreement measures are illustrated here.

Chapter 9, Ensembling Regression Models, will discuss in detail the use of ensemble methods in regression problems. A complex housing dataset from kaggle is used here. The regression data is modeled with multiple base learners. Bagging, random forest, boosting, and stacking are all illustrated for the regression data.

Chapter 10, Ensembling Survival Models, is where survival data is taken up. Survival analysis concepts are developed in considerable detail, and the traditional techniques are illustrated. The machine learning method of a survival tree is introduced, and then we build the ensemble method of random survival forests for this data structure.

Chapter 11, Ensembling Time Series Models, deals with another specialized data structure in which observations are dependent on each other. The core concepts of time series and the essential related models are developed. Bagging of a specialized time series model is presented, and we conclude the chapter with an ensemble of heterogeneous time series models.

Chapter 12, What's Next?, will discuss some of the unresolved topics in ensemble learning and the scope for future work.

To get the most out of this book

The official website of R is the Comprehensive R Archive Network (CRAN) at www.cran.r-project.org. At the time of writing this book, the most recent version of R is 3.5.1. This software is available for three platforms: Linux, macOS, and Windows. The reader can also download a nice frontend, such as RStudio.Every chapter has a header section titled Technical requirements. It gives a list of R packages required to run the code in that chapter. For example, the requirements for Chapter 3, Bagging, are as follows:

classFNNipredmlbenchrpart

The reader then needs to install all of these packages by running the following lines in the R console:

install.packages("class") install.packages("mlbench") install.packages("FNN") install.packages("rpart") install.packages("ipred")

Download the example code files

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at http://www.packtpub.com.Select the SUPPORT tab.Click on Code Downloads & Errata.Enter the name of the book in the Search box and follow the on-screen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Hands-On-Ensemble-Learning-with-R. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/HandsOnEnsembleLearningwithR_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. For example; "The computation of the values of the density functions using the dexp function."

A block of code is set as follows:

> Events_Prob <- apply(Elements_Prob,1,prod) > Majority_Events <- (rowSums(APC)>NT/2) > sum(Events_Prob*Majority_Events) [1] 0.9112646

Bold: Indicates a new term, an important word, or words that you see on the screen, for example, in menus or dialog boxes, also appear in the text like this. For example: "Select System info from the Administration panel."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email <[email protected]>, and mention the book's title in the subject of your message. If you have questions about any aspect of this book, please email us at <[email protected]>.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book we would be grateful if you would report this to us. Please visit, http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at <[email protected]> with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.