Machine Learning - Steven W. Knox - E-Book

Machine Learning E-Book

Steven W. Knox

4,5
88,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

AN INTRODUCTION TO MACHINE LEARNING THAT INCLUDES THE FUNDAMENTAL TECHNIQUES, METHODS, AND APPLICATIONS

PROSE Award Finalist 2019
Association of American Publishers Award for Professional and Scholarly Excellence

Machine Learning: a Concise Introduction offers a comprehensive introduction to the core concepts, approaches, and applications of machine learning. The author—an expert in the field—presents fundamental ideas, terminology, and techniques for solving applied problems in classification, regression, clustering, density estimation, and dimension reduction. The design principles behind the techniques are emphasized, including the bias-variance trade-off and its influence on the design of ensemble methods. Understanding these principles leads to more flexible and successful applications. Machine Learning: a Concise Introduction also includes methods for optimization, risk estimation, and model selection— essential elements of most applied projects. This important resource:

  • Illustrates many classification methods with a single, running example, highlighting similarities and differences between methods
  • Presents R source code which shows how to apply and interpret many of the techniques covered
  • Includes many thoughtful exercises as an integral part of the text, with an appendix of selected solutions
  • Contains useful information for effectively communicating with clients

A volume in the popular Wiley Series in Probability and Statistics, Machine Learning: a Concise Introduction offers the practical information needed for an understanding of the methods and application of machine learning.

STEVEN W. KNOX holds a Ph.D. in Mathematics from the University of Illinois and an M.S. in Statistics from Carnegie Mellon University. He has over twenty years’ experience in using Machine Learning, Statistics, and Mathematics to solve real-world problems. He currently serves as Technical Director of Mathematics Research and Senior Advocate for Data Science at the National Security Agency.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 518

Veröffentlichungsjahr: 2018

Bewertungen
4,5 (16 Bewertungen)
11
2
3
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



WILEY SERIES IN PROBABILITY AND STATISTICS

Established by Walter A. Shewhart and Samuel S. Wilks

Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice, Geof H. Givens, Harvey Goldstein, Geert Molenberghs, David W. Scott, Adrian F. M. Smith, Ruey S. Tsay

Editors Emeriti: J. Stuart Hunter, Iain M. Johnstone, Joseph B. Kadane, Jozef L. Teugels

The Wiley Series in Probability and Statistics is well established and authoritative. It covers many topics of current research interest in both pure and applied statistics and probability theory. Written by leading statisticians and institutions, the titles span both state-of-the-art developments in the field and classical methods. Reflecting the wide range of current research in statistics, the series encompasses applied, methodological and theoretical statistics, ranging from applications and new techniques made possible by advances in computerized practice to rigorous treatment of theoretical approaches. This series provides essential and invaluable reading for all statisticians, whether in academia, industry, government, or research.

A complete list of titles in this series can be found at http://www.wiley.com/go/wsps

Machine Learning: a Concise Introduction

Steven W. Knox

This edition first published 2018

This work is a U.S. Government work and is in the public domain in the U.S.A.

Published 2018 by John Wiley & Sons, Inc

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Steven W. Knox to be identified as the author of this work has been asserted in accordance with law.

Registered Office(s)

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

Editorial Office

111 River Street, Hoboken, NJ 07030, USA

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging-in-Publication Data

Names: Knox, Steven W., author.

Title: Machine learning : a concise introduction / by Steven W. Knox.

Description: Hoboken, New Jersey : John Wiley & Sons, 2018. | Series: Wiley series in probability and statistics |

Identifiers: LCCN 2017058505 (print) | LCCN 2018004509 (ebook) | ISBN 9781119439073 (pdf) | ISBN 9781119438984 (epub) | ISBN 9781119439196 (cloth)

Subjects: LCSH: Machine learning.

Classification: LCC Q325.5 (ebook) | LCC Q325.5 .K568 2018 (print) | DDC 006.3/1--dc23

LC record available at https://lccn.loc.gov/2017058505

Cover image: © Verticalarray/Shutterstock

Cover design by Wiley

CONTENTS

Preface

Organization—How to Use This Book

Acknowledgments

About the Companion Website

Chapter 1: Introduction—Examples from Real Life

Chapter 2: The Problem of Learning

2.1 Domain

2.2 Range

2.3 Data

2.4 Loss

2.5 Risk

2.6 The Reality of the Unknown Function

2.7 Training and Selection of Models, and Purposes of Learning

2.8 Notation

Chapter 3: Regression

3.1 General Framework

3.2 Loss

3.3 Estimating the Model Parameters

3.4 Properties of Fitted Values

3.5 Estimating the Variance

3.6 A Normality Assumption

3.7 Computation

3.8 Categorical Features

3.9 Feature Transformations, Expansions, and Interactions

3.10 Variations in Linear Regression

3.11 Nonparametric Regression

Chapter 4: Survey of Classification Techniques

4.1 The Bayes Classifier

4.2 Introduction to Classifiers

4.3 A Running Example

4.4 Likelihood Methods

4.5 Prototype Methods

4.6 Logistic Regression

4.7 Neural Networks

4.8 Classification Trees

4.9 Support Vector Machines

4.10 Postscript: Example Problem Revisited

Chapter 5: Bias–Variance Trade-off

5.1 Squared-Error Loss

5.2 Arbitrary Loss

Chapter 6: Combining Classifiers

6.1 Ensembles

6.2 Ensemble Design

6.3 Bootstrap Aggregation (Bagging)

6.4 Bumping

6.5 Random Forests

6.6 Boosting

6.7 Arcing

6.8 Stacking and Mixture of Experts

Chapter 7: Risk Estimation and Model Selection

7.1 Risk Estimation via Training Data

7.2 Risk Estimation via Validation or Test Data

7.3 Cross-Validation

7.4 Improvements on Cross-Validation

7.5 Out-of-Bag Risk Estimation

7.6 Akaike’s Information Criterion

7.7 Schwartz’s Bayesian Information Criterion

7.8 Rissanen’s Minimum Description Length Criterion

7.9

R

2

and Adjusted

R

2

7.10 Stepwise Model Selection

7.11 Occam’s Razor

Chapter 8: Consistency

8.1 Convergence of Sequences of Random Variables

8.2 Consistency for Parameter Estimation

8.3 Consistency for Prediction

8.4 There Are Consistent and Universally Consistent Classifiers

8.5 Convergence to Asymptopia Is Not Uniform and May Be Slow

Chapter 9: Clustering

9.1 Gaussian Mixture Models

9.2

k

-Means

9.3 Clustering by Mode-Hunting in a Density Estimate

9.4 Using Classifiers to Cluster

9.5 Dissimilarity

9.6

k

-Medoids

9.7 Agglomerative Hierarchical Clustering

9.8 Divisive Hierarchical Clustering

9.9 How Many Clusters Are There? Interpretation of Clustering

9.10 An Impossibility Theorem

Chapter 10: Optimization

10.1 Quasi-Newton Methods

10.2 The Nelder–Mead Algorithm

10.3 Simulated Annealing

10.4 Genetic Algorithms

10.5 Particle Swarm Optimization

10.6 General Remarks on Optimization

10.7 The Expectation-Maximization Algorithm

Chapter 11: High-Dimensional Data

11.1 The Curse of Dimensionality

11.2 Two Running Examples

11.3 Reducing Dimension While Preserving Information

11.4 Model Regularization

Chapter 12: Communication with Clients

12.1 Binary Classification and Hypothesis Testing

12.2 Terminology for Binary Decisions

12.3 ROC Curves

12.4 One-Dimensional Measures of Performance

12.5 Confusion Matrices

12.6 Multiple Testing

12.7 Expert Systems

Chapter 13: Current Challenges in Machine Learning

13.1 Streaming Data

13.2 Distributed Data

13.3 Semi-supervised Learning

13.4 Active Learning

13.5 Feature Construction via Deep Neural Networks

13.6 Transfer Learning

13.7 Interpretability of Complex Models

Chapter 14: R Source Code

14.1 Author’s Biases

14.2 Libraries

14.3 The Running Example (Section 4.3)

14.4 The Bayes Classifier (Section 4.1)

14.5 Quadratic Discriminant Analysis (Section 4.4.1)

14.6 Linear Discriminant Analysis (Section 4.4.2)

14.7 Gaussian Mixture Models (Section 4.4.3)

14.8 Kernel Density Estimation (Section 4.4.4)

14.9 Histograms (Section 4.4.5)

14.10 The Naive Bayes Classifier (Section 4.4.6)

14.11

k

-Nearest-Neighbor (Section 4.5.1)

14.12 Learning Vector Quantization (Section 4.5.4)

14.13 Logistic Regression (Section 4.6)

14.14 Neural Networks (Section 4.7)

14.15 Classification Trees (Section 4.8)

14.16 Support Vector Machines (Section 4.9)

14.17 Bootstrap Aggregation (Section 6.3)

14.18 Boosting (Section 6.6)

14.19 Arcing (Section 6.7)

14.20 Random Forests (Section 6.5)

Appendix A: List of Symbols

Appendix B: Solutions to Selected Exercises

Appendix C: Converting Between Normal Parameters and Level-Curve Ellipsoids

Appendix D: Training Data and Fitted Parameters

References

Index

EULA

List of Tables

Chapter 4

Table 4.1

Table 4.2

Chapter 5

Table 5.1

Chapter 12

Table 12.1

Table 12.2

Guide

Cover

Table of Contents

Preface

Pages

C1

ii

iii

iv

xi

xii

xiii

xiv

xv

xvi

xvii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

217

218

219

220

221

222

223

224

225

226

227

228

229

231

232

233

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

305

306

307

308

309

310

311

312

313

315

316

317

318

319

320

Preface

The goal of statistical data analysis is to extract the maximum information from the data, and to present a product that is as accurate and as useful as possible.

—David Scott, Scott, David Multivariate Density Estimation: Theory, Practice and Visualization, 1992

My purpose in writing this book is to introduce the mathematically sophisticated reader to a large number of topics and techniques in the field variously known as machine learning, statistical learning, or predictive modeling. I believe that a deeper understanding of the subject as a whole will be obtained from reflection on an intuitive understanding of many techniques rather than a very detailed understanding of only one or two, and the book is structured accordingly. I have omitted many details while focusing on what I think shows “what is really going on.” For details, the reader will be directed to the relevant literature, or to the exercises, which form an integral part of the text.

No work this small on a subject this large can be self-contained. Some undergraduate-level calculus, linear algebra, and probability is assumed without reference, as are a few basic ideas from statistics. All of the techniques discussed here can, I hope, be implemented using this book and a mid-level programming language (such as C),1 and explicit implementation of many techniques using R is presented in the last chapter.

The reader may detect a coverage bias in favor of classification over regression. This is deliberate. The existing literature on the theory and practice of linear regression and many of its variants is so strong that it does not need any contribution from me. Classification, I believe, is not yet so well documented. In keeping with what has been important in my experience, loss functions are completely general and predictive modeling is stressed more than explanatory modeling.

The intended audience for these notes has an extremely diverse background in probability, ranging from one introductory undergraduate course to extensive graduate work and published research.2 In seeking a probability notation which will create the least confusion for all concerned, I arrived at the non-standard use of P(x) for both the probability of an event x and a probability mass or density function, with respect to some measure which is never stated, evaluated at a point x. My hope, which I believe has been borne out in practice, is that anyone with sufficient knowledge to find this notation confusing will have sufficient knowledge to work through that confusion.

Notes

1

There is one exception: the convex programming needed to implement a support vector machine is omitted.

2

This book is designed for two kinds of people: those who know what P(

Y | X

) means but do not know the Radon–Nikodym theorem, and those who know what P(

Y | X

) means only because they know the Radon–Nikodym theorem. I have tried to communicate with the first group, at the cost of possibly irritating the second.

Organization—How to Use This Book

The core material of this book is laid out in Chapters 1 through 7. These chapters are intended to be read in order, as each chapter builds on the previous ones. Chapters 1 and 2 introduce problems which machine learning methods can solve, and also introduce the fundamental ideas and terminology used to describe these problems and their solutions. Chapter 3 gives a brief introduction to regression.1

Chapter 4 presents many methods for classification, grouped according to how they approach the problem. Chapter 5 discusses bias–variance trade-off, a topic essential to understanding the design principles behind ensemble methods. Chapter 6 presents various ensemble methods, focusing on how each method can be understood as trading off bias for variance, or vice versa. Chapter 7 concludes the core material with methods for risk estimation and model selection. By the end of Chapter 7, the reader will have encountered many useful methods for approaching classification and regression problems, and (I hope) will appreciate how each method works and why each method approaches classification or regression the way it does.

After Chapter 7, the reader can select from among the remaining seven chapters in any order, best suiting the reader’s goals or needs. For example, some readers might wish to follow

a path toward machine learning practice or consultancy:

Chapter 9

, Clustering, and then

Chapter 11

, High-Dimensional Data

Chapter 10

, Optimization

Chapter 12

, Communication with Clients

Chapter 14

, R Source Code (optional—recommended if the reader plans to use R)

Other readers might wish to follow

a path toward machine learning research:

Chapter 8

, Consistency

Chapter 11

, High-Dimensional Data (optionally preceded by

Chapter 9

, Clustering)

Chapter 10

, Optimization (optional)

Chapter 13

, Current Challenges in Machine Learning

Other readers might wish to develop insight empirically, through hands-on experience working with the methods covered in the core material of Chapters 1 through 7, before progressing on to later chapters. Such readers can proceed directly to Chapter 14, R Source Code, which illustrates how to apply and interpret many of the classification methods from Chapters 4 and 6. Finally, some readers might wish to learn specifically about deep neural networks. Feed-forward deep neural networks2 are a combination of several ideas, and these ideas are presented separately because each is individually useful: feed-forward neural networks in Section 4.7, stochastic gradient descent optimization in Sections 10.1 and 10.6, autoencoders in Section 11.3, and parameter regularization in Section 11.4.

Three of the later chapters—8 (Consistency), 10 (Optimization), and 12 (Communication with Clients)—may seem unusual in an introductory book on machine learning. Chapter 8 is about consistency, which addresses whether or not a given supervised learning method converges to an optimal solution as the number of data, n, increases without bound, either for a given problem or for any problem. This statistical topic is present because some readers of Chapters 4, 5, and 6 may, upon being presented with approximately 20 methods to solve classification problems, conclude that the richness and diversity of methods is a sign that no one actually knows yet how to deal with classification. Such readers might feel that there ought to be one method which solves all classification or regression problems optimally, at least as n → ℞. Such readers may be comforted, or saved reinventing the wheel, by learning of the large body of theory which addresses this issue.

Chapter 10 presents useful techniques in optimization, a topic which is required in order to apply most methods in machine learning3. Indeed, a common theme running through Chapters 3, 4, 6, 7, 9, and 11 (most of the practical content of the book) is the transformation of a classification, clustering, regression, density estimation, or dimension-reduction problem into an optimization problem: find the best estimate θ of a vector of model parameters θ, find the best trained model in a class of trained models, find the best features to use or the best low-dimensional representation of features, etc. Thus, in order to solve real-world problems, the machine learning practitioner needs to be able to solve various kinds of optimization problems. Even when relying on existing software for optimization, a machine learning practitioner can be more effective if he or she understands a variety of optimization methods, and their strengths and weaknesses.

Chapter 12 is about communication with clients. In the author’s experience, machine learning practitioners do not generally work for themselves, in the sense that they are not the ultimate beneficiary of a machine learning solution they create for some real-world application. Instead, a machine learning practitioner’s work is typically performed for a client, or customer, or mission stakeholder, or scientific colleague. It is the client’s subjective judgment which matters in determining what is a good solution to a problem, and what kind of trade-offs are acceptable in error, computation, and model interpretability. A machine learning practitioner—no matter how smart, how well trained, or how gifted in computational resources—is doomed to failure if he or she cannot elicit from a client what is really needed, or cannot communicate to a client what the machine learning practitioner needs and can provide.

Notes

1

The reader who has a thorough background in regression may choose to skip

Chapter 3

, or skim it for notation.

2

Recurrent neural networks are not covered in this book.

3

Despite its importance, optimization is a topic which is often either taken for granted or ignored.

Acknowledgments

This book was written to train mathematicians, computer scientists, data scientists, and others at the National Security Agency (NSA), where it has been used, in various editions, since 2005. This book is respectfully dedicated to the silent professionals—past, current, and future—who use this material in the intelligence services of the United States and its allies.

This book has been improved through many discussions with teachers, students, and colleagues. Most of these people cannot be listed here by name, so I will simply say I am grateful to them all.

Steven W. Knox

About the Companion Website

This book is accompanied by a companion website:

(URL: www.wiley.com\go\Knox\MachineLearning)

The website includes:

Solutions to some of the exercises in the book.

1Introduction—Examples from Real Life

To call in a statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of.

—R. A. Fisher, Presidential Address, 1938

The following examples will be used to illustrate the ideas of the next chapter.

Problem 1 (“Shuttle”).