99,99 €
Modern analysis of HEP data needs advanced statistical tools to separate signal from background. This is the first book which focuses on machine learning techniques. It will be of interest to almost every high energy physicist, and, due to its coverage, suitable for students.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 826
Veröffentlichungsjahr: 2013
Contents
Acknowledgements
Notation and Vocabulary
1 Why We Wrote This Book and How You Should Read It
2 Parametric Likelihood Fits
2.1 Preliminaries
2.2 Parametric Likelihood Fits
2.3 Fits for Small Statistics
2.4 Results Near the Boundary of a Physical Region
2.5 Likelihood Ratio Test for Presence of Signal
2.6 sPlots
2.7 Exercises
References
3 Goodness of Fit
3.1 Binned Goodness of Fit Tests
3.2 Statistics Converging to Chi-Square
3.3 Univariate Unbinned Goodness of Fit Tests
3.4 Multivariate Tests
3.5 Exercises
References
4 Resampling Techniques
4.1 Permutation Sampling
4.2 Bootstrap
4.3 Jackknife
4.4 BCa Confidence Intervals
4.5 Cross-Validation
4.6 Resampling Weighted Observations
4.7 Exercises
References
5 Density Estimation
5.1 Empirical Density Estimate
5.2 Histograms
5.3 Kernel Estimation
5.4 Ideogram
5.5 Parametric vs. Nonparametric Density Estimation
5.6 Optimization
5.7 Estimating Errors
5.8 The Curse of Dimensionality
5.9 Adaptive Kernel Estimation
5.10 Naive Bayes Classification
5.11 Multivariate Kernel Estimation
5.12 Estimation Using Orthogonal Series
5.13 Using Monte Carlo Models
5.14 Unfolding
5.15 Exercises
References
6 Basic Concepts and Definitions of Machine Learning
6.1 Supervised, Unsupervised, and Semi-Supervised
6.2 Tall and Wide Data
6.3 Batch and Online Learning
6.4 Parallel Learning
6.5 Classification and Regression
References
7 Data Preprocessing
7.1 Categorical Variables
7.2 Missing Values
7.3 Outliers
7.4 Exercises
References
8 Linear Transformations and Dimensionality Reduction
8.1 Centering, Scaling, Reflection and Rotation
8.2 Rotation and Dimensionality Reduction
8.3 Principal Component Analysis (PCA)
8.4 Independent Component Analysis (ICA)
8.5 Exercises
References
9 Introduction to Classification
9.1 Loss Functions: Hard Labels and Soft Scores
9.2 Bias, Variance, and Noise
9.3 Training, Validating and Testing: The Optimal Splitting Rule
9.4 Resampling Techniques: Cross-Validation and Bootstrap
9.5 Data with Unbalanced Classes
9.6 Learning with Cost
9.7 Exercises
References
10 Assessing Classifier Performance
10.1 Classification Error and Other Measures of Predictive Power
10.2 Receiver Operating Characteristic (ROC) and Other Curves
10.3 Testing Equivalence of Two Classification Models
10.4 Comparing Several Classifiers
10.5 Exercises
References
11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression
11.1 Discriminant Analysis
11.2 Logistic Regression
11.3 Classification by Linear Regression
11.4 Partial Least Squares Regression
11.5 Example: Linear Models for MAGIC Telescope Data
11.6 Choosing a Linear Classifier for Your Analysis
11.7 Exercises
References
12 Neural Networks
12.1 Perceptrons
12.2 The Feed-Forward Neural Network
12.3 Backpropagation
12.4 Bayes Neural Networks
12.5 Genetic Algorithms
12.6 Exercises
References
13 Local Learning and Kernel Expansion
13.1 From Input Variables to the Feature Space
13.2 Regularization
13.3 Making and Choosing Kernels
13.4 Radial Basis Functions
13.5 Support Vector Machines (SVM)
13.6 Empirical Local Methods
13.7 Kernel Methods: The Good, the Bad and the Curse of Dimensionality
13.8 Exercises
References
14 Decision Trees
14.1 Growing Trees
14.2 Predicting by Decision Trees
14.3 Stopping Rules
14.4 Pruning Trees
14.5 Trees for Multiple Classes
14.6 Splits on Categorical Variables
14.7 Surrogate Splits
14.8 Missing Values
14.9 Variable importance
14.10 Why Are Decision Trees Good (or Bad)?
14.11 Exercises
References
15 Ensemble Learning
15.1 Boosting
15.2 Diversifying the Weak Learner: Bagging, Random Subspace and Random Forest
15.3 Choosing an Ensemble for Your Analysis
15.4 Exercises
References
16 Reducing Multiclass to Binary
16.1 Encoding
16.2 Decoding
16.3 Summary: Choosing the Right Design
References
17 How to Choose the Right Classifier for Your Analysis and Apply It Correctly
17.1 Predictive Performance and Interpretability
17.2 Matching Classifiers and Variables
17.3 Using Classifier Predictions
17.4 Optimizing Accuracy
17.5 CPU and Memory Requirements
18 Methods for Variable Ranking and Selection
18.1 Definitions
18.2 Variable Ranking
18.3 Variable Selection
18.4 Exercises
References
19 Bump Hunting in Multivariate Data
19.1 Voronoi Tessellation and SLEUTH Algorithm
19.2 Identifying Box Regions by PRIM and Other Algorithms
19.3 Bump Hunting Through Supervised Learning
References
20 Software Packages for Machine Learning
20.1 Tools Developed in HEP
20.2 R
20.3 MATLAB
20.4 Tools for Java and Python
20.5 What Software Tool Is Right for You?
References
Appendix A: Optimization Algorithms
A.1 Line Search
A.2 Linear Programming (LP)
Index
Related Titles
Quarks and Leptons
An Introductory Course in Modern Particle Physics
1984
ISBN: 978-0-471-88741-6
Barlow, R.J.
Statistics
A Guide to the use of StatisticMethods in the Physical Science
1989
ISBN: 978-0-471-92295-7, e-publication available
Talman, R.
Accelerator X-Ray Sources
2006
ISBN: 978-3-527-40590-9, e-publication available, ISBN: 978-3-527-61030-3
Griffiths, D.
Introduction to Elementary Particles
2008
ISBN: 978-3-527-40601-2
Wangler, T.P.
RF Linear Accelerators
2008
ISBN: 978-3-527-40680-7, e-publication available
Reiser, M.
Theory and Design of Charged Particle Beams
2008
ISBN: 978-3-527-40741-5, -publication available
Russenschuck, S.
Field Computation for Accelerator Magnets
Analytical and Numerical Methods for Electromagnetic Design and Optimization
2010
ISBN: 978-3-527-40769-9, e-publication available
Padamsee, H., Knobloch, J., Hays, T.
RF Superconductivity for Accelerators
2008
ISBN: 978-3-527-40842-9
Brock, I., Schörner-Sadenius, T. (eds.)
Physics at the Terascale
2011
ISBN: 978-3-527-41001-9, e-publication available
Behnke, O., Kröninger, K., Schott, G., Schörner-Sadenius, T. (eds.)
Data Analysis in High Energy Physics
A Practical Guide to Statistical Methods
2013
ISBN: 978-3-527-41058-3, e-publication available
Authors
Dr. Ilya Narsky
MathWorks
Natick
United States of America
Prof. Frank C. Porter
California Inst. of Technology
Physics Department
Pasadena
United States of America
Cover
CMS Experiment at LHC, CERN
Data recorded Mon Nov 8, 11:30:43 2010 CEST
Run/Event 150431/630470
Lumi section 173
All books published by Wiley-VCH are carefully produced. Nevertheless, authors, editors, and publisher do not warrant the information contained in these books, including this book, to be free of errors. Readers are advised to keep in mind that statements, data, illustrations, procedural details or other items may inadvertently be inaccurate.
Library of Congress Card No.:
applied for
British Library Cataloguing-in-Publication Data:
A catalogue record for this book is available from the British Library.
Bibliographic information published by the Deutsche Nationalbibliothek
The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.d-nb.de.
© 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Boschstr. 12, 69469 Weinheim, Germany
All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form – by photoprinting, microfilm, or any other means – nor transmitted or translated into a machine language without written permission from the publishers. Registered names, trademarks, etc. used in this book, even when not specifically marked as such, are not to be considered unprotected by law.
Print ISBN 978-3-527-41086-6
ePDF ISBN 978-3-527-67731-3
ePub ISBN 978-3-527-67729-0
Mobi ISBN 978-3-527-67730-6
oBook ISBN 978-3-527-67732-0
Acknowledgements
We are grateful to our colleagues in high energy physics, astrophysics, and at Math-Works (I.N.) for many stimulating discussions and interesting problems. These interactions are ultimately the motivation for this book. While writing the book, we solicited and received comments from several colleagues who were willing to slog through unpolished material. We are certain that their suggestions have made this a better exposition. Specifically, we are indebted to physicists Roger Barlow, Tingting Cao, Bertrand Echenard, Louis Lyons, Piti Ongmongkolkul, and Steve Sekula; and statisticians Tom Lane, Crystal Linkletter, Gautam Pendse, and Ting Su.
We would like to thank Lars Sonnenschein for encouraging us to submit the book proposal to Wiley and coordinating the submission. We are grateful to Vera Palmer, Ulrike Werner, and Wiley for their technical guidance and support.
We thank Caltech High Energy Physics for hosting the web site for the book.
Notation and Vocabulary
We adopt several conventions in this book, which are gathered here for reference.
Random variables (RVs) are denoted with upper case Roman letters, for example, X. A value of such a random variable is denoted with the corresponding lower case letter, for example, x. Parameters needed to completely specify a distribution are usually denoted with Greek letters, for example, θ. Estimators for parameters consist of the same symbol, with an added “hat”, for example, . Estimators are also RVs, but not distinguished with upper case from their values. Where desired for clarity, an argument is added, which provides this information, that is, (X) is an RV and (x) is a value of this RV at x.
Distributions are usually specified with an upper case Roman letter, such as F, for a cdf (cumulative distribution function), and the corresponding lower case letter, f, for the corresponding pdf (probability distribution function). Sometimes we use P to specify a pdf, where P stands for “probability”. An estimator for a distribution is denoted with the hat notation, for example, . The notation FX (x; θ) refers to the cdf to sample a value x from distribution F for random variable X, where the distribution depends on the value of parameter θ. Similarly, fX(x; θ) refers to the respective pdf. We will generally use various simplifications of this notation where the meaning should be clear, for example, F(x; θ).
A joint distribution of two or more RVs separates arguments by commas, for example, fX,Y(X, Y). A conditional distribution uses the “|” notation. For example, the pdf of Y conditioned on X would be denoted by fY|X(Y|X), and its value at specific x and y would be denoted by fY|X(y|x). For brevity, we often drop the subscript Y|X.
Integrals are to be interpreted in the Lebesgue sense. In particular, dF(x) and f(x)dx are to be interpreted with the appropriate Dirac measure for discrete distributions. In other words, these are pmfs (probability mass functions) in the case of discrete distributions, with the integral sign becoming summation.
Vector RVs are set in bold Roman letters, for example, X, and vector parameters are set in bold Greek, θ. A vector of observed values drawn from a vector RV is set in bold lowercase: x. A vector consisting of i.i.d. (identical independently distributed) observations of a scalar RV is set in bold lowercase with a distinguishing font: y. A matrix of observed values is set in upper case bold with the same distinguishing font: X. Note that in this case we abandon the case distinction between a RV and an observation, in preference to emphasizing the matrix aspect.
We use the “E” notation to denote an expectation value. Thus E(X) is the mean of the sampling distribution for RV X. Where necessary to make the sampling distribution clear, a subscript is added: EF(X) is the mean of X in distribution F. A subscript can be added to indicate what RV is used for taking an expectation of a function of several RVs. For example, if G(X, Y) is a function of two RVs, EX(G) is used to denote a function of Y averaged over X. The variance of a RV is denoted by Var(X). If the RV is multivariate, then Var(X) is the full covariance matrix.
Response of a specific classification or regression model at point x is usually denoted by f(x), and the respective RV is denoted by F(x). We thus overload the letters f and F using them both for probability distributions and model response. It should be easy to avoid confusion because we never mix the notation in the same chapter. If f and F are used to specify response, P is used to specify a distribution.
In particle physics, an “event” represents a set of coincidental measurements. Adopting a vocabulary from the statistics literature, we use the word “observation” instead. An event or an observation is represented by a vector of recorded values, for example, x. Unlike events, observations do not need to be ordered in time.
The notation specifies function g mapping argument x onto domain or set .
The notation lim infN refers to the limit of the greatest lower bound as N → ∞, that is, lim infNf(N) ≡ limN→∞[infM≥Nf(M)].
Various special symbols are summarized in Table 1.
Table 1 Special symbols used in the book.
Dimension
dim
Dirac delta function
δ
(
x
)
If and only if
iff
Gradient, or nabla operator
∇
k
-dimensional real coordinate space
Kronecker symbol
Laplace operator
∇
2
Normal (Gaussian) distribution
N
The advent of high-speed computing has enabled a transformation in practical statistical methodology. We have entered the age of “machine learning”. Roughly, this means that we replace assumptions and approximations by computing power in order to derive statements about characteristics of a dataset. Statisticians and computer scientists have produced an enormous literature on this subject. The jargon includes bootstrap, cross-validation, learning curves, receiver operating characteristic, decision trees, neural nets, boosting, bagging, and so on. These algorithms are becoming increasingly popular in analysis of particle and astrophysics data. The idea of this book is to provide an introduction to these tools and methods in language and context appropriate to the physicist.
Machine learning may be divided into two broad types, “supervised learning” and “unsupervised learning”, with due caution toward oversimplifying. Supervised learning can be thought of as fitting a function in a multivariate domain over a set of measured values. Unsupervised learning is exploratory analysis, used when you want to discover interesting features of a dataset. This book is focused on the supervised learning side.
Supervised learning comes in two flavors: classification and regression. Classification aims at separating observations of different kinds such as signal and background. The fitted function in this case takes categorical values. Fitting a function with continuous values is addressed by regression. Fitting a scalar function of a scalar argument by least squares is a well-known regression tool. In case of a vector argument, classification appears to be used more often in modern physics analysis. This is why we focus on classification.
This book is not an introductory probability and statistics text. We assume our readers have been exposed to basic probability theory and to basic methods in parameter estimation such as maximum likelihood. We do not ignore these basic tools, but aim our discussion past the elementary development. Solid understanding of linear algebra, familiarity with multivariate calculus and some exposure to set theory are required as well.
Chapter 2 reviews techniques for parametric likelihood, a subject familiar to most physicists. We include discussion of practical issues such as fits for small statistics and fits near the boundary of a physical region, as well as advanced topics such as sPlots. Goodness of fit measures for univariate and multivariate data are reviewed in Chapter 3. These measures can be applied to distribution fitting by parametric likelihood described in Chapter 2 and nonparametric density estimation described in Chapter 5. Chapter 4 introduces resampling techniques such as the bootstrap, in the context of parameter estimation. Chapter 5 overviews techniques for nonparametric density estimation by histograms and kernel smoothing. The subjects reviewed in these chapters are part of the traditional statistician’s toolkit.
In Chapter 6 we turn attention to topics in machine learning. Chapter 6 introduces basic concepts and gives a cursory survey of the material that largely remains beyond the scope of this book. Before learning can begin, data need to be cleaned up and organized in a suitable way. These basic processing steps, in particular treatment of categorical variables, missing values, and outliers, are discussed in Chapter 7. Other important steps, optionally taken before supervised learning starts, are standardizing variable distributions and reducing the data dimensionality. Chapter 8 reviews simple techniques for univariate transformations and advanced techniques such as principal and independent component analysis.
In Chapter 9 we shift the focus to classification. Chapters 9 and 10 are essential for understanding the material in Chapters 11–18. Chapter 9 formalizes the problem of classification and lays out the common workflow for solving this problem. Resampling techniques are revisited here as well, with an emphasis on their application to supervised learning. Chapter 10 explains how the quality of a classification model can be judged and how two (or more) classification models can be compared by a formal statistical test. Some topics in these chapters can be skipped at first reading. In particular, analysis of data with class imbalance, although an important practical issue, is not required to enjoy the rest of the book.
Chapters 11–15 review specific classification techniques. Although many of them can be used for two-class (binary) and multiclass learning, binary classification has been studied more actively than multiclass algorithms. Chapter 16 describes a framework for reducing multiclass learning to a set of binary problems.
Chapter 17 provides a summary of the material learned in Chapters 11–16. Summaries oversimplify and should be interpreted with caution. Bearing this in mind, use this chapter as a practical guide for choosing a classifier appropriate for your analysis.
Chapter 18 reviews methods for selecting the most important variables from all inputs in the data. The importance of a variable is measured by its effect on the predictive power of a classification model.
Bump hunting in multivariate data may be an important component of searches for new physics processes at the Large Hadron Collider, as well as other experiments. We discuss appropriate techniques in Chapter 19. This discussion is focused on multivariate nonparametric searches, in a setting more complex than univariate likelihood fits.
Throughout the book, we illustrate the application of various algorithms to data, either simulated or borrowed from a real physics analysis, using examples of MATLAB code. These examples could be coded in another language. We have chosen MATLAB for two reasons. First, one of the authors, employed by MathWorks, has been involved in design and implementation of the MATLAB utilities supporting various algorithms described in this book. We thus have intimate knowledge of how these utilities work. Second, MATLAB is a good scripting language. If we used tools developed in the particle physics community, these code snippets would be considerably longer and less transparent.
There are many software suites that provide algorithms described here besides MATLAB. In Chapter 20 we review several software toolkits, whether developed by physicists or not.
We hope that this book can serve both pedagogically and as a reference. If your goal is to learn the broad arsenal of statistical tools for physics analysis, read Chapters 2–9. If you are interested primarily in classification, read Chapters 9 and 10; then choose one or more chapters from Chapters 11–15 for in-depth reading. If you wish to learn a specific classification technique, read Chapters 9 and 10 before digging into the respective chapter or section.
Sections labeled with present advanced material and can be cut at first reading.
In various places throughout the book we use datasets, either simulated or measured. Some of these datasets can be downloaded from the Machine Learning Repository maintained by University of California in Irvine, http://www.ics.uci.edu/~mlearn.
A fraction of this book is posted at the Caltech High Energy Physics site, http://www.hep.caltech.edu/~NarskyPorter. The posted material includes code for all MATLAB examples, without comments or graphs. Sections not included in the book are posted there as well.
The likelihood function is a pervasive object in physics analyses. In this chapter we review several basic concepts involving the likelihood function, with some simple applications to confidence intervals and hypothesis tests in the context of parametric statistics. We end with the technique of sPlots, providing an optimized method for background subtraction in multivariate settings.
(2.1)
The normalization of the likelihood function (of θ) is typically unimportant in practice, in contrast to the pdf f(X; θ), which must be normalized to one when integrated over the sample space. The likelihood function is a widely-used tool in making inferences concerning θ. It is of fundamental importance in Bayesian statistics. In frequency statistics, the likelihood function has no profound interpretation,1) but nonetheless provides an important concept leading to algorithms with many useful properties.
There are some further notions that will be useful in discussing the use of the likelihood function and related topics. First, we shall typically suppose that the Xn are independent and identically distributed (i.i.d.) random variables. The dimension N is then called the sample size.
The likelihood function may be used to formulate a statistical measure for information, relevant to parameters of interest:
Definition 2.1.If L(θ; X) is a likelihood function depending on parameter θ, the Fisher information number (or Fisher information, or simply information), corresponding to θ, is:
(2.2)
This generalizes to the R × R Fisher information matrix in the case of a multidimensional parameter space:
(2.3)
We assume this matrix is positive definite for any θ in the parameter space.
An intuitive view is that if L varies rapidly with θ, the experimental sampling distribution will be very sensitive to θ. Hence, a measurement will contain a lot of “information” relevant to θ. It is handy to note that (exercise for the reader):
(2.4)
In a multidimensional parameter space, this becomes the negative expectation value of the Hessian for log L.
The quantity ∂θ log L is known as the score function:
(2.5)
A property of the score function is that its expectation is zero:
(2.6)
Certain conditions must be satisfied for this to hold, notably that the sample space of X should be independent of θ.
When we estimate a population parameter given some sample X, we strive for an estimate which is as close as possible to the true parameter value. That is, we wish to find an estimate with a small error. There are many ways one could measure error, but perhaps the most common measure is the mean squared error, or MSE. Given an estimator (random variable) for population parameter θ, the MSE is defined by:
(2.7)
We recognize the second term in the second line as the square of the bias of the estimator. For an unbiased estimator, the MSE is simply the variance of the estimator. There is an obvious generalization of this definition to the multivariate case.
Using the information number, we may derive a lower bound on the variance of a parameter estimate for a given bias. Suppose that we have an estimator for a parameter θ, with a bias function. We have the theorem:
Theorem 2.2.Rao–Cramér–Frechet (RCF)
Assume:
Then the variance,of estimator obeys the inequality:
(2.8)
The proof is left as an exercise, but here is a sketch. First, show that
(2.9)
Next, find the linear correlation parameter:
(2.10)
between the score function and . Finally, note that ρ2 ≤ 1.
(2.11)
An efficient unbiased estimator is one that achieves this bound, which we shall refer to as the RCF bound. The RCF bound also provides a handy tool to estimate uncertainties in planning an experiment, under the assumption that one can get close to the bound with a good estimator (and enough statistics).
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!