E-Book
99,99 €

Statistical Analysis Techniques in Particle Physics E-Book

Ilya Narsky

0,0

99,99 €

oder

Leseprobe lesen

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.

Herausgeber: Wiley-VCH
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Modern analysis of HEP data needs advanced statistical tools to separate signal from background. This is the first book which focuses on machine learning techniques. It will be of interest to almost every high energy physicist, and, due to its coverage, suitable for students.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 826

Veröffentlichungsjahr: 2013

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Contents

Acknowledgements

Notation and Vocabulary

1 Why We Wrote This Book and How You Should Read It

2 Parametric Likelihood Fits

2.1 Preliminaries

2.2 Parametric Likelihood Fits

2.3 Fits for Small Statistics

2.4 Results Near the Boundary of a Physical Region

2.5 Likelihood Ratio Test for Presence of Signal

2.6 sPlots

2.7 Exercises

References

3 Goodness of Fit

3.1 Binned Goodness of Fit Tests

3.2 Statistics Converging to Chi-Square

3.3 Univariate Unbinned Goodness of Fit Tests

3.4 Multivariate Tests

3.5 Exercises

References

4 Resampling Techniques

4.1 Permutation Sampling

4.2 Bootstrap

4.3 Jackknife

4.4 BCa Confidence Intervals

4.5 Cross-Validation

4.6 Resampling Weighted Observations

4.7 Exercises

References

5 Density Estimation

5.1 Empirical Density Estimate

5.2 Histograms

5.3 Kernel Estimation

5.4 Ideogram

5.5 Parametric vs. Nonparametric Density Estimation

5.6 Optimization

5.7 Estimating Errors

5.8 The Curse of Dimensionality

5.9 Adaptive Kernel Estimation

5.10 Naive Bayes Classification

5.11 Multivariate Kernel Estimation

5.12 Estimation Using Orthogonal Series

5.13 Using Monte Carlo Models

5.14 Unfolding

5.15 Exercises

References

6 Basic Concepts and Definitions of Machine Learning

6.1 Supervised, Unsupervised, and Semi-Supervised

6.2 Tall and Wide Data

6.3 Batch and Online Learning

6.4 Parallel Learning

6.5 Classification and Regression

References

7 Data Preprocessing

7.1 Categorical Variables

7.2 Missing Values

7.3 Outliers

7.4 Exercises

References

8 Linear Transformations and Dimensionality Reduction

8.1 Centering, Scaling, Reflection and Rotation

8.2 Rotation and Dimensionality Reduction

8.3 Principal Component Analysis (PCA)

8.4 Independent Component Analysis (ICA)

8.5 Exercises

References

9 Introduction to Classification

9.1 Loss Functions: Hard Labels and Soft Scores

9.2 Bias, Variance, and Noise

9.3 Training, Validating and Testing: The Optimal Splitting Rule

9.4 Resampling Techniques: Cross-Validation and Bootstrap

9.5 Data with Unbalanced Classes

9.6 Learning with Cost

9.7 Exercises

References

10 Assessing Classifier Performance

10.1 Classification Error and Other Measures of Predictive Power

10.2 Receiver Operating Characteristic (ROC) and Other Curves

10.3 Testing Equivalence of Two Classification Models

10.4 Comparing Several Classifiers

10.5 Exercises

References

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

11.1 Discriminant Analysis

11.2 Logistic Regression

11.3 Classification by Linear Regression

11.4 Partial Least Squares Regression

11.5 Example: Linear Models for MAGIC Telescope Data

11.6 Choosing a Linear Classifier for Your Analysis

11.7 Exercises

References

12 Neural Networks

12.1 Perceptrons

12.2 The Feed-Forward Neural Network

12.3 Backpropagation

12.4 Bayes Neural Networks

12.5 Genetic Algorithms

12.6 Exercises

References

13 Local Learning and Kernel Expansion

13.1 From Input Variables to the Feature Space

13.2 Regularization

13.3 Making and Choosing Kernels

13.4 Radial Basis Functions

13.5 Support Vector Machines (SVM)

13.6 Empirical Local Methods

13.7 Kernel Methods: The Good, the Bad and the Curse of Dimensionality

13.8 Exercises

References

14 Decision Trees

14.1 Growing Trees

14.2 Predicting by Decision Trees

14.3 Stopping Rules

14.4 Pruning Trees

14.5 Trees for Multiple Classes

14.6 Splits on Categorical Variables

14.7 Surrogate Splits

14.8 Missing Values

14.9 Variable importance

14.10 Why Are Decision Trees Good (or Bad)?

14.11 Exercises

References

15 Ensemble Learning

15.1 Boosting

15.2 Diversifying the Weak Learner: Bagging, Random Subspace and Random Forest

15.3 Choosing an Ensemble for Your Analysis

15.4 Exercises

References

16 Reducing Multiclass to Binary

16.1 Encoding

16.2 Decoding

16.3 Summary: Choosing the Right Design

References

17 How to Choose the Right Classifier for Your Analysis and Apply It Correctly

17.1 Predictive Performance and Interpretability

17.2 Matching Classifiers and Variables

17.3 Using Classifier Predictions

17.4 Optimizing Accuracy

17.5 CPU and Memory Requirements

18 Methods for Variable Ranking and Selection

18.1 Definitions

18.2 Variable Ranking

18.3 Variable Selection

18.4 Exercises

References

19 Bump Hunting in Multivariate Data

19.1 Voronoi Tessellation and SLEUTH Algorithm

19.2 Identifying Box Regions by PRIM and Other Algorithms

19.3 Bump Hunting Through Supervised Learning

References

20 Software Packages for Machine Learning

20.1 Tools Developed in HEP

20.2 R

20.3 MATLAB

20.4 Tools for Java and Python

20.5 What Software Tool Is Right for You?

References

Appendix A: Optimization Algorithms

A.1 Line Search

A.2 Linear Programming (LP)

Index

Related Titles

Quarks and Leptons

An Introductory Course in Modern Particle Physics

1984

ISBN: 978-0-471-88741-6

Barlow, R.J.

Statistics

A Guide to the use of StatisticMethods in the Physical Science

1989

ISBN: 978-0-471-92295-7, e-publication available

Talman, R.

Accelerator X-Ray Sources

2006

ISBN: 978-3-527-40590-9, e-publication available, ISBN: 978-3-527-61030-3

Griffiths, D.

Introduction to Elementary Particles

2008

ISBN: 978-3-527-40601-2

Wangler, T.P.

RF Linear Accelerators

2008

ISBN: 978-3-527-40680-7, e-publication available

Reiser, M.

Theory and Design of Charged Particle Beams

2008

ISBN: 978-3-527-40741-5, -publication available

Russenschuck, S.

Field Computation for Accelerator Magnets

Analytical and Numerical Methods for Electromagnetic Design and Optimization

2010

ISBN: 978-3-527-40769-9, e-publication available

Padamsee, H., Knobloch, J., Hays, T.

RF Superconductivity for Accelerators

2008

ISBN: 978-3-527-40842-9

Brock, I., Schörner-Sadenius, T. (eds.)

Physics at the Terascale

2011

ISBN: 978-3-527-41001-9, e-publication available

Behnke, O., Kröninger, K., Schott, G., Schörner-Sadenius, T. (eds.)

Data Analysis in High Energy Physics

A Practical Guide to Statistical Methods

2013

ISBN: 978-3-527-41058-3, e-publication available

Authors

Dr. Ilya Narsky

MathWorks

Natick

United States of America

[email protected]

Prof. Frank C. Porter

California Inst. of Technology

Physics Department

Pasadena

United States of America

[email protected]

Cover

CMS Experiment at LHC, CERN

Data recorded Mon Nov 8, 11:30:43 2010 CEST

Run/Event 150431/630470

Lumi section 173

All books published by Wiley-VCH are carefully produced. Nevertheless, authors, editors, and publisher do not warrant the information contained in these books, including this book, to be free of errors. Readers are advised to keep in mind that statements, data, illustrations, procedural details or other items may inadvertently be inaccurate.

Library of Congress Card No.:

applied for

British Library Cataloguing-in-Publication Data:

A catalogue record for this book is available from the British Library.

Bibliographic information published by the Deutsche Nationalbibliothek

The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.d-nb.de.

All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form – by photoprinting, microfilm, or any other means – nor transmitted or translated into a machine language without written permission from the publishers. Registered names, trademarks, etc. used in this book, even when not specifically marked as such, are not to be considered unprotected by law.

Print ISBN 978-3-527-41086-6

ePDF ISBN 978-3-527-67731-3

ePub ISBN 978-3-527-67729-0

Mobi ISBN 978-3-527-67730-6

oBook ISBN 978-3-527-67732-0

Acknowledgements

We are grateful to our colleagues in high energy physics, astrophysics, and at Math-Works (I.N.) for many stimulating discussions and interesting problems. These interactions are ultimately the motivation for this book. While writing the book, we solicited and received comments from several colleagues who were willing to slog through unpolished material. We are certain that their suggestions have made this a better exposition. Specifically, we are indebted to physicists Roger Barlow, Tingting Cao, Bertrand Echenard, Louis Lyons, Piti Ongmongkolkul, and Steve Sekula; and statisticians Tom Lane, Crystal Linkletter, Gautam Pendse, and Ting Su.

We would like to thank Lars Sonnenschein for encouraging us to submit the book proposal to Wiley and coordinating the submission. We are grateful to Vera Palmer, Ulrike Werner, and Wiley for their technical guidance and support.

We thank Caltech High Energy Physics for hosting the web site for the book.

Notation and Vocabulary

We adopt several conventions in this book, which are gathered here for reference.

Random variables (RVs) are denoted with upper case Roman letters, for example, X. A value of such a random variable is denoted with the corresponding lower case letter, for example, x. Parameters needed to completely specify a distribution are usually denoted with Greek letters, for example, θ. Estimators for parameters consist of the same symbol, with an added “hat”, for example, . Estimators are also RVs, but not distinguished with upper case from their values. Where desired for clarity, an argument is added, which provides this information, that is, (X) is an RV and (x) is a value of this RV at x.

Distributions are usually specified with an upper case Roman letter, such as F, for a cdf (cumulative distribution function), and the corresponding lower case letter, f, for the corresponding pdf (probability distribution function). Sometimes we use P to specify a pdf, where P stands for “probability”. An estimator for a distribution is denoted with the hat notation, for example, . The notation FX (x; θ) refers to the cdf to sample a value x from distribution F for random variable X, where the distribution depends on the value of parameter θ. Similarly, fX(x; θ) refers to the respective pdf. We will generally use various simplifications of this notation where the meaning should be clear, for example, F(x; θ).

A joint distribution of two or more RVs separates arguments by commas, for example, fX,Y(X, Y). A conditional distribution uses the “|” notation. For example, the pdf of Y conditioned on X would be denoted by fY|X(Y|X), and its value at specific x and y would be denoted by fY|X(y|x). For brevity, we often drop the subscript Y|X.

Integrals are to be interpreted in the Lebesgue sense. In particular, dF(x) and f(x)dx are to be interpreted with the appropriate Dirac measure for discrete distributions. In other words, these are pmfs (probability mass functions) in the case of discrete distributions, with the integral sign becoming summation.

Vector RVs are set in bold Roman letters, for example, X, and vector parameters are set in bold Greek, θ. A vector of observed values drawn from a vector RV is set in bold lowercase: x. A vector consisting of i.i.d. (identical independently distributed) observations of a scalar RV is set in bold lowercase with a distinguishing font: y. A matrix of observed values is set in upper case bold with the same distinguishing font: X. Note that in this case we abandon the case distinction between a RV and an observation, in preference to emphasizing the matrix aspect.

We use the “E” notation to denote an expectation value. Thus E(X) is the mean of the sampling distribution for RV X. Where necessary to make the sampling distribution clear, a subscript is added: EF(X) is the mean of X in distribution F. A subscript can be added to indicate what RV is used for taking an expectation of a function of several RVs. For example, if G(X, Y) is a function of two RVs, EX(G) is used to denote a function of Y averaged over X. The variance of a RV is denoted by Var(X). If the RV is multivariate, then Var(X) is the full covariance matrix.

Response of a specific classification or regression model at point x is usually denoted by f(x), and the respective RV is denoted by F(x). We thus overload the letters f and F using them both for probability distributions and model response. It should be easy to avoid confusion because we never mix the notation in the same chapter. If f and F are used to specify response, P is used to specify a distribution.

In particle physics, an “event” represents a set of coincidental measurements. Adopting a vocabulary from the statistics literature, we use the word “observation” instead. An event or an observation is represented by a vector of recorded values, for example, x. Unlike events, observations do not need to be ordered in time.

The notation specifies function g mapping argument x onto domain or set .

The notation lim infN refers to the limit of the greatest lower bound as N → ∞, that is, lim infNf(N) ≡ limN→∞[infM≥Nf(M)].

Various special symbols are summarized in Table 1.

Table 1 Special symbols used in the book.

Dimension

dim

Dirac delta function

(

)

If and only if

iff

Gradient, or nabla operator

∇

-dimensional real coordinate space

Kronecker symbol

Laplace operator

∇

Normal (Gaussian) distribution

1 Why We Wrote This Book and How You Should Read It

The advent of high-speed computing has enabled a transformation in practical statistical methodology. We have entered the age of “machine learning”. Roughly, this means that we replace assumptions and approximations by computing power in order to derive statements about characteristics of a dataset. Statisticians and computer scientists have produced an enormous literature on this subject. The jargon includes bootstrap, cross-validation, learning curves, receiver operating characteristic, decision trees, neural nets, boosting, bagging, and so on. These algorithms are becoming increasingly popular in analysis of particle and astrophysics data. The idea of this book is to provide an introduction to these tools and methods in language and context appropriate to the physicist.

Machine learning may be divided into two broad types, “supervised learning” and “unsupervised learning”, with due caution toward oversimplifying. Supervised learning can be thought of as fitting a function in a multivariate domain over a set of measured values. Unsupervised learning is exploratory analysis, used when you want to discover interesting features of a dataset. This book is focused on the supervised learning side.

Supervised learning comes in two flavors: classification and regression. Classification aims at separating observations of different kinds such as signal and background. The fitted function in this case takes categorical values. Fitting a function with continuous values is addressed by regression. Fitting a scalar function of a scalar argument by least squares is a well-known regression tool. In case of a vector argument, classification appears to be used more often in modern physics analysis. This is why we focus on classification.

This book is not an introductory probability and statistics text. We assume our readers have been exposed to basic probability theory and to basic methods in parameter estimation such as maximum likelihood. We do not ignore these basic tools, but aim our discussion past the elementary development. Solid understanding of linear algebra, familiarity with multivariate calculus and some exposure to set theory are required as well.

Chapter 2 reviews techniques for parametric likelihood, a subject familiar to most physicists. We include discussion of practical issues such as fits for small statistics and fits near the boundary of a physical region, as well as advanced topics such as sPlots. Goodness of fit measures for univariate and multivariate data are reviewed in Chapter 3. These measures can be applied to distribution fitting by parametric likelihood described in Chapter 2 and nonparametric density estimation described in Chapter 5. Chapter 4 introduces resampling techniques such as the bootstrap, in the context of parameter estimation. Chapter 5 overviews techniques for nonparametric density estimation by histograms and kernel smoothing. The subjects reviewed in these chapters are part of the traditional statistician’s toolkit.

In Chapter 6 we turn attention to topics in machine learning. Chapter 6 introduces basic concepts and gives a cursory survey of the material that largely remains beyond the scope of this book. Before learning can begin, data need to be cleaned up and organized in a suitable way. These basic processing steps, in particular treatment of categorical variables, missing values, and outliers, are discussed in Chapter 7. Other important steps, optionally taken before supervised learning starts, are standardizing variable distributions and reducing the data dimensionality. Chapter 8 reviews simple techniques for univariate transformations and advanced techniques such as principal and independent component analysis.

In Chapter 9 we shift the focus to classification. Chapters 9 and 10 are essential for understanding the material in Chapters 11–18. Chapter 9 formalizes the problem of classification and lays out the common workflow for solving this problem. Resampling techniques are revisited here as well, with an emphasis on their application to supervised learning. Chapter 10 explains how the quality of a classification model can be judged and how two (or more) classification models can be compared by a formal statistical test. Some topics in these chapters can be skipped at first reading. In particular, analysis of data with class imbalance, although an important practical issue, is not required to enjoy the rest of the book.

Chapters 11–15 review specific classification techniques. Although many of them can be used for two-class (binary) and multiclass learning, binary classification has been studied more actively than multiclass algorithms. Chapter 16 describes a framework for reducing multiclass learning to a set of binary problems.

Chapter 17 provides a summary of the material learned in Chapters 11–16. Summaries oversimplify and should be interpreted with caution. Bearing this in mind, use this chapter as a practical guide for choosing a classifier appropriate for your analysis.

Chapter 18 reviews methods for selecting the most important variables from all inputs in the data. The importance of a variable is measured by its effect on the predictive power of a classification model.

Bump hunting in multivariate data may be an important component of searches for new physics processes at the Large Hadron Collider, as well as other experiments. We discuss appropriate techniques in Chapter 19. This discussion is focused on multivariate nonparametric searches, in a setting more complex than univariate likelihood fits.

Throughout the book, we illustrate the application of various algorithms to data, either simulated or borrowed from a real physics analysis, using examples of MATLAB code. These examples could be coded in another language. We have chosen MATLAB for two reasons. First, one of the authors, employed by MathWorks, has been involved in design and implementation of the MATLAB utilities supporting various algorithms described in this book. We thus have intimate knowledge of how these utilities work. Second, MATLAB is a good scripting language. If we used tools developed in the particle physics community, these code snippets would be considerably longer and less transparent.

There are many software suites that provide algorithms described here besides MATLAB. In Chapter 20 we review several software toolkits, whether developed by physicists or not.

We hope that this book can serve both pedagogically and as a reference. If your goal is to learn the broad arsenal of statistical tools for physics analysis, read Chapters 2–9. If you are interested primarily in classification, read Chapters 9 and 10; then choose one or more chapters from Chapters 11–15 for in-depth reading. If you wish to learn a specific classification technique, read Chapters 9 and 10 before digging into the respective chapter or section.

Sections labeled with present advanced material and can be cut at first reading.

In various places throughout the book we use datasets, either simulated or measured. Some of these datasets can be downloaded from the Machine Learning Repository maintained by University of California in Irvine, http://www.ics.uci.edu/~mlearn.

A fraction of this book is posted at the Caltech High Energy Physics site, http://www.hep.caltech.edu/~NarskyPorter. The posted material includes code for all MATLAB examples, without comments or graphs. Sections not included in the book are posted there as well.

2 Parametric Likelihood Fits

The likelihood function is a pervasive object in physics analyses. In this chapter we review several basic concepts involving the likelihood function, with some simple applications to confidence intervals and hypothesis tests in the context of parametric statistics. We end with the technique of sPlots, providing an optimized method for background subtraction in multivariate settings.

2.1 Preliminaries

(2.1)

The normalization of the likelihood function (of θ) is typically unimportant in practice, in contrast to the pdf f(X; θ), which must be normalized to one when integrated over the sample space. The likelihood function is a widely-used tool in making inferences concerning θ. It is of fundamental importance in Bayesian statistics. In frequency statistics, the likelihood function has no profound interpretation,1) but nonetheless provides an important concept leading to algorithms with many useful properties.

There are some further notions that will be useful in discussing the use of the likelihood function and related topics. First, we shall typically suppose that the Xn are independent and identically distributed (i.i.d.) random variables. The dimension N is then called the sample size.

The likelihood function may be used to formulate a statistical measure for information, relevant to parameters of interest:

Definition 2.1.If L(θ; X) is a likelihood function depending on parameter θ, the Fisher information number (or Fisher information, or simply information), corresponding to θ, is:

(2.2)

This generalizes to the R × R Fisher information matrix in the case of a multidimensional parameter space:

(2.3)

We assume this matrix is positive definite for any θ in the parameter space.

An intuitive view is that if L varies rapidly with θ, the experimental sampling distribution will be very sensitive to θ. Hence, a measurement will contain a lot of “information” relevant to θ. It is handy to note that (exercise for the reader):

(2.4)

In a multidimensional parameter space, this becomes the negative expectation value of the Hessian for log L.

The quantity ∂θ log L is known as the score function:

(2.5)

A property of the score function is that its expectation is zero:

(2.6)

Certain conditions must be satisfied for this to hold, notably that the sample space of X should be independent of θ.

When we estimate a population parameter given some sample X, we strive for an estimate which is as close as possible to the true parameter value. That is, we wish to find an estimate with a small error. There are many ways one could measure error, but perhaps the most common measure is the mean squared error, or MSE. Given an estimator (random variable) for population parameter θ, the MSE is defined by:

(2.7)

We recognize the second term in the second line as the square of the bias of the estimator. For an unbiased estimator, the MSE is simply the variance of the estimator. There is an obvious generalization of this definition to the multivariate case.

Using the information number, we may derive a lower bound on the variance of a parameter estimate for a given bias. Suppose that we have an estimator for a parameter θ, with a bias function. We have the theorem:

Theorem 2.2.Rao–Cramér–Frechet (RCF)

Assume:

1. The sample space ofXis independent of θ.

2. The variance of is finite, for any θ.

3., where g(X) is any statistic of finite variance.

Then the variance,of estimator obeys the inequality:

(2.8)

The proof is left as an exercise, but here is a sketch. First, show that

(2.9)

Next, find the linear correlation parameter:

(2.10)

between the score function and . Finally, note that ρ2 ≤ 1.

(2.11)

An efficient unbiased estimator is one that achieves this bound, which we shall refer to as the RCF bound. The RCF bound also provides a handy tool to estimate uncertainties in planning an experiment, under the assumption that one can get close to the bound with a good estimator (and enough statistics).

Lesen Sie weiter in der vollständigen Ausgabe!

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben: