Data Analysis in High Energy Physics -  - E-Book

Data Analysis in High Energy Physics E-Book

0,0
81,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

This practical guide covers the essential tasks in statistical data analysis encountered in high energy physics and provides comprehensive advice for typical questions and problems. The basic methods for inferring results from data are presented as well as tools for advanced tasks such as improving the signal-to-background ratio, correcting detector effects, determining systematics and many others. Concrete applications are discussed in analysis walkthroughs. Each chapter is supplemented by numerous examples and exercises and by a list of literature and relevant links. The book targets a broad readership at all career levels - from students to senior researchers. An accompanying website provides more algorithms as well as up-to-date information and links.

* Free solutions manual available for lecturers at www.wiley-vch.de/supplements/

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 809

Veröffentlichungsjahr: 2013

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Contents

Preface

List of Contributors

1 Fundamental Concepts

1.1 Introduction

1.2 Probability Density Functions

1.3 Theoretical Distributions

1.4 Probability

1.5 Inference and Measurement

1.6 Exercises

References

2 Parameter Estimation

2.1 Parameter Estimation in High Energy Physics: Introductory Words

2.2 Parameter Estimation: Definition and Properties

2.3 The Method of Maximum Likelihood

2.4 The Method of Least Squares

2.5 Maximum-Likelihood Fits: Unbinned, Binned, Standard and Extended Likelihood

2.6 Bayesian Parameter Estimation

2.7 Exercises

References

3 Hypothesis Testing

3.1 Basic Concepts

3.2 Choosing the Test Statistic

3.3 Choice of the Critical Region

3.4 Determining Test Statistic Distributions

3.5 p-Values

3.6 Inversion of Hypothesis Tests

3.7 Bayesian Approach to Hypothesis Testing

3.8 Goodness-of-Fit Tests

3.9 Conclusion

3.10 Exercises

References

4 Interval Estimation

4.1 Introduction

4.2 Characterisation of Interval Constructions

4.3 Frequentist Methods

4.4 Bayesian Methods

4.5 Graphical Comparison of Interval Constructions

4.6 The Role of Intervals in Search Procedures

4.7 Final Remarks and Recommendations

4.8 Exercises

References

5 Classification

5.1 Introduction to Multivariate Classification

5.2 Classification from a Statistical Perspective

5.3 Multivariate Classification Techniques

5.4 General Remarks

5.5 Dealing with Systematic Uncertainties

5.6 Exercises

References

6 Unfolding

6.1 Inverse Problems

6.2 Solution with Orthogonalisation

6.3 Regularisation Methods

6.4 The Discrete Cosine Transformation and Projection Methods

6.5 Iterative Unfolding

6.6 Unfolding Problems in Particle Physics

6.7 Programs Used for Unfolding in High Energy Physics

6.8 Exercise

References

7 Constrained Fits

7.1 Introduction

7.2 Solution by Elimination

7.3 The Method of Lagrange Multipliers

7.4 The Lagrange Multiplier Problem with Linear Constraints and Quadratic Objective Function

7.5 Iterative Solution of the Lagrange Multiplier Problem

7.6 Further Reading and Web Resources

7.7 Exercises

References

8 How to Deal with Systematic Uncertainties

8.1 Introduction

8.2 What Are Systematic Uncertainties?

8.3 Detection of Possible Systematic Uncertainties

8.4 Estimation of Systematic Uncertainties

8.5 How to Avoid Systematic Uncertainties

8.6 Conclusion

8.7 Exercise

References

9 Theory Uncertainties

9.1 Overview

9.2 Factorisation: A Cornerstone of Calculations in QCD

9.3 Power Corrections

9.4 The Final State

9.5 From Hadrons to Partons

9.6 Exercises

References

10 Statistical Methods Commonly Used in High Energy Physics

10.1 Introduction

10.2 Estimating Efficiencies

10.3 Estimating the Contributions of Processes to a Dataset: The Matrix Method

10.4 Estimating Parameters by Comparing Shapes of Distributions: The Template Method

10.5 Ensemble Tests

10.6 The Experimenter’s Role and Data Blinding

10.7 Exercises

References

11 Analysis Walk-Throughs

11.1 Introduction

11.2 Search for a Z′ Boson Decaying into Muons

11.3 Measurement

11.4 Exercises

References

12 Applications in Astronomy

12.1 Introduction

12.2 A Survey of Applications

12.3 Nested Sampling

12.4 Outlook and Conclusions

12.5 Exercises

References

The Authors

Index

Related Titles

Brock, I., Schorner-Sadenius, T. (eds.)

Physics at the Terascale

2011

ISBN: 978-3-527-41001-9

Russenschuck, S.

Field Computation for Accelerator Magnets

Analytical and Numerical Methods for Electromagnetic Design and Optimization

2010

ISBN: 978-3-527-40769-9

Halpern, P.

Collider

The Search for the World's Smallest Particles

2009

ISBN: 978-0-470-28620-3

Martin, B., Shaw, G.

Particle Physics

2008

ISBN: 978-0-470-03294-7

Griffiths, D.

Introduction to Elementary Particles

2008

ISBN: 978-3-527-40601-2

Reiser, M.

Theory and Design of Charged Particle Beams

2008

ISBN: 978-3-527-40741-5

Wangler, T.P.

RF Linear Accelerators

2008

ISBN: 978-3-527-40680-7

Padamsee, H., Knobloch, J., Hays, T.

RF Superconductivity for Accelerators

2008

ISBN: 978-3-527-40842-9

Talman, R.

Accelerator X-Ray Sources

2006

ISBN: 978-3-527-40590-9

The Editors

Dr. Olaf Behnke

DESY

Hamburg

Germany

[email protected]

Dr. Kevin Kröninger

Universität Göttingen

II. Physikalisches Institut

Göttingen, Germany

[email protected]

Dr. Gregory Schott

Karlsruher Institut für Technologie

Institut für Experimentelle Kernphysik

Karlsruhe, Germany

[email protected]

Dr. Thomas Schörner-Sadenius

DESY

Hamburg, Germany

[email protected]

The CoverPicture

The inset shows the negative logarithm of the likelihood function used to identify a resonance in the mass spectrum.

All books published by Wiley-VCH are carefully produced. Nevertheless, authors, editors, and publisher do not warrant the information contained in these books, including this book, to be free of errors. Readers are advised to keep in mind that statements, data, illustrations, procedural details or other items may inadvertently be inaccurate.

Library of Congress Card No.:

applied for

British Library Cataloguing-in-Publication Data:

A catalogue record for this book is available from the British Library.

Bibliographic information published by the DeutscheNationalbibliothek

The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.d-nb.de.

© 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Boschstr. 12, 69469 Weinheim, Germany

All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form – by photoprinting, microfilm, or any other means – nor transmitted or translated into a machine language without written permission from the publishers.

Registered names, trademarks, etc. used in this book, even when not specifically marked as such, are not to be considered unprotected by law.

Print ISBN 978-3-527-41058-3

ePDF ISBN 978-3-527-65344-7

ePub ISBN 978-3-527-65343-0

mobi ISBN 978-3-527-65342-3

oBook ISBN 978-3-527-65341-6

Cover Design Grafik-Design Schulz, Fußgönheim

Preface

Statistical inference plays a crucial role in the exact sciences. In fact, many results can only be obtained with the help of sophisticated statistical methods. In our field of experimental particle physics, statistical reasoning enters into basically every step of our data analysis work.

Recent years have seen the development of many new statistical techniques and of complex software packages implementing these. Consequently, the requirements on the statistics knowledge for scientists in high energy physics have increased dramatically, as have the needs for education and documentation in this field. This book aims at contributing to this purpose. It targets a broad readership at all career levels, from students to senior researchers, and is intended to provide comprehensive and practical advice for the various statistical analysis tasks typically encountered in high energy physics. To achieve this, the book is split into 12 chapters, all written by a different expert author or team of two authors and focusing on a well-defined topic:

Fundamental Concepts

introduces the basics of statistical data analyses, such as probability density functions and their properties, theoretical distributions (Gaussian, Poisson and many others) and concepts of probability (frequentist and Bayesian reasoning).

The next chapters elucidate the basic tools used to infer results from data:

Parameter Estimation

illustrates how to determine the best parameter values of a model from fitting data, for example how to estimate the strength of a signal.

Hypothesis Testing

lays out the framework that can be used to decide on hypotheses such as ‘the data can be explained by fluctuations of the known background sources alone’ or ‘the model describes the data reasonably well’.

Interval Estimation

discusses how to determine confidence or credibility intervals for parameter values, for example upper limits on the strength of a signal.

The following chapters deal with more advanced tasks encountered frequently:

Classification

presents various methods to optimally discriminate different event classes, for example signal from background, using multivariate data input. These methods can be very useful to enhance the sensitivity of a measurement, for example to find and measure a signal in the data that is otherwise drowned in background.

Unfolding

describes strategies and methods for correcting data for the usually inevitable effects of detector bias, acceptance, and resolution, which in particular can be applied in measurements of differential distributions.

Constrained Fits

discusses how to exploit physical constraints, such as energy– momentum conservation, to improve measurements or to determine unknown parameters.

The determination of systematic uncertainties is a key task for any measurement that is often performed as the very last step of a data analysis. We feel that it is worthwhile to discuss this – often neglected – topic in two chapters:

How to Deal with Systematic Uncertainties

elucidates how to detect and avoid sources of systematic uncertainties and how to estimate their impact.

Theory Uncertainties

illuminates various aspects of theoretical uncertainties, in particular for the strong interaction.

The following three chapters complete the book:

Statistical Methods Commonly Used in High Energy Physics

introduces various practical analysis tools and methods, such as the template and matrix methods for the estimation of sample compositions, or the determination of biases of analysis procedures by means of ensemble tests.

Analysis Walk-Throughs

provides a synopsis of the book by going through two complete analysis examples – a search for a new particle and a measurement of the properties of this hypothetical new particle.

Applications in Astronomy

takes us on a journey to the field of astronomy and illustrates, with several examples, the sophisticated data analysis techniques used in this research area.

In all chapters, care has been taken to be as practical and concrete as the material allows – for this purpose many specifically designed examples have been inserted into the text body of the chapters. A further deepening of the understanding of the book material can be achieved with the dedicated exercises at the end of all chapters. Hints and solutions to the exercises, together with some necessary software, are available from a webpage provided by the publisher. Here, we will also collect feedback, corrections and other information related to this volume; please check www.wiley.com for the details.

Many people have contributed to this book, and we would like to thank all of them. First of all, we thank the authors of the individual chapters for the high- quality material they provided.

Besides the authors, a number of people are needed to successfully conclude a book project like this one: numerous colleagues contributed by means of discussion, by providing expert advice and answers to our questions. We cannot name them all.

Katarina Brock spent many hours editing and polishing all the figures and providing a unified layout for them. Konrad Kieling from Wiley provided valuable support in typesetting the book. Vera Palmer and Ulrike Werner from Wiley provided constant support in all questions related to this book. We thank Tatsuya Nakada for his permission to use his exercise material.

Our last and very heartfelt thanks goes to our friends, partners and families who endured, over a considerable period, the very time- and also nerve-consuming genesis of this book. Without their support and tolerance this book would not exist today.

All comments, criticisms and questions you might have on the book are welcome – please send them to the authors via email:

[email protected],

[email protected],

thomas.schoerner@ desy. de,

gregory. [email protected].

Hamburg Göttingen Karlsruhe November 2012

Olaf Behnke

Kevin Kröninger

Thomas Schörner-Sadenius and

Gréory Schott

List of Contributors

Roger Barlow

University of Huddersfield

Huddersfield

United Kingdom

Olaf Behnke

DESY

Hamburg

Germany

Volker Blobel

Universität Hamburg

Hamburg

Germany

Luc Demortier

The Rockefeller University

New York, New York

United States of America

Markus Diehl

DESY

Hamburg

Germany

Aart Heijboer

Nikhef

Amsterdam

Netherlands

CarstenHensel

Universität Göttigen

II. Physikalisches Institut

Göttingen

Germany

Kevin Kröninger

Universität Göttingen

II. Physikalisches Institut

Göttingen

Germany

Benno List

DESY

Hamburg

Germany

Lorenzo Moneta

CERN

Geneva

Switzerland

Harrison B. Prosper

Florida State University

Tallahassee, Florida

United States of America

GrégorySchott

Karlsruher Institut für Technologie

Institut für Experimentelle Kernphysik

Karlsruhe

Germany

Helge Voss

Max-Planck-Institut für Kernpyhsik

Heidelberg

Germany

Ivo van Vulpen

Nikhef

Amsterdam

Netherlands

Rainer Wanke

Institut für Physik

Universität Mainz

Mainz

Germany

1

Fundamental Concepts

Roger Barlow

1.1 Introduction

Particle physics is all about random behaviour. When two particles collide, or even when a single particle decays, we can’t predict with certainty what will happen, we can only give probabilities of the various different outcomes. Although we measure the lifetimes of unstable particles and quote them to high precision – for the τ lepton, for example, it is 0.290±0.001 ps – we cannot say exactly when a particular τ will decay: it may well be shorter or longer. Although we know the probabilities (called, in this context, branching ratios) for the different decay channels, we can’t predict how any particular τ will decay – to an electron, or a muon, or various hadrons.

Then, when particles travel through a detector system they excite electrons in random ways, in the gas molecules of a drift chamber or the valence band of semiconducting silicon, and these electrons will be collected and amplified in further random processes. Photons and phototubes are random at the most basic quantum level. The experiments with which we study the properties of the basic particles are random through and through, and a thorough knowledge of that fundamental randomness is essential for machine builders, for analysts, and for the understanding of the results they give.

It was not always like this. Classical physics was deterministic and predictable. Laplace could suggest a hypothetical demon who, aware of all the coordinates and velocities of all the particles in the Universe, could then predict all future events. But in today’s physics the demon is handicapped not only by the uncertainties of quantum mechanics – the impossibility of knowing both coordinates and velocities – but also by the greater understanding we now have of chaotic systems. For predicting the flight of cannonballs or the trajectories of comets it was assumed, as a matter of common sense, that although our imperfect information about the initial conditions gave rise to increasing inaccuracy in the predicted motion, better information would give rise to more accurate predictions, and that this process could continue without limit, getting as close as one needed (and could afford) to perfect prediction. We now know that this is not true even for some quite simple systems, such as the compound pendulum.

That is only one of the two ways that probability comes into our experiments. When a muon passes through a When a muon passes through a detector it may, with some probability, produce a signal in a drift chamber: the corresponding calculation is a prediction. Conversely a drift chamber signal may, with some probability, have been produced by a muon, or by some other particle, or just by random noise. To interpret such a signal is a process called inference. Prediction works forwards in time and inference works backwards. We use the same mathematical tool – probability – to cover both processes, and this causes occasional confusion. But the statistical processes of inference are, though less visibly dramatic, of vital concern for the analysis of experiments. Which is what this book is about.

1.2 Probability Density Functions

The outcomes of random processes may be described by a variable (or variables) which can be discrete or continuous, and a discrete variable can be quantitative or qualitative. For example, when a τ lepton decays it can produce a muon, an electron, or hadrons: that’s a qualitative difference. It may produce one, three or five charged particles: that’s quantitative and discrete. The visible energy (i.e. not counting neutrinos) may be between 0 and 1777 MeV: that’s quantitative and continuous.

The probability prediction for a variable x is given by a function: we can call it f(x). If x is discrete then f(x) is itself a probability. If x is continuous then f(x) has the dimensions of the inverse of x: it is ∫ f(x)dx that is the dimensionless probability, and f(x) is called a probability density function or pdf.1) There are clearly an infinite number of different pdfs and it is often convenient to summarise the properties of a particular pdf in a few numbers.

1.2.1 Expectation Values

If the variable x is quantitative then for any functiSpon g(x) one can form the average

(1.1)

where the integral (for continuous x) or the sum (for discrete x) covers the whole range of possible values. This is called the expectation value. It is also sometimes written g, as in quantum mechanics. It gives the mean, or average, value of g, which is not necessarily the most likely one – particularly if x is discrete.

1.2.2 Moments

For any pdf f(x), the integer powers of x have expectation values. These are called the (algebraic) moments and are defined as

(1.2)

The first moment, α1, is called the mean or, more properly, arithmetic mean of the distribution; it is usually called µ and often written . It acts as a key measure of location, in cases where the variable x is distributed with some known shape about a particular point.

Conversely there are cases where the shape is what matters, and the absolute location of the distribution is of little interest. For these it is useful to use the central moments

(1.3)

1.2.2.1 Variance

The second central moment is also known as the variance, and its square root as the standard deviation:

(1.4)

The variance is a measure of the width of a distribution. It is often easier to deal with algebraically whereas the standard deviation σ has the same dimensions as the variable x; which to use is a matter of personal choice. Broadly speaking, statisticians tend to use the variance whereas physicists tend to use the standard deviation.

1.2.2.2 Skew and Kurtosis

The third and fourth central moments are used to build shape-describing quantities known as skew and kurtosis (or curtosis):

(1.5)

(1.6)

Division by the appropriate power of σ makes these quantities dimensionless and thus independent of the scale of the distribution, as well as of its location. Any symmetric distribution has zero skew: distributions with positive skew have a tail towards higher values, and conversely negative skew distributions have a tail towards lower values. The Poisson distribution has a positive skew, the energy recorded by a calorimeter has a negative skew. A Gaussian has a kurtosis of zero – by definition, that’s why there is a ‘3’ in the formula. Distributions with positive kurtosis (which are called leptokurtic) have a wider tail than the equivalent Gaussian, more centralised or platykurtic distributions have negative kurtosis. The Breit–Wigner distribution is leptokurtic, as is Students t. The uniform distribution is platykurtic.

1.2.2.3 Covariance and Correlation

Suppose you have a pdf f(x, y) which is a function of two random variables, x and y. You can not only form moments for both x and y, but also for combinations, particularly the covariance

(1.7)

A dimensionless version of the covariance is the correlation ρ:

(1.8)

The magnitude of the correlation lies between 0 (uncorrelated) and 1 (completely correlated). The sign can be positive or negative: amongst a sample of students there will probably be a positive correlation between height and weight, and a negative correlation between academic performance and alcohol consumption.

If there are several (i.e. more than two) variables, x1, x2,…, xN, one can form the covariance and correlation matrices:

(1.9)

(1.10)

and Vii is just .

1.2.2.4 Marginalisation and Projection

Mathematically, any pdf f(x, y) is a function of two variables x and y. They can be similar in nature, for example the energies of the two electrons produced by a converting high energy photon, or they can be different, for example the position and direction of particles undergoing scattering in material.

Projections can be useful for illustration, otherwise to be meaningful you have to have a good reason for choosing that specific value of y. Marginalisation requires that the distribution in y, like that of x, is properly normalised.

1.2.2.5 Other Properties

There are many other properties that can be quoted, depending on the point we want to bring out, and on the established usage of the field.

The mean is not always the most helpful measure of location. The mode is the value of x at which the pdf f(x) is maximum, and if you want a typical value to quote it serves well. The median is the midway point, in the sense that half the data lie above and half below. It is useful in describing very skewed distributions (particularly financial income) in which fluctuations in a small tail would give a big change in the mean.

We can also specify dispersion in ways that are particularly useful for non-Gaussian distributions by using quantiles: the upper and lower quartiles give the values above which, and below which, 25% of the data lie. Deciles and percentiles are also used.

1.2.3 Associated Functions

The cumulative distribution function

(1.11)

The characteristic function

(1.12)

which is just (up to factors of 2π) the Fourier transform of the pdf, is also met with sometimes as it has useful properties.

1.3 Theoretical Distributions

A pdf is a mathematical function. It involves a variable (or variables) describing the random quantity concerned. This may be a discrete integer or a continuous real number. It also involves one or more parameters. In what follows we will denote a random variable by x for a real number and r for an integer. Parameters generally have their traditional symbols for particular pdfs: where we refer to a generic parameter we will call it θ. It is often helpful to write a function as f(x; θ) or f(x|θ), separating this way more clearly the random variable(s) from the adjustable parameter(s). The semicolon is preferred by some, the line has the advantage that it matches the notation used for conditional probabilities, described in Section 1.4.4.1.

There are many pdfs in use to model the results of random processes. Some are based on physical motivations, some on mathematics, and some are just empirical forms that happen to work well in particular cases.

The overwhelmingly most useful form is the Gaussian or normal distribution. The Poisson distribution is also encountered very often, and the binomial distribution is not uncommon. So we describe these in some detail, and then some other distributions rather more briefly.

1.3.1 The Gaussian Distribution

The Gaussian, or normal, distribution for a continuous random variable x is given by

(1.13)

It has two parameters; the function is manifestly symmetrical about the location parameter µ, which is the mean (and mode, and median) of the distribution. The scale parameter σ is also the standard deviation of the distribution. So there is, in a sense, only one Gaussian, the unit Gaussian or standard normal distribution f(x; 0, 1) shown in Figure 1.1. Any other Gaussian can be obtained from this by scaling by a factor σ and translating by an amount µ. The Gaussian distribution is sometimes denoted .

The Gaussian is ubiquitous (hence the name ‘normal’) because of the central limit theorem, which states that if any distribution is convoluted with itself a large number of times, the resulting distribution tends to a Gaussian form. For a proof, see for example Appendix 2 in [1].

Figure 1.1 The unit Gaussian or standard normal distribution.

The product of two independent Gaussians gives a two-dimensional function

(1.14)

but the most general quadratic form in the exponent must include the cross term and can be written as

(1 15)

where the parameter ρ is the correlation between x and y. For N variables, for which we will use the vector x, the full form of the multivariate Gaussian can be compactly written using matrix notation:

(1.16)

Here, V is the covariance matrix described in Section 1.2.2.3.

The error function and the complementary error function are basically closely related to the cumulative Gaussian

(1.17)

(1.18)

Table 1.1 Two-sided Gaussian p-values for 1σ to 5σ deviations.

Deviation

p

-value (%)

31.7

4.56

0.270

0.00633

0.0000573

1.3.2 The Poisson Distribution

The Poisson distribution

(1.19)

describes the probability of n events occurring when the mean expected number is v; n is discrete and v is continuous. Typical examples are the number of clicks produced by a Geiger counter in an interval of time, or, famously, the number of Prussian cavalrymen killed by horse-kicks [4]. Some examples are shown in Figure 1.2.

The Poisson distribution has a mean of v and a standard deviation This property – that the standard deviation is the square root of the mean – is a key fact about distributions generated by a Poisson process, which is important as this includes most cases where a number of samples is taken, including the contents of the bin of a histogram.

Example 1.1 Counting cosmic muons
In an experiment built to measure cosmic muons, the expected rate of muons in one run of the experiment is 0.45 events. This means that you have a 64% probability of observing no decays, a 29% probability of a single decay, 6% chance of seeing two and less than 1% of seeing three.

1.3.3 The Binomial Distribution

The binomial distribution describes a generalisation of the simple problem of the numbers of heads and tails that can arise from spinning a coin several times. The probability for getting r ‘successes’ from N ‘trials’ given an intrinsic probability of success p is

(1.20)

Sometimes one writes q instead of 1 – p, which makes the algebra prettier. The distribution has a mean of Np and a standard deviation . The factor N!/[r!(N – r!)] is the number of ways that r objects may be chosen from N, and is often written .

If p is small then the distribution can be approximated by a Poisson distribution3) of mean Np. This is often used implicitly when analysing Monte Carlo samples: if you generate 1 000 000 Monte Carlo events, of which 100 end up in some particular histogram bin, then strictly speaking this is described by a binomial process rather than a Poisson. In practice you can take the error as the Poisson rather than a binomial . This doesn’t work if p is large. If 9 out of 10 events are accepted by the trigger, the error on the trigger efficiency of 90% is not but (in such a case the shortcut is to take the one lost event as approximately Poisson, giving the error as 10%, which is close).

If N is large and p is not small then the distribution is approximately a Gaussian.

If there are not just two possible outcomes but n, with probabilities {p1, p2, …, pn}, then the total probability of getting r1 of the first outcome, r2 of the second, and so on, is

(1.21)

This is the multinomial distribution.

1.3.4 Other Distributions

There are many, many other possible distribution functions, and it is worth listing some of those more often met with.

1.3.4.1 The Uniform Distribution

The uniform distribution, also known as the rectangular or top-hat distribution, is constant inside some range – call this range – a/2 to a/2, so the width is a; if the range is not central about zero but about some other value this is easily done by a translation. The mean, clearly, is zero, and the standard deviation is . This can be used in position measurements by a hodoscope: if a rectangular slab of scintillator gives a signal, you know that a track went through it but you do not know where. It is reasonable to assume a uniform distribution for the pdf of the hit position.

This can be relevant in considering some systematic uncertainties on the total result, as is also discussed in Section 8.4.1.2. For example, if you set up an experiment to run overnight, counting events with some efficiency E1, and when you arrive in the morning you find a component has tripped so the efficiency is E2, with no information about when this happened, your efficiency has to be quoted as . It can also be applied to theoretical models: when two models give different predictions you are justified in using their mean as your prediction, with a (systematic) error which is the difference divided by , if (and only if) these two models represent absolute extremes and you really have no feeling as to where between the two extremes the truth may lie.

1.3.4.2 The Cauchy, or Breit–Wigner, or Lorentzian Distribution

In nuclear and particle physics the function

(1.22)

gives the variation with the energy E of a cross section produced by the formation of a state with mass M and width Γ. It can be written more neatly in dimensionless form as

(1.23)

This distribution is used in fitting resonance peaks (provided the width is much larger than the measurement error on E). It also has an empirical use in fitting a set of data which is almost Gaussian but has wider tails. This often arises in cases where a fraction of the data is not so well measured as the rest. A double Gaussian may give a good fit, but it often turns out that this form does an adequate job without the need to invoke extra parameters.

1.3.4.3 The Landau Distribution

When a charged particle passes an atom, its electrons experience a changing electromagnetic field and acquire energy. The amount of energy may be large; on rare occasions it will be large enough to create a delta ray. The probability distribution for the energy loss was computed by Landau [5] and is given by

(1.24)

Figure 1.3 The Landau distribution.

The Landau distribution has very unpleasant mathematical properties. Some of its integrals diverge, for example it has no variance (like the Cauchy distribution), and, worse than that, it does not even have a mean. The ensuing complications can be avoided on a case-by-case basis by imposing an upper limit on the energy loss, as a particle cannot lose more than 100% of its energy.

There is a function which is described in some places as ‘the Landau distribution’. It is not. It is an approximation to the Landau distribution [6], and not a very good one at that.

1.3.4.4 The Negative Binomial Distribution

(1.25)

It is the probability for r successes and k – 1 failures in any permutation, followed by a final kth failure. The combinatorial factor can also be written , hence the name ‘negative binomial’. This can readily be extended to non-integer values by writing it as

(1.26)

although it is not clear what physical meaning this may have. Γ is the Gamma function, defined as

(1.27)

1.3.4.5 Student’s t Distribution

If the standard deviation σ is also unknown, then you can use instead the estimated if µ is known or if it is not. Now, for small n especially, this is not a very good estimator, and because you are dividing the differences from the mean by this bad estimate, the distribution for

(1.28)

is not given by a Gaussian, but by Student’s t distribution for n – 1 degrees of freedom, where Student’s t distribution is given by

(1.29)

This tends to a unit Gaussian as n becomes large, but for small n it has tails which are significantly wider (see Figure 1.4): large t values can result if is an underestimate of the true value. The mean is clearly zero; the variance is not one, as it would be for a unit Gaussian, but n/(n – 2).

Example 1.3 Light yields in scintillators
You have five samples of scintillator from a manufacturer with light yields measured (in some units) as 1.23, 1.42, 1.35, 1.29 and 1.40. A second, cheaper, manufacturer provides a sample whose yield is 1.19. Does this give reason to believe that the cheaper sample has an inferior light yield?
The sample mean is 1.338 and the estimated standard deviation is 0.079, so the cheaper sample is 1.90 standard deviations below the mean. If this were a Gaussian distribution then the probability of a value lying this far below the mean is only 2.9% – so you would take this as strong evidence that the cheaper process was not so good. But for Student’s t with four degrees of freedom the probability is (consulting the tables or evaluating a function) only 6.5%, so your evidence would be weaker (the calculations were done using the R function pt(x,ndf)).

1.3.4.6 The χ2 Distribution

In describing the agreement between a predictive function g(x) and a set of n measurements {(xi, yi)}, it is useful to form the total squared deviation

(1.30)

where σi is the Gaussian error on measurement i: if these errors are the same for all measurements then the factor can, of course, be taken outside the summation.

(1.31)

Some examples for different n are shown in Figure 1.5.

The χ2 distribution is used a great deal in considering the question of whether a particular set of measurements (with their errors) and a particular model are compatible. This is addressed through the cumulative χ2distribution. For a given value of χ2, the complement of the cumulative distribution gives the p-value, the probability that, given that the model is indeed correct, a measurement would give a result with a χ2 this large, or larger. If the value of χ2 obtained is large compared to n then the p-value is small, that is the probability that a set of measurements truly described by this model would give such a large disagreement is small, and doubt is cast on the model, or the data (or both). The mean of f(χ2, n) is just n, and the standard deviation is . For large n the distribution converges to the Gaussian, as it must by the central limit theorem. However, the convergence is actually rather slow, and this approximation is not often used. Instead the p-value should be obtained accurately from functions such as TMath::Prob in ROOT or pchisq in R

If the model has free parameters θ which are not given, but were found by fitting the data, then the same χ2 test can be used, but for n one takes the number of data points minus the number of fitted parameters. This is called the number of degrees of freedom. Strictly speaking this is only true if the model is a linear one (i.e. linear in the parameters). This is often the case, either exactly or to a good approximation, but there are some instances where this condition does not hold, leading to the computation of deceptively small and inaccurate p-values.

You will occasionally obtain χ2 values that seem very small: χ2n. There is no standard procedure for rejecting these, but you should treat them with some suspicion and consider whether the model may have been formulated after the data had been measured (‘retrospective prediction’), or whether perhaps the errors have been over-generously estimated.

1.3.4.7 The Log-Normal Distribution

If the logarithm of the variable is given by a Gaussian distribution f(ln x; µ, σ) then the distribution for x itself is the log-normal distribution

(1.32)

Just as the central limit theorem dictates that any variable which is the sum of a large number of random components is described by a Gaussian distribution, any variable which is the product of a large number of random factors, none of which dominates the behaviour, is described by the log-normal. For instance, the signal registered by an electron in a calorimeter may be described by a log-normal distribution, as a certain fraction of the energy may be lost to dead material, a fraction to lost photons, a fraction to neutron production, and so on. The mean is given by , and the standard deviation is .

1.3.4.8 The Weibull Distribution

The Weibull distribution is:

(1.33)

This gives a shape which rises from zero to a peak and then falls back to zero again. It was originally invented to describe the failure rates in aging light bulbs. There are no failures at small times (because they are new and fresh) or at long times (because they have all failed). It is a rather more realistic modelling of real-life ‘lifetime’ than the simple exponential decay law for which the failure probability is constant.

1.4 Probability

We use probability every day, in both our work as physicists and our everyday lives. Sometimes this is a matter of precise calculation, when we buy an insurance policy or decide whether to publish a result, sometimes it is more intuitive, as when we decide to take an umbrella to work in the morning.

But although we are familiar with the concept of probability, on closer inspection it turns out that there are subtleties. When we get into technicalities there turn out to be different definitions of the concept which are not always compatible.

1.4.1 Mathematical Definition of Probability

Let A be an event. Then the probability P(A) is a number obeying three conditions, the Kolmogorov axioms [7]:

1. P(A) ≥ 0;

From these axioms a whole system of theorems and properties can be derived. However, the theory contains no statement as to what the numbers actually mean. For mathematicians this is, of course, not a problem, but it does not help us to apply the results.

1.4.2 Classical Definition of Probability

The probability of a coin landing heads or tails is clearly 1/2. Symmetry dictates that it cannot be anything else. Likewise the chance of drawing a particular card from a pack has to be 1/52. The original development of probability by Laplace, Pascal and their contemporaries, to aid the gambling fraternity, was founded on this equally likely construction. ‘Probability’ could be defined by taking fundamental symmetry where all cases were equally likely (say, the six sides on a dice), and extended to more complex cases (say, rolling two dice) by counting combinations.

Unfortunately this definition does not generalise to cases of continuous variables, where there is no fundamental symmetry: if you ‘draw a line at random’ from a given point, this could be done by taking coordinates of the endpoint from a uniform distribution, or by drawing an angle uniformly taken between 0 and 360°, the results are incompatibly different. This approach thus leads to a dead end.

1.4.3 Frequentist Definition of Probability

Problems with the classical definition led to the alternative definition of probability as the limit of frequency by Venn, von Mises [8] and others. If a selection is made N times under identical circumstances, then the fraction of cases resulting in a particular outcome A tends to a limit, and this limit is what is meant by the probability:

(1.34)

This is the generally adopted definition, taught in most elementary courses and textbooks. It satisfies, of course, the Kolmogorov axioms.

Where the classical definition is valid it leads to the same results. But there is an important philosophical difference. The probability P(A) is not some intrinsic property of A, it also depends on the way the sampling is done: on how the collective or ensemble of total possible outcomes has been constructed.

Thus, to use von Mises’ example: the life insurance companies determine that the probability of one of their (male) clients dying between the ages of 40 and 41 is 1.1%. This is a hard and verifiable number, essential for the correct adjustment of the premium paid. However, it is not an intrinsic probability of the person concerned: you cannot say that a particular client has this number attached to them as a property in the same way that their height and weight are. The client belongs not just to this ensemble (insured 40-yr-old males) but to many others: 40-yr-old males, non-smoking 40-yr-old males, non-smoking professional lion tamers – and for each of these ensembles there will be a different number.

So there are cases with several possible ensembles, and the value of P(A) is ambiguous until the ensemble is specified. There are also cases where there is no ensemble, as the event is unique. The Big Bang is an obvious example, but others can be found much nearer home. For example, what is the probability P(rain) that it will rain tomorrow? Now, there is only one tomorrow, and it will either rain or it will not, so P(rain) is either 0 or 1. Von Mises condemns any further discussion as ‘unscientific’ use of language. This is further discussed (and resolved) in Section 1.5.2.

1.4.4 Bayesian Definition of Probability

Another way of extending the unsatisfactory classical definition of probability was made by de Finetti [9] and others. De Finetti’s starting point is the provocative ‘Probability does not exist.’ It has no objective status: it is something the human mind has constructed.

He shows that one can consistently define a personal probability (or degree-of-belief) P(A) in A by establishing the odds of a bet whereby you lose €1 if A subsequently turns out to be false, and you receive €G if it turns out to be true. If P(A) > 1/(1 + G) you will accept the bet; if P(A) < 1/(1 + G) you will decline it.

Such personal probability is indeed something we use every day: when you decide whether or not to take an umbrella to work in the morning your decision is based on your personal probability of there being rain (and also the ‘costs’ involved in (a) getting wet and (b) having something extra to carry). However, there is no need for my personal probability to be the same as yours, or anyone else’s. It is thus often referred to as a subjective probability. Subjective probability is also generally known as Bayesian probability, because of the great use it makes of Bayes’ theorem [10]. This is a simple and fundamental result which is actually valid for any of the probability definitions being used.

1.4.4.1 Bayes’ Theorem

Suppose A and B are two events, and introduce the conditional probability P(A | B), the probability of event A given that B is true (for instance: the probability that a card is the six of spades, given that it is black, P(six of spades|black) is 1/26).

The probability of both A and B occurring, P(AB) is clearly P(A | B)P(B). But it is also P(B | A) P(A). Equating these two quantities gives

(1.35)

This is used in problems like the famous ‘taxi colour’ example.

Example 1.5 Taxi colour
In some city, 15% of taxi cabs are yellow, and 85% are green. A taxi is involved in a hit-and-run accident, and an eyewitness says it was a yellow cab. The police have established that such eyewitness statements get the colour correct in 80% of cases and wrong in 20%. What is the probability that the cab was yellow?
The arithmetic is simple: just plug the numbers into Bayes’ theorem. Note that the P(B) term in the denominator can be helpfully written as , where denotes ‘not A’. If the cab’s true colour is denoted by Y or G, and the colour the witness says they saw by y or g, then
So the cab was more likely (60% probability) to have been green – despite the witness saying exactly the opposite.

The ‘Bayesian’ use of Bayes’ theorem uses the same algebra but applies it to cases where B represents some experimental result and A some theory. P(B |A) is the probability of the result occurring if the theory is true, and P(A) is the personal probability you ascribe to the theory being true before the experiment is done – the prior probability. P(A | B) is the probability you ascribe to the theory in the light of the experiment – the posterior probability (the prior P(A) and the posterior P(A | B) are meaningless in the frequentist definition).

1.5 Inference and Measurement

Standard probability calculations are all about getting from the theory to the data. They address questions like: under such-and-such conditions, what is the probability that a specified random event will happen?

Inference is the reverse process: getting from the data to the theory. There is a theoretical model, containing some parameter (or parameters) θ, which predicts the probability of getting a certain result (or set of results) x. What does the observation of a particular value of x tell you about θ ? A simple example would be a particle of true energy Etrue ≡ θ giving a measured energy of Emeas ≡ x in the calorimeter. A less simple example would be the existence of a Higgs particle with mass mH giving a set of events with particular characteristics in different channels.

1.5.1 Likelihood

When you make a measurement, then f(x, θ), the probability of obtaining a result x given the value of a model parameter θ, can also be written as the likelihood L(x; θ). This change is purely cosmetic: the actual algebra of the function is the same. Taking the Poisson as an example, and contemplating the observation of five events and a prediction of 3.4, one can write f(5; 3.4) ≡ L(5; 3.4) ≡ (3.45/5!)e–3.4.

Given a result x, the value of L(x; θ) tells you the probability that θ would lead to x, which in turn tells you something about the plausibility that a particular value of θ is the true one. The latter statement is purposefully made vague: it will be considered in proper detail later.

The likelihood principle states that if you have a result x then the likelihood function L(x; θ) contains all the information relevant to your measurement of θ. This principle is regarded by some as an irrefutable axiom, and by others as an irrelevance. Bayesian inference generally satisfies this, whereas frequentist inference generally violates it as the frequentist also has to consider the ensemble of experimental results that might have been obtained.

1.5.2 Frequentist Inference

As von Mises points out, the probability of rain tomorrow is either 0 or 1, and no more can be said. However, you can construct an ensemble for something that looks very similar. Suppose that the pressure is falling and the clouds are gathering. A local weather forecast (perhaps made by a professional meteorologist, perhaps by the ache in your grandmother’s left elbow) predicts rain. If you consider the track record of this particular prediction and count the number of times it has proved correct, that gives a probability which is valid in the frequentist sense. So although you cannot say ‘It will probably rain tomorrow’, you can say ‘The statement “It will rain tomorrow.” is probably true.’

Indeed, if your weather prophet has been correct nine times out of ten, you can say ‘The statement “It will rain tomorrow.” has a 90% probability of being true.’ Again notice that the number is a property not just of the event (rain) but of the ensemble, in the form of the weather forecaster.

Now apply this approach to the interpretation of a measurement. Suppose your measurement process is known to give a result x which differs from the true value µ with a probability distribution which is Gaussian with some known σ. You quote the result, whether it is the mass of the top quark determined from years of collider data, or a measurement of a resistance on a lab bench, as

(1.36)

This seems to say that µ lies in the range [x – σ, x + σ] with 68% probability. But it can’t. The top mass, mt, for which we currently quote 173.2 ± 0.9 GeV, either lies in the range [172.3, 174.1] GeV or it does not. It is our measurement which is random, not the true value. So, as a frequentist, you make a statement about statements. ‘The statement “172.3 < mt [GeV] < 174.1” has a 68% chance of being true.’ Or, to put it another way, you make the statement ‘172.3 < mt [GeV] < 174.1’ with 68% confidence. There is a trade-off between the accuracy of the statement and the confidence you have in it. You could have played safer, and said with 95% confidence ‘171.4 < mt [GeV] < 175.0’. In other cases one-sided (upper or lower) limits may be appropriate.

1.5.3 Bayesian Inference

The Bayesian has no need of such mental gymnastics:

In particular, the Bayesian interpretation of a Gaussian measurement, assuming a flat prior, equates the likelihood with the posterior probability: .This interpretation of the likelihood L(x; µ) as a pdf in the parameter µ looks especially plausible in the case of a Gaussian measurement: one has to remember that it is only valid for Bayesians and not for frequentists. Actually, the depiction of Bayesians and frequentists as different and rival schools of thought is not really correct. Yes, some statisticians can be fairly described as one or the other, but most of us adopt the approach most appropriate for a particular problem. But care must be taken not to use concepts that are inapplicable in the framework chosen.

Information from further measurements can be neatly incorporated into this framework. The posterior distribution from the first experiment is taken as the prior for the second, and its posterior forms the prior for the third (the order in which the combination is done is irrelevant).

1.5.3.1 Use of Different Priors

Secondly, uniformity does not survive reparameterisation. If an angle has a uniform pdf in θ, then the distribution in cos θ is very non-uniform, and in sin θ and tan θ it is different again. You cannot claim to have an objective analysis through having a uniform prior, as the choice of which variable to make uniform will affect the result.

So a sound measurement does not depend (much) on the choice of prior. This is called a robust measurement. In presenting a Bayesian result you may justify it by any of the following:

showing that the result is robust: that the arbitrary choice of prior makes no great difference;

justifying the prior in some way as being correct – or, perhaps, showing that the uniform prior is uniform in the correct variable;

saying that you have chosen this prior and it represents your personal belief.

But just saying ‘We took a uniform prior.’ is not doing a proper job.

1.5.3.2 Jeffreys Priors

One attempt to systematise the choice of priors was made by Jeffreys [11]. His argument is based on the idea that an impartial prior should be ‘uninformative’ – it should not prefer any particular value or values.

Still speaking loosely: if the log-likelihood ln L(x; θ) has a nice sharp peak, then the data are telling you something about θ, and if it is just a broad spread then it’s not being much help. The ‘peakiness’ of a distribution can be expressed using the second differential (with a helpful minus sign). On, or near, a sharp peak, – (∂2 ln L)/(∂θ2) will be large and positive.

Now we take a step back and forget any measurements made, and ask: given some value of θ, what would we expect, on average, from a measurement? This quantity is called the Fisher information:

(1.37)

where the expectation value is evaluated by multiplying by f(x; θ) and integrating over all possible results x for this particular value of θ.A large value of I(θ) means that if you make a measurement it will (probably) provide useful knowledge about the true value of θ, and a small value of I(θ) tells you that the measurement will not tell you much and is hardly worth doing. It can easily be shown (see e.g. Eq. (5.8) in [1]) that

(1.38)

Jeffreys answers the question ‘Should we use θ or ln θ or as our fundamental variable?’ by saying that we should choose a form such that no particular value will yield more informative results than any other. He prescribes a parameterisation θ′(θ) for which – (∂2 ln L)/(∂θ′2) is constant. This is a variable in which all values are (from the Fisher information viewpoint) equal, and if we make the prior in this variable flat we are clearly being fair and even-handed.

In practice one does not have to find θ′ explicitly. If π(θ′) is the prior for θ′ and is constant, and as I(θ′) is constant by construction, then

(1.39)

Jeffreys’ method rests on the idea that the prior should not prejudge the result: that it should be as ‘uninformative’ as possible. But it also provides a structure for giving a unique answer. Whether you choose the fundamental parameter to be v or does not change your final quoted result, thanks to the different priors you would have to use. This is why such priors are often termed ‘objective’, in that the dependence on your personal choice is removed.

Extension to more than one parameter is difficult but not impossible through a technique called reference priors [12].

Although the Jeffreys prior offers a way to getting unambiguous results, it has not been universally taken up. Partly because some are too lazy to consider anything other than a uniform prior in their favourite variable. Partly because of the difficulty of applying it to more than one parameter. Partly because it violates the likelihood principle. Partly because the prior does depend on the likelihood function and thus on the experimental technique, so you would invoke a different prior for (say) the Higgs mass, as determined with ATLAS through H → γγ than you would for the Higgs mass, as determined by CMS through H → W+ W–.

1.5.3.3 The Correct Prior?

So what prior, or what collection of priors, should you use in a Bayesian analysis, if you are forced to do so? The answer, clearly, is: whatever happens to be your personal belief. But although a prior is subjective, it should not be arbitrary. Other data, measurements of this quantity or similar ones, can be used for guidance. Asking (theorist) colleagues can be useful, but if you do that be sure to ask for a wide range.