109,99 €
Features a broad introduction to recent research on Turing’s formula and presents modern applications in statistics, probability, information theory, and other areas of modern data science
Turing's formula is, perhaps, the only known method for estimating the underlying distributional characteristics beyond the range of observed data without making any parametric or semiparametric assumptions. This book presents a clear introduction to Turing’s formula and its connections to statistics. Topics with relevance to a variety of different fields of study are included such as information theory; statistics; probability; computer science inclusive of artificial intelligence and machine learning; big data; biology; ecology; and genetics. The author provides examinations of many core statistical issues within modern data science from Turing's perspective. A systematic approach to long-standing problems such as entropy and mutual information estimation, diversity index estimation, domains of attraction on general alphabets, and tail probability estimation is presented in light of the most up-to-date understanding of Turing's formula. Featuring numerous exercises and examples throughout, the author provides a summary of the known properties of Turing's formula and explains how and when it works well; discusses the approach derived from Turing's formula in order to estimate a variety of quantities, all of which mainly come from information theory, but are also important for machine learning and for ecological applications; and uses Turing's formula to estimate certain heavy-tailed distributions.
In summary, this book:
• Features a unified and broad presentation of Turing’s formula, including its connections to statistics, probability, information theory, and other areas of modern data science
• Provides a presentation on the statistical estimation of information theoretic quantities
• Demonstrates the estimation problems of several statistical functions from Turing's perspective such as Simpson's indices, Shannon's entropy, general diversity indices, mutual information, and Kullback–Leibler divergence
• Includes numerous exercises and examples throughout with a fundamental perspective on the key results of Turing’s formula
Statistical Implications of Turing's Formula is an ideal reference for researchers and practitioners who need a review of the many critical statistical issues of modern data science. This book is also an appropriate learning resource for biologists, ecologists, and geneticists who are involved with the concept of diversity and its estimation and can be used as a textbook for graduate courses in mathematics, probability, statistics, computer science, artificial intelligence, machine learning, big data, and information theory.
Zhiyi Zhang, PhD, is Professor of Mathematics and Statistics at The University of North Carolina at Charlotte. He is an active consultant in both industry and government on a wide range of statistical issues, and his current research interests include Turing's formula and its statistical implications; probability and statistics on countable alphabets; nonparametric estimation of entropy and mutual information; tail probability and biodiversity indices; and applications involving extracting statistical information from low-frequency data space. He earned his PhD in Statistics from Rutgers University.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 302
Veröffentlichungsjahr: 2016
Cover
Title Page
Copyright
Dedication
Preface
Chapter 1: Turing's Formula
1.1 Turing's Formula
1.2 Univariate Normal Laws
1.3 Multivariate Normal Laws
1.4 Turing's Formula Augmented
1.5 Goodness-of-Fit by Counting Zeros
1.6 Remarks
1.7 Exercises
Chapter 2: Estimation of Simpson's Indices
2.1 Generalized Simpson's Indices
2.2 Estimation of Simpson's Indices
2.3 Normal Laws
2.4 Illustrative Examples
2.5 Remarks
2.6 Exercises
Chapter 3: Estimation of Shannon's Entropy
3.1 A Brief Overview
3.2 The Plug-In Entropy Estimator
3.3 Entropy Estimator in Turing's Perspective
3.4 Appendix
3.5 Remarks
3.6 Exercises
Chapter 4: Estimation of Diversity Indices
4.1 A Unified Perspective on Diversity Indices
4.2 Estimation of Linear Diversity Indices
4.3 Estimation of Rényi's Entropy
4.4 Remarks
4.5 Exercises
Chapter 5: Estimation of Information
5.1 Introduction
5.2 Estimation of Mutual Information
5.3 Estimation of Kullback–Leibler Divergence
5.4 Tests of Hypotheses
5.5 Appendix
5.6 Exercises
Chapter 6: Domains of Attraction on Countable Alphabets
6.1 Introduction
6.2 Domains of Attraction
6.3 Examples and Remarks
6.4 Appendix
6.5 Exercises
Chapter 7: Estimation of Tail Probability
7.1 Introduction
7.2 Estimation of Pareto Tail
7.3 Statistical Properties of
7.4 Remarks
7.5 Appendix
7.6 Exercises
References
Author Index
Subject Index
End User License Agreement
xi
xii
xiii
xiv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
Cover
Table of Contents
Preface
Begin Reading
Chapter 2: Estimation of Simpson's Indices
Figure 2.1 Estimated difference profile: Shakespeare versus Shakespeare.
Figure 2.2 Estimated difference profile: Shakespeare versus Sidney.
Chapter 5: Estimation of Information
Figure 5.1 Letters with positive probabilities: (a) Example 5.2, (b) Example 5.3, and (c) Example 5.4
Figure 5.2 Invariance of under respective permutations by letters in and
Figure 5.3 Points with positive probabilities in Example 5.7
Figure 5.4 Plot of of (5.30) for .
Chapter 6: Domains of Attraction on Countable Alphabets
Figure 6.1 of (upper) for and (lower) for .
Figure 6.3 of (upper) and (lower), .
Figure 6.2 of (upper) and (lower), . (a) from to , one scale. (b) from to , two scales.
Chapter 1: Turing's Formula
Table 1.1 Bird sample
Table 1.2 Rearranged bird sample
Table 1.3 Exact probabilities of (1.56) for
Table 1.4 Partial of based on (1.56) for
Chapter 2: Estimation of Simpson's Indices
Table 2.1 Early Cretaceous
Table 2.2 Late Cretaceous
Table 2.3 of early Cretaceous
Table 2.4 of late Cretaceous
Chapter 5: Estimation of Information
Table 5.1 Frequency data of ethnicity and color preference
Table 5.2 Relative frequency data of ethnicity and color preference
Table 5.3 for data in Table 5.2
Table 5.4 Estimated information indices with estimated variances
Zhiyi Zhang
Department of Mathematics and Statistics, University of North Carolina, Charlotte, NC, US
Copyright © 2017 by John Wiley & Sons, Inc. All rights reserved
Published by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Names: Zhang, Zhiyi, 1960-
Title: Statistical implications of Turing's formula / Zhiyi Zhang.
Description: Hoboken, New Jersey : John Wiley & Sons, Inc., [2017] | Includes bibliographical references and index.
Identifiers: LCCN 2016027830| ISBN 9781119237068 (cloth) | ISBN 9781119237099 (epub)
Subjects: LCSH: Mathematical statistics–Textbooks. | Probabilities–Textbooks.
Classification: LCC QA273 .Z43 2017 | DDC 519.5–dc23 LC record available at https://lccn.loc.gov/2016027830
Cover image courtesy Author Zhiyi Zhang
To my family and all my teachers
This book introduces readers to Turing's formula and then re-examines several core statistical issues of modern data science from Turing's perspective. Turing's formula was a remarkable invention of Alan Turing during World War II in an early attempt to decode the German enigmas. The formula looks at the world of randomness through a unique and powerful binary perspective – unmistakably of Turing. However, Turing's formula was not well understood for many years. Research amassed during the last decade has brought to light profound and new statistical implications of the formula that were previously not known. Recently, and only recently, a relatively clear and systematic description of Turing's formula, with its statistical properties and implications, has become possible. Hence this book.
Turing's formula is often perceived as having a mystical quality. I was awestruck when I first learned of the formula 10 years ago. Its anti-intuitive implication was simply beyond my immediate grasp. However, I was not along in this regard. After turning it over in my mind for a while, I mentioned to two of my colleagues, both seasoned mathematicians, that there might be a way to give a nonparametric characterization to tail probability of a random variable beyond data range. To that, their immediate reaction was, “tell us more when you have figured it out.” Some years later, a former doctoral student of mine said to me, “I used to refuse to think about anti-intuitive mathematical statements, but after Turing's formula, I would think about a statement at least twice however anti-intuitive it may sound.” Still another colleague of mine recently said to me, “I read everything you wrote on the subject, including details of the proofs. But I still cannot see intuitively why the formula works.” To that, I responded with the following two points:
Our intuition is a bounded mental box within which we conduct intellectual exercises with relative ease and comfort, but we must admit that this box also reflects the limitations of our experience, knowledge, and ability to reason.
If a fact known to be true does not fit into one's current box of intuition, is it not time to expand the boundary of the box to accommodate the true fact?
My personal journey in learning about Turing's formula has proved to be a rewarding one. The experience of observing Turing's formula totally outside of my box of intuition initially and then having it gradually snuggled well within the boundary of my new box of intuition is one I wish to share.
Turing's formula itself, while extraordinary in many ways, is not the only reason for this book. Statistical science, since R.A. Fisher, has come a long way and continues to evolve. In fact, the frontier of Statistics has largely moved on to the realm of nonparametrics. The last few decades have witnessed great advances in the theory and practice of nonparametric statistics. However in this realm, a seemingly impenetrable wall exists: how could one possibly make inference about the tail of a distribution beyond data range? In front of this wall, many, if not most, are discouraged by their intuition from exploring further. Yet it is often said in Statistics that “it is all in the tail.” Statistics needs a trail to get to the other side of the wall. Turing's formula blazes a trail, and this book attempts to mark that trail.
Turing's formula is relevant to many key issues in modern data sciences, for example, Big Data. Big Data, though as of yet not a field of study with a clearly defined boundary, unambiguously points to a data space that is a quantum leap away from what is imaginable in the realm of classical statistics in terms of data volume, data structure, and data complexity. Big Data, however defined, issues fundamental challenges to Statistics. To begin, the task of retrieving and analyzing data in a vastly complex data space must be in large part delegated to a machine (or software), hence the term Machine Learning. How does a machine learn and make judgment? At the very core, it all boils down to a general measure of association between two observable random elements (not necessarily random variables). At least two fundamental issues immediately present themselves:
High Dimensionality
. The complexity of the data space suggests that a data observation can only be appropriately registered in a very high-dimensional space, so much so that the dimensionality could be essentially infinite. Quickly, the usual statistical methodologies run into fundamental conceptual problems.
Discrete and Non-ordinal Nature
. The generality of the data space suggests that possible data values may not have a natural order among themselves: different gene types in the human genome, different words in text, and different species in an ecological population are all examples of general data spaces without a natural “neighborhood” concept.
Such issues would force a fundamental transition from the platform of random variables (on the real line) to the platform of random elements (on a general set or an alphabet). On such an alphabet, many familiar and fundamental concepts ofStatistics and Probability no longer exist, for example, moments, correlation, tail, and so on. It would seem that Statistics is in need of a rebirth to tackle these issues.
The rebirth has been taking place in Information Theory. Its founding father, Claude Shannon, defined two conceptual building blocks: entropy (in place of moments) and mutual information (in place of correlation) in his landmark paper (Shannon, (1948). Just as important as estimating moments and coefficient of correlation for random variables, entropy and mutual information must be estimated for random elements in practice. However, estimation of entropy and estimation of mutual information are technically difficult problems due to the curse of “High Dimensionality” and “Discrete and Non-ordinal Nature.” For about 50 years since (Shannon, (1948), advances in this arena have been slow to come. In recent years however, research interest, propelled by the rapidly increasing level of data complexity, has been reinvigorated and, at the same time, has been splintered into many different perspectives. One in particular is Turing's perspective, which has brought about significant and qualitative improvement to these difficult problems. This book presents an overview of the key results and updates the frontier in this research space.
The powerful utility of Turing's perspective can also be seen in many other areas. One increasingly important modern concept is Diversity. The topics of what it is and how to estimate it are rapidly moving into rigorous mathematical treatment. Scientists have passionately argued about them for years but largely without consensus. Turing's perspective gives some very interesting answers to these questions. This book gives a unified discussion of diversity indices, hence making good reading for those who are interested in diversity indices and their estimation. The final two chapters of the book speak to the issues of tail classification and, if classified, how to perform a refined analysis for a parametric tail model via Turing's perspective. These issues are scientifically relevant in many fields of study.
I intend this book to serve two groups of readers:
Textbook
for graduate students. The material is suitable for a topic course at the graduate level for students in Mathematics, Probability, Statistics, Computer Science (Artificial Intelligence, Machine Learning, Big Data), and Information Theory.
Reference book
for researchers and practitioners. This book offers an informative presentation of many of the critical statistical issues of modern data science and with updated new results. Both researchers and practitioners will find this book a good learning resource and enjoy the many relevant methodologies and formulas given and explained under one cover.
For a better flow of the presentation, some of the lengthy but instructive proofs are placed at the end of each chapter.
The seven chapters of this book may be naturally organized into three groups. Group 1 includes Chapters 1 and 2. Chapter 1 gives an introduction to Turing's formula; and Chapter 2 translates Turing's formula into a particular perspective (referred to as Turing's perspective) as embodied in a class of indices (referred to as Generalized Simpson's Indices). Group 1 may be considered as the theoretical foundation of the whole book. Group 2 includes Chapters 3–5. Chapter 3 takes Turing's perspective into entropy estimation, Chapter 4 takes it into diversity estimation, and Chapter 5 takes it into estimation of various information indices. Group 2 may be thought of as consisting of applications of Turing's perspective. Chapters 6 and 7 make up Group 3. Chapter 6 discusses the notion of tail on alphabets and offers a classification of probability distributions. Chapter 7 offers an application of Turing's formula in estimating parametric tails of random variables. Group 3 may be considered as a pathway to further research.
The material in this book is relatively new. In writing the book, I have made an effort to let the book, as well as its chapters, be self-contained. On the one hand, I wanted the material of the book to flow in a linearly coherent manner for students learning it for the first time. In this regard, readers may experience a certain degree of repetitiveness in notation definitions, lemmas, and even proofs across chapters. On the other hand, I wanted the book to go beyond merely stating established results and referring to proofs published elsewhere. Many of the mathematical results in the book have instructive value, and their proofs indicate the depth of the results. For this reason, I have included many proofs that might be judged overly lengthy and technical in a conventional textbook, mostly in the appendices.
It is important to note that this book, as the title suggests, is essentially a monograph on Turing's formula. It is not meant to be a comprehensive learning resource on topics such as estimation of entropy, estimation of diversity, or estimation of information. Consequently, many worthy methodologies in these topics have not been included. By no means do I suggest that the methodologies discussed in this book are the only ones with scientific merit. Far from it, there are many wonderful ideas proposed in the existing literature but not mentioned among the pages of this book, and assuredly many more are yet to come.
I wish to extend my heartfelt gratitude to those who have so kindly allowed me to bend their ears over the years. In particular, I wish to thank Hongwei Huang, Stas Molchanov, and Michael Grabchak for countless discussions on Turing's formula and related topics; my students, Chen Chen, Li Liu, Ann Stewart, and Jialin Zhang for picking out numerous errors in an earlier draft of the book; and The University of North Carolina at Charlotte for granting me a sabbatical leave in Spring 2015, which allowed me to bring this book to a complete draft. Most importantly, I wish to thank my family, wife Carol, daughter Katherine, and son Derek, without whose love and unwavering support this book would not have been possible.
Zhiyi ZhangCharlotte, North Carolina
May 2016
Consider the population of all birds in the world along with all its different species, say , and denote the corresponding proportion distribution by where is the proportion of the th bird species in the population. Suppose a random sample of is to be taken from the population, and let the bird counts for the different species be denoted by . If it is of interest to estimate the proportion of birds of species in the population, then is an excellent estimator; and similarly so is for for every particular .
To illustrate, consider a hypothetical sample of size with bird counts given in Table 1.1 or a version rearranged in decreasing order of the observed frequencies as in Table 1.2.
Table 1.1 Bird sample
1
2
3
4
5
6
7
8
9
10
300
200
300
200
100
100
100
100
0
100
11
12
13
14
15
16
17
18
19
20
100
80
70
0
30
50
6
1
2
1
21
22
23
24
25
26
27
28
29
30
1
1
0
0
1
1
1
1
1
1
31
32
33
34
35
36
37
38
39
…
50
100
1
1
0
0
0
0
0
…
Table 1.2 Rearranged bird sample
1
3
2
4
5
6
7
8
10
11
300
300
200
200
100
100
100
100
100
100
32
12
13
16
31
15
17
19
18
20
100
80
70
50
50
30
6
2
1
1
21
22
25
26
27
28
29
30
33
34
1
1
1
1
1
1
1
1
1
1
9
14
23
24
35
36
37
38
39
…
0
0
0
0
0
0
0
0
0
…
With this sample, one would likely estimate by and by , and so on.
The total number of bird species observed in this sample is 30. Yet it is clear that the bird population must have more than just 30 different species. A natural follow-up question would then be as follows:
What is the total population proportion of birds belonging to species other than those observed in the sample?
The follow-up question implies a statistical problem of estimation with a target, or estimand, being the collective proportion of birds of the species not represented in the sample. For convenience, let this target be denoted as . It is important to note that is a random quantity depending on the sample and therefore is not an estimand in the usual statistical sense. In the statistics literature, is often referred to as the sample coverage of the population, or in short sample coverage , or just coverage. Naturally may be referred to as the noncoverage.
The noncoverage defined with a random sample of size is an interesting quantity. It is sometimes interpreted as the “probability” of discovering a new species, because, in a loose sense, the chance that the “next” bird is of a new, or previously unobserved, species is . This interpretation is however somewhat misleading. The main issue of such an interpretation is the lack of clarification of the underlying experiment (and its sample space). Words such as “probability” and “next” can only have meaning in a well-specified experiment. While it is quite remarkable that could be reasonably and nonparametrically estimated by Turing's formula, is not a probability associated with the sample space of the experiment where the sample of size is drawn. Further discussion of this point is given in Section 1.5.
Turing's formula, also sometimes known as the Good–Turing formula, is an estimator of introduced by Good (1953) but largely credited to Alan Turing. Let denote the number of species each of which is represented by exactly one observation in a random sample of size . Turing's formula is given by . For the bird example given in Table 1.2, , , and therefore or .
Let be a countable alphabet with letters ; and let be a probability distribution on . Let be the observed frequencies of the letters in an identically and independently distributed () random sample of size . For any integer , , let the number of letters in the alphabet that are each represented exactly times in the sample be denoted by
where is the indicator function. Let the total probability associated with letters that are represented exactly times in the sample be denoted by
Of special interest is the case of when
and
representing, respectively,
:
the number of letters of
that each appears exactly once; and
:
the total probability associated with the unobserved letters of
in an
sample of size
.
The following expression is known as Turing's formula:
Turing's formula has been extensively studied during the many years following its introduction by Good (1953) and has been demonstrated, mostly through numerical simulations, to provide a satisfactory estimate of for a wide range of distributions. These studies put forth a remarkable yet puzzling implication: the total probability on the unobserved subset of may be well estimated nonparametrically. No satisfactory interpretation was given in the literature until Robbins (1968).
Let be defined with an random sample of size as in (1.3). Let Turing's formula be defined with an augmented sample of size by adding one new observation to the original sample of size ; and let the resulting estimator be denoted by . Then is an unbiased estimator of , in the sense of .
Robbins' claim is easily verified. Let be the number of letters represented exactly once in the augmented sample of size , with observed frequencies .
On the other hand, with the sample of size ,
Hence, .
Robbins' claim provides an intuitive interpretation in the sense that
is an unbiased and, therefore, a good estimator of
;
the difference between
and
should be small; and therefore
should be a good estimator of
.
However, Robbins' claim still leaves much to be desired. Suppose for a moment that is an acceptable notion of a good estimator for , it does not directly address the performance of as an estimator of , not even the bias , let alone other important statistical properties of . In short, the question whether Turing's formula works at all, or if it does how well, is not entirely resolved by Robbins' claim.
In addition to the unusual characteristic that the estimand is a random variable, it is to be noted that both the estimator and the estimand converge to zero in probability (see Exercise 6). Therefore, it is not sufficiently adequate to characterize the performance of by the bias alone. Any reasonable characterization of its performance must take into consideration the vanishing rate of the estimand .
To put the discussion in perspective, a notion of performance is needed however minimal it may be. Let denote a data sample. Let be an estimator of a random estimand . Let
whenever be referred to as the relative bias of estimating .
is said to be an asymptotically relatively unbiased estimator of if
as , provided that .
When does not depend on sample data , an asymptotically relatively unbiased estimator is an asymptotically unbiased estimator in the usual statistical sense.
Let be referred to as the effective cardinality of , or just simply cardinality whenever there is no risk of ambiguity. implies that there are only finite letters in with positive probabilities.
For any probability distribution such that , Turing's formula is not an asymptotically relatively unbiased estimator of . This is so because, letting ,
as .
Example 1.1 demonstrates that there exist distributions on under which Turing's formula is not an asymptotically relatively unbiased estimator. In retrospect, this is not so surprising. On a finite alphabet, the probabilities associated with letters not covered by a large sample, , should rapidly approach zero, at least on an intuitive level, as manifested by the fact that (see Exercise 7). On the other hand, the number of singletons, that is, letters that are each observed exactly once in the sample (), should also approach zero rapidly, as manifested by the fact that . The fact that and are vanishing at the same rate leads to the result of Example 1.1. This example also suggests that perhaps Turing's formula may not always be a good estimator of its target , at least in its current form. However, Turing's formula is not an asymptotically relatively unbiased estimator only when .
Turing's formula is an asymptotically relatively unbiased estimator of if and only if .
To prove Theorem 1.1, the following three lemmas are needed.
Suppose is a strictly decreasing infinite probability sequence with for every , . Then, as ,
Let be any specific index. may be re-expressed as follows.
where the symbol “” stands for the defining equality, for example, “” and “” represent “ is defined by ” and “ defines ” respectively.
For any given , let . First it is observed that
Next it is observed that, letting ,
In the numerator of the above-mentioned last expression, for every , therefore as . In addition, since there are finite terms in the summation, . That is to say that, for any as given earlier, there exists an integer such that for every , , or .
Suppose is a strictly decreasing infinite probability sequence with for every , . Then, as ,
By Lemma 1.1,
and therefore it suffices to show that
