67,99 €
A thorough treatment of the statistical methods used to analyze doubly truncated data In The Statistical Analysis of Doubly Truncated Data, an expert team of statisticians delivers an up-to-date review of existing methods used to deal with randomly truncated data, with a focus on the challenging problem of random double truncation. The authors comprehensively introduce doubly truncated data before moving on to discussions of the latest developments in the field. The book offers readers examples with R code along with real data from astronomy, engineering, and the biomedical sciences to illustrate and highlight the methods described within. Linear regression models for doubly truncated responses are provided and the influence of the bandwidth in the performance of kernel-type estimators, as well as guidelines for the selection of the smoothing parameter, are explored. Fully nonparametric and semiparametric estimators are explored and illustrated with real data. R code for reproducing the data examples is also provided. The book also offers: * A thorough introduction to the existing methods that deal with randomly truncated data * Comprehensive explorations of linear regression models for doubly truncated responses * Practical discussions of the influence of bandwidth in the performance of kernel-type estimators and guidelines for the selection of the smoothing parameter * In-depth examinations of nonparametric and semiparametric estimators Perfect for statistical professionals with some background in mathematical statistics, biostatisticians, and mathematicians with an interest in survival analysis and epidemiology, The Statistical Analysis of Doubly Truncated Data is also an invaluable addition to the libraries of biomedical scientists and practitioners, as well as postgraduate students studying survival analysis.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 286
Veröffentlichungsjahr: 2021
Cover
Title Page
Copyright
Dedication
Preface
List of Abbreviations
Notation
1 Introduction
1.1 Random Truncation
1.2 One‐sided Truncation
1.3 Double Truncation
1.4 Real Data Examples
References
2 One‐Sample Problems
2.1 Nonparametric Estimation of a Distribution Function
2.2 Semiparametric and Parametric Approaches
2.3 R Code for the Examples
References
3 Smoothing Methods
3.1 Some Background in Kernel Estimation
3.2 Estimating the Density Function
3.3 Asymptotic Properties
3.4 Data‐driven Bandwidth Selection
3.5 Further Issues in Kernel Density Estimation
3.6 Estimating the Hazard Function
3.7 R Code for the Examples
References
4 Regression Analysis
4.1 Observational Bias in Regression
4.2 Proportional Hazards Regression
4.3 Accelerated Failure Time Regression
4.4 Nonparametric Regression
4.5 R Code for the Examples
References
5 Further Topics
5.1 Two‐Sample Problems
5.2 Competing Risks
5.3 Testing for Quasi‐independence
5.4 Dependent Truncation
5.5 R Code for the Examples
References
A: Packages and Functions in
R
A.1 Computing the NPMLE and Standard Errors
A.2 Assessing the Existence and Uniqueness of the NPMLE
A.3 Semiparametric and Parametric Estimation
A.4 Kernel Estimation
A.5 Regression Analysis
A.6 Competing Risks
A.7 Simulating Data
A.8 Testing Quasi‐independence
A.9 Dependent Truncation
References
Index
End User License Agreement
Chapter 1
Table 1.1 Descriptive statistics for Childhood Cancer Data: sample size
and...
Table 1.2 Descriptive statistics for the AIDS Blood Transfusion Data: sample...
Table 1.3 Years to failure and number of failing units for the Equipment‐
S
R...
Table 1.4 Descriptive statistics for the Quasar Data. Luminosity in log‐scal...
Table 1.5 Parkinson's Disease Data: age of onset for genetic groups. Early o...
Table 1.6 Descriptive statistics for the ACS Data. Sample size
and mean (an...
Chapter 2
Table 2.1 (Example 2.1.19). Bias and standard deviation (SD) of
and of the ...
Table 2.2 Standard error estimation by bootstrap methods, for the simulation...
Table 2.3 Standard error estimation by bootstrap methods, for the simulation...
Table 2.4 Confidence interval estimation by bootstrap methods, for the simul...
Table 2.5 Confidence interval estimation by bootstrap methods, for the simul...
Table 2.6 Simulation results for Example 2.2.3: Monte Carlo variance of the ...
Table 2.7 Example 2.2.4. Analytical calculations for the asymptotic efficien...
Chapter 3
Table 3.1 (Simulated scenarios I and II, Example 3.3.3). Optimal bandwidths ...
Table 3.2 (Simulated scenario I with
, Example 3.4.1). Median, interquartile...
Table 3.3 (Simulated scenario I with
, Example 3.4.1). Median, interquartile ran...
Table 3.4 (Simulated scenario II with
, Example 3.4.1). Median, interquartile ra...
Table 3.5 (Simulated scenario II with
, Example 3.4.1). Median, interquartil...
Table 3.6 (Simulated scenarios I and II with several widths
for the samplin...
Chapter 4
Table 4.1 (RD and RC models, Example 4.2.3). Results corresponding to the pr...
Table 4.2 (Parkinson's Disease Data, Example 4.2.4). Results of the proporti...
Table 4.3 (RD and RC models with simulated scenario II, Example 4.3.1). Resu...
Table 4.4 (RD and RC models with simulated scenario II, Example 4.3.1). Resu...
Table 4.5 (AIDS Blood Transfusion Data, Example 4.3.2.) Results of the semip...
Chapter 5
Table 5.1 Example 5.1.1. Performance of the Mann–Whitney test for
for the s...
Table 5.2 Example 5.1.2. Performance of the Mann–Whitney test for
for the s...
Table 5.3 Childhood Cancer Data. Number of cases (and %) and mean (and SD) f...
Table 5.4 (Simulated competing risks data, Example 5.2.5). Bias, standard de...
Table 5.5 Simulation results for the conditional Kendall's Tau in models a) ...
Table 5.6 (AIDS Blood Transfusion Data and Childhood Cancer Data, Example 5....
Chapter 2
Figure 2.1 (Parkinson's Disease Data, Example 2.1.8). NPMLE for the cdf of t...
Figure 2.2 (Parkinson's Disease Data, Example 2.1.8). Estimated sampling pro...
Figure 2.3 (Parkinson's Disease Data, Example 2.1.8). NPMLE for the cdf of t...
Figure 2.4 (Simulated scenario I: no sampling bias for
. Example 2.1.11). L...
Figure 2.5 (Simulated scenario I: no sampling bias for
, Example 2.1.13). L...
Figure 2.6 (Simulated scenario I: no sampling bias for
, Example 2.1.13). L...
Figure 2.7 (Simulated scenario II: sampling bias for
, Example 2.1.14). Lef...
Figure 2.8 (Simulated scenario II: sampling bias for
, Example 2.1.14). Est...
Figure 2.9 (Equipment‐S Rounded Failure Time Data, Example 2.1.15). Left plo...
Figure 2.10 (Simulated scenario II with rounding, Example 2.1.16). Left: NPM...
Figure 2.11 (Parkinson's Disease Data, Example 2.1.22). Pointwise confidence...
Figure 2.12 (Equipment‐S Rounded Failure Time Data, Example 2.1.23). Histogr...
Figure 2.13 Left plot: NPMLE of
for the Childhood Cancer Data in Section 1...
Figure 2.14 (Example 2.2.3). Left: members of the parametric family of trunc...
Figure 2.15 (Parkinson's Disease Data, Example 2.2.6). Left plot: pdf of the...
Figure 2.16 (Parkinson's Disease Data, Example 2.2.6). Left plot: NPMLE of
Figure 2.17 (Parkinson's Disease Data, Example 2.2.6). Left plot: pdf of the...
Figure 2.18 (Parkinson's Disease Data, Example 2.2.6). Left plot: NPMLE of
Figure 2.19 (Parkinson's Disease Data, Example 2.2.8). Left plot: log‐normal...
Figure 2.20 (Parkinson's Disease Data, Example 2.2.8). Left plot: log‐normal...
Chapter 3
Figure 3.1 (Childhood Cancer Data, Example 3.2.1). Left plot: nonparametric ...
Figure 3.2 (Childhood Cancer Data, Example 3.3.4). Nonparametric (black line...
Figure 3.3 (AIDS Blood Transfusion Data, Example 3.3.5). Nonparametric (blac...
Figure 3.4 (AIDS Blood Transfusion Data, Example 3.3.5). Left plot: nonparam...
Figure 3.5 (Simulated scenario I, Example 3.4.1). Density estimates of
for...
Figure 3.6 (Simulated scenario II, Example 3.4.1). Density estimates of
fo...
Figure 3.7 (Quasar Data, Example 3.4.2). Left plot: sampling bias for the qu...
Figure 3.8 (AIDS Blood Transfusion Data, Example 3.4.3). Left plot: nonparam...
Figure 3.9 (Childhood Cancer Data from Example 3.5.1 and AIDS Blood Transfus...
Figure 3.10 (Simulated scenarios I and II, Example 3.6.3). True biasing func...
Figure 3.11 (AIDS Blood Transfusion Data, Example 3.6.4). Kernel hazard esti...
Figure 3.12 (Acute Coronary Syndrome Data, Example 3.6.5). Kernel hazard est...
Figure 3.13 (Acute Coronary Syndrome Data, Example 3.6.5). Biasing function ...
Chapter 4
Figure 4.1 Conditional cdfs
and
in model RD (dashed lines), and ordinary...
Figure 4.2 Regression line for the transformed response
in model RC (dashe...
Figure 4.3 (Parkinson's Disease Data, Example 4.1.1). Conditional cdfs for t...
Figure 4.4 (AIDS Blood Transfusion Data, Example 4.1.4). Consistent estimati...
Figure 4.5 (RD and RC model with simulated scenario II, Example 4.3.1). Left...
Figure 4.6 (AIDS Blood Transfusion Data, Example 4.3.2). Ordinary least squa...
Figure 4.7 (AIDS Blood Transfusion Data. Example 4.4.2). Nonparametric regre...
Chapter 5
Figure 5.1 Density functions for the two groups in Example 5.1.2, based on t...
Figure 5.2 (AIDS Blood Transfusion Data, Example 5.1.3). Left plot: NPMLE of...
Figure 5.3 (AIDS Blood Transfusion Data, Example 5.1.3). Empirical survival ...
Figure 5.4 (Childhood Cancer Data, Example 5.2.4). Left plot: sampling proba...
Figure 5.5 (Simulated competing risks data, Example 5.2.5). Left plot: true ...
Figure 5.6 (Simulated competing risks data, Example 5.2.5). Left plot: boxpl...
Figure 5.7 (Simulated competing risks data, Example 5.2.5). Left plot: boxpl...
Figure 5.8 (Childhood Cancer Data, Example 5.2.6). Conditional cumulative in...
Figure 5.9 Histograms for the conditional Kendall's Tau, corresponding to 10...
Figure 5.10 Scatterplot of
observations
simulated from the Clayton copul...
Figure 5.11 Scatterplot of
observations
simulated from the Clayton copul...
Figure 5.12 (Simulated scenario I with dependent truncation, Example 5.4.2)....
Figure 5.13 (AIDS Blood Transfusion Data and Childhood Cancer Data, Example ...
Figure 5.14 (AIDS Blood Transfusion Data, Example 5.4.3). Logarithm of the l...
Cover Page
Table of Contents
Title Page
Copyright
Dedication
Preface
List of Abbreviations
Notation
Begin Reading
A: Packages and Functions in R
Index
End User License Agreement
ii
iii
iv
v
xi
xii
xiii
xiv
xv
xvi
1
2
3
4
5
6
7
8
9
10
11
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
165
166
167
168
169
170
171
173
174
175
176
Established by Walter A. Shewhart and Samuel S. Wilks
Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice, Geof H. Givens, Geert Molenberghs, David W. Scott, Adrian F. M. Smith, Ruey S. Tsay
Editors Emeriti: Harvey Goldstein, J. Stuart Hunter, Iain M. Johnstone, Joseph B. Kadane, Jozef L. Teugels
The Wiley Series in Probability and Statistics is well established and authoritative. It covers many topics of current research interest in both pure and applied statistics and probability theory. Written by leading statisticians and institutions, the titles span both state‐of‐the‐art developments in the field and classical methods.
Reflecting the wide range of current research in statistics, the series encompasses applied, methodological and theoretical statistics, ranging from applications and new techniques made possible by advances in computerized practice to rigorous treatment of theoretical approaches. This series provides essential and invaluable reading for all statisticians, whether in academia, industry, government, or research.
A complete list of titles in this series can be found at http://www.wiley.com/go/wsps
Jacobo de Uña‐Álvarez, Carla Moreira and Rosa M. Crujeiras
This edition first published 2022© 2022 John Wiley & Sons Ltd
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Jacobo de Unã‐Álvarez, Carla Moreira and Rosa M. Crujeiras to be identified as the authors of the editorial material in this work has been asserted in accordance with law.
Registered OfficesJohn Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USAJohn Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial OfficeThe Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of WarrantyWhile the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging‐in‐Publication Data applied for
[ISBN: 9781119951377]
Cover Design: Wiley
To María Soledad, Marcos, Paula and Miguel, for their love, support and inspiration.
To Teo, Sabela and Andrés, for their infinite patience.
This book is the result of a long‐standing collaboration among the three authors, which began when Carla Moreira was a PhD student under the supervision of Jacobo de Uña‐Álvarez. Carla successfully defended her thesis, entitled ‘The Statistical Analysis of Doubly Truncated Data: New Methods, Software Development, and Biomedical Applications’, at the Universidade de Vigo in July 2010. At that time, only a small number of people seemed to be aware of the importance of random double truncation. Research papers on this topic were scarce before 2010, with the contribution by Bradley Efron and Vahe Petrosian in 1999 as the most relevant one. And, of course, no software was available. So, for us, it was a risky and exciting research exercise to embrace such an initiative.
We launched version 1.1 of our R package DTDA in September 2009. To our knowledge this was the first software library implementing the Efron–Petrosian estimator. The package included Efron and Petrosian's data on quasar luminosities, and we are very thankful to both scientists for sharing them. DTDA has been downloaded more than 45 thousand times up to now. We have taken the opportunity of writing this book to update and enhance DTDA, feeding it with new illustrative real datasets and enabling new functions and capabilities. We are confident in that the update of the package and the guidance provided by this book will exponentially increase the applications involving doubly truncated data, and also raise awareness about the implications of double truncation on inferential procedures.
Over these years, several researchers have collaborated with us in the fascinating adventure of investigating double truncation. Among them, we would like to mention Ingrid Van Keilegom, Micha Mandel, Rebecca Betensky, Luis Meira‐Machado and Roel Braekers. We have enjoyed co‐authoring a number of research papers with them. We also learned a lot about double truncation by studying real data problems posed by applied researchers; here we thank María José Bento, David Keith Simon, Zhi‐Sheng Ye, Ana Cristina Santos and Henrique Barros for fruitful discussions and cooperation.
Nowadays, there is a considerable statistical community doing research on exploratory and inferential methods for doubly truncated data, partly motivated by new emerging applications in Biomedicine, Economics and Engineering, among other fields. At the time of writing the activity in this area of research is much more intense than ever before, as is evident from the number of papers on the topic published in the last couple of years. And the interest in double truncation is growing faster and faster!
This book aims to serve as a companion for those ones interested in learning about doubly truncated data analysis and inference, presenting a wide range of tools for estimating distribution and regression models. All the methods presented in this book are accompanied by real data and simulated examples and, at the end of each chapter, the reader will find the do‐it‐yourself code, mostly based on the DTDA package. This book is not written with the aim of being just read: its main purpose is to invite the reader to think, explore and experience.
This volume is also self‐contained, providing a general overview on the main results. Further technical details and some omitted proofs can be consulted in the original references. It is also in our intention to leave several take‐home messages. First, that the correction of the potential sampling bias arising from double truncation may be critical in estimation and inference. Second, that, even when the Efron–Petrosian estimator is conceptually complicated and its asymptotic theory may be overwhelming, its practical application is relatively simple from the available software packages and the good performance of resampling algorithms. Third, that external information on the sampling bias should be used whenever available, since the Efron–Petrosian estimator may be very noisy or even non‐existing, particularly when the sample size is small to moderate.
We frankly hope that the reader will enjoy (and experience!) the book, at least as much as we have enjoyed writing it! Comments and suggestions from the readers on this edition are welcome; please send them to [email protected] to help us to improve the book.
Parts of this book were written while the authors were supported by the Grants MTM2017‐89422‐P (MINECO/AEI/FEDER, UE) (first author), UIDB/00013/2020 and UIDP/00013/2020 (second author), and MTM2016‐76969‐P (MINECO/AEI/FEDER, UE) (third author). This is acknowledged.
May 2021
Jacobo de Uña‐Álvarez, Carla Moreira and Rosa M. CrujeirasVigo, V. N. Famalicão and Santiago de Compostela
AFT: accelerated failure time
AIDS: acquired immunodeficiency syndrome
AMISE: asymptotic mean integrated square error
AMSE: asymptotic mean square error
bcv: biased cross‐validation
BMISE: bootstrap mean integrated square error
Boot: bootstrap
cdf: cumulative distribution function
CIF: cumulative incidence function
cv: cross‐validation
ecdf: empirical cumulative distribution function
DNA: deoxyribonucleic acid
DPI: direct plug‐in
DP
: direct plug‐in in one stage
DP
: direct plug‐in in two stages
FGM: Farlie–Gumbel–Morgenstern
HIV: human immunodeficiency virus
ICCC: International Classification of Childhood Cancer
iid: independent and identically distributed
IPWE: inverse probability weighted estimator
IQR: interquartile range
ISE: integrated square error
LSCV: least‐squares cross‐validation
MISE: mean integrated square error
MLE: maximum‐likelihood estimator
MSE: mean square error
NP: nonparametric
NPMLE: nonparametric maximum‐likelihood estimator
NR: normal reference
pdf: probability density function
OB: obvious bootstrap
PB: percentile bootstrap
PD: Parkinson's disease
SB: simple bootstrap
SBoot: smoothed bootstrap
SD: standard deviation
SEF: special exponential family
SJ‐dpi: Sheather–Jones direct plug‐in
SJ‐ste: Sheather–Jones solve‐the‐equation plug‐in
SNPs: single nucleotide polymorphisms
SP: semiparametric
SPMLE: semiparametric maximum‐likelihood estimator
ucv: unbiased cross‐validation
: target variable, supported on
;
: truncating variables; the respective supports of
and
are
and
: population triplet;
,
: observed independent triplets such that
, where
stands for the equality in distribution
: for interval sampling, width of the sampling interval, so
; it holds
, where
and
are the dates that determine the sampling interval
,
and
: cumulative distribution function, probability density function and hazard function of
, respectively
: variable
follows the cumulative distribution function
: variable
follows a uniform distribution on the interval
: support of the cumulative distribution function
: left‐continuous version of the cumulative ditribution function
; for a continuous
it holds that
for each
,
: bivariate cumulative distribution function and bivariate probability density function of
, respectively
,
,
,
: marginal cumulative distribution functions and marginal probability density functions of
and
,
: probability masses attached to
and
, respectively, appearing in the likelihood function
: indicator of the event
: bootstrap resample of
: weighting function, or biasing function, which reports the sampling probabilities for
;
; last equality holds if
and
are independent
: proportion of non‐truncated data
: normalized biasing function
: truncated cumulative distribution function of
: truncated bivariate cumulative distribution function of
,
,
,
: truncated marginal cumulative distribution functions and truncated marginal probability density functions of
and
,
: parametric family of distribution functions for the truncation couple, with parameter space
;
: true value of
: probability density function attached to
: weighting function, or biasing function, under the parametric truncation family
: probability of no truncation inherited from the parametric truncation family
: semiparametric version of
, inherited from the parametric truncation family
,
: parametric distribution family for
: true value of
: probability density function attached to
Random truncation generally refers to a situation in which a number of individuals of the target population cannot be sampled because a certain random event precludes them. When this random event is unrelated to the variables of interest standard statistical methods apply, with the only inconvenience of using a smaller sample size. In many practical cases, however, the truncation event is related to the variables under study, and specific methods to overcome the sampling bias must be considered.
This book is focused on random truncation phenomena that arise (usually, but not only) when sampling time‐to‐event data. That is, the variable of interest is the time elapsed from a well‐defined origin to another well‐defined end point. In this setting, a truncated sample of is a set of independent and identically distributed (iid) random variables with the conditional distribution of given , where is a random set. Since the truncation event is obviously related to , standard statistical methods applied to the truncated sample may be systematically biased. For example, the ordinary empirical cumulative distribution function (ecdf) of at point , , converges to rather than to the target cumulative distribution function (cdf) . This problem has received remarkable attention since the seminal paper by Turnbull (1976). Special forms of truncation when sampling time‐to‐event data are reviewed in Sections 1.2 and 1.3.
Time‐to‐event data are relevant in fields like Survival Analysis and Reliability Engineering, in which random truncation often occurs. Random truncation is found in Astronomy too, where represents the luminosity of an stellar object that is subject to observation limits. Examples from these areas will be introduced and analysed throughout this book.
Left‐truncation is a common feature when sampling time‐to‐event data. A left‐truncation time for the target is defined as a random variable such that is observed only when , determining the random set in the previous section.
Left‐truncation occurs, for example, with cross‐sectional sampling, where the sampled individuals are those being between the origin and the end point at a certain calendar time, which is the cross‐section date (Wang, 1991). That is, the observer arrives at the process at a given date, being allowed to observe the time‐to‐event and the left‐truncation time for the individuals 'in progress' by that date. With cross‐sectional sampling, the variable is simply defined as the time from onset to the cross‐section date. This sampling procedure is often applied because it entails relatively little effort to reach a pre‐specified sampling size. In medical research, such a design leads to the sampling of the so‐called prevalent cases: patients already diagnosed from a certain disease of interest who survived beyond the cross‐section date. Clearly, such a sampling design implies an observational bias, in the sense that individuals with longer survival (the value) will be observed with a relatively large probability. There exist well investigated proposals to overcome such a bias, based on the simple idea of taking the observed left‐truncation times into account to define suitable risk sets. For this purpose, independence between and has been traditionally assumed. This independence assumption states that the time‐to‐event distribution remains unchanged along time, being unrelated to the date of onset. A classical example of left‐truncation are the Channing House data, where the age at death is measured for people living in that retirement centre; in this case, the target variable is left‐truncated by the age when entering the residence (Klein and Moeschberger, 2003).
Another feature leading to left‐truncation is the delayed entry into study. This happens when the individuals enter the study only at some random time after onset. For example, diagnosis of a certain disease may not be ascertained until the first visit to the hospital. If the 'end‐of‐disease' event occurs before the potential date of visit, the time‐to‐event of such a patient will be never known, with the resulting difficulty in observing relatively small event times. Beyersmann et al. (2012) provide an illustrative example of this issue in the investigation of abortion times.
In some particular settings, the target variable of ultimate interest is observed only for the individuals who experience the event before a certain calendar time . A typical example of such a situation is the investigation of the incubation (or induction) times for AIDS; see for example Klein and Moeschberger (2003). The incubation time is defined as the time elapsed between the date of HIV infection, say, and the development of AIDS. If stands for the incubation time and , then the incubation times of individuals developing AIDS prior follow the distribution of conditionally on . Here, is called the right‐truncation time. An immediate effect of right‐truncation is that large values of are sampled with a relatively small probability.
At this point, the reader may be curious about the difference between truncation and censoring. Right‐censoring is a very well known phenomenon in Survival Analysis and reliability studies, among other fields. It happens when the follow‐up of a given individual stops before the event of interest has taken place. In such a case, the observer only knows that the target variable is larger than the registered value, which is referred to as censoring time. A sample made up of real and censored values is typically analysed by the Kaplan–Meier estimator (Kaplan and Meier, 1958), which corrects for the fact that some of the recorded values for are smaller than the true ones. With truncated data, every value in the sample corresponds to a true observation of ; however, the distribution of the observed values may be shifted with respect to the true one due to the truncation event. This difference between truncation and censoring suggests that specific methods to estimate the target distribution under random truncation should be employed. Indeed, Woodroofe (1985) provides a deep analysis of one‐sided truncation, introducing the original idea of Lynden–Bell (1971) as a nonparametric maximum likelihood estimator (NPMLE) of the probability distribution in that setting. The estimator in Woodroofe (1985) is a particular case of the estimator corresponding to doubly truncated data, on which this book is focused.
A variable of interest is said to be doubly truncated by a couple of random variables if the observation of is possible only when occurs. In such a case, and are called left‐ and right‐truncation variables respectively. Double truncation reduces to left‐truncation when degenerates at , while it corresponds to right‐truncation when . This book is focused on the problem of estimating the distribution of , and other related curves, from a set of iid triplets with the distribution of given .
There are several scenarios where double truncation appears in practice. One setting leading to double truncation is that of interval sampling, where the sample is restricted to the individuals with event between two specific dates and (Zhu and Wang, 2012). Then, the right‐truncation time is , where denotes the date of onset for the time‐to‐event, and the left‐truncation time is , where is the interval width. The Childhood Cancer Data in Section 1.4.1 is an example of data obtained through interval sampling.
With interval sampling the variable is degenerated at . This occurs in other sampling schemes too, in which and are certain subject‐specific event dates. An illustrative example is given by the Parkinson's Disease Data, see Section 1.4.5, where is the individual age at blood sampling. When is constant, the couple falls on a line, and its joint density does not exist, even when the truncating variables may be continuous.
In other situations, the truncating variables and are not linked through the linear equation . For example, and could represent some random observation limits beyond which the variable of interest can not be sampled or detected. Situations like this occur for example in Astronomy, as it is illustrated in Section 1.4.4.
With random double truncation, both large and small values of are observed in principle with a relatively small probability. However, the real observational bias for varies from application to application, depending on the joint distribution of . We will see, for example, that the probability of sampling a value , namely , may be roughly constant, inducing no observational bias; or that it may be roughly decreasing, indicating the dominance of the right‐truncation bias relative to the left‐truncation bias.
Another issue of relevance is that of the identifiability of the distribution of . Intuitively it is clear that with doubly truncated data it is only possible to estimate the distribution of conditional on , where and denote respectively the lower and upper endpoints of the supports of and (see Chapter 2 for details). This may have important practical consequences, as we will see. On the other hand, in applications with doubly truncated survival data the estimates correspond to the susceptible population for which the terminal event of interest is sure. This is in contrast to the standard analysis of survival times where a portion of the individuals may belong to the so‐called cured fraction, or immunes. This should be taken into account when interpreting the results from the analysis.
