207,99 €
Reliability and survival analysis are important applications of stochastic mathematics (probability, statistics and stochastic processes) that are usually covered separately in spite of the similarity of the involved mathematical theory. This title aims to redress this situation: it includes 21 chapters divided into four parts: Survival analysis, Reliability, Quality of life, and Related topics. Many of these chapters were presented at the European Seminar on Mathematical Methods for Survival Analysis, Reliability and Quality of Life in 2006.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 492
Veröffentlichungsjahr: 2013
Contents
Preface
PART I
Chapter 1. Model Selection for Additive Regression in the Presence of Right-Censoring
1.1. Introduction
1.2. Assumptions on the model and the collection of approximation spaces
1.3. The estimation method
1.4. Main result for the adaptive mean-square estimator
1.5. Practical implementation
1.6. Bibliography
Chapter 2. Non-parametric Estimation of Conditional Probabilities, Means and Quantiles under Bias Sampling
2.1. Introduction
2.2. Non-parametric estimation of p
2.3. Bias depending on the value of Y
2.4. Bias due to truncation on X
2.5. Truncation of a response variable in a non-parametric regression model
2.6. Double censoring of a response variable in a non-parametric model
2.7. Other truncation and censoring of Y in a non-parametric model
2.8. Observation by interval
2.9. Bibliography
Chapter 3. Inference in Transformation Models for Arbitrarily Censored and Truncated Data
3.1. Introduction
3.2. Non-parametric estimation of the survival function S
3.3. Semi-parametric estimation of the survival function S
3.4. Simulations
3.5. Bibliography
Chapter 4. Introduction of Within-area Risk Factor Distribution in Eco-logical Poisson Models
4.1. Introduction
4.2. Modeling framework
4.3. Simulation framework
4.4. Results
4.5. Discussion
4.6. Bibliography
Chapter 5. Semi-Markov Processes and Usefulness in Medicine
5.1. Introduction
5.2. Methods
5.3. An application to HIV control
5.4. An application to breast cancer
5.5. Discussion
5.6. Bibliography
Chapter 6. Bivariate Cox Models
6.1. Introduction
6.2. A dependence model for duration data
6.3. Some useful facts in bivariate dependence
6.4. Coherence
6.5. Covariates and estimation
6.6. Application: regression of Spearman’s rho on covariates
6.7. Bibliography
Chapter 7. Non-parametric Estimation of a Class of Survival Functionals
7.1. Introduction
7.2. Weighted local polynomial estimates
7.3. Consistency of local polynomial fitting estimators
7.4. Automatic selection of the smoothing parameter
7.5. Bibliography
Chapter 8. Approximate Likelihood in Survival Models
8.1. Introduction
8.2. Likelihood in proportional hazard models
8.3. Likelihood in parametric models
8.4. Profile likelihood
8.5. Statistical arguments
8.6. Bibliography
PART II
Chapter 9. Cox Regression with Missing Values of a Covariate having a Non-proportional Effect on Risk of Failure
9.1. Introduction
9.2. Estimation in the Cox model with missing covariate values: a short review
9.3. Estimation procedure in the stratified Cox model with missing stratum indicator values
9.4. Asymptotic theory
9.5. A simulation study
9.6. Discussion
9.7. Bibliography
Chapter 10. Exact Bayesian Variable Sampling Plans for Exponential Distribution under Type-I Censoring
10.1. Introduction
10.2. Proposed sampling plan and Bayes risk
10.3. Numerical examples and comparison
10.4. Bibliography
Chapter 11. Reliability of Stochastic Dynamical Systems Applied to Fatigue Crack Growth Modeling
11.1. Introduction
11.2. Stochastic dynamical systems with jump Markov process
11.3. Estimation
11.4. Numerical application
11.5. Conclusion
11.6. Bibliography
Chapter 12. Statistical Analysis of a Redundant System with One Standby Unit
12.1. Introduction
12.2. The models
12.3. The tests
12.4. Limit distribution of the test statistics
12.5. Bibliography
Chapter 13. A Modified Chi-squared Goodness-of-fit Test for the Three-parameter Weibull Distribution and its Applications in Reliability
13.1. Introduction
13.2. Parameter estimation and modified chi-squared tests
13.3. Power estimation
13.4. Neyman-Pearson classes
13.5. Discussion
13.6. Conclusion
13.7. Appendix
13.8. Bibliography
Chapter 14. Accelerated Life Testing when the Hazard Rate Function has Cup Shape
14.1. Introduction
14.2. Estimation in the AFT-GW model
14.3. Properties of estimators: simulation results for the AFT-GW model
14.4. Some remarks on the second plan of experiments
14.5. Conclusion
14.6. Appendix
14.7. Bibliography
Chapter 15. Point Processes in Software Reliability
15.1. Introduction
15.2. Basic concepts for repairable systems
15.3. Self-exciting point processes and black-box models
15.4. White-box models and Markovian arrival processes
15.5. Bibliography
PART III
Chapter 16. Likelihood Inference for the Latent Markov Rasch Model
16.1. Introduction
16.2. Latent class Rasch model
16.3. Latent Markov Rasch model
16.4. Likelihood inference for the latent Markov Rasch model
16.5. An application
16.6. Possible extensions
16.7. Conclusions
16.8. Bibliography
Chapter 17. Selection of Items Fitting a Rasch Model
17.1. Introduction
17.2. Notations and assumptions
17.3. The Rasch model and the multidimensional marginally sufficient Rasch model
17.4. The Raschfit procedure
17.5. A fast version of Raschfit
17.6. A small set of simulations to compare Raschfit and Raschfit-fast
17.7. A large set of simulations to compare Raschfit-fast, MSP and HCA/CCPROX
17.8. The Stata module “Raschfit”
17.9. Conclusion
17.10. Bibliography
Chapter 18. Analysis of Longitudinal HrQoL using Latent Regression in the Context of Rasch Modeling
18.1. Introduction
18.2. Global models for longitudinal data analysis
18.3. A latent regression Rasch model for longitudinal data analysis
18.4. Case study: longitudinal HrQoL of terminal cancer patients
18.5. Concluding remarks
18.6. Bibliography
Chapter 19. Empirical Internal Validation and Analysis of a Quality of Life Instrument in French Diabetic Patients during an Educational Inter-vention
19.1. Introduction
19.2. Material and methods
19.3. Results
19.4. Discussion
19.5. Conclusion
19.6. Bibliography
19.7. Appendices
PART IV
Chapter 20. Deterministic Modeling of the Size of the HIV/AIDS Epidemic in Cuba
20.1. Introduction
20.2. The models
20.3. The underreporting rate
20.4. Fitting the models to Cuban data
20.5. Discussion and concluding remarks
20.6. Bibliography
Chapter 21. Some Probabilistic Models Useful in Sport Sciences
21.1. Introduction
21.2. Sport jury analysis: the Gauss-Markov approach
21.3. Sport performance analysis: the fatigue and fitness approach
21.4. Sport equipment analysis: the fuzzy subset approach
21.5. Sport duel issue analysis: the logistic simulation approach
21.6. Sport epidemiology analysis: the accelerated degradation approach
21.7. Conclusion
21.8. Bibliography
Appendices
A. European Seminar: Some Figures
B. Contributors
Index
First published in Great Britain and the United States in 2008 by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Ltd
John Wiley & Sons, Inc.
6 Fitzroy Square
111 River Street
London W1T 5DX
Hoboken, NJ 07030
UK
USA
www.iste.co.uk
www.wiley.com
© ISTE Ltd, 2008
The rights of Catherine Huber, Nikolaos Limnios, Mounir Mesbah and Mikhail Nikulin to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.
Library of Congress Cataloging-in-Publication Data
Mathematical methods in survival analysis, reliability and quality of life / edited by Catherine Huber … [et al.].
p. cm.
Includes bibliographical references and index.
ISBN: 978-1-84821-010-3
1. Failure time data analysis. 2. Survival analysis (Biometry) I. Huber, Catherine.
QA276.M342 2008
519.5′46--dc22
2007046232
British Library Cataloguing-in-Publication Data
A CIP record for this book is available from the British Library
ISBN: 978-1-84821-010-3
Preface
The European Seminar on Mathematical Methods for Survival Analysis, Reliability and Quality of Life was created in 1997 by C. HUBER, N. LIMNIOS, M. NIKULIN and M. MESBAH, and, thanks to our respective laboratories and also to the supporting universities (see list below), it ran regularly during the last 10 years. 2007 is a special year as our European Seminar celebrates its 10th birthday. This seminar aims to give a review of recent research in the field of survival analysis, reliability, quality of life, and related topics, from both statistical and probabilistic points of view. Three or four sessions take place every year at the participating universities.
Besides these regular annual sessions, the European seminar supported many in-ternational conferences and workshops in France and abroad: for instance, in 2000, GOF2000 (Goodness Of Fit) in Paris and MMR2000 (Mathematical Methods in Reliability) in Bordeaux, in 2004, an international workshop on “Semi-parametric Models” organized at Mont Saint Michel and, more recently, Biostat2006 in Cyprus. More than 14 international workshops, 26 seminar sessions and about 150 talks were organized during the last ten years (see Appendix A).
Reliability and survival analysis are important applications of stochastic mathematics (probability, statistics and stochastic processes) that were usually treated separately in spite of the similarity of the involved mathematical theory. Reliability is oriented towards technical systems studies (without excluding the human factor), while survival analysis and quality of life are oriented towards biological and medical studies (without excluding the technical factor). The lifetime T of a technical system, of a human patient or of a bacteria is a non-negative random variable, and the same function of the time t, Prob(T > t), is stated as the reliability function, denoted R(t) in reliability theory, and as the survival function, denoted S(t), in medical applications. Nevertheless, even if the function to investigate is the same, the objectives are not always identical. In the field of reliability, most of the time, systems are ergodic and large, which is not the case in survival analysis. Thus, techniques developed in order to evaluate or to estimate the reliability/survival function are not always based on the same fundamental results. However they also include several common techniques: Cox models, degradation models, multi-state approaches (i.e., Markov and semi-Markov models), point processes, etc.
While it is recognized that quality of life is ultimately as important as quantity of life (survival time), efforts to implement quality of life measurements often fail. Statistical methods to analyze time of events are nowadays well established; opinions are largely agreed on between statisticians, clinicians and industrial professionals. Unfortunately, in the quality of life field, there is no standard instrument to measure it, no standard methodology to validate measurement instruments (questionnaires) and no standard statistical methodology to analyze obtained measurements. Specific development and application of modern psychometrical measurement models (latent variable models including the Rasch model) and connection with utility theory are important issues. A more recent issue is the joint analysis of the latent quality of life and external variables such as treatment, evolution (longitudinal analysis) and/or survival time.
The present book includes 21 chapters divided into four parts:
I. Survival analysis
II. Reliability
III. Quality of life
IV. Related topics
We would like to especially thank Julien Chiquet for his very efficient technical support on this book, and ISTE Ltd for their ongoing support.
Catherine HUBER,
Nikolaos LIMNIOS,
Mounir MESBAH,
Mikhail NIKULIN.
Statistical tools for handling regression problems when the response is censored have been developed in the last two decades. The response was often assumed to be a linear function of the covariates, but non-parametric regression models provide a very flexible method when a general relationship between covariates and response is first to be explored. In recent years, a vast literature has been devoted to non-parametric regression estimators for completely observed data. However, few methods exist under random censoring. First, Buckley and James [BUC 79] and Koul, Susarla and Van Ryzin [KOU 81] among others introduced the original idea of transforming the data to take the censoring into account, for linear regression curves. Then, Zheng [ZHE 88] proposed various classes of unbiased transformations. Dabrowska [DAB 87] and Zheng [ZHE 88] applied non-parametric methods for estimating the univariate regression curve. Later, Fan and Gijbels [FAN 94] considered a local linear approximation for the data transformed in the same way by using a variable bandwidth adaptive to the sparsity of the design points. Györfi et al. [GYÖ 02] also studied the consistency of generalized Stone’s regression estimators in the censored case. Heuchenne and Van Keilegom [HEU 05] considered a nonlinear semi-parametric regression model with censored data. Park [PAR 04] extended a procedure suggested in Gross and Lai [GRO 96] to a general non-parametric model in the presence of left-truncation and right-censoring, by using B-spline developments. Recently, Kohler et al. [KOH 03] proposed an adaptive mean-square estimator built with polynomial splines.
(1.1)
The method consists of building projection estimators of the d components rT, 1, …, rT, d on different projection spaces. The strategy is based on a standard mean-square contrast as in Baraud [BAR 00] together with an optimized version of the data transformation proposed by Fan and Gijbels [FAN 94]. The algorithm is explained and performed through several empirical trials and real data, and in particular improved in the bivariate setting. The model and the assumptions are presented in section 1.2, the method is described in section 1.3 and the main theoretical result is given in section 1.4. Finally, practical implementation is to be found in section 1.5 together with several examples of various dimensions.
We consider the following censoring mechanism. Let C1, C2, …, Cn be n censoring times independent of and consequently independent of as well. The and the couples (Zi, δi)s are observed where
δi indicates if the observed time Zi is a lifetime or a censoring time both occurring in the interval [0, T].
Now, let G(·) be the cumulative distribution function (cdf) of the Cis and FY be the marginal cdf of the Yis with and being the corresponding survival functions. We suppose moreover that:
The distribution functions of the Yis and Cis are R+-supported.
The projection spaces used in our theoretical results are described hereafter. For the sake of simplicity, we focus on the polynomial spaces. Note that it is possible to use trigonometric or wavelet spaces. Moreover, in practice the collection among which the algorithm makes its choice is much more complicated: the degrees on each bin are selected by the algorithm and the bins have not necessarily the same size. See Comte and Rozenholc [COM 04] for a description of the algorithm.
The spaces in collection [P] satisfy the following property:
(1.2)
where , fort in L2([0, 1]).
Moreover, for the results concerning the adaptive estimators, we need the following additional assumption:
is a collection of nested models, and we denote by the space belonging to the collection, such that . We denote by Nn the dimension of .
Assumption is satisfied with for collection [P]. Moreover, [DP] satisfies .
In order to estimate the additive regression function, the approximation spaces can be described as
where is chosen as a piecewise polynomial space with dimension As in the univariate case, this particular collection of multivariate spaces also satisfies and by taking in inequalities (1.2).
As usual in regression problems, a mean-square contrast can lead to an estimator of rT. However, we need first to transform the data to take the censoring mechanism into account.
We consider the following transformation of the censored data
(1.3)
The main interest of the transformation is the following property: . Indeed
(1.4)
with
(1.5)
In all cases, the transformed data are unobservable since we need to define G, a relevant estimator of G. We propose taking the Kaplan-Meier [KAP 58] product-limit estimator , modified in the way suggested by Lo et al. [LO 89], and defined by
(1.6)
Finally, by substituting G by its estimator , we obtain the empirical version of the transformed data:
(1.7)
The mean-square strategy leads us to study the following contrast:
(1.8)
In this context, it is useful to consider the empirical norm associated with the design
Here we define
(1.9)
The function may not be easy to define but the vector is always well defined since it is the orthogonal projection in Rn of vector onto the subspace of Rn defined by . This explains why the empirical norms are particularly suitable for the mean-square contrast.
Next, model selection is performed by selecting the model such that:
(1.10)
where we have to determine the relevant form of pen(·) for to be an adaptive estimator of r.
The automatic selection of the projection space can be performed via penalization and the following theoretical result is proved in Brunel and Comte [BRU 06], for the particular choice of i.e. for a contrast γn defined by (1.8) with variables .
Theorem 1.1Assume that the common density f of the covariate vectoris such thatand that the Yis admit moments of order 8. Consider the collection of models [DP] with) for [DP] where Kϕis a (known) constant depending on the basis. Letbe the adaptive estimator defined by (1.8) withand (1.10) with
(1.11)
where κ is a numerical constant. Then
(1.12)
where rmis the orthogonal projection of rT onto Smand C and C′ are constants depending on, cGand.
The theoretical penalty pen(m) involves constants having different status. Let us recall that Φ0 is known where r is the degree of the piecewise poly-nomials). The unknown terms therein are and the expectation and they have to be replaced by estimators:
(1.13)
where is an estimator of the lower bound of f. Lastly, the constant κ is a numerical constant, independent of the data and for which a minimal suitable value exists. It has to be calibrated by simulation experiments: this has been done for regression problems in Comte and Rozenholc [COM 04]. It can be proved that the estimator obtained when substituting the random penalization (1.13) to the theoretical one (1.11) still satisfies inequality (1.12) under the assumptions of the theorem.
The left-hand side term of inequality (1.12) shows that an automatic and non-asymptotic trade-off is automatically performed between an unavoidable squared bias term and a term having a variance order pen(m). The non-asymptotic properties of the estimation algorithm can be appreciated when the selected model has small dimensions but allows a good adequation between the true function and the estimate.
1) We obtain data: .
2) We apply the transformation defined by (1.7) for defined by (1.4), to the data (Z, δ) and build new variables for the regression , the δis being unchanged (see Figure 1.1 with both initial and transformed data where the effect of the preliminary transformation is in evidence).
3) The model must be subject to the constraint , otherwise for , the estimated functions are and . In practice, a good strategy may be to perform the regression algorithm of the centered data where . The output of the regression of the on the is a vector of estimations on a space selected by contrast penalization.
4) The new variables for the regression are taken as the , and the output of the regression of the on the is vector .
The mean-square estimation algorithm used here is the one originally implemented by Comte and Rozenholc ([COM 02], [COM 04]), which allows in addition variable degrees of the piecewise polynomials on each bin.
We present in the following some univariate examples on which the procedure has first been tested.
Example 1. First we consider
for iid and . We tested different values of c, which imply different censoring rates. Figure 1.1 illustrates that whatever the proportion of censored data, the estimate is very good. The scatter plots are here to compare the original data to the transformed data by given in (1.7) with given by (1.4).
Example 2. Our second example studies another function r:
for iid Cis, εis and Uis with and . The results are illustrated in Figure 1.2, and the scatter plots allow here the comparison between the censored and uncensored data once the transformation has been performed.
Example 3. The third model is borrowed from Fan and Gijbels [FAN 94] and is described by
and with
Example 4. We also considered the classical “Stanford Heart Transplant Data”, from October 1967 to February 1980, 184 patients admitted in a heart transplant program 157 with “complete tissue typing”, 55 censored, originally studied by Miller and Halpern [MIL 82] and later studied by Fan and Gijbels [FAN 94] among others. Figure 1.4 illustrates our results with this data set. The estimated function seems to show that there is an optimal age for a heart transplant, which is about 35 years.
In this section, we consider some bivariate examples, in order to illustrate that the sequential procedure works in this setting.
Example 5. The first bivariate example is inspired by the function r1 considered in Fan and Gijbels [FAN 94] associated with another one.
and . The censoring variables are iid with exponential distribution. Note that we cannot estimate r1 and r2 but and , for identifiability reasons. Note that . We keep the model in this form for comparison with the univariate case. We correct the means to calculate the errors for the plots.
We use a “sequential” method that consists of estimating with the regression on the and then use the residuals as new transformed data for the regression on the .
We compare the results obtained for this model with the result of a univariate model , generated with the same function r1 and the same σ as in Example 5. The censoring variables are taken as iid exponential distributions, with parameters adjusted to give the same proportion of censored variables for comparison. Tables 1.1 and 1.2 summarize the results of the mean squared errors (MSE) calculated for 100 simulated samples with five different sizes. It seems that estimating two functions instead of one does not greatly deteriorate the estimation of the first one and gives good results for the second one. A visualization of the orders of the MSE is given in Figure 1.5.
Example 6. Primary Biliary Cirrhosis (PBC) data. This data set is described in details in Fleming and Harrington ([FLE 91], p.2, Chapter 4) and is also studied by Fan and Gijbels [FAN 94]. The Mayo Clinic collected data on PBC, a rare but fatal chronic liver disease. From January 1974 to May 1984, 424 patients were registered, among which 312 participated in the random trial. The response variable is the logarithm of the time (in days) between registration and death, liver transplantation or time of the study analysis (July 1986). Among the 312 patients, 187 cases were censored. The covariates are first, the age, and second, the logarithm of bilirubin, which is known to be a prognostic factor. The estimated curves are given in Figure 1.6.
. The model is thus
The functions estimated by the algorithm are where and .
[BAR 00] BARAUD Y., “Model selection for regression on a fixed design”, Probab. Theory Related Fields, vol. 117, num. 4, p. 467–493, 2000.
[BRU 06] BRUNEL E., COMTE F., “Adaptive nonparametric regression estimation in presence of right-censoring”, Math. Methods Statist., vol. 15, num. 3, p. 233–255, 2006.
[BUC 79] BUCKLEY J., JAMES I., “Linear regression with censored data”, Biometrika, vol. 66, num. 3, p. 429–464, 1979.
[COM 02] COMTE F., ROZENHOLC Y., “Adaptive estimation of mean and volatility functions in (auto-)regressive models”, Stochastic Process. Appl., vol. 97, num. 1, p. 111–145, 2002.
[COM 04] COMTE F., ROZENHOLC Y., “A new algorithm for fixed design regression and denoising”, Ann. Inst. Statist. Math., vol. 56, num. 3, p. 449–473, 2004.
[DAB 87] DABROWSKA D. M., “Nonparametric regression with censored survival time data”, Scand. J. Statist., vol. 14, num. 3, p. 181–197, 1987.
[FAN 94] FAN J., GIJBELS I., “Censored regression: local linear approximations and their applications”, J. Amer. Statist. Assoc., vol. 89, num. 426, p. 560–570, 1994.
[FLE 91] FLEMING T. R., HARRINGTON D. P., Counting Processes and Survival Analysis, Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics, John Wiley & Sons Inc., New York, 1991.
[GRO 96] GROSS S. T., LAI T. L., “Nonparametric estimation and regression analysis with left-truncated and right-censored data”, J. Amer. Statist. Assoc., vol. 91, num. 435, p. 1166– 1180, 1996.
[GYÖ 02] GYÖRFI L., KOHLER M., KRZYZ?AK A., WALK H., A Distribution-free Theory of Nonparametric Regression, Springer Series in Statistics, Springer-Verlag, New York, 2002.
[HEU 05] HEUCHENNE C., VAN KEILEGOM I., “Nonlinear regression with censored data”, Discussion paper, 2005, 0512, Institut de Statistique, Université catholique de Louvain.
[KAP 58] KAPLAN E. L., MEIER P., “Nonparametric estimation from incomplete observations”, J. Amer. Statist. Assoc., vol. 53, p. 457–481, 1958.
[KOH 03] KOHLER M., KUL S., MÀTHÉ K., “Least squares estimates for censored regression”, Preprint, http://www.mathematik.uni-stuttgart.de/mathA/lst3/kohler/hfm-pub-en.html, 2003.
[KOU 81] KOUL H., SUSARLA V., VAN RYZIN J., “Regression analysis with randomly right-censored data”, Ann. Statist., vol. 9, num. 6, p. 1276–1288, 1981.
[LEU 87] LEURGANS S., “Linear models, random censoring and synthetic data”, Biometrika, vol. 74, num. 2, p. 301–309, 1987.
[LO 89] LO S. H., MACK Y. P., WANG J. L., “Density and hazard rate estimation for censored data via strong representation of the Kaplan-Meier estimator”, Probab. Theory Related Fields, vol. 80, num. 3, p. 461–473, 1989.
[MIL 82] MILLER R., HALPERN J., “Regression with censored data”, Biometrika, vol. 69, num. 3, p. 521–531, 1982.
[PAR 04] PARK J., “Optimal global rate of convergence in nonparametric regression with left-truncated and right-censored data”, J. Multivariate Anal., vol. 89, num. 1, p. 70–86, 2004.
[STO 82] STONE C. J., “Optimal global rates of convergence for nonparametric regression”, Ann. Statist., vol. 10, num. 4, p. 1040–1053, 1982.
[ZHE 88] ZHENG Z. K., “Strong consistency of nonparametric regression estimates with censored data”, J. Math. Res. Exposition, vol. 8, num. 2, p. 307–313, 1988.
1 Chapter written by Elodie BRUNEL and Fabienne COMTE.
Models for estimating toxicity thresholds or performing diagnostic tests for the detection of diseases are still in development. Let be a sample of the variable set (X, Y), where Y is the indicator that an event occurs and X is an explanatory variable. Conditionally on X, Y follows a Bernoulli distribution with parameter . The function p is supposed to be strictly monotone on the support IX of X. A usual example is that of a response variable Y to a dose X or to an expository time X. The variable X may be observed at fixed values xi, i ∈ {1,…, m} on a regular grid {1/m,…, 1} or at random values on a real interval corresponding to the values of a continuous process (Xt)t≤T at fixed or random times tj, j ≤ n.
The observations may be biased due to X or Y and the estimators must be corrected. We present here biased designs for this model and for a continuous bivariate set (X, Y), and discuss the identifiability of the models, non-parametric estimation of the distribution functions and means, and efficiency of the estimatiors. Linear models with a multidimensional variable have been widely studied and the results given here may be extended to that case.
In a discrete sampling design with several independent observations for each value xj of X, the likelihood is written
Let
The probability p(x) is deduced from θ and π(x) by the relation
and the bias sampling is
Let γ be the inverse of the proportion of cases in the population,
(2.1)
Under the bias sampling,
γ is modified by the scale parameter η: it becomes .
The product θγ may be directly estimated from the observed Bernoulli variables Yi by the maximization of the likelihood
hence,
In a discrete sampling design with several independent observations for fixed values xj of the variable X, the likelihood is
For random observations of variable X, or for fixed observations without replications, α(x) is estimated by
If θ is known, non-parametric estimators of p are deduced as
and the conditional probabilities of sampling, given the status value, are
Consider (X, Y), a two-dimensional variable in a left-truncated transformation model: let Y denote a response to a continuous expository variable X, up to a variable of individual variations independent of X,
with distribution function . The distribution function of Y conditionally on X is defined by
(2.2)
and the function m is continuous. The joint and marginal distribution functions of X and Y are denoted FX,Y, with support IY,X, FX, with bounded support IX, and FY, such that and .
The observation of Y is assumed to be left-truncated by variable T independent of (X, Y), with distribution function FT, Y and T are observed conditionally on Y ≥ T and none of the variables are observed if Y < T. Denote for any distribution function F and, under left-truncation,
(2.3)
(2.4)
and .
(2.6)
By the same arguments, from the means in (2.3)–(2.4), is estimated by
the distribution function FT is simply estimated by the product-limit estimator for right-truncated variables [WOO 85]
and an estimator of is deduced from FY|X, FX and m as
The means of T and C are estimated by
The estimators and are known to be P-uniformly consistent and asymptotically Gaussian. For the further convergences restricted to the interval , assume the following condition:
Proposition 2.1,
If and converge in distribution to Gaussian processes with mean zero, variances κ2A(1 − A)(y; x)α−1 (x) and κ2B (1 − B)(y; x)α−1(x) respectively, then the covariances of the limiting processes are zero.
Proof. Let and , with
they satisfy
and a similar approximation for . The biases and variances are deduced from those of each term and the weak convergences are proved as in [PIN 06].
From proposition 2.1 and applying the results of the non-parametric regression,
Proposition 2.2The estimatorsconverge P-uniformly toandconverge P-uniformly to EY and ET respectively.
The weak convergence of the estimated distribution function of truncated survival data was proved in several papers ([GIL 90, LAI 91]). As in [GIL 83] and by proposition 2.1, theirproof extends to their weak convergence on (mini{Yi: Ti < Yi}, maxi{Yi: Ti < Yi}) under the conditions and on , which are simply satisfied if for every x in and .
Theorem 2.1converges weakly to a centered Gaussian process W on IY,X. The variables, for every x in IX,n,h, andconverge weakly to EW(Y; x) and.
If m is assumed to be monotone with inverse function r, X is written and the quantiles of X are defined by the inverse functions q1 and q2 of FY|X at fixed y and x, respectively:
where is the inverse of at u. Finally, if m is increasing, then FY|X(y; x) is decreasing in x and increasing in y, and it is the same for its estimator , up to a random set of small probability. The thresholds q1 and q2 are estimated by
As a consequence of Theorem 2.1 and generalizing known results on quantiles.
Theorem 2.2Forconverges P-uniformly to qkon. For every y and (respectively) x, andconverge weakly to the centered Gaussian processand, respectively,.
Let , where and ε is independent of X, be observed on an independent random interval [T, C] with T < C. The observations are X, with a positive density on a support IX
The distribution function of Y conditionally on X is still defined by (2.2). Let and , the upper bounds for Y and C. The conditional mean of the observed Y is now . The notations of section 2.5 become
which is the sum of A, B and C. The hazard function of the observed variable W is no longer equal to the hazard function of the variable Y as it is for independent censoring variables T and C, and estimation by self-consistency equations may be used. For a sample and . Let y in ,
(2.7)
and . The process converges to zero in probability for every x in IX,n,h.
The self-consistency equation for an estimator of is defined as a solution of
(2.8)
therefore,
under the constraint for every l and x. The last sum of the denominator, cannot be directly calculated from the values of at . This expression provides an iterative algorithm: I(l) may be omitted at the first step and the conditional Kaplan-Meier estimator for right-censored observations of Y given X is used as initial estimator of , and is defined at every W(l). This initial estimator is used for an iterative procedure with the constraint that the estimator remains in ]0, 1]. This algorithm converges to the solution of (2.8).
Theorem 2.3The estimatorsolution of (2.8) is P-uniformly consistent andconverges weakly to a centered Gaussian process.
The weak convergence is proved by the same method as in [PON 07]. Then m(x) is estimated by which equals
and converges P-uniformly to m on IX,n,h.
The estimators are then written as
If Y is only right-truncated by C independent of (X, Y), with observations (X, Y) and C conditionally on Y ≤ C, the expressions α, A and B are then written as
The distribution function FC and FY|X are both identifiable and their expression differs from the previous ones,
The estimators are then
If Y is left and right-truncated by mutually independent variables T and C, independent of (X, Y), the observations are (X, Y), C and T, conditionally on T ≤ Y ≤ C,
The functions FC, FT and FY|X are identifiable and
with
Their estimators are
The other non-parametric estimators of section 2.2 and the results of section 2.5 generalize to all the estimators of this section.
and its derivatives with respect to m(x) and are
Let be the empirical estimator of FC and
an estimator of is deduced by deconvolution and
[FAN 96] FAN J., GIJBELS I., Local Polynomial Modelling and its Applications, Chapman and Hall, London, 1996.
[GIL 83] GILL R., “Large sample behaviour of the product-limit estimator on the whole line”, Ann. Statist., vol. 11, p. 49–58, 1983.
[GIL 90] GILL R., KEIDING N., “Random truncation model and Markov processes”, Ann. Statist., vol. 18, p. 582–60, 1990.
[LAI 91] LAI T., YING Z., “Estimating a distribution function with truncated and censored data”, Ann. Statist., vol. 19, p. 417–442, 1991.
[PIN 06] PINÇON C., PONS O., “Nonparametric estimator of a quantile function for the probability of event with repeated data”, Dependence in Probability and Statistics, Lecture Notes in Statistics, vol. 17, p. 475–489, Springer, New York, 2006.
[PON 06] PONS O., “Estimation for semi-Markov models with partial observations via self-consistency equations”, Statistics, vol. 40, p. 377–388, 2006.
[PON 07] PONS O., “Estimation for the distribution function of one and two-dimensional censored variables or sojourn times of Markov renewal processes”, Communications in Statistics – Theory and Methods, vol. 36, num. 14, 2007.
[WOO 85] WOODROOF M., “Estimating a distribution function with truncated data”, Ann. Statist., vol. 13, p. 163–177, 1985.
1 Chapter written by Odile PONS.
In survival analysis we deal with data related to times of events (or end-points) in individual life-histories. The survival data are not amenable to standard statistical procedures used in data analysis for several reasons. One of them is that survival data is not symmetrically distributed, but the main reason is that survival times are frequently censored. This usually happens when the data from a study are to be analyzed at a point when some individuals have not yet experienced the event of interest (or not reached the end-point). Many failure time data in epidemiological studies are simultaneously truncated and interval-censored. Interval-censored data occur in grouped data or when the event of interest is assessed on repeated visits. Right and left-censored data are particular cases of interval-censored data. Right-truncated data occur in registers. For instance, an acquired immune deficiency syndrome (AIDS) register only contains AIDS cases which have been reported. This generates right-truncated samples of induction times. [TUR 76] proposed a nice method of estimating the survival function in the case of arbitrarily censored and truncated data by a non-parametric maximum likelihood estimator. [FRY 94] noted that his method needed to be corrected slightly. [ALI 96] extended previous work by fitting a proportional hazards model to arbitrarily censored and truncated data, and concentrated on hypothesis testing. [HUB 04] introduced frailty models for the analysis of arbitrarily censored and truncated data, and focused on the estimation of the parameter of interest as well as the nuisance parameter of their model.
The concept of frailty models was introduced by [VAU 79] who studied models with Gamma distributed frailties. There are many frailty distributions that could be considered, such as the Gamma which corresponds to the well-known Clayton-Cuzick model [CLA 85, CLA 86], the inverse Gaussian or the positive stable (see [HOU 84] and [HOU 86] for many examples). The choice of a Gamma distributed frailty is the most popular in other works, due to its mathematical convenience.
This work is conducting some statistical analysis of interval censored and truncated data with the use of frailty models. We intend, using this analysis, to check the performance of the model proposed by [HUB 04]. In particular, we focus on hypothesis testing about the regression parameter of the model proposed by [HUB 04], in different situations, such as the case of independent covariates and the misspecification of the truncated proportion of the population. Further research could be directed towards the case of dependent covariates and the case of misspecification of the frailty distribution producing the data.
where for i
