Mathematical Methods in Survival Analysis, Reliability and Quality of Life -  - E-Book

Mathematical Methods in Survival Analysis, Reliability and Quality of Life E-Book

0,0
207,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Reliability and survival analysis are important applications of stochastic mathematics (probability, statistics and stochastic processes) that are usually covered separately in spite of the similarity of the involved mathematical theory. This title aims to redress this situation: it includes 21 chapters divided into four parts: Survival analysis, Reliability, Quality of life, and Related topics. Many of these chapters were presented at the European Seminar on Mathematical Methods for Survival Analysis, Reliability and Quality of Life in 2006.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 492

Veröffentlichungsjahr: 2013

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Contents

Preface

PART I

Chapter 1. Model Selection for Additive Regression in the Presence of Right-Censoring

1.1. Introduction

1.2. Assumptions on the model and the collection of approximation spaces

1.3. The estimation method

1.4. Main result for the adaptive mean-square estimator

1.5. Practical implementation

1.6. Bibliography

Chapter 2. Non-parametric Estimation of Conditional Probabilities, Means and Quantiles under Bias Sampling

2.1. Introduction

2.2. Non-parametric estimation of p

2.3. Bias depending on the value of Y

2.4. Bias due to truncation on X

2.5. Truncation of a response variable in a non-parametric regression model

2.6. Double censoring of a response variable in a non-parametric model

2.7. Other truncation and censoring of Y in a non-parametric model

2.8. Observation by interval

2.9. Bibliography

Chapter 3. Inference in Transformation Models for Arbitrarily Censored and Truncated Data

3.1. Introduction

3.2. Non-parametric estimation of the survival function S

3.3. Semi-parametric estimation of the survival function S

3.4. Simulations

3.5. Bibliography

Chapter 4. Introduction of Within-area Risk Factor Distribution in Eco-logical Poisson Models

4.1. Introduction

4.2. Modeling framework

4.3. Simulation framework

4.4. Results

4.5. Discussion

4.6. Bibliography

Chapter 5. Semi-Markov Processes and Usefulness in Medicine

5.1. Introduction

5.2. Methods

5.3. An application to HIV control

5.4. An application to breast cancer

5.5. Discussion

5.6. Bibliography

Chapter 6. Bivariate Cox Models

6.1. Introduction

6.2. A dependence model for duration data

6.3. Some useful facts in bivariate dependence

6.4. Coherence

6.5. Covariates and estimation

6.6. Application: regression of Spearman’s rho on covariates

6.7. Bibliography

Chapter 7. Non-parametric Estimation of a Class of Survival Functionals

7.1. Introduction

7.2. Weighted local polynomial estimates

7.3. Consistency of local polynomial fitting estimators

7.4. Automatic selection of the smoothing parameter

7.5. Bibliography

Chapter 8. Approximate Likelihood in Survival Models

8.1. Introduction

8.2. Likelihood in proportional hazard models

8.3. Likelihood in parametric models

8.4. Profile likelihood

8.5. Statistical arguments

8.6. Bibliography

PART II

Chapter 9. Cox Regression with Missing Values of a Covariate having a Non-proportional Effect on Risk of Failure

9.1. Introduction

9.2. Estimation in the Cox model with missing covariate values: a short review

9.3. Estimation procedure in the stratified Cox model with missing stratum indicator values

9.4. Asymptotic theory

9.5. A simulation study

9.6. Discussion

9.7. Bibliography

Chapter 10. Exact Bayesian Variable Sampling Plans for Exponential Distribution under Type-I Censoring

10.1. Introduction

10.2. Proposed sampling plan and Bayes risk

10.3. Numerical examples and comparison

10.4. Bibliography

Chapter 11. Reliability of Stochastic Dynamical Systems Applied to Fatigue Crack Growth Modeling

11.1. Introduction

11.2. Stochastic dynamical systems with jump Markov process

11.3. Estimation

11.4. Numerical application

11.5. Conclusion

11.6. Bibliography

Chapter 12. Statistical Analysis of a Redundant System with One Standby Unit

12.1. Introduction

12.2. The models

12.3. The tests

12.4. Limit distribution of the test statistics

12.5. Bibliography

Chapter 13. A Modified Chi-squared Goodness-of-fit Test for the Three-parameter Weibull Distribution and its Applications in Reliability

13.1. Introduction

13.2. Parameter estimation and modified chi-squared tests

13.3. Power estimation

13.4. Neyman-Pearson classes

13.5. Discussion

13.6. Conclusion

13.7. Appendix

13.8. Bibliography

Chapter 14. Accelerated Life Testing when the Hazard Rate Function has Cup Shape

14.1. Introduction

14.2. Estimation in the AFT-GW model

14.3. Properties of estimators: simulation results for the AFT-GW model

14.4. Some remarks on the second plan of experiments

14.5. Conclusion

14.6. Appendix

14.7. Bibliography

Chapter 15. Point Processes in Software Reliability

15.1. Introduction

15.2. Basic concepts for repairable systems

15.3. Self-exciting point processes and black-box models

15.4. White-box models and Markovian arrival processes

15.5. Bibliography

PART III

Chapter 16. Likelihood Inference for the Latent Markov Rasch Model

16.1. Introduction

16.2. Latent class Rasch model

16.3. Latent Markov Rasch model

16.4. Likelihood inference for the latent Markov Rasch model

16.5. An application

16.6. Possible extensions

16.7. Conclusions

16.8. Bibliography

Chapter 17. Selection of Items Fitting a Rasch Model

17.1. Introduction

17.2. Notations and assumptions

17.3. The Rasch model and the multidimensional marginally sufficient Rasch model

17.4. The Raschfit procedure

17.5. A fast version of Raschfit

17.6. A small set of simulations to compare Raschfit and Raschfit-fast

17.7. A large set of simulations to compare Raschfit-fast, MSP and HCA/CCPROX

17.8. The Stata module “Raschfit”

17.9. Conclusion

17.10. Bibliography

Chapter 18. Analysis of Longitudinal HrQoL using Latent Regression in the Context of Rasch Modeling

18.1. Introduction

18.2. Global models for longitudinal data analysis

18.3. A latent regression Rasch model for longitudinal data analysis

18.4. Case study: longitudinal HrQoL of terminal cancer patients

18.5. Concluding remarks

18.6. Bibliography

Chapter 19. Empirical Internal Validation and Analysis of a Quality of Life Instrument in French Diabetic Patients during an Educational Inter-vention

19.1. Introduction

19.2. Material and methods

19.3. Results

19.4. Discussion

19.5. Conclusion

19.6. Bibliography

19.7. Appendices

PART IV

Chapter 20. Deterministic Modeling of the Size of the HIV/AIDS Epidemic in Cuba

20.1. Introduction

20.2. The models

20.3. The underreporting rate

20.4. Fitting the models to Cuban data

20.5. Discussion and concluding remarks

20.6. Bibliography

Chapter 21. Some Probabilistic Models Useful in Sport Sciences

21.1. Introduction

21.2. Sport jury analysis: the Gauss-Markov approach

21.3. Sport performance analysis: the fatigue and fitness approach

21.4. Sport equipment analysis: the fuzzy subset approach

21.5. Sport duel issue analysis: the logistic simulation approach

21.6. Sport epidemiology analysis: the accelerated degradation approach

21.7. Conclusion

21.8. Bibliography

Appendices

A. European Seminar: Some Figures

B. Contributors

Index

First published in Great Britain and the United States in 2008 by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Ltd

John Wiley & Sons, Inc.

6 Fitzroy Square

111 River Street

London W1T 5DX

Hoboken, NJ 07030

UK

USA

www.iste.co.uk

www.wiley.com

© ISTE Ltd, 2008

The rights of Catherine Huber, Nikolaos Limnios, Mounir Mesbah and Mikhail Nikulin to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.

Library of Congress Cataloging-in-Publication Data

Mathematical methods in survival analysis, reliability and quality of life / edited by Catherine Huber … [et al.].

p. cm.

Includes bibliographical references and index.

ISBN: 978-1-84821-010-3

1. Failure time data analysis. 2. Survival analysis (Biometry) I. Huber, Catherine.

QA276.M342 2008

519.5′46--dc22

2007046232

British Library Cataloguing-in-Publication Data

A CIP record for this book is available from the British Library

ISBN: 978-1-84821-010-3

Preface

The European Seminar on Mathematical Methods for Survival Analysis, Reliability and Quality of Life was created in 1997 by C. HUBER, N. LIMNIOS, M. NIKULIN and M. MESBAH, and, thanks to our respective laboratories and also to the supporting universities (see list below), it ran regularly during the last 10 years. 2007 is a special year as our European Seminar celebrates its 10th birthday. This seminar aims to give a review of recent research in the field of survival analysis, reliability, quality of life, and related topics, from both statistical and probabilistic points of view. Three or four sessions take place every year at the participating universities.

Besides these regular annual sessions, the European seminar supported many in-ternational conferences and workshops in France and abroad: for instance, in 2000, GOF2000 (Goodness Of Fit) in Paris and MMR2000 (Mathematical Methods in Reliability) in Bordeaux, in 2004, an international workshop on “Semi-parametric Models” organized at Mont Saint Michel and, more recently, Biostat2006 in Cyprus. More than 14 international workshops, 26 seminar sessions and about 150 talks were organized during the last ten years (see Appendix A).

Reliability and survival analysis are important applications of stochastic mathematics (probability, statistics and stochastic processes) that were usually treated separately in spite of the similarity of the involved mathematical theory. Reliability is oriented towards technical systems studies (without excluding the human factor), while survival analysis and quality of life are oriented towards biological and medical studies (without excluding the technical factor). The lifetime T of a technical system, of a human patient or of a bacteria is a non-negative random variable, and the same function of the time t, Prob(T > t), is stated as the reliability function, denoted R(t) in reliability theory, and as the survival function, denoted S(t), in medical applications. Nevertheless, even if the function to investigate is the same, the objectives are not always identical. In the field of reliability, most of the time, systems are ergodic and large, which is not the case in survival analysis. Thus, techniques developed in order to evaluate or to estimate the reliability/survival function are not always based on the same fundamental results. However they also include several common techniques: Cox models, degradation models, multi-state approaches (i.e., Markov and semi-Markov models), point processes, etc.

While it is recognized that quality of life is ultimately as important as quantity of life (survival time), efforts to implement quality of life measurements often fail. Statistical methods to analyze time of events are nowadays well established; opinions are largely agreed on between statisticians, clinicians and industrial professionals. Unfortunately, in the quality of life field, there is no standard instrument to measure it, no standard methodology to validate measurement instruments (questionnaires) and no standard statistical methodology to analyze obtained measurements. Specific development and application of modern psychometrical measurement models (latent variable models including the Rasch model) and connection with utility theory are important issues. A more recent issue is the joint analysis of the latent quality of life and external variables such as treatment, evolution (longitudinal analysis) and/or survival time.

The present book includes 21 chapters divided into four parts:

I. Survival analysis

II. Reliability

III. Quality of life

IV. Related topics

We would like to especially thank Julien Chiquet for his very efficient technical support on this book, and ISTE Ltd for their ongoing support.

Catherine HUBER,

Nikolaos LIMNIOS,

Mounir MESBAH,

Mikhail NIKULIN.

PART I

Survival Analysis

Chapter 1

Model Selection for Additive Regression in the Presence of Right-Censoring1

1.1. Introduction

Statistical tools for handling regression problems when the response is censored have been developed in the last two decades. The response was often assumed to be a linear function of the covariates, but non-parametric regression models provide a very flexible method when a general relationship between covariates and response is first to be explored. In recent years, a vast literature has been devoted to non-parametric regression estimators for completely observed data. However, few methods exist under random censoring. First, Buckley and James [BUC 79] and Koul, Susarla and Van Ryzin [KOU 81] among others introduced the original idea of transforming the data to take the censoring into account, for linear regression curves. Then, Zheng [ZHE 88] proposed various classes of unbiased transformations. Dabrowska [DAB 87] and Zheng [ZHE 88] applied non-parametric methods for estimating the univariate regression curve. Later, Fan and Gijbels [FAN 94] considered a local linear approximation for the data transformed in the same way by using a variable bandwidth adaptive to the sparsity of the design points. Györfi et al. [GYÖ 02] also studied the consistency of generalized Stone’s regression estimators in the censored case. Heuchenne and Van Keilegom [HEU 05] considered a nonlinear semi-parametric regression model with censored data. Park [PAR 04] extended a procedure suggested in Gross and Lai [GRO 96] to a general non-parametric model in the presence of left-truncation and right-censoring, by using B-spline developments. Recently, Kohler et al. [KOH 03] proposed an adaptive mean-square estimator built with polynomial splines.

(1.1)

The method consists of building projection estimators of the d components rT, 1, …, rT, d on different projection spaces. The strategy is based on a standard mean-square contrast as in Baraud [BAR 00] together with an optimized version of the data transformation proposed by Fan and Gijbels [FAN 94]. The algorithm is explained and performed through several empirical trials and real data, and in particular improved in the bivariate setting. The model and the assumptions are presented in section 1.2, the method is described in section 1.3 and the main theoretical result is given in section 1.4. Finally, practical implementation is to be found in section 1.5 together with several examples of various dimensions.

1.2. Assumptions on the model and the collection of approximation spaces

1.2.1. Non-parametric regression model with censored data

We consider the following censoring mechanism. Let C1, C2, …, Cn be n censoring times independent of and consequently independent of as well. The and the couples (Zi, δi)s are observed where

δi indicates if the observed time Zi is a lifetime or a censoring time both occurring in the interval [0, T].

Now, let G(·) be the cumulative distribution function (cdf) of the Cis and FY be the marginal cdf of the Yis with and being the corresponding survival functions. We suppose moreover that:

The distribution functions of the Yis and Cis are R+-supported.

1.2.2. Description of the approximation spaces in the univariate case

The projection spaces used in our theoretical results are described hereafter. For the sake of simplicity, we focus on the polynomial spaces. Note that it is possible to use trigonometric or wavelet spaces. Moreover, in practice the collection among which the algorithm makes its choice is much more complicated: the degrees on each bin are selected by the algorithm and the bins have not necessarily the same size. See Comte and Rozenholc [COM 04] for a description of the algorithm.

The spaces in collection [P] satisfy the following property:

(1.2)

where , fort in L2([0, 1]).

Moreover, for the results concerning the adaptive estimators, we need the following additional assumption:

is a collection of nested models, and we denote by the space belonging to the collection, such that . We denote by Nn the dimension of .

Assumption is satisfied with for collection [P]. Moreover, [DP] satisfies .

1.2.3. The particular multivariate setting of additive models

In order to estimate the additive regression function, the approximation spaces can be described as

where is chosen as a piecewise polynomial space with dimension As in the univariate case, this particular collection of multivariate spaces also satisfies and by taking in inequalities (1.2).

1.3. The estimation method

As usual in regression problems, a mean-square contrast can lead to an estimator of rT. However, we need first to transform the data to take the censoring mechanism into account.

1.3.1. Transformation of the data

We consider the following transformation of the censored data

(1.3)

The main interest of the transformation is the following property: . Indeed

(1.4)

with

(1.5)

In all cases, the transformed data are unobservable since we need to define G, a relevant estimator of G. We propose taking the Kaplan-Meier [KAP 58] product-limit estimator , modified in the way suggested by Lo et al. [LO 89], and defined by

(1.6)

Finally, by substituting G by its estimator , we obtain the empirical version of the transformed data:

(1.7)

1.3.2. The mean-square contrast

The mean-square strategy leads us to study the following contrast:

(1.8)

In this context, it is useful to consider the empirical norm associated with the design

Here we define

(1.9)

The function may not be easy to define but the vector is always well defined since it is the orthogonal projection in Rn of vector onto the subspace of Rn defined by . This explains why the empirical norms are particularly suitable for the mean-square contrast.

Next, model selection is performed by selecting the model such that:

(1.10)

where we have to determine the relevant form of pen(·) for to be an adaptive estimator of r.

1.4. Main result for the adaptive mean-square estimator

The automatic selection of the projection space can be performed via penalization and the following theoretical result is proved in Brunel and Comte [BRU 06], for the particular choice of i.e. for a contrast γn defined by (1.8) with variables .

Theorem 1.1Assume that the common density f of the covariate vectoris such thatand that the Yis admit moments of order 8. Consider the collection of models [DP] with) for [DP] where Kϕis a (known) constant depending on the basis. Letbe the adaptive estimator defined by (1.8) withand (1.10) with

(1.11)

where κ is a numerical constant. Then

(1.12)

where rmis the orthogonal projection of rT onto Smand C and C′ are constants depending on, cGand.

The theoretical penalty pen(m) involves constants having different status. Let us recall that Φ0 is known where r is the degree of the piecewise poly-nomials). The unknown terms therein are and the expectation and they have to be replaced by estimators:

(1.13)

where is an estimator of the lower bound of f. Lastly, the constant κ is a numerical constant, independent of the data and for which a minimal suitable value exists. It has to be calibrated by simulation experiments: this has been done for regression problems in Comte and Rozenholc [COM 04]. It can be proved that the estimator obtained when substituting the random penalization (1.13) to the theoretical one (1.11) still satisfies inequality (1.12) under the assumptions of the theorem.

The left-hand side term of inequality (1.12) shows that an automatic and non-asymptotic trade-off is automatically performed between an unavoidable squared bias term and a term having a variance order pen(m). The non-asymptotic properties of the estimation algorithm can be appreciated when the selected model has small dimensions but allows a good adequation between the true function and the estimate.

1.5. Practical implementation

1.5.1. The algorithm

1) We obtain data: .

2) We apply the transformation defined by (1.7) for defined by (1.4), to the data (Z, δ) and build new variables for the regression , the δis being unchanged (see Figure 1.1 with both initial and transformed data where the effect of the preliminary transformation is in evidence).

3) The model must be subject to the constraint , otherwise for , the estimated functions are and . In practice, a good strategy may be to perform the regression algorithm of the centered data where . The output of the regression of the on the is a vector of estimations on a space selected by contrast penalization.

4) The new variables for the regression are taken as the , and the output of the regression of the on the is vector .

The mean-square estimation algorithm used here is the one originally implemented by Comte and Rozenholc ([COM 02], [COM 04]), which allows in addition variable degrees of the piecewise polynomials on each bin.

1.5.2. Univariate examples

We present in the following some univariate examples on which the procedure has first been tested.

Example 1. First we consider

for iid and . We tested different values of c, which imply different censoring rates. Figure 1.1 illustrates that whatever the proportion of censored data, the estimate is very good. The scatter plots are here to compare the original data to the transformed data by given in (1.7) with given by (1.4).

Example 2. Our second example studies another function r:

for iid Cis, εis and Uis with and . The results are illustrated in Figure 1.2, and the scatter plots allow here the comparison between the censored and uncensored data once the transformation has been performed.

Example 3. The third model is borrowed from Fan and Gijbels [FAN 94] and is described by

and with

Example 4. We also considered the classical “Stanford Heart Transplant Data”, from October 1967 to February 1980, 184 patients admitted in a heart transplant program 157 with “complete tissue typing”, 55 censored, originally studied by Miller and Halpern [MIL 82] and later studied by Fan and Gijbels [FAN 94] among others. Figure 1.4 illustrates our results with this data set. The estimated function seems to show that there is an optimal age for a heart transplant, which is about 35 years.

1.5.3. Bivariate examples

In this section, we consider some bivariate examples, in order to illustrate that the sequential procedure works in this setting.

Example 5. The first bivariate example is inspired by the function r1 considered in Fan and Gijbels [FAN 94] associated with another one.

and . The censoring variables are iid with exponential distribution. Note that we cannot estimate r1 and r2 but and , for identifiability reasons. Note that . We keep the model in this form for comparison with the univariate case. We correct the means to calculate the errors for the plots.

We use a “sequential” method that consists of estimating with the regression on the and then use the residuals as new transformed data for the regression on the .

We compare the results obtained for this model with the result of a univariate model , generated with the same function r1 and the same σ as in Example 5. The censoring variables are taken as iid exponential distributions, with parameters adjusted to give the same proportion of censored variables for comparison. Tables 1.1 and 1.2 summarize the results of the mean squared errors (MSE) calculated for 100 simulated samples with five different sizes. It seems that estimating two functions instead of one does not greatly deteriorate the estimation of the first one and gives good results for the second one. A visualization of the orders of the MSE is given in Figure 1.5.

Example 6. Primary Biliary Cirrhosis (PBC) data. This data set is described in details in Fleming and Harrington ([FLE 91], p.2, Chapter 4) and is also studied by Fan and Gijbels [FAN 94]. The Mayo Clinic collected data on PBC, a rare but fatal chronic liver disease. From January 1974 to May 1984, 424 patients were registered, among which 312 participated in the random trial. The response variable is the logarithm of the time (in days) between registration and death, liver transplantation or time of the study analysis (July 1986). Among the 312 patients, 187 cases were censored. The covariates are first, the age, and second, the logarithm of bilirubin, which is known to be a prognostic factor. The estimated curves are given in Figure 1.6.

1.5.4. A trivariate example

. The model is thus

The functions estimated by the algorithm are where and .

1.6. Bibliography

[BAR 00] BARAUD Y., “Model selection for regression on a fixed design”, Probab. Theory Related Fields, vol. 117, num. 4, p. 467–493, 2000.

[BRU 06] BRUNEL E., COMTE F., “Adaptive nonparametric regression estimation in presence of right-censoring”, Math. Methods Statist., vol. 15, num. 3, p. 233–255, 2006.

[BUC 79] BUCKLEY J., JAMES I., “Linear regression with censored data”, Biometrika, vol. 66, num. 3, p. 429–464, 1979.

[COM 02] COMTE F., ROZENHOLC Y., “Adaptive estimation of mean and volatility functions in (auto-)regressive models”, Stochastic Process. Appl., vol. 97, num. 1, p. 111–145, 2002.

[COM 04] COMTE F., ROZENHOLC Y., “A new algorithm for fixed design regression and denoising”, Ann. Inst. Statist. Math., vol. 56, num. 3, p. 449–473, 2004.

[DAB 87] DABROWSKA D. M., “Nonparametric regression with censored survival time data”, Scand. J. Statist., vol. 14, num. 3, p. 181–197, 1987.

[FAN 94] FAN J., GIJBELS I., “Censored regression: local linear approximations and their applications”, J. Amer. Statist. Assoc., vol. 89, num. 426, p. 560–570, 1994.

[FLE 91] FLEMING T. R., HARRINGTON D. P., Counting Processes and Survival Analysis, Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics, John Wiley & Sons Inc., New York, 1991.

[GRO 96] GROSS S. T., LAI T. L., “Nonparametric estimation and regression analysis with left-truncated and right-censored data”, J. Amer. Statist. Assoc., vol. 91, num. 435, p. 1166– 1180, 1996.

[GYÖ 02] GYÖRFI L., KOHLER M., KRZYZ?AK A., WALK H., A Distribution-free Theory of Nonparametric Regression, Springer Series in Statistics, Springer-Verlag, New York, 2002.

[HEU 05] HEUCHENNE C., VAN KEILEGOM I., “Nonlinear regression with censored data”, Discussion paper, 2005, 0512, Institut de Statistique, Université catholique de Louvain.

[KAP 58] KAPLAN E. L., MEIER P., “Nonparametric estimation from incomplete observations”, J. Amer. Statist. Assoc., vol. 53, p. 457–481, 1958.

[KOH 03] KOHLER M., KUL S., MÀTHÉ K., “Least squares estimates for censored regression”, Preprint, http://www.mathematik.uni-stuttgart.de/mathA/lst3/kohler/hfm-pub-en.html, 2003.

[KOU 81] KOUL H., SUSARLA V., VAN RYZIN J., “Regression analysis with randomly right-censored data”, Ann. Statist., vol. 9, num. 6, p. 1276–1288, 1981.

[LEU 87] LEURGANS S., “Linear models, random censoring and synthetic data”, Biometrika, vol. 74, num. 2, p. 301–309, 1987.

[LO 89] LO S. H., MACK Y. P., WANG J. L., “Density and hazard rate estimation for censored data via strong representation of the Kaplan-Meier estimator”, Probab. Theory Related Fields, vol. 80, num. 3, p. 461–473, 1989.

[MIL 82] MILLER R., HALPERN J., “Regression with censored data”, Biometrika, vol. 69, num. 3, p. 521–531, 1982.

[PAR 04] PARK J., “Optimal global rate of convergence in nonparametric regression with left-truncated and right-censored data”, J. Multivariate Anal., vol. 89, num. 1, p. 70–86, 2004.

[STO 82] STONE C. J., “Optimal global rates of convergence for nonparametric regression”, Ann. Statist., vol. 10, num. 4, p. 1040–1053, 1982.

[ZHE 88] ZHENG Z. K., “Strong consistency of nonparametric regression estimates with censored data”, J. Math. Res. Exposition, vol. 8, num. 2, p. 307–313, 1988.

1 Chapter written by Elodie BRUNEL and Fabienne COMTE.

Chapter 2

Non-parametric Estimation of Conditional Probabilities, Means and Quantiles under Bias Sampling1

2.1. Introduction

Models for estimating toxicity thresholds or performing diagnostic tests for the detection of diseases are still in development. Let be a sample of the variable set (X, Y), where Y is the indicator that an event occurs and X is an explanatory variable. Conditionally on X, Y follows a Bernoulli distribution with parameter . The function p is supposed to be strictly monotone on the support IX of X. A usual example is that of a response variable Y to a dose X or to an expository time X. The variable X may be observed at fixed values xi, i ∈ {1,…, m} on a regular grid {1/m,…, 1} or at random values on a real interval corresponding to the values of a continuous process (Xt)t≤T at fixed or random times tj, j ≤ n.

The observations may be biased due to X or Y and the estimators must be corrected. We present here biased designs for this model and for a continuous bivariate set (X, Y), and discuss the identifiability of the models, non-parametric estimation of the distribution functions and means, and efficiency of the estimatiors. Linear models with a multidimensional variable have been widely studied and the results given here may be extended to that case.

2.2. Non-parametric estimation of p

In a discrete sampling design with several independent observations for each value xj of X, the likelihood is written

2.3. Bias depending on the value of Y

Let

The probability p(x) is deduced from θ and π(x) by the relation

and the bias sampling is

Let γ be the inverse of the proportion of cases in the population,

(2.1)

Under the bias sampling,

γ is modified by the scale parameter η: it becomes .

The product θγ may be directly estimated from the observed Bernoulli variables Yi by the maximization of the likelihood

hence,

In a discrete sampling design with several independent observations for fixed values xj of the variable X, the likelihood is

For random observations of variable X, or for fixed observations without replications, α(x) is estimated by

If θ is known, non-parametric estimators of p are deduced as

2.4. Bias due to truncation on X

and the conditional probabilities of sampling, given the status value, are

2.5. Truncation of a response variable in a non-parametric regression model

Consider (X, Y), a two-dimensional variable in a left-truncated transformation model: let Y denote a response to a continuous expository variable X, up to a variable of individual variations independent of X,

with distribution function . The distribution function of Y conditionally on X is defined by

(2.2)

and the function m is continuous. The joint and marginal distribution functions of X and Y are denoted FX,Y, with support IY,X, FX, with bounded support IX, and FY, such that and .

The observation of Y is assumed to be left-truncated by variable T independent of (X, Y), with distribution function FT, Y and T are observed conditionally on Y ≥ T and none of the variables are observed if Y < T. Denote for any distribution function F and, under left-truncation,

(2.3)

(2.4)

and .

(2.6)

By the same arguments, from the means in (2.3)–(2.4), is estimated by

the distribution function FT is simply estimated by the product-limit estimator for right-truncated variables [WOO 85]

and an estimator of is deduced from FY|X, FX and m as

The means of T and C are estimated by

The estimators and are known to be P-uniformly consistent and asymptotically Gaussian. For the further convergences restricted to the interval , assume the following condition:

Proposition 2.1,

If and converge in distribution to Gaussian processes with mean zero, variances κ2A(1 − A)(y; x)α−1 (x) and κ2B (1 − B)(y; x)α−1(x) respectively, then the covariances of the limiting processes are zero.

Proof. Let and , with

they satisfy

and a similar approximation for . The biases and variances are deduced from those of each term and the weak convergences are proved as in [PIN 06].

From proposition 2.1 and applying the results of the non-parametric regression,

Proposition 2.2The estimatorsconverge P-uniformly toandconverge P-uniformly to EY and ET respectively.

The weak convergence of the estimated distribution function of truncated survival data was proved in several papers ([GIL 90, LAI 91]). As in [GIL 83] and by proposition 2.1, theirproof extends to their weak convergence on (mini{Yi: Ti < Yi}, maxi{Yi: Ti < Yi}) under the conditions and on , which are simply satisfied if for every x in and .

Theorem 2.1converges weakly to a centered Gaussian process W on IY,X. The variables, for every x in IX,n,h, andconverge weakly to EW(Y; x) and.

If m is assumed to be monotone with inverse function r, X is written and the quantiles of X are defined by the inverse functions q1 and q2 of FY|X at fixed y and x, respectively:

where is the inverse of at u. Finally, if m is increasing, then FY|X(y; x) is decreasing in x and increasing in y, and it is the same for its estimator , up to a random set of small probability. The thresholds q1 and q2 are estimated by

As a consequence of Theorem 2.1 and generalizing known results on quantiles.

Theorem 2.2Forconverges P-uniformly to qkon. For every y and (respectively) x, andconverge weakly to the centered Gaussian processand, respectively,.

2.6. Double censoring of a response variable in a non-parametric model

Let , where and ε is independent of X, be observed on an independent random interval [T, C] with T < C. The observations are X, with a positive density on a support IX

The distribution function of Y conditionally on X is still defined by (2.2). Let and , the upper bounds for Y and C. The conditional mean of the observed Y is now . The notations of section 2.5 become

which is the sum of A, B and C. The hazard function of the observed variable W is no longer equal to the hazard function of the variable Y as it is for independent censoring variables T and C, and estimation by self-consistency equations may be used. For a sample and . Let y in ,

(2.7)

and . The process converges to zero in probability for every x in IX,n,h.

The self-consistency equation for an estimator of is defined as a solution of

(2.8)

therefore,

under the constraint for every l and x. The last sum of the denominator, cannot be directly calculated from the values of at . This expression provides an iterative algorithm: I(l) may be omitted at the first step and the conditional Kaplan-Meier estimator for right-censored observations of Y given X is used as initial estimator of , and is defined at every W(l). This initial estimator is used for an iterative procedure with the constraint that the estimator remains in ]0, 1]. This algorithm converges to the solution of (2.8).

Theorem 2.3The estimatorsolution of (2.8) is P-uniformly consistent andconverges weakly to a centered Gaussian process.

The weak convergence is proved by the same method as in [PON 07]. Then m(x) is estimated by which equals

and converges P-uniformly to m on IX,n,h.

2.7. Other truncation and censoring of Y in a non-parametric model

The estimators are then written as

If Y is only right-truncated by C independent of (X, Y), with observations (X, Y) and C conditionally on Y ≤ C, the expressions α, A and B are then written as

The distribution function FC and FY|X are both identifiable and their expression differs from the previous ones,

The estimators are then

If Y is left and right-truncated by mutually independent variables T and C, independent of (X, Y), the observations are (X, Y), C and T, conditionally on T ≤ Y ≤ C,

The functions FC, FT and FY|X are identifiable and

with

Their estimators are

The other non-parametric estimators of section 2.2 and the results of section 2.5 generalize to all the estimators of this section.

2.8. Observation by interval

and its derivatives with respect to m(x) and are

Let be the empirical estimator of FC and

an estimator of is deduced by deconvolution and

2.9. Bibliography

[FAN 96] FAN J., GIJBELS I., Local Polynomial Modelling and its Applications, Chapman and Hall, London, 1996.

[GIL 83] GILL R., “Large sample behaviour of the product-limit estimator on the whole line”, Ann. Statist., vol. 11, p. 49–58, 1983.

[GIL 90] GILL R., KEIDING N., “Random truncation model and Markov processes”, Ann. Statist., vol. 18, p. 582–60, 1990.

[LAI 91] LAI T., YING Z., “Estimating a distribution function with truncated and censored data”, Ann. Statist., vol. 19, p. 417–442, 1991.

[PIN 06] PINÇON C., PONS O., “Nonparametric estimator of a quantile function for the probability of event with repeated data”, Dependence in Probability and Statistics, Lecture Notes in Statistics, vol. 17, p. 475–489, Springer, New York, 2006.

[PON 06] PONS O., “Estimation for semi-Markov models with partial observations via self-consistency equations”, Statistics, vol. 40, p. 377–388, 2006.

[PON 07] PONS O., “Estimation for the distribution function of one and two-dimensional censored variables or sojourn times of Markov renewal processes”, Communications in Statistics – Theory and Methods, vol. 36, num. 14, 2007.

[WOO 85] WOODROOF M., “Estimating a distribution function with truncated data”, Ann. Statist., vol. 13, p. 163–177, 1985.

1 Chapter written by Odile PONS.

Chapter 3

Inference in Transformation Models for Arbitrarily Censored and Truncated Data1

3.1. Introduction

In survival analysis we deal with data related to times of events (or end-points) in individual life-histories. The survival data are not amenable to standard statistical procedures used in data analysis for several reasons. One of them is that survival data is not symmetrically distributed, but the main reason is that survival times are frequently censored. This usually happens when the data from a study are to be analyzed at a point when some individuals have not yet experienced the event of interest (or not reached the end-point). Many failure time data in epidemiological studies are simultaneously truncated and interval-censored. Interval-censored data occur in grouped data or when the event of interest is assessed on repeated visits. Right and left-censored data are particular cases of interval-censored data. Right-truncated data occur in registers. For instance, an acquired immune deficiency syndrome (AIDS) register only contains AIDS cases which have been reported. This generates right-truncated samples of induction times. [TUR 76] proposed a nice method of estimating the survival function in the case of arbitrarily censored and truncated data by a non-parametric maximum likelihood estimator. [FRY 94] noted that his method needed to be corrected slightly. [ALI 96] extended previous work by fitting a proportional hazards model to arbitrarily censored and truncated data, and concentrated on hypothesis testing. [HUB 04] introduced frailty models for the analysis of arbitrarily censored and truncated data, and focused on the estimation of the parameter of interest as well as the nuisance parameter of their model.

The concept of frailty models was introduced by [VAU 79] who studied models with Gamma distributed frailties. There are many frailty distributions that could be considered, such as the Gamma which corresponds to the well-known Clayton-Cuzick model [CLA 85, CLA 86], the inverse Gaussian or the positive stable (see [HOU 84] and [HOU 86] for many examples). The choice of a Gamma distributed frailty is the most popular in other works, due to its mathematical convenience.

This work is conducting some statistical analysis of interval censored and truncated data with the use of frailty models. We intend, using this analysis, to check the performance of the model proposed by [HUB 04]. In particular, we focus on hypothesis testing about the regression parameter of the model proposed by [HUB 04], in different situations, such as the case of independent covariates and the misspecification of the truncated proportion of the population. Further research could be directed towards the case of dependent covariates and the case of misspecification of the frailty distribution producing the data.

3.2. Non-parametric estimation of the survival function S

where for i