E-Book
118,99 €

Applications of Regression Models in Epidemiology E-Book

Erick Suarez

0,0

118,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: John Wiley & Sons
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

A one-stop guide for public health students and practitioners learning the applications of classical regression models in epidemiology This book is written for public health professionals and students interested in applying regression models in the field of epidemiology. The academic material is usually covered in public health courses including (i) Applied Regression Analysis, (ii) Advanced Epidemiology, and (iii) Statistical Computing. The book is composed of 13 chapters, including an introduction chapter that covers basic concepts of statistics and probability. Among the topics covered are linear regression model, polynomial regression model, weighted least squares, methods for selecting the best regression equation, and generalized linear models and their applications to different epidemiological study designs. An example is provided in each chapter that applies the theoretical aspects presented in that chapter. In addition, exercises are included and the final chapter is devoted to the solutions of these academic exercises with answers in all of the major statistical software packages, including STATA, SAS, SPSS, and R. It is assumed that readers of this book have a basic course in biostatistics, epidemiology, and introductory calculus. The book will be of interest to anyone looking to understand the statistical fundamentals to support quantitative research in public health. In addition, this book: * Is based on the authors' course notes from 20 years teaching regression modeling in public health courses * Provides exercises at the end of each chapter * Contains a solutions chapter with answers in STATA, SAS, SPSS, and R * Provides real-world public health applications of the theoretical aspects contained in the chapters Applications of Regression Models in Epidemiology is a reference for graduate students in public health and public health practitioners. ERICK SUÁREZ is a Professor of the Department of Biostatistics and Epidemiology at the University of Puerto Rico School of Public Health. He received a Ph.D. degree in Medical Statistics from the London School of Hygiene and Tropical Medicine. He has 29 years of experience teaching biostatistics. CYNTHIA M. PÉREZ is a Professor of the Department of Biostatistics and Epidemiology at the University of Puerto Rico School of Public Health. She received an M.S. degree in Statistics and a Ph.D. degree in Epidemiology from Purdue University. She has 22 years of experience teaching epidemiology and biostatistics. ROBERTO RIVERA is an Associate Professor at the College of Business at the University of Puerto Rico at Mayaguez. He received a Ph.D. degree in Statistics from the University of California in Santa Barbara. He has more than five years of experience teaching statistics courses at the undergraduate and graduate levels. MELISSA N. MARTÍNEZ is an Account Supervisor at Havas Media International. She holds an MPH in Biostatistics from the University of Puerto Rico and an MSBA from the National University in San Diego, California. For the past seven years, she has been performing analyses for the biomedical research and media advertising fields.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 325

Veröffentlichungsjahr: 2017

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Cover

Title Page

Dedication

Preface

Acknowledgments

About the Authors

Chapter 1: Basic Concepts for Statistical Modeling

1.1 Introduction

1.2 Parameter Versus Statistic

1.3 Probability Definition

1.4 Conditional Probability

1.5 Concepts of Prevalence and Incidence

1.6 Random Variables

1.7 Probability Distributions

1.8 Centrality and Dispersion Parameters of a Random Variable

1.9 Independence and Dependence of Random Variables

1.10 Special Probability Distributions

1.11 Hypothesis Testing

1.12 Confidence Intervals

1.13 Clinical Significance Versus Statistical Significance

1.14 Data Management

1.15 Concept of Causality

References

Chapter 2: Introduction to Simple Linear Regression Models

2.1 Introduction

2.2 Specific Objectives

2.3 Model Definition

2.4 Model Assumptions

2.5 Graphic Representation

2.6 Geometry of the Simple Regression Model

2.7 Estimation of Parameters

2.8 Variance of Estimators

2.9 Hypothesis Testing About the Slope of the Regression Line

2.10 Coefficient of Determination R2

2.11 Pearson Correlation Coefficient

2.12 Estimation of Regression Line Values and Prediction

2.13 Example

2.14 Predictions

2.15 Conclusions

Practice Exercise

References

Chapter 3: Matrix Representation of the Linear Regression Model

3.1 Introduction

3.2 Specific Objectives

3.3 Definition

3.4 Matrix Representation of a SLRM

3.5 Matrix Arithmetic

3.6 Matrix Multiplication

3.7 Special Matrices

3.8 Linear Dependence

3.9 Rank of a Matrix

3.10 Inverse Matrix [A−1]

3.11 Application of an Inverse Matrix in a SLRM

3.12 Estimation of β Parameters in a SLRM

3.13 Multiple Linear Regression Model (MLRM)

3.14 Interpretation of the Coefficients in a MLRM

3.15 ANOVA in a MLRM

3.16 Using Indicator Variables (Dummy Variables)

3.17 Polynomial Regression Models

3.18 Centering

3.19 Multicollinearity

3.20 Interaction Terms

3.21 Conclusion

Practice Exercise

References

Chapter 4: Evaluation of Partial Tests of Hypotheses in a MLRM

4.1 Introduction

4.2 Specific Objectives

4.3 Definition of Partial Hypothesis

4.4 Evaluation Process of Partial Hypotheses

4.5 Special Cases

4.6 Examples

4.7 Conclusion

Practice Exercise

References

Chapter 5: Selection of Variables in a Multiple Linear Regression Model

5.1 Introduction

5.2 Specific Objectives

5.3 Selection of Variables According to the Study Objectives

5.4 Criteria for Selecting the Best Regression Model

5.5 Stepwise Method in Regression

5.6 Limitations of Stepwise Methods

5.7 Conclusion

Practice Exercise

References

Chapter 6: Correlation Analysis

6.1 Introduction

6.2 Specific Objectives

6.3 Main Correlation Coefficients Based on SLRM

6.4 Major Correlation Coefficients Based on MLRM

6.5 Partial Correlation Coefficient

6.6 Significance Tests

6.7 Suggested Correlations

6.8 Example

6.9 Conclusion

Practice Exercise

References

Chapter 7: Strategies for Assessing the Adequacy of the Linear Regression Model

7.1 Introduction

7.2 Specific Objectives

7.3 Residual Definition

7.4 Initial Exploration

7.5 Initial Considerations

7.6 Standardized Residual

7.7 Jackknife Residuals (R-Student Residuals)

7.8 Normality of the Errors

7.9 Correlation of Errors

7.10 Criteria for Detecting Outliers, Leverage, and Influential Points

7.11 Leverage Values

7.12 Cook's Distance

7.13 COV RATIO

7.14 DFBETAS

7.15 DFFITS

7.16 Summary of the Results

7.17 Multicollinearity

7.18 Transformation of Variables

7.19 Conclusion

Practice Exercise

References

Chapter 8: Weighted Least-Squares Linear Regression

8.1 Introduction

8.2 Specific Objectives

8.3 Regression Model with Transformation into the Original Scale of Y

8.4 Matrix Notation of the Weighted Linear Regression Model

8.5 Application of the WLS Model with Unequal Number of Subjects

8.6 Applications of the WLS Model When Variance Increases

8.7 Conclusions

Practice Exercise

References

Chapter 9: Generalized Linear Models

9.1 Introduction

9.2 Specific Objectives

9.3 Exponential Family of Probability Distributions

9.4 Exponential Family of Probability Distributions with Dispersion

9.5 Mean and Variance in EF and EDF

9.6 Definition of a Generalized Linear Model

9.7 Estimation Methods

9.8 Deviance Calculation

9.9 Hypothesis Evaluation

9.10 Analysis of Residuals

9.11 Model Selection

9.12 Bayesian Models

9.13 Conclusions

References

Chapter 10: Poisson Regression Models for Cohort Studies

10.1 Introduction

10.2 Specific Objectives

10.3 Incidence Measures

10.4 Confounding Variable

10.5 Stratified Analysis

10.6 Poisson Regression Model

10.7 Definition of Adjusted Relative Risk

10.8 Interaction Assessment

10.9 Relative Risk Estimation

10.10 Implementation of the Poisson Regression Model

10.11 Conclusion

Practice Exercise

References

Chapter 11: Logistic Regression in Case–Control Studies

11.1 Introduction

11.2 Specific Objectives

11.3 Graphical Representation

11.4 Definition of the Odds Ratio

11.5 Confounding Assessment

11.6 Effect Modification

11.7 Stratified Analysis

11.8 Unconditional Logistic Regression Model

11.9 Types of Logistic Regression Models

11.10 Computing the ORcrude

11.11 Computing the Adjusted OR

11.12 Inference on OR

11.13 Example of the Application of ULR Model: Binomial Case

11.14 Conditional Logistic Regression Model

11.15 Conclusions

Practice Exercise

References

Chapter 12: Regression Models in a Cross-Sectional Study

12.1 Introduction

12.2 Specific Objectives

12.3 Prevalence Estimation Using the Normal Approach

12.4 Definition of the Magnitude of the Association

12.5 POR Estimation

12.6 Prevalence Ratio

12.7 Stratified Analysis

12.8 Logistic Regression Model

12.9 Conclusions

Practice Exercise

References

Chapter 13: Solutions to Practice Exercises

Chapter 2 Practice Exercise

Chapter 3 Practice Exercise

Chapter 4 Practice Exercise

Chapter 5 Practice Exercise

Chapter 6 Practice Exercise

Chapter 7 Practice Exercise

Chapter 8 Practice Exercise

Chapter 10 Practice Exercise

Chapter 11 Practice Exercise

Chapter 12 Practice Exercise

Index

End User License Agreement

List of Tables

Table 2.1

Table 2.2

Table 2.3

Table 3.1

Table 4.1

Table 4.2

Table 5.1

Table 6.1

Table 6.2

Table 6.3

Table 6.4

Table 7.1

Table 7.2

Table 7.3

Table 7.4

Table 7.5

Table 7.6

Table 8.1

Table 8.2

Table 8.3

Table 8.4

Table 8.5

Table 9.1

Table 9.2

Table 9.3

Table 9.4

Table 10.1

Table 10.2

Table 10.3

Table 10.4

Table 10.5

Table 10.6

Table 10.7

Table 10.8

Table 10.9

Table 10.10

Table 11.1

Table 11.2

Table 11.3

Table 11.4

Table 11.5

Table 11.6

Table 11.7

Table 11.8

Table 11.9

Table 11.10

Table 11.11

Table 11.12

Table 11.13

Table 11.14

Table 12.1

Table 12.2

Table 12.3

Table 12.4

Table 12.5

Table 12.6

Table 12.7

Table 12.8

Table 12.9

Table 12.10

Table 12.11

Table 12.12

Table 12.13

List of Illustrations

Figure 1.1

Figure 1.2

Figure 1.3

Figure 2.1

Figure 2.2

Figure 2.3

Figure 2.4

Figure 2.5

Figure 2.6

Figure 2.7

Figure 3.1

Figure 3.2

Figure 7.1

Figure 7.2

Figure 7.3

Figure 7.4

Figure 7.5

Figure 7.6

Figure 7.7

Figure 10.1

Figure 10.2

Figure 10.3

Figure 11.1

Guide

Cover

Table of Contents

Begin Reading

Chapter 1

Pages

iii

xvi

xvii

xviii

xix

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

Applications of Regression Models in Epidemiology

Erick Suárez, Cynthia M. Pérez, Roberto Rivera, and Melissa N. Martínez

Published by John Wiley & Sons, Inc., Hoboken, New JerseyPublished simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Names: Erick L. Suárez, Erick L., 1953-

Title: Applications of Regression Models in Epidemiology / Erick Suarez [and three others].

Description: Hoboken, New Jersey : John Wiley & Sons, Inc., [2017] | Includesindex.

Identifiers: LCCN 2016042829| ISBN 9781119212485 (cloth) | ISBN 9781119212508 (epub)

Subjects: LCSH: Medical statistics. | Regression analysis. | Public health.

Classification: LCC RA407 .A67 2017 | DDC 610.2/1—dc23 LC record available at https://lccn.loc.gov/2016042829

To our loved ones

To those who have a strong commitmentto social justice, human rights, and public health.

Preface

This book is intended to serve as a guide for statistical modeling in epidemiologic research. Our motivation for writing this book lies in our years of experience teaching biostatistics and epidemiology for different academic and professional programs at the University of Puerto Rico Medical Sciences Campus. This subject matter is usually covered in biostatistics courses at the master's and doctoral levels at schools of public health. The main focus of this book is statistical models and their analytical foundations for data collected from basic epidemiological study designs. This 13-chapter book can serve equally well as a textbook or as a source for consultation. Readers will be exposed to the following topics: linear and multiple regression models, matrix notation in regression models, correlation analysis, strategies for selecting the best model, partial hypothesis testing, weighted least-squares linear regression, generalized linear models, conditional and unconditional logistic regression models, Poisson regression, and programming codes in STATA, SAS, R, and SPSS for different practice exercises. We have started with the assumption that the readers of this book have taken at least a basic course in biostatistics and epidemiology. However, the first chapter describes the basic concepts needed for the rest of the book.

Erick SuárezUniversity of Puerto Rico, Medical Sciences Campus

Cynthia M. PérezUniversity of Puerto Rico, Medical Sciences Campus

Roberto RiveraUniversity of Puerto Rico, Mayagüez Campus

Melissa N. MartínezHavas Media International Company

Acknowledgments

We wish to express our gratitude to our departmental colleagues for their continued support in the writing of this book. We are grateful to our colleagues and students for helping us to develop the programming for some of the examples and exercises: Heidi Venegas, Israel Almódovar, Oscar Castrillón, Marievelisse Soto, Linnette Rodríguez, José Rivera, Jorge Albarracín, and Glorimar Meléndez. We would also like to thank Sheila Ward for providing editorial advice. This book has been made possible by financial support received from grant CA096297/CA096300 from the National Cancer Institute and award number 2U54MD007587 from the National Institute on Minority Health and Health Disparities, both parts of the U.S. National Institutes of Health. Finally, we would like to thank our families for encouraging us throughout the development of this book.

About the Authors

Erick Suárez is Professor of Biostatistics at the Department of Biostatistics and Epidemiology of the University of Puerto Rico Graduate School of Public Health. He received a Ph.D. degree in Medical Statistics from the London School of Hygiene and Tropical Medicine. With more than 29 years of experience teaching biostatistics at the graduate level, he has also directed in mentoring and training efforts for public health students at the University of Puerto Rico. His research interests include HIV, HPV, cancer, diabetes, and genetical statistics.

Cynthia M. Pérez is a Professor of Epidemiology at the Department of Biostatistics and Epidemiology of the University of Puerto Rico Graduate School of Public Health. She received an M.S. degree in Statistics and a Ph.D. degree in Epidemiology from Purdue University. Since 1994, she has taught epidemiology and biostatistics. She has directed mentoring and training efforts for public health and medical students at the University of Puerto Rico. Her research interests include diabetes, cardiovascular disease, periodontal disease, viral hepatitis, and HPV infection.

Roberto Rivera is an Associate Professor at the College of Business of the University of Puerto Rico at Mayaguez. He received an M.A. and a Ph.D. degree in Statistics from the University of California in Santa Barbara. He has more than 5 years of experience teaching statistics courses at the undergraduate and graduate levels and his research interests include asthma, periodontal disease, marine sciences, and environmental statistics.

Melissa N. Martínez is a statistical analyst at the Havas Media International Company, located in Miami, FL. She has an MPH in Biostatistics from the University of Puerto Rico, Medical Sciences Campus and currently graduated from the Master of Business Analytics program at National University, San Diego, CA. For the past 7 years, she has been performing statistical analyses in the biomedical research, healthcare, and media advertising fields. She has assisted with the design of clinical trials, performing sample size calculations and writing the clinical trial reports.

1Basic Concepts for Statistical Modeling

Aim: Upon completing this chapter, the reader should be able to understand the basic concepts for statistical modeling in public health.

1.1 Introduction

It is assumed that the reader has taken introductory classes in biostatistics and epidemiology. Nevertheless, in this chapter we review the basic concepts of probability and statistics and their application to the public health field. The importance of data quality is also addressed and a discussion on causality in the context of epidemiological studies is provided.

Statistics is defined as the science and art of collecting, organizing, presenting, summarizing, and interpreting data. There is strong theoretical evidence backing many of the statistical procedures that will be discussed. However, in practice, statistical methods require decisions on organizing the data, constructing plots, and using rules of thumb that make statistics an art as well as a science.

Biostatistics is the branch of statistics that applies statistical methods to health sciences. The goal is typically to understand and improve the health of a population. A population, sometimes referred to as the target population, can be defined as the group of interest in our analysis. In public health, the population can be composed of healthy individuals or those at risk of disease and death. For example, study populations may include healthy people, breast cancer patients, obese subjects residing in Puerto Rico, persons exposed to high levels of asbestos, or persons with high-risk behaviors. Among the objectives of epidemiological studies are to describe the burden of disease in populations and identify the etiology of diseases, essential information for planning health services. It is convenient to frame our research questions about a population in terms of traits. A measurement made of a population is known as a parameter. Examples are: prevalence of diabetes among Hispanics, incidence of breast cancer in older women, and the average hospital stay of acute ischemic stroke patients in Puerto Rico. We cannot always obtain the parameter directly by counting or measuring from the population of interest. It might be too costly, time-consuming, the population may be too large, or unfeasible for other reasons. For example, if a health officer believes that the incidence of hepatitis C has increased in the last 5 years in a region, he or she cannot recommend a new preventive program without any data. Some information has to be collected from a sample of the population, if the resources are limited. Another example is the assessment of the effectiveness of a new breast cancer screening strategy. Since it is not practical to perform this assessment in all women at risk, an alternative is to select at least two samples of women, one that will receive the new screening strategy and another that will receive a different modality.

There are several ways to select samples from a population. We want to make the sample to be as representative of the population as possible to make appropriate inferences about that population. However, there are other aspects to consider such as convenience, cost, time, and availability of resources. The sample allows us to estimate the parameter of interest through what is known as a sample statistic, or statistic for short. Although the statistic estimates the parameter, there are key differences between the statistic and the parameter.

1.2 Parameter Versus Statistic

Let us take a look at the distinction between a parameter and a statistic. The classical concept of a parameter is a numerical value that, for our purposes, at a given period of time is constant, or fixed; for example, the mean birth weight in grams of newborns to Chinese women in 2015. On the other hand, a statistic is a numerical value that is random; for example, the mean birth weight in grams of 1000 newborns selected randomly from the women who delivered in maternity units of hospitals in China in the last 2 years. Coming from a subset of the population, the value of the statistic depends on the subjects that fall in the sample and this is what makes the statistic random. Sometimes, Greek symbols are used to denote parameters, to better distinguish between parameters and statistics. Sample statistics can provide reliable estimates of parameters as long as the population is carefully specified relative to the problem at hand and the sample is representative of that population. That the sample should be representative of the population may sound trivial but it may be easier said than done. In clinical research, participants are often volunteers, a technique known as convenience sampling. The advantage of convenience sampling is that it is less expensive and time-consuming. The disadvantage is that results from volunteers may differ from those who do not volunteer and hence the results may be biased. The process of reaching conclusions about the population based on a sample is known as statistical inference. As long as the data obtained from the sample are representative of the population, we can reach conclusions about the population by using the statistics gathered from the sample, while accounting for the uncertainty around these statistics through probability. Further discussion of sampling techniques in public health can be seen in Korn and Graunbard (1999) and Heeringa et al. (2010).

1.3 Probability Definition

Probability measures how likely it is that a specific event will occur. Simply put, probability is one of the main tools to quantify uncertainty. For any event , we define as the probability of . For any event A, . When an event has probability of 0.5, it means that it is equally likely that the event will or will not occur. As the probability approaches to 1, an event becomes more likely to occur, and as the probability approaches to 0, the event becomes less likely. Examples of events of interest in public health include exposure to secondhand smoke, diagnosis of type 2 diabetes, or death due to coronary heart disease. Events may be a combination of other events. For example, event “A,B” is the event when A and B occur simultaneously. We define P(A,B) as the probability of “A,B.” The probability of two or more events occurring is known as a joint probability; for example, assuming A = HIV positive and B = Female, then P(A,B) indicates the joint probability of a subject being HIV positive and female.

1.4 Conditional Probability

The probability of an event given that has occurred is known as a conditional probability and is expressed as . That is, we can interpret conditional probability as the probability of A and B occurring simultaneously relative to the probability of occurring. For example, if we define event B as intravenous drug use and event A as hepatitis C virus (HCV) seropositivity status, then P(A|B) indicates the probability of being HCV seropositive given the subject is an intravenous drug user. Beware: In the expression to the left of the inequality we find how likely is given that has occurred, while in the expression to the right of the inequality we find how likely is given that has occurred. Another interpretation of can be as follows: given some information (i.e., the occurrence of event ), what is the probability that an event () occurs? For example, what is the probability of a person developing lung cancer (A) given that he has been exposed to tobacco smoke carcinogens (B)? Conditional probabilities are regularly used to conduct statistical inference.

Let us assume a woman is pregnant. G is the event that the baby is a girl, and H is the event that the expecting mother is a smoker. Can you guess what P(G|H) is without a calculation? Intuitively, we can guess that it is 0.5, or 50%; however, in general P(G)<0.5, just keep in mind that the male/female ratio at birth varies by countries. That is, the fact that the expecting mother is a smoker has no impact on the chances of giving birth to a girl, P(G|H) = P(G). Two events are independent when the occurrence of one event does not affect the probability of occurrence of the other. When events A and are independent, then and . Independence implies that . For example, the probability that a woman has diabetes (A) and she is a lawyer (B) can be found as the product of the probability that a woman has diabetes times the probability that a woman is a lawyer, if we assume that diabetes diagnosis is independent of professional occupation.

1.5 Concepts of Prevalence and Incidence

In public health there are two important concepts for measuring disease occurrence, prevalence and incidence. The prevalence of a disease is the probability of having the disease at a given point in time; for example, the probability of someone being diagnosed with diabetes in a medical visit. Incidence is the probability that a person with no prior disease will develop disease over some specified time period; for example, the probability of developing lung cancer after 10 years of heavy smoking exposure.

1.6 Random Variables

A random variable, also known as a stochastic variable, has values derived from a function that turns outcomes from the sample space into numbers. Probabilities are assigned to either each value or to ranges of values of the random variable. If the random variable is counting something, then it is a discrete random variable. If the random variable is measuring something (e.g., length, weight, or duration) then it is a continuous random variable. Discrete random variables have integer values. For example, the number of hospitalizations, the number of smokers, or the number of HIV-infected patients. Within any interval, continuous random variables have an infinite amount of possible values. Examples are: the body mass index, blood pressure, or fasting plasma glucose levels of a person.

1.7 Probability Distributions

In epidemiological studies, usually the primary variable in a study, Y, is discrete (an integer number). For example, the number of hospital admissions for chest pain, the number of fractures or sprains seen in an emergency room, the number of incident cancer cases, or the number of people with moderate or severe periodontitis.

Other examples would be the specific result of a clinical evaluation, for example, positive versus negative results from a laboratory test, or presence versus absence of disease. In these cases, the study variable Y is dichotomous, where the variable is coded as follows:

, to indicate the presence of disease (or testing positive).

, to indicate the absence of disease (or testing negative).

The specific definition of the random variable Y depends on the epidemiologic study design that is used. In a case–control study, history of exposure is the random variable, where persons with the disease of interest (cases) and persons without the disease of interest (controls) are first selected and then we compare the prevalence of exposure in both groups. In a cohort study, the development of the disease is the random variable, where the exposure and nonexposure groups are first defined and then we compare the incidence of disease in each exposure group. These random variables cannot be determined in advance (their values are defined upon completion of the measurement), but their values or attributes can be determined in probabilistic terms. For example,

In a case–control study, the habit of smoking in the past cannot be defined until a subject undergoes an interview, but we could determine the probability of this habit based on previous data or under specific assumptions.

In a cohort study, the development of cervical cancer based on human papilloma virus (HPV) infection status is unknown until the study is completed; however, we could determine the probability of this cancer based on previous data or under specific assumptions.

Therefore, for each value of the random variable Y, we need to identify the corresponding probability:

where

th value of the random variable

probability associated with

Probability distribution functions are used to assign probabilities to values of random variables. Usually, a probability distribution is represented in the Cartesian plane, where the possible values of the random variable are plotted on the -axis, while the corresponding probabilities are on the -axis (see Figure 1.1).

Figure 1.1 Probability distribution of a discrete random variable.

1.8 Centrality and Dispersion Parameters of a Random Variable

The expected value of a random variable is a number that tells us what will be a typical value for the random variable, usually represented as E(Y) or μ. It is not necessarily a possible value of the random variable (a discrete random variable with possible values 1, 2, and 3 might have an expected value of 2.3). In the case of a discrete random variable Y, the expected value of Y is defined as follows:

(1.1)

where pi indicates the probability of yi and n is the number of possible values Y can have.

Another characteristic of a random variable is its dispersion, which quantifies how the possible values of Y are spread out around its expected value. Dispersion is usually measured by its variance, . It is the mean squared distance from each yi and the expected value of each possible value of . For example, for the discrete random variable Y,

(1.2)

where μ indicates the expected value of Y and pi indicates the probability of yi. Usually, the variance is represented with σ2. Sometimes the standard deviation, the square root of the variance (σ), is used as a measure of dispersion rather than the variance.

A probability distribution usually depends on one or more parameters that can be estimated with some measurements from a sample selected from the population of interest. For many distributions, a parameter represents the expected value of the measurements, or some function of the expected value. Other parameters may indicate the shape, scale, or width of the distribution (e.g., measures of variability or dispersion). These parameters are important in determining the form of the probability distribution (Jewell, 2004).

1.9 Independence and Dependence of Random Variables

The attribute of independent random variables will be employed frequently in this book. Often, this will be based on the argument that a sample of subjects was chosen randomly. Random selection means that what we obtain as the first observation (first value of a random variable) does not affect the probability of what we will get in the following observation (second value of a random variable). In contrast, when we randomly select households from a sampling design and interview all members of the family of a selected household, it is very likely that their responses will be highly correlated, particularly in dietary habits. For the most part, we will focus on a specific type of dependence between two variables : linear dependence (e.g., blood pressure (Y) and age (X)). This type of association will be modeled through conditional probabilities (and hence conditional expectations). However, keep in mind that absence of linear dependence does not automatically mean independence in general. The association may be nonlinear but our statistical tools to detect linear dependence may not be able to detect the nonlinear dependence.

1.10 Special Probability Distributions

Previously, the presence or absence of a disease was represented in terms of a random variable, with Y = 1 indicating presence of the disease and Y = 0 indicating absence of the disease. There is a wide class of situations that can be represented in terms of such a binary random variable. If we abstractly define , then we can set up a family of probability distributions and use it to define general, simplified ways to find values for characteristics in the population of interest, such as probabilities, E(Y) or Var(Y). We will describe the families of probability distributions most widely used in the statistical analysis of data derived from basic epidemiologic study designs.

1.10.1 Binomial Distribution

A Bernoulli trial is an observation that has two possible outcomes, identified as success or failure (Rosner, 2010). For example, the result of a serological test for HIV represents a Bernoulli trial, since the results of this test can be classified as a random variable with two possible results: positive (success) or negative (failure). The binomial distribution can be used when the random variable represents the number of cases (successes) based on a fixed number of independent Bernoulli trials. The specific formula to obtain probabilities of a binomial random variable is as follows:

(1.3)

where

is the parameter that defines the number of independent Bernoulli trials.

is the parameter that defines the probability of a success for each Bernoulli trial.

indicates one of the possible values of the random variable

, which vary from 0 to

It can be shown that for a binomial random variable, E(Y) = np, while Var(Y) = np(1 − p). For example, assume you want to determine the probability of observing exactly two HIV+ individuals in a hypothetical study where participants were chosen randomly of 20 injection drug users . If it is known that the probability of being HIV+ is 0.10 , then

That is, there is a 28.5% probability of observing exactly 2 out of 20 HIV+ drug users in this hypothetical study, where the probability of any person being HIV+ is 0.1. Also, in this case E(Y) = 20(0.1) = 2. That is, for every sample of 20 injection drug users, we expect two to be HIV+. Moreover, Var(Y) = 20(0.1)(0.9) = 1.8. With such a low spread, large values of Y are highly unlikely in this example (readers can double check this by finding probabilities of values of Y close to its largest possible value, 20).

1.10.2 Poisson Distribution

The Poisson distribution can be used when the random variable represents the number of cases (successes) under three conditions:

In a very large number of independent Bernoulli trials when the probability of success is small.

For a unit of time (e.g., day, month, or year).

On a unit area (e.g., square meter, square kilometer, or square mile) or volume (e.g., cubic meter or cubic centimeter).

An example of a random variable that could be associated with a Poisson distribution is the number of cancer cases reported in one year in a specific community. Another example would be the number of car accidents that occur in a given week. The formula to find the probability of a specific value of a Poisson random variable is as follows:

(1.4)

where

is the distribution parameter that indicates the number of cases expected per unit of time or space (area or volume).

is the value of the random variable. The possible values of a random variable with Poisson distribution range from 0 to infinity

is the Euler constant, whose value is approximately 2.7183.

Furthermore, E(Y) = λ and Var(y) = λ. For example, assume that in a specific community there is an average of 10 car accidents per week and you want to determine the probability of observing 7 car accidents . Substituting this parameter in the Poisson formula, we get the following:

That is, there is a 0.09 probability of observing exactly 7 car accidents in a week in the community, where on average there are 10 car accidents per week.

1.10.3 Normal Distribution

The normal probability distribution is associated with continuous random variables and is used in various situations. One such application arises when it is desired to estimate the average of a random variable through a sample of a population, such as the average weight of newborn infants. Since continuous random variables have infinitive possible values, they demand the definition of a density function; a function with values ≥0 for all values of Y and whose area below the function curve totals 1. The density function specific to the normal distribution is as follows:

(1.5)

where

is the parameter that indicates the expected value of the random variable

is the parameter defining the variance of

, that is, the expected value (

−

)

indicates the value of the random variable.

indicates the Euler constant, whose value is approximately 2.7183.

indicates the constant whose value is approximately 3.1416.

A normally distributed random variable takes values from minus infinity to plus infinity (−∝ < Y < +∝). The graphical presentation of this density function looks like a bell, that is, a symmetrical distribution such as the one presented in