Categorical Data Analysis - Alan Agresti - E-Book

Categorical Data Analysis E-Book

Alan Agresti

0,0
120,99 €

Beschreibung

Praise for the Second Edition "A must-have book for anyone expecting to do research and/orapplications in Categorical Data Analysis." --Statistics in Medicine "It is a total delight reading this book." --Pharmaceutical Research "If you do any analysis of categorical data, this is anessential desktop reference." --Technometrics The use of statistical methods for analyzing categorical datahas increased dramatically, particularly in the biomedical, socialsciences, and financial industries. Responding to new developments,this book offers a comprehensive treatment of the most importantmethods for Categorical Data Analysis. Categorical Data Analysis, Third Edition summarizes thelatest methods for univariate and correlated multivariatecategorical responses. Readers will find a unified generalizedlinear models approach that connects logistic regression andPoisson and negative binomial loglinear models for discrete datawith normal regression for continuous data. This edition alsofeatures: * An emphasis on logistic and probit regression methods forbinary, ordinal, and nominal responses for independent observationsand for clustered data with marginal models and random effectsmodels * Two new chapters on alternative methods for binary responsedata, including smoothing and regularization methods,classification methods such as linear discriminant analysis andclassification trees, and cluster analysis * New sections introducing the Bayesian approach for methods inthat chapter * More than 100 analyses of data sets and over 600 exercises * Notes at the end of each chapter that provide references torecent research and topics not covered in the text, linked to abibliography of more than 1,200 sources * A supplementary website showing how to use R and SAS; for allexamples in the text, with information also about SPSS and Stataand with exercise solutions Categorical Data Analysis, Third Edition is an invaluabletool for statisticians and methodologists, such as biostatisticiansand researchers in the social and behavioral sciences, medicine andpublic health, marketing, education, finance, biological andagricultural sciences, and industrial quality control.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 1535

Bewertungen
0,0
0
0
0
0
0



Contents

Cover

Half Title page

Title page

Copyright page

Dedication

Preface

Chapter 1: Introduction: Distributions and Inference for Categorical Data

1.1 Categorical Response Data

1.2 Distributions for Categorical Data

1.3 Statistical Inference for Categorical Data

1.4 Statistical Inference for Binomial Parameters

1.5 Statistical Inference for Multinomial Parameters

1.6 Bayesian Inference for Binomial and Multinomial Parameters

Notes

Exercises

Chapter 2: Describing Contingency Tables

2.1 Probability Structure for Contingency Tables

2.2 Comparing Two Proportions

2.3 Conditional Association in Stratified 2 × 2 Tables

2.4 Measuring Association in I × J Tables

Notes

Exercises

Chapter 3: Inference for Two-Way Contingency Tables

3.1 Confidence Intervals for Association Parameters

3.2 Testing Independence in Two-way Contingency Tables

3.3 Following-up Chi-Squared Tests

3.4 Two-Way Tables with Ordered Classifications

3.5 Small-Sample Inference for Contingency Tables

3.6 Bayesian Inference for Two-way Contingency Tables

3.7 Extensions for Multiway Tables and Nontabulated Responses

Notes

Exercises

Chapter 4: Introduction to Generalized Linear Models

4.1 The Generalized Linear Model

4.2 Generalized Linear Models for Binary Data

4.3 Generalized Linear Models for Counts and Rates

4.4 Moments and Likelihood for Generalized Linear Models

4.5 Inference and Model Checking for Generalized Linear Models

4.6 Fitting Generalized Linear Models

4.7 Quasi-Likelihood and Generalized Linear Models

Notes

Exercises

Chapter 5: Logistic Regression

5.1 Interpreting Parameters in Logistic Regression

5.2 Inference for Logistic Regression

5.3 Logistic Models with Categorical Predictors

5.4 Multiple Logistic Regression

5.5 Fitting Logistic Regression Models

Notes

Exercises

Chapter 6: Building, Checking, and Applying Logistic Regression Models

6.1 Strategies in Model Selection

6.2 Logistic Regression Diagnostics

6.3 Summarizing the Predictive Power of a Model

6.4 Mantel–Haenszel and Related Methods for Multiple 2 × 2 Tables

6.5 Detecting and Dealing with Infinite Estimates

6.6 Sample Size and Power Considerations

Notes

Exercises

Chapter 7: Alternative Modeling of Binary Response Data

7.1 Probit and Complementary Log–log Models

7.2 Bayesian Inference for Binary Regression

7.3 Conditional Logistic Regression

7.4 Smoothing: Kernels, Penalized Likelihood, Generalized Additive Models

7.5 Issues in Analyzing High-Dimensional Categorical Data

Notes

Exercises

Chapter 8: Models for Multinomial Responses

8.1 Nominal Responses: Baseline-Category Logit Models

8.2 Ordinal Responses: Cumulative Logit Models

8.3 Ordinal Responses: Alternative Models

8.4 Testing Conditional Independence in I × J × K Tables

8.5 Discrete-Choice Models

8.6 Bayesian Modeling of Multinomial Responses

Notes

Exercises

Chapter 9: Loglinear Models for Contingency Tables

9.1 Loglinear Models for Two-way Tables

9.2 Loglinear Models for Independence and Interaction in Three-way Tables

9.3 Inference for Loglinear Models

9.4 Loglinear Models for Higher Dimensions

9.5 Loglinear—Logistic Model Connection

9.6 Loglinear Model Fitting: Likelihood Equations and Asymptotic Distributions

9.7 Loglinear Model Fitting: Iterative Methods and Their Application

Notes

Exercises

Chapter 10: Building and Extending Loglinear Models

10.1 Conditional Independence Graphs and Collapsibility

10.2 Model Selection and Comparison

10.3 Residuals for Detecting Cell-Specific Lack of Fit

10.4 Modeling Ordinal Associations

10.5 Generalized Loglinear and Association Models, Correlation Models, and Correspondence Analysis

10.6 Empty Cells and Sparseness in Modeling Contingency Tables

10.7 Bayesian Loglinear Modeling

Notes

Exercises

Chapter 11: Models for Matched Pairs

11.1 Comparing Dependent Proportions

11.2 Conditional Logistic Regression for Binary Matched Pairs

11.3 Marginal Models for Square Contingency Tables

11.4 Symmetry, Quasi-Symmetry, and Quasi-Independence

11.5 Measuring Agreement Between Observers

11.6 Bradley–Terry Model for Paired Preferences

11.7 Marginal Models and Quasi-Symmetry Models for Matched Sets

Notes

Exercises

Chapter 12: Clustered Categorical Data: Marginal and Transitional Models

12.1 Marginal Modeling: Maximum Likelihood Approach

12.2 Marginal Modeling: Generalized Estimating Equations (GEEs) Approach

12.3 Quasi-Likelihood and Its GEE Multivariate Extension: Details

12.4 Transitional Models: Markov Chain and Time Series Models

Notes

Exercises

Chapter 13: Clustered Categorical Data: Random Effects Models

13.1 Random Effects Modeling of Clustered Categorical Data

13.2 Binary Responses: Logistic-Normal Model

13.3 Examples of Random Effects Models for Binary Data

13.4 Random Effects Models for Multinomial Data

13.5 Multilevel Modeling

13.6 GLMM Fitting, Inference, and Prediction

13.7 Bayesian Multivariate Categorical Modeling

Notes

Exercises

Chapter 14: Other Mixture Models for Discrete Data

14.1 Latent Class Models

14.2 Nonparametric Random Effects Models

14.3 Beta-Binomial Models

14.4 Negative Binomial Regression

14.5 Poisson Regression with Random Effects

Notes

Exercises

Chapter 15: Non-Model-Based Classification and Clustering

15.1 Classification: Linear Discriminant Analysis

15.2 Classification: Tree-Structured Prediction

15.3 Cluster Analysis for Categorical Data

Notes

Exercises

Chapter 16: Large- and Small-Sample Theory for Multinomial Models

16.1 Delta Method

16.2 Asymptotic Distributions of Estimators of Model Parameters and Cell Probabilities

16.3 Asymptotic Distributions of Residuals and Goodness-of-fit Statistics

16.4 Asymptotic Distributions for Logit/Loglinear Models

16.5 Small-Sample Significance Tests for Contingency Tables

16.6 Small-Sample Confidence Intervals for Categorical Data

16.7 Alternative Estimation Theory for Parametric Models

Notes

Exercises

Chapter 17: Historical Tour of Categorical Data Analysis

17.1 Pearson–Yule Association Controversy

17.2 R. A. Fisher’s Contributions

17.3 Logistic Regression

17.4 Multiway Contingency Tables and Loglinear Models

17.5 Bayesian Methods for Categorical Data

17.6 A Look Forward, and Backward

Appendix A: Statistical Software for Categorical Data Analysis

A.1 SAS

A.2 R And S-Plus

A.3 Stata

A.4 SPSS

A.5 Statxact and Logxact

A.6 Other Software

Appendix B: Chi-Squared Distribution Values

References

Author Index

Example Index

Subject Index

Categorical Data Analysis

WILEY SERIES IN PROBABILITY AND STATISTICS

ESTABLISHED BY WALTER A. SHEWHART AND SAMUEL S. WILKS

Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice,Harvey Goldstein, Iain M. Johnstone, Geert Molenberghs, David W. Scott,Adrian F. M. Smith, Ruey S. Tsay, Sanford Weisberg Editors Emeriti: Vic Barnett, J. Stuart Hunter, Joseph B. Kadane, Jozef L. Teugels

The Wiley Series in Probability and Statistics is well established and authoritative. It covers many topics of current research interest in both pure and applied statistics and probability theory. Written by leading statisticians and institutions, the titles span both state-of-the-art developments in the field and classical methods.

Reflecting the wide range of current research in statistics, the series encompasses applied, methodological and theoretical statistics, ranging from applications and new techniques made possible by advances in computerized practice to rigorous treatment of theoretical approaches.

This series provides essential and invaluable reading for all statisticians, whether in academia, industry, government, or research.

† ABRAHAM and LEDOLTER • Statistical Methods for Forecasting

AGRESTI • Analysis of Ordinal Categorical Data, Second Edition

AGRESTI • An Introduction to Categorical Data Analysis, Second Edition

AGRESTI • Categorical Data Analysis, Third Edition

ALTMAN, GILL, and McDONALD • Numerical Issues in Statistical Computing for the Social Scientist

AMARATUNGA and CABRERA • Exploration and Analysis of DNA Microarray and Protein Array Data

ANDĚL • Mathematics of Chance

ANDERSON • An Introduction to Multivariate Statistical Analysis, Third Edition

* ANDERSON • The Statistical Analysis of Time Series

ANDERSON, AUQUIER, HAUCK, OAKES, VANDAELE, and WEISBERG • Statistical Methods for Comparative Studies

ANDERSON and LOYNES • The Teaching of Practical Statistics

ARMITAGE and DAVID (editors) • Advances in Biometry

ARNOLD, BALAKRISHNAN, and NAGARAJA • Records

* ARTHANARI and DODGE • Mathematical Programming in Statistics

* BAILEY • The Elements of Stochastic Processes with Applications to the Natural Sciences

BAJORSKI • Statistics for Imaging, Optics, and Photonics

BALAKRISHNAN and KOUTRAS • Runs and Scans with Applications

BALAKRISHNAN and NG • Precedence-Type Tests and Applications

BARNETT • Comparative Statistical Inference, Third Edition

BARNETT • Environmental Statistics

BARNETT and LEWIS • Outliers in Statistical Data, Third Edition

BARTHOLOMEW, KNOTT, and MOUSTAKI • Latent Variable Models and Factor Analysis: A Unified Approach, Third Edition

BARTOSZYNSKI and NIEWIADOMSKA-BUGAJ • Probability and Statistical Inference, Second Edition

BASILEVSKY • Statistical Factor Analysis and Related Methods: Theory and Applications

BATES and WATTS • Nonlinear Regression Analysis and Its Applications

BECHHOFER, SANTNER, and GOLDSMAN • Design and Analysis of Experiments for Statistical Selection, Screening, and Multiple Comparisons

BEIRLANT, GOEGEBEUR, SEGERS, TEUGELS, and DE WAAL • Statistics of Extremes: Theory and Applications

BELSLEY • Conditioning Diagnostics: Collinearity and Weak Data in Regression

† BELSLEY, KUH, and WELSCH • Regression Diagnostics: Identifying Influential Data and Sources of Collinearity

BENDAT and PIERSOL • Random Data: Analysis and Measurement Procedures, Fourth Edition

BERNARDO and SMITH • Bayesian Theory

BERZUINI, DAWID, and BERNARDINELL • Causality: Statistical Perspectives and Applications

BHAT and MILLER • Elements of Applied Stochastic Processes, Third Edition

BHATTACHARYA and WAYMIRE • Stochastic Processes with Applications

BIEMER, GROVES, LYBERG, MATHIOWETZ, and SUDMAN • Measurement Errors in Surveys

BILLINGSLEY • Convergence of Probability Measures, Second Edition

BILLINGSLEY • Probability and Measure, Anniversary Edition

BIRKES and DODGE • Alternative Methods of Regression

BISGAARD and KULAHCI • Time Series Analysis and Forecasting by Example

BISWAS, DATTA, FINE, and SEGAL • Statistical Advances in the Biomedical Sciences: Clinical Trials, Epidemiology, Survival Analysis, and Bioinformatics

BLISCHKE and MURTHY (editors) • Case Studies in Reliability and Maintenance

BLISCHKE and MURTHY • Reliability: Modeling, Prediction, and Optimization

BLOOMFIELD • Fourier Analysis of Time Series: An Introduction, Second Edition

BOLLEN • Structural Equations with Latent Variables

BOLLEN and CURRAN • Latent Curve Models: A Structural Equation Perspective

BOROVKOV • Ergodicity and Stability of Stochastic Processes

BOSQ and BLANKE • Inference and Prediction in Large Dimensions

BOULEAU • Numerical Methods for Stochastic Processes

* BOX and TIAO • Bayesian Inference in Statistical Analysis

BOX • Improving Almost Anything, Revised Edition

* BOX and DRAPER • Evolutionary Operation: A Statistical Method for Process Improvement

BOX and DRAPER • Response Surfaces, Mixtures, and Ridge Analyses, Second Edition

BOX, HUNTER, and HUNTER • Statistics for Experimenters: Design, Innovation, and Discovery, Second Edition

BOX, JENKINS, and REINSEL • Time Series Analysis: Forcasting and Control, Fourth Edition

BOX, LUCEÑO, and PANIAGUA-QUIÑONES • Statistical Control by Monitoring and Adjustment, Second Edition

* BROWN and HOLLANDER • Statistics: A Biomedical Introduction

CAIROLI and DALANG • Sequential Stochastic Optimization

CASTILLO, HADI, BALAKRISHNAN, and SARABIA • Extreme Value and Related Models with Applications in Engineering and Science

CHAN • Time Series: Applications to Finance with R and S-Plus, Second Edition

CHARALAMBIDES • Combinatorial Methods in Discrete Distributions

CHATTERJEE and HADI • Regression Analysis by Example, Fifth Edition

CHATTERJEE and HADI • Sensitivity Analysis in Linear Regression

CHERNICK • Bootstrap Methods: A Guide for Practitioners and Researchers, Second Edition

CHERNICK and FRIIS • Introductory Biostatistics for the Health Sciences

CHILÈS and DELFINER • Geostatistics: Modeling Spatial Uncertainty, Second Edition

CHOW and LIU • Design and Analysis of Clinical Trials: Concepts and Methodologies, Second Edition

CLARKE • Linear Models: The Theory and Application of Analysis of Variance

CLARKE and DISNEY • Probability and Random Processes: A First Course with Applications, Second Edition

* COCHRAN and COX • Experimental Designs, Second Edition

COLLINS and LANZA • Latent Class and Latent Transition Analysis: With Applications in the Social, Behavioral, and Health Sciences

CONGDON • Applied Bayesian Modelling

CONGDON • Bayesian Models for Categorical Data

CONGDON • Bayesian Statistical Modelling, Second Edition

CONOVER • Practical Nonparametric Statistics, Third Edition

COOK • Regression Graphics

COOK and WEISBERG • An Introduction to Regression Graphics

COOK and WEISBERG • Applied Regression Including Computing and Graphics

CORNELL • A Primer on Experiments with Mixtures

CORNELL • Experiments with Mixtures, Designs, Models, and the Analysis of Mixture Data, Third Edition

COX • A Handbook of Introductory Statistical Methods

CRESSIE • Statistics for Spatial Data, Revised Edition

CRESSIE and WIKLE • Statistics for Spatio-Temporal Data

CSÖRG and HORVÁTH • Limit Theorems in Change Point Analysis

DAGPUNAR • Simulation and Monte Carlo: With Applications in Finance and MCMC

DANIEL • Applications of Statistics to Industrial Experimentation

DANIEL • Biostatistics: A Foundation for Analysis in the Health Sciences, Eighth Edition

* DANIEL • Fitting Equations to Data: Computer Analysis of Multifactor Data, Second Edition

DASU and JOHNSON • Exploratory Data Mining and Data Cleaning

DAVID and NAGARAJA • Order Statistics, Third Edition

* DEGROOT, FIENBERG, and KADANE • Statistics and the Law

DEL CASTILLO • Statistical Process Adjustment for Quality Control

DEMARIS • Regression with Social Data: Modeling Continuous and Limited Response Variables

DEMIDENKO • Mixed Models: Theory and Applications

DENISON, HOLMES, MALLICK and SMITH • Bayesian Methods for Nonlinear Classification and Regression

DETTE and STUDDEN • The Theory of Canonical Moments with Applications in Statistics, Probability, and Analysis

DEY and MUKERJEE • Fractional Factorial Plans

DILLON and GOLDSTEIN • Multivariate Analysis: Methods and Applications

* DODGE and ROMIG • Sampling Inspection Tables, Second Edition

* DOOB • Stochastic Processes

DOWDY, WEARDEN, and CHILKO • Statistics for Research, Third Edition

DRAPER and SMITH • Applied Regression Analysis, Third Edition

DRYDEN and MARDIA • Statistical Shape Analysis

DUDEWICZ and MISHRA • Modern Mathematical Statistics

DUNN and CLARK • Basic Statistics: A Primer for the Biomedical Sciences, Fourth Edition

DUPUIS and ELLIS • A Weak Convergence Approach to the Theory of Large Deviations

EDLER and KITSOS • Recent Advances in Quantitative Methods in Cancer and Human Health Risk Assessment

* ELANDT-JOHNSON and JOHNSON • Survival Models and Data Analysis

ENDERS • Applied Econometric Time Series, Third Edition

† ETHIER and KURTZ • Markov Processes: Characterization and Convergence

EVANS, HASTINGS, and PEACOCK • Statistical Distributions, Third Edition

EVERITT, LANDAU, LEESE, and STAHL • Cluster Analysis, Fifth Edition

FEDERER and KING • Variations on Split Plot and Split Block Experiment Designs

FELLER • An Introduction to Probability Theory and Its Applications, Volume I, Third Edition, Revised; Volume II, Second Edition

FITZMAURICE, LAIRD, and WARE • Applied Longitudinal Analysis, Second Edition

* FLEISS • The Design and Analysis of Clinical Experiments

FLEISS • Statistical Methods for Rates and Proportions, Third Edition

† FLEMING and HARRINGTON • Counting Processes and Survival Analysis

FUJIKOSHI, ULYANOV, and SHIMIZU • Multivariate Statistics: High-Dimensional and Large-Sample Approximations

FULLER • Introduction to Statistical Time Series, Second Edition

† FULLER • Measurement Error Models

GALLANT • Nonlinear Statistical Models

GEISSER • Modes of Parametric Statistical Inference

GELMAN and MENG • Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives

GEWEKE • Contemporary Bayesian Econometrics and Statistics

GHOSH, MUKHOPADHYAY, and SEN • Sequential Estimation

GIESBRECHT and GUMPERTZ • Planning, Construction, and Statistical Analysis of Comparative Experiments

GIFI • Nonlinear Multivariate Analysis

GIVENS and HOETING • Computational Statistics

GLASSERMAN and YAO • Monotone Structure in Discrete-Event Systems

GNANADESIKAN • Methods for Statistical Data Analysis of Multivariate Observations, Second Edition

GOLDSTEIN • Multilevel Statistical Models, Fourth Edition

GOLDSTEIN and LEWIS • Assessment: Problems, Development, and Statistical Issues

GOLDSTEIN and WOOFF • Bayes Linear Statistics

GREENWOOD and NIKULIN • A Guide to Chi-Squared Testing

GROSS, SHORTLE, THOMPSON, and HARRIS • Fundamentals of Queueing Theory, Fourth Edition

GROSS, SHORTLE, THOMPSON, and HARRIS • Solutions Manual to Accompany Fundamentals of Queueing Theory, Fourth Edition

* HAHN and SHAPIRO • Statistical Models in Engineering

HAHN and MEEKER • Statistical Intervals: A Guide for Practitioners

HALD • A History of Probability and Statistics and their Applications Before 1750

† HAMPEL • Robust Statistics: The Approach Based on Influence Functions

HARTUNG, KNAPP, and SINHA • Statistical Meta-Analysis with Applications

HEIBERGER • Computation for the Analysis of Designed Experiments

HEDAYAT and SINHA • Design and Inference in Finite Population Sampling

HEDEKER and GIBBONS • Longitudinal Data Analysis

HELLER • MACSYMA for Statisticians

HERITIER, CANTONI, COPT, and VICTORIA-FESER • Robust Methods in Biostatistics

HINKELMANN and KEMPTHORNE • Design and Analysis of Experiments, Volume 1: Introduction to Experimental Design, Second Edition

HINKELMANN and KEMPTHORNE • Design and Analysis of Experiments, Volume 2: Advanced Experimental Design

HINKELMANN (editor) • Design and Analysis of Experiments, Volume 3: Special Designs and Applications

HOAGLIN, MOSTELLER, and TUKEY • Fundamentals of Exploratory Analysis of Variance

* HOAGLIN, MOSTELLER, and TUKEY • Exploring Data Tables, Trends and Shapes

* HOAGLIN, MOSTELLER, and TUKEY • Understanding Robust and Exploratory Data Analysis

HOCHBERG and TAMHANE • Multiple Comparison Procedures

HOCKING • Methods and Applications of Linear Models: Regression and the Analysis of Variance, Second Edition

HOEL • Introduction to Mathematical Statistics, Fifth Edition

HOGG and KLUGMAN • Loss Distributions

HOLLANDER and WOLFE • Nonparametric Statistical Methods, Second Edition

HOSMER and LEMESHOW • Applied Logistic Regression, Second Edition

HOSMER, LEMESHOW, and MAY • Applied Survival Analysis: Regression Modeling of Time-to-Event Data, Second Edition

HUBER • Data Analysis: What Can Be Learned From the Past 50 Years

HUBER • Robust Statistics

† HUBER and RONCHETTI • Robust Statistics, Second Edition

HUBERTY • Applied Discriminant Analysis, Second Edition

HUBERTY and OLEJNIK • Applied MANOVA and Discriminant Analysis, Second Edition

HUITEMA • The Analysis of Covariance and Alternatives: Statistical Methods for Experiments, Quasi-Experiments, and Single-Case Studies, Second Edition

HUNT and KENNEDY • Financial Derivatives in Theory and Practice, Revised Edition

HURD and MIAMEE • Periodically Correlated Random Sequences: Spectral Theory and Practice

HUSKOVA, BERAN, and DUPAC • Collected Works of Jaroslav Hajek—with Commentary

HUZURBAZAR • Flowgraph Models for Multistate Time-to-Event Data

JACKMAN • Bayesian Analysis for the Social Sciences

† JACKSON • A User’s Guide to Principle Components

JOHN • Statistical Methods in Engineering and Quality Assurance

JOHNSON • Multivariate Statistical Simulation

JOHNSON and BALAKRISHNAN • Advances in the Theory and Practice of Statistics: A Volume in Honor of Samuel Kotz

JOHNSON, KEMP, and KOTZ • Univariate Discrete Distributions, Third Edition

JOHNSON and KOTZ (editors) • Leading Personalities in Statistical Sciences: From the Seventeenth Century to the Present

JOHNSON, KOTZ, and BALAKRISHNAN • Continuous Univariate Distributions, Volume 1, Second Edition

JOHNSON, KOTZ, and BALAKRISHNAN • Continuous Univariate Distributions, Volume 2, Second Edition

JOHNSON, KOTZ, and BALAKRISHNAN • Discrete Multivariate Distributions

JUDGE, GRIFFITHS, HILL, LÜTKEPOHL, and LEE • The Theory and Practice of Econometrics, Second Edition

JUREK and MASON • Operator-Limit Distributions in Probability Theory

KADANE • Bayesian Methods and Ethics in a Clinical Trial Design

KADANE AND SCHUM • A Probabilistic Analysis of the Sacco and Vanzetti Evidence

KALBFLEISCH and PRENTICE • The Statistical Analysis of Failure Time Data, Second Edition

KARIYA and KURATA • Generalized Least Squares

KASS and VOS • Geometrical Foundations of Asymptotic Inference

† KAUFMAN and ROUSSEEUW • Finding Groups in Data: An Introduction to Cluster Analysis

KEDEM and FOKIANOS • Regression Models for Time Series Analysis

KENDALL, BARDEN, CARNE, and LE • Shape and Shape Theory

KHURI • Advanced Calculus with Applications in Statistics, Second Edition

KHURI, MATHEW, and SINHA • Statistical Tests for Mixed Linear Models

* KISH • Statistical Design for Research

KLEIBER and KOTZ • Statistical Size Distributions in Economics and Actuarial Sciences

KLEMELÄ • Smoothing of Multivariate Data: Density Estimation and Visualization

KLUGMAN, PANJER, and WILLMOT • Loss Models: From Data to Decisions, Fourth Edition

KLUGMAN, PANJER, and WILLMOT • Student Solutions Manual to Accompany Loss Models: From Data to Decisions, Fourth Edition

KOSKI and NOBLE • Bayesian Networks: An Introduction

KOTZ, BALAKRISHNAN, and JOHNSON • Continuous Multivariate Distributions, Volume 1, Second Edition

KOTZ and JOHNSON (editors) • Encyclopedia of Statistical Sciences: Volumes 1 to 9 with Index

KOTZ and JOHNSON (editors) • Encyclopedia of Statistical Sciences: Supplement Volume

KOTZ, READ, and BANKS (editors) • Encyclopedia of Statistical Sciences: Update Volume 1

KOTZ, READ, and BANKS (editors) • Encyclopedia of Statistical Sciences: Update Volume 2

KOWALSKI and TU • Modern Applied U-Statistics

KRISHNAMOORTHY and MATHEW • Statistical Tolerance Regions: Theory, Applications, and Computation

KROESE, TAIMRE, and BOTEV • Handbook of Monte Carlo Methods

KROONENBERG • Applied Multiway Data Analysis

KULINSKAYA, MORGENTHALER, and STAUDTE • Meta Analysis: A Guide to Calibrating and Combining Statistical Evidence

KULKARNI and HARMAN • An Elementary Introduction to Statistical Learning Theory

KUROWICKA and COOKE • Uncertainty Analysis with High Dimensional Dependence Modelling

KVAM and VIDAKOVIC • Nonparametric Statistics with Applications to Science and Engineering

LACHIN • Biostatistical Methods: The Assessment of Relative Risks, Second Edition

LAD • Operational Subjective Statistical Methods: A Mathematical, Philosophical, and Historical Introduction

LAMPERTI • Probability: A Survey of the Mathematical Theory, Second Edition

LAWLESS • Statistical Models and Methods for Lifetime Data, Second Edition

LAWSON • Statistical Methods in Spatial Epidemiology, Second Edition

LE • Applied Categorical Data Analysis, Second Edition

LE • Applied Survival Analysis

LEE • Structural Equation Modeling: A Bayesian Approach

LEE and WANG • Statistical Methods for Survival Data Analysis, Third Edition

LEPAGE and BILLARD • Exploring the Limits of Bootstrap

LESSLER and KALSBEEK • Nonsampling Errors in Surveys

LEYLAND and GOLDSTEIN (editors) • Multilevel Modelling of Health Statistics

LIAO • Statistical Group Comparison

LIN • Introductory Stochastic Analysis for Finance and Insurance

LITTLE and RUBIN • Statistical Analysis with Missing Data, Second Edition

LLOYD • The Statistical Analysis of Categorical Data

LOWEN and TEICH • Fractal-Based Point Processes

MAGNUS and NEUDECKER • Matrix Differential Calculus with Applications in Statistics and Econometrics, Revised Edition

MALLER and ZHOU • Survival Analysis with Long Term Survivors

MARCHETTE • Random Graphs for Statistical Pattern Recognition

MARDIA and JUPP • Directional Statistics

MARKOVICH • Nonparametric Analysis of Univariate Heavy-Tailed Data: Research and Practice

MARONNA, MARTIN and YOHAI • Robust Statistics: Theory and Methods

MASON, GUNST, and HESS • Statistical Design and Analysis of Experiments with Applications to Engineering and Science, Second Edition

McCOOL • Using the Weibull Distribution: Reliability, Modeling, and Inference

McCULLOCH, SEARLE, and NEUHAUS • Generalized, Linear, and Mixed Models, Second Edition

McFADDEN • Management of Data in Clinical Trials, Second Edition

* McLACHLAN • Discriminant Analysis and Statistical Pattern Recognition

McLACHLAN, DO, and AMBROISE • Analyzing Microarray Gene Expression Data

McLACHLAN and KRISHNAN • The EM Algorithm and Extensions, Second Edition

McLACHLAN and PEEL • Finite Mixture Models

McNEIL • Epidemiological Research Methods

MEEKER and ESCOBAR • Statistical Methods for Reliability Data

MEERSCHAERT and SCHEFFLER • Limit Distributions for Sums of Independent Random Vectors: Heavy Tails in Theory and Practice

MENGERSEN, ROBERT, and TITTERINGTON • Mixtures: Estimation and Applications

MICKEY, DUNN, and CLARK • Applied Statistics: Analysis of Variance and Regression, Third Edition

* MILLER • Survival Analysis, Second Edition

MONTGOMERY, JENNINGS, and KULAHCI • Introduction to Time Series Analysis and Forecasting

MONTGOMERY, PECK, and VINING • Introduction to Linear Regression Analysis, Fifth Edition

MORGENTHALER and TUKEY • Configural Polysampling: A Route to Practical Robustness

MUIRHEAD • Aspects of Multivariate Statistical Theory

MULLER and STOYAN • Comparison Methods for Stochastic Models and Risks

MURTHY, XIE, and JIANG • Weibull Models

MYERS, MONTGOMERY, and ANDERSON-COOK • Response Surface Methodology: Process and Product Optimization Using Designed Experiments, Third Edition

MYERS, MONTGOMERY, VINING, and ROBINSON • Generalized Linear Models. With Applications in Engineering and the Sciences, Second Edition

NATVIG • Multistate Systems Reliability Theory With Applications

† NELSON • Accelerated Testing, Statistical Models, Test Plans, and Data Analyses

† NELSON • Applied Life Data Analysis

NEWMAN • Biostatistical Methods in Epidemiology

NG, TAIN, and TANG • Dirichlet Theory: Theory, Methods and Applications

OKABE, BOOTS, SUGIHARA, and CHIU • Spatial Tesselations: Concepts and Applications of Voronoi Diagrams, Second Edition

OLIVER and SMITH • Influence Diagrams, Belief Nets and Decision Analysis

PALTA • Quantitative Methods in Population Health: Extensions of Ordinary Regressions

PANJER • Operational Risk: Modeling and Analytics

PANKRATZ • Forecasting with Dynamic Regression Models

PANKRATZ • Forecasting with Univariate Box-Jenkins Models: Concepts and Cases

PARDOUX • Markov Processes and Applications: Algorithms, Networks, Genome and Finance

PARMIGIANI and INOUE • Decision Theory: Principles and Approaches

* PARZEN • Modern Probability Theory and Its Applications

PEÑA, TIAO, and TSAY • A Course in Time Series Analysis

PESARIN and SALMASO • Permutation Tests for Complex Data: Applications and Software

PIANTADOSI • Clinical Trials: A Methodologic Perspective, Second Edition

POURAHMADI • Foundations of Time Series Analysis and Prediction Theory

POWELL • Approximate Dynamic Programming: Solving the Curses of Dimensionality, Second Edition

POWELL and RYZHOV • Optimal Learning

PRESS • Subjective and Objective Bayesian Statistics, Second Edition

PRESS and TANUR • The Subjectivity of Scientists and the Bayesian Approach

PURI, VILAPLANA, and WERTZ • New Perspectives in Theoretical and Applied Statistics

† PUTERMAN • Markov Decision Processes: Discrete Stochastic Dynamic Programming

QIU • Image Processing and Jump Regression Analysis

* RAO • Linear Statistical Inference and Its Applications, Second Edition

RAO • Statistical Inference for Fractional Diffusion Processes

RAUSAND and HØYLAND • System Reliability Theory: Models, Statistical Methods, and Applications, Second Edition

RAYNER, THAS, and BEST • Smooth Tests of Goodnes of Fit: Using R, Second Edition

RENCHER and SCHAALJE • Linear Models in Statistics, Second Edition

RENCHER and CHRISTENSEN • Methods of Multivariate Analysis, Third Edition

RENCHER • Multivariate Statistical Inference with Applications

RIGDON and BASU • Statistical Methods for the Reliability of Repairable Systems

* RIPLEY • Spatial Statistics

* RIPLEY • Stochastic Simulation

ROHATGI and SALEH • An Introduction to Probability and Statistics, Second Edition

ROLSKI, SCHMIDLI, SCHMIDT, and TEUGELS • Stochastic Processes for Insurance and Finance

ROSENBERGER and LACHIN • Randomization in Clinical Trials: Theory and Practice

ROSSI, ALLENBY, and MCCULLOCH • Bayesian Statistics and Marketing

† ROUSSEEUW and LEROY • Robust Regression and Outlier Detection

ROYSTON and SAUERBREI • Multivariate Model Building: A Pragmatic Approach to Regression Analysis Based on Fractional Polynomials for Modeling Continuous Variables

* RUBIN • Multiple Imputation for Nonresponse in Surveys

RUBINSTEIN and KROESE • Simulation and the Monte Carlo Method, Second Edition

RUBINSTEIN and MELAMED • Modern Simulation and Modeling

RYAN • Modern Engineering Statistics

RYAN • Modern Experimental Design

RYAN • Modern Regression Methods, Second Edition

RYAN • Statistical Methods for Quality Improvement, Third Edition

SALEH • Theory of Preliminary Test and Stein-Type Estimation with Applications

SALTELLI, CHAN, and SCOTT (editors) • Sensitivity Analysis

SCHERER • Batch Effects and Noise in Microarray Experiments: Sources and Solutions

* SCHEFFE • The Analysis of Variance

SCHIMEK • Smoothing and Regression: Approaches, Computation, and Application

SCHOTT • Matrix Analysis for Statistics, Second Edition

SCHOUTENS • Levy Processes in Finance: Pricing Financial Derivatives

SCOTT • Multivariate Density Estimation: Theory, Practice, and Visualization

* SEARLE • Linear Models

† SEARLE • Linear Models for Unbalanced Data

† SEARLE • Matrix Algebra Useful for Statistics

† SEARLE, CASELLA, and McCULLOCH • Variance Components

SEARLE and WILLETT • Matrix Algebra for Applied Economics

SEBER • A Matrix Handbook For Statisticians

† SEBER • Multivariate Observations

SEBER and LEE • Linear Regression Analysis, Second Edition

† SEBER and WILD • Nonlinear Regression

SENNOTT • Stochastic Dynamic Programming and the Control of Queueing Systems

* SERFLING • Approximation Theorems of Mathematical Statistics

SHAFER and VOVK • Probability and Finance: It’s Only a Game!

SHERMAN • Spatial Statistics and Spatio-Temporal Data: Covariance Functions and Directional Properties

SILVAPULLE and SEN • Constrained Statistical Inference: Inequality, Order, and Shape Restrictions

SINGPURWALLA • Reliability and Risk: A Bayesian Perspective

SMALL and MCLEISH • Hilbert Space Methods in Probability and Statistical Inference

SRIVASTAVA • Methods of Multivariate Statistics

STAPLETON • Linear Statistical Models, Second Edition

STAPLETON • Models for Probability and Statistical Inference: Theory and Applications

STAUDTE and SHEATHER • Robust Estimation and Testing

STOYAN • Counterexamples in Probability, Second Edition

STOYAN, KENDALL, and MECKE • Stochastic Geometry and Its Applications, Second Edition

STOYAN and STOYAN • Fractals, Random Shapes and Point Fields: Methods of Geometrical Statistics

STREET and BURGESS • The Construction of Optimal Stated Choice Experiments: Theory and Methods

STYAN • The Collected Papers of T. W. Anderson: 1943–1985

SUTTON, ABRAMS, JONES, SHELDON, and SONG • Methods for Meta-Analysis in Medical Research

TAKEZAWA • Introduction to Nonparametric Regression

TAMHANE • Statistical Analysis of Designed Experiments: Theory and Applications

TANAKA • Time Series Analysis: Nonstationary and Noninvertible Distribution Theory

THOMPSON • Empirical Model Building: Data, Models, and Reality, Second Edition

THOMPSON • Sampling, Third Edition

THOMPSON • Simulation: A Modeler’s Approach

THOMPSON and SEBER • Adaptive Sampling

THOMPSON, WILLIAMS, and FINDLAY • Models for Investors in Real World Markets

TIERNEY • LISP-STAT: An Object-Oriented Environment for Statistical Computing and Dynamic Graphics

TSAY • Analysis of Financial Time Series, Third Edition

TSAY • An Introduction to Analysis of Financial Data with R

UPTON and FINGLETON • Spatial Data Analysis by Example, Volume II: Categorical and Directional Data

† VAN BELLE • Statistical Rules of Thumb, Second Edition

VAN BELLE, FISHER, HEAGERTY, and LUMLEY • Biostatistics: A Methodology for the Health Sciences, Second Edition

VESTRUP • The Theory of Measures and Integration

VIDAKOVIC • Statistical Modeling by Wavelets

VIERTL • Statistical Methods for Fuzzy Data

VINOD and REAGLE • Preparing for the Worst: Incorporating Downside Risk in Stock Market Investments

WALLER and GOTWAY • Applied Spatial Statistics for Public Health Data

WEISBERG • Applied Linear Regression, Third Edition

WEISBERG • Bias and Causation: Models and Judgment for Valid Comparisons

WELSH • Aspects of Statistical Inference

WESTFALL and YOUNG • Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment

* WHITTAKER • Graphical Models in Applied Multivariate Statistics

WINKER • Optimization Heuristics in Economics: Applications of Threshold Accepting

WOODWORTH • Biostatistics: A Bayesian Introduction

WOOLSON and CLARKE • Statistical Methods for the Analysis of Biomedical Data, Second Edition

WU and HAMADA • Experiments: Planning, Analysis, and Parameter Design Optimization, Second Edition

WU and ZHANG • Nonparametric Regression Methods for Longitudinal Data Analysis

YIN • Clinical Trial Design: Bayesian and Frequentist Adaptive Methods

YOUNG, VALERO-MORA, and FRIENDLY • Visual Statistics: Seeing Data with Dynamic Interactive Graphics

ZACKS • Stage-Wise Adaptive Designs

* ZELLNER • An Introduction to Bayesian Inference in Econometrics

ZELTERMAN • Discrete Distributions—Applications in the Health Sciences

ZHOU, OBUCHOWSKI, and MCCLISH • Statistical Methods in Diagnostic Medicine, Second Edition

*Now available in a lower priced paperback edition in the Wiley Classics Library.

†Now available in a lower priced paperback edition in the Wiley–Interscience Paperback Series.

Copyright 2013 by John Wiley & Sons. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400, fax 978-750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, 201-748-6011, fax 201-748-6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at 800-762-2974, outside the United States at 317-572-3993 or fax 317-572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data

Agresti, Alan. Categorical data analysis / Alan Agresti. – 3rd ed. p. cm. – (Wiley series in probability and statistics; 792)Includes bibliographical references and index. ISBN 978-0-470-46363-5 (hardback) 1. Multivariate analysis. I. Title. QA278.A353 2013 519.5′35–dc23 2012009792

To Jacki

Preface

The explosion in the development of methods for analyzing categorical data that began in the 1960s has continued apace in recent years. This book provides an overview of these methods, as well as older, now standard, methods. It gives special emphasis to generalized linear modeling techniques, which extend linear model methods for continuous variables, and their extensions for multivariate responses.

OUTLINE OF TOPICS

Chapters 1–10 present the core methods for categorical response variables. Chapters 1–3 cover distributions for categorical responses and traditional methods for two-way contingency tables. Chapters 4–8 introduce logistic regression and related models such as the probit model for binary and multicategory response variables. Chapters 9 and 10 cover loglinear models for contingency tables.

In the past quarter century, a major area of new research has been the development of methods for repeated measurement and other forms of clustered categorical data. Chapters 11–14 present these methods, including marginal models and generalized linear mixed models with random effects. Chapter 15 introduces non-model-based methods for classification and clustering. Chapter 16 presents theoretical foundations as well as alternatives to the maximum likelihood paradigm that this text adopts. Chapter 17 is devoted to a historical overview of the development of the methods. It examines contributions of noted statisticians, such as Pearson and Fisher, whose pioneering efforts—and sometimes vocal debates—broke the ground for this evolution.

Appendices illustrate the use of statistical software for analyzing categorical data. The website for the text, vww.stat.ufl.edu/~aa/cda/cda.html, contains an appendix with detailed examples of the use of software (especially R, SAS, and Stata) for performing the analyses in this book, solutions to many of the exercises, extra exercises, and corrections.

CHANGES IN THIS EDITION

Given the explosion of research in the past 50 years on categorical data methods, it is an increasing challenge to write a comprehensive book covering all the commonly used methods. The second edition of this book already exceeded 700 pages. In including much new material without letting the book grow much, I have necessarily had to make compromises in depth and use relatively simple examples. I try to present a broad overview, while presenting bibliographic notes with many references in which the reader can find more details. In attempting to make the book relatively comprehensive while presenting substantive new material, every chapter of the first two editions has been extensively rewritten. The major changes are:

A new Chapter 7 presents alternative methods for binary response data, including some regularization methods that are becoming popular in this age of massive data sets with enormous numbers of variables.

A new Chapter 15 introduces non-model-based methods of classification, such as linear discriminant analysis and classification trees, and cluster analysis.

Many chapters now include a section describing the Bayesian approach for the methods of that chapter. We also have added material (e.g., Sections 6.5 and 7.4) about ways that frequentist methods can deal with awkward situations such as infinite maximum likelihood estimates.

The use of various software for categorical data methods is discussed at a much expanded website for the text,

www.stat.ufl.edu/~aa/cda/cda.html

. Examples are shown of the use of R, SAS, and Stata for most of the examples in the text, and there is discussion also about SPSS, StatXact, and other software. That website also contains many of the text’s data sets, some of which have only excerpts shown in the text itself, as well as solutions for many exercises and corrections of errors found in early printings of the book. I recommend that you refer to this appendix (or specialized software manuals) while reading the text, perhaps printing the pages about the software you prefer, as an aid to implementing the methods. This material was placed at the website partly because the text is already so long without it and also because it is then easier to keep the presentation up-to-date.

In this text, I interpret categorical data analysis to refer to methods for categorical response variables. For most methods, explanatory variables can be categorical or quantitative, as in ordinary regression. Thus, the focus is intended to be more general than contingency table analysis, although for simplicity of data presentation, most examples use contingency tables. These examples are simplistic, but should help you focus on understanding the methods themselves and make it easier for you to replicate results with your favorite software.

Other special features of the text include:

More than 100 analyses of data sets.

About 600 exercises, some directed toward theory and methods and some toward applications and data analysis.

Notes at the end of each chapter that provide references for recent research and many topics not covered in the text, linked to a bibliography of more than 1200 sources.

INTENDED AUDIENCE AND USE AS A TEXTBOOK

I intend this book to be accessible to the diverse mix of students who take graduate-level courses in categorical data analysis. But I have also written it with practicing statisticians and biostatisticians in mind. I hope it enables them to catch up with recent advances and learn about methods that sometimes receive inadequate attention in the traditional statistics curriculum.

The development of new methods has influenced—and been influenced by—the increasing availability of data sets with categorical responses in the social, behavioral, and biomedical sciences, as well as in public health, genetics, ecology, education, marketing and the financial industry, and industrial quality control. And so, although this book is directed mainly to statisticians and biostatisticians, I also aim for it to be helpful to methodologists in these fields.

Readers should possess a background that includes regression and analysis of variance models, as well as maximum likelihood methods of statistical theory. Those not having much theory background should be able to follow most methodological discussions. Those with mainly applied interests can skip most of Chapter 4 on the theory of generalized linear models and proceed to other chapters. However, the book has a distinctly higher technical level and is more thorough and complete than my lower-level text, An Introduction to Categorical Data Analysis, Second Edition (Wiley, 2007).

Today, because of the ubiquity of categorical data in applications, most statistics and biostatistics departments offer courses on categorical data analysis or on generalized linear models with strong emphasis on methods for discrete data. This book can be used as a text for such courses. The material in Chapters 1–6 forms the heart of most courses. There is too much material in this book for a single course, but a one-term course can be based on the following outline:

Basic contingency table analysis, covering Chapters 1–3, perhaps skipping some tangential sections such as 1.5.7, 1.6, 2.4, 3.4–3.7.

Logistic regression and related methods for binary data, covering Chapters 4–6, perhaps skipping some tangential sections such as 4.4–4.7 and 6.4–6.6.

Multinomial response models, covering at least Sections 8.1 and 8.2.

Matched pairs and clustered data, covering at least Sections 11.1–11.2.

Courses with biostatistical orientation may want to include bits from Chapters 12 and 13 on marginal and random effects models. Courses with social science emphasis may want to include some topics on loglinear modeling from Chapters 9 and 10. Some courses may want to select specialized topics from Chapter 7, such as probit modeling, conditional logistic regression, Bayesian binary data modeling, smoothing, and issues in the analysis of high-dimensional data.

ACKNOWLEDGMENTS

I thank those who commented on parts of the manuscript or provided help of some type. Special thanks to Anna Gottard, David Hoaglin, Maria Kateri, Bernhard Klingenberg, Keli Liu, and Euijung Ryu, who gave insightful comments on some chapters and made many helpful suggestions, and Brett Presnell for his advice and resources about R software and his comments about some of the material. Thanks to people who made suggestions about new material for this edition, including Jonathan Bischof, James Booth, Brian Caffo, Tianxi Cai, Brent Coull, Nicholas Cox, Ralitza Gueorguieva, Debashis Ghosh, John Henretta, David Hitchcock, Galin Jones, Robert Kushler, Xihong Lin, Jun Liu, Gianfranco Lovison, Giovanni Marchetti, David Olive, Art Owen, Alessandra Petrucci, Michael Radelet, Gerard Scallan, Maura Stokes, Anestis Touloumis, and Ming Yang. Thanks to those who commented on aspects of the second edition, including pointing out errors or typos, such as Pat Altham, Roberto Bertolusso, Nicholas Cox, David Firth, Rene Gonin, David Hoaglin, Harry Khamis, Bernhard Klingenberg, Robert Kushler, Gianfranco Lovison, Theo Nijsse, Richard Reyment, Misha Salganik, William Santo, Laura Thompson, Michael Vock, and Zhongming Yang. Thanks also to Laura Thompson for preparing her very helpful manual on using R and S-Plus for examples in the second edition. Thanks to the many who reviewed material or suggested examples for the first two editions, mentioned in the Prefaces of those editions. Thanks also to Wiley Executive Editor Steve Quigley and Associate Editor Jacqueline Palmieri for their steadfast encouragement and facilitation of this project. Finally, thanks to my wife Jacki Levine for continuing support of all kinds.

ALAN AGRESTI

Gainesville, Florida and Brookline, MassachusettsFebruary 2012

CHAPTER 1

Introduction: Distributions and Inference for Categorical Data

From helping to assess the value of new medical treatments to evaluating the factors that affect our opinions and behaviors, analysts today are finding myriad uses for categorical data methods. In this book we introduce these methods and the theory behind them.

Statistical methods for categorical responses were late in gaining the level of sophistication achieved early in the twentieth century by methods for continuous responses. Despite influential work around 1900 by the British statistician Karl Pearson, relatively Utile development of models for categorical responses occurred until the 1960s. In this book we describe the early fundamental work that still has importance today but place primary emphasis on more recent modeling approaches.

1.1 CATEGORICAL RESPONSE DATA

A categorical variable has a measurement scale consisting of a set of categories. For instance, political philosophy is often measured as liberal, moderate, or conservative. Diagnoses regarding breast cancer based on a mammogram use the categories normal, benign, probably benign, suspicious, and malignant.

The development of methods for categorical variables was stimulated by the need to analyze data generated in research studies in both the social and biomedical sciences. Categorical scales are pervasive in the social sciences for measuring attitudes and opinions. Categorical scales in biomedical sciences measure outcomes such as whether a medical treatment is successful.

Categorical data are by no means restricted to the social and biomedical sciences. They frequently occur in the behavioral sciences (e.g., type of mental illness, with the categories schizophrenia, depression, neurosis), epidemiology and public health (e.g., contraceptive method at last sexual intercourse, with the categories none, condom, pill, IUD, other), genetics (type of allele inherited by an offspring), botany and zoology (e.g., whether or not a particular organism is observed in a sampled quadrat), education (e.g., whether a student response to an exam question is correct or incorrect), and marketing (e.g., consumer preference among the three leading brands of a product). They even occur in highly quantitative fields such as engineering sciences and industrial quality control. Examples are the classification of items according to whether they conform to certain standards, and subjective evaluation of some characteristic: how soft to the touch a certain fabric is, how good a particular food product tastes, or how easy a worker finds it to perform a certain task.

Categorical variables are of many types. In this section we provide ways of classifying them.

1.1.1 Response–Explanatory Variable Distinction

Statistical analyses distinguish between response (or dependent) variables and explanatory (or independent) variables. This book focuses on methods for categorical response variables. As in ordinary regression modeling, explanatory variables can be any type. For instance, a study might analyze how opinion about whether same-sex marriages should be legal (yes or no) changes according to values of explanatory variables, such as religious affiliation, political ideology, number of years of education, annual income, age, gender, and race.

1.1.2 Binary–Nominal–Ordinal Scale Distinction

Many categorical variables have only two categories. Such variables, for which the two categories are often given the generic labels “success” and “failure,” are called binary variables. A major topic of this book is the modeling of binary response variables.

When a categorical variable has more than two categories, we distinguish between two types of categorical scales. Variables having categories without a natural ordering are said to be measured on a nominal scale and are called nominal variables. Examples are mode of transportation to get to work (automobile, bicycle, bus, subway, walk), favorite type of music (classical, country, folk, jazz, rock), and choice of residence (apartment, condominium, house, other). For nominal variables, the order of listing the categories is irrelevant to the statistical analysis.

Many categorical variables do have ordered categories. Such variables are said to be measured on an ordinal scale and are called ordinal variables. Examples are social class (upper, middle, lower), political philosophy (very liberal, slightly liberal, moderate, slightly conservative, very conservative), patient condition (good, fair, serious, critical), and rating of a movie for Netflix (1 to 5 stars, representing hated it, didn’t like it, liked it, really liked it, loved it). For ordinal variables, distances between categories are unknown. Although a person categorized as very liberal is more liberal than a person categorized as slightly liberal, no numerical value describes how much more liberal that person is.

An interval variable is one that does have numerical distances between any two values. For example, systolic blood pressure level, length of prison term, and annual income are interval variables. For most such variables, it is also possible to compare two values by their ratio, in which case the variable is also called a ratio variable.

The way that a variable is measured determines its classification. For example, “education” is only nominal when measured as (public school, private school, home schooling); it is ordinal when measured by highest degree attained, using the categories (none, high school, bachelor’s, master’s, doctorate); it is interval when measured by number of years of education completed, using the integers 0, 1, 2, 3, ….

A variable’s measurement scale determines which statistical methods are appropriate. It is usually best to apply methods appropriate for the actual scale. In the measurement hierarchy, interval variables are highest, ordinal variables are next, and nominal variables are lowest. Statistical methods for variables of one type can also be used with variables at higher levels but not at lower levels. For instance, statistical methods for nominal variables can be used with-ordinal variables by ignoring the ordering of categories. Methods for ordinal variables cannot, however, be used with nominal variables, since their categories have no meaningful ordering. The distinction between ordered and unordered categories is not important for binary variables, because ordinal methods and nominal methods then typically reduce to equivalent methods.

In this book, we present methods for the analysis of binary, nominal, and ordinal variables. The methods also apply to interval variables having a small number of distinct values (e.g., number of times married, number of distinct side effects experienced in taking some drug) or for which the values are grouped into ordered categories (e.g., education measured as ≤12 years, > 12 but < 16 years, ≥16 years).

1.1.3 Discrete–Continuous Variable Distinction

Variables are classified as discrete or continuous, according to whether the number of values they can take is countable. Actual measurement of all variables occurs in a discrete manner, due to precision limitations in measuring instruments. The discrete–continuous classification, in practice, distinguishes between variables that take few values and variables that take lots of values. For instance, statisticians often treat discrete interval variables having a large number of values (such as test scores) as continuous, using them in methods for continuous responses.

This book deals with certain types of discretely measured responses: (1) binary variables, (2) nominal variables, (3) ordinal variables, (4) discrete interval variables having relatively few values, and (5) continuous variables grouped into a small number of categories.

1.1.4 Quantitative–Qualitative Variable Distinction

Nominal variables are qualitative—distinct categories differ in quality, not in quantity. Interval variables are quantitative—distinct levels have differing amounts of the characteristic of interest. The position of ordinal variables in the qualitative–quantitative classification is fuzzy. Analysts often treat them as qualitative, using methods for nominal variables. But in many respects, ordinal variables more closely resemble interval variables than they resemble nominal variables. They possess important quantitative features: Each category has a greater or smaller magnitude of the characteristic man another category; and although not possible to measure, an underlying continuous variable is often present. The political ideology classification (very liberal, slightly liberal, moderate, slightly conservative, very conservative) crudely measures an inherently continuous characteristic.

Analysts often utilize the quantitative nature of ordinal variables by assigning numerical scores to the categories or assuming an underlying continuous distribution. This requires good judgment and guidance from researchers who use the scale, but it provides benefits in the variety of methods available for data analysis.

1.1.5 Organization of Book and Online Computing Appendix

The models for categorical response variables discussed in this book resemble regression models for continuous response variables; however, they assume binomial or multinomial response distributions instead of normality. One type of model receives special attention—logistic regression. Ordinary logistic regression models apply with binary responses and assume a binomial distribution. Generalizations of logistic regression apply with multicategory responses and assume a multinomial distribution.

The book has four main units. In the first, Chapters 1 through 3, we summarize descriptive and inferential methods for univariate and bivariate categorical data. These chapters cover discrete distributions, methods of inference, and measures of association for contingency tables. They summarize the non-model-based methods developed prior to about 1960.

In the second and primary unit, Chapters 4 through 10, we introduce models for categorical responses. In Chapter 4 we describe a class of generalized linear models having models of this text as special cases. Chapters 5 and 6 cover the most important model for binary responses, logistic regression. Chapter 7 presents alternative methods for binary data, including the probit, Bayesian fitting, and smoothing methods. In Chapter 8 we present generalizations of the logistic regression model for nominal and ordinal multicategory response variables. In Chapters 9 and 10 we introduce the modeling of multivariate categorical response data, in terms of association and interaction patterns among the variables. The models, called loglinear models, apply to counts in the table that cross-classifies those responses.

In the third unit. Chapters 11 through 14, we discuss models for handling repeated measurement and other forms of clustered data. In Chapter 11 we present models for a categorical response with matched pairs; these apply, for instance, with a categorical response measured for the same subjects at two times. Chapter 12 covers models for more general types of repeated categorical data, such as longitudinal data from several times with explanatory variables. In Chapter 13 we present a broad class of models, generalized linear mixed models, that use random effects to account for dependence with such data. In Chapter 14 further extensions of the models from Chapters 11 through 13 are described, unified by treating the response as having a mixture distribution of some type.

The fourth and final unit has a different nature than the others. In Chapter 15 we consider non-model-based classification and clustering methods. In Chapter 16 we summarize large-sample and small-sample theory for categorical data models. This theory is the basis for behavior of model parameter estimators and goodness-of-fit statistics. Chapter 17 presents a historical overview of the development of categorical data methods.

Maximum likelihood methods receive primary attention throughout the book. Many chapters, however, contain a section presenting corresponding Bayesian methods.

In Appendix A we review software that can perform the analyses in this book. The website www.stat.ufl.edu/~aa/cda/cda.html for this book contains an appendix that gives more information about using R, SAS, Stata, and other software, with sample programs for text examples. In addition, that site has complete data sets for many text examples and exercises, solutions to some exercises, extra exercises, corrections, and links to other useful sites. For instance, a manual prepared by Dr. Laura Thompson provides examples of how to use R and S-Plus for all examples in the second edition of this text, many of which (or very similar ones) are also in this edition.

In the rest of this chapter, we provide background material. In Section 1.2 we review the key distributions for categorical data: the binomial and multinomial, as well as another that is important for discrete data, the Poisson. In Section 1.3 we review the primary mechanisms for statistical inference using maximum likelihood. In Sections 1.4 and 1.5 we illustrate these by presenting significance tests and confidence intervals for binomial and multinomial parameters. In Section 1.6 we introduce Bayesian inference for these parameters.

1.2 DISTRIBUTIONS FOR CATEGORICAL DATA

Inferential data analyses require assumptions about the random mechanism that generated the data. For regression models with continuous responses, the normal distribution plays the central role. In this section we review the three key distributions for categorical responses: binomial, multinomial, and Poisson.

1.2.1 Binomial Distribution

The probability mass function for the possible outcomes y for Y is

(1.1)

There is no guarantee that successive binary observations are independent or identical. Thus, occasionally, we will utilize other distributions. One such case is sampling binary outcomes without replacement from a finite population, such as observations on whether a homework assignment was completed for 10 students sampled from a class of size 20. The hypergeometric distribution, studied in Section 3.5.1, is then relevant. In Section 1.2.4 we discuss another case that violates the binomial assumptions.

1.2.2 Multinomial Distribution

(1.2)

For the multinomial distribution,

(1.3)

We derive the covariance in Section 16.1.4. The marginal distribution of each nj is binomial.

1.2.3 Poisson Distribution

(1.4)

A key feature of the Poisson distribution is that its variance equals its mean. Sample counts vary more when their mean is higher. When the mean number of daily fatal accidents equals 15, greater variability occurs from day to day than when the mean equals 2.

1.2.4 Overdispersion

In practice, count observations often exhibit variability exceeding that predicted by the binomial or Poisson. This phenomenon is called overdispersion. We assumed above that each person has the same probability each day of dying in a fatal auto accident. More realistically, these probabilities vary from day to day according to the amount of road traffic and weather conditions and vary from person to person according to factors such as the amount of time spent in autos, whether the person wears a seat belt, how much of the driving is at high speeds, gender, and age. Such variation causes fatality counts to display more variation than predicted by the Poisson model.

Assuming a Poisson distribution for a count variable is often too simplistic, because of factors that cause overdispersion. The negative binomial is a related distribution for count data that has a second parameter and permits the variance to exceed the mean. We introduce it in Section 4.3.4.

Analyses assuming binomial (or multinomial) distributions are also sometimes invalid because of overdispersion. This might happen because the true distribution is a mixture of different binomial distributions, with the parameter varying because of unmeasured variables. To illustrate, suppose that an experiment exposes pregnant mice to a toxin and then after a week observes the number of fetuses in each mouse’s litter that show signs of malformation. Let ni denote the number of fetuses in the litter for mouse i. The pregnant mice also vary according to other factors, such as their weight, overall health, and genetic makeup. Extra variation then occurs because of the variability from litter to litter in the probability π of malformation. The distribution of the number of fetuses per Utter showing malformations might cluster near 0 and near ni, showing more dispersion than expected for binomial sampling with a single value of π. Overdispersion could also occur when π varies among fetuses in a litter according to some distribution (Exercise 1.17). In Chapters 4, 13, and 14 we introduce methods for data that are overdispersed relative to binomial and Poisson assumptions.

1.2.5 Connection Between Poisson and Multinomial Distributions

With Poisson sampling the total count n is random rather than fixed. If we assume a Poisson model but condition on n, {Yi} no longer have Poisson distributions, since each Yi cannot exceed n. Given n, {Yi) are also no longer independent, since the value of one affects the possible range for the others.

(1.5)

Many categorical data analyses assume a multinomial distribution. Such analyses usually have the same inferential results as those of analyses assuming a Poisson distribution, because of the similarity in the likelihood functions.

1.2.6 The Chi-Squared Distribution

Another distribution of fundamental importance for categorical data is the chi-squared, not as a distribution for the data but rather as a sampling distribution for many statistics. Because of its importance, we summarize here a few of its properties.

The chi-squared distribution with degrees of freedom denoted by df has mean df, variance 2(df), and skewness . It converges (slowly) to normality as df increases, the approximation being reasonably good when df is at least about 50.