120,99 €
Praise for the Second Edition "A must-have book for anyone expecting to do research and/or applications in Categorical Data Analysis." --Statistics in Medicine "It is a total delight reading this book." --Pharmaceutical Research "If you do any analysis of categorical data, this is an essential desktop reference." --Technometrics The use of statistical methods for analyzing categorical data has increased dramatically, particularly in the biomedical, social sciences, and financial industries. Responding to new developments, this book offers a comprehensive treatment of the most important methods for Categorical Data Analysis. Categorical Data Analysis, Third Edition summarizes the latest methods for univariate and correlated multivariate categorical responses. Readers will find a unified generalized linear models approach that connects logistic regression and Poisson and negative binomial loglinear models for discrete data with normal regression for continuous data. This edition also features: * An emphasis on logistic and probit regression methods for binary, ordinal, and nominal responses for independent observations and for clustered data with marginal models and random effects models * Two new chapters on alternative methods for binary response data, including smoothing and regularization methods, classification methods such as linear discriminant analysis and classification trees, and cluster analysis * New sections introducing the Bayesian approach for methods in that chapter * More than 100 analyses of data sets and over 600 exercises * Notes at the end of each chapter that provide references to recent research and topics not covered in the text, linked to a bibliography of more than 1,200 sources * A supplementary website showing how to use R and SAS; for all examples in the text, with information also about SPSS and Stata and with exercise solutions Categorical Data Analysis, Third Edition is an invaluable tool for statisticians and methodologists, such as biostatisticians and researchers in the social and behavioral sciences, medicine and public health, marketing, education, finance, biological and agricultural sciences, and industrial quality control.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1535
Contents
Cover
Half Title page
Title page
Copyright page
Dedication
Preface
Chapter 1: Introduction: Distributions and Inference for Categorical Data
1.1 Categorical Response Data
1.2 Distributions for Categorical Data
1.3 Statistical Inference for Categorical Data
1.4 Statistical Inference for Binomial Parameters
1.5 Statistical Inference for Multinomial Parameters
1.6 Bayesian Inference for Binomial and Multinomial Parameters
Notes
Exercises
Chapter 2: Describing Contingency Tables
2.1 Probability Structure for Contingency Tables
2.2 Comparing Two Proportions
2.3 Conditional Association in Stratified 2 × 2 Tables
2.4 Measuring Association in I × J Tables
Notes
Exercises
Chapter 3: Inference for Two-Way Contingency Tables
3.1 Confidence Intervals for Association Parameters
3.2 Testing Independence in Two-way Contingency Tables
3.3 Following-up Chi-Squared Tests
3.4 Two-Way Tables with Ordered Classifications
3.5 Small-Sample Inference for Contingency Tables
3.6 Bayesian Inference for Two-way Contingency Tables
3.7 Extensions for Multiway Tables and Nontabulated Responses
Notes
Exercises
Chapter 4: Introduction to Generalized Linear Models
4.1 The Generalized Linear Model
4.2 Generalized Linear Models for Binary Data
4.3 Generalized Linear Models for Counts and Rates
4.4 Moments and Likelihood for Generalized Linear Models
4.5 Inference and Model Checking for Generalized Linear Models
4.6 Fitting Generalized Linear Models
4.7 Quasi-Likelihood and Generalized Linear Models
Notes
Exercises
Chapter 5: Logistic Regression
5.1 Interpreting Parameters in Logistic Regression
5.2 Inference for Logistic Regression
5.3 Logistic Models with Categorical Predictors
5.4 Multiple Logistic Regression
5.5 Fitting Logistic Regression Models
Notes
Exercises
Chapter 6: Building, Checking, and Applying Logistic Regression Models
6.1 Strategies in Model Selection
6.2 Logistic Regression Diagnostics
6.3 Summarizing the Predictive Power of a Model
6.4 Mantel–Haenszel and Related Methods for Multiple 2 × 2 Tables
6.5 Detecting and Dealing with Infinite Estimates
6.6 Sample Size and Power Considerations
Notes
Exercises
Chapter 7: Alternative Modeling of Binary Response Data
7.1 Probit and Complementary Log–log Models
7.2 Bayesian Inference for Binary Regression
7.3 Conditional Logistic Regression
7.4 Smoothing: Kernels, Penalized Likelihood, Generalized Additive Models
7.5 Issues in Analyzing High-Dimensional Categorical Data
Notes
Exercises
Chapter 8: Models for Multinomial Responses
8.1 Nominal Responses: Baseline-Category Logit Models
8.2 Ordinal Responses: Cumulative Logit Models
8.3 Ordinal Responses: Alternative Models
8.4 Testing Conditional Independence in I × J × K Tables
8.5 Discrete-Choice Models
8.6 Bayesian Modeling of Multinomial Responses
Notes
Exercises
Chapter 9: Loglinear Models for Contingency Tables
9.1 Loglinear Models for Two-way Tables
9.2 Loglinear Models for Independence and Interaction in Three-way Tables
9.3 Inference for Loglinear Models
9.4 Loglinear Models for Higher Dimensions
9.5 Loglinear—Logistic Model Connection
9.6 Loglinear Model Fitting: Likelihood Equations and Asymptotic Distributions
9.7 Loglinear Model Fitting: Iterative Methods and Their Application
Notes
Exercises
Chapter 10: Building and Extending Loglinear Models
10.1 Conditional Independence Graphs and Collapsibility
10.2 Model Selection and Comparison
10.3 Residuals for Detecting Cell-Specific Lack of Fit
10.4 Modeling Ordinal Associations
10.5 Generalized Loglinear and Association Models, Correlation Models, and Correspondence Analysis
10.6 Empty Cells and Sparseness in Modeling Contingency Tables
10.7 Bayesian Loglinear Modeling
Notes
Exercises
Chapter 11: Models for Matched Pairs
11.1 Comparing Dependent Proportions
11.2 Conditional Logistic Regression for Binary Matched Pairs
11.3 Marginal Models for Square Contingency Tables
11.4 Symmetry, Quasi-Symmetry, and Quasi-Independence
11.5 Measuring Agreement Between Observers
11.6 Bradley–Terry Model for Paired Preferences
11.7 Marginal Models and Quasi-Symmetry Models for Matched Sets
Notes
Exercises
Chapter 12: Clustered Categorical Data: Marginal and Transitional Models
12.1 Marginal Modeling: Maximum Likelihood Approach
12.2 Marginal Modeling: Generalized Estimating Equations (GEEs) Approach
12.3 Quasi-Likelihood and Its GEE Multivariate Extension: Details
12.4 Transitional Models: Markov Chain and Time Series Models
Notes
Exercises
Chapter 13: Clustered Categorical Data: Random Effects Models
13.1 Random Effects Modeling of Clustered Categorical Data
13.2 Binary Responses: Logistic-Normal Model
13.3 Examples of Random Effects Models for Binary Data
13.4 Random Effects Models for Multinomial Data
13.5 Multilevel Modeling
13.6 GLMM Fitting, Inference, and Prediction
13.7 Bayesian Multivariate Categorical Modeling
Notes
Exercises
Chapter 14: Other Mixture Models for Discrete Data
14.1 Latent Class Models
14.2 Nonparametric Random Effects Models
14.3 Beta-Binomial Models
14.4 Negative Binomial Regression
14.5 Poisson Regression with Random Effects
Notes
Exercises
Chapter 15: Non-Model-Based Classification and Clustering
15.1 Classification: Linear Discriminant Analysis
15.2 Classification: Tree-Structured Prediction
15.3 Cluster Analysis for Categorical Data
Notes
Exercises
Chapter 16: Large- and Small-Sample Theory for Multinomial Models
16.1 Delta Method
16.2 Asymptotic Distributions of Estimators of Model Parameters and Cell Probabilities
16.3 Asymptotic Distributions of Residuals and Goodness-of-fit Statistics
16.4 Asymptotic Distributions for Logit/Loglinear Models
16.5 Small-Sample Significance Tests for Contingency Tables
16.6 Small-Sample Confidence Intervals for Categorical Data
16.7 Alternative Estimation Theory for Parametric Models
Notes
Exercises
Chapter 17: Historical Tour of Categorical Data Analysis
17.1 Pearson–Yule Association Controversy
17.2 R. A. Fisher’s Contributions
17.3 Logistic Regression
17.4 Multiway Contingency Tables and Loglinear Models
17.5 Bayesian Methods for Categorical Data
17.6 A Look Forward, and Backward
Appendix A: Statistical Software for Categorical Data Analysis
A.1 SAS
A.2 R And S-Plus
A.3 Stata
A.4 SPSS
A.5 Statxact and Logxact
A.6 Other Software
Appendix B: Chi-Squared Distribution Values
References
Author Index
Example Index
Subject Index
Categorical Data Analysis
WILEY SERIES IN PROBABILITY AND STATISTICS
ESTABLISHED BY WALTER A. SHEWHART AND SAMUEL S. WILKS
Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice,Harvey Goldstein, Iain M. Johnstone, Geert Molenberghs, David W. Scott,Adrian F. M. Smith, Ruey S. Tsay, Sanford Weisberg Editors Emeriti: Vic Barnett, J. Stuart Hunter, Joseph B. Kadane, Jozef L. Teugels
The Wiley Series in Probability and Statistics is well established and authoritative. It covers many topics of current research interest in both pure and applied statistics and probability theory. Written by leading statisticians and institutions, the titles span both state-of-the-art developments in the field and classical methods.
Reflecting the wide range of current research in statistics, the series encompasses applied, methodological and theoretical statistics, ranging from applications and new techniques made possible by advances in computerized practice to rigorous treatment of theoretical approaches.
This series provides essential and invaluable reading for all statisticians, whether in academia, industry, government, or research.
† ABRAHAM and LEDOLTER • Statistical Methods for Forecasting
AGRESTI • Analysis of Ordinal Categorical Data, Second Edition
AGRESTI • An Introduction to Categorical Data Analysis, Second Edition
AGRESTI • Categorical Data Analysis, Third Edition
ALTMAN, GILL, and McDONALD • Numerical Issues in Statistical Computing for the Social Scientist
AMARATUNGA and CABRERA • Exploration and Analysis of DNA Microarray and Protein Array Data
ANDĚL • Mathematics of Chance
ANDERSON • An Introduction to Multivariate Statistical Analysis, Third Edition
* ANDERSON • The Statistical Analysis of Time Series
ANDERSON, AUQUIER, HAUCK, OAKES, VANDAELE, and WEISBERG • Statistical Methods for Comparative Studies
ANDERSON and LOYNES • The Teaching of Practical Statistics
ARMITAGE and DAVID (editors) • Advances in Biometry
ARNOLD, BALAKRISHNAN, and NAGARAJA • Records
* ARTHANARI and DODGE • Mathematical Programming in Statistics
* BAILEY • The Elements of Stochastic Processes with Applications to the Natural Sciences
BAJORSKI • Statistics for Imaging, Optics, and Photonics
BALAKRISHNAN and KOUTRAS • Runs and Scans with Applications
BALAKRISHNAN and NG • Precedence-Type Tests and Applications
BARNETT • Comparative Statistical Inference, Third Edition
BARNETT • Environmental Statistics
BARNETT and LEWIS • Outliers in Statistical Data, Third Edition
BARTHOLOMEW, KNOTT, and MOUSTAKI • Latent Variable Models and Factor Analysis: A Unified Approach, Third Edition
BARTOSZYNSKI and NIEWIADOMSKA-BUGAJ • Probability and Statistical Inference, Second Edition
BASILEVSKY • Statistical Factor Analysis and Related Methods: Theory and Applications
BATES and WATTS • Nonlinear Regression Analysis and Its Applications
BECHHOFER, SANTNER, and GOLDSMAN • Design and Analysis of Experiments for Statistical Selection, Screening, and Multiple Comparisons
BEIRLANT, GOEGEBEUR, SEGERS, TEUGELS, and DE WAAL • Statistics of Extremes: Theory and Applications
BELSLEY • Conditioning Diagnostics: Collinearity and Weak Data in Regression
† BELSLEY, KUH, and WELSCH • Regression Diagnostics: Identifying Influential Data and Sources of Collinearity
BENDAT and PIERSOL • Random Data: Analysis and Measurement Procedures, Fourth Edition
BERNARDO and SMITH • Bayesian Theory
BERZUINI, DAWID, and BERNARDINELL • Causality: Statistical Perspectives and Applications
BHAT and MILLER • Elements of Applied Stochastic Processes, Third Edition
BHATTACHARYA and WAYMIRE • Stochastic Processes with Applications
BIEMER, GROVES, LYBERG, MATHIOWETZ, and SUDMAN • Measurement Errors in Surveys
BILLINGSLEY • Convergence of Probability Measures, Second Edition
BILLINGSLEY • Probability and Measure, Anniversary Edition
BIRKES and DODGE • Alternative Methods of Regression
BISGAARD and KULAHCI • Time Series Analysis and Forecasting by Example
BISWAS, DATTA, FINE, and SEGAL • Statistical Advances in the Biomedical Sciences: Clinical Trials, Epidemiology, Survival Analysis, and Bioinformatics
BLISCHKE and MURTHY (editors) • Case Studies in Reliability and Maintenance
BLISCHKE and MURTHY • Reliability: Modeling, Prediction, and Optimization
BLOOMFIELD • Fourier Analysis of Time Series: An Introduction, Second Edition
BOLLEN • Structural Equations with Latent Variables
BOLLEN and CURRAN • Latent Curve Models: A Structural Equation Perspective
BOROVKOV • Ergodicity and Stability of Stochastic Processes
BOSQ and BLANKE • Inference and Prediction in Large Dimensions
BOULEAU • Numerical Methods for Stochastic Processes
* BOX and TIAO • Bayesian Inference in Statistical Analysis
BOX • Improving Almost Anything, Revised Edition
* BOX and DRAPER • Evolutionary Operation: A Statistical Method for Process Improvement
BOX and DRAPER • Response Surfaces, Mixtures, and Ridge Analyses, Second Edition
BOX, HUNTER, and HUNTER • Statistics for Experimenters: Design, Innovation, and Discovery, Second Edition
BOX, JENKINS, and REINSEL • Time Series Analysis: Forcasting and Control, Fourth Edition
BOX, LUCEÑO, and PANIAGUA-QUIÑONES • Statistical Control by Monitoring and Adjustment, Second Edition
* BROWN and HOLLANDER • Statistics: A Biomedical Introduction
CAIROLI and DALANG • Sequential Stochastic Optimization
CASTILLO, HADI, BALAKRISHNAN, and SARABIA • Extreme Value and Related Models with Applications in Engineering and Science
CHAN • Time Series: Applications to Finance with R and S-Plus, Second Edition
CHARALAMBIDES • Combinatorial Methods in Discrete Distributions
CHATTERJEE and HADI • Regression Analysis by Example, Fifth Edition
CHATTERJEE and HADI • Sensitivity Analysis in Linear Regression
CHERNICK • Bootstrap Methods: A Guide for Practitioners and Researchers, Second Edition
CHERNICK and FRIIS • Introductory Biostatistics for the Health Sciences
CHILÈS and DELFINER • Geostatistics: Modeling Spatial Uncertainty, Second Edition
CHOW and LIU • Design and Analysis of Clinical Trials: Concepts and Methodologies, Second Edition
CLARKE • Linear Models: The Theory and Application of Analysis of Variance
CLARKE and DISNEY • Probability and Random Processes: A First Course with Applications, Second Edition
* COCHRAN and COX • Experimental Designs, Second Edition
COLLINS and LANZA • Latent Class and Latent Transition Analysis: With Applications in the Social, Behavioral, and Health Sciences
CONGDON • Applied Bayesian Modelling
CONGDON • Bayesian Models for Categorical Data
CONGDON • Bayesian Statistical Modelling, Second Edition
CONOVER • Practical Nonparametric Statistics, Third Edition
COOK • Regression Graphics
COOK and WEISBERG • An Introduction to Regression Graphics
COOK and WEISBERG • Applied Regression Including Computing and Graphics
CORNELL • A Primer on Experiments with Mixtures
CORNELL • Experiments with Mixtures, Designs, Models, and the Analysis of Mixture Data, Third Edition
COX • A Handbook of Introductory Statistical Methods
CRESSIE • Statistics for Spatial Data, Revised Edition
CRESSIE and WIKLE • Statistics for Spatio-Temporal Data
CSÖRG and HORVÁTH • Limit Theorems in Change Point Analysis
DAGPUNAR • Simulation and Monte Carlo: With Applications in Finance and MCMC
DANIEL • Applications of Statistics to Industrial Experimentation
DANIEL • Biostatistics: A Foundation for Analysis in the Health Sciences, Eighth Edition
* DANIEL • Fitting Equations to Data: Computer Analysis of Multifactor Data, Second Edition
DASU and JOHNSON • Exploratory Data Mining and Data Cleaning
DAVID and NAGARAJA • Order Statistics, Third Edition
* DEGROOT, FIENBERG, and KADANE • Statistics and the Law
DEL CASTILLO • Statistical Process Adjustment for Quality Control
DEMARIS • Regression with Social Data: Modeling Continuous and Limited Response Variables
DEMIDENKO • Mixed Models: Theory and Applications
DENISON, HOLMES, MALLICK and SMITH • Bayesian Methods for Nonlinear Classification and Regression
DETTE and STUDDEN • The Theory of Canonical Moments with Applications in Statistics, Probability, and Analysis
DEY and MUKERJEE • Fractional Factorial Plans
DILLON and GOLDSTEIN • Multivariate Analysis: Methods and Applications
* DODGE and ROMIG • Sampling Inspection Tables, Second Edition
* DOOB • Stochastic Processes
DOWDY, WEARDEN, and CHILKO • Statistics for Research, Third Edition
DRAPER and SMITH • Applied Regression Analysis, Third Edition
DRYDEN and MARDIA • Statistical Shape Analysis
DUDEWICZ and MISHRA • Modern Mathematical Statistics
DUNN and CLARK • Basic Statistics: A Primer for the Biomedical Sciences, Fourth Edition
DUPUIS and ELLIS • A Weak Convergence Approach to the Theory of Large Deviations
EDLER and KITSOS • Recent Advances in Quantitative Methods in Cancer and Human Health Risk Assessment
* ELANDT-JOHNSON and JOHNSON • Survival Models and Data Analysis
ENDERS • Applied Econometric Time Series, Third Edition
† ETHIER and KURTZ • Markov Processes: Characterization and Convergence
EVANS, HASTINGS, and PEACOCK • Statistical Distributions, Third Edition
EVERITT, LANDAU, LEESE, and STAHL • Cluster Analysis, Fifth Edition
FEDERER and KING • Variations on Split Plot and Split Block Experiment Designs
FELLER • An Introduction to Probability Theory and Its Applications, Volume I, Third Edition, Revised; Volume II, Second Edition
FITZMAURICE, LAIRD, and WARE • Applied Longitudinal Analysis, Second Edition
* FLEISS • The Design and Analysis of Clinical Experiments
FLEISS • Statistical Methods for Rates and Proportions, Third Edition
† FLEMING and HARRINGTON • Counting Processes and Survival Analysis
FUJIKOSHI, ULYANOV, and SHIMIZU • Multivariate Statistics: High-Dimensional and Large-Sample Approximations
FULLER • Introduction to Statistical Time Series, Second Edition
† FULLER • Measurement Error Models
GALLANT • Nonlinear Statistical Models
GEISSER • Modes of Parametric Statistical Inference
GELMAN and MENG • Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives
GEWEKE • Contemporary Bayesian Econometrics and Statistics
GHOSH, MUKHOPADHYAY, and SEN • Sequential Estimation
GIESBRECHT and GUMPERTZ • Planning, Construction, and Statistical Analysis of Comparative Experiments
GIFI • Nonlinear Multivariate Analysis
GIVENS and HOETING • Computational Statistics
GLASSERMAN and YAO • Monotone Structure in Discrete-Event Systems
GNANADESIKAN • Methods for Statistical Data Analysis of Multivariate Observations, Second Edition
GOLDSTEIN • Multilevel Statistical Models, Fourth Edition
GOLDSTEIN and LEWIS • Assessment: Problems, Development, and Statistical Issues
GOLDSTEIN and WOOFF • Bayes Linear Statistics
GREENWOOD and NIKULIN • A Guide to Chi-Squared Testing
GROSS, SHORTLE, THOMPSON, and HARRIS • Fundamentals of Queueing Theory, Fourth Edition
GROSS, SHORTLE, THOMPSON, and HARRIS • Solutions Manual to Accompany Fundamentals of Queueing Theory, Fourth Edition
* HAHN and SHAPIRO • Statistical Models in Engineering
HAHN and MEEKER • Statistical Intervals: A Guide for Practitioners
HALD • A History of Probability and Statistics and their Applications Before 1750
† HAMPEL • Robust Statistics: The Approach Based on Influence Functions
HARTUNG, KNAPP, and SINHA • Statistical Meta-Analysis with Applications
HEIBERGER • Computation for the Analysis of Designed Experiments
HEDAYAT and SINHA • Design and Inference in Finite Population Sampling
HEDEKER and GIBBONS • Longitudinal Data Analysis
HELLER • MACSYMA for Statisticians
HERITIER, CANTONI, COPT, and VICTORIA-FESER • Robust Methods in Biostatistics
HINKELMANN and KEMPTHORNE • Design and Analysis of Experiments, Volume 1: Introduction to Experimental Design, Second Edition
HINKELMANN and KEMPTHORNE • Design and Analysis of Experiments, Volume 2: Advanced Experimental Design
HINKELMANN (editor) • Design and Analysis of Experiments, Volume 3: Special Designs and Applications
HOAGLIN, MOSTELLER, and TUKEY • Fundamentals of Exploratory Analysis of Variance
* HOAGLIN, MOSTELLER, and TUKEY • Exploring Data Tables, Trends and Shapes
* HOAGLIN, MOSTELLER, and TUKEY • Understanding Robust and Exploratory Data Analysis
HOCHBERG and TAMHANE • Multiple Comparison Procedures
HOCKING • Methods and Applications of Linear Models: Regression and the Analysis of Variance, Second Edition
HOEL • Introduction to Mathematical Statistics, Fifth Edition
HOGG and KLUGMAN • Loss Distributions
HOLLANDER and WOLFE • Nonparametric Statistical Methods, Second Edition
HOSMER and LEMESHOW • Applied Logistic Regression, Second Edition
HOSMER, LEMESHOW, and MAY • Applied Survival Analysis: Regression Modeling of Time-to-Event Data, Second Edition
HUBER • Data Analysis: What Can Be Learned From the Past 50 Years
HUBER • Robust Statistics
† HUBER and RONCHETTI • Robust Statistics, Second Edition
HUBERTY • Applied Discriminant Analysis, Second Edition
HUBERTY and OLEJNIK • Applied MANOVA and Discriminant Analysis, Second Edition
HUITEMA • The Analysis of Covariance and Alternatives: Statistical Methods for Experiments, Quasi-Experiments, and Single-Case Studies, Second Edition
HUNT and KENNEDY • Financial Derivatives in Theory and Practice, Revised Edition
HURD and MIAMEE • Periodically Correlated Random Sequences: Spectral Theory and Practice
HUSKOVA, BERAN, and DUPAC • Collected Works of Jaroslav Hajek—with Commentary
HUZURBAZAR • Flowgraph Models for Multistate Time-to-Event Data
JACKMAN • Bayesian Analysis for the Social Sciences
† JACKSON • A User’s Guide to Principle Components
JOHN • Statistical Methods in Engineering and Quality Assurance
JOHNSON • Multivariate Statistical Simulation
JOHNSON and BALAKRISHNAN • Advances in the Theory and Practice of Statistics: A Volume in Honor of Samuel Kotz
JOHNSON, KEMP, and KOTZ • Univariate Discrete Distributions, Third Edition
JOHNSON and KOTZ (editors) • Leading Personalities in Statistical Sciences: From the Seventeenth Century to the Present
JOHNSON, KOTZ, and BALAKRISHNAN • Continuous Univariate Distributions, Volume 1, Second Edition
JOHNSON, KOTZ, and BALAKRISHNAN • Continuous Univariate Distributions, Volume 2, Second Edition
JOHNSON, KOTZ, and BALAKRISHNAN • Discrete Multivariate Distributions
JUDGE, GRIFFITHS, HILL, LÜTKEPOHL, and LEE • The Theory and Practice of Econometrics, Second Edition
JUREK and MASON • Operator-Limit Distributions in Probability Theory
KADANE • Bayesian Methods and Ethics in a Clinical Trial Design
KADANE AND SCHUM • A Probabilistic Analysis of the Sacco and Vanzetti Evidence
KALBFLEISCH and PRENTICE • The Statistical Analysis of Failure Time Data, Second Edition
KARIYA and KURATA • Generalized Least Squares
KASS and VOS • Geometrical Foundations of Asymptotic Inference
† KAUFMAN and ROUSSEEUW • Finding Groups in Data: An Introduction to Cluster Analysis
KEDEM and FOKIANOS • Regression Models for Time Series Analysis
KENDALL, BARDEN, CARNE, and LE • Shape and Shape Theory
KHURI • Advanced Calculus with Applications in Statistics, Second Edition
KHURI, MATHEW, and SINHA • Statistical Tests for Mixed Linear Models
* KISH • Statistical Design for Research
KLEIBER and KOTZ • Statistical Size Distributions in Economics and Actuarial Sciences
KLEMELÄ • Smoothing of Multivariate Data: Density Estimation and Visualization
KLUGMAN, PANJER, and WILLMOT • Loss Models: From Data to Decisions, Fourth Edition
KLUGMAN, PANJER, and WILLMOT • Student Solutions Manual to Accompany Loss Models: From Data to Decisions, Fourth Edition
KOSKI and NOBLE • Bayesian Networks: An Introduction
KOTZ, BALAKRISHNAN, and JOHNSON • Continuous Multivariate Distributions, Volume 1, Second Edition
KOTZ and JOHNSON (editors) • Encyclopedia of Statistical Sciences: Volumes 1 to 9 with Index
KOTZ and JOHNSON (editors) • Encyclopedia of Statistical Sciences: Supplement Volume
KOTZ, READ, and BANKS (editors) • Encyclopedia of Statistical Sciences: Update Volume 1
KOTZ, READ, and BANKS (editors) • Encyclopedia of Statistical Sciences: Update Volume 2
KOWALSKI and TU • Modern Applied U-Statistics
KRISHNAMOORTHY and MATHEW • Statistical Tolerance Regions: Theory, Applications, and Computation
KROESE, TAIMRE, and BOTEV • Handbook of Monte Carlo Methods
KROONENBERG • Applied Multiway Data Analysis
KULINSKAYA, MORGENTHALER, and STAUDTE • Meta Analysis: A Guide to Calibrating and Combining Statistical Evidence
KULKARNI and HARMAN • An Elementary Introduction to Statistical Learning Theory
KUROWICKA and COOKE • Uncertainty Analysis with High Dimensional Dependence Modelling
KVAM and VIDAKOVIC • Nonparametric Statistics with Applications to Science and Engineering
LACHIN • Biostatistical Methods: The Assessment of Relative Risks, Second Edition
LAD • Operational Subjective Statistical Methods: A Mathematical, Philosophical, and Historical Introduction
LAMPERTI • Probability: A Survey of the Mathematical Theory, Second Edition
LAWLESS • Statistical Models and Methods for Lifetime Data, Second Edition
LAWSON • Statistical Methods in Spatial Epidemiology, Second Edition
LE • Applied Categorical Data Analysis, Second Edition
LE • Applied Survival Analysis
LEE • Structural Equation Modeling: A Bayesian Approach
LEE and WANG • Statistical Methods for Survival Data Analysis, Third Edition
LEPAGE and BILLARD • Exploring the Limits of Bootstrap
LESSLER and KALSBEEK • Nonsampling Errors in Surveys
LEYLAND and GOLDSTEIN (editors) • Multilevel Modelling of Health Statistics
LIAO • Statistical Group Comparison
LIN • Introductory Stochastic Analysis for Finance and Insurance
LITTLE and RUBIN • Statistical Analysis with Missing Data, Second Edition
LLOYD • The Statistical Analysis of Categorical Data
LOWEN and TEICH • Fractal-Based Point Processes
MAGNUS and NEUDECKER • Matrix Differential Calculus with Applications in Statistics and Econometrics, Revised Edition
MALLER and ZHOU • Survival Analysis with Long Term Survivors
MARCHETTE • Random Graphs for Statistical Pattern Recognition
MARDIA and JUPP • Directional Statistics
MARKOVICH • Nonparametric Analysis of Univariate Heavy-Tailed Data: Research and Practice
MARONNA, MARTIN and YOHAI • Robust Statistics: Theory and Methods
MASON, GUNST, and HESS • Statistical Design and Analysis of Experiments with Applications to Engineering and Science, Second Edition
McCOOL • Using the Weibull Distribution: Reliability, Modeling, and Inference
McCULLOCH, SEARLE, and NEUHAUS • Generalized, Linear, and Mixed Models, Second Edition
McFADDEN • Management of Data in Clinical Trials, Second Edition
* McLACHLAN • Discriminant Analysis and Statistical Pattern Recognition
McLACHLAN, DO, and AMBROISE • Analyzing Microarray Gene Expression Data
McLACHLAN and KRISHNAN • The EM Algorithm and Extensions, Second Edition
McLACHLAN and PEEL • Finite Mixture Models
McNEIL • Epidemiological Research Methods
MEEKER and ESCOBAR • Statistical Methods for Reliability Data
MEERSCHAERT and SCHEFFLER • Limit Distributions for Sums of Independent Random Vectors: Heavy Tails in Theory and Practice
MENGERSEN, ROBERT, and TITTERINGTON • Mixtures: Estimation and Applications
MICKEY, DUNN, and CLARK • Applied Statistics: Analysis of Variance and Regression, Third Edition
* MILLER • Survival Analysis, Second Edition
MONTGOMERY, JENNINGS, and KULAHCI • Introduction to Time Series Analysis and Forecasting
MONTGOMERY, PECK, and VINING • Introduction to Linear Regression Analysis, Fifth Edition
MORGENTHALER and TUKEY • Configural Polysampling: A Route to Practical Robustness
MUIRHEAD • Aspects of Multivariate Statistical Theory
MULLER and STOYAN • Comparison Methods for Stochastic Models and Risks
MURTHY, XIE, and JIANG • Weibull Models
MYERS, MONTGOMERY, and ANDERSON-COOK • Response Surface Methodology: Process and Product Optimization Using Designed Experiments, Third Edition
MYERS, MONTGOMERY, VINING, and ROBINSON • Generalized Linear Models. With Applications in Engineering and the Sciences, Second Edition
NATVIG • Multistate Systems Reliability Theory With Applications
† NELSON • Accelerated Testing, Statistical Models, Test Plans, and Data Analyses
† NELSON • Applied Life Data Analysis
NEWMAN • Biostatistical Methods in Epidemiology
NG, TAIN, and TANG • Dirichlet Theory: Theory, Methods and Applications
OKABE, BOOTS, SUGIHARA, and CHIU • Spatial Tesselations: Concepts and Applications of Voronoi Diagrams, Second Edition
OLIVER and SMITH • Influence Diagrams, Belief Nets and Decision Analysis
PALTA • Quantitative Methods in Population Health: Extensions of Ordinary Regressions
PANJER • Operational Risk: Modeling and Analytics
PANKRATZ • Forecasting with Dynamic Regression Models
PANKRATZ • Forecasting with Univariate Box-Jenkins Models: Concepts and Cases
PARDOUX • Markov Processes and Applications: Algorithms, Networks, Genome and Finance
PARMIGIANI and INOUE • Decision Theory: Principles and Approaches
* PARZEN • Modern Probability Theory and Its Applications
PEÑA, TIAO, and TSAY • A Course in Time Series Analysis
PESARIN and SALMASO • Permutation Tests for Complex Data: Applications and Software
PIANTADOSI • Clinical Trials: A Methodologic Perspective, Second Edition
POURAHMADI • Foundations of Time Series Analysis and Prediction Theory
POWELL • Approximate Dynamic Programming: Solving the Curses of Dimensionality, Second Edition
POWELL and RYZHOV • Optimal Learning
PRESS • Subjective and Objective Bayesian Statistics, Second Edition
PRESS and TANUR • The Subjectivity of Scientists and the Bayesian Approach
PURI, VILAPLANA, and WERTZ • New Perspectives in Theoretical and Applied Statistics
† PUTERMAN • Markov Decision Processes: Discrete Stochastic Dynamic Programming
QIU • Image Processing and Jump Regression Analysis
* RAO • Linear Statistical Inference and Its Applications, Second Edition
RAO • Statistical Inference for Fractional Diffusion Processes
RAUSAND and HØYLAND • System Reliability Theory: Models, Statistical Methods, and Applications, Second Edition
RAYNER, THAS, and BEST • Smooth Tests of Goodnes of Fit: Using R, Second Edition
RENCHER and SCHAALJE • Linear Models in Statistics, Second Edition
RENCHER and CHRISTENSEN • Methods of Multivariate Analysis, Third Edition
RENCHER • Multivariate Statistical Inference with Applications
RIGDON and BASU • Statistical Methods for the Reliability of Repairable Systems
* RIPLEY • Spatial Statistics
* RIPLEY • Stochastic Simulation
ROHATGI and SALEH • An Introduction to Probability and Statistics, Second Edition
ROLSKI, SCHMIDLI, SCHMIDT, and TEUGELS • Stochastic Processes for Insurance and Finance
ROSENBERGER and LACHIN • Randomization in Clinical Trials: Theory and Practice
ROSSI, ALLENBY, and MCCULLOCH • Bayesian Statistics and Marketing
† ROUSSEEUW and LEROY • Robust Regression and Outlier Detection
ROYSTON and SAUERBREI • Multivariate Model Building: A Pragmatic Approach to Regression Analysis Based on Fractional Polynomials for Modeling Continuous Variables
* RUBIN • Multiple Imputation for Nonresponse in Surveys
RUBINSTEIN and KROESE • Simulation and the Monte Carlo Method, Second Edition
RUBINSTEIN and MELAMED • Modern Simulation and Modeling
RYAN • Modern Engineering Statistics
RYAN • Modern Experimental Design
RYAN • Modern Regression Methods, Second Edition
RYAN • Statistical Methods for Quality Improvement, Third Edition
SALEH • Theory of Preliminary Test and Stein-Type Estimation with Applications
SALTELLI, CHAN, and SCOTT (editors) • Sensitivity Analysis
SCHERER • Batch Effects and Noise in Microarray Experiments: Sources and Solutions
* SCHEFFE • The Analysis of Variance
SCHIMEK • Smoothing and Regression: Approaches, Computation, and Application
SCHOTT • Matrix Analysis for Statistics, Second Edition
SCHOUTENS • Levy Processes in Finance: Pricing Financial Derivatives
SCOTT • Multivariate Density Estimation: Theory, Practice, and Visualization
* SEARLE • Linear Models
† SEARLE • Linear Models for Unbalanced Data
† SEARLE • Matrix Algebra Useful for Statistics
† SEARLE, CASELLA, and McCULLOCH • Variance Components
SEARLE and WILLETT • Matrix Algebra for Applied Economics
SEBER • A Matrix Handbook For Statisticians
† SEBER • Multivariate Observations
SEBER and LEE • Linear Regression Analysis, Second Edition
† SEBER and WILD • Nonlinear Regression
SENNOTT • Stochastic Dynamic Programming and the Control of Queueing Systems
* SERFLING • Approximation Theorems of Mathematical Statistics
SHAFER and VOVK • Probability and Finance: It’s Only a Game!
SHERMAN • Spatial Statistics and Spatio-Temporal Data: Covariance Functions and Directional Properties
SILVAPULLE and SEN • Constrained Statistical Inference: Inequality, Order, and Shape Restrictions
SINGPURWALLA • Reliability and Risk: A Bayesian Perspective
SMALL and MCLEISH • Hilbert Space Methods in Probability and Statistical Inference
SRIVASTAVA • Methods of Multivariate Statistics
STAPLETON • Linear Statistical Models, Second Edition
STAPLETON • Models for Probability and Statistical Inference: Theory and Applications
STAUDTE and SHEATHER • Robust Estimation and Testing
STOYAN • Counterexamples in Probability, Second Edition
STOYAN, KENDALL, and MECKE • Stochastic Geometry and Its Applications, Second Edition
STOYAN and STOYAN • Fractals, Random Shapes and Point Fields: Methods of Geometrical Statistics
STREET and BURGESS • The Construction of Optimal Stated Choice Experiments: Theory and Methods
STYAN • The Collected Papers of T. W. Anderson: 1943–1985
SUTTON, ABRAMS, JONES, SHELDON, and SONG • Methods for Meta-Analysis in Medical Research
TAKEZAWA • Introduction to Nonparametric Regression
TAMHANE • Statistical Analysis of Designed Experiments: Theory and Applications
TANAKA • Time Series Analysis: Nonstationary and Noninvertible Distribution Theory
THOMPSON • Empirical Model Building: Data, Models, and Reality, Second Edition
THOMPSON • Sampling, Third Edition
THOMPSON • Simulation: A Modeler’s Approach
THOMPSON and SEBER • Adaptive Sampling
THOMPSON, WILLIAMS, and FINDLAY • Models for Investors in Real World Markets
TIERNEY • LISP-STAT: An Object-Oriented Environment for Statistical Computing and Dynamic Graphics
TSAY • Analysis of Financial Time Series, Third Edition
TSAY • An Introduction to Analysis of Financial Data with R
UPTON and FINGLETON • Spatial Data Analysis by Example, Volume II: Categorical and Directional Data
† VAN BELLE • Statistical Rules of Thumb, Second Edition
VAN BELLE, FISHER, HEAGERTY, and LUMLEY • Biostatistics: A Methodology for the Health Sciences, Second Edition
VESTRUP • The Theory of Measures and Integration
VIDAKOVIC • Statistical Modeling by Wavelets
VIERTL • Statistical Methods for Fuzzy Data
VINOD and REAGLE • Preparing for the Worst: Incorporating Downside Risk in Stock Market Investments
WALLER and GOTWAY • Applied Spatial Statistics for Public Health Data
WEISBERG • Applied Linear Regression, Third Edition
WEISBERG • Bias and Causation: Models and Judgment for Valid Comparisons
WELSH • Aspects of Statistical Inference
WESTFALL and YOUNG • Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment
* WHITTAKER • Graphical Models in Applied Multivariate Statistics
WINKER • Optimization Heuristics in Economics: Applications of Threshold Accepting
WOODWORTH • Biostatistics: A Bayesian Introduction
WOOLSON and CLARKE • Statistical Methods for the Analysis of Biomedical Data, Second Edition
WU and HAMADA • Experiments: Planning, Analysis, and Parameter Design Optimization, Second Edition
WU and ZHANG • Nonparametric Regression Methods for Longitudinal Data Analysis
YIN • Clinical Trial Design: Bayesian and Frequentist Adaptive Methods
YOUNG, VALERO-MORA, and FRIENDLY • Visual Statistics: Seeing Data with Dynamic Interactive Graphics
ZACKS • Stage-Wise Adaptive Designs
* ZELLNER • An Introduction to Bayesian Inference in Econometrics
ZELTERMAN • Discrete Distributions—Applications in the Health Sciences
ZHOU, OBUCHOWSKI, and MCCLISH • Statistical Methods in Diagnostic Medicine, Second Edition
*Now available in a lower priced paperback edition in the Wiley Classics Library.
†Now available in a lower priced paperback edition in the Wiley–Interscience Paperback Series.
Copyright 2013 by John Wiley & Sons. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400, fax 978-750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, 201-748-6011, fax 201-748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at 800-762-2974, outside the United States at 317-572-3993 or fax 317-572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data
Agresti, Alan. Categorical data analysis / Alan Agresti. – 3rd ed. p. cm. – (Wiley series in probability and statistics; 792)Includes bibliographical references and index. ISBN 978-0-470-46363-5 (hardback) 1. Multivariate analysis. I. Title. QA278.A353 2013 519.5′35–dc23 2012009792
To Jacki
Preface
The explosion in the development of methods for analyzing categorical data that began in the 1960s has continued apace in recent years. This book provides an overview of these methods, as well as older, now standard, methods. It gives special emphasis to generalized linear modeling techniques, which extend linear model methods for continuous variables, and their extensions for multivariate responses.
Chapters 1–10 present the core methods for categorical response variables. Chapters 1–3 cover distributions for categorical responses and traditional methods for two-way contingency tables. Chapters 4–8 introduce logistic regression and related models such as the probit model for binary and multicategory response variables. Chapters 9 and 10 cover loglinear models for contingency tables.
In the past quarter century, a major area of new research has been the development of methods for repeated measurement and other forms of clustered categorical data. Chapters 11–14 present these methods, including marginal models and generalized linear mixed models with random effects. Chapter 15 introduces non-model-based methods for classification and clustering. Chapter 16 presents theoretical foundations as well as alternatives to the maximum likelihood paradigm that this text adopts. Chapter 17 is devoted to a historical overview of the development of the methods. It examines contributions of noted statisticians, such as Pearson and Fisher, whose pioneering efforts—and sometimes vocal debates—broke the ground for this evolution.
Appendices illustrate the use of statistical software for analyzing categorical data. The website for the text, vww.stat.ufl.edu/~aa/cda/cda.html, contains an appendix with detailed examples of the use of software (especially R, SAS, and Stata) for performing the analyses in this book, solutions to many of the exercises, extra exercises, and corrections.
Given the explosion of research in the past 50 years on categorical data methods, it is an increasing challenge to write a comprehensive book covering all the commonly used methods. The second edition of this book already exceeded 700 pages. In including much new material without letting the book grow much, I have necessarily had to make compromises in depth and use relatively simple examples. I try to present a broad overview, while presenting bibliographic notes with many references in which the reader can find more details. In attempting to make the book relatively comprehensive while presenting substantive new material, every chapter of the first two editions has been extensively rewritten. The major changes are:
A new Chapter 7 presents alternative methods for binary response data, including some regularization methods that are becoming popular in this age of massive data sets with enormous numbers of variables.
A new Chapter 15 introduces non-model-based methods of classification, such as linear discriminant analysis and classification trees, and cluster analysis.
Many chapters now include a section describing the Bayesian approach for the methods of that chapter. We also have added material (e.g., Sections 6.5 and 7.4) about ways that frequentist methods can deal with awkward situations such as infinite maximum likelihood estimates.
The use of various software for categorical data methods is discussed at a much expanded website for the text,
www.stat.ufl.edu/~aa/cda/cda.html
. Examples are shown of the use of R, SAS, and Stata for most of the examples in the text, and there is discussion also about SPSS, StatXact, and other software. That website also contains many of the text’s data sets, some of which have only excerpts shown in the text itself, as well as solutions for many exercises and corrections of errors found in early printings of the book. I recommend that you refer to this appendix (or specialized software manuals) while reading the text, perhaps printing the pages about the software you prefer, as an aid to implementing the methods. This material was placed at the website partly because the text is already so long without it and also because it is then easier to keep the presentation up-to-date.
In this text, I interpret categorical data analysis to refer to methods for categorical response variables. For most methods, explanatory variables can be categorical or quantitative, as in ordinary regression. Thus, the focus is intended to be more general than contingency table analysis, although for simplicity of data presentation, most examples use contingency tables. These examples are simplistic, but should help you focus on understanding the methods themselves and make it easier for you to replicate results with your favorite software.
Other special features of the text include:
More than 100 analyses of data sets.
About 600 exercises, some directed toward theory and methods and some toward applications and data analysis.
Notes at the end of each chapter that provide references for recent research and many topics not covered in the text, linked to a bibliography of more than 1200 sources.
I intend this book to be accessible to the diverse mix of students who take graduate-level courses in categorical data analysis. But I have also written it with practicing statisticians and biostatisticians in mind. I hope it enables them to catch up with recent advances and learn about methods that sometimes receive inadequate attention in the traditional statistics curriculum.
The development of new methods has influenced—and been influenced by—the increasing availability of data sets with categorical responses in the social, behavioral, and biomedical sciences, as well as in public health, genetics, ecology, education, marketing and the financial industry, and industrial quality control. And so, although this book is directed mainly to statisticians and biostatisticians, I also aim for it to be helpful to methodologists in these fields.
Readers should possess a background that includes regression and analysis of variance models, as well as maximum likelihood methods of statistical theory. Those not having much theory background should be able to follow most methodological discussions. Those with mainly applied interests can skip most of Chapter 4 on the theory of generalized linear models and proceed to other chapters. However, the book has a distinctly higher technical level and is more thorough and complete than my lower-level text, An Introduction to Categorical Data Analysis, Second Edition (Wiley, 2007).
Today, because of the ubiquity of categorical data in applications, most statistics and biostatistics departments offer courses on categorical data analysis or on generalized linear models with strong emphasis on methods for discrete data. This book can be used as a text for such courses. The material in Chapters 1–6 forms the heart of most courses. There is too much material in this book for a single course, but a one-term course can be based on the following outline:
Basic contingency table analysis, covering Chapters 1–3, perhaps skipping some tangential sections such as 1.5.7, 1.6, 2.4, 3.4–3.7.
Logistic regression and related methods for binary data, covering Chapters 4–6, perhaps skipping some tangential sections such as 4.4–4.7 and 6.4–6.6.
Multinomial response models, covering at least Sections 8.1 and 8.2.
Matched pairs and clustered data, covering at least Sections 11.1–11.2.
Courses with biostatistical orientation may want to include bits from Chapters 12 and 13 on marginal and random effects models. Courses with social science emphasis may want to include some topics on loglinear modeling from Chapters 9 and 10. Some courses may want to select specialized topics from Chapter 7, such as probit modeling, conditional logistic regression, Bayesian binary data modeling, smoothing, and issues in the analysis of high-dimensional data.
I thank those who commented on parts of the manuscript or provided help of some type. Special thanks to Anna Gottard, David Hoaglin, Maria Kateri, Bernhard Klingenberg, Keli Liu, and Euijung Ryu, who gave insightful comments on some chapters and made many helpful suggestions, and Brett Presnell for his advice and resources about R software and his comments about some of the material. Thanks to people who made suggestions about new material for this edition, including Jonathan Bischof, James Booth, Brian Caffo, Tianxi Cai, Brent Coull, Nicholas Cox, Ralitza Gueorguieva, Debashis Ghosh, John Henretta, David Hitchcock, Galin Jones, Robert Kushler, Xihong Lin, Jun Liu, Gianfranco Lovison, Giovanni Marchetti, David Olive, Art Owen, Alessandra Petrucci, Michael Radelet, Gerard Scallan, Maura Stokes, Anestis Touloumis, and Ming Yang. Thanks to those who commented on aspects of the second edition, including pointing out errors or typos, such as Pat Altham, Roberto Bertolusso, Nicholas Cox, David Firth, Rene Gonin, David Hoaglin, Harry Khamis, Bernhard Klingenberg, Robert Kushler, Gianfranco Lovison, Theo Nijsse, Richard Reyment, Misha Salganik, William Santo, Laura Thompson, Michael Vock, and Zhongming Yang. Thanks also to Laura Thompson for preparing her very helpful manual on using R and S-Plus for examples in the second edition. Thanks to the many who reviewed material or suggested examples for the first two editions, mentioned in the Prefaces of those editions. Thanks also to Wiley Executive Editor Steve Quigley and Associate Editor Jacqueline Palmieri for their steadfast encouragement and facilitation of this project. Finally, thanks to my wife Jacki Levine for continuing support of all kinds.
ALAN AGRESTI
Gainesville, Florida and Brookline, MassachusettsFebruary 2012
From helping to assess the value of new medical treatments to evaluating the factors that affect our opinions and behaviors, analysts today are finding myriad uses for categorical data methods. In this book we introduce these methods and the theory behind them.
Statistical methods for categorical responses were late in gaining the level of sophistication achieved early in the twentieth century by methods for continuous responses. Despite influential work around 1900 by the British statistician Karl Pearson, relatively Utile development of models for categorical responses occurred until the 1960s. In this book we describe the early fundamental work that still has importance today but place primary emphasis on more recent modeling approaches.
A categorical variable has a measurement scale consisting of a set of categories. For instance, political philosophy is often measured as liberal, moderate, or conservative. Diagnoses regarding breast cancer based on a mammogram use the categories normal, benign, probably benign, suspicious, and malignant.
The development of methods for categorical variables was stimulated by the need to analyze data generated in research studies in both the social and biomedical sciences. Categorical scales are pervasive in the social sciences for measuring attitudes and opinions. Categorical scales in biomedical sciences measure outcomes such as whether a medical treatment is successful.
Categorical data are by no means restricted to the social and biomedical sciences. They frequently occur in the behavioral sciences (e.g., type of mental illness, with the categories schizophrenia, depression, neurosis), epidemiology and public health (e.g., contraceptive method at last sexual intercourse, with the categories none, condom, pill, IUD, other), genetics (type of allele inherited by an offspring), botany and zoology (e.g., whether or not a particular organism is observed in a sampled quadrat), education (e.g., whether a student response to an exam question is correct or incorrect), and marketing (e.g., consumer preference among the three leading brands of a product). They even occur in highly quantitative fields such as engineering sciences and industrial quality control. Examples are the classification of items according to whether they conform to certain standards, and subjective evaluation of some characteristic: how soft to the touch a certain fabric is, how good a particular food product tastes, or how easy a worker finds it to perform a certain task.
Categorical variables are of many types. In this section we provide ways of classifying them.
Statistical analyses distinguish between response (or dependent) variables and explanatory (or independent) variables. This book focuses on methods for categorical response variables. As in ordinary regression modeling, explanatory variables can be any type. For instance, a study might analyze how opinion about whether same-sex marriages should be legal (yes or no) changes according to values of explanatory variables, such as religious affiliation, political ideology, number of years of education, annual income, age, gender, and race.
Many categorical variables have only two categories. Such variables, for which the two categories are often given the generic labels “success” and “failure,” are called binary variables. A major topic of this book is the modeling of binary response variables.
When a categorical variable has more than two categories, we distinguish between two types of categorical scales. Variables having categories without a natural ordering are said to be measured on a nominal scale and are called nominal variables. Examples are mode of transportation to get to work (automobile, bicycle, bus, subway, walk), favorite type of music (classical, country, folk, jazz, rock), and choice of residence (apartment, condominium, house, other). For nominal variables, the order of listing the categories is irrelevant to the statistical analysis.
Many categorical variables do have ordered categories. Such variables are said to be measured on an ordinal scale and are called ordinal variables. Examples are social class (upper, middle, lower), political philosophy (very liberal, slightly liberal, moderate, slightly conservative, very conservative), patient condition (good, fair, serious, critical), and rating of a movie for Netflix (1 to 5 stars, representing hated it, didn’t like it, liked it, really liked it, loved it). For ordinal variables, distances between categories are unknown. Although a person categorized as very liberal is more liberal than a person categorized as slightly liberal, no numerical value describes how much more liberal that person is.
An interval variable is one that does have numerical distances between any two values. For example, systolic blood pressure level, length of prison term, and annual income are interval variables. For most such variables, it is also possible to compare two values by their ratio, in which case the variable is also called a ratio variable.
The way that a variable is measured determines its classification. For example, “education” is only nominal when measured as (public school, private school, home schooling); it is ordinal when measured by highest degree attained, using the categories (none, high school, bachelor’s, master’s, doctorate); it is interval when measured by number of years of education completed, using the integers 0, 1, 2, 3, ….
A variable’s measurement scale determines which statistical methods are appropriate. It is usually best to apply methods appropriate for the actual scale. In the measurement hierarchy, interval variables are highest, ordinal variables are next, and nominal variables are lowest. Statistical methods for variables of one type can also be used with variables at higher levels but not at lower levels. For instance, statistical methods for nominal variables can be used with-ordinal variables by ignoring the ordering of categories. Methods for ordinal variables cannot, however, be used with nominal variables, since their categories have no meaningful ordering. The distinction between ordered and unordered categories is not important for binary variables, because ordinal methods and nominal methods then typically reduce to equivalent methods.
In this book, we present methods for the analysis of binary, nominal, and ordinal variables. The methods also apply to interval variables having a small number of distinct values (e.g., number of times married, number of distinct side effects experienced in taking some drug) or for which the values are grouped into ordered categories (e.g., education measured as ≤12 years, > 12 but < 16 years, ≥16 years).
Variables are classified as discrete or continuous, according to whether the number of values they can take is countable. Actual measurement of all variables occurs in a discrete manner, due to precision limitations in measuring instruments. The discrete–continuous classification, in practice, distinguishes between variables that take few values and variables that take lots of values. For instance, statisticians often treat discrete interval variables having a large number of values (such as test scores) as continuous, using them in methods for continuous responses.
This book deals with certain types of discretely measured responses: (1) binary variables, (2) nominal variables, (3) ordinal variables, (4) discrete interval variables having relatively few values, and (5) continuous variables grouped into a small number of categories.
Nominal variables are qualitative—distinct categories differ in quality, not in quantity. Interval variables are quantitative—distinct levels have differing amounts of the characteristic of interest. The position of ordinal variables in the qualitative–quantitative classification is fuzzy. Analysts often treat them as qualitative, using methods for nominal variables. But in many respects, ordinal variables more closely resemble interval variables than they resemble nominal variables. They possess important quantitative features: Each category has a greater or smaller magnitude of the characteristic man another category; and although not possible to measure, an underlying continuous variable is often present. The political ideology classification (very liberal, slightly liberal, moderate, slightly conservative, very conservative) crudely measures an inherently continuous characteristic.
Analysts often utilize the quantitative nature of ordinal variables by assigning numerical scores to the categories or assuming an underlying continuous distribution. This requires good judgment and guidance from researchers who use the scale, but it provides benefits in the variety of methods available for data analysis.
The models for categorical response variables discussed in this book resemble regression models for continuous response variables; however, they assume binomial or multinomial response distributions instead of normality. One type of model receives special attention—logistic regression. Ordinary logistic regression models apply with binary responses and assume a binomial distribution. Generalizations of logistic regression apply with multicategory responses and assume a multinomial distribution.
The book has four main units. In the first, Chapters 1 through 3, we summarize descriptive and inferential methods for univariate and bivariate categorical data. These chapters cover discrete distributions, methods of inference, and measures of association for contingency tables. They summarize the non-model-based methods developed prior to about 1960.
In the second and primary unit, Chapters 4 through 10, we introduce models for categorical responses. In Chapter 4 we describe a class of generalized linear models having models of this text as special cases. Chapters 5 and 6 cover the most important model for binary responses, logistic regression. Chapter 7 presents alternative methods for binary data, including the probit, Bayesian fitting, and smoothing methods. In Chapter 8 we present generalizations of the logistic regression model for nominal and ordinal multicategory response variables. In Chapters 9 and 10 we introduce the modeling of multivariate categorical response data, in terms of association and interaction patterns among the variables. The models, called loglinear models, apply to counts in the table that cross-classifies those responses.
In the third unit. Chapters 11 through 14, we discuss models for handling repeated measurement and other forms of clustered data. In Chapter 11 we present models for a categorical response with matched pairs; these apply, for instance, with a categorical response measured for the same subjects at two times. Chapter 12 covers models for more general types of repeated categorical data, such as longitudinal data from several times with explanatory variables. In Chapter 13 we present a broad class of models, generalized linear mixed models, that use random effects to account for dependence with such data. In Chapter 14 further extensions of the models from Chapters 11 through 13 are described, unified by treating the response as having a mixture distribution of some type.
The fourth and final unit has a different nature than the others. In Chapter 15 we consider non-model-based classification and clustering methods. In Chapter 16 we summarize large-sample and small-sample theory for categorical data models. This theory is the basis for behavior of model parameter estimators and goodness-of-fit statistics. Chapter 17 presents a historical overview of the development of categorical data methods.
Maximum likelihood methods receive primary attention throughout the book. Many chapters, however, contain a section presenting corresponding Bayesian methods.
In Appendix A we review software that can perform the analyses in this book. The website www.stat.ufl.edu/~aa/cda/cda.html for this book contains an appendix that gives more information about using R, SAS, Stata, and other software, with sample programs for text examples. In addition, that site has complete data sets for many text examples and exercises, solutions to some exercises, extra exercises, corrections, and links to other useful sites. For instance, a manual prepared by Dr. Laura Thompson provides examples of how to use R and S-Plus for all examples in the second edition of this text, many of which (or very similar ones) are also in this edition.
In the rest of this chapter, we provide background material. In Section 1.2 we review the key distributions for categorical data: the binomial and multinomial, as well as another that is important for discrete data, the Poisson. In Section 1.3 we review the primary mechanisms for statistical inference using maximum likelihood. In Sections 1.4 and 1.5 we illustrate these by presenting significance tests and confidence intervals for binomial and multinomial parameters. In Section 1.6 we introduce Bayesian inference for these parameters.
Inferential data analyses require assumptions about the random mechanism that generated the data. For regression models with continuous responses, the normal distribution plays the central role. In this section we review the three key distributions for categorical responses: binomial, multinomial, and Poisson.
The probability mass function for the possible outcomes y for Y is
(1.1)
There is no guarantee that successive binary observations are independent or identical. Thus, occasionally, we will utilize other distributions. One such case is sampling binary outcomes without replacement from a finite population, such as observations on whether a homework assignment was completed for 10 students sampled from a class of size 20. The hypergeometric distribution, studied in Section 3.5.1, is then relevant. In Section 1.2.4 we discuss another case that violates the binomial assumptions.
(1.2)
For the multinomial distribution,
(1.3)
We derive the covariance in Section 16.1.4. The marginal distribution of each nj is binomial.
(1.4)
A key feature of the Poisson distribution is that its variance equals its mean. Sample counts vary more when their mean is higher. When the mean number of daily fatal accidents equals 15, greater variability occurs from day to day than when the mean equals 2.
In practice, count observations often exhibit variability exceeding that predicted by the binomial or Poisson. This phenomenon is called overdispersion. We assumed above that each person has the same probability each day of dying in a fatal auto accident. More realistically, these probabilities vary from day to day according to the amount of road traffic and weather conditions and vary from person to person according to factors such as the amount of time spent in autos, whether the person wears a seat belt, how much of the driving is at high speeds, gender, and age. Such variation causes fatality counts to display more variation than predicted by the Poisson model.
Assuming a Poisson distribution for a count variable is often too simplistic, because of factors that cause overdispersion. The negative binomial is a related distribution for count data that has a second parameter and permits the variance to exceed the mean. We introduce it in Section 4.3.4.
Analyses assuming binomial (or multinomial) distributions are also sometimes invalid because of overdispersion. This might happen because the true distribution is a mixture of different binomial distributions, with the parameter varying because of unmeasured variables. To illustrate, suppose that an experiment exposes pregnant mice to a toxin and then after a week observes the number of fetuses in each mouse’s litter that show signs of malformation. Let ni denote the number of fetuses in the litter for mouse i. The pregnant mice also vary according to other factors, such as their weight, overall health, and genetic makeup. Extra variation then occurs because of the variability from litter to litter in the probability π of malformation. The distribution of the number of fetuses per Utter showing malformations might cluster near 0 and near ni, showing more dispersion than expected for binomial sampling with a single value of π. Overdispersion could also occur when π varies among fetuses in a litter according to some distribution (Exercise 1.17). In Chapters 4, 13, and 14 we introduce methods for data that are overdispersed relative to binomial and Poisson assumptions.
With Poisson sampling the total count n is random rather than fixed. If we assume a Poisson model but condition on n, {Yi} no longer have Poisson distributions, since each Yi cannot exceed n. Given n, {Yi) are also no longer independent, since the value of one affects the possible range for the others.
(1.5)
Many categorical data analyses assume a multinomial distribution. Such analyses usually have the same inferential results as those of analyses assuming a Poisson distribution, because of the similarity in the likelihood functions.
Another distribution of fundamental importance for categorical data is the chi-squared, not as a distribution for the data but rather as a sampling distribution for many statistics. Because of its importance, we summarize here a few of its properties.
The chi-squared distribution with degrees of freedom denoted by df has mean df, variance 2(df), and skewness . It converges (slowly) to normality as df increases, the approximation being reasonably good when df is at least about 50.