113,99 €
Presents new models, methods, and techniques and considers important real-world applications in political science, sociology, economics, marketing, and finance
Emphasizing interdisciplinary coverage, Bayesian Inference in the Social Sciences builds upon the recent growth in Bayesian methodology and examines an array of topics in model formulation, estimation, and applications. The book presents recent and trending developments in a diverse, yet closely integrated, set of research topics within the social sciences and facilitates the transmission of new ideas and methodology across disciplines while maintaining manageability, coherence, and a clear focus.
Bayesian Inference in the Social Sciences features innovative methodology and novel applications in addition to new theoretical developments and modeling approaches, including the formulation and analysis of models with partial observability, sample selection, and incomplete data. Additional areas of inquiry include a Bayesian derivation of empirical likelihood and method of moment estimators, and the analysis of treatment effect models with endogeneity. The book emphasizes practical implementation, reviews and extends estimation algorithms, and examines innovative applications in a multitude of fields. Time series techniques and algorithms are discussed for stochastic volatility, dynamic factor, and time-varying parameter models. Additional features include:
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 524
Veröffentlichungsjahr: 2014
Contents
Cover
Half Title page
Title page
Copyright page
Preface
Chapter 1: Bayesian Analysis of Dynamic Network Regression with Joint Edge/Vertex Dynamics
1.1 Introduction
1.2 Statistical Models for Social Network Data
1.3 Dynamic Network Logistic Regression with Vertex Dynamics
1.4 Empirical Examples and Simulation Analysis
1.5 Discussion
1.6 Conclusion
Bibliography
Chapter 2: Ethnic Minority Rule and Civil War: A Bayesian Dynamic Multilevel Analysis
2.1 Introduction: Ethnic Minority Rule and Civil War
2.2 EMR: Grievance and Opportunities of Rebellion
2.3 Bayesian GLMM-AR(p) Model
2.4 Variables, Model, and Data
2.5 Empirical Results and Interpretation
2.6 Civil War: Prediction
2.7 Robustness Checking: Alternative Measures of EMR
2.8 Conclusion
Bibliography
Chapter 3: Bayesian Analysis of Treatment Effect Models
3.1 Introduction
3.2 Linear Treatment Response Models Under Normality
3.3 Nonlinear Treatment Response Models
3.4 Other Issues and Extensions: Non-Normality, Model Selection, and Instrument Imperfection
3.5 Illustrative Application
3.6 Conclusion
Bibliography
Chapter 4: Bayesian Analysis of Sample Selection Models
4.1 Introduction
4.2 Univariate Selection Models
4.3 Multivariate Selection Models
4.4 Semiparametric Models
4.5 Conclusion
Bibliography
Chapter 5: Modern Bayesian Factor Analysis
5.1 Introduction
5.2 Normal Linear Factor Analysis
5.3 Factor Stochastic Volatility
5.4 Spatial Factor Analysis
5.5 Additional Developments
5.6 Modern non-Bayesian factor analysis
5.7 Final Remarks
Bibliography
Chapter 6: Estimation of Stochastic Volatility Models with Heavy Tails and Serial Dependence
6.1 Introduction
6.2 Stochastic Volatility Model
6.3 Moving Average Stochastic Volatility Model
6.4 Stochastic Volatility Models with Heavy-Tailed Error Distributions
Bibliography
Chapter 7: From the Great Depression to the Great Recession: A Model-Based Ranking of U.S. Recessions
7.1 Introduction
7.2 Methodology
7.3 Results
7.4 Conclusions
Appendix: Data
Bibliography
Chapter 8: What Difference Fat Tails Make: A Bayesian MCMC Estimation of Empirical Asset Pricing Models
8.1 Introduction
8.2 Methodology
8.3 Data
8.4 Empirical Results
8.5 Concluding Remarks
Bibliography
Chapter 9: Stochastic Search for Price Insensitive Consumers
9.1 Introduction
9.2 Random Utility Models in Marketing Applications
9.3 The Censored Mixing Distribution in Detail
9.4 Reference Price Models with Price Thresholds
9.5 Conclusion
Bibliography
Chapter 10: Hierarchical Modeling of Choice Concentration of U.S. Households
10.1 Introduction
10.2 Data Description
10.3 Measures of Choice Concentration
10.4 Methodology
10.5 Results
10.6 Interpreting θ
10.7 Decomposing the Effects of Time, Number of Decisions and Concentration Preference
10.8 Conclusion
Bibliography
Chapter 11: Approximate Bayesian Inference in Models Defined Through Estimating Equations
11.1 Introduction
11.2 Examples
11.3 Frequentist Estimation
11.4 Bayesian Estimation
11.5 Simulating from the Posteriors
11.6 Asymptotic Theory
11.7 Bayesian Validity
11.8 Application
11.9 Conclusions
Bibliography
Chapter 12: Reacting to Surprising Seemingly Inappropriate Results
12.1 Introduction
12.2 Statistical Framework
12.3 Empirical Illustration
12.4 Discussion
Bibliography
Chapter 13: Identification and MCMC Estimation of Bivariate Probit Models with Partial Observability
13.1 Introduction
13.2 Bivariate Probit Model
13.3 Identification in a Partially Observable Model
13.4 Monte Carlo Simulations
13.5 Bayesian Methodology
13.6 Application
13.7 Conclusion
Appendix
Bibliography
Chapter 14: School Choice Effects in Tokyo Metropolitan Area: A Bayesian Spatial Quantile Regression Approach
14.1 Introduction
14.2 The Model
14.3 Posterior Analysis
14.4 Empirical Analysis
14.5 Conclusions
Bibliography
Index
Bayesian Inference in the Social Sciences
Copyright © 2014 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic format. For information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Bayesian inference in the social sciences / edited by Ivan Jeliazkov, Department of Economics, University of California, Irvine, California, USA, Xin-She Yang, School of Science and Technology, Middlesex University, London, United Kingdom. pages cm Includes bibliographical references and index. ISBN 978-1-118-77121-1 (hardback) 1. Social sciences—Statistical methods. 2. Bayesian statistical decision theory. I. Jeliazkov, Ivan, 1973- II. Yang, Xin-She. HA29.B38345 2014 519.5’42—dc23
2014011437
PREFACE
No researcher is an island. Indeed, in scientific inquiry, we are generally motivated by, and build upon, the work of a long line of other researchers to produce new techniques and findings that in turn form a basis for further study and scientific discourse. Interdisciplinary research can be particularly fruitful in enhancing this interconnectedness despite the intrinsic difficulty of having to carefully sail the unfamiliar waters of multiple fields.
Bayesian Inference in the Social Sciences was conceived as a manifestation, if any were needed, of the major advances in model building, estimation, and evaluation that have been achieved in the Bayesian paradigm in the past few decades. These advances have been uneven across the various fields, but have nonetheless been widespread and far-reaching. Today, all branches in the social sciences make use of the tools of Bayesian statistics. In part, this is due to the conceptual simplicity and intellectual appeal of the Bayesian approach, but it has also much to do with the ability of Bayesian methods to handle previously intractable problems due to the computational revolution that started in the 1990s. The book provides chapters from leading scholars in political science, sociology, economics, marketing and finance, and offers clear, self-contained, and in-depth coverage of many central topics in these fields. Examples of novel theoretical developments and important applications are found throughout the book, aiming to appeal to a wide audience, including readers with a taste for conceptual detail, as well as those looking for genuine practical applications.
Although the specific topics and terminology differ, much common ground can be found in the use of novel state-of-the-art computational algorithms, elaborate hierarchical modeling, and careful examination of model uncertainty. We hope that this book will enhance the spread of new ideas and will inspire a new generation of applied social scientists to employ Bayesian methodology, build more realistic and flexible models, and study important social phenomena with rigor and clarity.
We wish to thank and acknowledge the hard work of the contributing authors and referees, and the production team at Wiley for their patience and professionalism.
IVAN JELIAZKOV AND XIN-SHE YANG
July, 2014
ZACK W. ALMQUIST1 AND CARTER T. BUTTS2
1University of Minnesota, USA.
2University of California, Irvine, USA.
Change in network structure and composition has been a topic of extensive theoretical and methodological interest over the last two decades; however, the effects of endogenous group change on interaction dynamics within the context of social networks is a surprisingly understudied area. Network dynamics may be viewed as a process of change in the edge structure of a network, in the vertex set on which edges are defined, or in both simultaneously. Recently, Almquist and Butts (2014) introduced a simple family of models for network panel data with vertex dynamics—referred to here as dynamic network logistic regression (DNR)—expanding on a subfamily of temporal exponential-family random graph models (TERGM) (see Robins and Pattison, 2001; Hanneke et al., 2010). Here, we further elaborate this existing approach by exploring Bayesian methods for parameter estimation and model assessment. We propose and implement techniques for Bayesian inference via both maximum a posteriori probability (MAP) and Markov chain Monte Carlo (MCMC) under several different priors, with an emphasis on minimally informative priors that can be employed in a range of empirical settings. These different approaches are compared in terms of model fit and predictive model assessment using several reference data sets.
This chapter is laid out as follows: (1) We introduce the standard (exponential family) framework for modeling static social network data, including both MLE and Bayesian estimation methodology; (2) we introduce network panel data models, discussing both MLE and Bayesian estimation procedures; (3) we introduce a subfamily of the more general panel data models (dynamic network logistic regression)—which allows for vertex dynamics—and expand standard MLE procedures to include Bayesian estimation; (4) through simulation and empirical examples we explore the effect of different prior specifications on both parameter estimation/hypothesis tests and predictive adequacy; (5) finally, we conclude with a summary and discussion of our findings.
The literature on statistical models for network analysis has grown substantially over the last two decades (for a brief review see Butts, 2008b). Further, the literature on dynamic networks has expanded extensively in this last decade – a good overview can be found in Almquist and Butts (2014). In this chapter we use a combination of commonly used statistical and graph theoretic notation. First, we briefly introduce necessary notation and literature for the current state of the art in network panel data models, then we review these panel data models in their general form, including their Bayesian representation. Last, we discuss a specific model family (DNR) which reduces to an easily employed regression-like structure, and formalize it to the Bayesian context.
When modeling social or other networks, it is often helpful to represent their distributions via random graphs in discrete exponential family form. Graph distributions expressed in this way are called exponential family random graph models or ERGMs. Holland and Leinhardt (1981) are generally credited with the first explicit use of statistical exponential families to represent random graph models for social networks, with important extensions by Frank and Strauss (1986) and subsequent elaboration by Wasserman and Pattison (1996), Pattison and Wasserman (1999), Pattison and Robins (2002), Snijders et al. (2006), Butts (2007), and others. The power of this framework lies in the extensive body of inferential, computational, and stochastic process theory [borrowed from the general theory of discrete exponential families, see, e.g., Barndorff-Nielsen (1978); Brown (1986)] that can be brought to bear on models specified in its terms.
We begin with the “static” case in which we have a single random graph, G, with support G. It is convenient to model G via its adjacency matrix Y, with y representing the associated support (i.e., the set of adjacency matrices corresponding to all elements in G). In ERGM form, we express the pmf of Y as follows:
(1.1)
where is a vector of sufficient statistics, θ s is a vector of natural parameters, XX is a collection of covariates, and y is the indicator function (i.e., 1 if its argument is in the support of y, 0 otherwise).1 If |G| is finite, then the pmf for any G can obviously be written with finite-dimensional S, θ (e.g., by letting S be a vector of indicator variables for elements of y); this is not necessarily true in the more general case, although a representation with S, θ of countable dimension still exists. In practice, it is generally assumed that S is of low dimension, or that at least that the vector of natural parameters can be mapped to a low-dimensional vector of “curved” parameters [see, e.g., Hunter and Handcock (2006)].
Theoretical developments in the ERGM literature have arguably lagged inferential and computational advances, although this has become an increasingly active area of research. A major concern of the theoretical literature on ERGMs is the problem of degeneracy, defined differently by different authors but generally involving an inappropriately large concentration of probability mass on a small set of (generally unrealistic) structures. This issue was recognized as early as Strauss (1986), who showed asymptotic concentration of probability mass on graphs of high density for models based on triangle statistics. [This motivated the use of local triangulation by Strauss and Ikeda (1990), a recommendation that went unheeded in later work.] More general treatments of the degeneracy problem can be found in Handcock (2003), Schweinberger (2011), and Chatterjee and Diaconis (2011). Butts (2011) introduced analytical methods that can be used to bound the behavior of general ERGMs by Bernoulli graphs (i.e., ERGMs with independent edge variables), and used these to show sufficient conditions for ERGMs to avoid certain forms of degeneracy as N → ∞. One area of relatively rich theoretical development in the ERGM literature has been the derivation of sufficient statistics from first principles (particularly dependence conditions). Following the early work of Frank and Strauss (1986), many papers in this area employ Hammersley-Clifford constructions (Besag, 1974) in which initially posited axioms for conditional dependence among edge variables (usually based on substantive theory) are used to generate sets of statistics sufficient to represent all pmfs with the posited dependence structure. Examples of such work for single-graph ERGMs include Wasserman and Pattison (1996), Pattison and Robins (2002), and Snijders et al. (2006), with multi-relational examples including Pattison and Wasserman (1999) and Koehly and Pattison (2005). Snijders (2010) has showed that statistics based on certain forms of dependence allow for models that allow conditional marginalization across components (i.e., graph components are conditionally independent); this suggests statistics that may be appropriate for social processes in which edges can only influence each other “through” the network itself, and provides insight into circumstances which facilitate inference for population network parameters from data sampled at the component level (see also Shalizi and Rinaldo, 2013). An alternative way to motivate model statistics is via generative models that treat the observed network as arising from a stochastic choice process. Examples of such developments include Snijders (2001) and Almquist and Butts (2013).
1.2.2.1 Bayesian Inference for ERGM Parameters Given the likelihood of equation (1.1), Bayesian inference follows in the usual fashion by application of Bayes’ Theorem, i.e.,
where ϕ s and v > 0 are hyperparameters and κ is the ERGM normalizing factor (as defined above). Note that θ and v have natural interpretations in terms of “prior pseudo-data” and “prior pseudo-sample size,” as is clear from the joint posterior:
(1.2)
with equation (1.2) giving the (re)normalized form.
Despite the attractiveness of the conjugate prior, it is less helpful than it might be due to the intractability of the ERGM normalizing factor. While standard MCMC methods (e.g., the Metropolis-Hastings algorithm) can often manage intractable normalizing constants of a posterior density when the posterior density in question is known up to a constant, the kernel of equation (1.2) also involves the (usually intractable) normalizing factor κ from the ERGM likelihood. Such posteriors have been described as “doubly intractable” (Murray et al., 2012), and pose significant computational challenges in practice. In the more general case for which p(θ) does not necessarily include κ (i.e., non-conjugate priors), MCMC or related approaches must generally deal with posterior odds of the form
which still require evaluation of normalizing factor ratios at each step. Provided that the prior ratio can be easily calculated, the complexity of this calculation is no worse than the associated ratios required for likelihood maximization, and indeed MAP can be performed in such cases using MCMC-MLE methods (see e.g. Hunter et al., 2008, 2012, for the MLE case) via the addition of prior odds as a penalty function. Approaches to direct posterior simulation in this regime include the use of exchange algorithms (Caimo and Friel, 2011) and other approximate MCMC methods (see Hunter et al., 2012, for a review). To date these latter methods have proven too computationally expensive for routine use, but the area is one of active research.
An alternative (if less generically satisfying) approach to the problem arises by observing that there are some classes of models for which κ is directly computable, and hence for which Bayesian analysis is more readily performed. An early example of this work is that of Wong (1987), who provided a fully Bayesian treatment of the p1 family of Holland and Leinhardt (1981). Because the likelihood for this family factors as a product of categorical pmfs (the four edge variable states associated with each dyad), κ is easily calculated and Bayesian inference is greatly simplified. This intuition was subsequently elaborated by van Duijn et al. (2004), who used it as a basis for a much richer family of effects. Although we are focused here on models in ERGM form, it should also be noted that many latent variable models for networks can be viewed as positing that Y is drawn from an ERGM with strong conditional independence properties (leading to a tractable normalizing factor), given a (possibly very complex) set of latent covariates on which a prior structure is placed. Models such as those of Hoff et al. (2002), Handcock et al. (2007), Nowicki and Snijders (2001) and Airoldi et al. (2008) can be viewed in this light. While the simultaneous dependence in cross-sectional data tends to limit the utility of simplified ERGMs (or to require a shifting of computational burden into a complexly specified parameter structure), this problem is sometimes reduced in dynamic data due to the ability to condition on past observations (i.e., replacing simultaneous dependence in the present with dependence on the past) (Almquist and Butts, 2014). It is to this setting that we now turn.
Temporal models for social network data can be generally classified into two broad categories: (1) continuous time models; and (2) panel data models. Here we will focus only on panel data models – for examples of models for continuous time interaction data see Butts (2008a), DuBois, Butts, McFarland, and Smyth (2013), and DuBois, Butts, and Smyth (2013). Current theory and software are focused on statistical inference for panel data models based on four general approaches. The first is the family of actor oriented models, which assumes an underlying continuous-time model of network dynamics, where each observed event represents a single actor altering his or her outgoing links to optimize a function based on sufficient statistics (for details, see Snijders, 1996; Snijders and Van Duijn, 1997; Snijders, 2001, 2005). The second is the family of latent dynamic structure models, which treat network dynamics as emerging from a simple network process influenced by the evolution of set of latent covariates; for example, see Sarkar and Moore (2005), Sarkar et al. (2007), and Foulds et al. (2011). The third is the family of temporal exponential family random graph models (TERGMs), which attempt to directly parameterize the joint pmf of a graph sequence using discrete exponential families (Hanneke and Xing, 2007a; Hanneke et al., 2010; Hanneke and Xing, 2007b; Cranmer and Desmarais, 2011; Desmarais and Cranmer, 2011, 2012; Almquist and Butts, 2012, 2013, 2014). Finally, the fourth approach is the separable temporal ERGM family (or STERGM), which assumes each panel observation is a cross-sectional observation from a latent continuous time process in which edges evolve via two separable processes of edge formation and edge dissolution (Krivitsky and Handcock, 2010). Here, we focus on the TERGM case.
TERGMs can be viewed as the natural analog of time series (e.g., VAR) models for the random graph case. Typically, we assume a time series of adjacency matrices …, Yt−1, Yt,… and parameterize the conditional pmf of Yt|Yt−1, Yt−2,… in ERGM form. As with classical time series models, it is typical to introduce a temporal Markov assumption of limited dependence on past states; specifically, we assume the existence of some k ≥ 0 such that that Yt is independent of Yt−k−1, Yt−k−2,… given Yt−1,…, Yt−k ≡ Yt−1t−k. Under this assumption, the standard TERGM likelihood for a single observation is written as
(1.3)
As before, S is an s-vector of real-valued sufficient statistics, but for the TERGMs S : yk+1, X → s (i.e., each function may involve observations at the k time points prior to t instead of a single graph). Otherwise, nothing is intrinsically different from the cross-sectional case. (In particular, note that from the point of view of Yt, yt−1t−k is a fully observed covariate. This is useful for the development that follows.) The denominator of (1.3) is again intractable in the general case, as it is for ERGMs.
For a complete TERGM series, the joint likelihood of the sequence Y1,…, Yt is given by , where TERG refers to the single-observation TERGM likelihood of equation (1.3). MCMC-based maximum likelihood estimation for θ is feasible for very short series, but becomes costly as sequence length grows. Cranmer and Desmarais (2011) propose estimation via MPLE combined with a bootstrapping procedure to estimate standard errors as a computationally cheaper alternative. Alternately, scalable estimation is greatly simplified for TERGMs with no simultaneous dependence terms; i.e., models such that Yijt is conditionally independent of Yklt given Yt−1t−k for all distinct (i, j), (k, l). The TERGM likelihood for such models reduces to a product of Bernoulli graph pmfs, and hence the corresponding inference problem is equivalent to (dynamic) logistic regression. Although by no means novel, these conditional Bernoulli families have recently been advocated by Almquist and Butts (2014) as viable alternatives for network time series in which the time period between observations is on or faster than the time scale of network evolution, or whether it is for other reasons possible to capture much of the simultaneous dependence among edges by conditioning on the past history of the network. Almquist and Butts (2014) also show how this family can be easily extended to incorporate endogenous vertex dynamics (a feature not currently treated in other dynamic network families). In the remainder of this chapter, we focus on this case, with a particular emphasis on Bayesian inference for joint vertex/edge dynamics.
(1.4)
where vt is the set of possible adjacency matrices compatible with vertex set vt, W is a w-vector of sufficient statistics on the vertex set, and ψ is a w-vector of vertex set parameters. The joint TERGM likelihood for a time series is then the product of the likelihoods for each observation. We refer to the conditional likelihood of a single observation in equation (1.4) as TERGV (i.e., temporal exponential family random graph with vertex processes) in the discussion that follows.
The likelihood of equation (1.4) is inferentially “separable” in the sense that it factorizes into terms respectively dealing with ψ (and the vertex set) and with θ (and the edge set). These may be estimated separately, even when both depend on the same data (i.e., the edge history and vertex history may both enter into S and W). On the other hand, inferential separability does not imply predictive separability: the vertex model will strongly impact the edge structure of graphs drawn from the model, and in some cases vice versa. [See Almquist and Butts (2014) for a discussion.]
1.2.3.2 Bayesian Estimation of TERGMs As before, Bayesian inference for the full TERGM family (with vertex dynamics) is based on the posterior distribution of θ, ψ given Z1,…, Zt:
It is frequently reasonable to treat the parameters of the edge and vertex processes as a priori independent given X. In that case, the above factors as
which implies that the joint posterior itself factors as
(1.5)
This is a manifestation of the inferential separability remarked on previously. Although ψ and θ are jointly dependent on both the covariates and the observed data, the two may be analyzed independently. In the special case where no vertex dynamics are present, or where such dynamics are exogenous, the joint posterior simplifies to that of equation (1.5).
As with the ERGM case, posterior estimation for TERGMs inherits the normalizing factor problem (exacerbated in the case of vertex dynamics by the presence of two distinct exponential families, each with a normalizing factor!). Because of these technical complications there has been very little work in applying Bayesian analysis to the more general TERGM framework.3 In the special case in which all observations in the present are independent conditional on the past, however, the normalizing factor becomes tractable and analysis is greatly simplified. As noted above, the similarity of the resulting inference problem to logistic regression (and the direct analogy with existing network regression models) has led to these being dubbed “dynamic network logistic regression” (DNR) families (Almquist and Butts, 2014). In the following section we will discuss Bayesian estimation in the DNR case, with and without vertex dynamics.
and the edge likelihood by
(1.6)
where B is understood to be the Bernoulli pmf, is the indicator function, and v(i) indicates the ith vertex from a known total risk set Vmax. (Thus, the support V of Vt is the power set of Vmax.) The analogy of this model family with logistic regression is clear from the form of the joint likelihood, which is equivalent to a (relatively complex) logistic regression of indicator variables for edge and vertex set memberships on a set of statistics associated with the network history and/or covariates. In the special case when Vt is exogenously varying, the joint likelihood of the data reduces to the edge process in equation (1.6); when it is fixed, the likelihood reduces to the “classic” dynamic network logistic regression model. Model specification, maximum likelihood based inference, and adequacy checking for this family are discussed in Almquist and Butts (2013, 2014).
Because the DNR family reduces to a logistic regression structure, Bayesian inference is nominally straightforward. However, choice of prior structure for DNR families has not been explored to date. Justifiably or otherwise, researchers typically seek to employ a default prior specification if they do not have a strong rationale for endorsing a specific prior. There is an extensive literature on noninformative, default, and reference prior distributions within the Bayesian statistical field (see Jeffreys, 1998; Hartigan, 1964; Bernardo, 1979; Spiegelhalter and Smith, 1982; Yang and Berger, 1994; Kass and Wasserman, 1996). More recent work has continued the traditions of both research on informative prior distributions using application-specific information and on minimally informative priors (often motivated by invariance principles) (for a review see, Gelman et al., 2008). One increasingly widely used approach to evaluating default priors (particularly in the machine learning literature) is the use of predictive assessment, i.e., examination of the extent to which a given prior structure reliably leads to accurate predictions on test data for a given body of training data. While arguably less principled than priors derived from other considerations, priors found to give good predictive performance on past data may be attractive on pragmatic grounds; by turns, such priors can also be justified more substantively as representing distributions compatible with past observations on similar data, and hence plausible at least as a crude starting point. Likewise, priors that consistently lead to poor predictive performance on test data should be suspect, whatever the principles used to construct them. The balance of this chapter is thus concerned with the predictive evaluation of various candidate priors in the context of DNR models.
While both MCMC and MAP are feasible for inference in DNR families, our focus here will be on posterior simulation via MCMC. In addition to giving us a more complete view of the posterior distribution, posterior simulation is particularly well-adapted to predictive model adequacy checking (Gelman et al., 2004). Specifically, simulation of future observations conditional on a point estimate (e.g., the posterior mean or mode) can greatly underestimate the uncertainty associated with the posterior distribution, and by extension can fail to reveal the benefits to be gained by, e.g., prior specifications that reign in extreme parameter values without greatly changing the central tendency of the posterior distribution.
Given the above, there are many reasonable choices of prior specifications for DNR families that may be applicable in one or another context. Given that our focus here is on evaluating simple, default priors, we will focus our attention on four prior specifications suggested as default priors for logistic regression in the Bayesian statistical literature (e.g., Gelman et al., 2008). Our core questions are as follows. First, what are the inferential consequences of employing these reference priors versus maximum likelihood estimation for DNR families in typical social network settings? Second, to what extent do various reasonable default priors lead to differences in either point estimation or posterior uncertainty in such settings? Finally, what differences (if any) does selection of one or another default prior make to prediction, in the specific sense of forecasting properties of an evolving network? If, in typical settings, inferential and predictive outcomes are fairly insensitive to choice of prior, then selection based on computational or other factors may be a reasonable practice. If, by contrast, we find substantial differences in inferential and/or predictive performance among default priors, then these choices must be scrutinized far more carefully. The balance of this chapter is intended as a first step towards assessing these questions.
Although simulation and inference for ERGMs is in general a highly specialized art (see, e.g., Snijders, 2002; Hunter et al., 2008; Wang et al., 2009), the logistic form of DNR families facilitates parameter estimation (if not network simulation) using more standardized tools and techniques. Examples of off-the-shelf toolkits suitable for posterior simulation in the cases studied here include Winbugs, JAGS or MCMCpack in R (Spiegelhalter et al., 2003; Plummer, 2003; Martin et al., 2011). Here, we employ the Metropolis-Hastings algorithm as implemented Martin et al. (2011) for posterior simulation, with sufficient statistics computed in the same manner as Almquist and Butts (2014, 2013) via custom statnet-based tools (Handcock et al., 2003). The latter were also used for posterior predictive simulation of graph-theoretic quantities. MCMC convergence was assessed using both Geweke’s convergence diagnostic (Geweke et al., 1991) and Raftery and Lewis’s diagnostic (Raftery and Lewis, 1992).
As noted above, inferential separability of the edge and vertex processes in the DNR context allows both to be treated as independent logistic regression problems (so long as the associated parameters are a priori independent). As this implies, there is no reason that the same prior structure must be used for both; for simplicity and practicality, however we here consider the case where we assume both the edge and vertex parameters have the same prior distributions.
As noted above, we here consider a number of typical priors for logistic regression (which have been recommended in the literature) for use as priors in DNR with and without vertex dynamics. The baseline point of comparison for all Bayesian results will be the ML estimate (and its sampling distribution), reflecting the dominant practice within the ERGM literature. In addition to this baseline, we consider five prior specifications within three general classes: (1) an improper uniform prior (i.e., the fully-Bayesian counterpart to maximum likelihood estimation); (2–3) independent normally distributed priors (one centered at 0, and one offset to inflate prior density); and (4–5) weakly informative t families of proper priors recommended by Gelman et al. (2008) (a Cauchy distribution centered at 0 with a scale parameter of 2.5, and a t7 prior centered at 0). Empirical experiments on the effects of these priors on the analysis and interpretation can be found in Section 1.4.
Given the six specifications described above (including the MLE), we seek to evaluate the practical consequences of these priors for Bayesian inference in typical settings. To that end, we consider a comparative analysis of inference under our proposed priors for two empirical cases. The first is a dynamic network of citations among bloggers during the 2004 US presidential election (“Blog“), and the second is a dynamic network of face-to-face communication ties among windsurfers on a southern California beach (“Beach“). Both networks are typical of social network data sets in that each involves multiple, complex mechanisms of interaction as well as actor-level heterogeneity. These networks have also been studied in the literature, making them useful reference cases for our present analysis.
Given a collection of alternative priors, what are the consequences of choosing one or another for inference in the case of a DNR family with a fixed vertex set? To assess this, we consider parameter estimation and prediction on the Blog data. While the entire time series consists of 484 time points, we here restrict ourselves to a much smaller series of 32 time points, in order to form a reasonable comparison with the Beach data. As noted below, the likelihood specification (i.e., sufficient statistics or effects) for models employed here is based on prior work by Almquist and Butts (2013); we hold the effects constant across all models to isolate the impact of prior specifications.
Because Vt is fixed here, we need only consider priors for the edge parameters (θ). Posterior distributions based on these models are compared the standard ML estimate derived from iterative weighted least squares methods (McCullagh and Nelder, 1999), noting that the ML estimate can also be thought of as a MAP estimator under an improper uniform prior. The posteriors evaluated are as follows: (1) an improper uniform prior (i.e., a fully Bayesian analog to the MLE); (2) independent N(0,.1) priors on each parameter (a simple default choice with relatively strong shrinkage towards 0); (3) independent N(5, 1) priors on each parameter (included as a non-normative test to evaluate the consequences of poor prior specification); (4) independent standard t7 priors on each parameter, emulating the example used by Gelman et al. (2008); and (5) independent scaled Cauchy priors C(0, 2.5) as recommended by Gelman et al. (2008) for logistic regression in another context. All Bayesian DNR models were be estimated via posterior simulation by MCMC, thus allowing for improved predictive modeling. This is discussed in the following section.
Our specification of the Blog network model follows Almquist and Butts (2013); for expository purposes, we have simplified the original model slightly by removing less important effects. Those retained fall into two main classes: (1) density effects, e.g., DNC/RNC mixing and daily seasonality; and (2) lagged network effects, e.g., interaction or preferential attachment effects (Wasserman and Faust, 1994). In Table 1.1 we consider first mixing effects for DNC and RNC internal ties (respectively), followed by terms for interaction between DNC and RNC blogs; next we consider daily effects, Tuesday to Sunday effects with Monday as the reference category; a lag effect (Yt−1); and a lagged indegree effect log (Deg(Yt−1)), which can be thought of as preferential attachment effect. The table is laid out to illustrate the differences and similarities of the parameter estimates for different minimally informative priors on estimation. Table 1.1 demonstrates that most reasonable minimally informative priors produce results comparable to each other and the MLE, as suggested by Gelman et al. (2008). It is worth pointing out that the prior can have a large effect on the parameter estimate (and thus the resulting predictions, see Section 1.4.3.1): the N(5, 1) parameter is noticeably farther from the other estimates (although the bias is not always in a positive direction, e.g., for the lag and lagged degree terms). That said, it is perhaps reassuring that even a fairly strong prior still leads to generally comparable estimates for a data set of only moderate size (44 nodes over 32 time points). In general, the MLE and posterior mean estimates for the standard default priors differ by an amount that is approximately an order of magnitude smaller than the statistical uncertainty associated with the estimates themselves; for the “extreme” N(5, 1) prior, the differences are on the scale of the posterior standard deviation. Likewise, the posterior standard deviations themselves are generally quite similar to each other, and to the standard error of the MLE (although, again, the “extreme” prior deviates somewhat).
Table 1.1: Parameter estimates: MLE and posterior means under the five prior specifications discussed in Section 1.4.3 (standard errors and posterior standard deviations are given in parentheses).
To further illustrate the similarities and differences among estimates, we consider Figure 1.1 which contains the 95% posterior marginals for each parameter (where the ML case shows the asymptotic Gaussian approximation to the sampling distribution of the MLE). Here we see again that the typical minimally informative priors result in relatively similar posterior distributions, both in the central mass and in the tails; all are likewise quite similar to the MLE. The deviations associated with the “extreme” prior are clearer here, and demonstrate visually the fact that (for models of this sort) a bias in the prior may not manifest within the posterior in a clear way. Specifically, for this case the inflated prior mean on the baseline interaction rate parameters tends to lead to a compensating deflation of the lagged terms; such effects suggest that it is important to be careful when using highly informative priors, since influence on one parameter may propagate to the posterior marginals for other parameters in a non-obvious manner.
Figure 1.1: Posterior marginals for the five prior specifications discussed in Section 1.4.3; MLE sampling distribution based on asymptotic Gaussian approximation
The fact that all reasonable prior specifications lead to posterior marginals that are close to each other and to the sampling distribution of the MLE suggests that we are effectively operating within the regime in which the posterior is close to its asymptotic Gaussian limit (Gelman et al., 2004). Although this phenomenon is data dependent, our case suggests that, for models of this kind, data sets of moderate size may prove adequate for the asymptotic approximation to hold. As such, any long-tailed, symmetric prior distribution is likely to work well from an inferential point of view, and choice of distribution can be reasonably made for computational or other reasons.
1.4.3.1 One-step Prediction for Model Assessment Inference is vital for hypothesis testing, model understanding, and the like; however, one can obtain parameters that are qualitatively reasonable from models that nevertheless predict poorly. In this section we will focus on one type of predictive model assessment procedure, described by Almquist and Butts (2014) in the DNR context as inhomogeneous Bernoulli prediction. The algorithm as suggested by Almquist and Butts (2014) is as follows: for each time point t we predict times (in other words, we take n draws from the posterior predictive distribution of the network at time t given the previous k observed time steps) to predict the edge structure. We then summarize the resulting network via a suite of Graph Level Indices (GLI; Anderson et al., 1999), yielding a GLI distribution for each time point.
Here we consider two important graph level indices: density (Wasserman and Faust, 1994) and the fraction of triads that form 3-cliques (i.e., triangles) (Wasserman and Faust, 1994). The first of these is an extremely fundamental GLI that describes the fraction of all possible ties that are present in the network; the second is a simple index related to clustering. For purposes of assessment, we examine the one-step inhomogeneous Bernoulli predictions of graph density and 3-clique formation rates (respectively) under the ML and Bayesian posterior predictive distributions. The results are summarized in Figures 1.2 and 1.3. An important advantage of using posterior predictive distributions in a temporal modeling context (versus point estimates) is that this allows for a realistic propagation of posterior uncertainty in parameter estimates into the resulting predictive distribution for prediction in temporal modeling is that it allows for more realistic prediction through sampling of the prior distribution. We can see this quite clearly in Figures 1.2 and 1.3, where the prediction intervals for the MCMC Bayesian estimates are typically much larger than those of the MLE (blue dotted lines). Further, the we see that the “extreme” N(5, 1) prior has surprisingly minimal effect on the prediction estimates and prediction intervals. Overall, it appears that the increased variance in the model is a more realistic portrayal of the data as wider prediction interval often covers the observed data, while the “pure” ML estimated prediction often does not.
Figure 1.2: One-step prediction of graph density; red dots indicate observed values.
Figure 1.3: One-step prediction of fraction of triads forming 3-cliques; red dots indicate observed values.
To experiment with the effects of different minimally informative priors on models with vertex dynamics, we again look to an empirical case; here we begin by considering the beach data set collected by Freeman et al. (1988), which is comprised of windsurfers engaging in interpersonal communication on a beach in South California in early fall of 1986. This is a temporally evolving network that includes endogenous vertex dynamics, i.e., change in windsurfers who show up over time. Thus, our priors here involve two classes of parameters: θ (the edge parameters), and ψ (the vertex parameters). For the present study, we use the same priors as in the Blog case (along with the MLE), with the same prior specification employed in each instance for θ and for ψ.
Our specification of the likelihood for the Beach data is a simplified version of the model employed by Almquist and Butts (2014). Because the model includes vertex dynamics, note that effects must be specified for both the edge set (given the vertices present each day) and the vertex set (given the past history up to the day in question). We include two main classes of effects: (1) density effects, e.g., weekday/weekend effects and log of the vertex set size5; and (2) lagged network effects, e.g., interaction or appearance on the day before and the number of interactions one engaged in the day before (i.e., degree effects). Degree effects may be particularly important in this context as they capture general tendency towards engagement in the network the day before. In Table 1.2 we consider weekend and weekday effects for both the vertex and edge set dynamics (in a beach setting we expect this to be particularly important as it captures the natural rhythm of activity of the work week in the United States), a log(nt) effect for the edge set, a lag effect for both the edge (Yt−1) and vertex set (Vt−1), and last a lagged degree effect for both the edge log(Deg(Yt−1)) and vertex set Deg(Vt−1). The table is laid out to illustrate the differences and similarities of the parameter estimates for different minimally informative priors on estimation (as well as the “extreme” N(5, 1) prior as a point of comparison).
Table 1.2: Posterior means and MLE estimates for the Beach data under the prior specifications discussed in Section 1.4.4 (posterior standard deviations and MLE standard errors given in parentheses).
Table 1.2 demonstrates that (as with the Blog model), we obtain comparable point estimates using either the MLE or standard default priors, with variation across prior specifications generally at or below the level of statistical uncertainty associated with the estimates themselves. As before, the “extreme” prior has a noteworthy effect, although for many parameters the posterior means given this prior are quite close to those obtained via other specifications. The primary exceptions in this regard are the density related effects, which are more weakly estimated and subject to considerable influence by a strongly informative prior. On the whole, however, our results suggest that reasonable default priors would lead to qualitatively (and quantitatively similar conclusions) for the processes shaping the Beach network.
Turning to the posterior marginal distributions, Figure 1.4 shows results that closely parallel the above (and our findings for the Blog data). The posteriors under reasonable default priors closely mirror each other and the MLE sampling distribution, suggesting that we are once again within the Gaussian asymptotic regime. The use of an extreme informative prior does clearly disrupt inference, with a stronger effect on some parameters than others; as before, the sign of the effect on a given parameter may not reflect the direction of prior bias, due to relationships among the parameters within the likelihood.
Figure 1.4: Posterior marginals and MLE sampling distributions for the Beach data.
1.4.4.1 One-step Prediction for Model Assessment For the Blog models, we saw that while the MLE was inferentially comparable to the posterior distributions obtained by reasonable default priors, the resulting predictions (MLE-based simulations versus the full posterior predictive) were quite different. Here, we repeat this analysis for the Beach models, bearing in mind that we now sample vertices as well as edges. (I.e., at each step we first draw the vertex set from the associated model, next drawing the edge set conditional on the realization of the vertex set.) As before, we employ the density and fraction of triads forming 3-cliques in the next time period as our outcome measures.
The results of our simulation experiment are summarized in Figures 1.5 and 1.6. Broadly, the results are similar to those of the Blog data: incorporating posterior uncertainty into our predictions leads to much wider prediction intervals, and these are often necessary to capture the observed data. In general, predicting clique structure is hard for this model (it has no simultaneous dependence, and many other potential predictors have been removed), but the uncertainty associated with the model predictions is clearer for the Bayesian models than for the MLE. There is not an obvious “winner” from the default priors, although the extreme tails of their prediction intervals do vary somewhat from time point to time point. In general, the differences between them appear small and unsystematic.
Figure 1.5: One-step prediction of graph density on the Beach data for all five Bayesian DNR models and MLE.
Figure 1.6: One-step prediction of graph 3-cliques on the Beach data for all five Bayesian DNR models and MLE.
In our test cases for Bayesian DNR with and without vertex dynamics, we see that several standard minimally informative priors produce both inferential and predictive results that are comparable to each other and—in the case of inference—to maximum likelihood estimation. Differences between point estimates for the default priors tested here are typically smaller (often by an order of magnitude) than the statistical uncertainty associated with the estimates themselves, and any are hence likely to lead to very similar interpretation in practice. Since these results were achieved with two very different types of dynamic social networks of relatively moderate size, we are led to the preliminary conclusion that selection of default priors for DNR families is unlikely to strongly impact results in typical settings (and the choice of prior can thus be made on computational or other grounds).
It is noteworthy that there is a general advantage for the use of Bayesian posterior predictive distributions versus predictive distributions based on the MLE, in terms of reduced overconfidence. This is a general and well-known phenomenon, and not particular to network models; however, it is perhaps worth reinforcing the point that predictions should incorporate uncertainty regarding parameters, and the Bayesian approach greatly facilitates this practice.
While we saw that inference in our test cases was generally fairly robust to prior specification, it is of course possible to “break” a data set by employing a sufficiently informative prior. Here, we observed that use of an “extreme” prior biased in a single direction strongly affected some (but not all) parameters, and that the direction of bias in the posterior did not necessarily correspond to the direction of bias in the prior (due to correlations among the associated statistics). As would be expected, parameters that are more poorly estimated are more subject to influence from the prior, and these (along with others closely related to them via the likelihood) are the ones that are most vulnerable to poor prior specifications. As in other contexts, our results suggest that use of strongly informative priors in DNR must be undertaken with caution, and in particular with an awareness of the degree to which parameters are entangled via the likelihood. Robustness tests are strongly recommended.
The strong concordance between posteriors under various choices of prior with each other and the sampling distribution of the MLE strongly suggests that our test cases place us in the asymptotic Gaussian regime, an encouraging development given that our networks are of relatively modest size. That said, it is important to bear in mind the large number of degrees of freedom inherent in dynamic network data. In general, data size for such problems grows as O(N2T), where N is the vertex set size and T is the number of time points. For graphs of even moderate size—and even for small numbers of time points—this can easily result in an extremely large number of edge variables. Of course, the asymptotic limit depends on more than simply the number of degrees of freedom (the sparsity of the data is also important), but this heuristic provides a reasonable intuition for why the Gaussian approximation is likely to work well here. This is a property that is potentially exploitable for e.g. approximate Bayesian computation in large N,T settings.
In this chapter, we have reviewed the problem of modeling static and dynamic networks in exponential family form. As we have shown, Bayesian analysis of both problems in the general case is made difficult via the central role of the incomputable ERGM normalizing factor, which enters into the likelihood (and sometimes the prior) in a manner that makes traditional MCMC-based sampling schemes slow and/or impractical. For some model families, however, this problem does not apply (due to the presence of a tractable normalizing factor); such families are especially useful in the case of dynamic network modeling, where conditioning on the past can in some cases allow us to model edges as conditionally independent in the present. For such cases, the temporal ERGM form reduces to a simple product of inhomogeneous Bernoulli graph likelihoods, dubbed “dynamic network regression” because of the similarity of the resulting model to logistic regression on time series data. Because of the simplicity of this family, and its similarity to logistic regression, it would be desirable to be able to employ standard “default” priors for analysis in routine settings. Our experiments with two different data sets suggest that this is a reasonable approach: alternative default priors lead to very similar conclusions, with all being similar to inferences resulting from maximum likelihood estimation. One reason for this concordance is that even fairly modest dynamic network data sets supply enough data degrees of freedom to—for DNR families—place the posterior within the asymptotic Gaussian regime. While it is possible to obtain poor results by selection of an especially inappropriate prior, reasonable choices thus lead to reasonable outcomes. Given this, and given the advantages of the Bayesian approach for problems such as prediction, there seems little reason not to recommend this as a standard technique for analyzing network dynamics with TERGM DNR families.
Acknowledgements: This work was supported in part by Office of Naval Research (ONR) award # N00014-08-1-1015, National Science Foundation (NSF) awards # BCS-0827027 and # SES-1260798 and National Institute of Health (NIH)/National Institute of Child Health & Human Development (NICHD) award # 1R01HD068395-01.
Airoldi, E. M., D. M. Blei, S. E. Fienberg, and E. P. Xing (2008, June). Mixed membership stochastic blockmodels. Journal of Machine Learning Research 9, 1981–2014.
Almquist, Z. W. and C. T. Butts (2012). Evolving context: Evidence from temporal change in organizational collaboration over the course of the 2005 katrina disaster. Working paper, University of California, Irvine.
Almquist, Z. W. and C. T. Butts (2013). Dynamic network logistic regression: A logistic choice analysis of inter- and intra-group blog citation dynamics in the 2004 US presidential election. Political Analysis 21(4), 430–448.
Almquist, Z. W. and C. T. Butts (2014). Logistic network regression for scalable analysis of networks with joint edge/vertex dynamics. Sociological Methodology (forthcoming).
Anderson, B. S., C. Butts, and K. Carley (1999). The interaction of size and density with graph-level indices. Social Networks 21(3), 239–267.
Anderson, C. J., S. Wasserman, and B. Crouch (1999). A p* primer: Logit models for social networks. Social Networks 21(1), 37–66.
Barndorff-Nielsen, O. (1978). Information and Exponential Families in Statistical Theory. New York: John Wiley and Sons.
Bernardo, J. M. (1979). Reference posterior distributions for Bayesian inference. Journal of the Royal Statistical Society. Series B (Methodological), 113–147.
Besag, J. (1974). Spatial interactions and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Series B (Methodological) 36(2), 192–236.
Besag, J. (2001). Markov chain Monte Carlo for statistical inference. Center for Statistics and the Social Sciences Working Papers(9).
Brown, L. D. (1986). Fundamentals of Statistical Exponential Families: with Applications in Statistical Decision Theory. Hayward, CA: Institute of Mathematical Statistics.
Butts, C. T. (2007). Permutation models for relational data. Sociological Methodology 37(1), 257–281.
Butts, C. T. (2008a). A relational event framework for social action. Sciological Methodology 38(1), 155–200.
Butts, C. T. (2008b). Social networks: A methodological introduction. Asian Journal of Social Psychology 11(1), 13–41.
Butts, C. T. (2009). Revisiting the foundations of network analysis. Science 325, 414–416.
Butts, C. T. (2011). Bernoulli graph bounds for general random graphs. Sociological Methodology 41, 299–345.
