Univariate, Bivariate, and Multivariate Statistics Using R - Daniel J. Denis - E-Book

Univariate, Bivariate, and Multivariate Statistics Using R E-Book

Daniel J. Denis

0,0
107,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

A practical source for performing essential statistical analyses and data management tasks in R Univariate, Bivariate, and Multivariate Statistics Using R offers a practical and very user-friendly introduction to the use of R software that covers a range of statistical methods featured in data analysis and data science. The author-- a noted expert in quantitative teaching --has written a quick go-to reference for performing essential statistical analyses and data management tasks in R. Requiring only minimal prior knowledge, the book introduces concepts needed for an immediate yet clear understanding of statistical concepts essential to interpreting software output. The author explores univariate, bivariate, and multivariate statistical methods, as well as select nonparametric tests. Altogether a hands-on manual on the applied statistics and essential R computing capabilities needed to write theses, dissertations, as well as research publications. The book is comprehensive in its coverage of univariate through to multivariate procedures, while serving as a friendly and gentle introduction to R software for the newcomer. This important resource: * Offers an introductory, concise guide to the computational tools that are useful for making sense out of data using R statistical software * Provides a resource for students and professionals in the social, behavioral, and natural sciences * Puts the emphasis on the computational tools used in the discovery of empirical patterns * Features a variety of popular statistical analyses and data management tasks that can be immediately and quickly applied as needed to research projects * Shows how to apply statistical analysis using R to data sets in order to get started quickly performing essential tasks in data analysis and data science Written for students, professionals, and researchers primarily in the social, behavioral, and natural sciences, Univariate, Bivariate, and Multivariate Statistics Using R offers an easy-to-use guide for performing data analysis fast, with an emphasis on drawing conclusions from empirical observations. The book can also serve as a primary or secondary textbook for courses in data analysis or data science, or others in which quantitative methods are featured.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 596

Veröffentlichungsjahr: 2020

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Preface

This Book’s Objective

What Does It Mean to “Know” R?

Intended Audience and Advice for Instructors

1 Introduction to Applied Statistics

1.1 The Nature of Statistics and Inference

1.2 A Motivating Example

1.3 What About “Big Data”?

1.4 Approach to Learning R

1.5 Statistical Modeling in a Nutshell

1.6 Statistical Significance Testing and Error Rates

1.7 Simple Example of Inference Using a Coin

1.8 Statistics Is for Messy Situations

1.9 Type I versus Type II Errors

1.10 Point Estimates and Confidence Intervals

1.11 So What Can We Conclude from One Confidence Interval?

1.12 Variable Types

1.13 Sample Size, Statistical Power, and Statistical Significance

1.14 How “

p

< 0.05” Happens

1.15 Effect Size

1.16 The Verdict on Significance Testing

1.17 Training versus Test Data

1.18 How to Get the Most Out of This Book

Exercises

2 Introduction to R and Computational Statistics

2.1 How to Install R on Your Computer

2.2 How to Do Basic Mathematics with R

2.3 Vectors and Matrices in R

2.4 Matrices in R

2.5 How to Get Data into R

2.6 Merging Data Frames

2.7 How to Install a Package in R, and How to Use It

2.8 How to View the Top, Bottom, and “Some” of a Data File

2.9 How to Select Subsets from a Dataframe

2.10 How R Deals with Missing Data

2.11 Using

ls( )

to See Objects in the Workspace

2.12 Writing Your Own Functions

2.13 Writing Scripts

2.14 How to Create Factors in R

2.15 Using the

table()

Function

2.16 Requesting a Demonstration Using the

example()

Function

2.17 Citing R in Publications

Exercises

3 Exploring Data with R: Essential Graphics and Visualization

3.1 Statistics, R, and Visualization

3.2 R's

plot()

Function

3.3 Scatterplots and Depicting Data in Two or More Dimensions

3.4 Communicating Density in a Plot

3.5 Stem‐and‐Leaf Plots

3.6 Assessing Normality

3.7 Box‐and‐Whisker Plots

3.8 Violin Plots

3.9 Pie Graphs and Charts

3.10 Plotting Tables

Exercises

4 Means, Correlations, Counts: Drawing Inferences Using Easy‐to‐Implement Statistical Tests

4.1 Computing

z

and Related Scores in R

4.2 Plotting Normal Distributions

4.3 Correlation Coefficients in R

4.4 Evaluating Pearson's

r

for Statistical Significance

4.5 Spearman's Rho: A Nonparametric Alternative to Pearson

4.6 Alternative Correlation Coefficients in R

4.7 Tests of Mean Differences

4.8 Categorical Data

4.9 Radar Charts

4.10 Cohen's Kappa

Exercises

5 Power Analysis and Sample Size Estimation Using R

5.1 What Is Statistical Power?

5.2 Does That Mean Power and Huge Sample Sizes Are “Bad?”

5.3 Should I Be Estimating Power or Sample Size?

5.4 How Do I Know What the Effect Size Should Be?

5.5 Power for

t

‐Tests

5.6 Estimating Power for a Given Sample Size

5.7 Power for Other Designs – The Principles Are the Same

5.8 Power for Correlations

5.9 Concluding Thoughts on Power

Exercises

6 Analysis of Variance: Fixed Effects, Random Effects, Mixed Models, and Repeated Measures

6.1 Revisiting

t

‐Tests

6.2 Introducing the Analysis of Variance (ANOVA)

6.3 Evaluating Assumptions

6.4 Performing the ANOVA Using

aov()

6.5 Alternative Way of Getting ANOVA Results via

lm()

6.6 Factorial Analysis of Variance

6.7 Example of Factorial ANOVA

6.8 Should Main Effects Be Interpreted in the Presence of Interaction?

6.9 Simple Main Effects

6.10 Random Effects ANOVA and Mixed Models

6.11 Mixed Models

6.12 Repeated‐Measures Models

Exercises

7 Simple and Multiple Linear Regression

7.1 Simple Linear Regression

7.2 Ordinary Least‐Squares Regression

7.3 Adjusted

R

2

7.4 Multiple Regression Analysis

7.5 Verifying Model Assumptions

7.6 Collinearity Among Predictors and the Variance Inflation Factor

7.7 Model‐Building and Selection Algorithms

7.8 Statistical Mediation

7.9 Best Subset and Forward Regression

7.10 Stepwise Selection

7.11 The Controversy Surrounding Selection Methods

Exercises

8 Logistic Regression and the Generalized Linear Model

8.1 The “Why” Behind Logistic Regression

8.2 Example of Logistic Regression in R

8.3 Introducing the Logit: The Log of the Odds

8.4 The Natural Log of the Odds

8.5 From Logits Back to Odds

8.6 Full Example of Logistic Regression

8.7 Logistic Regression on Challenger Data

8.8 Analysis of Deviance Table

8.9 Predicting Probabilities

8.10 Assumptions of Logistic Regression

8.11 Multiple Logistic Regression

8.12 Training Error Rate Versus Test Error Rate

Exercises

9 Multivariate Analysis of Variance (MANOVA) and Discriminant Analysis

9.1 Why Conduct MANOVA?

9.2 Multivariate Tests of Significance

9.3 Example of MANOVA in R

9.4 Effect Size for MANOVA

9.5 Evaluating Assumptions in MANOVA

9.6 Outliers

9.7 Homogeneity of Covariance Matrices

9.8 Linear Discriminant Function Analysis

9.9 Theory of Discriminant Analysis

9.10 Discriminant Analysis in R

9.11 Computing Discriminant Scores Manually

9.12 Predicting Group Membership

9.13 How Well Did the Discriminant Function Analysis Do?

9.14 Visualizing Separation

9.15 Quadratic Discriminant Analysis

9.16 Regularized Discriminant Analysis

Exercises

10 Principal Component Analysis

10.1 Principal Component Analysis Versus Factor Analysis

10.2 A Very Simple Example of PCA

10.3 What Are the Loadings in PCA?

10.4 Properties of Principal Components

10.5 Component Scores

10.6 How Many Components to Keep?

10.7 Principal Components of USA Arrests Data

10.8 Unstandardized Versus Standardized Solutions

Exercises

11 Exploratory Factor Analysis

11.1 Common Factor Analysis Model

11.2 A Technical and Philosophical Pitfall of EFA

11.3 Factor Analysis Versus Principal Component Analysis on the Same Data

11.4 The Issue of Factor Retention

11.5 Initial Eigenvalues in Factor Analysis

11.6 Rotation in Exploratory Factor Analysis

11.7 Estimation in Factor Analysis

11.8 Example of Factor Analysis on the Holzinger and Swineford Data

12 Cluster Analysis

12.1 A Simple Example of Cluster Analysis

12.2 The Concepts of Proximity and Distance in Cluster Analysis

12.3

k

‐Means Cluster Analysis

12.4 Minimizing Criteria

12.5 Example of

k

‐Means Clustering in R

12.6 Hierarchical Cluster Analysis

12.7 Why Clustering Is Inherently Subjective

Exercises

13 Nonparametric Tests

13.1 Mann–Whitney

U

Test

13.2 Kruskal–Wallis Test

13.3 Nonparametric Test for Paired Comparisons and Repeated Measures

13.4 Sign Test

Exercises

References

Index

End User License Agreement

List of Tables

Chapter 2

Table 2.1 Hypothetical data on quantitative and verbal ability as a function ...

Chapter 4

Table 4.1 Favorability of movies for two individuals in terms of ranks.

Chapter 5

Table 5.1

R

2

 → 

f

2

 → 

f

conversions.

Chapter 6

Table 6.1 Achievement as a function of teacher.

Table 6.2 Achievement as a function of teacher and textbook.

Table 6.3 Achievement as a function of teacher and textbook.

Table 6.4 Achievement cell means teacher * textbook.

Table 6.5 Learning as a function of trial (hypothetical data).

Table 6.6 Learning as a function of trial and treatment (hypothetical data).

Chapter 8

Table 8.1 Hypothetical data on quantitative and verbal ability for those rece...

Table 8.2 Achievement as a function of teacher and textbook.

Chapter 12

Table 12.1 Fictional data for simple cluster analysis.

List of Illustrations

Chapter 6

Figure 6.1 (a) Cell means for teacher * textbook on achievement. (b) Distanc...

Chapter 7

Figure 7.1 Classic single‐variable mediation model.

Chapter 12

Figure 12.1 (a) Plot of height and weight. (b) Identifying similarity.

Guide

Cover

Table of Contents

Begin Reading

Pages

iii

iv

v

xiii

xiv

xv

xvi

xvii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

347

348

349

350

351

352

353

354

355

356

357

359

360

361

363

364

365

366

367

Univariate, Bivariate, and Multivariate Statistics Using R

Quantitative Tools for Data Analysis and Data Science

Daniel J. Denis

This edition first published 2020© 2020 John Wiley & Sons, Inc.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Daniel J. Denis to be identified as the author of this work has been asserted in accordance with law.

Registered OfficeJohn Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

Editorial Office111 River Street, Hoboken, NJ 07030, USA

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of WarrantyWhile the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging‐in‐Publication Data applied for

ISBN: 9781119549932

Cover Design: WileyCover Image: © whiteMocca/Shutterstock

To Kaiser

Preface

In many departments and programs, a 16‐week course in applied statistics or data science, even at the graduate level, is all that is allotted (unfortunately) for applied science majors. These include courses in psychology, sociology, education, chemistry, forestry, business, and possibly biology. The student must assimilate topics from so‐called “elementary” statistics to much more advanced multivariate methods in a 3‐month period. For such programs, students need a good introduction to the most commonly used fundamental techniques and a way to implement these techniques in software. For such courses and programs, it is hoped that the current book will fill this need.

This Book’s Objective

This book is an elementary introduction to univariate through to multivariate applied statistics and data science featuring the use of R software. The book surveys many of the most common and classical statistical methods used in the applied social and natural sciences. What this book is not, is a deep, theoretical work on statistics, or an in‐depth manual of every computational possibility in R or advanced visualization. Instead, the primary goal of the book can be summarized as follows:

This book emphasizes getting many common results quickly using the most popular functions in R, while at the same time introducing to the reader concepts relevant to data analysis and applied statistics. In this spirit, the book can be used as a general introduction or elementary primer to applied univariate through to multivariate statistics and data science using R, focusing on the most core and fundamental techniques used by most social and natural scientists.

The book is designed to be used in upper division undergraduate through to graduate courses in a wide range of disciplines, from behavioral and social science courses to any courses requiring a data‐analytic and computational component. It is primarily a “how to” book, in most cases providing only the most essential statistical theory needed to implement and interpret a variety of univariate and multivariate methods. Computationally, the book simply “gets you started,” by surveying a few of the more common ways of obtaining essential output. The book will likely not be of value to those users who are already familiar with the essentials both in theory and computation, and wish to extend that knowledge to a more sophisticated and computational foundation. For those users, more advanced sources, both in theory and computation, are recommended.

While the book can in theory be used without much prior exposure to statistics, an introductory course in statistics at the undergraduate level at some point in the reader’s past is preferred. Most of what is learned in a first course in statistics is usually quickly forgotten and concepts never truly mastered on a deeper level. However, a prior introductory course serves to provide at least some exposure and initiation to many of the concepts discussed in this book so that they do not feel completely “new” to the reader. Statistical learning is often accomplished by successive approximations and iterations, and often prior familiarity with concepts is revealed as the material is learned again “for the first time.” What you may grasp today is often a product of prior experience having a long history, even if the mastery of a concept today can feel entirely “sudden.” The experience of “Oh, now I get it!” is usually determined by a much longer trail than we may at first realize. What suddenly makes sense now may have in one way or another been lurking below the limen of awareness for some time.

The book you hold in your hands is similar in its approach to a prior book published by the author featuring applied statistics, that one using SPSS, SPSS Data Analysis for Univariate, Bivariate, and Multivariate Statistics (2019), also with Wiley. Both this book on R and the earlier one on SPSS are special cases of a wider more thorough text also published by the author, Applied Univariate, Bivariate, and Multivariate Statistics (2016), also with Wiley (a second (and better) edition is set to be published in 2020). That book, however, while still applied, is a bit heavier on theory. Both the SPSS book and this one were written for those users who want a quicker and less theoretical introduction to these topics, and wish to get on fast with using the relevant software in applying these methods. Hence, both the SPSS book and this one are appropriate for courses, typically 16 weeks or shorter, that seek to include the data‐analytic component without getting too bogged down into theoretical discussion or development. Instructors can select the book of their choice, SPSS or R, depending on the software they prefer. What sets this book apart from many other books at this level is the explicit explanation provided in interpreting software output. In most places, output is explained as much as possible, or quality references are provided where further reading may be required for a deeper understanding.

A few features that are designed to make the book useful, enjoyable to read, and suitable as an advanced undergraduate or beginning graduate textbook are as follows:

Bullet points

are used quite liberally to annotate and summarize different features of the output that are most relevant. What is not most relevant is generally not discussed, or further references are provided.

Don’t Forget

!”

Brief “don’t forget” summaries of the most relevant information appear throughout the book. These are designed to help reinforce the most pertinent information to facilitate mastery of it.

Exercises

appear at the end of each chapter. These include both conceptual open‐ended questions (great for discussing and exploring concepts) as well as exercises using R.

What Does It Mean to “Know” R?

A General Comment on Software Knowledge versus Statistical Knowledge

R users range from those who are able to use the software with varying success to analyze data in support of their scientific pursuits, to those who are proficient experts at the software and its underlying language. Most of the readers who are using and learning from this book will fall into the former category. Computing, at the level of this book, isn’t “rocket science,” and you do not have to be an “R expert” to get something to work to analyze your data. You do not have to know the intricacies of the R language, the formalities of the language, etc., in order to use R in this fashion. All you really need is the ability to try different things and a whole lot of patience for figuring things out and debugging code. Though we do demonstrate quite a few of the functionalities available in R, we more or less stick to relatively easy‐to‐use functions that you can apply immediately to get results quickly.

Beyond the R capabilities in elementary introductions such as this one are veritable true experts in R who can literally generate programs on the fly (well, they too proceed by trial and error, and debug constantly too). These are the ones who truly interact with the software’s language on a deeper level. Hence, in response to a question of “How do I do such‐and‐such in R?” these experts, even if they don’t know the immediate answer, can often “figure it out,” not through looking up code, but through understanding how R functions, and programming code more or less “on the spot” to make it work. We mention this simply so that you are aware of the level at which we are instructing on using R in this book. Beyond this basic level is a whole new world of programming possibilities where your specialization becomes not one of the science you are practicing, but rather of computer programming. Indeed, many of those who contribute packages to R are programming specialists who understand computing languages in general (not only R) at a very deep level. For most common scientific applications of R such as those featured in this book, you definitely do not need to know the software at anywhere near that level. What you need to know is how to make use of R to meet your data‐analytic needs.

However, the same cannot be true of the statistics you are using R for. Understanding the statistics is, not surprisingly, the more difficult part. You can’t “look up” statistical understanding as you can R code. Let me give you an example of what I mean. Consider the following exchange between students:

Student 1

: “Do you know how to create a 3‐D scatterplot in R?”

Student 2

: “No, but I can look it up and figure it out using what I know about R, and keep trying until it works.”

The above is categorically different from the following exchange:

Student 1

: “Do you understand the difference between a p‐value and an effect size?”

Student 2

: “No, but I can look it up and figure it out and keep trying until I do.”

The first scenario involves already understanding the nature of multivariable scatterplots and simply “figuring out” how to get one in R. The second scenario requires more understanding of applied statistics that develops over time, experience, and study. That is, it isn’t simply a matter of “digging something up” on the internet as it was for the first example. The second scenario involves understanding the principles at play that develop via study and contemplation. In the age of software, knowing how to generate an ANOVA through software, for example, does not imply knowledge of how ANOVA works on a deeper level, and this distinction should be kept in mind. As I like to say to prospective Ph.D. candidates, assuming your dissertation is on a scientific topic, nobody at your defense meeting is going to ask you (or “test you on”) how you computed something in R. However, someone will likely ask you to explain the meaning behind what you computed and how it applies to your research question. Now had your dissertation been titled “Computational Methods in R,” then questions about how you communicated with R to obtain this or that output would have been fair and relevant. For most readers of this book, however, their primary topic is their chosen science, not the software.

It is important to remain aware of the distinction between statistical knowledge and software knowledge. In this book, software knowledge can be “dug up” as needed, whereas statistical knowledge may require a bit more thinking and deliberating. Being able to generate a result in software does not equate to understanding the concepts behind what you have computed.

Intended Audience and Advice for Instructors

As mentioned, the book is suitable for upper‐division undergraduate or beginning graduate courses in applied statistics in the applied sciences and related areas. Because it isn’t merely an R programming manual, the book will be well‐suited for applied statistics courses that feature or use R as one of its software options. Depending on the nature of the course and the goals of the instructor, the book can be used either as a primary text or as a supplement to a more theoretical book, relying on the current work to provide guidance using R. Experienced instructors may also choose to develop the concepts mentioned in the book at a much deeper level via classroom notes, etc., while using the book to help guide the course. Because definitions can sometimes appear elusive without examples of their use, many concepts in this book are introduced or reviewed in the context of how they are used. This facilitates for a student the meaning behind the concept, rather than memorizing imperfect definitions which are never perfectly precise accounts of the underlying concept. Text in bold is used to provide emphasis on the word, concept, or sentence.

I hope you enjoy this book as a useful introduction to the world of introductory to advanced statistics using R. Thank you to my Editor, Mindy Okura‐Marszycki, and all at Wiley who made this book possible, as well as students, colleagues, and others who have in one way or another influenced my own professional development. Please contact me at [email protected] or [email protected] with any comments or corrections. For data files and errata, please visit www.datapsyc.com.

Daniel J. DenisJanuary, 2020