Statistical Analysis with Python For Dummies - Joseph Schmuller - E-Book

Statistical Analysis with Python For Dummies E-Book

Joseph Schmuller

0,0
25,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Wrangle stats as you learn how to graph, analyze, and interpret data with Python

Statistical Analysis with Python For Dummies introduces you to the tool of choice for digging deep into data to inform business decisions. Even if you're new to coding, this book unlocks the magic of Python and shows you how to apply it to statistical analysis tasks. You'll learn to set up a coding environment and use Python's libraries and functions to mine data for correlations and test hypotheses. You'll also get a crash course in the concepts of probability, including graphing and explaining your results. Part coding book, part stats class, part business analyst guide, this book is ideal for anyone tasked with squeezing insight from data.

  • Get clear explanations of the basics of statistics and data analysis
  • Learn how to summarize and analyze data with Python, step by step
  • Improve business decisions with objective evidence and analysis
  • Explore hypothesis testing, regression analysis, and prediction techniques

This is the perfect introduction to Python for students, professionals, and the stat-curious.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 453

Veröffentlichungsjahr: 2025

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Statistical Analysis with Python® For Dummies®

To view this book's Cheat Sheet, simply go to www.dummies.com and search for “Statistical Analysis with Python For Dummies Cheat Sheet” in the Search box.

Table of Contents

Cover

Table of Contents

Title Page

Copyright

Introduction

About This Book

Similarity with These Other For Dummies Books

What You Can Safely Skip

Foolish Assumptions

How This Book Is Organized

Icons Used in This Book

Where to Go from Here

Part 1: Getting Started with Statistical Analysis with Python

Chapter 1: Data, Statistics, and Decisions

The Statistical (and Related) Notions You Just Have to Know

Inferential Statistics: Testing Hypotheses

Chapter 2: Python: What It Does and How It Does It

Introducing Colab

Exploring the Colab Environment

Introducing Python

Working with Python Functions

Checking Out Python Libraries

Going Round and Round with Looping

Considering Conditionals

Comprehending List Comprehension

Defining Your Own Functions

Wrapping Up

Part 2: Describing Data

Chapter 3: Getting Graphic

Getting the Data

Creating a Histogram

Barhopping

Slicing the Pie

The Plot of Scatter

Of Boxes and Whiskers

Continuous Variables

Wrapping Up

Chapter 4: Finding Your Center

Means: The Lure of Averages

The Average in Python

Medians: Caught in the Middle

The Median in Python

Statistics à la Mode

The Mode in Python

Chapter 5: Deviating from the Average

Measuring Variation

Variance in Python

Back to the Roots: Standard Deviation

Standard Deviation in Python

Conditions, Conditions, Conditions …

Chapter 6: Meeting Standards and Standings

Catching Some Z’s

z-Scores in Python

Where Do You Stand?

Chapter 7: Summarizing It All

How Many?

The High and the Low

Living in the Moments

Tuning in the Frequency

Summarizing a DataFrame

Chapter 8: What’s Normal?

Hitting the Curve

Working with Normal Distributions

A Distinguished Member of the Family

Part 3: Drawing Conclusions from Data

Chapter 9: The Confidence Game: Estimation

Understanding Sampling Distributions

An EXTREMELY Important Idea: The Central Limit Theorem

Confidence: It Has Its Limits!

Finding Confidence Limits for a Mean

Fit to a t

Chapter 10: One-Sample Hypothesis Testing

Hypotheses, Tests, and Errors

Hypothesis Tests and Sampling Distributions

Catching Some Z’s Again

z-Testing in Python

t for One

t Testing in Python

Working with

t-

Distributions

Visualizing

t-

Distributions

Testing a Variance

Testing a Variance in Python

Working with Chi-Square Distributions

Visualizing Chi-Square Distributions

Chapter 11: Two-Sample Hypothesis Testing

Hypotheses Built for Two

Sampling Distributions Revisited

t for Two

t-

Testing in Python

A Matched Set: Hypothesis Testing for Paired Samples

Paired Sample

t-

Testing in Python

Testing Two Variances

Working with F

-

Distributions

Visualizing F

-

Distributions

Chapter 12: Testing More than Two Samples

Testing More than Two

ANOVA in Python

After the ANOVA

Another Kind of Hypothesis, Another Kind of Test

Getting Trendy

Trend Analysis in Python

Chapter 13: More Complicated Testing

Cracking the Combinations

Two-Way ANOVA in Python

Visualizing the Two-Way Results

Two Kinds of Variables … at Once

After the Analysis

Multivariate Analysis of Variance

Chapter 14: Regression: Linear, Multiple, and the General Linear Model

The Plot of Scatter

Graphing Lines

Regression: What a Line!

Linear Regression in Python

Juggling Many Relationships at Once: Multiple Regression

ANOVA: Another Look

Analysis of Covariance: The Final Component of the GLM

But Wait — There's More

Chapter 15: Correlation: The Rise and Fall of Relationships

Scatterplots, Again

Understanding Correlation

Correlation and Regression

Testing Hypotheses About Correlation

Correlation in Python

Multiple Correlation

Partial Correlation

Partial Correlation in Python

Semipartial Correlation

Semipartial Correlation in Python

Chapter 16: Curvilinear Regression: When Relationships Get Complicated

What Is a Logarithm?

What Is e?

Power Regression

Exponential Regression

Logarithmic Regression

Polynomial Regression: A Higher Power

Which Model Should You Use?

Part 4: Working with Probability

Chapter 17: Introducing Probability

What Is Probability?

Compound Events

Conditional Probability

Large Sample Spaces

Python Functions for Counting Rules

Random Variables: Discrete and Continuous

Probability Distributions and Density Functions

The Binomial Distribution

The Binomial and Negative Binomial in Python

Hypothesis Testing with the Binomial Distribution

More on Hypothesis Testing: Python versus Tradition

Chapter 18: Introducing Modeling

Modeling a Distribution

A Simulating Discussion

Chapter 19: Probability Meets Regression: Logistic Regression

Getting the Data

Doing the Analysis

Part 5: The Part of Tens

Chapter 20: Ten Tips for R Veterans

Python Libraries Are (Somewhat) Different from R Libraries

Python's Statistics Functions Live in Libraries

In Python, Distributions Also Live in Libraries

Dot Notation in Python Is Important

Dot in Python is Much Like $ in R

Two Important Libraries: NumPy and Pandas

Use the Dictionary

Learn the

statsmodels

Library

Where Are the Vectors?

A Python Grammar of Graphics

Chapter 21: Ten Valuable Python Resources

Python.org

Python Library Websites

W3 Schools

Pythonbooks

The Python Papers

Python for Everybody

KDNuggets

Geeks for Geeks

Real Python

The Zen of Python

Index

About the Author

Connect with Dummies

End User License Agreement

Guide

Cover

Table of Contents

Title Page

Copyright

Begin Reading

Index

About the Author

Pages

i

ii

1

2

3

4

5

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

83

84

85

86

87

88

89

90

91

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

363

364

365

366

367

368

369

370

371

373

374

375

376

377

378

379

380

381

383

384

385

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

403

404

405

Statistical Analysis with Python® For Dummies®

Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774, www.wiley.com

Copyright © 2026 by John Wiley & Sons, Inc. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.

Media and software compilation copyright © 2026 by John Wiley & Sons, Inc. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

The manufacturer’s authorized representative according to the EU General Product Safety Regulation is Wiley-VCH GmbH, Boschstr. 12, 69469 Weinheim, Germany, e-mail: [email protected].

Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and may not be used without written permission. Python is a registered trademark of Python Software Foundation. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL SERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONAL PERSON SHOULD BE SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATION OR WEBSITE IS REFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE INFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. FURTHER, READERS SHOULD BE AWARE THAT INTERNET WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ.

For general information on our other products and services, please contact our Customer Care Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002. For technical support, please visit https://hub.wiley.com/community/support/dummies.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Control Number is available from the publisher.

ISBN: 978-1-394-37032-0 (pbk); ISBN: 978-1-394-37033-7 (ebk); ISBN: 978-1-394-37034-4 (ebk)

Introduction

So, you’re holding a statistics book. In my humble (and absolutely biased) opinion, it’s not just another statistics book. Nor is it just another Python book. I say this for two reasons.

First, many statistics books teach you the concepts but give you no easy way to apply them. That often leads to a lack of understanding. Because Python has a wealth of features and wide-ranging applicability, it’s a good tool for applying (and learning) statistics concepts.

Second, let’s look at it from the opposite direction: Before I tell you about one of Python’s statistics-related features, I give you the statistical foundation it’s based on. That way, you understand that feature when you use it — and you use it more effectively.

I didn’t want to write a book that only covers the details of Python and introduces some clever coding techniques. Some of that is necessary, of course, in any book that shows you how to use a language like Python. My goal was to venture far beyond that.

Neither did I want to write a statistics “cookbook” — when-faced-with-problem-category-#431-use-statistical-procedure-#763. My goal was to venture far beyond that, too.

Bottom line: This book isn’t just about statistics or just about Python — it’s firmly at the intersection of the two. In the proper context, Python can be a useful tool for teaching and learning statistics, and I’ve tried to supply the proper context.

About This Book

Although the field of statistics proceeds in a logical way, I’ve organized this book so that you can open it in any chapter and start reading. The idea is for you to find the information you’re looking for in a hurry and use it immediately — whether it’s a statistical concept or a Python-related one.

On the other hand, reading from cover to cover is okay if you’re so inclined. If you’re a statistics newbie and you have to use Python to analyze data, I recommend that you begin at the beginning.

Similarity with These Other For Dummies Books

You might be aware that I’ve written two other books: Statistical Analysis with Excel For Dummies and Statistical Analysis with R For Dummies (both from Wiley). This is not a shameless plug for those books. (Shameless plugs appear elsewhere.)

I’m just letting you know that the sections in this book that explain statistical concepts are much like the corresponding sections in the other books. I use (mostly) the same examples and, in many cases, the same words. I’ve developed that material during decades of teaching statistics and found it to be quite effective. (Reviewers seem to like it, too.) Also, if you happen to have read either or both of the other books and you’re transitioning to Python, the common material might just help you make the switch.

And, you know: If it ain’t broke… .

What You Can Safely Skip

Any reference book throws lots of information at you, and this one is no exception. I intended for it all to be useful, but I didn’t aim it all at the same level. So if you’re not deeply into the subject matter, you can avoid paragraphs marked with the Technical Stuff icon.

As you read, you’ll run into sidebars. They provide information that elaborates on a topic, but they’re not part of the main path. If you’re in a hurry, you can breeze past them.

Foolish Assumptions

I’m assuming this much about you:

You know how to work with Windows or the Mac.

I don’t describe the details of pointing, clicking, selecting, and other actions

.

You’re able to install Google Colaboratory (I show you how in

Chapter

2

) and follow along with the examples.

I work in Windows, but you should have no problem if you’re working on a Mac.

How This Book Is Organized

I’ve organized this book into five parts.

Part 1: Getting Started with Statistical Analysis with Python

In Part 1, I provide a general introduction to statistics and to Python. I discuss important statistical concepts and describe useful Python techniques. If it’s been a long time since your last course in statistics or if you’ve never even had a statistics course, start with Part 1. If you have never worked with Python, definitely start with Part 1.

Part 2: Describing Data

Part of working with statistics is to summarize data in meaningful ways. In Part 2, you find out how to do that. Most people know about averages and how to compute them. But that’s not the whole story. In Part 2, I tell you about additional statistics that fill in the gaps, and I show you how to use Python to work with those statistics. I also introduce Python graphics in this part.

Part 3: Drawing Conclusions from Data

Part 3 addresses the fundamental aim of statistical analysis: to go beyond the data and help you make decisions. Usually, the data are measurements of a sample taken from a large population. The goal is to use these data to figure out what’s going on in the population.

This opens a wide range of questions: What does an average mean? What does the difference between two averages mean? Are two things associated? These are only a few of the questions I address in Part 3, and I discuss the Python capabilities that help you answer them.

Part 4: Working with Probability

Probability is the basis for statistical analysis and decision-making. In Part 4, I tell you all about it. I show you how to apply probability, particularly in the area of modeling. Part 4 also includes a chapter on a statistical technique, called logistic regression, that marries a method from Part 3 with probability. Python provides a rich set of capabilities that deal with probability, and here’s where you find them.

Part 5: The Part of Tens

Part 5 has two chapters. In the first, I give R users ten tips for moving to Python. In the second, I cover ten valuable Python-related resources you can find online.

Icons Used in This Book

Icons appear all over For Dummies books, and this one is no exception. Each one is a little picture in the margin that lets you know something special about the paragraph it sits next to.

This icon points out a hint or a shortcut that can help you in your work (and perhaps make you a finer, kinder, and more insightful human being).

This one points out timeless wisdom to take with you on your continuing quest for statistics knowledge.

Pay attention to the information accompanied by this icon. It’s a reminder to avoid an action that might gum up the works for you.

As I mention in the earlier section “What You Can Safely Skip,” this icon indicates material you can blow past if it’s just too technical. (I’ve kept this to a minimum.)

Where to Go from Here

You can start reading this book anywhere, but here are a couple of hints. Want to learn the foundations of statistics? Turn the page. Introduce yourself to Python? That’s Chapter 2. Want to start with Python graphics? Hit Chapter 3. For anything else, find it in the table of contents or the index and go for it.

In addition to what you’re reading right now, this product comes with a free, access-anywhere cheat sheet that presents a selected list of Python capabilities and describes what they do. To get this cheat sheet, visit www.dummies.com and type Statistical Analysis with Python For Dummies Cheat Sheet in the search box. Also, be sure to check out this book’s companion website at www.dummies.com/go/statisticalanalysiswithpythonfd, where you will find all the sample code I use in this book in a downloadable format.

Part 1

Getting Started with Statistical Analysis with Python

IN THIS PART …

Find out about Python’s statistical capabilities.

Explore how to work with populations and samples.

Test your hypotheses.

Understand errors in decision-making.

Determine independent and dependent variables.

Chapter 1

Data, Statistics, and Decisions

IN THIS CHAPTER

Introducing statistical concepts

Generalizing from samples to populations

Getting into probability

Testing hypotheses

Two types of error

Statistics? That’s all about crunching numbers into arcane-looking formulas, right? Not really. Statistics, first and foremost, is about decision-making. Some number-crunching is involved, of course, but the primary goal is to use numbers to make decisions. Statisticians look at data and wonder what the numbers are saying. What kinds of trends are in the data? What kinds of predictions are possible? What conclusions can you make?

To make sense of data and answer these questions, statisticians have developed a wide variety of analytical tools.

About the number-crunching part: If you had to do it via pencil-and-paper (or with the aid of a pocket calculator), you’d soon grow discouraged with the amount of computation involved and the errors that might creep in. Software like Python helps you crunch the data and compute the numbers. As a bonus, working with Python can also help you comprehend statistical concepts.

Although Python is an all-purpose computing language, many of its libraries make it ideal for statistical work. I wrote this book to show you how to use these libraries and the statistical tools they make available.

The Statistical (and Related) Notions You Just Have to Know

The analytical tools you find in Python are based on statistical concepts I help you explore in the remainder of this chapter. As you’ll see, these concepts are based on common sense.

Samples and populations

If you watch TV on election night, you know that one exciting occurrence that takes place before the main event is the prediction of the outcome immediately after the polls close (and before all the votes are counted). How is it that pundits almost always get it right?

The idea is to talk to a sample of voters right after they vote. If they’re truthful about how they marked their ballots, and if the sample is representative of the population of voters, analysts can use the sample data to draw conclusions about the population.

That, in a nutshell, is what statistics is all about — using the data from samples to draw conclusions about populations.

Here’s another example. Imagine that your job is to find the average height of 10-year-old children in the United States. Because you probably wouldn’t have the time or the resources to measure every child, you’d measure the heights of a representative sample. Then you’d average those heights and use that average as the estimate of the population average.

Estimating the population average is one kind of inference that statisticians make from sample data. I discuss inference in more detail in the later section “Inferential Statistics: Testing Hypotheses.”

Here’s some important terminology: Properties of a population (like the population average) are called parameters, and properties of a sample (like the sample average) are called statistics. If your only concern is the sample properties (like the heights of the children in your sample), the statistics you calculate are descriptive. If you’re concerned about estimating the population properties, your statistics are inferential.

Now for an important convention about notation: Statisticians use Greek letters (μ, σ. ρ) to stand for parameters, and English letters (, s, r) to stand for statistics. Figure 1-1 summarizes the relationship between populations and samples, and between parameters and statistics.

FIGURE 1-1: The relationship between populations, samples, parameters, and statistics.

Variables: Dependent and independent

A variable is something that can take on different values at different times — like your age, the value of the dollar against other currencies, or the number of games your favorite sports team wins. Something that can have only one value is a constant. Scientists tell us that the speed of light is a constant, and we use the constant π to calculate the area of a circle.

Statisticians work with independent variables and dependent variables. In any study or experiment, you’ll find both kinds. Statisticians assess the relationship between them.

Imagine a computerized training method designed to increase a person’s IQ. How would a researcher find out whether this method does what it’s supposed to do? First, that person would randomly assign a sample of people to one of two groups. One group would receive the training method, and the other would complete another kind of computer-based activity — like reading text on a website. Before and after each group completes its activities, the researcher measures each person’s IQ. What happens next? I discuss that topic in the later section “Inferential Statistics: Testing Hypotheses.”

For now, understand that the independent variable here is Type of Activity. The two possible values of this variable are IQ Training and Reading Text. The dependent variable is the change in IQ from Before to After.

A dependent variable is what a researcher measures. In an experiment, an independent variable is what a researcher manipulates. In other contexts, a researcher can’t manipulate an independent variable. Instead, they note naturally occurring values of the independent variable and how they affect a dependent variable.

In general, the objective is to find out whether changes in an independent variable are associated with changes in a dependent variable.

In the examples that appear throughout this book, I show you how to use Python to calculate characteristics of groups of scores or to compare groups of scores. Whenever I show you a group of scores, I’m talking about the values of a dependent variable.

Types of data

When you do statistical work, you can run into four kinds of data. And when you work with a variable, the way you work with it depends on what kind of data it is. The first kind is nominal data. If a set of numbers happens to be nominal data, the numbers are labels — their values don’t signify anything. On a sports team, the jersey numbers are nominal. They just identify the players.

The next kind is ordinal data. In this data type, the numbers are more than just labels. As the name ordinal might tell you, the order of the numbers is important. If I were to ask you to rank ten foods from the one you like best (1) to the one you like least (10), we’d have a set of ordinal data.

But the difference between your third-favorite food and your fourth-favorite food might not be the same as the difference between your ninth-favorite and your tenth-favorite. So this type of data lacks equal intervals and equal differences.

Interval data gives us equal differences. The Fahrenheit scale of temperature is a good example. The difference between 30o and 40o is the same as the difference between 90o and 100o. So each degree is an interval.

People are sometimes surprised to find out that on the Fahrenheit scale, a temperature of 80o is not twice as hot as 40o. For ratio statements (“twice as much as,” “half as much as”) to make sense, zero has to mean the complete absence of the thing you’re measuring. A temperature of 0o F doesn’t mean the complete absence of heat — it’s just an arbitrary point on the Fahrenheit scale. (The same holds true for Celsius.)

The fourth kind of data, ratio, provides a meaningful zero point. On the Kelvin scale of temperature, zero means “absolute zero,” where all molecular motion (the basis of heat) stops. So 200o Kelvin is twice as hot as 100o Kelvin. Another example is length. Eight inches is twice as long as 4 inches. Zero inches means “a complete absence of length.”

An independent variable or a dependent variable can be either nominal, ordinal, interval, or ratio. The analytical tools you use depend on the type of data you work with.

A little probability

When statisticians make decisions, they use probability to express their confidence about those decisions. They can never be absolutely certain about what they decide. They can only tell you how probable their conclusions are.

What do I mean by probability? Mathematicians and philosophers might give you complex definitions. In my experience, however, the best way to understand probability is in terms of examples.

Here’s a simple example: If you toss a coin, what’s the probability that it turns up heads? If the coin is fair, you might figure that you have a 50-50 chance of heads and a 50-50 chance of tails. And you’d be right. In terms of the kinds of numbers associated with probability, that’s ½.

Think about rolling a fair die (one member of a pair of dice). What’s the probability that you roll a 4? Well, a die has six faces and one of them is 4, so that’s ⅙. Still another example: Select 1 card at random from a standard deck of 52 cards. What’s the probability that it’s a diamond? A deck of cards has four suits, so that’s ¼.

These examples tell you that if you want to know the probability that an event occurs, count how many ways that event can happen and divide by the total number of events that can happen. In the first two examples (heads, 4), the event you’re interested in happens in only one way. For the coin, you divide 1 by 2. For the die, you divide 1 by 6. In the third example (diamond), the event can happen in 1 of 13 ways (ace through king), so you divide 13 by 52 (to get ¼).

Now for a slightly more complicated example. Toss a coin and roll a die at the same time. What’s the probability of tails and a 4? Think about all the possible events that can happen when you toss a coin and roll a die at the same time. You could have tails and 1 through 6, or heads and 1 through 6. That adds up to 12 possibilities. The tails-and-4 combination can happen only one way. So the probability is .

In general, the formula for the probability that a particular event occurs is

At the beginning of this section, I say that statisticians express their confidence about their conclusions in terms of probability, which is why I brought all this up in the first place. This line of thinking leads to conditional probability — the probability that an event occurs given that some other event occurs. Suppose that I roll a die, look at it (so that you don’t see it), and tell you that I rolled an odd number. What’s the probability that I’ve rolled a 5? Ordinarily, the probability of a 5 is ⅙, but “I rolled an odd number” narrows it down. That piece of information eliminates the three even numbers (2, 4, 6) as possibilities. Only the three odd numbers (1, 3, 5) are possible, so the probability is ⅓.

What’s the big deal about conditional probability? What role does it play in statistical analysis? Read on.

Inferential Statistics: Testing Hypotheses

Before any statistician begins a study, they draw up a tentative explanation — a hypothesis that tells why the data might come out a certain way. After gathering all the data, the statistician has to decide whether to reject the hypothesis.

That decision is the answer to a conditional probability question — what’s the probability of obtaining the data, given that this hypothesis is correct? Statisticians have tools that calculate the probability. If the probability turns out to be low, the statistician rejects the hypothesis.

Back to coin-tossing for an example: Imagine that you’re interested in whether a particular coin is fair — whether it has an equal chance of heads or tails on any toss. Let’s start with “The coin is fair” as the hypothesis.

To test the hypothesis, you’d toss the coin a number of times — let’s say 100. These 100 tosses are the sample data. If the coin is fair (as per the hypothesis), you’d expect 50 heads and 50 tails.

If it’s 99 heads and 1 tail, you’d surely reject the fair-coin hypothesis: The conditional probability of 99 heads and 1 tail given a fair coin is very low. Of course, the coin could still be fair and you could, quite by chance, get a 99-1 split, right? Sure. You never really know. You have to gather the sample data (the 100-toss results) and then decide. Your decision might be right, or it might not.

Juries make these types of decisions. In the United States, the starting hypothesis is that the defendant is not guilty (“innocent until proven guilty”). Think of the evidence as data. Jury members consider the evidence and answer a conditional probability question: What’s the probability of the evidence, given that the defendant is not guilty? Their answer determines the verdict.

Null and alternative hypotheses

Think again about that coin-tossing study I just mentioned. The sample data are the results from the 100 tosses. I said that we can start with the hypothesis that the coin is fair. This starting point is called the null hypothesis. The statistical notation for the null hypothesis is H0. According to this hypothesis, any heads-tails split in the data is consistent with a fair coin. Think of it as the idea that nothing in the sample data is out of the ordinary.

An alternative hypothesis is possible — that the coin isn’t a fair one and it’s loaded to produce an unequal number of heads and tails. This hypothesis says that any heads-tails split is consistent with an unfair coin. This alternative hypothesis is called, believe it or not, the alternative hypothesis. The statistical notation for the alternative hypothesis is H1.

Now toss the coin 100 times and note the number of heads and tails. If the results are something like 90 heads and 10 tails, it’s a good idea to reject H0. If the results are around 50 heads and 50 tails, don’t reject H0.

Similar ideas apply to the IQ example I gave earlier. One sample receives the computer-based IQ training method, and the other participates in a different computer-based activity — like reading text on a website. Before and after each group completes its activities, the researcher measures each person’s IQ. The null hypothesis, H0, is that one group’s improvement isn’t different from the other. If the improvements are greater with the IQ training than with the other activity — so much greater that it’s unlikely that the two aren’t different from one another — reject H0. If they’re not, don’t reject H0.

Notice that I did not say “accept H0.” The way the logic works, you never accept a hypothesis. You either reject H0 or don’t reject H0. In a jury trial, the verdict is either “guilty” (reject the null hypothesis of “not guilty”) or “not guilty” (don’t reject H0). “Innocent” (acceptance of the null hypothesis) is not a possible verdict.

Notice also that in the coin-tossing example, I said “around 50 heads and 50 tails.” What does around mean? Also, I said that if it’s 90-10, reject H0. What about 85-15? 80-20? 70-30? Exactly how much different from 50-50 does the split have to be for you to reject H0? In the IQ training example, how much greater does the IQ improvement have to be to reject H0?

I won’t answer these questions now. Statisticians have formulated decision rules for situations like this, and I’ll help you explore those rules throughout this book.

Two types of error

Whenever you evaluate data and decide to reject H0 or not reject H0, you can never be absolutely sure. You never really know the “true” state of the world. In the coin-tossing example, that means you can’t be certain whether the coin is fair. All you can do is make a decision based on the sample data. If you want to know for sure about the coin, you have to have the data for the entire population of tosses — which means you have to keep tossing the coin until the end of time.

Because you’re never certain about your decisions, you can make an error either way you decide. As I mention earlier, the coin could be fair, and you just happen to get 99 heads in 100 tosses. That’s not likely, and that’s why you reject H0 if that happens. It’s also possible that the coin is biased, yet you just happen to toss 50 heads in 100 tosses. Again, that’s not likely, and you don’t reject H0 in that case.

Although those errors aren’t likely, they’re possible. They lurk in every study that involves inferential statistics. Statisticians have named them Type I errors and Type II errors.

If you reject H0 and you shouldn’t, that’s a Type I error. In the coin example, that’s rejecting the hypothesis that the coin is fair when in reality it’s a fair coin.

If you don’t reject H0 and you should have, that’s a Type II error. It happens when you don’t reject the hypothesis that the coin is fair, and in reality, it’s biased.

How do you know whether you’ve made either type of error? You don’t — at least not right after you make the decision to reject or not reject H0. (If it’s possible to know, you wouldn’t make the error in the first place!) All you can do is gather more data and see whether the additional data is consistent with your decision.

If you think of H0 as a tendency to maintain the status quo and not interpret anything as being out of the ordinary (no matter how it looks), a Type II error means you’ve missed out on something big. In fact, some iconic mistakes are Type II errors.

Here’s what I mean. On New Year’s Day in 1962, a rock group consisting of three guitarists and a drummer auditioned in the London studio of a major recording company. Legend has it that the recording executives didn’t like what they heard, didn’t like what they saw, and believed that guitar groups were on their way out. Although the musicians played their hearts out, the group failed the audition.

Who was that group? The Beatles!

And that’s a Type II error.

Chapter 2

Python: What It Does and How It Does It

IN THIS CHAPTER

Working with Colab

Learning Python functions

Learning Python structures

Working with libraries

Creating your own functions

Python is a computer language. You can use it for doing the kinds of computation and number-crunching that can set the stage for effective statistical analysis and decision-making. An important aspect of statistical analysis is to present the results in a comprehensible way. For this reason, I explore Python’s extensive graphics capabilities (in Chapter 3).

The brainchild of Guido van Rossum, Python is named after the long-running BBC hit comedy series “Monty Python’s Flying Circus.” He intended Python to be easy and intuitive to use, open source, and suitable for everyday tasks, from website creation to more involved efforts like machine learning and data science. He also wanted humans to be able to easily understand Python code.

To say that van Rossum succeeded is putting it mildly. In 2024, Python became the most-often-used language on GitHub (the world’s largest code management website), and it’s been one of the ten most popular languages since 2004.

To read more about how Python began, check out Python: The Documentary on YouTube (www.youtube.com/watch?v=GfH4QL4VqJ0). It’s a bit technically oriented, but it’s fun to watch Guido and his colleagues reminisce.

Introducing Colab

At this point, it might seem logical to tell you how to download Python and install it on your computer.

Instead, I move in a different direction. Why? In this book, I don’t use Python on a local machine. Instead, I show you how to do your computing in the cloud. That way, you don’t have to worry about installation issues or local hardware limitations. You don’t fuss with command lines or path names. All you need is a Gmail account and a working Internet connection — and of course, the faster, the better.

What makes all this possible is the Google Colaboratory, dubbed Colab by its users. Google hosts Colab as a platform for learning, exploration, and experimentation. Because Google hosts it online, you work with Google hardware rather than your own. If you’re an aspiring data scientist or machine learning engineer, or if you just want to learn Python, it’s a good idea to get into Colab.

Colab is a browser-based version of a locally installable app known as the Jupyter Notebook, so named because of the languages it accommodates (Julia, Python, and R). It’s pronounced “Jupiter,” like the planet. (To be consistent with its spelling, I think it should be pronounced “JuPYter,” but as usual, nobody asked me.)

To make the going as easy as possible, I use the Chrome browser and store my work on Google Drive. That way, everything stays in the Google family.

The most user-friendly installable version of Jupyter, in my view, is Jupyter Lab Desktop (https://github.com/jupyterlab/jupyterlab-desktop). You can try it if you like, but I have nothing more to say about it in this book.

Time to dive into Colab:

After registering your free Gmail account, open Colab by navigating tohttps://colab.research.google.com.

Doing so opens a page with an Open Notebook dialog box that resembles Figure 2-1. (I say “resembles” because yours won’t look exactly like mine. I’ve been working with Colab for a while, so I have some files saved.)

Click the blue New Notebook button to get started.

Figure 2-2 shows the page that opens.

First things first — let’s give this notebook a name.

FIGURE 2-1: The Open Notebook dialog box on the Colab welcome page.

FIGURE 2-2: The page that opens when you click the New Notebook button.

Double-click on

Untitled

and type

My First Notebook

.

Press Enter to apply the name change.