Building Statistical Models in Python - Huy Hoang Nguyen - E-Book

Building Statistical Models in Python E-Book

Huy Hoang Nguyen

0,0
35,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

The ability to proficiently perform statistical modeling is a fundamental skill for data scientists and essential for businesses reliant on data insights. Building Statistical Models with Python is a comprehensive guide that will empower you to leverage mathematical and statistical principles in data assessment, understanding, and inference generation.

This book not only equips you with skills to navigate the complexities of statistical modeling, but also provides practical guidance for immediate implementation through illustrative examples. Through emphasis on application and code examples, you’ll understand the concepts while gaining hands-on experience. With the help of Python and its essential libraries, you’ll explore key statistical models, including hypothesis testing, regression, time series analysis, classification, and more.

By the end of this book, you’ll gain fluency in statistical modeling while harnessing the full potential of Python's rich ecosystem for data analysis.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 546

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Building Statistical Models in Python

Develop useful models for regression, classification, time series, and survival analysis

Huy Hoang Nguyen

Paul N Adams

Stuart J Miller

BIRMINGHAM—MUMBAI

Building Statistical Models in Python

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Ali Abidi

Publishing Product Manager: Sanjana Gupta

Senior Editor: Sushma Reddy

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Book Project Manager: Kirti Pisat

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Hemangini Bari

Production Designer: Prashant Ghare

Marketing Coordinator: Nivedita Singh

First published: August 2023

Production reference: 3310823

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul's Square

Birmingham

B3 1RB, UK.

ISBN 978-1-80461-428-0

www.packtpub.com

To my parents, Thieu and Tang, for their enormous support and faith in me.

To my wife, Tam, for her endless love, dedication, and courage.

- Huy Hoang Nguyen

To my daughter, Lydie, for demonstrating how work and dedication regenerate inspiration and creativity. To my wife, Helene, for her love and support.

– Paul Adams

To my partner, Kate, who has always supported my endeavors.

– Stuart Miller

Contributors

About the authors

Huy Hoang Nguyen is a mathematician and data scientist with extensive experience in advanced mathematics, strategic leadership, and applied machine learning research. He holds a PhD in Mathematics, as well as two Master’s degrees in Applied Mathematics and Data Science. His previous work focused on Partial Differential Equations, Functional Analysis, and their applications in Fluid Mechanics. After transitioning from academia to the healthcare industry, he has undertaken a variety of data science projects, ranging from traditional machine learning to deep learning.

Paul Adams is a Data Scientist with a background primarily in the healthcare industry. Paul applies statistics and machine learning in multiple areas of industry, focusing on projects in process engineering, process improvement, metrics and business rules development, anomaly detection, forecasting, clustering, and classification. Paul holds an MSc in Data Science from Southern Methodist University.

Stuart Miller is a Machine Learning Engineer with a wide range of experience. Stuart has applied machine learning methods to various projects in industries ranging from insurance to semiconductor manufacturing. Stuart holds degrees in data science, electrical engineering, and physics.

About the reviewers

Krishnan Raghavan is an IT Professional with over 20+ years of experience in software development and delivery excellence across multiple domains and technology ranging from C++ to Java, Python, Data Warehousing, and Big Data tools and technologies.

When not working, Krishnan likes to spend time with his wife and daughter, reading fiction and nonfiction as well as technical books. Krishnan tries to give back to the community by being part of the GDG Pune Volunteer Group, helping the team organize events. Currently, he is unsuccessfully trying to learn how to play the guitar.

You can connect with Krishnan at [email protected] or via LinkedIn: www.linkedin.com/in/krishnan-raghavan.

I would like to thank my wife Anita and daughter Ananya for giving me the time and space to review this book.

Karthik Dulam is a Principal Data Scientist at EDB. He is passionate about all things data with a particular focus on data engineering, statistical modeling, and machine learning. He has a diverse background delivering machine learning solutions for the healthcare, IT, automotive, telecom, tax, and advisory industries. He actively engages with students as a guest speaker at esteemed universities delivering insightful talks on machine learning use cases.

I would like to thank my wife, Sruthi Anem, for her unwavering support and patience. I also want to thank my family, friends, and colleagues who have played an instrumental role in shaping the person I am today. Their unwavering support, encouragement, and belief in me have been a constant source of inspiration.

Table of Contents

Preface

Part 1: Introduction to Statistics

1

Sampling and Generalization

Software and environment setup

Population versus sample

Population inference from samples

Randomized experiments

Observational study

Sampling strategies – random, systematic, stratified, and clustering

Probability sampling

Non-probability sampling

Summary

2

Distributions of Data

Technical requirements

Understanding data types

Nominal data

Ordinal data

Interval data

Ratio data

Visualizing data types

Measuring and describing distributions

Measuring central tendency

Measuring variability

Measuring shape

The normal distribution and central limit theorem

The Central Limit Theorem

Bootstrapping

Confidence intervals

Standard error

Correlation coefficients (Pearson’s correlation)

Permutations

Permutations and combinations

Permutation testing

Transformations

Summary

References

3

Hypothesis Testing

The goal of hypothesis testing

Overview of a hypothesis test for the mean

Scope of inference

Hypothesis test steps

Type I and Type II errors

Type I errors

Type II errors

Basics of the z-test – the z-score, z-statistic, critical values, and p-values

The z-score and z-statistic

A z-test for means

z-test for proportions

Power analysis for a two-population pooled z-test

Summary

4

Parametric Tests

Assumptions of parametric tests

Normally distributed population data

Equal population variance

T-test – a parametric hypothesis test

T-test for means

Two-sample t-test – pooled t-test

Two-sample t-test – Welch’s t-test

Paired t-test

Tests with more than two groups and ANOVA

Multiple tests for significance

ANOVA

Pearson’s correlation coefficient

Power analysis examples

Summary

References

5

Non-Parametric Tests

When parametric test assumptions are violated

Permutation tests

The Rank-Sum test

The test statistic procedure

Normal approximation

Rank-Sum example

The Signed-Rank test

The Kruskal-Wallis test

Chi-square distribution

Chi-square goodness-of-fit

Chi-square test of independence

Chi-square goodness-of-fit test power analysis

Spearman’s rank correlation coefficient

Summary

Part 2: Regression Models

6

Simple Linear Regression

Simple linear regression using OLS

Coefficients of correlation and determination

Coefficients of correlation

Coefficients of determination

Required model assumptions

A linear relationship between the variables

Normality of the residuals

Homoscedasticity of the residuals

Sample independence

Testing for significance and validating models

Model validation

Summary

7

Multiple Linear Regression

Multiple linear regression

Adding categorical variables

Evaluating model fit

Interpreting the results

Feature selection

Statistical methods for feature selection

Performance-based methods for feature selection

Recursive feature elimination

Shrinkage methods

Ridge regression

LASSO regression

Elastic Net

Dimension reduction

PCA – a hands-on introduction

PCR – a hands-on salary prediction study

Summary

Part 3: Classification Models

8

Discrete Models

Probit and logit models

Multinomial logit model

Poisson model

The Poisson distribution

Modeling count data

The negative binomial regression model

Negative binomial distribution

Summary

9

Discriminant Analysis

Bayes’ theorem

Probability

Conditional probability

Discussing Bayes’ Theorem

Linear Discriminant Analysis

Supervised dimension reduction

Quadratic Discriminant Analysis

Summary

Part 4: Time Series Models

10

Introduction to Time Series

What is a time series?

Goals of time series analysis

Statistical measurements

Mean

Variance

Autocorrelation

Cross-correlation

The white-noise model

Stationarity

Summary

References

11

ARIMA Models

Technical requirements

Models for stationary time series

Autoregressive (AR) models

Moving average (MA) models

Autoregressive moving average (ARMA) models

Models for non-stationary time series

ARIMA models

Seasonal ARIMA models

More on model evaluation

Summary

References

12

Multivariate Time Series

Multivariate time series

Time-series cross-correlation

ARIMAX

Preprocessing the exogenous variables

Fitting the model

Assessing model performance

VAR modeling

Step 1 – visual inspection

Step 2 – selecting the order of AR(p)

Step 3 – assessing cross-correlation

Step 4 – building the VAR(p,q) model

Step 5 – testing the forecast

Step 6 – building the forecast

Summary

References

Part 5: Survival Analysis

13

Time-to-Event Variables – An Introduction

What is censoring?

Left censoring

Right censoring

Interval censoring

Type I and Type II censoring

Survival data

Survival Function, Hazard and Hazard Ratio

Summary

14

Survival Models

Technical requirements

Kaplan-Meier model

Model definition

Model example

Exponential model

Model example

Cox Proportional Hazards regression model

Step 1

Step 2

Step 3

Step 4

Step 5

Summary

Index

Other Books You May Enjoy

Part 1:Introduction to Statistics

This part will cover the statistical concepts that are foundational to statistical modeling.

It includes the following chapters:

Chapter 1, Sampling and GeneralizationChapter 2, Distributions of DataChapter 3, Hypothesis TestingChapter 4, Parametric TestsChapter 5, Non-Parametric Tests