Essential Statistics for Non-STEM Data Analysts - Rongpeng Li - E-Book

Essential Statistics for Non-STEM Data Analysts E-Book

Rongpeng Li

0,0
31,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Statistics remain the backbone of modern analysis tasks, helping you to interpret the results produced by data science pipelines. This book is a detailed guide covering the math and various statistical methods required for undertaking data science tasks.
The book starts by showing you how to preprocess data and inspect distributions and correlations from a statistical perspective. You’ll then get to grips with the fundamentals of statistical analysis and apply its concepts to real-world datasets. As you advance, you’ll find out how statistical concepts emerge from different stages of data science pipelines, understand the summary of datasets in the language of statistics, and use it to build a solid foundation for robust data products such as explanatory models and predictive models. Once you’ve uncovered the working mechanism of data science algorithms, you’ll cover essential concepts for efficient data collection, cleaning, mining, visualization, and analysis. Finally, you’ll implement statistical methods in key machine learning tasks such as classification, regression, tree-based methods, and ensemble learning.
By the end of this Essential Statistics for Non-STEM Data Analysts book, you’ll have learned how to build and present a self-contained, statistics-backed data product to meet your business goals.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 388

Veröffentlichungsjahr: 2020

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Essential Statistics for Non-STEM Data Analysts

Get to grips with the statistics and math knowledge needed to enter the world of data science with Python

Rongpeng Li

BIRMINGHAM—MUMBAI

Essential Statistics for Non-STEM Data Analysts

Copyright © 2020 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Sunith Shetty

Acquisition Editor: Devika Battike

Senior Editor: Roshan Kumar

Content Development Editor: Sean Lobo

Technical Editor: Sonam Pandey

Copy Editor: Safis Editing

Project Coordinator: Aishwarya Mohan

Proofreader: Safis Editing

Indexer: Pratik Shirodkar

Production Designer: Roshan Kawale

First published: November 2020

Production reference: 1111120

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-83898-484-7

www.packt.com

Packt.com

Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionalsImprove your learning with Skill Plans built especially for youGet a free eBook or video every monthFully searchable for easy access to vital informationCopy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Rongpeng Li is a data science instructor and a senior data scientist at Galvanize, Inc. He has previously been a research programmer at Information Sciences Institute, working on knowledge graphs and artificial intelligence. He has also been the host and organizer of the Data Analysis Workshop Designed for Non-STEM Busy Professionals at LA.

Michael Hansen (https://www.linkedin.com/in/michael-n-hansen/), a friend of mine, provided invaluable English language editing suggestions for this book. Michael has great attention to detail, which made him a great language reviewer. Thank you, Michael!

About the reviewers

James Mott, PhD, is a senior education consultant with extensive experience in teaching statistical analysis, modeling, data mining, and predictive analytics. He has over 30 years of experience using SPSS products in his own research, including IBM SPSS Statistics, IBM SPSS Modeler, and IBM SPSS Amos. He has also been actively teaching about these products to IBM/SPSS customers for over 30 years. In addition, he is an experienced historian with expertise in the research and teaching of 20th century United States political history and quantitative methods. His specialties are data mining, quantitative methods, statistical analysis, teaching, and consulting.

Yidan Pan obtained her PhD in system, synthetic, and physical biology from Rice University. Her research interest is profiling mutagenesis at genomic and transcriptional levels with molecular biology wet labs, bioinformatics, statistical analysis, and machine learning models. She believes that this book will give its readers a lot of practical skills for data analysis.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Preface

Section 1: Getting Started with Statistics for Data Science

Chapter 1: Fundamentals of Data Collection, Cleaning, and Preprocessing

Technical requirements

Collecting data from various data sources

Reading data directly from files

Obtaining data from an API

Obtaining data from scratch

Data imputation

Preparing the dataset for imputation

Imputation with mean or median values

Imputation with the mode/most frequent value

Outlier removal

Data standardization – when and how

Examples involving the scikit-learn preprocessing module

Imputation

Standardization

Summary

Chapter 2: Essential Statistics for Data Assessment

Classifying numerical and categorical variables

Distinguishing between numerical and categorical variables

Understanding mean, median, and mode

Mean

Median

Mode

Learning about variance, standard deviation, quartiles, percentiles, and skewness

Variance

Standard deviation

Quartiles

Skewness

Knowing how to handle categorical variables and mixed data types

Frequencies and proportions

Transforming a continuous variable to a categorical one

Using bivariate and multivariate descriptive statistics

Covariance

Cross-tabulation

Summary

Chapter 3: Visualization with Statistical Graphs

Basic examples with the Python Matplotlib package

Elements of a statistical graph

Exploring important types of plotting in Matplotlib

Advanced visualization customization

Customizing the geometry

Customizing the aesthetics

Query-oriented statistical plotting

Example 1 – preparing data to fit the plotting function API

Example 2 – combining analysis with plain plotting

Presentation-ready plotting tips

Use styling

Font matters a lot

Summary

Section 2: Essentials of Statistical Analysis

Chapter 4: Sampling and Inferential Statistics

Understanding fundamental concepts in sampling techniques

Performing proper sampling under different scenarios

The dangers associated with non-probability sampling

Probability sampling – the safer approach

Understanding statistics associated with sampling

Sampling distribution of the sample mean

Standard error of the sample mean

The central limit theorem

Summary

Chapter 5: Common Probability Distributions

Understanding important concepts in probability

Events and sample space

The probability mass function and the probability density function

Subjective probability and empirical probability

Understanding common discrete probability distributions

Bernoulli distribution

Binomial distribution

Poisson distribution

Understanding the common continuous probability distribution

Uniform distribution

Exponential distribution

Normal distribution

Learning about joint and conditional distribution

Independency and conditional distribution

Understanding the power law and black swan

The ubiquitous power law

Be aware of the black swan

Summary

Chapter 6: Parametric Estimation

Understanding the concepts of parameter estimation and the features of estimators

Evaluation of estimators

Using the method of moments to estimate parameters

Example 1 – the number of 911 phone calls in a day

Example 2 – the bounds of uniform distribution

Applying the maximum likelihood approach with Python

Likelihood function

MLE for uniform distribution boundaries

MLE for modeling noise

MLE and the Bayesian theorem

Summary

Chapter 7: Statistical Hypothesis Testing

An overview of hypothesis testing

Understanding P-values, test statistics, and significance levels

Making sense of confidence intervals and P-values from visual examples

Calculating the P-value from discrete events

Calculating the P-value from the continuous PDF

Significance levels in t-distribution

The power of a hypothesis test

Using SciPy for common hypothesis testing

The paradigm

T-test

The normality hypothesis test

The goodness-of-fit test

A simple ANOVA model

Stationarity tests for time series

Examples of stationary and non-stationary time series

Appreciating A/B testing with a real-world example

Conducting an A/B test

Randomization and blocking

Common test statistics

Common mistakes in A/B tests

Summary

Section 3: Statistics for Machine Learning

Chapter 8: Statistics for Regression

Understanding a simple linear regression model and its rich content

Least squared error linear regression and variance decomposition

The coefficient of determination

Hypothesis testing

Connecting the relationship between regression and estimators

Simple linear regression as an estimator

Having hands-on experience with multivariate linear regression and collinearity analysis

Collinearity

Learning regularization from logistic regression examples

Summary

Chapter 9: Statistics for Classification

Understanding how a logistic regression classifier works

The formulation of a classification problem

Implementing logistic regression from scratch

Evaluating the performance of the logistic regression classifier

Building a naïve Bayes classifier from scratch

Underfitting, overfitting, and cross-validation

Summary

Chapter 10: Statistics for Tree-Based Methods

Overviewing tree-based methods for classification tasks

Growing and pruning a classification tree

Understanding how splitting works

Evaluating decision tree performance

Exploring regression tree

Using tree models in scikit-learn

Summary

Chapter 11: Statistics for Ensemble Methods

Revisiting bias, variance, and memorization

Understanding the bootstrapping and bagging techniques

Understanding and using the boosting module

Exploring random forests with scikit-learn

Summary

Section 4: Appendix

Chapter 12: A Collection of Best Practices

Understanding the importance of data quality

Understanding why data can be problematic

Avoiding the use of misleading graphs

Example 1 – COVID-19 trend

Example 2 – Bar plot cropping

Fighting against false arguments

Summary

Chapter 13: Exercises and Projects

Exercises

Chapter 1 – Fundamentals of Data Collection, Cleaning, and Preprocessing

Chapter 2 – Essential Statistics for Data Assessment

Chapter 3 – Visualization with Statistical Graphs

Chapter 4 – Sampling and Inferential Statistics

Chapter 5 – Common Probability Distributions

Chapter 6 – Parameter Estimation

Chapter 7 – Statistical Hypothesis Testing

Chapter 8 – Statistics for Regression

Chapter 9 – Statistics for Classification

Chapter 10 – Statistics for Tree-Based Methods

Chapter 11 – Statistics for Ensemble Methods

Project suggestions

Non-tabular data

Real-time weather data

Goodness of fit for discrete distributions

Building a weather prediction web app

Building a typing suggestion app

Further reading

Textbooks

Visualization

Exercising your mind

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Section 1: Getting Started with Statistics for Data Science

In this section, you will learn how to preprocess data and inspect distributions and correlations from a statistical perspective.

This section consists of the following chapters:

Chapter 1, Fundamentals of Data Collection, Cleaning, and PreprocessingChapter 2, Essential Statistics for Data AssessmentChapter 3, Visualization with Statistical Graphs