The Statistics and Machine Learning with R Workshop - Liu Peng - E-Book

The Statistics and Machine Learning with R Workshop E-Book

Liu Peng

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

The Statistics and Machine Learning with R Workshop is a comprehensive resource packed with insights into statistics and machine learning, along with a deep dive into R libraries. The learning experience is further enhanced by practical examples and hands-on exercises that provide explanations of key concepts.
Starting with the fundamentals, you’ll explore the complete model development process, covering everything from data pre-processing to model development. In addition to machine learning, you’ll also delve into R's statistical capabilities, learning to manipulate various data types and tackle complex mathematical challenges from algebra and calculus to probability and Bayesian statistics. You’ll discover linear regression techniques and more advanced statistical methodologies to hone your skills and advance your career.
By the end of this book, you'll have a robust foundational understanding of statistics and machine learning. You’ll also be proficient in using R's extensive libraries for tasks such as data processing and model training and be well-equipped to leverage the full potential of R in your future projects.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 626

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



The Statistics and Machine Learning with R Workshop

Unlock the power of efficient data science modeling with this hands-on guide

Liu Peng

The Statistics and Machine Learning with R Workshop

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Ali Abidi

Book Project Manager: Farheen Fathima

Senior Editor: Nazia Shaikh

Technical Editor: Devanshi Ayare

Copy Editor: Safis Editing

Proofreader: Safis Editing

Indexer: Tejal Daruwale Soni

Production Designer: Joshua Misquitta

DevRel Marketing Coordinator: Vinshika Kalra

First published: September 2023

Production reference: 1290923

Published by Packt Publishing Ltd

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB

ISBN 978-1-80324-030-5

www.packtpub.com

This book is dedicated to my family, particularly my wife, Zheng, and my children, Jiaran, Jiaxin, and Jiayu. Jiaran comes first this time, as her older sister (Jiaxin) already declared victory in my other book.

Contributors

About the author

Liu Peng is an assistant professor of quantitative finance (practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has 10 years of working experience as a data scientist across the banking, technology, and hospitality industries.

This volume encapsulates a decade-long odyssey through the multifaceted landscape of data science, a journey that began as a spark of personal curiosity and evolved into an integrated blend of theoretical and practical insights. I owe a debt of gratitude to my esteemed mentors—Teo Chung Piaw, Chen Ying, and Ian Wilson—who have been instrumental in shaping my academic and professional trajectory, providing unwavering support every step of the way.

About the reviewer

Usha Rengaraju currently heads the data science research at Exa Protocol, and she is the first female triple Kaggle Grandmaster worldwide. She specializes in deep learning and probabilistic graphical models and was also one of the judges of the TigerGraph Graph for All Million Dollar Challenge. She was ranked as one of the top 10 data scientists in India by Analytics India Magazine and also ranked as one of the top 150 AI leaders and influencers by 3AI magazine. She is one of the winners of the ML in Action competition organized by the ML developer programs team at Google, and her team won first place at the WiDS Datathon 2022 organized by Stanford University. She was also the winner of the Kaggle ML Research Spotlight for 2022 and the winner of the TensorFlow Community Spotlight 2023.

Vybhavreddy KC is a dedicated data science practitioner by profession. He has fortified his passion for data with a Bachelor's degree in Computer Science and a Master's degree in Analytics. Vybhav's expertise includes leading the development of innovative ML/AI driven solutions for Compliance and Regulatory product suite. When he's not immersed in the realm of numbers and algorithms, Vybhav cherishes his free time, and loves playing with his children.

I would like to thank my wife Srilakshmi, my lovely kids Varshil and Reyansh for their unwavering support in achieving my academic and professional goals.

Table of Contents

Preface

Part 1: Statistics Essentials

1

Getting Started with R

Technical requirements

Introducing R

Covering the R and RStudio basics

Common data types in R

Common data structures in R

Vector

Matrix

Data frame

List

Control logic in R

Relational operators

Logical operators

Conditional statements

Loops

Exploring functions in R

Summary

2

Data Processing with dplyr

Technical requirements

Introducing tidyverse and dplyr

Data transformation with dplyr

Slicing the dataset using the filter() function

Sorting the dataset using the arrange() function

Adding or changing a column using the mutate() function

Selecting columns using the select() function

Selecting the top rows using the top_n() function

Combining the five verbs

Introducing other verbs

Data aggregation with dplyr

Counting observations using the count() function

Aggregating data via group_by() and summarize()

Data merging with dplyr

Case study – working with the Stack Overflow dataset

Summary

3

Intermediate Data Processing

Technical requirements

Transforming categorical and numeric variables

Recoding categorical variables

Creating variables using case_when()

Binning numeric variables using cut()

Reshaping the DataFrame

Converting from long format into wide format using spread()

Converting from wide format into long format using gather()

Manipulating string data

Creating strings

Converting numbers into strings

Connecting strings

Working with stringr

Basics of stringr

Pattern matching in a string

Splitting a string

Replacing a string

Putting it together

Introducing regular expressions

Working with tidy text mining

Converting text into tidy data using unnest_tokens()

Working with a document-term matrix

Summary

4

Data Visualization with ggplot2

Technical requirements

Introducing ggplot2

Building a scatter plot

Understanding the grammar of graphics

Geometries in graphics

Understanding geometry in scatter plots

Introducing bar charts

Introducing line plots

Controlling themes in graphics

Adjusting themes

Exploring ggthemes

Summary

5

Exploratory Data Analysis

Technical requirements

EDA fundamentals

Analyzing categorical data

Summarizing categorical variables using counts

Converting counts into proportions

Marginal distribution and faceted bar charts

Analyzing numerical data

Visualization in higher dimensions

Measuring the central concentration

Measuring variability

Working with skewed distributions

EDA in practice

Obtaining the stock price data

Univariate analysis of individual stock prices

Correlation analysis

Summary

6

Effective Reporting with R Markdown

Technical requirements

Fundamentals of R Markdown

Getting started with R Markdown

Getting to know the YAML header

Formatting textual information

Writing R code

Generating a financial analysis report

Getting and displaying the data

Performing data analysis

Adding plots to the report

Adding tables to the report

Configuring code chunks

Customizing R Markdown reports

Adding a table of contents

Creating a report with parameters

Customizing the report style

Summary

Part 2: Fundamentals of Linear Algebra and Calculus in R

7

Linear Algebra in R

Technical requirements

Introducing linear algebra

Working with vectors

Working with matrices

Matrix vector multiplication

Matrix multiplication

The identity matrix

Transposing a matrix

Inverting a matrix

Solving a system of linear equations

System of linear equations

The solution to matrix-vector equations

Geometric interpretation of solving a system of linear equations

Obtaining a unique solution to a system of linear equations

Overdetermined and underdetermined systems of linear equations

Summary

8

Intermediate Linear Algebra in R

Technical requirements

Introducing the matrix determinant

Interpreting the determinant

Connection to the matrix rank

Introducing the matrix trace

Special properties of the matrix trace

Understanding the matrix norm

Understanding the vector norm

Calculating the L 1-norm of a vector

Calculating the L 2-norm of a vector

Calculating the L ∞-norm of a vector

Understanding the matrix norm

Calculating the L 1-norm of a matrix

Calculating the Frobenius norm of a matrix

Calculating the infinity norm of a matrix

Getting to know eigenvalues and eigenvectors

Understanding scalar-vector multiplication

Defining eigenvalues and eigenvectors

Computing eigenvalues and eigenvectors

Introducing principal component analysis

Understanding the variance-covariance matrix

Connecting to PCA

Performing PCA

Summary

9

Calculus in R

Technical requirements

Introducing calculus

Differential and integral calculus

More on functions

Vertical line test

Functional symmetry

Increasing and decreasing functions

Slope of a function

Function composition

Common functions

Understanding limits

Infinite limit

Limit at infinity

Introducing derivatives

Common derivatives

Common properties and rules of derivatives

Introducing integral calculus

Indefinite integrals

Indefinite integrals of basic functions

Properties of indefinite integrals

Integration by parts

Definite integrals

Working with calculus in R

Plotting basic functions

Working with derivatives

Using symbolic parameters

Working with the second derivative

Working with partial derivatives

Working with integration in R

More on antiderivatives

Evaluating the definite integral

Summary

Part 3: Fundamentals of Mathematical Statistics in R

10

Probability Basics

Technical requirements

Introducing probability distribution

Exploring common discrete probability distributions

The Bernoulli distribution

The binomial distribution

The Poisson distribution

Poisson approximation to binomial distribution

The geometric distribution

Comparing different discrete probability distributions

Discovering common continuous probability distributions

The normal distribution

The exponential distribution

Uniform distribution

Generating normally distributed random samples

Understanding common sampling distributions

Common sampling distributions

Understanding order statistics

Extracting order statistics

Calculating the value at risk

Summary

11

Statistical Estimation

Statistical inference for categorical data

Statistical inference for a single parameter

Introducing the General Social Survey dataset

Calculating the sample proportion

Calculating the confidence interval

Interpreting the confidence interval of the sample proportion

Hypothesis testing for the sample proportion

Inference for the difference in sample proportions

Type I and Type II errors

Testing the independence of two categorical variables

Introducing the contingency table

Applying the chi-square test for independence between two categorical variables

Statistical inference for numerical data

Generating a bootstrap distribution for the median

Constructing the bootstrapped confidence interval

Re-centering a bootstrap distribution

Introducing the central limit theorem used in t-distribution

Constructing the confidence interval for the population mean using the t-distribution

Performing hypothesis testing for two means

Introducing ANOVA

Summary

12

Linear Regression in R

Introducing linear regression

Understanding simple linear regression

Introducing multiple linear regression

Seeking a higher coefficient of determination

More on adjusted R 2

Developing an MLR model

Introducing Simpson’s Paradox

Working with categorical variables

Introducing the interaction term

Handling nonlinear terms

More on the logarithmic transformation

Working with the closed-form solution

Dealing with multicollinearity

Dealing with heteroskedasticity

Introducing penalized linear regression

Working with ridge regression

Working with lasso regression

Summary

13

Logistic Regression in R

Technical requirements

Introducing logistic regression

Understanding the sigmoid function

Grokking the logistic regression model

Comparing logistic regression with linear regression

Making predictions using the logistic regression model

More on log odds and odds ratio

Introducing the cross-entropy loss

Evaluating a logistic regression model

Dealing with an imbalanced dataset

Penalized logistic regression

Extending to multi-class classification

Summary

14

Bayesian Statistics

Technical requirements

Introducing Bayesian statistics

A first look into the Bayesian theorem

Understanding the generative model

Understanding prior distributions

Introducing the likelihood function

Introducing the posterior model

Diving deeper into Bayesian inference

Introducing the normal-normal model

Introducing MCMC

The full Bayesian inference procedure

Bayesian linear regression with a categorical variable

Summary

Index

Other Books You May Enjoy

Part 1:Statistics Essentials

This part is designed to equip you with knowledge of statistical and programming fundamentals, focusing particularly on the versatile R language, which will serve as the cornerstone for more advanced topics in subsequent parts.

By the end of this part, you’ll have a strong grasp of the core statistical and programming concepts essential for any data science practitioner to understand. With these foundational skills in hand, you’ll be well prepared to delve into the more specialized topics that await you in subsequent parts of this book.

This part has the following chapters:

Chapter 1, Getting Started with RChapter 2, Data Processing with dplyrChapter 3, Intermediate Data ProcessingChapter 4, Data Visualization with ggplot2Chapter 5, Exploratory Data AnalysisChapter 6, Effective Reporting with R Markdown