E-Book
36,59 €

The Data Science Workshop E-Book

Anthony So

0,0

36,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Cut through the noise and get real results with a step-by-step approach to data science

Key Features

Ideal for the data science beginner who is getting started for the first time

A data science tutorial with step-by-step exercises and activities that help build key skills

Structured to let you progress at your own pace, on your own terms

Use your physical print copy to redeem free access to the online interactive edition

Book Description

You already know you want to learn data science, and a smarter way to learn data science is to learn by doing. The Data Science Workshop focuses on building up your practical skills so that you can understand how to develop simple machine learning models in Python or even build an advanced model for detecting potential bank frauds with effective modern data science. You'll learn from real examples that lead to real results.

Throughout The Data Science Workshop, you'll take an engaging step-by-step approach to understanding data science. You won't have to sit through any unnecessary theory. If you're short on time you can jump into a single exercise each day or spend an entire weekend training a model using sci-kit learn. It's your choice. Learning on your terms, you'll build up and reinforce key skills in a way that feels rewarding.

Every physical print copy of The Data Science Workshop unlocks access to the interactive edition. With videos detailing all exercises and activities, you'll always have a guided solution. You can also benchmark yourself against assessments, track progress, and receive content updates. You'll even earn a secure credential that you can share and verify online upon completion. It's a premium learning experience that's included with your printed copy. To redeem, follow the instructions located at the start of your data science book.

Fast-paced and direct, The Data Science Workshop is the ideal companion for data science beginners. You'll learn about machine learning algorithms like a data scientist, learning along the way. This process means that you'll find that your new skills stick, embedded as best practice. A solid foundation for the years ahead.

What you will learn

Find out the key differences between supervised and unsupervised learning

Manipulate and analyze data using scikit-learn and pandas libraries

Learn about different algorithms such as regression, classification, and clustering

Discover advanced techniques to improve model ensembling and accuracy

Speed up the process of creating new features with automated feature tool

Simplify machine learning using open source Python packages

Who this book is for

Our goal at Packt is to help you be successful, in whatever it is you choose to do. The Data Science Workshop is an ideal data science tutorial for the data science beginner who is just getting started. Pick up a Workshop today and let Packt help you develop skills that stick with you for life.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 790

Veröffentlichungsjahr: 2020

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

The Data Science Workshop

A New, Interactive Approach to Learning Data Science

Anthony So, Thomas V. Joseph, Robert Thas John, Andrew Worsley, and Dr. Samuel Asare

The Data Science Workshop

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Authors: Anthony So, Thomas V. Joseph, Robert Thas John, Andrew Worsley, and Dr. Samuel Asare

Technical Reviewers: Tianxiang Liu, Tiffany Ford, and Pritesh Tiwari

Managing Editors: Adrian Cardoza and Snehal Tambe

Acquisitions Editor: Sarah Lawton

Production Editor: Salma Patel

Editorial Board: Shubhopriya Banerjee, Bharat Botle, Ewan Buckingham, Megan Carlisle, Mahesh Dhyani, Manasa Kumar, Alex Mazonowicz, Bridget Neale, Dominic Pereira, Shiny Poojary, Abhishek Rane, Brendan Rodrigues, Erol Staveley, Ankita Thakur, Nitesh Thakur, and Jonathan Wray

First Published: January 2020

Production Reference: 1280120

ISBN: 978-1-83898-126-6

Published by Packt Publishing Ltd.

Livery Place, 35 Livery Street

Birmingham B3 2PB, UK

Preface i

1. Introduction to Data Science in Python 1

Introduction 2

Application of Data Science 2

What Is Machine Learning? 3

Supervised Learning 4

Unsupervised Learning 5

Reinforcement Learning 6

Overview of Python 6

Types of Variable 6

Numeric Variables 6

Text Variables 7

Python List 8

Python Dictionary 10

Exercise 1.01: Creating a Dictionary That Will Contain Machine Learning Algorithms 13

Python for Data Science 16

The pandas Package 16

DataFrame and Series 17

CSV Files 18

Excel Spreadsheets 20

JSON 20

Exercise 1.02: Loading Data of Different Formats into a pandas DataFrame 22

Scikit-Learn 25

What Is a Model? 25

Model Hyperparameters 28

The sklearn API 28

Exercise 1.03: Predicting Breast Cancer from a Dataset Using sklearn 31

Activity 1.01: Train a Spam Detector Algorithm 35

Summary 36

2. Regression 39

Introduction 40

Simple Linear Regression 42

The Method of Least Squares 43

Multiple Linear Regression 44

Estimating the Regression Coefficients (β0, β1, β2 and β3) 45

Logarithmic Transformations of Variables 45

Correlation Matrices 45

Conducting Regression Analysis Using Python 45

Exercise 2.01: Loading and Preparing the Data for Analysis 47

The Correlation Coefficient 54

Exercise 2.02: Graphical Investigation of Linear Relationships Using Python 55

Exercise 2.03: Examining a Possible Log-Linear Relationship Using Python 58

The Statsmodels formula API 59

Exercise 2.04: Fitting a Simple Linear Regression Model Using the Statsmodels formula API 60

Analyzing the Model Summary 61

The Model Formula Language 62

Intercept Handling 64

Activity 2.01: Fitting a Log-Linear Model Using the Statsmodels formula API 64

Multiple Regression Analysis 66

Exercise 2.05: Fitting a Multiple Linear Regression Model Using the Statsmodels formula API 66

Assumptions of Regression Analysis 68

Activity 2.02: Fitting a Multiple Log-Linear Regression Model 69

Explaining the Results of Regression Analysis 70

Regression Analysis Checks and Balances 72

The F-test 73

The t-test 74

Summary 74

3. Binary Classification 77

Introduction 78

Understanding the Business Context 79

Business Discovery 79

Exercise 3.01: Loading and Exploring the Data from the Dataset 80

Testing Business Hypotheses Using Exploratory Data Analysis 82

Visualization for Exploratory Data Analysis 83

Exercise 3.02: Business Hypothesis Testing for Age versus Propensity for a Term Loan 87

Intuitions from the Exploratory Analysis 91

Activity 3.01: Business Hypothesis Testing to Find Employment Status versus Propensity for Term Deposits 92

Feature Engineering 94

Business-Driven Feature Engineering 94

Exercise 3.03: Feature Engineering – Exploration of Individual Features 95

Exercise 3.04: Feature Engineering – Creating New Features from Existing Ones 100

Data-Driven Feature Engineering 106

A Quick Peek at Data Types and a Descriptive Summary 106

Correlation Matrix and Visualization 108

Exercise 3.05: Finding the Correlation in Data to Generate a Correlation Plot Using Bank Data 108

Skewness of Data 111

Histograms 112

Density Plots 113

Other Feature Engineering Methods 114

Summarizing Feature Engineering 116

Building a Binary Classification Model Using the Logistic Regression Function 117

Logistic Regression Demystified 119

Metrics for Evaluating Model Performance 120

Confusion Matrix 121

Accuracy 122

Classification Report 122

Data Preprocessing 123

Exercise 3.06: A Logistic Regression Model for Predicting the Propensity of Term Deposit Purchases in a Bank 124

Activity 3.02: Model Iteration 2 – Logistic Regression Model with Feature Engineered Variables 129

Next Steps 130

Summary 132

4. Multiclass Classification with RandomForest 135

Introduction 136

Training a Random Forest Classifier 136

Evaluating the Model's Performance 140

Exercise 4.01: Building a Model for Classifying Animal Type and Assessing Its Performance 142

Number of Trees Estimator 146

Exercise 4.02: Tuning n_estimators to Reduce Overfitting 149

Maximum Depth 152

Exercise 4.03: Tuning max_depth to Reduce Overfitting 154

Minimum Sample in Leaf 157

Exercise 4.04: Tuning min_samples_leaf 159

Maximum Features 162

Exercise 4.05: Tuning max_features 165

Activity 4.01: Train a Random Forest Classifier on the ISOLET Dataset 168

Summary 169

5. Performing Your First Cluster Analysis 173

Introduction 174

Clustering with k-means 175

Exercise 5.01: Performing Your First Clustering Analysis on the ATO Dataset 177

Interpreting k-means Results 181

Exercise 5.02: Clustering Australian Postcodes by Business Income and Expenses 186

Choosing the Number of Clusters 191

Exercise 5.03: Finding the Optimal Number of Clusters 195

Initializing Clusters 200

Exercise 5.04: Using Different Initialization Parameters to Achieve a Suitable Outcome 203

Calculating the Distance to the Centroid 208

Exercise 5.05: Finding the Closest Centroids in Our Dataset 212

Standardizing Data 219

Exercise 5.06: Standardizing the Data from Our Dataset 223

Activity 5.01: Perform Customer Segmentation Analysis in a Bank Using k-means 228

Summary 230

6. How to Assess Performance 233

Introduction 234

Splitting Data 234

Exercise 6.01: Importing and Splitting Data 235

Assessing Model Performance for Regression Models 239

Data Structures – Vectors and Matrices 240

Scalars 240

Vectors 241

Matrices 242

R2 Score 244

Exercise 6.02: Computing the R2 Score of a Linear Regression Model 245

Mean Absolute Error 249

Exercise 6.03: Computing the MAE of a Model 249

Exercise 6.04: Computing the Mean Absolute Error of a Second Model 252

Other Evaluation Metrics 256

Assessing Model Performance for Classification Models 257

Exercise 6.05: Creating a Classification Model for Computing Evaluation Metrics 257

The Confusion Matrix 261

Exercise 6.06: Generating a Confusion Matrix for the Classification Model 261

Precision 263

Exercise 6.07: Computing Precision for the Classification Model 264

Recall 265

Exercise 6.08: Computing Recall for the Classification Model 265

F1 Score 266

Exercise 6.09: Computing the F1 Score for the Classification Model 266

Accuracy 267

Exercise 6.10: Computing Model Accuracy for the Classification Model 267

Logarithmic Loss 268

Exercise 6.11: Computing the Log Loss for the Classification Model 268

Receiver Operating Characteristic Curve 269

Exercise 6.12: Computing and Plotting ROC Curve for a Binary Classification Problem 269

Area Under the ROC Curve 275

Exercise 6.13: Computing the ROC AUC for the Caesarian Dataset 276

Saving and Loading Models 277

Exercise 6.14: Saving and Loading a Model 277

Activity 6.01: Train Three Different Models and Use Evaluation Metrics to Pick the Best Performing Model 280

Summary 282

7. The Generalization of Machine Learning Models 285

Introduction 286

Overfitting 286

Training on Too Many Features 286

Training for Too Long 287

Underfitting 287

Data 288

The Ratio for Dataset Splits 288

Creating Dataset Splits 289

Exercise 7.01: Importing and Splitting Data 290

Random State 294

Exercise 7.02: Setting a Random State When Splitting Data 296

Cross-Validation 297

KFold 298

Exercise 7.03: Creating a Five-Fold Cross-Validation Dataset 298

Exercise 7.04: Creating a Five-Fold Cross-Validation Dataset Using a Loop for Calls 301

cross_val_score 304

Exercise 7.05: Getting the Scores from Five-Fold Cross-Validation 305

Understanding Estimators That Implement CV 307

LogisticRegressionCV 308

Exercise 7.06: Training a Logistic Regression Model Using Cross-Validation 308

Hyperparameter Tuning with GridSearchCV 312

Decision Trees 312

Exercise 7.07: Using Grid Search with Cross-Validation to Find the Best Parameters for a Model 317

Hyperparameter Tuning with RandomizedSearchCV 322

Exercise 7.08: Using Randomized Search for Hyperparameter Tuning 322

Model Regularization with Lasso Regression 327

Exercise 7.09: Fixing Model Overfitting Using Lasso Regression 327

Ridge Regression 337

Exercise 7.10: Fixing Model Overfitting Using Ridge Regression 338

Activity 7.01: Find an Optimal Model for Predicting the Critical Temperatures of Superconductors 347

Summary 349

8. Hyperparameter Tuning 351

Introduction 352

What Are Hyperparameters? 352

Difference between Hyperparameters and Statistical Model Parameters 353

Setting Hyperparameters 354

A Note on Defaults 356

Finding the Best Hyperparameterization 356

Exercise 8.01: Manual Hyperparameter Tuning for a k-NN Classifier 357

Advantages and Disadvantages of a Manual Search 360

Tuning Using Grid Search 361

Simple Demonstration of the Grid Search Strategy 361

GridSearchCV 365

Tuning using GridSearchCV 365

Support Vector Machine (SVM) Classifiers 370

Exercise 8.02: Grid Search Hyperparameter Tuning for an SVM 371

Advantages and Disadvantages of Grid Search 375

Random Search 376

Random Variables and Their Distributions 376

Simple Demonstration of the Random Search Process 381

Tuning Using RandomizedSearchCV 387

Exercise 8.03: Random Search Hyperparameter Tuning for a Random Forest Classifier 389

Advantages and Disadvantages of a Random Search 393

Activity 8.01: Is the Mushroom Poisonous? 394

Summary 396

9. Interpreting a Machine Learning Model 399

Introduction 400

Linear Model Coefficients 401

Exercise 9.01: Extracting the Linear Regression Coefficient 403

RandomForest Variable Importance 409

Exercise 9.02: Extracting RandomForest Feature Importance 413

Variable Importance via Permutation 418

Exercise 9.03: Extracting Feature Importance via Permutation 422

Partial Dependence Plots 426

Exercise 9.04: Plotting Partial Dependence 429

Local Interpretation with LIME 432

Exercise 9.05: Local Interpretation with LIME 438

Activity 9.01: Train and Analyze a Network Intrusion Detection Model 441

Summary 443

10. Analyzing a Dataset 445

Introduction 446

Exploring Your Data 447

Analyzing Your Dataset 451

Exercise 10.01: Exploring the Ames Housing Dataset with Descriptive Statistics 454

Analyzing the Content of a Categorical Variable 458

Exercise 10.02: Analyzing the Categorical Variables from the Ames Housing Dataset 459

Summarizing Numerical Variables 462

Exercise 10.03: Analyzing Numerical Variables from the Ames Housing Dataset 466

Visualizing Your Data 469

How to use the Altair API 470

Histogram for Numerical Variables 475

Bar Chart for Categorical Variables 478

Boxplots 481

Exercise 10.04: Visualizing the Ames Housing Dataset with Altair 484

Activity 10.01: Analyzing Churn Data Using Visual Data Analysis Techniques 494

Summary 497

11. Data Preparation 499

Introduction 500

Handling Row Duplication 500

Exercise 11.01: Handling Duplicates in a Breast Cancer Dataset 506

Converting Data Types 509

Exercise 11.02: Converting Data Types for the Ames Housing Dataset 512

Handling Incorrect Values 517

Exercise 11.03: Fixing Incorrect Values in the State Column 520

Handling Missing Values 526

Exercise 11.04: Fixing Missing Values for the Horse Colic Dataset 530

Activity 11.01: Preparing the Speed Dating Dataset 535

Summary 539

12. Feature Engineering 543

Introduction 544

Merging Datasets 544

The left join 548

The right join 549

Exercise 12.01: Merging the ATO Dataset with the Postcode Data 552

Binning Variables 557

Exercise 12.02: Binning the YearBuilt variable from the AMES Housing dataset 560

Manipulating Dates 564

Exercise 12.03: Date Manipulation on Financial Services Consumer Complaints 568

Performing Data Aggregation 573

Exercise 12.04: Feature Engineering Using Data Aggregation on the AMES Housing Dataset 579

Activity 12.01: Feature Engineering on a Financial Dataset 583

Summary 585

13. Imbalanced Datasets 587

Introduction 588

Understanding the Business Context 588

Exercise 13.01: Benchmarking the Logistic Regression Model on the Dataset 589

Analysis of the Result 593

Challenges of Imbalanced Datasets 594

Strategies for Dealing with Imbalanced Datasets 596

Collecting More Data 597

Resampling Data 597

Exercise 13.02: Implementing Random Undersampling and Classification on Our Banking Dataset to Find the Optimal Result 598

Analysis 603

Generating Synthetic Samples 604

Implementation of SMOTE and MSMOTE 605

Exercise 13.03: Implementing SMOTE on Our Banking Dataset to Find the Optimal Result 606

Exercise 13.04: Implementing MSMOTE on Our Banking Dataset to Find the Optimal Result 609

Applying Balancing Techniques on a Telecom Dataset 612

Activity 13.01: Finding the Best Balancing Technique by Fitting a Classifier on the Telecom Churn Dataset 612

Summary 615

14. Dimensionality Reduction 617

Introduction 618

Business Context 619

Exercise 14.01: Loading and Cleaning the Dataset 620

Creating a High-Dimensional Dataset 627

Activity 14.01: Fitting a Logistic Regression Model on a High-Dimensional Dataset 629

Strategies for Addressing High-Dimensional Datasets 632

Backward Feature Elimination (Recursive Feature Elimination) 632

Exercise 14.02: Dimensionality Reduction Using Backward Feature Elimination 633

Forward Feature Selection 640

Exercise 14.03: Dimensionality Reduction Using Forward Feature Selection 640

Principal Component Analysis (PCA) 644

Exercise 14.04: Dimensionality Reduction Using PCA 648

Independent Component Analysis (ICA) 652

Exercise 14.05: Dimensionality Reduction Using Independent Component Analysis 653

Factor Analysis 657

Exercise 14.06: Dimensionality Reduction Using Factor Analysis 657

Comparing Different Dimensionality Reduction Techniques 661

Activity 14.02: Comparison of Dimensionality Reduction Techniques on the Enhanced Ads Dataset 663

Summary 667

15. Ensemble Learning 669

Introduction 670

Ensemble Learning 670

Variance 671

Bias 671

Business Context 672

Exercise 15.01: Loading, Exploring, and Cleaning the Data 672

Activity 15.01: Fitting a Logistic Regression Model on Credit Card Data 678

Simple Methods for Ensemble Learning 679

Averaging 679

Exercise 15.02: Ensemble Model Using the Averaging Technique 680

Weighted Averaging 684

Exercise 15.03: Ensemble Model Using the Weighted Averaging Technique 684

Iteration 2 with Different Weights 687

Max Voting 688

Exercise 15.04: Ensemble Model Using Max Voting 689

Advanced Techniques for Ensemble Learning 692

Bagging 692

Exercise 15.05: Ensemble Learning Using Bagging 694

Boosting 696

Exercise 15.06: Ensemble Learning Using Boosting 696

Stacking 698

Exercise 15.07: Ensemble Learning Using Stacking 700

Activity 15.02: Comparison of Advanced Ensemble Techniques 702

Summary 704

16. Machine Learning Pipelines 707

Introduction 708

Pipelines 708

Business Context 709

Exercise 16.01: Preparing the Dataset to Implement Pipelines 710

Automating ML Workflows Using Pipeline 714

Automating Data Preprocessing Using Pipelines 715

Exercise 16.02: Applying Pipelines for Feature Extraction to the Dataset 717

ML Pipeline with Processing and Dimensionality Reduction 721

Exercise 16.03: Adding Dimensionality Reduction to the Feature Extraction Pipeline 721

ML Pipeline for Modeling and Prediction 723

Exercise 16.04: Modeling and Predictions Using ML Pipelines 724

ML Pipeline for Spot-Checking Multiple Models 726

Exercise 16.05: Spot-Checking Models Using ML Pipelines 726

ML Pipelines for Identifying the Best Parameters for a Model 728

Cross-Validation 729

Grid Search 729

Exercise 16.06: Grid Search and Cross-Validation with ML Pipelines 729

Applying Pipelines to a Dataset 732

Activity 16.01: Complete ML Workflow in a Pipeline 735

Summary 737

17. Automated Feature Engineering 741

Introduction 742

Feature Engineering 743

Automating Feature Engineering Using Feature Tools 743

Business Context 744

Domain Story for the Problem Statement 744

Featuretools – Creating Entities and Relationships 745

Exercise 17.01: Defining Entities and Establishing Relationships 747

Feature Engineering – Basic Operations 752

Featuretools – Automated Feature Engineering 755

Exercise 17.02: Creating New Features Using Deep Feature Synthesis 757

Exercise 17.03: Classification Model after Automated Feature Generation 763

Featuretools on a New Dataset 774

Activity 17.01: Building a Classification Model with Features that have been Generated Using Featuretools 774

Summary 777

Preface

About

This section briefly introduces the coverage of this book, the technical skills you'll need to get started, and the software requirements required to complete all of the included activities and exercises.

About the Book

Throughout The Data Science Workshop, you'll take an engaging step-by-step approach to understanding data science. You won't have to sit through any unnecessary theory. If you're short on time you can jump into a single exercise each day or spend an entire weekend training a model using scikit-learn. It's your choice. Learning on your terms, you'll build up and reinforce key skills in a way that feels rewarding.

About the Chapters

Chapter 1, Introduction to Data Science in Python, will introduce you to the field of data science and walk you through an overview of Python's core concepts and their application in the world of data science.

Chapter 2, Regression, will acquaint you with linear regression analysis and its application to practical problem solving in data science.

Chapter 3, Binary Classification, will teach you a supervised learning technique called classification to generate business outcomes.

Chapter 4,Multiclass Classification with RandomForest, will show you how to train a multiclass classifier using the Random Forest algorithm.

Chapter 5, Performing Your First Cluster Analysis, will introduce you to unsupervised learning tasks, where algorithms have to automatically learn patterns from data by themselves as no target variables are defined beforehand.

Chapter 6, How to Assess Performance, will teach you to evaluate a model and assess its performance before you decide to put it into production.

Chapter 7, The Generalization of Machine Learning Models, will teach you how to make best use of your data to train better models, by either splitting the data or making use of cross-validation.

Chapter 8, Hyperparameter Tuning, will guide you to find further predictive performance improvements via the systematic evaluation of estimators with different hyperparameters.

Chapter 9, Interpreting a Machine Learning Model, will show you how to interpret a machine learning model's results and get deeper insights into the patterns it found.

Chapter 10, Analyzing a Dataset, will introduce you to the art of performing exploratory data analysis and visualizing the data in order to identify quality issues, potential data transformations, and interesting patterns.

Chapter 11, Data Preparation, will present the main techniques you can use to handle data issues in order to ensure your data is of a high enough quality for successful modeling.

Chapter 12, Feature Engineering, will teach you some of the key techniques for creating new variables on an existing dataset.

Chapter 13, Imbalanced Datasets, will equip you to identify use cases where datasets are likely to be imbalanced, and formulate strategies for dealing with imbalanced datasets.

Chapter 14, Dimensionality Reduction, will show how to analyze datasets with high dimensions and deal with the challenges posed by these datasets.

Chapter 15, Ensemble Learning, will teach you to apply different ensemble learning techniques to your dataset.

Chapter 16, Machine Learning Pipelines, will show how to perform preprocessing, dimensionality reduction, and modeling using the pipeline utility.

Chapter 17, Automated Feature Engineering, will show you how to use the automated feature engineering techniques.

Note

You can find the bonus chapter on Model as a Service with Flask and the solution set to the activities at https://packt.live/2sSKX3D.

Conventions

Code words in text, database table names, folder names, filenames, file extensions, path names, dummy URLs, user input, and Twitter handles are shown as follows:

"sklearn has a class called train_test_split, which provides the functionality for splitting data."

Words that you see on the screen, for example, in menus or dialog boxes, also appear in the same format.

A block of code is set as follows:

# import libraries

import pandas as pd

from sklearn.model_selection import train_test_split

New terms and important words are shown like this:

"A dictionary contains multiple elements, like a list, but each element is organized as a key-value pair."

Before You Begin

Each great journey begins with a humble step. Our upcoming adventure with Data Science is no exception. Before we can do awesome things with Data Science, we need to be prepared with a productive environment. In this small note, we shall see how to do that.

How to Set Up Google Colab

There are many integrated development environments (IDE) for Python. The most popular one for running Data Science project is Jupyter Notebook from Anaconda but this is not the one we are recommending for this book. As you are starting your journey into Data Science, rather than asking you to setup a Python environment from scratch, we think it is better for you to use a plug-and-play solution so that you can fully focus on learning the concepts we are presenting in this book. We want to remove most of the blockers so that you can make this first step into Data Science as seamlessly and as fast as possible.

Luckily such a tool does exist, and it is called Google Colab. It is a free tool provided by Google that are run on the cloud, so you don't need to buy a new laptop or computer or upgrade its specs. The other benefit of using Colab is most of the Python packages we are using in this book are already installed so you can use them straight away. The only thing you need is a Google account. If you don't have one, you can create one here: https://packt.live/37mea5X.

Then, you will need to subscribe to the Colab service:

First, log into Google Drive: https://packt.live/2TM1v8wThen, go to the following url: https://packt.live/2NKaAuP

You should see the following screen:

Figure 0.1: Google Colab Introduction page

Then, you can click on NEW PYTHON 3 NOTEBOOK and you should see a new Colab notebook

Figure 0.2: New Colab notebook

You just added Google Colab to you Google account and now you are ready to write and your own Python code.

How to Use Google Colab

Now that you have added Google Colab to your account, let's see how to use it. Google Colab is very similar to Jupyter Notebook. It is actually based on Jupyter, but run on Google servers with additional integrations with their services such as Google Drive.

To open a new Colab notebook, you need to login into your Google Drive account and then click on + New icon:

Figure 0.3: Option to open new notebook

On the menu displayed, select More and then Google Colaboratory

Figure 0.4: Option to open Colab notebook from Google Drive

A new Colab notebook will be created.

Figure 0.5: New Colab notebook

A Colab notebook is an interactive IDE where you can run Python code or add texts using cells. A cell is a container where you will add your lines of code or any text information related to your project. In each cell, you can put as many lines of code or text as you want. A cell can display the output of your code after running it, so it is a very powerful way of testing and checking the results of your work. It is a good practice to not overload each cell with tons of code. Try to split it to multiple cells so you will be able to run them independently and track step by step if your code is working.

Let us now see how we can write some Python code in a cell and run it. A cell is composed of 4 main parts:

The text box where you will write your codeThe Run button for running your codeThe options menu that will provide additional functionalitiesThe output display

Figure 0.6: Parts of the Colab notebook cell

In this preceding example, we just wrote a simple line of code that adds 2 to 3. Then, we either need to click on the Run button or use the shortcut Ctrl + Enter to run the code. The result will then be displayed below the cell. If your code breaks (when there is an error), the error message will be displayed below the cell:

Figure 0.7: Error message on Google Colab

As you can see, we tried to add an integer to a string which is not possible as their data types are not compatible and this is exactly what this error message is telling us.

To add a new cell, you just need to click on either the + Code or + Text on the option bar at the top:

Figure 0.8: New cell button

If you add a new Text cell, you have access to specific options for editing your text such as bold, italic, and hypertext links and so on:

Figure 0.9: Different options on cell

This type of cell is actually Markdown compatible. So, you can easily create title, sub-title, bullet points and so on. Here is a link for learning more about the Markdown options: https://packt.live/2NVgVDT.

With thecell option menu, you can delete a cell, move it up or down in the notebook:

Figure 0.10: Cell options

If you need to install a specific Python package that is not available in Google Colab, you just need to run a cell with the following syntax:

!pip install <package_name>

Note

The '!' is a magic command to run shell commands.

Figure 0.11: Using "!" command

You just learnt the main functionalities provided by Google Colab for running Python code. There are much more functionalities available, but you now know enough for going through the lessons and contents of this book.

Installing the Code Bundle

Download the code files from GitHub at https://packt.live/2ucwsId. Refer to these code files for the complete code bundle.

If you have any issues or questions regarding installation, please email us at [email protected].

The high-quality color images used in book can be found at https://packt.live/30O91Bd.

1. Introduction to Data Science in Python

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

The Data Science Workshop E-Book

Anthony So

The Data Science Workshop

The Data Science Workshop

Table of Contents

Preface i

1. Introduction to Data Science in Python 1

Introduction 2

Application of Data Science 2

What Is Machine Learning? 3

Supervised Learning 4

Unsupervised Learning 5

Reinforcement Learning 6

Overview of Python 6

Types of Variable 6

Numeric Variables 6

Text Variables 7

Python List 8

Python Dictionary 10

Exercise 1.01: Creating a Dictionary That Will Contain Machine Learning Algorithms 13

Python for Data Science 16

The pandas Package 16

DataFrame and Series 17

CSV Files 18

Excel Spreadsheets 20

JSON 20

Exercise 1.02: Loading Data of Different Formats into a pandas DataFrame 22

Scikit-Learn 25

What Is a Model? 25

Model Hyperparameters 28

The sklearn API 28

Exercise 1.03: Predicting Breast Cancer from a Dataset Using sklearn 31

Activity 1.01: Train a Spam Detector Algorithm 35

Summary 36

2. Regression 39

Introduction 40

Simple Linear Regression 42

The Method of Least Squares 43

Multiple Linear Regression 44

Estimating the Regression Coefficients (β0, β1, β2 and β3) 45

Logarithmic Transformations of Variables 45

Correlation Matrices 45

Conducting Regression Analysis Using Python 45

Exercise 2.01: Loading and Preparing the Data for Analysis 47

The Correlation Coefficient 54

Exercise 2.02: Graphical Investigation of Linear Relationships Using Python 55

Exercise 2.03: Examining a Possible Log-Linear Relationship Using Python 58

The Statsmodels formula API 59

Exercise 2.04: Fitting a Simple Linear Regression Model Using the Statsmodels formula API 60

Analyzing the Model Summary 61

The Model Formula Language 62

Intercept Handling 64

Activity 2.01: Fitting a Log-Linear Model Using the Statsmodels formula API 64

Multiple Regression Analysis 66

Exercise 2.05: Fitting a Multiple Linear Regression Model Using the Statsmodels formula API 66

Assumptions of Regression Analysis 68

Activity 2.02: Fitting a Multiple Log-Linear Regression Model 69

Explaining the Results of Regression Analysis 70

Regression Analysis Checks and Balances 72

The F-test 73

The t-test 74

Summary 74

3. Binary Classification 77

Introduction 78

Understanding the Business Context 79

Business Discovery 79

Exercise 3.01: Loading and Exploring the Data from the Dataset 80

Testing Business Hypotheses Using Exploratory Data Analysis 82

Visualization for Exploratory Data Analysis 83

Exercise 3.02: Business Hypothesis Testing for Age versus Propensity for a Term Loan 87

Intuitions from the Exploratory Analysis 91

Activity 3.01: Business Hypothesis Testing to Find Employment Status versus Propensity for Term Deposits 92

Feature Engineering 94

Business-Driven Feature Engineering 94

Exercise 3.03: Feature Engineering – Exploration of Individual Features 95

Exercise 3.04: Feature Engineering – Creating New Features from Existing Ones 100

Data-Driven Feature Engineering 106

A Quick Peek at Data Types and a Descriptive Summary 106

Correlation Matrix and Visualization 108

Exercise 3.05: Finding the Correlation in Data to Generate a Correlation Plot Using Bank Data 108