The Data Science Workshop - Anthony So - E-Book

The Data Science Workshop E-Book

Anthony So

0,0
36,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Cut through the noise and get real results with a step-by-step approach to data science




Key Features



  • Ideal for the data science beginner who is getting started for the first time


  • A data science tutorial with step-by-step exercises and activities that help build key skills


  • Structured to let you progress at your own pace, on your own terms


  • Use your physical print copy to redeem free access to the online interactive edition



Book Description



You already know you want to learn data science, and a smarter way to learn data science is to learn by doing. The Data Science Workshop focuses on building up your practical skills so that you can understand how to develop simple machine learning models in Python or even build an advanced model for detecting potential bank frauds with effective modern data science. You'll learn from real examples that lead to real results.







Throughout The Data Science Workshop, you'll take an engaging step-by-step approach to understanding data science. You won't have to sit through any unnecessary theory. If you're short on time you can jump into a single exercise each day or spend an entire weekend training a model using sci-kit learn. It's your choice. Learning on your terms, you'll build up and reinforce key skills in a way that feels rewarding.







Every physical print copy of The Data Science Workshop unlocks access to the interactive edition. With videos detailing all exercises and activities, you'll always have a guided solution. You can also benchmark yourself against assessments, track progress, and receive content updates. You'll even earn a secure credential that you can share and verify online upon completion. It's a premium learning experience that's included with your printed copy. To redeem, follow the instructions located at the start of your data science book.







Fast-paced and direct, The Data Science Workshop is the ideal companion for data science beginners. You'll learn about machine learning algorithms like a data scientist, learning along the way. This process means that you'll find that your new skills stick, embedded as best practice. A solid foundation for the years ahead.




What you will learn



  • Find out the key differences between supervised and unsupervised learning


  • Manipulate and analyze data using scikit-learn and pandas libraries


  • Learn about different algorithms such as regression, classification, and clustering


  • Discover advanced techniques to improve model ensembling and accuracy


  • Speed up the process of creating new features with automated feature tool


  • Simplify machine learning using open source Python packages



Who this book is for



Our goal at Packt is to help you be successful, in whatever it is you choose to do. The Data Science Workshop is an ideal data science tutorial for the data science beginner who is just getting started. Pick up a Workshop today and let Packt help you develop skills that stick with you for life.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 790

Veröffentlichungsjahr: 2020

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



The Data Science Workshop

A New, Interactive Approach to Learning Data Science

Anthony So, Thomas V. Joseph, Robert Thas John, Andrew Worsley, and Dr. Samuel Asare

The Data Science Workshop

Copyright © 2020 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Authors: Anthony So, Thomas V. Joseph, Robert Thas John, Andrew Worsley, and Dr. Samuel Asare

Technical Reviewers: Tianxiang Liu, Tiffany Ford, and Pritesh Tiwari

Managing Editors: Adrian Cardoza and Snehal Tambe

Acquisitions Editor: Sarah Lawton

Production Editor: Salma Patel

Editorial Board: Shubhopriya Banerjee, Bharat Botle, Ewan Buckingham, Megan Carlisle, Mahesh Dhyani, Manasa Kumar, Alex Mazonowicz, Bridget Neale, Dominic Pereira, Shiny Poojary, Abhishek Rane, Brendan Rodrigues, Erol Staveley, Ankita Thakur, Nitesh Thakur, and Jonathan Wray

First Published: January 2020

Production Reference: 1280120

ISBN: 978-1-83898-126-6

Published by Packt Publishing Ltd.

Livery Place, 35 Livery Street

Birmingham B3 2PB, UK

Table of Contents

Preface   i

1. Introduction to Data Science in Python   1

Introduction   2

Application of Data Science   2

What Is Machine Learning?   3

Supervised Learning 4

Unsupervised Learning 5

Reinforcement Learning 6

Overview of Python   6

Types of Variable   6

Numeric Variables 6

Text Variables 7

Python List 8

Python Dictionary 10

Exercise 1.01: Creating a Dictionary That Will Contain Machine Learning Algorithms   13

Python for Data Science   16

The pandas Package   16

DataFrame and Series 17

CSV Files 18

Excel Spreadsheets 20

JSON 20

Exercise 1.02: Loading Data of Different Formats into a pandas DataFrame   22

Scikit-Learn   25

What Is a Model? 25

Model Hyperparameters 28

The sklearn API 28

Exercise 1.03: Predicting Breast Cancer from a Dataset Using sklearn   31

Activity 1.01: Train a Spam Detector Algorithm   35

Summary   36

2. Regression   39

Introduction   40

Simple Linear Regression   42

The Method of Least Squares   43

Multiple Linear Regression   44

Estimating the Regression Coefficients (β0, β1, β2 and β3)   45

Logarithmic Transformations of Variables   45

Correlation Matrices   45

Conducting Regression Analysis Using Python   45

Exercise 2.01: Loading and Preparing the Data for Analysis   47

The Correlation Coefficient   54

Exercise 2.02: Graphical Investigation of Linear Relationships Using Python   55

Exercise 2.03: Examining a Possible Log-Linear Relationship Using Python   58

The Statsmodels formula API   59

Exercise 2.04: Fitting a Simple Linear Regression Model Using the Statsmodels formula API   60

Analyzing the Model Summary   61

The Model Formula Language   62

Intercept Handling   64

Activity 2.01: Fitting a Log-Linear Model Using the Statsmodels formula API   64

Multiple Regression Analysis   66

Exercise 2.05: Fitting a Multiple Linear Regression Model Using the Statsmodels formula API   66

Assumptions of Regression Analysis   68

Activity 2.02: Fitting a Multiple Log-Linear Regression Model   69

Explaining the Results of Regression Analysis   70

Regression Analysis Checks and Balances   72

The F-test   73

The t-test   74

Summary   74

3. Binary Classification   77

Introduction   78

Understanding the Business Context   79

Business Discovery   79

Exercise 3.01: Loading and Exploring the Data from the Dataset   80

Testing Business Hypotheses Using Exploratory Data Analysis   82

Visualization for Exploratory Data Analysis   83

Exercise 3.02: Business Hypothesis Testing for Age versus Propensity for a Term Loan   87

Intuitions from the Exploratory Analysis   91

Activity 3.01: Business Hypothesis Testing to Find Employment Status versus Propensity for Term Deposits   92

Feature Engineering    94

Business-Driven Feature Engineering   94

Exercise 3.03: Feature Engineering – Exploration of Individual Features   95

Exercise 3.04: Feature Engineering – Creating New Features from Existing Ones   100

Data-Driven Feature Engineering   106

A Quick Peek at Data Types and a Descriptive Summary   106

Correlation Matrix and Visualization   108

Exercise 3.05: Finding the Correlation in Data to Generate a Correlation Plot Using Bank Data   108

Skewness of Data   111

Histograms   112

Density Plots   113

Other Feature Engineering Methods   114

Summarizing Feature Engineering   116

Building a Binary Classification Model Using the Logistic Regression Function   117

Logistic Regression Demystified   119

Metrics for Evaluating Model Performance   120

Confusion Matrix   121

Accuracy   122

Classification Report   122

Data Preprocessing   123

Exercise 3.06: A Logistic Regression Model for Predicting the Propensity of Term Deposit Purchases in a Bank   124

Activity 3.02: Model Iteration 2 – Logistic Regression Model with Feature Engineered Variables   129

Next Steps   130

Summary   132

4. Multiclass Classification with RandomForest   135

Introduction   136

Training a Random Forest Classifier   136

Evaluating the Model's Performance   140

Exercise 4.01: Building a Model for Classifying Animal Type and Assessing Its Performance   142

Number of Trees Estimator   146

Exercise 4.02: Tuning n_estimators to Reduce Overfitting   149

Maximum Depth   152

Exercise 4.03: Tuning max_depth to Reduce Overfitting   154

Minimum Sample in Leaf   157

Exercise 4.04: Tuning min_samples_leaf   159

Maximum Features   162

Exercise 4.05: Tuning max_features   165

Activity 4.01: Train a Random Forest Classifier on the ISOLET Dataset   168

Summary   169

5. Performing Your First Cluster Analysis   173

Introduction   174

Clustering with k-means   175

Exercise 5.01: Performing Your First Clustering Analysis on the ATO Dataset   177

Interpreting k-means Results   181

Exercise 5.02: Clustering Australian Postcodes by Business Income and Expenses   186

Choosing the Number of Clusters   191

Exercise 5.03: Finding the Optimal Number of Clusters   195

Initializing Clusters   200

Exercise 5.04: Using Different Initialization Parameters to Achieve a Suitable Outcome   203

Calculating the Distance to the Centroid   208

Exercise 5.05: Finding the Closest Centroids in Our Dataset   212

Standardizing Data   219

Exercise 5.06: Standardizing the Data from Our Dataset   223

Activity 5.01: Perform Customer Segmentation Analysis in a Bank Using k-means   228

Summary   230

6. How to Assess Performance   233

Introduction   234

Splitting Data   234

Exercise 6.01: Importing and Splitting Data   235

Assessing Model Performance for Regression Models   239

Data Structures – Vectors and Matrices   240

Scalars 240

Vectors 241

Matrices 242

R2 Score   244

Exercise 6.02: Computing the R2 Score of a Linear Regression Model   245

Mean Absolute Error   249

Exercise 6.03: Computing the MAE of a Model   249

Exercise 6.04: Computing the Mean Absolute Error of a Second Model   252

Other Evaluation Metrics 256

Assessing Model Performance for Classification Models   257

Exercise 6.05: Creating a Classification Model for Computing Evaluation Metrics   257

The Confusion Matrix   261

Exercise 6.06: Generating a Confusion Matrix for the Classification Model   261

More on the Confusion Matrix 262

Precision   263

Exercise 6.07: Computing Precision for the Classification Model   264

Recall   265

Exercise 6.08: Computing Recall for the Classification Model   265

F1 Score   266

Exercise 6.09: Computing the F1 Score for the Classification Model   266

Accuracy   267

Exercise 6.10: Computing Model Accuracy for the Classification Model   267

Logarithmic Loss   268

Exercise 6.11: Computing the Log Loss for the Classification Model   268

Receiver Operating Characteristic Curve   269

Exercise 6.12: Computing and Plotting ROC Curve for a Binary Classification Problem   269

Area Under the ROC Curve   275

Exercise 6.13: Computing the ROC AUC for the Caesarian Dataset   276

Saving and Loading Models   277

Exercise 6.14: Saving and Loading a Model   277

Activity 6.01: Train Three Different Models and Use Evaluation Metrics to Pick the Best Performing Model   280

Summary   282

7. The Generalization of Machine Learning Models   285

Introduction   286

Overfitting   286

Training on Too Many Features   286

Training for Too Long   287

Underfitting   287

Data   288

The Ratio for Dataset Splits   288

Creating Dataset Splits   289

Exercise 7.01: Importing and Splitting Data   290

Random State   294

Exercise 7.02: Setting a Random State When Splitting Data   296

Cross-Validation   297

KFold   298

Exercise 7.03: Creating a Five-Fold Cross-Validation Dataset   298

Exercise 7.04: Creating a Five-Fold Cross-Validation Dataset Using a Loop for Calls   301

cross_val_score   304

Exercise 7.05: Getting the Scores from Five-Fold Cross-Validation   305

Understanding Estimators That Implement CV   307

LogisticRegressionCV   308

Exercise 7.06: Training a Logistic Regression Model Using Cross-Validation   308

Hyperparameter Tuning with GridSearchCV   312

Decision Trees   312

Exercise 7.07: Using Grid Search with Cross-Validation to Find the Best Parameters for a Model   317

Hyperparameter Tuning with RandomizedSearchCV   322

Exercise 7.08: Using Randomized Search for Hyperparameter Tuning   322

Model Regularization with Lasso Regression   327

Exercise 7.09: Fixing Model Overfitting Using Lasso Regression   327

Ridge Regression   337

Exercise 7.10: Fixing Model Overfitting Using Ridge Regression   338

Activity 7.01: Find an Optimal Model for Predicting the Critical Temperatures of Superconductors   347

Summary   349

8. Hyperparameter Tuning   351

Introduction   352

What Are Hyperparameters?   352

Difference between Hyperparameters and Statistical Model Parameters   353

Setting Hyperparameters   354

A Note on Defaults   356

Finding the Best Hyperparameterization   356

Exercise 8.01: Manual Hyperparameter Tuning for a k-NN Classifier   357

Advantages and Disadvantages of a Manual Search   360

Tuning Using Grid Search   361

Simple Demonstration of the Grid Search Strategy   361

GridSearchCV   365

Tuning using GridSearchCV   365

Support Vector Machine (SVM) Classifiers 370

Exercise 8.02: Grid Search Hyperparameter Tuning for an SVM   371

Advantages and Disadvantages of Grid Search   375

Random Search   376

Random Variables and Their Distributions   376

Simple Demonstration of the Random Search Process   381

Tuning Using RandomizedSearchCV   387

Exercise 8.03: Random Search Hyperparameter Tuning for a Random Forest Classifier   389

Advantages and Disadvantages of a Random Search   393

Activity 8.01: Is the Mushroom Poisonous?   394

Summary   396

9. Interpreting a Machine Learning Model   399

Introduction   400

Linear Model Coefficients   401

Exercise 9.01: Extracting the Linear Regression Coefficient    403

RandomForest Variable Importance   409

Exercise 9.02: Extracting RandomForest Feature Importance   413

Variable Importance via Permutation   418

Exercise 9.03: Extracting Feature Importance via Permutation   422

Partial Dependence Plots   426

Exercise 9.04: Plotting Partial Dependence   429

Local Interpretation with LIME    432

Exercise 9.05: Local Interpretation with LIME   438

Activity 9.01: Train and Analyze a Network Intrusion Detection Model   441

Summary   443

10. Analyzing a Dataset   445

Introduction   446

Exploring Your Data   447

Analyzing Your Dataset   451

Exercise 10.01: Exploring the Ames Housing Dataset with Descriptive Statistics   454

Analyzing the Content of a Categorical Variable   458

Exercise 10.02: Analyzing the Categorical Variables from the Ames Housing Dataset   459

Summarizing Numerical Variables   462

Exercise 10.03: Analyzing Numerical Variables from the Ames Housing Dataset   466

Visualizing Your Data   469

How to use the Altair API   470

Histogram for Numerical Variables   475

Bar Chart for Categorical Variables    478

Boxplots   481

Exercise 10.04: Visualizing the Ames Housing Dataset with Altair   484

Activity 10.01: Analyzing Churn Data Using Visual Data Analysis Techniques   494

Summary   497

11. Data Preparation   499

Introduction   500

Handling Row Duplication   500

Exercise 11.01: Handling Duplicates in a Breast Cancer Dataset   506

Converting Data Types   509

Exercise 11.02: Converting Data Types for the Ames Housing Dataset   512

Handling Incorrect Values   517

Exercise 11.03: Fixing Incorrect Values in the State Column   520

Handling Missing Values   526

Exercise 11.04: Fixing Missing Values for the Horse Colic Dataset   530

Activity 11.01: Preparing the Speed Dating Dataset   535

Summary   539

12. Feature Engineering   543

Introduction   544

Merging Datasets   544

The left join 548

The right join 549

Exercise 12.01: Merging the ATO Dataset with the Postcode Data   552

Binning Variables   557

Exercise 12.02: Binning the YearBuilt variable from the AMES Housing dataset   560

Manipulating Dates   564

Exercise 12.03: Date Manipulation on Financial Services Consumer Complaints   568

Performing Data Aggregation   573

Exercise 12.04: Feature Engineering Using Data Aggregation on the AMES Housing Dataset   579

Activity 12.01: Feature Engineering on a Financial Dataset   583

Summary   585

13. Imbalanced Datasets   587

Introduction   588

Understanding the Business Context   588

Exercise 13.01: Benchmarking the Logistic Regression Model on the Dataset   589

Analysis of the Result   593

Challenges of Imbalanced Datasets   594

Strategies for Dealing with Imbalanced Datasets   596

Collecting More Data   597

Resampling Data   597

Exercise 13.02: Implementing Random Undersampling and Classification on Our Banking Dataset to Find the Optimal Result   598

Analysis   603

Generating Synthetic Samples   604

Implementation of SMOTE and MSMOTE   605

Exercise 13.03: Implementing SMOTE on Our Banking Dataset to Find the Optimal Result   606

Exercise 13.04: Implementing MSMOTE on Our Banking Dataset to Find the Optimal Result   609

Applying Balancing Techniques on a Telecom Dataset   612

Activity 13.01: Finding the Best Balancing Technique by Fitting a Classifier on the Telecom Churn Dataset   612

Summary   615

14. Dimensionality Reduction   617

Introduction   618

Business Context   619

Exercise 14.01: Loading and Cleaning the Dataset   620

Creating a High-Dimensional Dataset   627

Activity 14.01: Fitting a Logistic Regression Model on a High-Dimensional Dataset   629

Strategies for Addressing High-Dimensional Datasets   632

Backward Feature Elimination (Recursive Feature Elimination)   632

Exercise 14.02: Dimensionality Reduction Using Backward Feature Elimination   633

Forward Feature Selection   640

Exercise 14.03: Dimensionality Reduction Using Forward Feature Selection   640

Principal Component Analysis (PCA)   644

Exercise 14.04: Dimensionality Reduction Using PCA   648

Independent Component Analysis (ICA)   652

Exercise 14.05: Dimensionality Reduction Using Independent Component Analysis   653

Factor Analysis   657

Exercise 14.06: Dimensionality Reduction Using Factor Analysis   657

Comparing Different Dimensionality Reduction Techniques   661

Activity 14.02: Comparison of Dimensionality Reduction Techniques on the Enhanced Ads Dataset   663

Summary   667

15. Ensemble Learning   669

Introduction   670

Ensemble Learning   670

Variance   671

Bias   671

Business Context   672

Exercise 15.01: Loading, Exploring, and Cleaning the Data   672

Activity 15.01: Fitting a Logistic Regression Model on Credit Card Data   678

Simple Methods for Ensemble Learning   679

Averaging   679

Exercise 15.02: Ensemble Model Using the Averaging Technique   680

Weighted Averaging   684

Exercise 15.03: Ensemble Model Using the Weighted Averaging Technique   684

Iteration 2 with Different Weights 687

Max Voting 688

Exercise 15.04: Ensemble Model Using Max Voting   689

Advanced Techniques for Ensemble Learning   692

Bagging 692

Exercise 15.05: Ensemble Learning Using Bagging   694

Boosting   696

Exercise 15.06: Ensemble Learning Using Boosting   696

Stacking   698

Exercise 15.07: Ensemble Learning Using Stacking   700

Activity 15.02: Comparison of Advanced Ensemble Techniques   702

Summary   704

16. Machine Learning Pipelines   707

Introduction   708

Pipelines   708

Business Context   709

Exercise 16.01: Preparing the Dataset to Implement Pipelines   710

Automating ML Workflows Using Pipeline   714

Automating Data Preprocessing Using Pipelines   715

Exercise 16.02: Applying Pipelines for Feature Extraction to the Dataset   717

ML Pipeline with Processing and Dimensionality Reduction   721

Exercise 16.03: Adding Dimensionality Reduction to the Feature Extraction Pipeline   721

ML Pipeline for Modeling and Prediction   723

Exercise 16.04: Modeling and Predictions Using ML Pipelines   724

ML Pipeline for Spot-Checking Multiple Models   726

Exercise 16.05: Spot-Checking Models Using ML Pipelines   726

ML Pipelines for Identifying the Best Parameters for a Model   728

Cross-Validation   729

Grid Search   729

Exercise 16.06: Grid Search and Cross-Validation with ML Pipelines   729

Applying Pipelines to a Dataset   732

Activity 16.01: Complete ML Workflow in a Pipeline   735

Summary   737

17. Automated Feature Engineering   741

Introduction   742

Feature Engineering   743

Automating Feature Engineering Using Feature Tools   743

Business Context   744

Domain Story for the Problem Statement   744

Featuretools – Creating Entities and Relationships   745

Exercise 17.01: Defining Entities and Establishing Relationships   747

Feature Engineering – Basic Operations   752

Featuretools – Automated Feature Engineering   755

Exercise 17.02: Creating New Features Using Deep Feature Synthesis   757

Exercise 17.03: Classification Model after Automated Feature Generation   763

Featuretools on a New Dataset   774

Activity 17.01: Building a Classification Model with Features that have been Generated Using Featuretools   774

Summary   777

Preface

About

This section briefly introduces the coverage of this book, the technical skills you'll need to get started, and the software requirements required to complete all of the included activities and exercises.

About the Book

You already know you want to learn data science, and a smarter way to learn data science is to learn by doing. The Data Science Workshop focuses on building up your practical skills so that you can understand how to develop simple machine learning models in Python or even build an advanced model for detecting potential bank frauds with effective modern data science. You'll learn from real examples that lead to real results.

Throughout The Data Science Workshop, you'll take an engaging step-by-step approach to understanding data science. You won't have to sit through any unnecessary theory. If you're short on time you can jump into a single exercise each day or spend an entire weekend training a model using scikit-learn. It's your choice. Learning on your terms, you'll build up and reinforce key skills in a way that feels rewarding.

Every physical print copy of The Data Science Workshop unlocks access to the interactive edition. With videos detailing all exercises and activities, you'll always have a guided solution. You can also benchmark yourself against assessments, track progress, and receive content updates. You'll even earn a secure credential that you can share and verify online upon completion. It's a premium learning experience that's included with your printed copy. To redeem, follow the instructions located at the start of your data science book.

Fast-paced and direct, The Data Science Workshop is the ideal companion for data science beginners. You'll learn about machine learning algorithms like a data scientist, learning along the way. This process means that you'll find that your new skills stick, embedded as best practice, a solid foundation for the years ahead.

About the Chapters

Chapter 1, Introduction to Data Science in Python, will introduce you to the field of data science and walk you through an overview of Python's core concepts and their application in the world of data science.

Chapter 2, Regression, will acquaint you with linear regression analysis and its application to practical problem solving in data science.

Chapter 3, Binary Classification, will teach you a supervised learning technique called classification to generate business outcomes.

Chapter 4,Multiclass Classification with RandomForest, will show you how to train a multiclass classifier using the Random Forest algorithm.

Chapter 5, Performing Your First Cluster Analysis, will introduce you to unsupervised learning tasks, where algorithms have to automatically learn patterns from data by themselves as no target variables are defined beforehand.

Chapter 6, How to Assess Performance, will teach you to evaluate a model and assess its performance before you decide to put it into production.

Chapter 7, The Generalization of Machine Learning Models, will teach you how to make best use of your data to train better models, by either splitting the data or making use of cross-validation.

Chapter 8, Hyperparameter Tuning, will guide you to find further predictive performance improvements via the systematic evaluation of estimators with different hyperparameters.

Chapter 9, Interpreting a Machine Learning Model, will show you how to interpret a machine learning model's results and get deeper insights into the patterns it found.

Chapter 10, Analyzing a Dataset, will introduce you to the art of performing exploratory data analysis and visualizing the data in order to identify quality issues, potential data transformations, and interesting patterns.

Chapter 11, Data Preparation, will present the main techniques you can use to handle data issues in order to ensure your data is of a high enough quality for successful modeling.

Chapter 12, Feature Engineering, will teach you some of the key techniques for creating new variables on an existing dataset.

Chapter 13, Imbalanced Datasets, will equip you to identify use cases where datasets are likely to be imbalanced, and formulate strategies for dealing with imbalanced datasets.

Chapter 14, Dimensionality Reduction, will show how to analyze datasets with high dimensions and deal with the challenges posed by these datasets.

Chapter 15, Ensemble Learning, will teach you to apply different ensemble learning techniques to your dataset.

Chapter 16, Machine Learning Pipelines, will show how to perform preprocessing, dimensionality reduction, and modeling using the pipeline utility.

Chapter 17, Automated Feature Engineering, will show you how to use the automated feature engineering techniques.

Note

You can find the bonus chapter on Model as a Service with Flask and the solution set to the activities at https://packt.live/2sSKX3D.

Conventions

Code words in text, database table names, folder names, filenames, file extensions, path names, dummy URLs, user input, and Twitter handles are shown as follows:

"sklearn has a class called train_test_split, which provides the functionality for splitting data."

Words that you see on the screen, for example, in menus or dialog boxes, also appear in the same format.

A block of code is set as follows:

# import libraries

import pandas as pd

from sklearn.model_selection import train_test_split

New terms and important words are shown like this:

"A dictionary contains multiple elements, like a list, but each element is organized as a key-value pair."

Before You Begin

Each great journey begins with a humble step. Our upcoming adventure with Data Science is no exception. Before we can do awesome things with Data Science, we need to be prepared with a productive environment. In this small note, we shall see how to do that.

How to Set Up Google Colab

There are many integrated development environments (IDE) for Python. The most popular one for running Data Science project is Jupyter Notebook from Anaconda but this is not the one we are recommending for this book. As you are starting your journey into Data Science, rather than asking you to setup a Python environment from scratch, we think it is better for you to use a plug-and-play solution so that you can fully focus on learning the concepts we are presenting in this book. We want to remove most of the blockers so that you can make this first step into Data Science as seamlessly and as fast as possible.

Luckily such a tool does exist, and it is called Google Colab. It is a free tool provided by Google that are run on the cloud, so you don't need to buy a new laptop or computer or upgrade its specs. The other benefit of using Colab is most of the Python packages we are using in this book are already installed so you can use them straight away. The only thing you need is a Google account. If you don't have one, you can create one here: https://packt.live/37mea5X.

Then, you will need to subscribe to the Colab service:

First, log into Google Drive: https://packt.live/2TM1v8wThen, go to the following url: https://packt.live/2NKaAuP

You should see the following screen:

Figure 0.1: Google Colab Introduction page

Then, you can click on NEW PYTHON 3 NOTEBOOK and you should see a new Colab notebook

Figure 0.2: New Colab notebook

You just added Google Colab to you Google account and now you are ready to write and your own Python code.

How to Use Google Colab

Now that you have added Google Colab to your account, let's see how to use it. Google Colab is very similar to Jupyter Notebook. It is actually based on Jupyter, but run on Google servers with additional integrations with their services such as Google Drive.

To open a new Colab notebook, you need to login into your Google Drive account and then click on + New icon:

Figure 0.3: Option to open new notebook

On the menu displayed, select More and then Google Colaboratory

Figure 0.4: Option to open Colab notebook from Google Drive

A new Colab notebook will be created.

Figure 0.5: New Colab notebook

A Colab notebook is an interactive IDE where you can run Python code or add texts using cells. A cell is a container where you will add your lines of code or any text information related to your project. In each cell, you can put as many lines of code or text as you want. A cell can display the output of your code after running it, so it is a very powerful way of testing and checking the results of your work. It is a good practice to not overload each cell with tons of code. Try to split it to multiple cells so you will be able to run them independently and track step by step if your code is working.

Let us now see how we can write some Python code in a cell and run it. A cell is composed of 4 main parts:

The text box where you will write your codeThe Run button for running your codeThe options menu that will provide additional functionalitiesThe output display

Figure 0.6: Parts of the Colab notebook cell

In this preceding example, we just wrote a simple line of code that adds 2 to 3. Then, we either need to click on the Run button or use the shortcut Ctrl + Enter to run the code. The result will then be displayed below the cell. If your code breaks (when there is an error), the error message will be displayed below the cell:

Figure 0.7: Error message on Google Colab

As you can see, we tried to add an integer to a string which is not possible as their data types are not compatible and this is exactly what this error message is telling us.

To add a new cell, you just need to click on either the + Code or + Text on the option bar at the top:

Figure 0.8: New cell button

If you add a new Text cell, you have access to specific options for editing your text such as bold, italic, and hypertext links and so on:

Figure 0.9: Different options on cell

This type of cell is actually Markdown compatible. So, you can easily create title, sub-title, bullet points and so on. Here is a link for learning more about the Markdown options: https://packt.live/2NVgVDT.

With thecell option menu, you can delete a cell, move it up or down in the notebook:

Figure 0.10: Cell options

If you need to install a specific Python package that is not available in Google Colab, you just need to run a cell with the following syntax:

!pip install <package_name>

Note

The '!' is a magic command to run shell commands.

Figure 0.11: Using "!" command

You just learnt the main functionalities provided by Google Colab for running Python code. There are much more functionalities available, but you now know enough for going through the lessons and contents of this book.

Installing the Code Bundle

Download the code files from GitHub at https://packt.live/2ucwsId. Refer to these code files for the complete code bundle.

If you have any issues or questions regarding installation, please email us at [email protected].

The high-quality color images used in book can be found at https://packt.live/30O91Bd.

1. Introduction to Data Science in Python