E-Book
23,99 €

Applied Data Science with Python and Jupyter E-Book

Alex Galea

0,0

23,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Lebensstil
Sprache: Englisch

Beschreibung

Getting started with data science doesn't have to be an uphill battle. Applied Data Science with Python and Jupyter is a step-by-step guide ideal for beginners who know a little Python and are looking for a quick, fast-paced introduction to these concepts. In this book, you'll learn every aspect of the standard data workflow process, including collecting, cleaning, investigating, visualizing, and modeling data. You'll start with the basics of Jupyter, which will be the backbone of the book. After familiarizing ourselves with its standard features, you'll look at an example of it in practice with our first analysis. In the next lesson, you dive right into predictive analytics, where multiple classification algorithms are implemented. Finally, the book ends by looking at data collection techniques. You'll see how web data can be acquired with scraping techniques and via APIs, and then briefly explore interactive visualizations.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 173

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Applied Data Science with Python and Jupyter

Use powerful industry-standard tools to unlock new, actionable insights from your data

Alex Galea

Applied Data Science with Python and Jupyter

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Author: Alex Galea

Reviewer: Elie Kawerk

Managing Editor: Mahesh Dhyani

Acquisitions Editor: Aditya Date

Production Editor: Samita Warang

Editorial Board: David Barnes, Ewan Buckingham, Simon Cox, Manasa Kumar, Alex Mazonowicz, Douglas Paterson, Dominic Pereira, Shiny Poojary, Saman Siddiqui, Erol Staveley, Ankita Thakur, and Mohita Vyas

First Published: October 2018

Production Reference: 2051218

ISBN: 978-1-78995-817-1

Preface i

Jupyter Fundamentals 1

Introduction 2

Basic Functionality and Features 2

What is a Jupyter Notebook and Why is it Useful? 3

Navigating the Platform 5

Exercise 1: Introducing Jupyter Notebooks 5

Jupyter Features 11

Exercise 2: Implementing Jupyter's Most Useful Features 11

Converting a Jupyter Notebook to a Python Script 17

Python Libraries 19

Exercise 3: Importing the External Libraries and Setting Up the Plotting Environment 20

Our First Analysis - The Boston Housing Dataset 21

Loading the Data into Jupyter Using a Pandas DataFrame 22

Exercise 4: Loading the Boston Housing Dataset 22

Data Exploration 29

Exercise 5: Analyzing the Boston Housing Dataset 29

Introduction to Predictive Analytics with Jupyter Notebooks 34

Exercise 6: Applying Linear Models With Seaborn and Scikit-learn 34

Activity 1: Building a Third-Order Polynomial Model 39

Using Categorical Features for Segmentation Analysis 40

Exercise 7: Creating Categorical Fields From Continuous Variables and Make Segmented Visualizations 41

Summary 48

Data Cleaning and Advanced Machine Learning 51

Introduction 52

Preparing to Train a Predictive Model 53

Determining a Plan for Predictive Analytics 54

Exercise 8: Explore Data Preprocessing Tools and Methods 58

Activity 2: Preparing to Train a Predictive Model for the Employee-Retention Problem 72

Training Classification Models 74

Introduction to Classification Algorithms 74

Exercise 9: Training Two-Feature Classification Models With Scikit-learn 76

The plot_decision_regions Function 82

Exercise 10: Training K-nearest Neighbors for Our Model 84

Exercise 11: Training a Random Forest 88

Assessing Models With K-fold Cross-Validation and Validation Curves 91

Exercise 12: Using K-fold Cross Validation and Validation Curves in Python With Scikit-learn 94

Dimensionality Reduction Techniques 97

Exercise 13: Training a Predictive Model for the Employee Retention Problem 98

Summary 107

Web Scraping and Interactive Visualizations 109

Introduction 110

Scraping Web Page Data 110

Introduction to HTTP Requests 111

Making HTTP Requests in the Jupyter Notebook 113

Exercise 14: Handling HTTP Requests With Python in a Jupyter Notebook 113

Parsing HTML in the Jupyter Notebook 117

Exercise 15: Parsing HTML With Python in a Jupyter Notebook 118

Activity 3: Web Scraping With Jupyter Notebooks 125

Interactive Visualizations 126

Building a DataFrame to Store and Organize Data 127

Exercise 16: Building and Merging Pandas DataFrames 127

Introduction to Bokeh 133

Exercise 17: Introduction to Interactive Visualization With Bokeh 134

Activity 4: Exploring Data with Interactive Visualizations 138

Summary 139

Appendix A 143

Preface

About

This section briefly introduces the author, the coverage of this book, the technical skills you'll need to get started, and the hardware and software requirements required to complete all of the included activities and exercises.

Jupyter Fundamentals

Learning Objectives

By the end of this chapter, you will be able to:

Describe Jupyter Notebooks and how they are used for data analysisDescribe the features of Jupyter NotebooksUse Python data science librariesPerform simple exploratory data analysis

In this chapter, you will learn and implement the fundamental features of the Jupyter notebook by completing several hands-on erxercises.

Introduction

Jupyter Notebooks are one of the most important tools for data scientists using Python. This is because they're an ideal environment for developing reproducible data analysis pipelines. Data can be loaded, transformed, and modeled all inside a single Notebook, where it's quick and easy to test out code and explore ideas along the way. Furthermore, all of this can be documented "inline" using formatted text, so you can make notes for yourself or even produce a structured report.

Other comparable platforms - for example, RStudio or Spyder - present the user with multiple windows, which promote arduous tasks such as copy and pasting code around and rerunning code that has already been executed. These tools also tend to involve Read Eval Prompt Loops (REPLs) where code is run in a terminal session that has saved memory. This type of development environment is bad for reproducibility and not ideal for development either. Jupyter Notebooks solve all these issues by giving the user a single window where code snippets are executed and outputs are displayed inline. This lets users develop code efficiently and allows them to look back at previous work for reference, or even to make alterations.

We'll start the chapter by explaining exactly what Jupyter Notebooks are and continue to discuss why they are so popular among data scientists. Then, we'll open a Notebook together and go through some exercises to learn how the platform is used. Finally, we'll dive into our first analysis and perform an exploratory analysis in

Basic Functionality and Features

In this section, we first demonstrate the usefulness of Jupyter Notebooks with examples and through discussion. Then, in order to cover the fundamentals of Jupyter Notebooks for beginners, we'll see the basic usage of them in terms of launching and interacting with the platform. For those who have used Jupyter Notebooks before, this will be mostly a review; however, you will certainly see new things in this topic as well.

What is a Jupyter Notebook and Why is it Useful?

Jupyter Notebooks are locally run web applications which contain live code, equations, figures, interactive apps, and Markdown text. The standard language is Python, and that's what we'll be using for this book; however, note that a variety of alternatives are supported. This includes the other dominant data science language, R:

Figure 1.1: Jupyter Notebook sample workbook

Those familiar with R will know about R Markdown. Markdown documents allow for Markdown-formatted text to be combined with executable code. Markdown is a simple language used for styling text on the web. For example, most GitHub repositories have a README.mdMarkdown file. This format is useful for basic text formatting. It's comparable to HTML but allows for much less customization.

Commonly used symbols in Markdown include hashes (#) to make text into a heading, square and round brackets to insert hyperlinks, and stars to create italicized or bold text:

Figure 1.2: Sample Markdown document

Having seen the basics of Markdown, let's come back to R Markdown, where Markdown text can be written alongside executable code. Jupyter Notebooks offer the equivalent functionality for Python, although, as we'll see, they function quite differently than R Markdown documents. For example, R Markdown assumes you are writing Markdown unless otherwise specified, whereas Jupyter Notebooks assume you are inputting code. This makes it more appealing to use Jupyter Notebooks for rapid development and testing.

From a data science perspective, there are two primary types for a Jupyter Notebook depending on how they are used: lab-style and deliverable.

Lab-style Notebooks are meant to serve as the programming analog of research journals. These should contain all the work you've done to load, process, analyze, and model the data. The idea here is to document everything you've done for future reference, so it's usually not advisable to delete or alter previous lab-style Notebooks. It's also a good idea to accumulate multiple date-stamped versions of the Notebook as you progress through the analysis, in case you want to look back at previous states.

Deliverable Notebooks are intended to be presentable and should contain only select parts of the lab-style Notebooks. For example, this could be an interesting discovery to share with your colleagues, an in-depth report of your analysis for a manager, or a summary of the key findings for stakeholders.

In either case, an important concept is reproducibility. If you've been diligent in documenting your software versions, anyone receiving the reports will be able to rerun the Notebook and compute the same results as you did. In the scientific community, where reproducibility is becoming increasingly difficult, this is a breath of fresh air.

Navigating the Platform

Now, we are going to open up a Jupyter Notebook and start to learn the interface. Here, we will assume you have no prior knowledge of the platform and go over the basic usage.

Exercise 1: Introducing Jupyter Notebooks

Navigate to the companion material directory in the terminal

Note

Unix machines such as Mac or Linux, command-line navigation can be done using ls to display directory contents and cd to change directories. On Windows machines, use dir to display directory contents and use cd to change directories instead. If, for example, you want to change the drive from C: to D:, you should execute d: to change drives.

Start a new local Notebook server here by typing the following into the terminal:

jupyter notebook

A new window or tab of your default browser will open the Notebook Dashboard to the working directory. Here, you will see a list of folders and files contained therein.

Click on a folder to navigate to that particular path and open a file by clicking on it. Although its main use is editing IPYNB Notebook files, Jupyter functions as a standard text editor as well.Reopen the terminal window used to launch the app. We can see the NotebookApp being run on a local server. In particular, you should see a line like this:

[I 20:03:01.045 NotebookApp] The Jupyter Notebook is running at: http://localhost:8888/?token=e915bb06866f19ce462d959a9193a94c7c088e81765f9d8a

Going to that HTTP address will load the app in your browser window, as was done automatically when starting the app. Closing the window does not stop the app; this should be done from the terminal by typing Ctrl + C.

Close the app by typing Ctrl + C in the terminal. You may also have to confirm by entering y. Close the web browser window as well.Load the list of available options by running the following code:

jupyter notebook --help

Open the NotebookApp at local port 9000 by running the following:

jupyter notebook --port 9000

Click New in the upper-right corner of the Jupyter Dashboard and select a kernel from the drop-down menu (that is, select something in the Notebooks section):

Figure 1.3: Selecting a kernel from the drop down menu

This is the primary method of creating a new Jupyter Notebook.

Kernels provide programming language support for the Notebook. If you have installed Python with Anaconda, that version should be the default kernel. Conda virtual environments will also be available here.

Note

Virtual environments are a great tool for managing multiple projects on the same machine. Each virtual environment may contain a different version of Python and external libraries. Python has built-in virtual environments; however, the Conda virtual environment integrates better with Jupyter Notebooks and boasts other nice features. The documentation is available at: https://conda.io/docs/user-guide/tasks/manage-environments.html.

With the newly created blank Notebook, click the top cell and type print('hello world'), or any other code snippet that writes to the screen. Click the cell and press Shift + Enter or select Run Cell in the Cell menu.

Any stdout or stderr output from the code will be displayed beneath as the cell runs. Furthermore, the string representation of the object written in the final line will be displayed as well. This is very handy, especially for displaying tables, but sometimes we don't want the final object to be displayed. In such cases, a semicolon (;) can be added to the end of the line to suppress the display. New cells expect and run code input by default; however, they can be changed to render Markdown instead.

Click an empty cell and change it to accept the Markdown-formatted text. This can be done from the drop-down menu icon in the toolbar or by selecting Markdown from the Cell menu. Write some text in here (any text will do), making sure to utilize Markdown formatting symbols such as #.Scroll to the Play icon in the tool bar:

Figure 1.4: Jupyter Notebook tool bar

This can be used to run cells. As we'll see later, however, it's handier to use the keyboard shortcut Shift + Enter to run cells.

Right next to this is a Stop icon, which can be used to stop cells from running. This is useful, for example, if a cell is taking too long to run:

Figure 1.5: Stop icon in Jupyter Notebooks

New cells can be manually added from the Insert menu:

Figure 1.6: Adding new cells from the Insert menu in Jupyter Notebooks

Cells can be copied, pasted, and deleted using icons or by selecting options from the Edit menu:

Figure 1.7: Edit Menu in the Jupyter Notebooks

Figure 1.8: Cutting and copying cells in Jupyter Notebooks

Cells can also be moved up and down this way:

Figure 1.9: Moving cells up and down in Jupyter Notebooks

There are useful options under the Cell menu to run a group of cells or the entire Notebook:

Figure 1.10: Running cells in Jupyter Notebooks

Experiment with the toolbar options to move cells up and down, insert new cells, and delete cells. An important thing to understand about these Notebooks is the shared memory between cells. It's quite simple: every cell existing on the sheet has access to the global set of variables. So, for example, a function defined in one cell could be called from any other, and the same applies to variables. As one would expect, anything within the scope of a function will not be a global variable and can only be accessed from within that specific function.

Open the Kernel menu to see the selections. The Kernel menu is useful for stopping script executions and restarting the Notebook if the kernel dies. Kernels can also be swapped here at any time, but it is unadvisable to use multiple kernels for a single Notebook due to reproducibility concerns.Open the File menu to see the selections. The File menu contains options for downloading the Notebook in various formats. In particular, it's recommended to save an HTML version of your Notebook, where the content is rendered statically and can be opened and viewed "as you would expect" in web browsers.

The Notebook name will be displayed in the upper-left corner. New Notebooks will automatically be named Untitled.

Change the name of your IPYNB Notebook file by clicking on the current name in the upper-left corner and typing the new name. Then, save the file.Close the current tab in your web browser (exiting the Notebook) and go to the Jupyter Dashboard tab, which should still be open. (If it's not open, then reload it by copy and pasting the HTTP link from the terminal.)

Since we didn't shut down the Notebook, and we just saved and exited, it will have a green book symbol next to its name in the Files section of the Jupyter Dashboard and will be listed as Running on the right side next to the last modified date. Notebooks can be shut down from here.

Quit the Notebook you have been working on by selecting it (checkbox to the left of the name), and then click the orange Shutdown button:

Note

Read through the basic keyboard shortcuts and test them.

Figure 1.11: Shutting down the Jupyter notebook

Note

If you plan to spend a lot of time working with Jupyter Notebooks, it's worthwhile to learn the keyboard shortcuts. This will speed up your workflow considerably. Particularly useful commands to learn are the shortcuts for manually adding new cells and converting cells from code to Markdown formatting. Click on Keyboard Shortcuts from the Help menu to see how.

Jupyter Features

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Applied Data Science with Python and Jupyter E-Book

Alex Galea

Applied Data Science with Python and Jupyter

Applied Data Science with Python and Jupyter

Table of Contents

Preface i

Jupyter Fundamentals 1

Introduction 2

Basic Functionality and Features 2

What is a Jupyter Notebook and Why is it Useful? 3

Navigating the Platform 5

Exercise 1: Introducing Jupyter Notebooks 5

Jupyter Features 11

Exercise 2: Implementing Jupyter's Most Useful Features 11

Converting a Jupyter Notebook to a Python Script 17

Python Libraries 19

Exercise 3: Importing the External Libraries and Setting Up the Plotting Environment 20

Our First Analysis - The Boston Housing Dataset 21

Loading the Data into Jupyter Using a Pandas DataFrame 22

Exercise 4: Loading the Boston Housing Dataset 22

Data Exploration 29

Exercise 5: Analyzing the Boston Housing Dataset 29

Introduction to Predictive Analytics with Jupyter Notebooks 34

Exercise 6: Applying Linear Models With Seaborn and Scikit-learn 34

Activity 1: Building a Third-Order Polynomial Model 39

Using Categorical Features for Segmentation Analysis 40

Exercise 7: Creating Categorical Fields From Continuous Variables and Make Segmented Visualizations 41

Summary 48

Data Cleaning and Advanced Machine Learning 51

Introduction 52

Preparing to Train a Predictive Model 53

Determining a Plan for Predictive Analytics 54

Exercise 8: Explore Data Preprocessing Tools and Methods 58

Activity 2: Preparing to Train a Predictive Model for the Employee-Retention Problem 72

Training Classification Models 74

Introduction to Classification Algorithms 74

Exercise 9: Training Two-Feature Classification Models With Scikit-learn 76

The plot_decision_regions Function 82

Exercise 10: Training K-nearest Neighbors for Our Model 84

Exercise 11: Training a Random Forest 88

Assessing Models With K-fold Cross-Validation and Validation Curves 91

Exercise 12: Using K-fold Cross Validation and Validation Curves in Python With Scikit-learn 94

Dimensionality Reduction Techniques 97

Exercise 13: Training a Predictive Model for the Employee Retention Problem 98

Summary 107

Web Scraping and Interactive Visualizations 109

Introduction 110

Scraping Web Page Data 110

Introduction to HTTP Requests 111

Making HTTP Requests in the Jupyter Notebook 113

Exercise 14: Handling HTTP Requests With Python in a Jupyter Notebook 113

Parsing HTML in the Jupyter Notebook 117

Exercise 15: Parsing HTML With Python in a Jupyter Notebook 118

Activity 3: Web Scraping With Jupyter Notebooks 125

Interactive Visualizations 126

Building a DataFrame to Store and Organize Data 127

Exercise 16: Building and Merging Pandas DataFrames 127

Introduction to Bokeh 133

Exercise 17: Introduction to Interactive Visualization With Bokeh 134

Activity 4: Exploring Data with Interactive Visualizations 138

Summary 139

Appendix A 143

Preface

About

Jupyter Fundamentals

Learning Objectives

Introduction

Basic Functionality and Features

What is a Jupyter Notebook and Why is it Useful?

Figure 1.1: Jupyter Notebook sample workbook

Figure 1.2: Sample Markdown document

Navigating the Platform

Exercise 1: Introducing Jupyter Notebooks

Note

Figure 1.3: Selecting a kernel from the drop down menu

Note

Figure 1.4: Jupyter Notebook tool bar

Figure 1.5: Stop icon in Jupyter Notebooks

Figure 1.6: Adding new cells from the Insert menu in Jupyter Notebooks

Figure 1.7: Edit Menu in the Jupyter Notebooks