E-Book
28,14 €

Machine Learning with scikit-learn Quick Start Guide E-Book

Kevin Jolly

0,0

28,14 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Deploy supervised and unsupervised machine learning algorithms using scikit-learn to perform classification, regression, and clustering.

Key Features

Build your first machine learning model using scikit-learn

Train supervised and unsupervised models using popular techniques such as classification, regression and clustering

Understand how scikit-learn can be applied to different types of machine learning problems

Book Description

Scikit-learn is a robust machine learning library for the Python programming language. It provides a set of supervised and unsupervised learning algorithms. This book is the easiest way to learn how to deploy, optimize, and evaluate all of the important machine learning algorithms that scikit-learn provides.

This book teaches you how to use scikit-learn for machine learning. You will start by setting up and configuring your machine learning environment with scikit-learn. To put scikit-learn to use, you will learn how to implement various supervised and unsupervised machine learning models. You will learn classification, regression, and clustering techniques to work with different types of datasets and train your models.

Finally, you will learn about an effective pipeline to help you build a machine learning project from scratch. By the end of this book, you will be confident in building your own machine learning models for accurate predictions.

What you will learn

Learn how to work with all scikit-learn's machine learning algorithms

Install and set up scikit-learn to build your first machine learning model

Employ Unsupervised Machine Learning Algorithms to cluster unlabelled data into groups

Perform classification and regression machine learning

Use an effective pipeline to build a machine learning project from scratch

Who this book is for

This book is for aspiring machine learning developers who want to get started with scikit-learn. Intermediate knowledge of Python programming and some fundamental knowledge of linear algebra and probability will help.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 156

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Machine Learning with scikit-learn Quick Start Guide

Classification, regression, and clustering techniques in Python

Kevin Jolly

BIRMINGHAM - MUMBAI

Machine Learning with scikit-learn Quick Start Guide

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Amey VarangaonkarAcquisition Editor: Aditi GourContent Development Editor:Smit CarvalhoTechnical Editor: Jinesh TopiwalaCopy Editor: Safis EditingProject Coordinator: Hardik BhindeProofreader: Safis EditingIndexer:Tejal Daruwale SoniGraphics:Jason MonteiroProduction Coordinator:Jyoti Chauhan

First published: October 2018

Production reference: 1291018

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78934-370-0

www.packtpub.com

To my parents for their unconditional support for all the choices I make.

– Kevin Jolly

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Kevin Jolly is a formally educated data scientist with a master's degree in data science from the prestigious King's College London. Kevin works as a statistical analyst with a digital healthcare start-up, Connido Limited, in London, where he is primarily involved in leading the data science projects that the company undertakes. He has built machine learning pipelines for small and big data, with a focus on scaling such pipelines into production for the products that the company has built.

Kevin is also the author of a book titled Hands-On Data Visualization with Bokeh, published by Packt. He is the editor-in-chief of Linear, a weekly online publication on data science software and products.

About the reviewers

Mehar Pratap Singh is one of the co-founders of ProCogia, and divides his time between their corporate headquarters in Vancouver, BC, and their Seattle office. Among his priorities is the marshaling of the diverse talents at ProCogia to expand the possibilities of data science and give their clients unbeatable competitive advantages.

Mehar was previously a data science consultant at T-Mobile, Microsoft, and several start-up ventures in Seattle. He holds an MBA from the University of Washington and an MS in electrical engineering from the University of Wisconsin.

Mehar's favorite party trick is to recite the entire chemical periodic table from memory. He is also an avid basketball player and loves following the NBA.

Joydeep Bhattacharjee is a principal engineer working for Nineleaps Technology Solutions. After graduating from the National Institute of Technology at Silchar, he started working in the software industry, where he stumbled upon Python. Through Python, he came across machine learning. Now he primarily develops intelligent systems that can parse and process data to solve challenging problems at work. He believes in sharing knowledge and loves mentoring in machine learning. He has published a book on FastText, a popular natural language processing tool, and loves speaking about machine learning at various national and international conferences.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Machine Learning with scikit-learn Quick Start Guide

Dedication

Packt Upsell

Why subscribe?

Packt.com

Contributors

About the author

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Code in action

Conventions used

Get in touch

Reviews

Introducing Machine Learning with scikit-learn

A brief introduction to machine learning

Supervised learning

Unsupervised learning

What is scikit-learn?

Installing scikit-learn

The pip method

The Anaconda method

Additional packages

Pandas

Matplotlib

Tree

Pydotplus

Image

Algorithms that you will learn to implement using scikit-learn

Supervised learning algorithms

Unsupervised learning algorithms

Summary

Predicting Categories with K-Nearest Neighbors

Technical requirements

Preparing a dataset for machine learning with scikit-learn

Dropping features that are redundant

Reducing the size of the data

Encoding the categorical variables

Missing values

The k-NN algorithm

Implementing the k-NN algorithm using scikit-learn

Splitting the data into training and test sets

Implementation and evaluation of your model

Fine-tuning the parameters of the k-NN algorithm

Scaling for optimized performance

Summary

Predicting Categories with Logistic Regression

Technical requirements

Understanding logistic regression mathematically 

Implementing logistic regression using scikit-learn

Splitting the data into training and test sets

Fine-tuning the hyperparameters

Scaling the data

Interpreting the logistic regression model

Summary

Predicting Categories with Naive Bayes and SVMs

Technical requirements

The Naive Bayes algorithm 

Implementing the Naive Bayes algorithm in scikit-learn

Support vector machines

Implementing the linear support vector machine algorithm in scikit-learn

Hyperparameter optimization for the linear SVMs

Graphical hyperparameter optimization

Hyperparameter optimization using GridSearchCV

Scaling the data for performance improvement

Summary

Predicting Numeric Outcomes with Linear Regression

Technical requirements

The inner mechanics of the linear regression algorithm

Implementing linear regression in scikit-learn

Linear regression in two dimensions 

Using linear regression to predict mobile transaction amount

Scaling your data

Model optimization 

Ridge regression

Lasso regression

Summary

Classification and Regression with Trees

Technical requirements

Classification trees

The decision tree classifier

Picking the best feature

The Gini coefficient

Implementing the decision tree classifier in scikit-learn

Hyperparameter tuning for the decision tree

Visualizing the decision tree

The random forests classifier

Implementing the random forest classifier in scikit-learn

Hyperparameter tuning for random forest algorithms

The AdaBoost classifier

Implementing the AdaBoost classifier in scikit-learn

Hyperparameter tuning for the AdaBoost classifier

Regression trees

The decision tree regressor

Implementing the decision tree regressor in scikit-learn

Visualizing the decision tree regressor

The random forest regressor

Implementing the random forest regressor in scikit-learn

The gradient boosted tree

Implementing the gradient boosted tree in scikit-learn

Ensemble classifier

Implementing the voting classifier in scikit-learn

Summary

Clustering Data with Unsupervised Machine Learning

Technical requirements

The k-means algorithm

Assignment of centroids

When does the algorithm stop iterating?

Implementing the k-means algorithm in scikit-learn

Creating the base k-means model

The optimal number of clusters

Feature engineering for optimization

Scaling

Principal component analysis

Cluster visualization

t-SNE

Hierarchical clustering

Step 1 – Individual features as individual clusters

Step 2 – The merge

Step 3 – Iteration

Implementing hierarchical clustering

Going from unsupervised to supervised learning

Creating a labeled dataset 

Building the decision tree

Summary

Performance Evaluation Methods

Technical requirements

Why is performance evaluation critical?

Performance evaluation for classification algorithms

The confusion matrix

The normalized confusion matrix

Area under the curve

Cumulative gains curve

Lift curve

K-S statistic plot

Calibration plot

Learning curve

Cross-validated box plot

Performance evaluation for regression algorithms

Mean absolute error

Mean squared error

Root mean squared error

Performance evaluation for unsupervised algorithms

Elbow plot

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

The fundamental aim of this book is help its readers quickly deploy, optimize, and evaluate every kind of machine learning algorithm that scikit-learn provides in an agile manner.

Readers will learn how to deploy supervised machine learning algorithms, such as logistic regression, k-nearest neighbors, linear regression, Support Vector Machines, Naive Bayes, and tree-based algorithms, in order to solve classification and regression machine learning problems.

Readers will also learn how to deploy unsupervised machine learning algorithms such as the k-means algorithm in order to cluster unlabeled data into groups.

Finally, readers will be provided with different techniques to visually interpret and evaluate the performance of the algorithms that they build.

Who this book is for

This book is for data scientists, software engineers, and people interested in machine learning with a background in Python who would like to understand, implement, and evaluate a wide range of machine learning algorithms using the scikit-learn framework.

What this book covers

Chapter 1, Introducing Machine Learning with scikit-learn, is a brief introduction to the different types of machine learning and its applications.

Chapter 2, Predicting Categories with K-Nearest Neighbors, covers working with and implementing the k-nearest neighbors algorithm to solve classification problems in scikit-learn.

Chapter 3, Predicting Categories with Logistic Regression, explains the workings and implementation of the logistic regression algorithm when solving classification problems in scikit-learn.

Chapter 4, Predicting Categories with Naive Bayes and SVMs, explains the workings and implementation of the Naive Bayes and the Linear Support Vector Machines algorithms when solving classification problems in scikit-learn.

Chapter 5, Predicting Numeric Outcomes with Linear Regression, explains the workings and implementation of the linear regression algorithm when solving regression problems in scikit-learn.

Chapter 6, Classification and Regression with Trees, explains the workings and implementation of tree-based algorithms such as decision trees, random forests, and the boosting and ensemble algorithms when solving classification and regression problems in scikit-learn.

Chapter 7, Clustering Data with Unsupervised Machine Learning, explains the workings and implementation of the k-means algorithm when solving unsupervised problems in scikit-learn.

Chapter 8, Performance Evaluation Methods, contains visual performance evaluation techniques for supervised and unsupervised machine learning algorithms.

To get the most out of this book

To get the most out of this book:

Prior knowledge of Python is assumed at a basic level.

Jupyter Notebook as a development environment is preferred but not necessary.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packt.com

Select the

SUPPORT

tab.

Click on

Code Downloads and Errata

Enter the name of the book in the

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Machine-Learning-with-scikit-learn-Quick-Start-Guide. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Code in action

Visit the following link to check out videos of the code being run:http://bit.ly/2OcWIGH

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Introducing Machine Learning with scikit-learn

Welcome to the world of machine learning with scikit-learn. I'm thrilled that you have chosen this book in order to begin or further advance your knowledge on the vast field of machine learning. Machine learning can be overwhelming at times and this is partly due to the large number of tools that are available on the market. This book will simplify this process of tool selection down to one – scikit-learn.

If I were to tell you what this book can do for you in one sentence, it would be this –The book gives you pipelines that can be implemented in order to solve a wide range of machine learning problems. True to what this sentence implies, you will learn how to construct an end-to-end machine learning pipeline using some of the most popular algorithms that are widely used in the industry and professional competitions, such as Kaggle.

However, in this introductory chapter, we will go through the following topics:

A brief introduction to machine learning

What is scikit-learn?

Installing scikit-learn

Algorithms that you will learn to implement scikit-learn in this book

Now, let's begin this fun journey into the world of machine learning with scikit-learn!

A brief introduction to machine learning

Machine learning has generated quite the buzz – from Elon Musk fearing the role of unregulated artificial intelligence in society, to Mark Zuckerberg having a view that contradicts Musk's.

So, what exactly is machine learning? Simply put, machine learning is a set ofmethodsthat can detect patterns in data and use those patterns to make future predictions. Machine learning has found immense value in a wide range of industries, ranging from finance to healthcare. This translates to a higher requirement of talent with the skill capital in the field of machine learning.

Broadly speaking, machine learning can be categorized into three main types:

Supervised learning

Unsupervised learning

Reinforcement learning

Scikit-learn is designed to tackle problems pertaining to supervised and unsupervised learning only, and does not support reinforcement learning at present.

Supervised learning

Supervised learning is a form of machine learning in which our data comes with a set of labels or a target variable that is numeric. These labels/categories usually belong to one feature/attribute, which is commonly known as the target variable. For instance, each row of your data could either belong to the category of Healthy or Not Healthy.

Given a set of features such as weight, blood sugar levels, and age, we can use the supervised machine learning algorithm to predict whether the person is healthy or not.

In the following simple mathematical expression, S is the supervised learning algorithm, X is the set of input features, such as weight and age, and Y is the target variable with the labels Healthy or Not Healthy:

Although supervised machine learning is the most common type of machine learning that is implemented with scikit-learn and in the industry, most datasets typically do not come with predefined labels. Unsupervised learning algorithms are first used to cluster data without labels into distinct groups to which we can then assign labels. This is discussed in detail in the following section.

Unsupervised learning

Unsupervised learning is a form of machine learning in which the algorithm tries to detect/find patterns in data that do not have an outcome/target variable. In other words, we do not have data that comes with pre-existing labels. Thus, the algorithm will typically use a metric such as distance to group data together depending on how close they are to each other.

As discussed in the previous section, most of the data that you will encounter in the real world will not come with a set of predefined labels and, as such, will only have a set of input features without a target attribute.

In the following simple mathematical expression, U is the unsupervised learning algorithm, while X is a set of input features, such as weight and age:

Given this data, our objective is to create groups that could potentially be labeled as Healthy or Not Healthy. The unsupervised learning algorithm will use a metric such as distance in order to identify how close a set of points are to each other and how far apart two such groups are. The algorithm will then proceed to cluster these groups into two distinct groups, as illustrated in the following diagram:

Clustering two groups together

What is scikit-learn?

Scikit-learn is a free and open source software that helps you tackle supervised and unsupervised machine learning problems. The software is built entirely in Python and utilizes some of the most popular libraries that Python has to offer, namely NumPy and SciPy.

The main reason why scikit-learn is very popular stems from the fact that most of the world's most popular machine learning algorithms can be implemented quite quickly in a plug and play format once you know what the core pipeline is like. Another reason is that popular algorithms for classification such as logistic regression and support vector machines are written in Cython. Cython is used to give these algorithms C-like performance and thus makes the use of scikit-learn quite efficient in the process.

Installing scikit-learn

There are two ways in which you can install scikit-learn on your personal device:

By using the pip method

By using the Anaconda method

The pip method can be implemented on the macOS/Linux Terminal or the Windows PowerShell, while the Anaconda method will work with the Anaconda prompt.

Choosing between these two methods of installation is pretty straightforward:

If you would like all the common Python package distributions for data science to be installed in one environment, the Anaconda method works best

If you would like to build you own environment from scratch for scikit-learn, the pip method works best (for advanced users of Python)

This book will be using Python 3.6 for all the code that is displayed throughout every chapter, unless mentioned otherwise.

The pip method

Scikit-learn requires a few packages to be installed on your device before you can install it. These are as follows:

NumPy

: Version 1.8.2 or greater

SciPy

: Version 0.13.3 or greater

These can be installed using the pip method by using the following commands:

pip3 install NumPy

pip3 install SciPy

Next, we can install scikit-learn using the following code:

pip3 install scikit-learn

Additionally, if you already have scikit-learn installed on your device and you simply want to upgrade it to the latest version, you can use the following code:

pip3 install -U scikit-learn

The version of scikit-learn implemented in the book is 0.19.1.

The Anaconda method

In the event that you have installed Python using the Anaconda distribution, you can install scikit-learn by using the following code in the Anaconda prompt:

The first step is to install the dependencies:

conda install NumPy

conda install SciPy

Next, we can install scikit-learn by using the following code:

conda install scikit-learn

Additionally, if you already have scikit-learn installed with the Anaconda distribution, you can upgrade it to the latest version by using the following code in the Anaconda prompt:

conda update scikit-learn

When upgrading or uninstalling scikit-learn that has been installed with Anaconda, avoid using the pip method at all costs as doing so is most likely going to fail upgrading or removing all the required files. Stick with either the pip method or the Anaconda method in order to maintain consistency.

Additional packages

In this section, we will talk about the packages that we will be installing outside of scikit-learn that will be used throughout this book.

Pandas

To install Pandas, you can use either the pip method or the Anaconda method, as follows:

Pip method:

pip3 install pandas

Anaconda method:

conda install pandas

Matplotlib

To install matplotlib, you can use either the pip method or the Anaconda method, as follows:

Pip method:

pip3 install matplotlib

Anaconda method:

conda install matplotlib

Tree

To install tree, you can use either the pip method or the Anaconda method, as follows:

Pip method:

pip3 install tree

Anaconda method:

conda install tree

Pydotplus

To install pydotplus, you can use either the pip method or the Anaconda method, as follows:

Pip method:

pip3 install pydotplus

Anaconda method:

conda install pydotplus

Image

To install Image, you can use either the pip method or the Anaconda method, as follows:

Pip method:

pip3 install Image

Anaconda method:

conda install Image

Algorithms that you will learn to implement using scikit-learn

The algorithms that you will learn about in this book are broadly classified into the following two categories:

Supervised learning algorithms

Unsupervised learning algorithms

Supervised learning algorithms

Supervised learning algorithms can be used to solve both classification and regression problems. In this book, you will learn how to implement some of the most popular supervised machine learning algorithms. Popular supervised machine learning algorithms are the ones that are widely used in industry and research, and have helped us solve a wide range of problems across a wide range of domains. These supervised learning algorithms are as follows:

Linear

regression

: This supervised learning algorithm is used to predict continuous numeric outcomes such as house prices, stock prices, and temperature, to name a few

Logistic regression

: The logistic learning algorithm is a popular classification algorithm that is especially used in the credit industry in order to predict loan defaults

k-Nearest Neighbors

The k-NN algorithm is a classification algorithm that is used to classify data into two or more categories, and is widely used to classify houses into expensive and affordable categories based on price, area, bedrooms, and a whole range of other features

Support vector machines