Mastering Machine Learning with scikit-learn - Second Edition - Gavin Hackeling - E-Book

Mastering Machine Learning with scikit-learn - Second Edition E-Book

Gavin Hackeling

0,0
41,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Use scikit-learn to apply machine learning to real-world problems

About This Book

  • Master popular machine learning models including k-nearest neighbors, random forests, logistic regression, k-means, naive Bayes, and artificial neural networks
  • Learn how to build and evaluate performance of efficient models using scikit-learn
  • Practical guide to master your basics and learn from real life applications of machine learning

Who This Book Is For

This book is intended for software engineers who want to understand how common machine learning algorithms work and develop an intuition for how to use them, and for data scientists who want to learn about the scikit-learn API. Familiarity with machine learning fundamentals and Python are helpful, but not required.

What You Will Learn

  • Review fundamental concepts such as bias and variance
  • Extract features from categorical variables, text, and images
  • Predict the values of continuous variables using linear regression and K Nearest Neighbors
  • Classify documents and images using logistic regression and support vector machines
  • Create ensembles of estimators using bagging and boosting techniques
  • Discover hidden structures in data using K-Means clustering
  • Evaluate the performance of machine learning systems in common tasks

In Detail

Machine learning is the buzzword bringing computer science and statistics together to build smart and efficient models. Using powerful algorithms and techniques offered by machine learning you can automate any analytical model.

This book examines a variety of machine learning models including popular machine learning algorithms such as k-nearest neighbors, logistic regression, naive Bayes, k-means, decision trees, and artificial neural networks. It discusses data preprocessing, hyperparameter optimization, and ensemble methods. You will build systems that classify documents, recognize images, detect ads, and more. You will learn to use scikit-learn's API to extract features from categorical variables, text and images; evaluate model performance, and develop an intuition for how to improve your model's performance.

By the end of this book, you will master all required concepts of scikit-learn to build efficient models at work to carry out advanced tasks with the practical approach.

Style and approach

This book is motivated by the belief that you do not understand something until you can describe it simply. Work through toy problems to develop your understanding of the learning algorithms and models, then apply your learnings to real-life problems.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 273

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Mastering Machine Learning with scikit-learn

 

Second Edition

 

 

 

 

 

Learn to implement and evaluate machine learning solutions with scikit-learn

 

 

 

 

 

 

 

 

 

 

Gavin Hackeling

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

 

Mastering Machine Learning with scikit-learn

Second Edition

 

Copyright © 2017 Packt Publishing

 

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

First published: October 2014

second published: July 2017

Production reference: 1200717

 

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-78829-987-9

www.packtpub.com

Credits

Author

Gavin Hackeling

Copy Editors

Safis Editing

Vikrant Phadkay

Reviewer

Oleg Okun

Project Coordinator

Nidhi Joshi

Commissioning Editor

Amey Varangaonkar

Proofreader

Safis Editing

Acquisition Editor

Aman Singh

Indexer

Tejal Daruwale Soni

Content Development Editor

Aishwarya Pandere

Graphics

Tania Dutta

Technical Editor

Suwarna Patil

Production Coordinator

Arvindkumar Gupta

About the Author

Gavin Hackeling is a data scientist and author. He was worked on a variety of machine learning problems, including automatic speech recognition, document classification, object recognition, and semantic segmentation. An alumnus of the University of North Carolina and New York University, he lives in Brooklyn with his wife and cat.

 

I would like to thank my wife, Hallie, and the scikti-learn community.

About the Reviewer

Oleg Okun is a machine learning expert and an author/editor of four books, numerous journal articles, and conference papers. His career spans more than a quarter of a century. He was employed in both academia and industry in his motherland, Belarus, and abroad (Finland, Sweden, and Germany). His work experience includes document image analysis, fingerprint biometrics, bioinformatics, online/offline marketing analytics, credit scoring analytics, and text analytics.

He is interested in all aspects of distributed machine learning and the Internet of Things. Oleg currently lives and works in Hamburg, Germany.

I would like to express my deepest gratitude to my parents for everything that they have done for me.

 

 

www.PacktPub.com

For support files and downloads related to your book, please visitwww.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.comand as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1788299876.

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

The Fundamentals of Machine Learning

Defining machine learning

Learning from experience

Machine learning tasks

Training data, testing data, and validation data

Bias and variance

An introduction to scikit-learn

Installing scikit-learn

Installing using pip

Installing on Windows

Installing on Ubuntu 16.04

Installing on Mac OS

Installing Anaconda

Verifying the installation

Installing pandas, Pillow, NLTK, and matplotlib

Summary

Simple Linear Regression

Simple linear regression

Evaluating the fitness of the model with a cost function

Solving OLS for simple linear regression

Evaluating the model

Summary

Classification and Regression with k-Nearest Neighbors

K-Nearest Neighbors

Lazy learning and non-parametric models

Classification with KNN

Regression with KNN

Scaling features

Summary

Feature Extraction

Extracting features from categorical variables

Standardizing features

Extracting features from text

The bag-of-words model

Stop word filtering

Stemming and lemmatization

Extending bag-of-words with tf-idf weights

Space-efficient feature vectorizing with the hashing trick

Word embeddings

Extracting features from images

Extracting features from pixel intensities

Using convolutional neural network activations as features

Summary

From Simple Linear Regression to Multiple Linear Regression

Multiple linear regression

Polynomial regression

Regularization

Applying linear regression

Exploring the data

Fitting and evaluating the model

Gradient descent

Summary

From Linear Regression to Logistic Regression

Binary classification with logistic regression

Spam filtering

Binary classification performance metrics

Accuracy

Precision and recall

Calculating the F1 measure

ROC AUC

Tuning models with grid search

Multi-class classification

Multi-class classification performance metrics

Multi-label classification and problem transformation

Multi-label classification performance metrics

Summary

Naive Bayes

Bayes' theorem

Generative and discriminative models

Naive Bayes

Assumptions of Naive Bayes

Naive Bayes with scikit-learn

Summary

Nonlinear Classification and Regression with Decision Trees

Decision trees

Training decision trees

Selecting the questions

Information gain

Gini impurity

Decision trees with scikit-learn

Advantages and disadvantages of decision trees

Summary

From Decision Trees to Random Forests and Other Ensemble Methods

Bagging

Boosting

Stacking

Summary

The Perceptron

The perceptron

Activation functions

The perceptron learning algorithm

Binary classification with the perceptron

Document classification with the perceptron

Limitations of the perceptron

Summary

From the Perceptron to Support Vector Machines

Kernels and the kernel trick

Maximum margin classification and support vectors

Classifying characters in scikit-learn

Classifying handwritten digits

Classifying characters in natural images

Summary

From the Perceptron to Artificial Neural Networks

Nonlinear decision boundaries

Feed-forward and feedback ANNs

Multi-layer perceptrons

Training multi-layer perceptrons

Backpropagation

Training a multi-layer perceptron to approximate XOR

Training a multi-layer perceptron to classify handwritten digits

Summary

K-means

Clustering

K-means

Local optima

Selecting K with the elbow method

Evaluating clusters

Image quantization

Clustering to learn features

Summary

Dimensionality Reduction with Principal Component Analysis

Principal component analysis

Variance, covariance, and covariance matrices

Eigenvectors and eigenvalues

Performing PCA

Visualizing high-dimensional data with PCA

Face recognition with PCA

Summary

Preface

In recent years, popular imagination has become fascinated by machine learning. The discipline has found a variety of applications. Some of these applications, such as spam filtering, are ubiquitous and have been rendered mundane by their successes. Many other applications have only recently been conceived, and hint at machine learning's potential.

In this book, we will examine several machine learning models and learning algorithms. We will discuss tasks that machine learning is commonly applied to, and we will learn to measure the performance of machine learning systems. We will work with a popular library for the Python programming language called scikit-learn, which has assembled state-of-the-art implementations of many machine learning algorithms under an intuitive and versatile API.

What this book covers

Chapter 1, The Fundamentals of Machine Learning, defines machine learning as the study and design of programs that improve their performance of a task by learning from experience. This definition guides the other chapters; in each, we will examine a machine learning model, apply it to a task, and measure its performance.

Chapter 2, Simple Linear Regression, discusses a model that relates a single feature to a continuous response variable. We will learn about cost functions and use the normal equation to optimize the model.

Chapter 3, Classification and Regression with K-Nearest Neighbors, introduces a simple, nonlinear model for classification and regression tasks.

Chapter 4, Feature Extraction, describes methods for representing text, images, and categorical variables as features that can be used in machine learning models.

Chapter 5, From Simple Linear Regression to Multiple Linear Regression, discusses a generalization of simple linear regression that regresses a continuous response variable onto multiple features.

Chapter 6, From Linear Regression to Logistic Regression, further generalizes multiple linear regression and introduces a model for binary classification tasks.

Chapter 7, Naive Bayes, discusses Bayes’ theorem and the Naive Bayes family of classifiers, and compares generative and discriminative models.

Chapter 8, Nonlinear Classification and Regression with Decision Trees, introduces the decision tree, a simple, nonlinear model for classification and regression tasks.

Chapter 9, From Decision Trees to Random Forests and other Ensemble Methods, discusses three methods for combining models called bagging, boosting, and stacking.

Chapter 10, The Perceptron, introduces a simple online model for binary classification.

Chapter 11, From the Perceptron to Support Vector Machines, discusses a powerful, discriminative model for classification and regression called the support vector machine, and a technique for efficiently projecting features to higher dimensional spaces.

Chapter 12, From the Perceptron to Artificial Neural Networks, introduces powerful nonlinear models for classification and regression built from graphs of artificial neurons.

Chapter 13,  K-means, discusses an algorithm that can be used to find structures in unlabeled data.

Chapter 14, Dimensionality Reduction with Principal Component Analysis, describes a method for reducing the dimensions of data that can mitigate the curse of dimensionality.

What you need for this book

The examples in this book require Python >= 2.7 or >= 3.3 and pip, the PyPA recommended tool for installing Python packages. The examples are intended to be executed in a Jupyter notebook or an IPython interpreter. Chapter 1, The Fundamentals of Machine Learning shows how to install scikit-learn 0.18.1, its dependencies, and other libraries on Ubuntu, Mac OS, and Windows.

Who this book is for

This book is intended for software engineers who want to understand how common machine learning algorithms work and develop an intuition for how to use them. It is also for data scientists who want to learn about the scikit-learn API. Familiarity with machine learning fundamentals and Python is helpful but not required.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The package is named sklearn because scikit-learn is not a valid Python package name."

# In[1]:import sklearn sklearn.__version__ # Out[1]:'0.18.1'

New terms and important words are shown in bold.

Warnings or important notes appear like this.
Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you thought about this book-what you liked or disliked. Reader feedback is important for us as it helps us to develop titles that you will really get the most out of. To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Machine-Learning-with-scikit-learn-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us to improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspects of this book, you can contact us at [email protected], and we will do our best to address it.

The Fundamentals of Machine Learning

In this chapter, we will review fundamental concepts in machine learning. We will compare supervised and unsupervised learning; discuss the uses of training, testing, and validation data; and describe applications of machine learning. Finally, we will introduce scikit-learn, and install the tools required in subsequent chapters.

Defining machine learning

Our imaginations have long been captivated by visions of machines that can learn and imitate human intelligence. While machines capable of general artificial intelligence-like Arthur C. Clarke's HAL and Isaac Asimov's Sonny-have yet to be realized, software programs that can acquire new knowledge and skills through experience are becoming increasingly common. We use such machine learning programs to discover new music that we might enjoy, and to find exactly the shoes we want to purchase online. Machine learning programs allow us to dictate commands to our smart phones, and allow our thermostats to set their own temperatures. Machine learning programs can decipher sloppily-written mailing addresses better than humans, and can guard credit cards from fraud more vigilantly. From investigating new medicines to estimating the page views for versions of a headline, machine learning software is becoming central to many industries. Machine learning has even encroached on activities that have long been considered uniquely human, such as writing the sports column recapping the Duke basketball team's loss to UNC.

Machine learning is the design and study of software artifacts that use past experience to inform future decisions; machine learning is the study of programs that learn from data. The fundamental goal of machine learning is to generalize, or to induce an unknown rule from examples of the rule's application. The canonical example of machine learning is spam filtering. By observing thousands of emails that have been previously labeled as either spam or ham, spam filters learn to classify new messages. Arthur Samuel, a computer scientist who pioneered the study of artificial intelligence, said that machine learning is the "study that gives computers the ability to learn without being explicitly programmed". Throughout the 1950s and 1960s, Samuel developed programs that played checkers. While the rules of checkers are simple, complex strategies are required to defeat skilled opponents. Samuel never explicitly programmed these strategies, but through the experience of playing thousands of games, the program learned complex behaviors that allowed it to beat many human opponents.

A popular quote from computer scientist Tom Mitchell defines machine learning more formally: "A program can be said to learn from experience 'E' with respect to some class of tasks 'T' and performance measure 'P', if its performance at tasks in 'T', as measured by 'P', improves with experience 'E'." For example, assume that you have a collection of pictures. Each picture depicts either a dog or a cat. A task could be sorting the pictures into separate collections of dog and cat photos. A program could learn to perform this task by observing pictures that have already been sorted, and it could evaluate its performance by calculating the percentage of correctly classified pictures.

We will use Mitchell's definition of machine learning to organize this chapter. First, we will discuss types of experience, including supervised learning and unsupervised learning. Next, we will discuss common tasks that can be performed by machine learning systems. Finally, we will discuss performance measures that can be used to assess machine learning systems.

Learning from experience

Machine learning systems are often described as learning from experience either with or without supervision from humans. In supervised learning problems, a program predicts an output for an input by learning from pairs of labeled inputs and outputs. That is, the program learns from examples of the "right answers". In unsupervised learning, a program does not learn from labeled data. Instead, it attempts to discover patterns in data. For example, assume that you have collected data describing the heights and weights of people. An example of an unsupervised learning problem is dividing the data points into groups. A program might produce groups that correspond to men and women, or children and adults. Now assume that the data is also labeled with the person's sex. An example of a supervised learning problem is to induce a rule for predicting whether a person is male or female based on his or her height and weight. We will discuss algorithms and examples of supervised and unsupervised learning in the following chapters.

Supervised learning and unsupervised learning can be thought of as occupying opposite ends of a spectrum. Some types of problem, called semi-supervised learning problems, make use of both supervised and unsupervised data; these problems are located on the spectrum between supervised and unsupervised learning. Reinforcement learning is located near the supervised end of the spectrum. Unlike supervised learning, reinforcement learning programs do not learn from labeled pairs of inputs and outputs. Instead, they receive feedback for their decisions, but errors are not explicitly corrected. For example, a reinforcement learning program that is learning to play a side-scrolling video game like Super Mario Bros may receive a reward when it completes a level or exceeds a certain score, and a punishment when it loses a life. However, this supervised feedback is not associated with specific decisions to run, avoid Goombas, or pick up fire flowers. We will focus primarily on supervised and unsupervised learning, as these categories include most common machine learning problems. In the next sections, we will review supervised and unsupervised learning in more detail.

A supervised learning program learns from labeled examples of the outputs that should be produced for an input. There are many names for the output of a machine learning program. Several disciplines converge in machine learning, and many of those disciplines use their own terminology. In this book, we will refer to the output as the response variable. Other names for response variables include "dependent variables", "regressands", "criterion variables", "measured variables", "responding variables", "explained variables", "outcome variables", "experimental variables", "labels", and "output variables". Similarly, the input variables have several names. In this book, we will refer to inputs as features, and the phenomena they represent as explanatory variables. Other names for explanatory variables include "predictors", "regressors", "controlled variables", and "exposure variables". Response variables and explanatory variables may take real or discrete values.

The collection of examples that comprise supervised experience is called a training set. A collection of examples that is used to assess the performance of a program is called a test set. The response variable can be thought of as the answer to the question posed by the explanatory variables; supervised learning problems learn from a collection of answers to different questions. That is, supervised learning programs are provided with the correct answers and must learn to respond correctly to unseen, but similar, questions.

Machine learning tasks

Two of the most common supervised machine learning tasks are classification and regression. In classification tasks, the program must learn to predict discrete values for one or more response variables from one or more features. That is, the program must predict the most probable category, class, or label for new observations. Applications of classification include predicting whether a stock's price will rise or fall, or deciding whether a news article belongs to the politics or leisure sections. In regression problems, the program must predict the values of one more or continuous response variables from one or more features. Examples of regression problems include predicting the sales revenue for a new product, or predicting the salary for a job based on its description. Like classification, regression problems require supervised learning.

A common unsupervised learning task is to discover groups of related observations, called clusters, within the dataset. This task, called clustering or cluster analysis, assigns observations into groups such that observations within a groups are more similar to each other based on some similarity measure than they are to observations in other groups. Clustering is often used to explore a dataset. For example, given a collection of movie reviews, a clustering algorithm might discover the sets of positive and negative reviews. The system will not be able to label the clusters as positive or negative; without supervision, it will only have knowledge that the grouped observations are similar to each other by some measure. A common application of clustering is discovering segments of customers within a market for a product. By understanding what attributes are common to particular groups of customers, marketers can decide what aspects of their campaigns to emphasize. Clustering is also used by internet radio services; given a collection of songs, a clustering algorithm might be able to group the songs according to their genres. Using different similarity measures, the same clustering algorithm might group the songs by their keys, or by the instruments they contain.

Dimensionality reduction is another task that is commonly accomplished using unsupervised learning. Some problems may contain thousands or millions of features, which can be computationally costly to work with. Additionally, the program's ability to generalize may be reduced if some of the features capture noise or are irrelevant to the underlying relationship. Dimensionality reduction is the process of discovering the features that account for the greatest changes in the response variable. Dimensionality reduction can also be used to visualize data. It is easy to visualize a regression problem such as predicting the price of a home from its size; the size of the home can be plotted on the graph's x axis, and the price of the home can be plotted on the y axis. It is similarly easy to visualize the housing price regression problem when a second feature is added; the number of bathrooms in the house could be plotted on the z axis, for instance. A problem with thousands of features, however, becomes impossible to visualize.

Training data, testing data, and validation data

As mentioned previously, a training set is a collection of observations. These observations comprise the experience that the algorithm uses to learn. In supervised learning problems, each observation consists of an observed response variable and features of one or more observed explanatory variables. The test set is a similar collection of observations. The test set is used to evaluate the performance of the model using some performance metric. It is important that no observations from the training set are included in the test set. If the test set does contain examples from the training set, it will be difficult to assess whether the algorithm has learned to generalize from the training set or has simply memorized it. A program that generalizes well will be able to effectively perform a task with new data. In contrast, a program that memorizes the training data by learning an overly-complex model could predict the values of the response variable for the training set accurately, but will fail to predict the value of the response variable for new examples. Memorizing the training set is called overfitting. A program that memorizes its observations may not perform its task well, as it could memorize relations and structure that are coincidental in the training data. Balancing generalization and memorization is a problem common to many machine learning algorithms. In later chapters we will discuss regularization, which can be applied to many models to reduce over-fitting.

In addition to the training and test data, a third set of observations, called a validation or hold-out set, is sometimes required. The validation set is used to tune variables called hyperparameters that control how the algorithm learns from the training data. The program is still evaluated on the test set to provide an estimate of its performance in the real world. The validation set should not be used to estimate real-world performance because the program has been tuned to learn from the training data in a way that optimizes its score on the validation data; the program will not have this advantage in the real world.

It is common to partition a single set of supervised observations into training, validation, and test sets. There are no requirements for the sizes of the partitions, and they may vary according to the amount of data available. It is common to allocate between fifty and seventy-five percent of the data to the training set, ten to twenty-five percent of the data to the test set, and the remainder to the validation set.

Some training sets may contain only a few hundred observations; others may include millions. Inexpensive storage, increased network connectivity, and the ubiquity of sensor-packed smartphones have contributed to the contemporary state of big data, or training sets with millions or billions of examples. While this book will not work with datasets that require parallel processing on tens or hundreds of computers, the predictive power of many machine learning algorithms improves as the amount of training data increases. However, machine learning algorithms also follow the maxim "garbage in, garbage out". A student who studies for a test by reading a large, confusing textbook that contains many errors likely will not score better than a student who reads a short but well-written textbook. Similarly, an algorithm trained on a large collection of noisy, irrelevant, or incorrectly-labeled data will not perform better than an algorithm trained on a smaller set of data that is more representative of the problem in the real-world.

Many supervised training sets are prepared manually or by semi-automated processes. Creating a large collection of supervised data can be costly in some domains. Fortunately, several datasets are bundled with scikit-learn, allowing developers to focus on experimenting with models instead. During development, and particularly when training data is scarce, a practice called cross-validation can be used to train and validate a model on the same data. In cross-validation, the training data is partitioned. The model is trained using all but one of the partitions, and tested on the remaining partition. The partitions are then rotated several times so that the model is trained and evaluated on all of the data. The mean of the model's scores on each of the partitions is a better estimate of performance in the real world than an evaluation using a single training/testing split. The following diagram depicts cross validation with five partitions, or folds.

The original dataset is partitioned into five subsets of equal size labeled A through E. Initially the model is trained on partitions B through E, and tested on partition A. In the next iteration, the model is trained on partitions A, C, D, and E, and tested on partition B. The partitions are rotated until models have been trained and tested on all of the partitions. Cross-validation provides a more accurate estimate of the model's performance than testing a single partition of the data.

Bias and variance

Many metrics can be used to measure whether or not a program is learning to perform its task more effectively. For supervised learning problems, many performance metrics measure the amount of prediction error. There are two fundamental causes of prediction error: a model's bias, and its variance. Assume that you have many training sets that are all unique, but equally representative of the population. A model with high bias will produce similar errors for an input regardless of the training set it used to learn; the model biases its own assumptions about the real relationship over the relationship demonstrated in the training data. A model with high variance, conversely, will produce different errors for an input depending on the training set that it used to learn. A model with high bias is inflexible, but a model with high variance may be so flexible that it models the noise in the training set. That is, a model with high variance over-fits the training data, while a model with high bias under-fits the training data. It can be helpful to visualize bias and variance as darts thrown at a dartboard. Each dart is analogous to a prediction, and is thrown by a model trained on a different dataset every time. A model with high bias but low variance will throw darts that will be tightly clustered, but could be far from the bulls-eye. A model with high bias and high variance will throw darts all over the board; the darts are far from the bulls-eye and from each other. A model with low bias and high variance will throw darts that could be poorly clustered but close to the bulls-eye. Finally, a model with low bias and low variance will throw darts that are tightly clustered around the bulls-eye.

Ideally, a model will have both low bias and variance, but efforts to decrease one will frequently increase the other. This is known as the bias-variance trade-off. We will discuss the biases and variances of many of the models introduced in this book.

Unsupervised learning problems do not have an error signal to measure; instead, performance metrics for unsupervised learning problems measure some attribute of the structure discovered in the data, such as the distances within and between clusters.