Scikit-learn Cookbook - Second Edition - Julian Avila - E-Book

Scikit-learn Cookbook - Second Edition E-Book

Julian Avila

0,0
37,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Learn to use scikit-learn operations and functions for Machine Learning and deep learning applications.

About This Book

  • Handle a variety of machine learning tasks effortlessly by leveraging the power of scikit-learn
  • Perform supervised and unsupervised learning with ease, and evaluate the performance of your model
  • Practical, easy to understand recipes aimed at helping you choose the right machine learning algorithm

Who This Book Is For

Data Analysts already familiar with Python but not so much with scikit-learn, who want quick solutions to the common machine learning problems will find this book to be very useful. If you are a Python programmer who wants to take a dive into the world of machine learning in a practical manner, this book will help you too.

What You Will Learn

  • Build predictive models in minutes by using scikit-learn
  • Understand the differences and relationships between Classification and Regression, two types of Supervised Learning.
  • Use distance metrics to predict in Clustering, a type of Unsupervised Learning
  • Find points with similar characteristics with Nearest Neighbors.
  • Use automation and cross-validation to find a best model and focus on it for a data product
  • Choose among the best algorithm of many or use them together in an ensemble.
  • Create your own estimator with the simple syntax of sklearn
  • Explore the feed-forward neural networks available in scikit-learn

In Detail

Python is quickly becoming the go-to language for analysts and data scientists due to its simplicity and flexibility, and within the Python data space, scikit-learn is the unequivocal choice for machine learning. This book includes walk throughs and solutions to the common as well as the not-so-common problems in machine learning, and how scikit-learn can be leveraged to perform various machine learning tasks effectively.

The second edition begins with taking you through recipes on evaluating the statistical properties of data and generates synthetic data for machine learning modelling. As you progress through the chapters, you will comes across recipes that will teach you to implement techniques like data pre-processing, linear regression, logistic regression, K-NN, Naive Bayes, classification, decision trees, Ensembles and much more. Furthermore, you'll learn to optimize your models with multi-class classification, cross validation, model evaluation and dive deeper in to implementing deep learning with scikit-learn. Along with covering the enhanced features on model section, API and new features like classifiers, regressors and estimators the book also contains recipes on evaluating and fine-tuning the performance of your model.

By the end of this book, you will have explored plethora of features offered by scikit-learn for Python to solve any machine learning problem you come across.

Style and Approach

This book consists of practical recipes on scikit-learn that target novices as well as intermediate users. It goes deep into the technical issues, covers additional protocols, and many more real-live examples so that you are able to implement it in your daily life scenarios.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 294

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



scikit-learn Cookbook

Second Edition

 

 

 

 

 

 

 

 

 

Over 80 recipes for machine learning in Python with scikit-learn

 

 

 

 

 

 

 

 

 

Julian Avila
Trent Hauck

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

scikit-learn Cookbook

Second Edition

 

Copyright © 2017 Packt Publishing

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

First published: November 2014

Second edition: November 2017

 

Production reference: 1141117

 

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

 

ISBN 978-1-78728-638-2

 

www.packtpub.com

Credits

Authors

Julian Avila

Trent Hauck

Copy Editors

Vikrant Phadkay

Safis Editing

Reviewer

Oleg Okun

Project Coordinator

Nidhi Joshi

Commissioning Editor

Amey Varangaonkar

Proofreader

Safis Editing

Acquisition Editor

Vinay Argekar

Indexer

Tejal Daruwale Soni

Content Development Editor

Mayur Pawanikar

Graphics

Tania Dutta

Technical Editor

Dinesh Pawar

Production Coordinator

Aparna Bhagat

About the Authors

Julian Avila is a programmer and data scientist in the fields of finance and computer vision. He graduated from the Massachusetts Institute of Technology (MIT) in mathematics, where he researched quantum mechanical computation, a field involving physics, math, and computer science. While at MIT, Julian first picked up classical and flamenco guitar, machine learning, and artificial intelligence through discussions with friends in the CSAIL lab. 

He started programming in middle school, including games and geometrically artistic animations. He competed successfully in math and programming and worked for several groups at MIT. Julian has written complete software projects in elegant Python with just-in-time compilation. Some memorable projects of his include a large-scale facial recognition system for videos with neural networks on GPUs, recognizing parts of neurons within pictures, and stock market trading programs.

I would like to thank most of all my wife, Karen, for her immense support while writing this book. I would like to thank my daughters, Annelise and Sofia. Annelise considers this her book. I am very grateful to Bo Morgan who suggested I use scikit-learn many years ago. We had many artificial intelligence discussions and Bo introduced me to Marvin Minsky's layers of mental activities and neural networks. I would like to thank as well the late Marvin, who was Bo's advisor. I am grateful to Jose Ramirez, co-founder of Ayaakua, where I applied neural networks and machine learning with scikit-learn to computer vision problems.  Special thanks to MIT professor emeritus Robert Rose, who was very encouraging in regards to writing this particular machine learning book. I would also like to thank professors Seth Lloyd and Peter Shor for introducing me to computations of a probabilistic nature, the many-worlds that might be of quantum mechanics; Dr. Paul Bamberg for teaching statistics (although I took a geometry class from him) and Dr. Michael Artin for his humor and geometric algebra knowledge. Finally, I would like to thank Dr. Yuri Chernyak who taught me a lot about problem solving. I would like to thank Packt for writing (and helping me write) very direct and practical books. I would also like to thank the Python community and their philosophies. Python is a very welcoming and elegant language, particularly effective for solving very tough problems and fine-tuning requirements very fast. I would like to thank you in advance for reading this book and pushing the data science frontier further with scikit-learn.

Trent Hauck is a data scientist living and working in the Seattle area. He grew up in Wichita, Kansas and received his undergraduate and graduate degrees from the University of Kansas.

He is the author of the book Instant Data Intensive Apps with pandas How-to, by Packt Publishing—a book that can get you up to speed quickly with pandas and other associated technologies.

About the Reviewer

Oleg Okun is a machine learning expert and an author/editor of four books, numerous journal articles, and conference papers. His career spans more than a quarter of a century.

He was employed in both academia and industry in his mother country, Belarus, and abroad (Finland, Sweden, and Germany). His work experience includes document image analysis, fingerprint biometrics, bioinformatics, online/offline marketing analytics, credit scoring analytics, and text analytics. He is interested in all aspects of distributed machine learning and the Internet of Things.

Oleg currently lives and works in Hamburg, Germany.

I would like to express my deepest gratitude to my parents for everything that they have done for me.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

 

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/178728638X.

If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

Who this book is for

What you need for this book

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

High-Performance Machine Learning – NumPy

Introduction

NumPy basics

How to do it...

The shape and dimension of NumPy arrays

NumPy broadcasting

Initializing NumPy arrays and dtypes

Indexing

Boolean arrays

Arithmetic operations

NaN values

How it works...

Loading the iris dataset

Getting ready

How to do it...

How it works...

Viewing the iris dataset

How to do it...

How it works...

There's more...

Viewing the iris dataset with Pandas

How to do it...

How it works...

Plotting with NumPy and matplotlib

Getting ready

How to do it...

A minimal machine learning recipe – SVM classification

Getting ready

How to do it...

How it works...

There's more...

Introducing cross-validation

Getting ready

How to do it...

How it works...

There's more...

Putting it all together

How to do it...

There's more...

Machine learning overview – classification versus regression

The purpose of scikit-learn

Supervised versus unsupervised

Getting ready

How to do it...

Quick SVC – a classifier and regressor

Making a scorer

How it works...

There's more...

Linear versus nonlinear

Black box versus not

Interpretability

A pipeline

Pre-Model Workflow and Pre-Processing

Introduction

Creating sample data for toy analysis

Getting ready

How to do it...

Creating a regression dataset

Creating an unbalanced classification dataset

Creating a dataset for clustering

How it works...

Scaling data to the standard normal distribution

Getting ready

How to do it...

How it works...

Creating binary features through thresholding

Getting ready

How to do it...

There's more...

Sparse matrices

The fit method

Working with categorical variables

Getting ready

How to do it...

How it works...

There's more...

DictVectorizer class

Imputing missing values through various strategies

Getting ready

How to do it...

How it works...

There's more...

A linear model in the presence of outliers

Getting ready

How to do it...

How it works...

Putting it all together with pipelines

Getting ready

How to do it...

How it works...

There's more...

Using Gaussian processes for regression

Getting ready

How to do it…

Cross-validation with the noise parameter

There's more...

Using SGD for regression

Getting ready

How to do it…

How it works…

Dimensionality Reduction

Introduction

Reducing dimensionality with PCA

Getting ready

How to do it...

How it works...

There's more...

Using factor analysis for decomposition

Getting ready

How to do it...

How it works...

Using kernel PCA for nonlinear dimensionality reduction

Getting ready

How to do it...

How it works...

Using truncated SVD to reduce dimensionality

Getting ready

How to do it...

How it works...

There's more...

Sign flipping

Sparse matrices

Using decomposition to classify with DictionaryLearning

Getting ready

How to do it...

How it works...

Doing dimensionality reduction with manifolds – t-SNE

Getting ready

How to do it...

How it works...

Testing methods to reduce dimensionality with pipelines

Getting ready

How to do it...

How it works...

Linear Models with scikit-learn

Introduction

Fitting a line through data

Getting ready

How to do it...

How it works...

There's more...

Fitting a line through data with machine learning

Getting ready

How to do it...

Evaluating the linear regression model

Getting ready

How to do it...

How it works...

There's more...

Using ridge regression to overcome linear regression's shortfalls

Getting ready

How to do it...

Optimizing the ridge regression parameter

Getting ready

How to do it...

How it works...

There's more...

Bayesian ridge regression

Using sparsity to regularize models

Getting ready

How to do it...

How it works...

LASSO cross-validation – LASSOCV

LASSO for feature selection

Taking a more fundamental approach to regularization with LARS

Getting ready

How to do it...

How it works...

There's more...

References

Linear Models – Logistic Regression

Introduction

Using linear methods for classification – logistic regression

Loading data from the UCI repository

How to do it...

Viewing the Pima Indians diabetes dataset with pandas

How to do it...

Looking at the UCI Pima Indians dataset web page

How to do it...

View the citation policy

Read about missing values and context

Machine learning with logistic regression

Getting ready

Define X, y – the feature and target arrays

How to do it...

Provide training and testing sets

Train the logistic regression

Score the logistic regression

Examining logistic regression errors with a confusion matrix

Getting ready

How to do it...

Reading the confusion matrix

General confusion matrix in context

Varying the classification threshold in logistic regression

Getting ready

How to do it...

Receiver operating characteristic – ROC analysis

Getting ready

Sensitivity

A visual perspective

How to do it...

Calculating TPR in scikit-learn

Plotting sensitivity

There's more...

The confusion matrix in a non-medical context

Plotting an ROC curve without context

How to do it...

Perfect classifier

Imperfect classifier

AUC – the area under the ROC curve

Putting it all together – UCI breast cancer dataset

How to do it...

Outline for future projects

Building Models with Distance Metrics

Introduction

Using k-means to cluster data

Getting ready

How to do it…

How it works...

Optimizing the number of centroids

Getting ready

How to do it...

How it works...

Assessing cluster correctness

Getting ready

How to do it...

There's more...

Using MiniBatch k-means to handle more data

Getting ready

How to do it...

How it works...

Quantizing an image with k-means clustering

Getting ready

How do it…

How it works…

Finding the closest object in the feature space

Getting ready

How to do it...

How it works...

There's more...

Probabilistic clustering with Gaussian mixture models

Getting ready

How to do it...

How it works...

Using k-means for outlier detection

Getting ready

How to do it...

How it works...

Using KNN for regression

Getting ready

How to do it…

How it works..

Cross-Validation and Post-Model Workflow

Introduction

Selecting a model with cross-validation

Getting ready

How to do it...

How it works...

K-fold cross validation

Getting ready

How to do it..

There's more...

Balanced cross-validation

Getting ready

How to do it...

There's more...

Cross-validation with ShuffleSplit

Getting ready

How to do it...

Time series cross-validation

Getting ready

How to do it...

There's more...

Grid search with scikit-learn

Getting ready

How to do it...

How it works...

Randomized search with scikit-learn

Getting ready

How to do it...

Classification metrics

Getting ready

How to do it...

There's more...

Regression metrics

Getting ready

How to do it...

Clustering metrics

Getting ready

How to do it...

Using dummy estimators to compare results

Getting ready

How to do it...

How it works...

Feature selection

Getting ready

How to do it...

How it works...

Feature selection on L1 norms

Getting ready

How to do it...

There's more...

Persisting models with joblib or pickle

Getting ready

How to do it...

Opening the saved model

There's more...

Support Vector Machines

Introduction

Classifying data with a linear SVM

Getting ready

Load the data

Visualize the two classes

How to do it...

How it works...

There's more...

Optimizing an SVM

Getting ready

How to do it...

Construct a pipeline

Construct a parameter grid for a pipeline

Provide a cross-validation scheme

Perform a grid search

There's more...

Randomized grid search alternative

Visualize the nonlinear RBF decision boundary

More meaning behind C and gamma

Multiclass classification with SVM

Getting ready

How to do it...

OneVsRestClassifier

Visualize it

How it works...

Support vector regression

Getting ready

How to do it...

Tree Algorithms and Ensembles

Introduction

Doing basic classifications with decision trees

Getting ready

How to do it...

Visualizing a decision tree with pydot

How to do it...

How it works...

There's more...

Tuning a decision tree

Getting ready

How to do it...

There's more...

Using decision trees for regression

Getting ready

How to do it...

There's more...

Reducing overfitting with cross-validation

How to do it...

There's more...

Implementing random forest regression

Getting ready

How to do it...

 Bagging regression with nearest neighbors

Getting ready

How to do it...

Tuning gradient boosting trees

Getting ready

How to do it...

There's more...

Finding the best parameters of a gradient boosting classifier

Tuning an AdaBoost regressor

How to do it...

There's more...

Writing a stacking aggregator with scikit-learn

How to do it...

Text and Multiclass Classification with scikit-learn

Using LDA for classification

Getting ready

How to do it...

How it works...

Working with QDA – a nonlinear LDA

Getting ready

How to do it...

How it works...

Using SGD for classification

Getting ready

How to do it...

There's more...

Classifying documents with Naive Bayes

Getting ready

How to do it...

How it works...

There's more...

Label propagation with semi-supervised learning

Getting ready

How to do it...

How it works...

Neural Networks

Introduction

Perceptron classifier

Getting ready

How to do it...

How it works...

There's more...

Neural network – multilayer perceptron

Getting ready

How to do it...

How it works...

Philosophical thoughts on neural networks

Stacking with a neural network

Getting ready

How to do it...

First base model – neural network

Second base model – gradient boost ensemble

Third base model – bagging regressor of gradient boost ensembles

Some functions of the stacker

Meta-learner – extra trees regressor

There's more...

Create a Simple Estimator

Introduction

Create a simple estimator

Getting ready

How to do it...

How it works...

There's more...

Trying the new GEE classifier on the Pima diabetes dataset

Saving your trained estimator

Preface

Starting with installing and setting up scikit-learn, this book contains highly practical recipes on common supervised and unsupervised machine learning concepts. Acquire your data for analysis; select the necessary features for your model; and implement popular techniques such as linear models, classification, regression, clustering, and more in no time at all! The book also contains recipes on evaluating and fine-tuning the performance of your model. The recipes contain both the underlying motivations and theory for trying a technique, plus all the code in detail.

"Premature optimization is the root of all evil"

                                                                                                               - Donald Knuth

scikit-learn and Python allow fast prototyping, which is in a sense the opposite of Donald Knuth's premature optimization. Personally, scikit-learn has allowed me to prototype what I once thought was impossible, including large-scale facial recognition systems and stock market trading simulations. You can gain instant insights and build prototypes with scikit-learn. Data science is, by definition, scientific and has many failed hypotheses. Thankfully, with scikit-learn you can see what works (and what does not) within the next few minutes.

Additionally, Jupyter (IPython) notebooks feature a nice interface that is very welcoming to beginners and experts alike and encourages a new scientific software engineering mindset. This welcoming nature is refreshing because, in innovation, we are all beginners.

In the last chapter of this book, you can make your own estimator and Python transitions from a scripting language to more of an object-oriented language. The Python data science ecosystem has the basic components for you to make your own unique style and contribute heavily to the data science team and artificial intelligence.

In analogous fashion, algorithms work as a team in the stacker. Diverse algorithms of different styles vote to make better predictions. Some make better choices than others, but as long as the algorithms are different, the choice in the end will be the best. Stackers and blenders came to prominence in the Netflix $1 million prize competition won by the team Pragmatic Chaos.

Welcome to the world of scikit-learn: a very powerful, simple, and expressive machine learning library. I am truly excited to see what you come up with.

What this book covers

Chapter 1, High-Performance Machine Learning – NumPy, features your first machine learning algorithm with support vector machines. We distinguish between classification (what type?) and regression (how much?). We predict an outcome on data we have not seen.

Chapter 2, Pre-Model Workflow and Pre-Processing, exposes a realistic industrial setting with plenty of data munging and pre-processing. To do machine learning, you need good data, and this chapter tells you how to get it and get it into good form for machine learning.

Chapter 3, Dimensionality Reduction, discusses reducing the number of features to simplify machine learning and allow better use of computational resources.

Chapter 4, Linear Models with scikit-learn, tells the story of linear regression, the oldest predictive model, from the machine learning and artificial intelligence lenses. You deal with correlated features with ridge regression, eliminate related features with LASSO and cross-validation, or eliminate outliers with robust median-based regression.

Chapter 5, Linear Models – Logistic Regression, examines the important healthcare datasets for cancer and diabetes with logistic regression. This model highlights both similarities and differences between regression and classification, the two types of supervised learning.

Chapter 6, Building Models with Distance Metrics, places points in your familiar Euclidean space of school geometry, as distance is synonymous with similarity. How close (similar) or far away are two points? Can we group them together? With Euclid's help, we can approach unsupervised learning with k-means clustering and place points in categories we do not know in advance.

Chapter 7, Cross-Validation and Post-Model Workflow, features how to select a model that works well with cross-validation: iterated training and testing of predictions. We also save computational work with the pickle module.

Chapter 8, Support Vector Machines, examines in detail the support vector machine, a powerful and easy-to-understand algorithm.

Chapter 9, Tree Algorithms and Ensembles, features the algorithms of decision making: decision trees. This chapter introduces meta-learning algorithms, diverse algorithms that vote in some fashion to increase overall predictive accuracy.

Chapter 10, Text and Multiclass Classification with scikit-learn, reviews the basics of natural language processing with the simple bag-of-words model. In general, we view classification with three or more categories.

Chapter 11, Neural Networks, introduces a neural network and perceptrons, the components of a neural network. Each layer figures out a step in a process, leading to a desired outcome. As we do not program any steps specifically, we venture into artificial intelligence. Save the neural network so that you can keep training it later, or load it and utilize it as part of a stacking ensemble.

Chapter 12, Create a Simple Estimator, helps you make your own scikit-learn estimator, which you can contribute to the scikit-learn community and take part in the evolution of data science with scikit-learn.

Who this book is for

This book is for data analysts who are familiar with Python but not so much with scikit-learn, and Python programmers who would like to dive into the world of machine learning in a direct, straightforward fashion.

What you need for this book

You will need to install following libraries:

anaconda 4.1.1

numba 0.26.0

numpy 1.12.1

pandas 0.20.3

pandas-datareader 0.4.0

patsy 0.4.1

scikit-learn 0.19.0

scipy 0.19.1

statsmodels 0.8.0

sympy 1.0

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The scikit-learn library requires input tables of two-dimensional NumPy arrays."

Any command-line input or output is written as follows:

import numpy as np #Load the numpy library for fast array

computations

import pandas as pd #Load the pandas data-analysis library

import matplotlib.pyplot as plt #Load the pyplot visualization

library

New terms and important words are shown in bold.

Warnings or important notes appear in a box like this.
Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/scikit-learn-Cookbook-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

High-Performance Machine Learning – NumPy

In this chapter, we will cover the following recipes:

NumPy basics

Loading the iris dataset

Viewing the iris dataset

Viewing the iris dataset with pandas

Plotting with NumPy and matplotlib

A minimal machine learning recipe – SVM classification

Introducing cross-validation

Putting it all together

Machine learning overview – classification versus regression

Introduction

In this chapter, we'll learn how to make predictions with scikit-learn. Machine learning emphasizes on measuring the ability to predict, and with scikit-learn we will predict accurately and quickly.

We will examine the iris dataset, which consists of measurements of three types of Iris flowers: Iris Setosa, Iris Versicolor, and Iris Virginica.

To measure the strength of the predictions, we will:

Save some data for testing

Build a model using only training data

Measure the predictive power on the test set

The prediction—one of three flower types is categorical. This type of problem is called a classification problem.

Informally, classification asks, Is it an apple or an orange? Contrast this with machine learning regression, which asks, How many apples? By the way, the answer can be 4.5 apples for regression.

By the evolution of its design, scikit-learn addresses machine learning mainly via four categories:

Classification:

Non-text classification, like the Iris flowers example

Text classification

Regression

Clustering

Dimensionality reduction

NumPy basics

Data science deals in part with structured tables of data. The scikit-learn library requires input tables of two-dimensional NumPy arrays. In this section, you will learn about the numpy library.

How to do it...

We will try a few operations on NumPy arrays. NumPy arrays have a single type for all of their elements and a predefined shape. Let us look first at their shape.

Indexing

Look up the values of the two-dimensional arrays with indexing:

array_1[0,0] #Finds value in first row and first column.

1

View the first row:

array_1[0,:]array([1, 2])

Then view the first column:

array_1[:,0]array([1, 3, 5, 7, 9])

View specific values along both axes. Also view the second to the fourth rows: