E-Book
82,79 €

Python: Real World Machine Learning E-Book

Prateek Joshi

0,0

82,79 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Learn to solve challenging data science problems by building powerful machine learning models using Python

About This Book

Understand which algorithms to use in a given context with the help of this exciting recipe-based guide
This practical tutorial tackles real-world computing problems through a rigorous and effective approach
Build state-of-the-art models and develop personalized recommendations to perform machine learning at scale

Who This Book Is For

This Learning Path is for Python programmers who are looking to use machine learning algorithms to create real-world applications. It is ideal for Python professionals who want to work with large and complex datasets and Python developers and analysts or data scientists who are looking to add to their existing skills by accessing some of the most powerful recent trends in data science. Experience with Python, Jupyter Notebooks, and command-line execution together with a good level of mathematical knowledge to understand the concepts is expected. Machine learning basic knowledge is also expected.

What You Will Learn

Use predictive modeling and apply it to real-world problems
Understand how to perform market segmentation using unsupervised learning
Apply your new-found skills to solve real problems, through clearly-explained code for every technique and test
Compete with top data scientists by gaining a practical and theoretical understanding of cutting-edge deep learning algorithms
Increase predictive accuracy with deep learning and scalable data-handling techniques
Work with modern state-of-the-art large-scale machine learning techniques
Learn to use Python code to implement a range of machine learning algorithms and techniques

In Detail

Machine learning is increasingly spreading in the modern data-driven world. It is used extensively across many fields such as search engines, robotics, self-driving cars, and more. Machine learning is transforming the way we understand and interact with the world around us.

In the first module, Python Machine Learning Cookbook, you will learn how to perform various machine learning tasks using a wide variety of machine learning algorithms to solve real-world problems and use Python to implement these algorithms.

The second module, Advanced Machine Learning with Python, is designed to take you on a guided tour of the most relevant and powerful machine learning techniques and you'll acquire a broad set of powerful skills in the area of feature selection and feature engineering.

The third module in this learning path, Large Scale Machine Learning with Python, dives into scalable machine learning and the three forms of scalability. It covers the most effective machine learning techniques on a map reduce framework in Hadoop and Spark in Python.

This Learning Path will teach you Python machine learning for the real world. The machine learning techniques covered in this Learning Path are at the forefront of commercial practice.

This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products:

Python Machine Learning Cookbook by Prateek Joshi
Advanced Machine Learning with Python by John Hearty
Large Scale Machine Learning with Python by Bastiaan Sjardin, Alberto Boschetti, Luca Massaron

Style and approach

This course is a smooth learning path that will teach you how to get started with Python machine learning for the real world, and develop solutions to real-world problems. Through this comprehensive course, you'll learn to create the most effective machine learning techniques from scratch and more!

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 1091

Veröffentlichungsjahr: 2016

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Python: Real World Machine Learning

Credits

Preface

What this learning path covers

What you need for this learning path

Who this learning path is for

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

I. Module 1

1. The Realm of Supervised Learning

Introduction

Preprocessing data using different techniques

Getting ready

How to do it…

Mean removal

Scaling

Normalization

Binarization

One Hot Encoding

Label encoding

How to do it…

Building a linear regressor

Getting ready

How to do it…

Computing regression accuracy

Getting ready

How to do it…

Achieving model persistence

How to do it…

Building a ridge regressor

Getting ready

How to do it…

Building a polynomial regressor

Getting ready

How to do it…

Estimating housing prices

Getting ready

How to do it…

Computing the relative importance of features

How to do it…

Estimating bicycle demand distribution

Getting ready

How to do it…

There's more…

2. Constructing a Classifier

Introduction

Building a simple classifier

How to do it…

There's more…

Building a logistic regression classifier

How to do it…

Building a Naive Bayes classifier

How to do it…

Splitting the dataset for training and testing

How to do it…

Evaluating the accuracy using cross-validation

Getting ready…

How to do it…

Visualizing the confusion matrix

How to do it…

Extracting the performance report

How to do it…

Evaluating cars based on their characteristics

Getting ready

How to do it…

Extracting validation curves

How to do it…

Extracting learning curves

How to do it…

Estimating the income bracket

How to do it…

3. Predictive Modeling

Introduction

Building a linear classifier using Support Vector Machine (SVMs)

Getting ready

How to do it…

Building a nonlinear classifier using SVMs

How to do it…

Tackling class imbalance

How to do it…

Extracting confidence measurements

How to do it…

Finding optimal hyperparameters

How to do it…

Building an event predictor

Getting ready

How to do it…

Estimating traffic

Getting ready

How to do it…

4. Clustering with Unsupervised Learning

Introduction

Clustering data using the k-means algorithm

How to do it…

Compressing an image using vector quantization

How to do it…

Building a Mean Shift clustering model

How to do it…

Grouping data using agglomerative clustering

How to do it…

Evaluating the performance of clustering algorithms

How to do it…

Automatically estimating the number of clusters using DBSCAN algorithm

How to do it…

Finding patterns in stock market data

How to do it…

Building a customer segmentation model

How to do it…

5. Building Recommendation Engines

Introduction

Building function compositions for data processing

How to do it…

Building machine learning pipelines

How to do it…

How it works…

Finding the nearest neighbors

How to do it…

Constructing a k-nearest neighbors classifier

How to do it…

How it works…

Constructing a k-nearest neighbors regressor

How to do it…

How it works…

Computing the Euclidean distance score

How to do it…

Computing the Pearson correlation score

How to do it…

Finding similar users in the dataset

How to do it…

Generating movie recommendations

How to do it…

6. Analyzing Text Data

Introduction

Preprocessing data using tokenization

How to do it…

Stemming text data

How to do it…

How it works…

Converting text to its base form using lemmatization

How to do it…

Dividing text using chunking

How to do it…

Building a bag-of-words model

How to do it…

How it works…

Building a text classifier

How to do it…

How it works…

Identifying the gender

How to do it…

Analyzing the sentiment of a sentence

How to do it…

How it works…

Identifying patterns in text using topic modeling

How to do it…

How it works…

7. Speech Recognition

Introduction

Reading and plotting audio data

How to do it…

Transforming audio signals into the frequency domain

How to do it…

Generating audio signals with custom parameters

How to do it…

Synthesizing music

How to do it…

Extracting frequency domain features

How to do it…

Building Hidden Markov Models

How to do it…

Building a speech recognizer

How to do it…

8. Dissecting Time Series and Sequential Data

Introduction

Transforming data into the time series format

How to do it…

Slicing time series data

How to do it…

Operating on time series data

How to do it…

Extracting statistics from time series data

How to do it…

Building Hidden Markov Models for sequential data

Getting ready

How to do it…

Building Conditional Random Fields for sequential text data

Getting ready

How to do it…

Analyzing stock market data using Hidden Markov Models

How to do it…

9. Image Content Analysis

Introduction

Operating on images using OpenCV-Python

How to do it…

Detecting edges

How to do it…

Histogram equalization

How to do it…

Detecting corners

How to do it…

Detecting SIFT feature points

How to do it…

Building a Star feature detector

How to do it…

Creating features using visual codebook and vector quantization

How to do it…

Training an image classifier using Extremely Random Forests

How to do it…

Building an object recognizer

How to do it…

10. Biometric Face Recognition

Introduction

Capturing and processing video from a webcam

How to do it…

Building a face detector using Haar cascades

How to do it…

Building eye and nose detectors

How to do it…

Performing Principal Components Analysis

How to do it…

Performing Kernel Principal Components Analysis

How to do it…

Performing blind source separation

How to do it…

Building a face recognizer using Local Binary Patterns Histogram

How to do it…

11. Deep Neural Networks

Introduction

Building a perceptron

How to do it…

Building a single layer neural network

How to do it…

Building a deep neural network

How to do it…

Creating a vector quantizer

How to do it…

Building a recurrent neural network for sequential data analysis

How to do it…

Visualizing the characters in an optical character recognition database

How to do it…

Building an optical character recognizer using neural networks

How to do it…

12. Visualizing Data

Introduction

Plotting 3D scatter plots

How to do it…

Plotting bubble plots

How to do it…

Animating bubble plots

How to do it…

Drawing pie charts

How to do it…

Plotting date-formatted time series data

How to do it…

Plotting histograms

How to do it…

Visualizing heat maps

How to do it…

Animating dynamic signals

How to do it…

II. Module 2

1. Unsupervised Machine Learning

Principal component analysis

PCA – a primer

Employing PCA

Introducing k-means clustering

Clustering – a primer

Kick-starting clustering analysis

Tuning your clustering configurations

Self-organizing maps

SOM – a primer

Employing SOM

Further reading

Summary

2. Deep Belief Networks

Neural networks – a primer

The composition of a neural network

Network topologies

Restricted Boltzmann Machine

Introducing the RBM

Topology

Training

Applications of the RBM

Further applications of the RBM

Deep belief networks

Training a DBN

Applying the DBN

Validating the DBN

Further reading

Summary

3. Stacked Denoising Autoencoders

Autoencoders

Introducing the autoencoder

Topology

Training

Denoising autoencoders

Applying a dA

Stacked Denoising Autoencoders

Applying the SdA

Assessing SdA performance

Further reading

Summary

4. Convolutional Neural Networks

Introducing the CNN

Understanding the convnet topology

Understanding convolution layers

Understanding pooling layers

Training a convnet

Putting it all together

Applying a CNN

Further reading

Summary

6. Text Feature Engineering

Introduction

Text feature engineering

Cleaning text data

Text cleaning with BeautifulSoup

Managing punctuation and tokenizing

Tagging and categorising words

Tagging with NLTK

Sequential tagging

Backoff tagging

Creating features from text data

Stemming

Bagging and random forests

Testing our prepared data

Further reading

Summary

7. Feature Engineering Part II

Introduction

Creating a feature set

Engineering features for ML applications

Using rescaling techniques to improve the learnability of features

Creating effective derived variables

Reinterpreting non-numeric features

Using feature selection techniques

Performing feature selection

Correlation

LASSO

Recursive Feature Elimination

Genetic models

Feature engineering in practice

Acquiring data via RESTful APIs

Testing the performance of our model

Twitter

Translink Twitter

Consumer comments

The Bing Traffic API

Deriving and selecting variables using feature engineering techniques

The weather API

Further reading

Summary

8. Ensemble Methods

Introducing ensembles

Understanding averaging ensembles

Using bagging algorithms

Using random forests

Applying boosting methods

Using XGBoost

Using stacking ensembles

Applying ensembles in practice

Using models in dynamic applications

Understanding model robustness

Identifying modeling risk factors

Strategies to managing model robustness

Further reading

Summary

9. Additional Python Machine Learning Tools

Alternative development tools

Introduction to Lasagne

Getting to know Lasagne

Introduction to TensorFlow

Getting to know TensorFlow

Using TensorFlow to iteratively improve our models

Knowing when to use these libraries

Further reading

Summary

A. Chapter Code Requirements

III. Module 3

1. First Steps to Scalability

Explaining scalability in detail

Making large scale examples

Introducing Python

Scale up with Python

Scale out with Python

Python for large scale machine learning

Choosing between Python 2 and Python 3

Package upgrades

Scientific distributions

Introducing Jupyter/IPython

Python packages

NumPy

SciPy

Pandas

Scikit-learn

The matplotlib package

Gensim

H2O

XGBoost

Theano

TensorFlow

The sknn library

Theanets

Keras

Other useful packages to install on your system

Summary

2. Scalable Learning in Scikit-learn

Out-of-core learning

Subsampling as a viable option

Optimizing one instance at a time

Building an out-of-core learning system

Streaming data from sources

Datasets to try the real thing yourself

The first example – streaming the bike-sharing dataset

Using pandas I/O tools

Working with databases

Paying attention to the ordering of instances

Stochastic learning

Batch gradient descent

Stochastic gradient descent

The Scikit-learn SGD implementation

Defining SGD learning parameters

Feature management with data streams

Describing the target

The hashing trick

Other basic transformations

Testing and validation in a stream

Trying SGD in action

Summary

3. Fast SVM Implementations

Datasets to experiment with on your own

The bike-sharing dataset

The covertype dataset

Support Vector Machines

Hinge loss and its variants

Understanding the Scikit-learn SVM implementation

Pursuing nonlinear SVMs by subsampling

Achieving SVM at scale with SGD

Feature selection by regularization

Including non-linearity in SGD

Trying explicit high-dimensional mappings

Hyperparameter tuning

Other alternatives for SVM fast learning

Nonlinear and faster with Vowpal Wabbit

Installing VW

Understanding the VW data format

Python integration

A few examples using reductions for SVM and neural nets

Faster bike-sharing

The covertype dataset crunched by VW

Summary

4. Neural Networks and Deep Learning

The neural network architecture

What and how neural networks learn

Choosing the right architecture

The input layer

The hidden layer

The output layer

Neural networks in action

Parallelization for sknn

Neural networks and regularization

Neural networks and hyperparameter optimization

Neural networks and decision boundaries

Deep learning at scale with H2O

Large scale deep learning with H2O

Gridsearch on H2O

Deep learning and unsupervised pretraining

Deep learning with theanets

Autoencoders and unsupervised learning

Autoencoders

Summary

5. Deep Learning with TensorFlow

TensorFlow installation

TensorFlow operations

GPU computing

Linear regression with SGD

A neural network from scratch in TensorFlow

Machine learning on TensorFlow with SkFlow

Deep learning with large files – incremental learning

Keras and TensorFlow installation

Convolutional Neural Networks in TensorFlow through Keras

The convolution layer

The pooling layer

The fully connected layer

CNN's with an incremental approach

GPU Computing

Summary

6. Classification and Regression Trees at Scale

Bootstrap aggregation

Random forest and extremely randomized forest

Fast parameter optimization with randomized search

Extremely randomized trees and large datasets

CART and boosting

Gradient Boosting Machines

max_depth

learning_rate

Subsample

Faster GBM with warm_start

Speeding up GBM with warm_start

Training and storing GBM models

XGBoost

XGBoost regression

XGBoost and variable importance

XGBoost streaming large datasets

XGBoost model persistence

Out-of-core CART with H2O

Random forest and gridsearch on H2O

Stochastic gradient boosting and gridsearch on H2O

Summary

7. Unsupervised Learning at Scale

Unsupervised methods

Feature decomposition – PCA

Randomized PCA

Incremental PCA

Sparse PCA

PCA with H2O

Clustering – K-means

Initialization methods

K-means assumptions

Selection of the best K

Scaling K-means – mini-batch

K-means with H2O

LDA

Scaling LDA – memory, CPUs, and machines

Summary

8. Distributed Environments – Hadoop and Spark

From a standalone machine to a bunch of nodes

Why do we need a distributed framework?

Setting up the VM

VirtualBox

Vagrant

Using the VM

The Hadoop ecosystem

Architecture

HDFS

MapReduce

YARN

Spark

pySpark

Summary

9. Practical Machine Learning with Spark

Setting up the VM for this chapter

Sharing variables across cluster nodes

Broadcast read-only variables

Accumulators write-only variables

Broadcast and accumulators together – an example

Data preprocessing in Spark

JSON files and Spark DataFrames

Dealing with missing data

Grouping and creating tables in-memory

Writing the preprocessed DataFrame or RDD to disk

Working with Spark DataFrames

Machine learning with Spark

Spark on the KDD99 dataset

Reading the dataset

Feature engineering

Training a learner

Evaluating a learner's performance

The power of the ML pipeline

Manual tuning

Cross-validation

Final cleanup

Summary

A. Introduction to GPUs and Theano

GPU computing

Theano – parallel computing on the GPU

Installing Theano

A. Bibliography

Index

Python: Real World Machine Learning

Learn to solve challenging data science problems by building powerful machine learning models using Python

A course in three modules

BIRMINGHAM - MUMBAI

Python: Real World Machine Learning

All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Published on: October 2016

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78712-321-2

www.packtpub.com

Credits

Authors

Prateek Joshi

John Hearty

Bastiaan Sjardin

Luca Massaron

Alberto Boschetti

Reviewers

Dr. Vahid Mirjalili

Jared Huffman

Ashwin Pajankar

Oleg Okun

Kai Londenberg

Content Development Editor

Aishwarya Pandere

Production Coordinator

Nilesh Mohite

Preface

Machine learning is becoming increasingly pervasive in the modern data-driven world. It is used extensively across many fields, such as search engines, robotics, self-driving cars, and so on. In this course, you will explore various real-life scenarios where you can use machine learning. You will understand what algorithms you should use in a given context using this exciting recipe-based guide.

This course starts by talking about various realms in machine learning followed by practical examples.

What this learning path covers

Module 1, Python Machine Learning Cookbook, teaches you about the algorithms that we use to build recommendation engines. We will learn how to apply these algorithms to collaborative filtering and movie recommendations.

Module 2, Advanced Machine Learning with Python, explains how to apply several semi-supervised learning techniques including CPLE, self learning, and S3VM.

Module 3, Large Scale Machine Learning with Python, covers interesting deep learning techniques together with an online method for neural networks. Although TensorFlow is only in its infancy, the framework provides elegant machine learning solutions.

What you need for this learning path

Module 1: While we believe that the world is moving forward with better versions coming out, a lot of developers still enjoy using Python 2.x. A lot of operating systems have Python 2.x built into them. This course is focused on machine learning in Python as opposed to Python itself. It also helps in maintaining compatibility with libraries that haven’t been ported to Python 3.x. Hence the code in the book is oriented towards Python 2.x. In that spirit, we have tried to keep all the code as agnostic as possible to the Python versions.

Module 2: The entirety of this course’s content leverages openly available data and code,including open source Python libraries and frameworks. While each chapter’s example code is accompanied by a README file documenting all the libraries required to run the code provided in that chapter’s accompanying scripts, the content of these files is collated here for your convenience. It is recommended that some libraries required for earlier chapters be available when working with code from any later chapter. These requirements are identified using bold text. Particularly, it is important to set up the first chapter’s required libraries for any content later in the book.

Module 3: The execution of the code examples provided in this book requires an installation of Python 2.7 or higher versions on macOS, Linux, or Microsoft Windows.

The examples throughout the book will make frequent use of Python’s essential libraries, such as SciPy, NumPy, Scikit-learn, and StatsModels, and to a minor extent, matplotlib and pandas, for scientific and statistical computing. We will also make use of an out-of-core cloud computing application called H2O. This book is highly dependent on Jupyter and its Notebooks powered by the Python kernel. We will use its most recent version, 4.1, for this book. The first chapter will provide you with all the step-by-step instructions and some useful tips to set up your Python environment, these core libraries, and all the necessary tools.

Who this learning path is for

This Learning Path is for Python programmers who are looking to use machine learning algorithms to create real-world applications. Python professionals intending to work with large and complex datasets and Python developers and analysts or data scientists who are looking to add to their existing skills by accessing some of the most powerful recent trends in data science will find this Learning Path useful. Experience with Python, Jupyter Notebooks, and command-line execution together with a good level of mathematical knowledge to understand the concepts is expected. Machine learning basics is also expected.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this course—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the course’s title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt course, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this course from your account at http://www.packtpub.com. If you purchased this course elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the course in the Search box.Select the course for which you’re looking to download the code files.Choose from the drop-down menu where you purchased this course from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the course’s webpage at the Packt Publishing website. This page can be accessed by entering the course’s name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the course is also hosted on GitHub at https://github.com/PacktPublishing/Python-Real-World-Machine-Learning. We also have other code bundles from our rich catalog of books, videos, and courses available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our courses—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this course. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your course, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the course in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this course, you can contact us at <[email protected]>, and we will do our best to address the problem.

Part I. Module 1

Python Machine Learning Cookbook

100 recipes that teach you how to perform various machine learning tasks in the real world

Chapter 1. The Realm of Supervised Learning

In this chapter, we will cover the following recipes:

Preprocessing data using different techniquesLabel encodingBuilding a linear regressorComputing regression accuracyAchieving model persistenceBuilding a ridge regressorBuilding a polynomial regressorEstimating housing pricesComputing the relative importance of featuresEstimating bicycle demand distribution

Introduction

If you are familiar with the basics of machine learning, you will certainly know what supervised learning is all about. To give you a quick refresher, supervised learning refers to building a machine learning model that is based on labeled samples. For example, if we build a system to estimate the price of a house based on various parameters, such as size, locality, and so on, we first need to create a database and label it. We need to tell our algorithm what parameters correspond to what prices. Based on this data, our algorithm will learn how to calculate the price of a house using the input parameters.

Unsupervised learning is the opposite of what we just discussed. There is no labeled data available here. Let's assume that we have a bunch of datapoints, and we just want to separate them into multiple groups. We don't exactly know what the criteria of separation would be. So, an unsupervised learning algorithm will try to separate the given dataset into a fixed number of groups in the best possible way. We will discuss unsupervised learning in the upcoming chapters.

We will use various Python packages, such as NumPy, SciPy, scikit-learn, and matplotlib, during the course of this book to build various things. If you use Windows, it is recommended that you use a SciPy-stack compatible version of Python. You can check the list of compatible versions at http://www.scipy.org/install.html. These distributions come with all the necessary packages already installed. If you use Mac OS X or Ubuntu, installing these packages is fairly straightforward. Here are some useful links for installation and documentation:

NumPy: http://docs.scipy.org/doc/numpy-1.10.1/user/install.htmlSciPy: http://www.scipy.org/install.htmlscikit-learn: http://scikit-learn.org/stable/install.htmlmatplotlib: http://matplotlib.org/1.4.2/users/installing.html

Make sure that you have these packages installed on your machine before you proceed.

Computing regression accuracy

Now that we know how to build a regressor, it's important to understand how to evaluate the quality of a regressor as well. In this context, an error is defined as the difference between the actual value and the value that is predicted by the regressor.

Getting ready

Let's quickly understand what metrics can be used to measure the quality of a regressor. A regressor can be evaluated using many different metrics, such as the following:

Mean absolute error: This is the average of absolute errors of all the datapoints in the given dataset.Mean squared error: This is the average of the squares of the errors of all the datapoints in the given dataset. It is one of the most popular metrics out there!Median absolute error: This is the median of all the errors in the given dataset. The main advantage of this metric is that it's robust to outliers. A single bad point in the test dataset wouldn't skew the entire error metric, as opposed to a mean error metric.Explained variance score: This score measures how well our model can account for the variation in our dataset. A score of 1.0 indicates that our model is perfect.R2 score: This is pronounced as R-squared, and this score refers to the coefficient of determination. This tells us how well the unknown samples will be predicted by our model. The best possible score is 1.0, and the values can be negative as well.

How to do it…

There is a module in scikit-learn that provides functionalities to compute all the following metrics. Open a new Python file and add the following lines:

import sklearn.metrics as sm print "Mean absolute error =", round(sm.mean_absolute_error(y_test, y_test_pred), 2) print "Mean squared error =", round(sm.mean_squared_error(y_test, y_test_pred), 2) print "Median absolute error =", round(sm.median_absolute_error(y_test, y_test_pred), 2) print "Explained variance score =", round(sm.explained_variance_score(y_test, y_test_pred), 2) print "R2 score =", round(sm.r2_score(y_test, y_test_pred), 2)

Keeping track of every single metric can get tedious, so we pick one or two metrics to evaluate our model. A good practice is to make sure that the mean squared error is low and the explained variance score is high.

Chapter 2. Constructing a Classifier

In this chapter, we will cover the following recipes:

Building a simple classifierBuilding a logistic regression classifierBuilding a Naïve Bayes classifierSplitting the dataset for training and testingEvaluating the accuracy using cross-validationVisualizing the confusion matrixExtracting the performance reportEvaluating cars based on their characteristicsExtracting validation curvesExtracting learning curvesEstimating the income bracket

Introduction

In the field of machine learning, classification refers to the process of using the characteristics of data to separate it into a certain number of classes. This is different from regression that we discussed in the previous chapter where the output is a real number. A supervised learning classifier builds a model using labeled training data and then uses this model to classify unknown data.

A classifier can be any algorithm that implements classification. In simple cases, this classifier can be a straightforward mathematical function. In more real-world cases, this classifier can take very complex forms. In the course of study, we will see that classification can be either binary, where we separate data into two classes, or it can be multiclass, where we separate data into more than two classes. The mathematical techniques that are devised to deal with the classification problem tend to deal with two classes, so we extend them in different ways to deal with the multiclass problem as well.

Evaluating the accuracy of a classifier is an important step in the world machine learning. We need to learn how to use the available data to get an idea as to how this model will perform in the real world. In this chapter, we will look at recipes that deal with all these things.

Building a logistic regression classifier

Despite the word regression being present in the name, logistic regression is actually used for classification purposes. Given a set of datapoints, our goal is to build a model that can draw linear boundaries between our classes. It extracts these boundaries by solving a set of equations derived from the training data.

How to do it…

Let's see how to do this in Python. We will use the logistic_regression.py file that is provided to you as a reference. Assuming that you imported the necessary packages, let's create some sample data along with training labels: