E-Book
36,59 €

Hands-On Ensemble Learning with Python E-Book

George Kyriakides

0,0

36,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Combine popular machine learning techniques to create ensemble models using Python

Key Features

Implement ensemble models using algorithms such as random forests and AdaBoost

Apply boosting, bagging, and stacking ensemble methods to improve the prediction accuracy of your model

Explore real-world data sets and practical examples coded in scikit-learn and Keras

Book Description

Ensembling is a technique of combining two or more similar or dissimilar machine learning algorithms to create a model that delivers superior predictive power. This book will demonstrate how you can use a variety of weak algorithms to make a strong predictive model.

With its hands-on approach, you'll not only get up to speed on the basic theory but also the application of various ensemble learning techniques. Using examples and real-world datasets, you'll be able to produce better machine learning models to solve supervised learning problems such as classification and regression. Furthermore, you'll go on to leverage ensemble learning techniques such as clustering to produce unsupervised machine learning models. As you progress, the chapters will cover different machine learning algorithms that are widely used in the practical world to make predictions and classifications. You'll even get to grips with the use of Python libraries such as scikit-learn and Keras for implementing different ensemble models.

By the end of this book, you will be well-versed in ensemble learning, and have the skills you need to understand which ensemble method is required for which problem, and successfully implement them in real-world scenarios.

What you will learn

Implement ensemble methods to generate models with high accuracy

Overcome challenges such as bias and variance

Explore machine learning algorithms to evaluate model performance

Understand how to construct, evaluate, and apply ensemble models

Analyze tweets in real time using Twitter's streaming API

Use Keras to build an ensemble of neural networks for the MovieLens dataset

Who this book is for

This book is for data analysts, data scientists, machine learning engineers and other professionals who are looking to generate advanced models using ensemble techniques. An understanding of Python code and basic knowledge of statistics is required to make the most out of this book.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 273

Veröffentlichungsjahr: 2019

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Hands-On Ensemble Learning with Python

Build highly optimized ensemble machine learning models using scikit-learn and Keras

George Kyriakides

Konstantinos G. Margaritis

BIRMINGHAM - MUMBAI

Hands-On Ensemble Learning with Python

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Sunith ShettyAcquisition Editor:Devika BattikeContent Development Editor:Athikho Sapuni RishanaSenior Editor: Martin WhittemoreTechnical Editor: Utkarsha S. KadamCopy Editor: Safis EditingProject Coordinator:Kirti PisatProofreader: Safis EditingIndexer:Manju ArasanProduction Designer:Alishon Mendonsa

First published: July 2019

Production reference: 1180719

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78961-285-1

www.packtpub.com

Packt.com

Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Fully searchable for easy access to vital information

Copy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the authors

George Kyriakides is a Ph.D. researcher, studying distributed neural architecture search. His interests and experience include the automated generation and optimization of predictive models for a wide array of applications, such as image recognition, time series analysis, and financial applications. He holds an M.Sc. in computational methods and applications, and a B.Sc. in applied informatics, both from the University of Macedonia, Thessaloniki, Greece.

Konstantinos G. Margaritis has been a teacher and researcher in computer science for more than 30 years. His research interests include parallel and distributed computing, as well as computational intelligence and machine learning. He holds an M.Eng. in electrical engineering (Aristotle University of Thessaloniki, Greece), as well as an M.Sc. and a Ph.D. in computer science (Loughborough University, UK). He is a professor at the Department of Applied Informatics, University of Macedonia, Thessaloniki, Greece.

About the reviewers

Greg Walters has been involved with computers and computer programming since 1972. Currently, he is extremely well versed in Visual Basic, Visual Basic .NET, Python, and SQL using MySQL, SQLite, Microsoft SQL Server, Oracle, C++, Delphi, Modula-2, Pascal, C, 80x86 Assembler, COBOL, and Fortran. He is a programming trainer and has trained numerous people on many pieces of computer software, including MySQL, Open Database Connectivity, Quattro Pro, Corel Draw!, Paradox, Microsoft Word, Excel, DOS, Windows 3.11, Windows for Workgroups, Windows 95, Windows NT, Windows 2000, Windows XP, and Linux. He is currently retired and, in his spare time, is a musician and loves to cook, but he is also open to working as a freelancer on various projects.

Bhavesh Bhatt is a technology postgraduate at BITS Pilani with a keen interest in machine learning, data science, and computer vision. He currently works as a data scientist at Fractal Analytics. He has taught data science using the Python programming language to hundreds of students in the classroom. Additionally, Bhavesh hosts a machine learning-based educational YouTube channel with over 4,400 subscribers.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Hands-On Ensemble Learning with Python

About Packt

Why subscribe?

Contributors

About the authors

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Code in action

Conventions used

Get in touch

Reviews

Section 1: Introduction and Required Software Tools

A Machine Learning Refresher

Technical requirements

Learning from data

Popular machine learning datasets

Diabetes

Breast cancer

Handwritten digits

Supervised and unsupervised learning

Supervised learning

Unsupervised learning

Dimensionality reduction

Performance measures

Cost functions

Mean absolute error

Mean squared error

Cross entropy loss

Metrics

Classification accuracy

Confusion matrix

Sensitivity, specificity, and area under the curve

Precision, recall, and the F1 score

Evaluating models

Machine learning algorithms

Python packages

Supervised learning algorithms

Regression

Support vector machines

Neural networks

Decision trees

K-Nearest Neighbors

K-means

Summary

Getting Started with Ensemble Learning

Technical requirements

Bias, variance, and the trade-off

What is bias?

What is variance?

Trade-off

Ensemble learning

Motivation

Identifying bias and variance

Validation curves

Learning curves

Ensemble methods

Difficulties in ensemble learning

Weak or noisy data

Understanding interpretability

Computational cost

Choosing the right models

Summary

Section 2: Non-Generative Methods

Voting

Technical requirements

Hard and soft voting

Hard voting

Soft voting

Python implementation

Custom hard voting implementation

Analyzing our results using Python

Using scikit-learn

Hard voting implementation

Soft voting implementation

Analyzing our results

Summary

Stacking

Technical requirements

Meta-learning

Stacking

Creating metadata

Deciding on an ensemble's composition

Selecting base learners

Selecting the meta-learner

Python implementation

Stacking for regression

Stacking for classification

Creating a stacking regressor class for scikit-learn

Summary

Section 3: Generative Methods

Bagging

Technical requirements

Bootstrapping

Creating bootstrap samples

Bagging

Creating base learners

Strengths and weaknesses

Python implementation

Implementation

Parallelizing the implementation

Using scikit-learn 

Bagging for classification

Bagging for regression

Summary

Boosting

Technical requirements

AdaBoost

Weighted sampling

Creating the ensemble

Implementing AdaBoost in Python

Strengths and weaknesses

Gradient boosting

Creating the ensemble

Preface

Ensembling is a technique for combining two or more similar or dissimilar machine learning algorithms to create a model that delivers superior predictive power. This book will demonstrate how you can use a variety of weak algorithms to make a strong predictive model.With its hands-on approach, you'll not only get up to speed on the basic theory, but also the application of various ensemble learning techniques. Using examples and real-world datasets, you'll be able to produce better machine learning models to solve supervised learning problems such as classification and regression. Later in the book, you'll go on to leverage ensemble learning techniques such as clustering to produce unsupervised machine learning models. As you progress, the chapters will cover different machine learning algorithms that are widely used in the practical world to make predictions and classifications. You'll even get to grips with using Python libraries such as scikit-learn and Keras to implement different ensemble models.By the end of this book, you will be well versed in ensemble learning and have the skills you need to understand which ensemble method is required for which problem, in order to successfully implement them in real-world scenarios.

Who this book is for

This book is for data analysts, data scientists, machine learning engineers, and other professionals who are looking to generate advanced models using ensemble techniques.

What this book covers

Chapter 1, A Machine Learning Refresher, presents an overview of machine learning, including basic concepts such as training/test sets, performance measures, supervised and unsupervised learning, machine learning algorithms, and benchmark datasets.

Chapter 2, Getting Started with Ensemble Learning, introduces the concept of ensemble learning, highlighting the problems that it solves as well as the problems that it poses.

Chapter 3, Voting, introduces the most simple ensemble learning technique, voting, while explaining the difference between hard and soft voting. You will learn how to implement a custom classifier, as well as use scikit-learn's implementation of hard/soft voting.

Chapter 4, Stacking, covers meta learning (stacking) a more advanced ensemble learning method. After reading this chapter, you will be able to implement a stacking classifier in Python to use with scikit-learn classifiers.

Chapter 5, Bagging, introduces bootstrap resampling and the first generative ensemble learning technique, bagging. Furthermore, this chapter guides you through the process of implementing the technique in Python, as well as how to use the scikit-learn implementation.

Chapter 6, Boosting, touches on more advanced subjects in ensemble learning. This chapter explains how popular boosting algorithms work and are implemented. Furthermore, it presents XGBoost, a highly successful distributed boosting library.

Chapter 7, Random Forests, goes through the process of creating random decision trees by subsampling the instances and features of a dataset. Moreover, this chapter explains how to utilize an ensemble of random trees to create a random forest. Finally, this chapter presents scikit-learn's implementations and how to use them.

Chapter 8, Clustering, introduces to the possibility of using ensembles for unsupervised learning tasks, such as clustering. Furthermore, the OpenEnsembles Python library is introduced, along with guidance on using it.

Chapter 9, Classifying Fraudulent Transactions, presents an application for the classification of a real-world dataset, using ensemble learning techniques presented in earlier chapters. The dataset concerns fraudulent credit card transactions.

Chapter 10, Predicting Bitcoin Prices, presents an application for the regression of a real-world dataset, using ensemble learning techniques presented in earlier chapters. The dataset concerns the price of the popular cryptocurrency Bitcoin.

Chapter 11, Evaluating Sentiment on Twitter, presents an application for evaluating the sentiment of various tweets using a real-world dataset.

Chapter 12, Recommending Movies with Keras, presents the process of creating a recommender system using ensembles of neural networks.

Chapter 13, Clustering World Happiness, presents the process of using an ensemble learning approach to cluster data from the World Happiness Report 2018.

To get the most out of this book

This book is aimed at analysts, data scientists, engineers, and other professionals who have an interest in generating advanced models that describe and generalize datasets of interest to them. It is assumed that the reader has basic experience of programming in Python and is familiar with elementary machine learning models. Furthermore, a basic understanding of statistics is assumed, although key points and more advanced concepts are briefly presented. Familiarity with Python's scikit-learn module would be greatly beneficial, although it is not strictly required. A standard Python installation is required. Anaconda Distribution (https://www.anaconda.com/distribution/) greatly simplifies the task of installing and managing the various Python packages, although it is not necessary. Finally, a good Integrated Development Environment (IDE) is extremely useful for managing your code and debugging. In our examples, we usually utilize the Spyder IDE, which can be easily installed through Anaconda.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packt.com

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

Enter the name of the book in the

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest versions of the following:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for macOS

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Hands-On-Ensemble-Learning-with-Python. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781789612851_ColorImages.pdf.

Code in action

Visit the following link to check out videos of the code being run: http://bit.ly/2GfnRrv.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

# --- SECTION 6 ---# Accuracy of hard votingprint('-'*30)print('Hard Voting:', accuracy_score(y_test, hard_predictions))

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Thus, the preferred approach is to utilize K-fold cross validation."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Section 1: Introduction and Required Software Tools

This section is a refresher on basic machine learning concepts and an introduction to ensemble learning. We will have an overview of machine learning and various concepts pertaining to it, such as train and test sets, supervised and unsupervised learning, and more. We will also learn about the concept of ensemble learning.

This section comprises the following chapters:

Chapter 1

A Machine Learning Refresher

Chapter 2

Getting Started with Ensemble Learning

A Machine Learning Refresher

Machine learning is a sub field of artificial intelligence (AI) focused on the aim of developing algorithms and techniques that enable computers to learn from massive amounts of data. Given the increasing rate at which data is produced, machine learning has played a critical role in solving difficult problems in recent years. This success was the main driving force behind the funding and development of many great machine learning libraries that make use of data in order to build predictive models. Furthermore, businesses have started to realize the potential of machine learning, driving the demand for data scientists and machine learning engineers to new heights, in order to design better-performing predictive models.

This chapter serves as a refresher on the main concepts and terminology, as well as an introduction to the frameworks that will be used throughout the book, in order to approach ensemble learning with a solid foundation.

The main topics covered in this chapter are the following:

The various machine learning problems and datasets

How to evaluate the performance of a predictive model

Machine learning algorithms

Python environment setup and the required libraries

Technical requirements

You will require basic knowledge of machine learning techniques and algorithms. Furthermore, a knowledge of python conventions and syntax is required. Finally, familiarity with the NumPy library will greatly help the reader to understand some custom algorithm implementations.

The code files of this chapter can be found on GitHub:

https://github.com/PacktPublishing/Hands-On-Ensemble-Learning-with-Python/tree/master/Chapter01

Check out the following video to see the Code in Action: http://bit.ly/30u8sv8.

Learning from data

Data is the raw ingredient of machine learning. Processing data can produce information; for example, measuring the height of a portion of a school's students (data) and calculating their average (processing) can give us an idea of the whole school's height (information). If we process the data further, for example, by grouping males and females and calculating two averages – one for each group, we will gain more information, as we will have an idea about the average height of the school's males and females. Machine learning strives to produce the most information possible from any given data. In this example, we produced a very basic predictive model. By calculating the two averages, we can predict the average height of any student just by knowing whether the student is male or female.

The set of data that a machine learning algorithm is tasked with processing is called the problem's dataset. In our example, the dataset consists of height measurements (in centimeters) and the child's sex (male/female). In machine learning, input variables are called features and output variables are called targets. In this dataset, the features of our predictive model consist solely of the students' sex, while our target is the students' height in centimeters. The predictive model that is produced and maps features to targets will be referred to as simply the model from now on, unless otherwise specified. Each data point is called an instance. In this problem, each student is an instance of the dataset.

When the target is a continuous variable (a number), it presents a regression problem, as the aim is to regress the target on the features. When the target is a set of categories, it presents a classification problem, as we try to assign each instance to a category or class.

Note that, in classification problems, the target class can be represented by a number; this does not mean that it is a regression problem. The most useful way to determine whether it is a regression problem is to think about whether the instances can be ordered by their targets. In our example, the target is height, so we can order the students from tallest to shortest, as 100 cm is less than 110 cm. As a counter example, if the target was their favorite color, we could represent each color by a number, but we could not order them. Even if we represented red as one and blue as two, we could not say that red is "before" or "less than" blue. Thus, this counter example is a classification problem.

Popular machine learning datasets

Machine learning relies on data in order to produce high-performing models. Without data, it's not even possible to create models. In this section, we'll present some popular machine learning datasets, which we will utilize throughout this book.

Diabetes

The diabetes dataset concerns 442 individual diabetes patients and the progression of the disease one year after a baseline measurement. The dataset consists of 10 features, which are the patient's age, sex, body mass index (bmi), average blood pressure (bp), and six measurements of their blood serum. The dataset target is the progression of the disease one year after the baseline measurement. This is a regression dataset, as the target is a number.

In this book, the dataset features are mean-centered and scaled such that the dataset sum of squares for each feature equals one. The following table depicts a sample of the diabetes dataset:

age

sex

bmi

target

0.04

0.05

0.06

0.02

-0.04

-0.03

-0.04

0.00

0.02

-0.02

151

0.00

-0.04

-0.05

-0.03

-0.01

-0.02

0.07

-0.04

-0.07

-0.09

0.09

0.05

0.04

-0.01

-0.05

-0.03

0.00

-0.03

141

-0.09

-0.04

-0.01

-0.04

0.01

0.02

-0.04

0.03

0.02

-0.01

206

Breast cancer

The breast cancer dataset concerns 569 biopsies of malignant and benign tumors. The dataset provides 30 features extracted from images of fine-needle aspiration biopsies that describe cell nuclei. The images provide information about the shape, size, and texture of each cell nucleus. Furthermore, for each characteristic, three distinct values are provided. The mean, the standard error, and the worst or largest value. This ensures that, for each image, the cell population is adequately described.

The dataset target concerns the diagnosis, that is, whether a tumor is malignant or benign. Thus, this is a classification dataset. The available features are listed as follows:

Mean radius

Mean texture

Mean perimeter

Mean area

Mean smoothness

Mean compactness

Mean concavity

Mean concave points

Mean symmetry

Mean fractal dimension

Radius error

Texture error

Perimeter error

Area error

Smoothness error

Compactness error

Concavity error

Concave points error

Symmetry error

Fractal dimension error

Worst radius

Worst texture

Worst perimeter

Worst area

Worst smoothness

Worst compactness

Worst concavity

Worst concave points

Worst symmetry

Worst fractal dimension

Handwritten digits

The MNIST handwritten digit dataset is one of the most famous image recognition datasets. It consists of square images, 8 x 8 pixels, each containing a single handwritten digit. Thus, the dataset features are an 8 by 8 matrix, containing each pixel's color in grayscale. The target consists of 10 classes, one for each digit from 0 to 9. This is a classification dataset. The following figure is a sample from the handwritten digit dataset:

Sample of the handwritten digit dataset

Supervised and unsupervised learning

Machine learning can be divided into many subcategories; two broad categories are supervised and unsupervised learning. These categories contain some of the most popular and widely used machine learning methods. In this section, we present them, as well as some toy example uses of supervised and unsupervised learning.

Supervised learning

In examples such as those in the previous section, the data consisted of some features and a target; no matter whether the target was quantitative (regression) or categorical (classification). Under these circumstances, we call the dataset a labeled dataset. When we try to produce a model from a labeled dataset in order to make predictions about unseen or future data (for example, to diagnose a new tumor case), we make use of supervised learning. In simple cases, supervised learning models can be visualized as a line. This line's purpose is to either separate the data based on the target (in classification) or to closely follow the data (in regression).

The following figure illustrates a simple regression example. Here, y is the target and x is the dataset feature. Our model consists of the simple equation y=2x-5. As is evident, the line closely follows the data. In order to estimate the y value of a new unseen point, we calculate its value using the preceding formula. The following figure shows a simple regression with y=2x-5 as the predictive model:

Simple regression with y=2x-5 as the predictive model

In the following figure, a simple classification problem is depicted. Here, the dataset features are x and y, while the target is the instance color. Again, the dotted line is y=2x-5, but this time we test whether the point is above or below the line. If the point's y value is lower than expected (smaller), then we expect it to be orange. If it is higher (greater), we expect it to be blue. The following figure is a simple classification with y=2x-5 as the boundary:

Simple classification with y=2x-5 as boundary

Unsupervised learning

In both regression and classification, we have a clear understanding of how the data is structured or how it behaves. Our goal is to simply model that structure or behavior. In some cases, we do not know how the data is structured. In those cases, we can utilize unsupervised learning in order to discover the structure, and thus information, within the data. The simplest form of unsupervised learning is clustering. As the name implies, clustering techniques attempt to group (or cluster) data instances. Thus, instances that belong to the same cluster share many similarities in their features, while they are dissimilar to instances that belong in separate clusters. A simple example with three clusters is depicted in the following figure. Here, the dataset features are x and y, while there is no target.

The clustering algorithm discovered three distinct groups, centered around the points (0, 0), (1, 1), and (2, 2):

Clustering with three distinct groups

Dimensionality reduction

Another form of unsupervised learning is dimensionality reduction. The number of features present in a dataset equals the dataset's dimensions. Often, many features can be correlated, noisy, or simply not provide much information. Nonetheless, the cost of storing and processing data is correlated with a dataset's dimensionality. Thus, by reducing the dataset's dimensions, we can help the algorithms to better model the data.

Another use of dimensionality reduction is for the visualization of high-dimensional datasets. For example, using the t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm, we can reduce the breast cancer dataset to two dimensions or components. Although it is not easy to visualize 30 dimensions, it is quite easy to visualize two.

Furthermore, we can visually test whether the information contained within the dataset can be utilized to separate the dataset's classes or not. The next figure depicts the two components on the y and x axis, while the color represents the instance's class. Although we cannot plot all of the dimensions, by plotting the two components, we can conclude that a degree of separability between the classes exists:

Using t-SNE to reduce the dimensionality of the breast cancer dataset

Performance measures

Machine learning is a highly quantitative field. Although we can gauge the performance of a model by plotting how it separates classes and how closely it follows data, more quantitative performance measures are needed in order to evaluate models. In this section, we present cost functions and metrics. Both of them are used in order to assess a model's performance.

Cost functions

A machine learning model's objective is to model our dataset. In order to assess each model's performance, we define an objective function. These functions usually express a cost, or how far from perfect a model is. These cost functions usually utilize a loss function to assess how well the model performed on each individual dataset instance.

Some of the most widely used cost functions are described in the following sections, assuming that the dataset has n instances, the target's true value for instance i is ti and the model's output is yi .

Mean absolute error

Mean absolute error (MAE) or L1 loss is the mean absolute distance between the target's real values and the model's outputs. It is calculated as follows:

Mean squared error

Mean squared error (MSE) or L2 loss is the mean squared distance between the target's real values and the model's output. It is calculated as follows:

Cross entropy loss

Cross entropy loss is used in models that output probabilities between 0 and 1, usually to express the probability that an instance is a member of a specific class. As the output probability diverges from the actual label, the loss increases. For a simple case where the dataset consists of two classes, it is calculated as follows:

Metrics

Cost functions are useful when we try to numerically optimize our models. But as humans, we need metrics that are useful and intuitive to understand and report. As such, there are a number of metrics available that give insight into a model's performance. The most common metrics are presented in the following sections.

Classification accuracy

The simplest and easiest to grasp of all, classification accuracy refers to the percentage of correct predictions. In order to calculate accuracy, we divide the number of correct predictions by the total number of instances: