Modern Scala Projects - Ilango gurusamy - E-Book

Modern Scala Projects E-Book

ilango gurusamy

0,0
46,44 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Develop robust, Scala-powered projects with the help of machine learning libraries such as SparkML to harvest meaningful insight




Key Features



  • Gain hands-on experience in building data science projects with Scala


  • Exploit powerful functionalities of machine learning libraries


  • Use machine learning algorithms and decision tree models for enterprise apps





Book Description



Scala, together with the Spark Framework, forms a rich and powerful data processing ecosystem. Modern Scala Projects is a journey into the depths of this ecosystem. The machine learning (ML) projects presented in this book enable you to create practical, robust data analytics solutions, with an emphasis on automating data workflows with the Spark ML pipeline API. This book showcases or carefully cherry-picks from Scala's functional libraries and other constructs to help readers roll out their own scalable data processing frameworks. The projects in this book enable data practitioners across all industries gain insights into data that will help organizations have strategic and competitive advantage.






Modern Scala Projects focuses on the application of supervisory learning ML techniques that classify data and make predictions. You'll begin with working on a project to predict a class of flower by implementing a simple machine learning model. Next, you'll create a cancer diagnosis classification pipeline, followed by projects delving into stock price prediction, spam filtering, fraud detection, and a recommendation engine.






By the end of this book, you will be able to build efficient data science projects that fulfil your software requirements.





What you will learn



  • Create pipelines to extract data or analytics and visualizations


  • Automate your process pipeline with jobs that are reproducible


  • Extract intelligent data efficiently from large, disparate datasets


  • Automate the extraction, transformation, and loading of data


  • Develop tools that collate, model, and analyze data


  • Maintain the integrity of data as data flows become more complex


  • Develop tools that predict outcomes based on “pattern discovery”


  • Build really fast and accurate machine-learning models in Scala





Who this book is for



Modern Scala Projects is for Scala developers who would like to gain some hands-on experience with some interesting real-world projects. Prior programming experience with Scala is necessary.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 298

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Modern Scala Projects

 

 

 

 

Leverage the power of Scala for building data-driven and high-performant projects

 

 

 

 

 

 

 

 

 

 

 

Ilango Gurusamy

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Modern Scala Projects

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Richa TripathiAcquisition Editor: Sandeep MishraContent Development Editor: Priyanka SawantTechnical Editor: Gaurav GalaCopy Editor: Safis EditingProject Coordinator: Vaidehi SawantProofreader: Safis EditingIndexer: Mariammal ChettiyarGraphics: Jason MonteiroProduction Coordinator: Aparna Bhagat

First published: July 2018

Production reference: 1280718

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78862-411-4

www.packtpub.com

 
mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Ilango Gurusamy holds an MS degree in computer science from California State University. He has lead Java projects at Northrop Grumman, AT&T, and such. He moved into Scala and Functional Programming. His current interests are IoT, navigational applications, and all things Scala related. A strategic thinker, speaker, and writer, he also loves yoga, skydiving, cars, dogs, and fishing. You can know more about his achievements in his blog, titled scalanirvana. His LinkedIn user name is ilangogurusamy.

About the reviewer

Adithya Selvaprithiviraj is a Scala developer in the Innovation Centre Network at SAP Labs. Currently, he is involved in the development of a modern typesafe framework to ease enterprise application development in the SAP landscape. Previously, Adithya was part of several machine learning projects. You can find out more about his achievements in his blog, titled adithyaselv.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Modern Scala Projects

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the author

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Predict the Class of a Flower from the Iris Dataset

A multivariate classification problem

Understanding multivariate

Different kinds of variables

Categorical variables 

Fischer's Iris dataset

The Iris dataset represents a multiclass, multidimensional classification task

The training dataset

The mapping function

An algorithm and its mapping function 

Supervised learning – how it relates to the Iris classification task

Random Forest classification algorithm

Project overview – problem formulation

Getting started with Spark

Setting up prerequisite software

Installing Spark in standalone deploy mode

Developing a simple interactive data analysis utility

Reading a data file and deriving DataFrame out of it

Implementing the Iris pipeline 

Iris pipeline implementation objectives

Step 1 – getting the Iris dataset from the UCI Machine Learning Repository

Step 2 – preliminary EDA

Firing up Spark shell

Loading the iris.csv file and building a DataFrame

Calculating statistics

Inspecting your SparkConf again

Calculating statistics again

Step 3 – creating an SBT project

Step 4 – creating Scala files in SBT project

Step 5 – preprocessing, data transformation, and DataFrame creation

DataFrame Creation

Step 6 – creating, training, and testing data

Step 7 – creating a Random Forest classifier

Step 8 – training the Random Forest classifier

Step 9 – applying the Random Forest classifier to test data

Step 10 – evaluate Random Forest classifier 

Step 11 – running the pipeline as an SBT application

Step 12 – packaging the application

Step 13 – submitting the pipeline application to Spark local

Summary

Questions

Build a Breast Cancer Prognosis Pipeline with the Power of Spark and Scala

Breast cancer classification problem

Breast cancer dataset at a glance

Logistic regression algorithm

Salient characteristics of LR

Binary logistic regression assumptions

A fictitious dataset and LR

LR as opposed to linear regression

Formulation of a linear regression classification model

Logit function as a mathematical equation

LR function

Getting started

Setting up prerequisite software

Implementation objectives

Implementation objective 1 – getting the breast cancer dataset

Implementation objective 2 – deriving a dataframe for EDA

Step 1 – conducting preliminary EDA 

Step 2 – loading data and converting it to an RDD[String]

Step 3 – splitting the resilient distributed dataset and reorganizing individual rows into an array

Step 4 – purging the dataset of rows containing question mark characters

Step 5 – running a count after purging the dataset of rows with questionable characters

Step 6 – getting rid of header

Step 7 – creating a two-column DataFrame

Step 8 – creating the final DataFrame

Random Forest breast cancer pipeline

Step 1 – creating an RDD and preprocessing the data

Step 2 – creating training and test data

Step 3 – training the Random Forest classifier

Step 4 – applying the classifier to the test data

Step 5 – evaluating the classifier

Step 6 – running the pipeline as an SBT application

Step 7 – packaging the application

Step 8 – deploying the pipeline app into Spark local

LR breast cancer pipeline

Implementation objectives

Implementation objectives 1 and 2

Implementation objective 3 – Spark ML workflow for the breast cancer classification task

Implementation objective 4 – coding steps for building the indexer and logit machine learning model

Extending our pipeline object with the WisconsinWrapper trait

Importing the StringIndexer algorithm and using it

Splitting the DataFrame into training and test datasets

Creating a LogisticRegression classifier and setting hyperparameters on it

Running the LR model on the test dataset

Building a breast cancer pipeline with two stages

Implementation objective 5 – evaluating the binary classifier's performance

Summary

Questions

Stock Price Predictions

Stock price binary classification problem

Stock price prediction dataset at a glance

Getting started

Support for hardware virtualization

Installing the supported virtualization application 

Downloading the HDP Sandbox and importing it

Hortonworks Sandbox virtual appliance overview

Turning on the virtual machine and powering up the Sandbox

Setting up SSH access for data transfer between Sandbox and the host machine

Setting up PuTTY, a third-party SSH and Telnet client

Setting up WinSCP, an SFTP client for Windows

Updating the default Python required by Zeppelin

What is Zeppelin?

Updating our Zeppelin instance

Launching the Ambari Dashboard and Zeppelin UI

Updating Zeppelin Notebook configuration by adding or updating interpreters

Updating a Spark 2 interpreter

Implementation objectives

List of implementation goals

Step 1 – creating a Scala representation of the path to the dataset file

Step 2 – creating an RDD[String]

Step 3 – splitting the RDD around the newline character in the dataset

Step 4 – transforming the RDD[String] 

Step 5 – carrying out preliminary data analysis

Creating DataFrame from the original dataset

Dropping the Date and Label columns from the DataFrame

Having Spark describe the DataFrame

Adding a new column to the DataFrame and deriving Vector out of it

Removing stop words – a preprocessing step 

Transforming the merged DataFrame

Transforming a DataFrame into an array of NGrams

Adding a new column to the DataFrame, devoid of stop words

Constructing a vocabulary from our dataset corpus

Training CountVectorizer

Using StringIndexer to transform our input label column

Dropping the input label column

Adding a new column to our DataFrame 

Dividing the DataSet into training and test sets

Creating labelIndexer to index the indexedLabel column

Creating StringIndexer to index a column label

Creating RandomForestClassifier

Creating a new data pipeline with three stages

Creating a new data pipeline with hyperparameters

Training our new data pipeline

Generating stock price predictions

Summary

Questions

Building a Spam Classification Pipeline

Spam classification problem

Relevant background topics 

Multidimensional data

Features and their importance

Classification task

Classification outcomes

Two possible classification outcomes

Project overview – problem formulation

Getting started

Setting up prerequisite software

Spam classification pipeline 

Implementation steps

Step 1 – setting up your project folder

Step 2 – upgrading your build.sbt file

Step 3 – creating a trait called SpamWrapper

Step 4 – describing the dataset

Description of the SpamHam dataset

Step 5 – creating a new spam classifier class

Step 6 – listing the data preprocessing steps

Step 7 – regex to remove punctuation marks and whitespaces

Step 8 – creating a ham dataframe with punctuation removed

Creating a labeled ham dataframe

Step 9 – creating a spam dataframe devoid of punctuation

Step 10 – joining the spam and ham datasets

Step 11 – tokenizing our features

Step 12 – removing stop words

Step 13 – feature extraction

Step 14 – creating training and test datasets

Summary

Questions

Further reading

Build a Fraud Detection System

Fraud detection problem

Fraud detection dataset at a glance

Precision, recall, and the F1 score

Feature selection

The Gaussian Distribution function

Where does Spark fit in all this?

Fraud detection approach

Project overview – problem formulation

Getting started

Setting up Hortonworks Sandbox in the cloud

Creating your Azure free account, and signing in

The Azure Marketplace

The HDP Sandbox home page

Implementation objectives

Implementation steps

Create the FraudDetection trait

Broadcasting mean and standard deviation vectors

Calculating PDFs

F1 score

Calculating the best error term and best F1 score

Maximum and minimum values of a probability density

Step size for best error term calculation

A loop to generate the best F1 and the best error term

Generating predictions – outliers that represent fraud

Generating the best error term and best F1 measure

Preparing to compute precision and recall

A recap of how we looped through a ranger of Epsilons, the best error term, and the best F1 measure

Function to calculate false positives

Summary

Questions

Further reading

Build Flights Performance Prediction Model

Overview of flight delay prediction

The flight dataset at a glance

Problem formulation of flight delay prediction

Getting started

Setting up prerequisite software

Increasing Java memory

Reviewing the JDK version

MongoDB installation

Implementation and deployment

Implementation objectives

Creating a new Scala project

Building the AirlineWrapper Scala trait

Summary

Questions

Further reading

Building a Recommendation Engine

Problem overviews

Recommendations on Amazon

Brief overview

Detailed overview

On-site recommendations

Recommendation systems

Definition

Categorizing recommendations

Implicit recommendations

Explicit recommendations

Recommendations for machine learning

Collaborative filtering algorithms

Recommendations problem formulation

Understanding datasets

Detailed overview

Recommendations regarding problem formulation

Defining explicit feedback

Building a narrative

Sales leads and past sales

Weapon sales leads and past sales data

Implementation and deployment 

Implementation

Step 1 – creating the Scala project

Step 2 – creating the AirlineWrapper definition

Step 3 – creating a weapon sales orders schema

Step 4 – creating a weapon sales leads schema

Step 5 – building a weapon sales order dataframe

Step 6 – displaying the weapons sales dataframe

Step 7 – displaying the customer-weapons-system dataframe

Step 8 – generating predictions

Step 9 – displaying predictions

Compilation and deployment

Compiling the project

What is an assembly.sbt file?

Creating assembly.sbt

Contents of assembly.sbt

Running the sbt assembly task

Upgrading the build.sbt file

Rerunning the assembly command

Deploying the recommendation application

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

Scala, along with the Spark Framework, forms a rich and powerful data processing ecosystem. This book is a journey into the depths of this ecosystem. The machine learning (ML) projects presented in this book enable you to create practical, robust, data analytics solutions, with an emphasis on automating data workflows with the Spark ML pipeline API. This book showcases, or carefully cherry-picks from, Scala’s functional libraries and other constructs to help readers roll out their own scalable data processing frameworks. The projects in this book enable data practitioners across all industries to gain insights into data that will help organizations to obtain a strategic and competitive advantage. Modern Scala Projects focuses on the application of supervisory learning ML techniques that classify data and make predictions. You'll begin with working on a project to predict a class of flower by implementing a simple machine learning model. Next, you'll create a cancer diagnosis classification pipeline, followed by projects delving into stock price prediction, spam filtering, fraud detection, and a recommendation engine.

By the end of this book, you will be able to build efficient data science projects that fulfill your software requirements.

Who this book is for

This book is for Scala developers who would like to gain some hands-on experience with some interesting real-world projects. Prior programming experience with Scala is necessary.

What this book covers

Chapter 1, Predict the Class of a Flower from the Iris Dataset, focuses on building a machine learning model leveraging a time-tested statistical method based on regression. The chapter draws the reader into data processing, all the way to training and testing a relatively simple machine learning model.

Chapter 2, Build a Breast Cancer Prognosis Pipeline with the Power of Spark and Scala, taps into a publicly available breast cancer dataset. It evaluates various feature selection algorithms, transforms data, and builds a classification model.

Chapter 3, Stock Price Predictions, says that stock price prediction can be an impossible task. In this chapter, we take a new approach. Accordingly, we build and train a neural network model with training data to solve the apparently intractable problem of stock price prediction. A data pipeline, with Spark at its core, distributes training of the model across multiple machines in a cluster. A real-life dataset is fed into the pipeline. Training data goes through preprocessing and normalization steps before a model is trained to fit the data. We may also provide a means to visualize the results of our prediction and evaluate our model after training.

Chapter 4, Building a Spam Classification Pipeline, informs the reader that the overarching learning objective of this chapter is to implement a spam filtering data analysis pipeline. We will rely on the Spark ML library's machine learning APIs and its supporting libraries to build a spam classification pipeline.

Chapter 5, Build a Fraud Detection System, applies machine learning techniques and algorithms to build a practical ML pipeline that helps find questionable charges on consumers’ credit cards. The data is drawn from a publicly accessible Consumer Complaints Database. The chapter demonstrates the tools contained in Spark ML for building, evaluating, and tuning a pipeline. Feature extraction is one function served by Spark ML that is covered here.

Chapter 6, Build Flights Performance Prediction Model, makes us able to leverage flight departure and arrival data to predict for the user if their flight is delayed or canceled. Here, we will build a decisions trees-based model to derive useful predictors, such as what time of the day is best to have a seat on a flight, with a minimum chance of delay.

Chapter 7, Building a Recommendation Engine, draws the reader into the implementation of a scalable recommendations engine. The collaborative-filtering approach is laid out as the reader walks through a phased recommendations-generating process based on users’ past preferences.

To get the most out of this book

Prior knowledge of Scala is assumed. Knowledge of basic concepts like Spark ML will be an add-on.

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packtpub.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Modern-Scala-Projects. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/ModernScalaProjects_ColorImages.pdf

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Predict the Class of a Flower from the Iris Dataset

This chapter kicks off a machine learning (ML) initiative in Scala and Spark. Speaking of Spark, its Machine Learning Library (MLlib) living under the spark.ml package and accessible via its MLlib DataFrame-based API will help us develop scalable data analysis applications. The MLlib DataFrame-based API, also known as Spark ML, provides powerful learning algorithms and pipeline building tools for data analysis. Needless to say, we will, starting this chapter, leverage MLlib's classification algorithms.

The Spark ecosystem, also boasting of APIs to R, Python, and Java in addition to Scala, empowers our readers, be they beginner, or seasoned data professionals, to make sense of and extract analytics from various datasets. 

Speaking of datasets, the Iris dataset is the simplest, yet the most famous data analysis task in the ML space. This chapter builds a solution to the data analysis classification task that the Iris dataset represents. 

Here is the dataset we will refer to:

UCI Machine Learning Repository: Iris Data Set

Accessed July 13, 2018

Website URL:

https://archive.ics.uci.edu/ml/datasets/Iris

The overarching learning objective of this chapter is to implement a Scala solution to the so-called multivariate classification task represented by the Iris dataset.

The following list is a section-wise breakdown of individual learning outcomes:

A multivariate classification problem

Project overview—problem formulation

Getting started with Spark

Implementing a multiclass classification pipeline

The following section offers the reader an in-depth perspective on the Iris dataset classification problem.

A multivariate classification problem

The most famous dataset in data science history is Sir Ronald Aylmer Fisher's classical Iris flower dataset, also known as Anderson's dataset. It was introduced in 1936, as a study in understanding multivariate (or multiclass) classification. What then is multivariate?

Understanding multivariate

The term multivariate can bear two meanings:

In terms of an adjective, multivariate means having or involving one or more variables.

In terms of a noun, multivariate may represent a mathematical vector whose individual elements are variate. Each individual element in this vector is a measurable quantity or variable.

Both meanings mentioned have a common denominator variable. Conducting a multivariate analysis of an experimental unit involves at least one measurable quantity or variable. A classic example of such an analysis is the Iris dataset, having one or more (outcome) variables per observation.

In this subsection, we understood multivariate in terms of variables. In the next subsection, we briefly touch upon different kinds of variables, one of them being categorical variables.

Different kinds of variables

In general, variables are of two types:

Quantitative variable

: It is a variable representing a measurement that is quantified by a numeric value. Some examples of quantitative variables are:

A variable representing the age of a girl called

Huan

(

Age_Huan

). In September of 2017, the variable representing her age contained the value

24

. Next year, one year later, that variable would be the number 1 (arithmetically) added to her current age.

The variable representing the number of planets in the solar system (

Planet_Number

).

 Currently, pending the discovery of any new planets in the future, this variable contains the number

12

. I

f scientists found a new celestial body tomorrow that they think qualifies to be a planet, the

Planet_Number

variable's new value would be bumped up from its current value of

12

to

13

Categorical variable

: A variable that cannot be assigned a numerical measure in the natural order of things. For example, the status of an individual in the United States. It could be one of the following values: a citizen, permanent resident, or a non-resident.

In the next subsection, we will describe categorical variables in some detail.

Categorical variables 

We will draw upon the definition of a categorical variable from the previous subsection. Categorical variables distinguish themselves from quantitative variables in a fundamental way. As opposed to a quantitative variable that represents a measure of a something in numerical terms, a categorical variable represents a grouping name or a category name, which can take one of the finite numbers of possible categories. For example, the species of an Iris flower is a categorical variable and the value it takes could be one value from a finite set of categorical values: Iris-setosa, Iris-virginica, and Iris-versicolor.

It may be useful to draw on other examples of categorical variables; these are listed here as follows:

The

 

blood group of an individual as in A+, A-, B+, B-, AB+, AB-, O+, or O-

The county that an individual is a resident of given a finite list of counties in the state of Missouri

The

 

political affiliation of a United States citizen could take up categorical values in the form of Democrat, Republican, or Green Party

In global warming studies, the type of a forest is a categorical variable that could take one of three values in the form of tropical, temperate, or

 taiga

The first item in the preceding list, the blood group of a person, is a categorical variable whose corresponding data (values) are categorized (classified) into eight groups (A, B, AB, or O with their positives or negatives). In a similar vein, the species of an Iris flower is a categorical variable whose data (values) are categorized (classified) into three species groups—Iris-setosa, Iris-versicolor, and Iris-virginica. 

That said, a common data analysis task in ML is to index, or encode, current string representations of categorical values into a numeric form; doubles for example. Such indexing is a prelude to a prediction on the target or label, which we shall talk more about shortly.

In respect to the Iris flower dataset, its species variable data is subject to a classification (or categorization) task with the express purpose of being able to make a prediction on the species of an Iris flower. At this point, we want to examine the Iris dataset, its rows, row characteristics, and much more, which is the focus of the upcoming topic.

Fischer's Iris dataset

The Iris flower dataset comprises of a total of 150 rows, where each row represents one flower. Each row is also known as an observation. This 150 observation Iris dataset is made up of three kinds of observations related to three different Iris flower species. The following table is an illustration:

Iris dataset observation breakup table

Referring to the preceding table, it is clear that three flower species are represented in the Iris dataset. Each flower species in this dataset contributes equally to 50 observations apiece. Each observation holds four measurements. One measurement corresponds to one flower feature, where each flower feature corresponds to one of the following:

Sepal Length

Sepal Width

Petal Length

Petal Width

 

The features listed earlier are illustrated in the following table for clarity:

Iris features

Okay, so three flower species are represented in the Iris dataset. Speaking of species, we will henceforth replace the term species with the term classes whenever there is the need to stick to an ML terminology context. That means #1-Iris-setosa from earlier refers to Class # 1, #2-Iris-virginica to Class # 2, and #3-Iris-versicolor to Class # 3.

We just listed three different Iris flower species that are represented in the Iris dataset. What do they look like? What do their features look like? These questions are answered in the following screenshot:

Representations of three species of Iris flower

 

That said, let's look at the Sepal and Petal portions of each class of Iris flower. The Sepal (the larger lower part) and Petal (the lower smaller part) dimensions are how each class of Iris flower bears a relationship to the other two classes of Iris flowers. In the next section, we will summarize our discussion and expand the scope of the discussion of the Iris dataset to a multiclass, multidimensional classification task.

The Iris dataset represents a multiclass, multidimensional classification task

In this section, we will restate the facts about the Iris dataset and describe it in the context of an ML classification task:

The Iris dataset classification task is multiclass because a prediction of the class of a new incoming Iris flower from the wild can belong to any of three classes.

Indeed, this chapter is all about attempting a 

species classification (inferring the target class of a new Iris flower) using sepal and petal dimensions as feature parameters.

The Iris dataset classification is multidimensional because there are four features. 

There are 150 observations, where each observation is comprised of measurements on four features. These measurements are also known by the following terms:

Input attributes or instances

Predictor variables (

X

)

Input variables (

X

)

Classification of an Iris flower picked in the wild is carried out by a model (the computed mapping function) that is given four 

flower feature measurements.

The outcome of the Iris flower classification task is the identification of a (computed) predicted value for the response from the predictors by a process of learning (or fitting) a discrete number of targets or category labels (

Y

). The outcome or predicted value may mean the same as the following:

Categorical response variable: In a later section, we shall see that an indexer algorithm will transform all categorical values to numbers

Response or outcome variable (

Y

)

So far, we have claimed that the outcome (Y) of our multiclass classification task is dependent on inputs (X). Where will these inputs come from? This is answered in the next section.

The training dataset

An integral aspect of our data analysis or classification task we did not hitherto mention is the training dataset. A training dataset is our classification task's source of input data (X). We take advantage of this dataset to obtain a prediction on each target class, simply by deriving optimal perimeters or boundary conditions. We just redefined our classification process by adding in the extra detail of the training dataset. For a classification task, then we have X on one side and Y on the other, with an inferred mapping function in the middle. That brings us to the mapping or predictor function, which is the focus of the next section.

An algorithm and its mapping function 

This section starts with a schematic depicting the components of the mapping function and an algorithm that learns the mapping function. The algorithm is learning the mapping function, as shown in the following diagram:

An input to output mapping function and an algorithm learning the mapping function

The goal of our classification process is to let the algorithm derive the best possible approximation of a mapping function by a learning (or fitting) process. When we find an Iris flower out in the wild and want to classify it, we use its input measurements as new input data that our algorithm's mapping function will accept in order to give us a predictor value (Y). In other words, given feature measurements of an Iris flower (the new data), the mapping function produced by a supervised learning algorithm (this will be a random forest) will classify the flower.

Two kinds of ML problems exist that supervised learning classification algorithms can solve. These are as follows:

Classification tasks

Regression tasks

In the following paragraph, we will talk about a mapping function with an example.  We explain the role played by a "supervised learning classification task" in deducing the mapping function. The concept of a model is introduced.

Let's say we already knew that the (mapping) function f(x) for the Iris dataset classification task is exactly of the form x + 1,   then there is there no need for us to find a new mapping function.  If we recall, a mapping function is one that maps the relationship between flower features, such as sepal length and sepal width, on the species the flower belongs to? No.

Therefore, there is no preexisting function x + 1 that clearly maps the relationship between flower features and the flower's species. What we need is a model that will model the aforementioned relationship as closely as possible. Data and its classification seldom tend to be straightforward. A supervised learning classification task starts life with no knowledge of what function f(x) is. A supervised learning classification process applies ML techniques and strategies in an iterative process of deduction to ultimately learn what f(x) is.

In our case, such an ML endeavor is a classification task, a task where the function or mapping function is referred to in statistical or ML terminology as a model.

In the next section, we will describe what supervised learning is and how it relates to the Iris dataset classification.