46,44 €
Develop robust, Scala-powered projects with the help of machine learning libraries such as SparkML to harvest meaningful insight
Key Features
Book Description
Scala, together with the Spark Framework, forms a rich and powerful data processing ecosystem. Modern Scala Projects is a journey into the depths of this ecosystem. The machine learning (ML) projects presented in this book enable you to create practical, robust data analytics solutions, with an emphasis on automating data workflows with the Spark ML pipeline API. This book showcases or carefully cherry-picks from Scala's functional libraries and other constructs to help readers roll out their own scalable data processing frameworks. The projects in this book enable data practitioners across all industries gain insights into data that will help organizations have strategic and competitive advantage.
Modern Scala Projects focuses on the application of supervisory learning ML techniques that classify data and make predictions. You'll begin with working on a project to predict a class of flower by implementing a simple machine learning model. Next, you'll create a cancer diagnosis classification pipeline, followed by projects delving into stock price prediction, spam filtering, fraud detection, and a recommendation engine.
By the end of this book, you will be able to build efficient data science projects that fulfil your software requirements.
What you will learn
Who this book is for
Modern Scala Projects is for Scala developers who would like to gain some hands-on experience with some interesting real-world projects. Prior programming experience with Scala is necessary.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 298
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Richa TripathiAcquisition Editor: Sandeep MishraContent Development Editor: Priyanka SawantTechnical Editor: Gaurav GalaCopy Editor: Safis EditingProject Coordinator: Vaidehi SawantProofreader: Safis EditingIndexer: Mariammal ChettiyarGraphics: Jason MonteiroProduction Coordinator: Aparna Bhagat
First published: July 2018
Production reference: 1280718
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78862-411-4
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Ilango Gurusamy holds an MS degree in computer science from California State University. He has lead Java projects at Northrop Grumman, AT&T, and such. He moved into Scala and Functional Programming. His current interests are IoT, navigational applications, and all things Scala related. A strategic thinker, speaker, and writer, he also loves yoga, skydiving, cars, dogs, and fishing. You can know more about his achievements in his blog, titled scalanirvana. His LinkedIn user name is ilangogurusamy.
Adithya Selvaprithiviraj is a Scala developer in the Innovation Centre Network at SAP Labs. Currently, he is involved in the development of a modern typesafe framework to ease enterprise application development in the SAP landscape. Previously, Adithya was part of several machine learning projects. You can find out more about his achievements in his blog, titled adithyaselv.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Modern Scala Projects
Packt Upsell
Why subscribe?
PacktPub.com
Contributors
About the author
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Predict the Class of a Flower from the Iris Dataset
A multivariate classification problem
Understanding multivariate
Different kinds of variables
Categorical variables 
Fischer's Iris dataset
The Iris dataset represents a multiclass, multidimensional classification task
The training dataset
The mapping function
An algorithm and its mapping function 
Supervised learning – how it relates to the Iris classification task
Random Forest classification algorithm
Project overview – problem formulation
Getting started with Spark
Setting up prerequisite software
Installing Spark in standalone deploy mode
Developing a simple interactive data analysis utility
Reading a data file and deriving DataFrame out of it
Implementing the Iris pipeline 
Iris pipeline implementation objectives
Step 1 – getting the Iris dataset from the UCI Machine Learning Repository
Step 2 – preliminary EDA
Firing up Spark shell
Loading the iris.csv file and building a DataFrame
Calculating statistics
Inspecting your SparkConf again
Calculating statistics again
Step 3 – creating an SBT project
Step 4 – creating Scala files in SBT project
Step 5 – preprocessing, data transformation, and DataFrame creation
DataFrame Creation
Step 6 – creating, training, and testing data
Step 7 – creating a Random Forest classifier
Step 8 – training the Random Forest classifier
Step 9 – applying the Random Forest classifier to test data
Step 10 – evaluate Random Forest classifier 
Step 11 – running the pipeline as an SBT application
Step 12 – packaging the application
Step 13 – submitting the pipeline application to Spark local
Summary
Questions
Build a Breast Cancer Prognosis Pipeline with the Power of Spark and Scala
Breast cancer classification problem
Breast cancer dataset at a glance
Logistic regression algorithm
Salient characteristics of LR
Binary logistic regression assumptions
A fictitious dataset and LR
LR as opposed to linear regression
Formulation of a linear regression classification model
Logit function as a mathematical equation
LR function
Getting started
Setting up prerequisite software
Implementation objectives
Implementation objective 1 – getting the breast cancer dataset
Implementation objective 2 – deriving a dataframe for EDA
Step 1 – conducting preliminary EDA 
Step 2 – loading data and converting it to an RDD[String]
Step 3 – splitting the resilient distributed dataset and reorganizing individual rows into an array
Step 4 – purging the dataset of rows containing question mark characters
Step 5 – running a count after purging the dataset of rows with questionable characters
Step 6 – getting rid of header
Step 7 – creating a two-column DataFrame
Step 8 – creating the final DataFrame
Random Forest breast cancer pipeline
Step 1 – creating an RDD and preprocessing the data
Step 2 – creating training and test data
Step 3 – training the Random Forest classifier
Step 4 – applying the classifier to the test data
Step 5 – evaluating the classifier
Step 6 – running the pipeline as an SBT application
Step 7 – packaging the application
Step 8 – deploying the pipeline app into Spark local
LR breast cancer pipeline
Implementation objectives
Implementation objectives 1 and 2
Implementation objective 3 – Spark ML workflow for the breast cancer classification task
Implementation objective 4 – coding steps for building the indexer and logit machine learning model
Extending our pipeline object with the WisconsinWrapper trait
Importing the StringIndexer algorithm and using it
Splitting the DataFrame into training and test datasets
Creating a LogisticRegression classifier and setting hyperparameters on it
Running the LR model on the test dataset
Building a breast cancer pipeline with two stages
Implementation objective 5 – evaluating the binary classifier's performance
Summary
Questions
Stock Price Predictions
Stock price binary classification problem
Stock price prediction dataset at a glance
Getting started
Support for hardware virtualization
Installing the supported virtualization application 
Downloading the HDP Sandbox and importing it
Hortonworks Sandbox virtual appliance overview
Turning on the virtual machine and powering up the Sandbox
Setting up SSH access for data transfer between Sandbox and the host machine
Setting up PuTTY, a third-party SSH and Telnet client
Setting up WinSCP, an SFTP client for Windows
Updating the default Python required by Zeppelin
What is Zeppelin?
Updating our Zeppelin instance
Launching the Ambari Dashboard and Zeppelin UI
Updating Zeppelin Notebook configuration by adding or updating interpreters
Updating a Spark 2 interpreter
Implementation objectives
List of implementation goals
Step 1 – creating a Scala representation of the path to the dataset file
Step 2 – creating an RDD[String]
Step 3 – splitting the RDD around the newline character in the dataset
Step 4 – transforming the RDD[String] 
Step 5 – carrying out preliminary data analysis
Creating DataFrame from the original dataset
Dropping the Date and Label columns from the DataFrame
Having Spark describe the DataFrame
Adding a new column to the DataFrame and deriving Vector out of it
Removing stop words – a preprocessing step 
Transforming the merged DataFrame
Transforming a DataFrame into an array of NGrams
Adding a new column to the DataFrame, devoid of stop words
Constructing a vocabulary from our dataset corpus
Training CountVectorizer
Using StringIndexer to transform our input label column
Dropping the input label column
Adding a new column to our DataFrame 
Dividing the DataSet into training and test sets
Creating labelIndexer to index the indexedLabel column
Creating StringIndexer to index a column label
Creating RandomForestClassifier
Creating a new data pipeline with three stages
Creating a new data pipeline with hyperparameters
Training our new data pipeline
Generating stock price predictions
Summary
Questions
Building a Spam Classification Pipeline
Spam classification problem
Relevant background topics 
Multidimensional data
Features and their importance
Classification task
Classification outcomes
Two possible classification outcomes
Project overview – problem formulation
Getting started
Setting up prerequisite software
Spam classification pipeline 
Implementation steps
Step 1 – setting up your project folder
Step 2 – upgrading your build.sbt file
Step 3 – creating a trait called SpamWrapper
Step 4 – describing the dataset
Description of the SpamHam dataset
Step 5 – creating a new spam classifier class
Step 6 – listing the data preprocessing steps
Step 7 – regex to remove punctuation marks and whitespaces
Step 8 – creating a ham dataframe with punctuation removed
Creating a labeled ham dataframe
Step 9 – creating a spam dataframe devoid of punctuation
Step 10 – joining the spam and ham datasets
Step 11 – tokenizing our features
Step 12 – removing stop words
Step 13 – feature extraction
Step 14 – creating training and test datasets
Summary
Questions
Further reading
Build a Fraud Detection System
Fraud detection problem
Fraud detection dataset at a glance
Precision, recall, and the F1 score
Feature selection
The Gaussian Distribution function
Where does Spark fit in all this?
Fraud detection approach
Project overview – problem formulation
Getting started
Setting up Hortonworks Sandbox in the cloud
Creating your Azure free account, and signing in
The Azure Marketplace
The HDP Sandbox home page
Implementation objectives
Implementation steps
Create the FraudDetection trait
Broadcasting mean and standard deviation vectors
Calculating PDFs
F1 score
Calculating the best error term and best F1 score
Maximum and minimum values of a probability density
Step size for best error term calculation
A loop to generate the best F1 and the best error term
Generating predictions – outliers that represent fraud
Generating the best error term and best F1 measure
Preparing to compute precision and recall
A recap of how we looped through a ranger of Epsilons, the best error term, and the best F1 measure
Function to calculate false positives
Summary
Questions
Further reading
Build Flights Performance Prediction Model
Overview of flight delay prediction
The flight dataset at a glance
Problem formulation of flight delay prediction
Getting started
Setting up prerequisite software
Increasing Java memory
Reviewing the JDK version
MongoDB installation
Implementation and deployment
Implementation objectives
Creating a new Scala project
Building the AirlineWrapper Scala trait
Summary
Questions
Further reading
Building a Recommendation Engine
Problem overviews
Recommendations on Amazon
Brief overview
Detailed overview
On-site recommendations
Recommendation systems
Definition
Categorizing recommendations
Implicit recommendations
Explicit recommendations
Recommendations for machine learning
Collaborative filtering algorithms
Recommendations problem formulation
Understanding datasets
Detailed overview
Recommendations regarding problem formulation
Defining explicit feedback
Building a narrative
Sales leads and past sales
Weapon sales leads and past sales data
Implementation and deployment 
Implementation
Step 1 – creating the Scala project
Step 2 – creating the AirlineWrapper definition
Step 3 – creating a weapon sales orders schema
Step 4 – creating a weapon sales leads schema
Step 5 – building a weapon sales order dataframe
Step 6 – displaying the weapons sales dataframe
Step 7 – displaying the customer-weapons-system dataframe
Step 8 – generating predictions
Step 9 – displaying predictions
Compilation and deployment
Compiling the project
What is an assembly.sbt file?
Creating assembly.sbt
Contents of assembly.sbt
Running the sbt assembly task
Upgrading the build.sbt file
Rerunning the assembly command
Deploying the recommendation application
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
Scala, along with the Spark Framework, forms a rich and powerful data processing ecosystem. This book is a journey into the depths of this ecosystem. The machine learning (ML) projects presented in this book enable you to create practical, robust, data analytics solutions, with an emphasis on automating data workflows with the Spark ML pipeline API. This book showcases, or carefully cherry-picks from, Scala’s functional libraries and other constructs to help readers roll out their own scalable data processing frameworks. The projects in this book enable data practitioners across all industries to gain insights into data that will help organizations to obtain a strategic and competitive advantage. Modern Scala Projects focuses on the application of supervisory learning ML techniques that classify data and make predictions. You'll begin with working on a project to predict a class of flower by implementing a simple machine learning model. Next, you'll create a cancer diagnosis classification pipeline, followed by projects delving into stock price prediction, spam filtering, fraud detection, and a recommendation engine.
By the end of this book, you will be able to build efficient data science projects that fulfill your software requirements.
This book is for Scala developers who would like to gain some hands-on experience with some interesting real-world projects. Prior programming experience with Scala is necessary.
Chapter 1, Predict the Class of a Flower from the Iris Dataset, focuses on building a machine learning model leveraging a time-tested statistical method based on regression. The chapter draws the reader into data processing, all the way to training and testing a relatively simple machine learning model.
Chapter 2, Build a Breast Cancer Prognosis Pipeline with the Power of Spark and Scala, taps into a publicly available breast cancer dataset. It evaluates various feature selection algorithms, transforms data, and builds a classification model.
Chapter 3, Stock Price Predictions, says that stock price prediction can be an impossible task. In this chapter, we take a new approach. Accordingly, we build and train a neural network model with training data to solve the apparently intractable problem of stock price prediction. A data pipeline, with Spark at its core, distributes training of the model across multiple machines in a cluster. A real-life dataset is fed into the pipeline. Training data goes through preprocessing and normalization steps before a model is trained to fit the data. We may also provide a means to visualize the results of our prediction and evaluate our model after training.
Chapter 4, Building a Spam Classification Pipeline, informs the reader that the overarching learning objective of this chapter is to implement a spam filtering data analysis pipeline. We will rely on the Spark ML library's machine learning APIs and its supporting libraries to build a spam classification pipeline.
Chapter 5, Build a Fraud Detection System, applies machine learning techniques and algorithms to build a practical ML pipeline that helps find questionable charges on consumers’ credit cards. The data is drawn from a publicly accessible Consumer Complaints Database. The chapter demonstrates the tools contained in Spark ML for building, evaluating, and tuning a pipeline. Feature extraction is one function served by Spark ML that is covered here.
Chapter 6, Build Flights Performance Prediction Model, makes us able to leverage flight departure and arrival data to predict for the user if their flight is delayed or canceled. Here, we will build a decisions trees-based model to derive useful predictors, such as what time of the day is best to have a seat on a flight, with a minimum chance of delay.
Chapter 7, Building a Recommendation Engine, draws the reader into the implementation of a scalable recommendations engine. The collaborative-filtering approach is laid out as the reader walks through a phased recommendations-generating process based on users’ past preferences.
Prior knowledge of Scala is assumed. Knowledge of basic concepts like Spark ML will be an add-on.
You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packtpub.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Modern-Scala-Projects. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/ModernScalaProjects_ColorImages.pdf
Feedback from our readers is always welcome.
General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packtpub.com.
This chapter kicks off a machine learning (ML) initiative in Scala and Spark. Speaking of Spark, its Machine Learning Library (MLlib) living under the spark.ml package and accessible via its MLlib DataFrame-based API will help us develop scalable data analysis applications. The MLlib DataFrame-based API, also known as Spark ML, provides powerful learning algorithms and pipeline building tools for data analysis. Needless to say, we will, starting this chapter, leverage MLlib's classification algorithms.
The Spark ecosystem, also boasting of APIs to R, Python, and Java in addition to Scala, empowers our readers, be they beginner, or seasoned data professionals, to make sense of and extract analytics from various datasets.
Speaking of datasets, the Iris dataset is the simplest, yet the most famous data analysis task in the ML space. This chapter builds a solution to the data analysis classification task that the Iris dataset represents.
Here is the dataset we will refer to:
UCI Machine Learning Repository: Iris Data Set
Accessed July 13, 2018
Website URL:
https://archive.ics.uci.edu/ml/datasets/Iris
The overarching learning objective of this chapter is to implement a Scala solution to the so-called multivariate classification task represented by the Iris dataset.
The following list is a section-wise breakdown of individual learning outcomes:
A multivariate classification problem
Project overview—problem formulation
Getting started with Spark
Implementing a multiclass classification pipeline
The following section offers the reader an in-depth perspective on the Iris dataset classification problem.
The most famous dataset in data science history is Sir Ronald Aylmer Fisher's classical Iris flower dataset, also known as Anderson's dataset. It was introduced in 1936, as a study in understanding multivariate (or multiclass) classification. What then is multivariate?
The term multivariate can bear two meanings:
In terms of an adjective, multivariate means having or involving one or more variables.
In terms of a noun, multivariate may represent a mathematical vector whose individual elements are variate. Each individual element in this vector is a measurable quantity or variable.
Both meanings mentioned have a common denominator variable. Conducting a multivariate analysis of an experimental unit involves at least one measurable quantity or variable. A classic example of such an analysis is the Iris dataset, having one or more (outcome) variables per observation.
In this subsection, we understood multivariate in terms of variables. In the next subsection, we briefly touch upon different kinds of variables, one of them being categorical variables.
In general, variables are of two types:
Quantitative variable
: It is a variable representing a measurement that is quantified by a numeric value. Some examples of quantitative variables are:
A variable representing the age of a girl called
Huan
(
Age_Huan
). In September of 2017, the variable representing her age contained the value
24
. Next year, one year later, that variable would be the number 1 (arithmetically) added to her current age.
The variable representing the number of planets in the solar system (
Planet_Number
).
Currently, pending the discovery of any new planets in the future, this variable contains the number
12
. I
f scientists found a new celestial body tomorrow that they think qualifies to be a planet, the
Planet_Number
variable's new value would be bumped up from its current value of
12
to
13
.
Categorical variable
: A variable that cannot be assigned a numerical measure in the natural order of things. For example, the status of an individual in the United States. It could be one of the following values: a citizen, permanent resident, or a non-resident.
In the next subsection, we will describe categorical variables in some detail.
We will draw upon the definition of a categorical variable from the previous subsection. Categorical variables distinguish themselves from quantitative variables in a fundamental way. As opposed to a quantitative variable that represents a measure of a something in numerical terms, a categorical variable represents a grouping name or a category name, which can take one of the finite numbers of possible categories. For example, the species of an Iris flower is a categorical variable and the value it takes could be one value from a finite set of categorical values: Iris-setosa, Iris-virginica, and Iris-versicolor.
It may be useful to draw on other examples of categorical variables; these are listed here as follows:
The
blood group of an individual as in A+, A-, B+, B-, AB+, AB-, O+, or O-
The county that an individual is a resident of given a finite list of counties in the state of Missouri
The
political affiliation of a United States citizen could take up categorical values in the form of Democrat, Republican, or Green Party
In global warming studies, the type of a forest is a categorical variable that could take one of three values in the form of tropical, temperate, or
taiga
The first item in the preceding list, the blood group of a person, is a categorical variable whose corresponding data (values) are categorized (classified) into eight groups (A, B, AB, or O with their positives or negatives). In a similar vein, the species of an Iris flower is a categorical variable whose data (values) are categorized (classified) into three species groups—Iris-setosa, Iris-versicolor, and Iris-virginica.
That said, a common data analysis task in ML is to index, or encode, current string representations of categorical values into a numeric form; doubles for example. Such indexing is a prelude to a prediction on the target or label, which we shall talk more about shortly.
In respect to the Iris flower dataset, its species variable data is subject to a classification (or categorization) task with the express purpose of being able to make a prediction on the species of an Iris flower. At this point, we want to examine the Iris dataset, its rows, row characteristics, and much more, which is the focus of the upcoming topic.
The Iris flower dataset comprises of a total of 150 rows, where each row represents one flower. Each row is also known as an observation. This 150 observation Iris dataset is made up of three kinds of observations related to three different Iris flower species. The following table is an illustration:
Referring to the preceding table, it is clear that three flower species are represented in the Iris dataset. Each flower species in this dataset contributes equally to 50 observations apiece. Each observation holds four measurements. One measurement corresponds to one flower feature, where each flower feature corresponds to one of the following:
Sepal Length
Sepal Width
Petal Length
Petal Width
The features listed earlier are illustrated in the following table for clarity:
Okay, so three flower species are represented in the Iris dataset. Speaking of species, we will henceforth replace the term species with the term classes whenever there is the need to stick to an ML terminology context. That means #1-Iris-setosa from earlier refers to Class # 1, #2-Iris-virginica to Class # 2, and #3-Iris-versicolor to Class # 3.
We just listed three different Iris flower species that are represented in the Iris dataset. What do they look like? What do their features look like? These questions are answered in the following screenshot:
That said, let's look at the Sepal and Petal portions of each class of Iris flower. The Sepal (the larger lower part) and Petal (the lower smaller part) dimensions are how each class of Iris flower bears a relationship to the other two classes of Iris flowers. In the next section, we will summarize our discussion and expand the scope of the discussion of the Iris dataset to a multiclass, multidimensional classification task.
In this section, we will restate the facts about the Iris dataset and describe it in the context of an ML classification task:
The Iris dataset classification task is multiclass because a prediction of the class of a new incoming Iris flower from the wild can belong to any of three classes.
Indeed, this chapter is all about attempting a
species classification (inferring the target class of a new Iris flower) using sepal and petal dimensions as feature parameters.
The Iris dataset classification is multidimensional because there are four features.
There are 150 observations, where each observation is comprised of measurements on four features. These measurements are also known by the following terms:
Input attributes or instances
Predictor variables (
X
)
Input variables (
X
)
Classification of an Iris flower picked in the wild is carried out by a model (the computed mapping function) that is given four
flower feature measurements.
The outcome of the Iris flower classification task is the identification of a (computed) predicted value for the response from the predictors by a process of learning (or fitting) a discrete number of targets or category labels (
Y
). The outcome or predicted value may mean the same as the following:
Categorical response variable: In a later section, we shall see that an indexer algorithm will transform all categorical values to numbers
Response or outcome variable (
Y
)
So far, we have claimed that the outcome (Y) of our multiclass classification task is dependent on inputs (X). Where will these inputs come from? This is answered in the next section.
An integral aspect of our data analysis or classification task we did not hitherto mention is the training dataset. A training dataset is our classification task's source of input data (X). We take advantage of this dataset to obtain a prediction on each target class, simply by deriving optimal perimeters or boundary conditions. We just redefined our classification process by adding in the extra detail of the training dataset. For a classification task, then we have X on one side and Y on the other, with an inferred mapping function in the middle. That brings us to the mapping or predictor function, which is the focus of the next section.
This section starts with a schematic depicting the components of the mapping function and an algorithm that learns the mapping function. The algorithm is learning the mapping function, as shown in the following diagram:
The goal of our classification process is to let the algorithm derive the best possible approximation of a mapping function by a learning (or fitting) process. When we find an Iris flower out in the wild and want to classify it, we use its input measurements as new input data that our algorithm's mapping function will accept in order to give us a predictor value (Y). In other words, given feature measurements of an Iris flower (the new data), the mapping function produced by a supervised learning algorithm (this will be a random forest) will classify the flower.
Two kinds of ML problems exist that supervised learning classification algorithms can solve. These are as follows:
Classification tasks
Regression tasks
In the following paragraph, we will talk about a mapping function with an example. We explain the role played by a "supervised learning classification task" in deducing the mapping function. The concept of a model is introduced.
Let's say we already knew that the (mapping) function f(x) for the Iris dataset classification task is exactly of the form x + 1, then there is there no need for us to find a new mapping function. If we recall, a mapping function is one that maps the relationship between flower features, such as sepal length and sepal width, on the species the flower belongs to? No.
Therefore, there is no preexisting function x + 1 that clearly maps the relationship between flower features and the flower's species. What we need is a model that will model the aforementioned relationship as closely as possible. Data and its classification seldom tend to be straightforward. A supervised learning classification task starts life with no knowledge of what function f(x) is. A supervised learning classification process applies ML techniques and strategies in an iterative process of deduction to ultimately learn what f(x) is.
In our case, such an ML endeavor is a classification task, a task where the function or mapping function is referred to in statistical or ML terminology as a model.
In the next section, we will describe what supervised learning is and how it relates to the Iris dataset classification.
