E-Book
40,79 €

Practical Machine Learning E-Book

Sunila Gollapudi

0,0

40,79 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Tackle the real-world complexities of modern machine learning with innovative, cutting-edge, techniques

About This Book

Fully-coded working examples using a wide range of machine learning libraries and tools, including Python, R, Julia, and Spark
Comprehensive practical solutions taking you into the future of machine learning
Go a step further and integrate your machine learning projects with Hadoop

Who This Book Is For

This book has been created for data scientists who want to see machine learning in action and explore its real-world application. With guidance on everything from the fundamentals of machine learning and predictive analytics to the latest innovations set to lead the big data revolution into the future, this is an unmissable resource for anyone dedicated to tackling current big data challenges. Knowledge of programming (Python and R) and mathematics is advisable if you want to get started immediately.

What You Will Learn

Implement a wide range of algorithms and techniques for tackling complex data
Get to grips with some of the most powerful languages in data science, including R, Python, and Julia
Harness the capabilities of Spark and Hadoop to manage and process data successfully
Apply the appropriate machine learning technique to address real-world problems
Get acquainted with Deep learning and find out how neural networks are being used at the cutting-edge of machine learning
Explore the future of machine learning and dive deeper into polyglot persistence, semantic data, and more

In Detail

Finding meaning in increasingly larger and more complex datasets is a growing demand of the modern world. Machine learning and predictive analytics have become the most important approaches to uncover data gold mines. Machine learning uses complex algorithms to make improved predictions of outcomes based on historical patterns and the behaviour of data sets. Machine learning can deliver dynamic insights into trends, patterns, and relationships within data, immensely valuable to business growth and development.

This book explores an extensive range of machine learning techniques uncovering hidden tricks and tips for several types of data using practical and real-world examples. While machine learning can be highly theoretical, this book offers a refreshing hands-on approach without losing sight of the underlying principles. Inside, a full exploration of the various algorithms gives you high-quality guidance so you can begin to see just how effective machine learning is at tackling contemporary challenges of big data.

This is the only book you need to implement a whole suite of open source tools, frameworks, and languages in machine learning. We will cover the leading data science languages, Python and R, and the underrated but powerful Julia, as well as a range of other big data platforms including Spark, Hadoop, and Mahout. Practical Machine Learning is an essential resource for the modern data scientists who want to get to grips with its real-world application.

With this book, you will not only learn the fundamentals of machine learning but dive deep into the complexities of real world data before moving on to using Hadoop and its wider ecosystem of tools to process and manage your structured and unstructured data.

You will explore different machine learning techniques for both supervised and unsupervised learning; from decision trees to Naive Bayes classifiers and linear and clustering methods, you will learn strategies for a truly advanced approach to the statistical analysis of data. The book also explores the cutting-edge advancements in machine learning, with worked examples and guidance on deep learning and reinforcement learning, providing you with practical demonstrations and samples that help take the theory–and mystery–out of even the most advanced machine learning methodologies.

Style and approach

A practical data science tutorial designed to give you an insight into the practical application of machine learning, this book takes you through complex concepts and tasks in an accessible way. Featuring information on a wide range of data science techniques, Practical Machine Learning is a comprehensive data science resource.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 492

Veröffentlichungsjahr: 2016

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Practical Machine Learning

Credits

Foreword

About the Author

Acknowledgments

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. Introduction to Machine learning

Machine learning

Definition

Core Concepts and Terminology

What is learning?

Data

Labeled and unlabeled data

Tasks

Algorithms

Models

Logical models

Geometric models

Probabilistic models

Data and inconsistencies in Machine learning

Under-fitting

Over-fitting

Data instability

Unpredictable data formats

Practical Machine learning examples

Types of learning problems

Classification

Clustering

Forecasting, prediction or regression

Simulation

Optimization

Supervised learning

Unsupervised learning

Semi-supervised learning

Reinforcement learning

Deep learning

Performance measures

Is the solution good?

Mean squared error (MSE)

Mean absolute error (MAE)

Normalized MSE and MAE (NMSE and NMAE)

Solving the errors: bias and variance

Some complementing fields of Machine learning

Data mining

Artificial intelligence (AI)

Statistical learning

Data science

Machine learning process lifecycle and solution architecture

Machine learning algorithms

Decision tree based algorithms

Bayesian method based algorithms

Kernel method based algorithms

Clustering methods

Artificial neural networks (ANN)

Dimensionality reduction

Ensemble methods

Instance based learning algorithms

Regression analysis based algorithms

Association rule based learning algorithms

Machine learning tools and frameworks

Summary

2. Machine learning and Large-scale datasets

Big data and the context of large-scale Machine learning

Functional versus Structural – A methodological mismatch

Commoditizing information

Theoretical limitations of RDBMS

Scaling-up versus Scaling-out storage

Distributed and parallel computing strategies

Machine learning: Scalability and Performance

Too many data points or instances

Too many attributes or features

Shrinking response time windows – need for real-time responses

Highly complex algorithm

Feed forward, iterative prediction cycles

Model selection process

Potential issues in large-scale Machine learning

Algorithms and Concurrency

Developing concurrent algorithms

Technology and implementation options for scaling-up Machine learning

MapReduce programming paradigm

High Performance Computing (HPC) with Message Passing Interface (MPI)

Language Integrated Queries (LINQ) framework

Manipulating datasets with LINQ

Graphics Processing Unit (GPU)

Field Programmable Gate Array (FPGA)

Multicore or multiprocessor systems

Summary

3. An Introduction to Hadoop's Architecture and Ecosystem

Introduction to Apache Hadoop

Evolution of Hadoop (the platform of choice)

Hadoop and its core elements

Machine learning solution architecture for big data (employing Hadoop)

The Data Source layer

The Ingestion layer

The Hadoop Storage layer

The Hadoop (Physical) Infrastructure layer – supporting appliance

Hadoop platform / Processing layer

The Analytics layer

The Consumption layer

Explaining and exploring data with Visualizations

Security and Monitoring layer

Hadoop core components framework

Hadoop Distributed File System (HDFS)

Secondary Namenode and Checkpoint process

Splitting large data files

Block loading to the cluster and replication

Writing to and reading from HDFS

Handling failures

HDFS command line

RESTFul HDFS

MapReduce

MapReduce architecture

What makes MapReduce cater to the needs of large datasets?

MapReduce execution flow and components

Developing MapReduce components

InputFormat

OutputFormat

Mapper implementation

Hadoop 2.x

Hadoop ecosystem components

Hadoop installation and setup

Installing Jdk 1.7

Creating a system user for Hadoop (dedicated)

Disable IPv6

Steps for installing Hadoop 2.6.0

Starting Hadoop

Hadoop distributions and vendors

Summary

4. Machine Learning Tools, Libraries, and Frameworks

Machine learning tools – A landscape

Apache Mahout

How does Mahout work?

Installing and setting up Apache Mahout

Setting up Maven

Setting-up Apache Mahout using Eclipse IDE

Setting up Apache Mahout without Eclipse

Mahout Packages

Implementing vectors in Mahout

Installing and setting up R

Integrating R with Apache Hadoop

Approach 1 – Using R and Streaming APIs in Hadoop

Approach 2 – Using the Rhipe package of R

Approach 3 – Using RHadoop

Summary of R/Hadoop integration approaches

Implementing in R (using examples)

R Expressions

Assignments

Functions

R Vectors

Assigning, accessing, and manipulating vectors

R Matrices

R Factors

R Data Frames

R Statistical frameworks

Julia

Installing and setting up Julia

Downloading and using the command line version of Julia

Using Juno IDE for running Julia

Using Julia via the browser

Running the Julia code from the command line

Implementing in Julia (with examples)

Using variables and assignments

Numeric primitives

Data structures

Working with Strings and String manipulations

Packages

Interoperability

Integrating with C

Integrating with Python

Integrating with MATLAB

Graphics and plotting

Benefits of adopting Julia

Integrating Julia and Hadoop

Python

Toolkit options in Python

Implementation of Python (using examples)

Installing Python and setting up scikit-learn

Loading data

Apache Spark

Scala

Programming with Resilient Distributed Datasets (RDD)

Spring XD

Summary

5. Decision Tree based learning

Decision trees

Terminology

Purpose and uses

Constructing a Decision tree

Handling missing values

Considerations for constructing Decision trees

Choosing the appropriate attribute(s)

Information gain and Entropy

Gini index

Gain ratio

Termination Criteria / Pruning Decision trees

Decision trees in a graphical representation

Inducing Decision trees – Decision tree algorithms

CART

C4.5

Greedy Decision trees

Benefits of Decision trees

Specialized trees

Oblique trees

Random forests

Evolutionary trees

Hellinger trees

Implementing Decision trees

Using Mahout

Using R

Using Spark

Using Python (scikit-learn)

Using Julia

Summary

6. Instance and Kernel Methods Based Learning

Instance-based learning (IBL)

Nearest Neighbors

Value of k in KNN

Distance measures in KNN

Euclidean distance

Hamming distance

Minkowski distance

Case-based reasoning (CBR)

Locally weighed regression (LWR)

Implementing KNN

Using Mahout

Using R

Using Spark

Using Python (scikit-learn)

Using Julia

Kernel methods-based learning

Kernel functions

Support Vector Machines (SVM)

Inseparable Data

Implementing SVM

Using Mahout

Using R

Using Spark

Using Python (Scikit-learn)

Using Julia

Summary

7. Association Rules based learning

Association rules based learning

Association rule – a definition

Apriori algorithm

Rule generation strategy

Rules for defining appropriate minsup

Apriori – the downside

FP-growth algorithm

Apriori versus FP-growth

Implementing Apriori and FP-growth

Using Mahout

Using R

Using Spark

Using Python (Scikit-learn)

Using Julia

Summary

8. Clustering based learning

Clustering-based learning

Types of clustering

Hierarchical clustering

Partitional clustering

The k-means clustering algorithm

Convergence or stopping criteria for the k-means clustering

K-means clustering on disk

Advantages of the k-means approach

Disadvantages of the k-means algorithm

Distance measures

Complexity measures

Implementing k-means clustering

Using Mahout

Using R

Using Spark

Using Python (scikit-learn)

Using Julia

Summary

9. Bayesian learning

Bayesian learning

Statistician's thinking

Important terms and definitions

Probability

Types of events

Mutually exclusive or disjoint events

Independent events

Dependent events

Types of probability

Distribution

Bernoulli distribution

Binomial distribution

Poisson probability distribution

Exponential distribution

Normal distribution

Relationship between the distributions

Bayes' theorem

Naïve Bayes classifier

Multinomial Naïve Bayes classifier

The Bernoulli Naïve Bayes classifier

Implementing Naïve Bayes algorithm

Using Mahout

Using R

Using Spark

Using scikit-learn

Using Julia

Summary

10. Regression based learning

Regression analysis

Revisiting statistics

Properties of expectation, variance, and covariance

Properties of variance

Properties of covariance

Example

ANOVA and F Statistics

Confounding

Effect modification

Regression methods

Simple regression or simple linear regression

Multiple regression

Polynomial (non-linear) regression

Generalized Linear Models (GLM)

Logistic regression (logit link)

Odds ratio in logistic regression

Model

Poisson regression

Implementing linear and logistic regression

Using Mahout

Using R

Using Spark

Using scikit-learn

Using Julia

Summary

11. Deep learning

Background

The human brain

Neural networks

Neuron

Synapses

Artificial neurons or perceptrons

Linear neurons

Rectified linear neurons / linear threshold neurons

Binary threshold neurons

Sigmoid neurons

Stochastic binary neurons

Neural Network size

An example

Neural network types

Multilayer fully connected feedforward networks or Multilayer Perceptrons (MLP)

Jordan networks

Elman networks

Radial Bias Function (RBF) networks

Hopfield networks

Dynamic Learning Vector Quantization (DLVQ) networks

Gradient descent method

Backpropagation algorithm

Softmax regression technique

Deep learning taxonomy

Convolutional neural networks (CNN/ConvNets)

Convolutional layer (CONV)

Pooling layer (POOL)

Fully connected layer (FC)

Recurrent Neural Networks (RNNs)

Restricted Boltzmann Machines (RBMs)

Deep Boltzmann Machines (DBMs)

Autoencoders

Implementing ANNs and Deep learning methods

Using Mahout

Using R

Using Spark

Using Python (Scikit-learn)

Using Julia

Summary

12. Reinforcement learning

Reinforcement Learning (RL)

The context of Reinforcement Learning

Examples of Reinforcement Learning

Evaluative Feedback

n-Armed Bandit problem

Action-value methods

Reinforcement comparison methods

The Reinforcement Learning problem – the world grid example

Markov Decision Process (MDP)

Basic RL model – agent-environment interface

Delayed rewards

The policy

Reinforcement Learning – key features

Reinforcement learning solution methods

Dynamic Programming (DP)

Generalized Policy Iteration (GPI)

Monte Carlo methods

Temporal difference (TD) learning

Sarsa - on-Policy TD

Q-Learning – off-Policy TD

Actor-critic methods (on-policy)

R Learning (Off-policy)

Implementing Reinforcement Learning algorithms

Using Mahout

Using R

Using Spark

Using Python (Scikit-learn)

Using Julia

Summary

13. Ensemble learning

Ensemble learning methods

The wisdom of the crowd

Key use cases

Recommendation systems

Anomaly detection

Transfer learning

Stream mining or classification

Ensemble methods

Supervised ensemble methods

Boosting

AdaBoost

Bagging

Wagging

Random forests

Gradient boosting machines (GBM)

Unsupervised ensemble methods

Implementing ensemble methods

Using Mahout

Using R

Using Spark

Using Python (Scikit-learn)

Using Julia

Summary

14. New generation data architectures for Machine learning

Evolution of data architectures

Emerging perspectives & drivers for new age data architectures

Modern data architectures for Machine learning

Semantic data architecture

The business data lake

Semantic Web technologies

Ontology and data integration

Vendors

Multi-model database architecture / polyglot persistence

Vendors

Lambda Architecture (LA)

Vendors

Summary

Index

Practical Machine Learning

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: January 2016

Production reference: 2270116

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78439-968-9

www.packtpub.com

Credits

Author

Sunila Gollapudi

Reviewers

Rahul Agrawal

Rahul Jain

Ryota Kamoshida

Ravi Teja Kankanala

Dr. Jinfeng Yi

Commissioning Editor

Akram Hussain

Acquisition Editor

Sonali Vernekar

Content Development Editor

Sumeet Sawant

Technical Editor

Murtaza Tinwala

Copy Editor

Yesha Gangani

Project Coordinator

Shweta H Birwatkar

Proofreader

Safis Editing

Indexer

Tejal Daruwale Soni

Graphics

Jason Monteiro

Production Coordinator

Manu Joseph

Cover Work

Manu Joseph

Foreword

Can machines think? This question has fascinated scientists and researchers around the world. In the 1950s, Alan Turing shifted the paradigm from "Can machines think?" to "Can machines do what humans (as thinking entities) can do?". Since then, the field of Machine learning/Artificial Intelligence continues to be an exciting topic and considerable progress has been made.

The advances in various computing technologies, the pervasive use of computing devices, and resultant Information/Data glut has shifted the focus of Machine learning from an exciting esoteric field to prime time. Today, organizations around the world have understood the value of Machine learning in the crucial role of knowledge discovery from data, and have started to invest in these capabilities.

Most developers around the world have heard of Machine learning; the "learning" seems daunting since this field needs a multidisciplinary thinking—Big Data, Statistics, Mathematics, and Computer Science. Sunila has stepped in to fill this void. She takes a fresh approach to mastering Machine learning, addressing the computing side of the equation-handling scale, complexity of data sets, and rapid response times.

Practical Machine Learning is aimed at being a guidebook for both established and aspiring data scientists/analysts. She presents, herewith, an enriching journey for the readers to understand the fundamentals of Machine learning, and manages to handhold them at every step leading to practical implementation path.

She progressively uncovers three key learning blocks. The foundation block focuses on conceptual clarity with a detailed review of the theoretical nuances of the disciple. This is followed by the next stage of connecting these concepts to the real-world problems and establishing an ability to rationalize an optimal application. Finally, exploring the implementation aspects of latest and best tools in the market to demonstrate the value to the business users.

V. Laxmikanth

Managing Director, Broadridge Financial Solutions (India) Pvt Ltd

About the Author

Sunila Gollapudi works as Vice President Technology with Broadridge Financial Solutions (India) Pvt. Ltd., a wholly owned subsidiary of the US-based Broadridge Financial Solutions Inc. (BR). She has close to 14 years of rich hands-on experience in the IT services space. She currently runs the Architecture Center of Excellence from India and plays a key role in the big data and data science initiatives. Prior to joining Broadridge she held key positions at leading global organizations and specializes in Java, distributed architecture, big data technologies, advanced analytics, Machine learning, semantic technologies, and data integration tools. Sunila represents Broadridge in global technology leadership and innovation forums, the most recent being at IEEE for her work on semantic technologies and its role in business data lakes. Sunila's signature strength is her ability to stay connected with ever changing global technology landscape where new technologies mushroom rapidly , connect the dots and architect practical solutions for business delivery . A post graduate in computer science, her first publication was on Big Data Datawarehouse solution, Greenplum titled Getting Started with Greenplum for Big Data Analytics, Packt Publishing. She's a noted Indian classical dancer at both national and international levels, a painting artist, in addition to being a mother, and a wife.

Acknowledgments

At the outset, I would like to express my sincere gratitude to Broadridge Financial Solutions (India) Pvt Ltd., for providing the platform to pursue my passion in the field of technology.

My heartfelt thanks to Laxmikanth V, my mentor and Managing Director of the firm, for his continued support and the foreword for this book, Dr. Dakshinamurthy Kolluru, President, International School of Engineering (INSOFE), for helping me discover my love for Machine learning and Mr. Nagaraju Pappu, Founder & Chief Architect Canopus Consulting, for being my mentor in Enterprise Architecture.

This acknowledgement is incomplete without a special mention of Packt Publications for giving this opportunity to outline, conceptualize and provide complete support in releasing this book. This is my second publication with them, and again it is a pleasure to work with a highly professional crew and the expert reviewers.

To my husband, family and friends for their continued support as always. One person whom I owe the most is my lovely and understanding daughter Sai Nikita who was as excited as me throughout this journey of writing this book. I only wish there were more than 24 hours in a day and would have spent all that time with you Niki!

Lastly, this book is a humble submission to all the restless minds in the technology world for their relentless pursuit to build something new every single day that makes the lives of people better and more exciting.

About the Reviewers

Rahul Agrawal is a Principal Research Manager at Bing Sponsored Search in Microsoft India, where he heads a team of applied scientists solving problems in the domain of query understanding, ad matching, and large-scale data mining in real time. His research interests include large-scale text mining, recommender systems, deep neural networks, and social network analysis. Prior to Microsoft, he worked with Yahoo! Research, where he worked in building click prediction models for display advertising. He is a post graduate from Indian Institute of Science and has 13 years of experience in Machine learning and massive scale data mining.

Rahul Jain is a big data / search consultant from Hyderabad, India, where he helps organizations in scaling their big data / search applications. He has 8 years of experience in the development of Java- and J2EE-based distributed systems with 3 years of experience in working with big data technologies (Apache Hadoop / Spark), NoSQL(MongoDB, HBase, and Cassandra), and Search / IR systems (Lucene, Solr, or Elasticsearch). In his previous assignments, he was associated with IVY Comptech as an architect where he worked on implementation of big data solutions using Kafka, Spark, and Solr. Prior to that, he worked with Aricent Technologies and Wipro Technologies Ltd, Bangalore, on the development of multiple products.

He runs one of the top technology meet-ups in Hyderabad—Big Data Hyderabad Meetup—that focuses on big data and its ecosystem. He is a frequent speaker and had given several talks on multiple topics in big data/search domain at various meet-ups/conferences in India and abroad. In his free time, he enjoys meeting new people and learning new skills.

I would like to thank my wife, Anshu, for standing beside me throughout my career and reviewing this book. She has been my inspiration and motivation for continuing to improve my knowledge and move my career forward.

Ryota Kamoshida is the maintainer of Python library MALSS (https://github.com/canard0328/malss) and now works as a researcher in computer science at a Japanese company.

Ravi Teja Kankanala is a Machine learning expert and loves making sense of large amount of data and predicts trends through advanced algorithms. At Xlabs, he leads all research and data product development efforts, addressing HealthCare and Market Research Domain. Prior to that, he developed data science product for various use cases in telecom sector at Ericsson R&D. Ravi did his BTech in computer science from IIT Madras.

Dr. Jinfeng Yi is a research staff Member at IBM's Thomas J. Watson Research Center, concentrating on data analytics for complex real-world applications. His research interests lie in Machine learning and its application to various domains, including recommender system, crowdsourcing, social computing, and spatio-temporal analysis. Jinfeng is particularly interested in developing theoretically principled and practically efficient algorithms for learning from massive datasets. He has published over 15 papers in top Machine learning and data mining venues, such as ICML, NIPS, KDD, AAAI, and ICDM. He also holds multiple US and international patents related to large-scale data management, electronic discovery, spatial-temporal analysis, and privacy preserved data sharing.

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

I dedicate this work of mine to my father G V L N Sastry, and my mother, late G Vijayalakshmi. I wouldn't have been what I am today without your perseverance, love, and confidence in me.

Preface

Finding something meaningful in increasingly larger and more complex datasets is a growing demand of the modern world. Machine learning and predictive analytics have become the most important approaches to uncover data gold mines. Machine learning uses complex algorithms to make improved predictions of outcomes based on historical patterns and the behavior of datasets. Machine learning can deliver dynamic insights into trends, patterns, and relationships within data, which is immensely valuable to the growth and development of business.

With this book, you will not only learn the fundamentals of Machine learning, but you will also dive deep into the complexities of the real-world data before moving onto using Hadoop and its wider ecosystem of tools to process and manage your structured and unstructured data.

What this book covers

Chapter 1, Introduction to Machine learning, will cover the basics of Machine learning and the landscape of Machine learning semantics. It will also define Machine learning in simple terms and introduce Machine learning jargon or commonly used terms. This chapter will form the base for the rest of the chapters.

Chapter 2, Machine learning and Large-scale datasets, will explore qualifiers of large datasets, common characteristics, problems of repetition, the reasons for the hyper-growth in the volumes, and approaches to handle the big data.

Chapter 3, An Introduction to Hadoop's Architecture and Ecosystem, will cover all about Hadoop, starting from its core frameworks to its ecosystem components. At the end of this chapter, readers will be able to set up Hadoop and run some MapReduce functions; they will be able to use one or more ecosystem components. They will also be able to run and manage Hadoop environment and understand the command-line usage.

Chapter 4, Machine Learning Tools, Libraries, and Frameworks, will explain open source options to implement Machine learning and cover installation, implementation, and execution of libraries, tools, and frameworks, such as Apache Mahout, Python, R, Julia, and Apache Spark's MLlib. Very importantly, we will cover the integration of these frameworks with the big data platform—Apache Hadoop

Chapter 5, Decision Tree based learning, will explore a supervised learning technique with Decision trees to solve classification and regression problems. We will cover methods to select attributes and split and prune the tree. Among all the other Decision tree algorithms, we will explore the CART, C4.5, Random forests, and advanced decision tree techniques.

Chapter 6, Instance and Kernel methods based learning, will explore two learning algorithms: instance-based and kernel methods; and we will discover how they address the classification and prediction requirements. In instance-based learning methods, we will explore the Nearest Neighbor algorithm in detail. Similarly in kernel-based methods, we will explore Support Vector Machines using real-world examples.

Chapter 7, Association Rules based learning, will explore association rule based learning methods and algorithms: Apriori and FP-growth. With a common example, you will learn how to do frequent pattern mining using the Apriori and FP-growth algorithms with a step-by-step debugging of the algorithm.

Chapter 8, Clustering based learning, will cover clustering based learning methods in the context of unsupervised learning. We will take a deep dive into k-means clustering algorithm using an example and learn to implement it using Mahout, R, Python, Julia, and Spark.

Chapter 9, Bayesian learning, will explore Bayesian Machine learning. Additionally, we will cover all the core concepts of statistics starting from basic nomenclature to various distributions. We will cover Bayes theorem in depth with examples to understand how to apply it to the real-world problems.

Chapter 10, Regression based learning, will cover regression analysis-based Machine learning and in specific, how to implement linear and logistic regression models using Mahout, R, Python, Julia, and Spark. Additionally, we will cover other related concepts of statistics such as variance, covariance, ANOVA, among others. We will also cover regression models in depth with examples to understand how to apply it to the real-world problems.

Chapter 11, Deep learning, will cover the model for a biological neuron and will explain how an artificial neuron is related to its function. You will learn the core concepts of neural networks and understand how fully-connected layers work. We will also explore some key activation functions that are used in conjunction with matrix multiplication.

Chapter 12, Reinforcement learning, will explore a new learning technique called reinforcement learning. We will see how this is different from the traditional supervised and unsupervised learning techniques. We will also explore the elements of MDP and learn about it using an example.

Chapter 13,Ensemble learning, will cover the ensemble learning methods of Machine learning. In specific, we will look at some supervised ensemble learning techniques with some real-world examples. Finally, this chapter will have source-code examples for gradient boosting algorithm using R, Python (scikit-learn), Julia, and Spark machine learning tools and recommendation engines using Mahout libraries.

Chapter 14, New generation data architectures for Machine learning, will be on the implementation aspects of Machine learning. We will understand what the traditional analytics platforms are and how they cannot fit in modern data requirements. You will also learn about the architecture drivers that promote new data architecture paradigms, such as Lambda architectures polyglot persistence (Multi-model database architecture); you will learn how Semantic architectures help in a seamless data integration.

What you need for this book

You'll need the following softwares for this book:

R (2.15.1)Apache Mahout (0.9)Python(sckit-learn)Julia(0.3.4) Apache Spark (with Scala 2.10.4)

Who this book is for

This book has been created for data scientists who want to see Machine learning in action and explore its real-world application. With guidance on everything from the fundamentals of Machine learning and predictive analytics to the latest innovations set to lead the big data revolution into the future, this is an unmissable resource for anyone dedicated to tackling current big data challenges. Knowledge of programming (Python and R) and mathematics is advisable, if you want to get started immediately.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

The author will be updating the code on https://github.com/PacktCode/Practical-Machine-Learning for you to download as and when there are version updates.

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from http://www.packtpub.com/sites/default/files/downloads/Practical_Machine_Learning_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]>with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Chapter 1. Introduction to Machine learning

The goal of this chapter is to take you through the Machine learning landscape and lay out the basic concepts upfront for the chapters that follow. More importantly, the focus is to help you explore various learning strategies and take a deep dive into the different subfields of Machine learning. The techniques and algorithms under each subfield, and the overall architecture that forms the core for any Machine learning project implementation, are covered in depth.

There are many publications on Machine learning, and a lot of work has been done in past in this field. Further to the concepts of Machine learning, the focus will be primarily on specific practical implementation aspects through real-world examples. It is important that you already have a relatively high degree of knowledge in basic programming techniques and algorithmic paradigms; although for every programming section, the required primers are in place.

The following topics listed are covered in depth in this chapter:

Introduction to Machine learningA basic definition and the usage contextThe differences and similarities between Machine learning and data mining, Artificial Intelligence (AI), statistics, and data scienceThe relationship with big dataThe terminology and mechanics: model, accuracy, data, features, complexity, and evaluation measuresMachine learning subfields: supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, and deep learning. Specific Machine learning techniques and algorithms are also covered under each of the machine learning subfieldsMachine learning problem categories: Classification, Regression, Forecasting, and OptimizationMachine learning architecture, process lifecycle, and practical problemsMachine learning technologies, tools, and frameworks

Machine learning

Machine learning has been around for many years now and all social media users, at some point in time, have been consumers of Machine learning technology. One of the common examples is face recognition software, which is the capability to identify whether a digital photograph includes a given person. Today, Facebook users can see automatic suggestions to tag their friends in the digital photographs that are uploaded. Some cameras and software such as iPhoto also have this capability. There are many examples and use cases that will be discussed in more detail later in this chapter.

The following concept map represents the key aspects and semantics of Machine learning that will be covered throughout this chapter:

Definition

Let's start with defining what Machine learning is. There are many technical and functional definitions for Machine learning, and some of them are as follows:

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

--Tom M. Mitchell

"Machine learning is the training of a model from data that generalizes a decision against a performance measure."

--Jason Brownlee

"A branch of artificial intelligence in which a computer generates rules underlying or based on raw data that has been fed into it."

--Dictionary.com

"Machine learning is a scientific discipline that is concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases."

--Wikipedia

The preceding definitions are fascinating and relevant. They either have an algorithmic, statistical, or mathematical perspective.

Beyond these definitions, a single term or definition for Machine learning is the key to facilitating the definition of a problem-solving platform. Basically, it is a mechanism for pattern search and building intelligence into a machine to be able to learn, implying that it will be able to do better in the future from its own experience.

Drilling down a little more into what a pattern typically is, pattern search or pattern recognition is essentially the study of how machines perceive the environment, learn to discriminate behavior of interest from the rest, and be able to take reasonable decisions about categorizing the behavior. This is more often performed by humans. The goal is to foster accuracy, speed, and avoid the possibility of inappropriate use of the system.

Machine learning algorithms that are constructed this way handle building intelligence. Essentially, machines make sense of data in much the same way that humans do.

The primary goal of a Machine learning implementation is to develop a general purpose algorithm that solves a practical and focused problem. Some of the aspects that are important and need to be considered in this process include data, time, and space requirements. Most importantly, with the ability to be applied to a broad class of learning problems, the goal of a learning algorithm is to produce a result that is a rule and is as accurate as possible.

Another important aspect is the big data context; that is, Machine learning methods are known to be effective even in cases where insights need to be uncovered from datasets that are large, diverse, and rapidly changing. More on the large scale data aspect of Machine learning will be covered in Chapter 2, Machine Learning and Large-scale Datasets.

Core Concepts and Terminology

At the heart of Machine learning is knowing and using the data appropriately. This includes collecting the right data, cleansing the data, and processing the data using learning algorithms iteratively to build models using certain key features of data, and based on the hypotheses from these models, making predictions.

In this section, we will cover the standard nomenclature or terminology used in machine learning, starting from how to describe data, learning, modeling, algorithms, and specific machine learning tasks.

What is learning?

Now, let us look at the definition of "learning" in the context of Machine learning. In simple terms, historical data or observations are used to predict or derive actionable tasks. Very clearly, one mandate for an intelligent system is its ability to learn. The following are some considerations to define a learning problem:

Provide a definition of what the learner should learn and the need for learning.Define the data requirements and the sources of the data.Define if the learner should operate on the dataset in entirety or a subset will do.

Before we plunge into understanding the internals of each learning type in the following sections, you need to understand the simple process that is followed to solve a learning problem, which involves building and validating models that solve a problem with maximum accuracy.

Tip

A model is nothing but an output from applying an algorithm to a dataset, and it is usually a representation of the data. We cover more on models in the later sections.

In general, for performing Machine learning, there are primarily two types of datasets required. The first dataset is usually manually prepared, where the input data and the expected output data are available and prepared. It is important that every piece of input data has an expected output data point available as this will be used in a supervised manner to build the rule. The second dataset is where we have the input data, and we are interested in predicting the expected output.

As a first step, the given data is segregated into three datasets: training, validation, and testing. There is no one hard rule on what percentage of data should be training, validation, and testing datasets. It can be 70-10-20, 60-30-10, 50-25-25, or any other values.

The training dataset refers to the data examples that are used to learn or build a classifier, for example. The validation dataset refers to the data examples that are verified against the built classifier and can help tune the accuracy of the output. The testing dataset refers to the data examples that help assess the performance of the classifier.

There are typically three phases for performing Machine learning:

Phase 1—Training Phase: This is the phase where training data is used to train the model by pairing the given input with the expected output. The output of this phase is the learning model itself.Phase 2—Validation and Test Phase: This phase is to measure how good the learning model that has been trained is and estimate the model properties, such as error measures, recall, precision, and others. This phase uses a validation dataset, and the output is a sophisticated learning model.Phase 3—Application Phase: In this phase, the model is subject to the real-world data for which the results need to be derived.

The following figure depicts how learning can be applied to predict the behavior:

Data

Data forms the main source of learning in Machine learning. The data that is being referenced here can be in any format, can be received at any frequency, and can be of any size. When it comes to handling large datasets in the Machine learning context, there are some new techniques that have evolved and are being experimented with. There are also more big data aspects, including parallel processing, distributed storage, and execution. More on the large-scale aspects of data will be covered in the next chapter, including some unique differentiators.

When we think of data, dimensions come to mind. To start with, we have rows and columns when it comes to structured and unstructured data. This book will cover handling both structured and unstructured data in the machine learning context. In this section, we will cover the terminology related to data within the Machine learning context.

Term

Purpose or meaning in the context of Machine learning

Feature, attribute, field, or variable

This is a single column of data being referenced by the learning algorithms. Some features can be input to the learning algorithm, and some can be the outputs.

Instance

This is a single row of data in the dataset.

Feature vector or tuple

This is a list of features.

Dimension

This is a subset of attributes used to describe a property of data. For example, a date dimension consists of three attributes: day, month, and year.

Dataset

A collection of rows or instances is called a dataset. In the context of Machine learning, there are different types of datasets that are meant to be used for different purposes. An algorithm is run on different datasets at different stages to measure the accuracy of the model. There are three types of dataset: training, testing, and evaluation datasets. Any given comprehensive dataset is split into three categories of datasets and is usually in the following proportions: 60% training, 30% testing, and 10% evaluation.

a. Training Dataset

The training dataset is the dataset that is the base dataset against which the model is built or trained.

b. Testing Dataset

The testing dataset is the dataset that is used to validate the model built. This dataset is also referred to as a validating dataset.

c. Evaluation Dataset

The evaluation dataset is the dataset that is used for final verification of the model (and can be treated more as user acceptance testing).

Data Types

Attributes or features can have different data types. Some of the data types are listed here:

Categorical (for example: young, old).Ordinal (for example: 0, 1).Numeric (for example: 1.3, 2.1, 3.2, and so on).

Coverage

The percentage of a dataset for which a prediction is made or the model is covered. This determines the confidence of the prediction model.

Labeled and unlabeled data

Data in the Machine learning context can either be labeled or unlabeled. Before we go deeper into the Machine learning basics, you need to understand this categorization, and what data is used when, as this terminology will be used throughout this book.

Unlabeled data is usually the raw form of the data. It consists of samples of natural or human-created artifacts. This category of data is easily available in abundance. For example, video streams, audio, photos, and tweets among others. This form of data usually has no explanation of the meaning attached.

The unlabeled data becomes labeled data the moment a meaning is attached. Here, we are talking about attaching a "tag" or "label" that is required, and is mandatory, to interpret and define the relevance. For example, labels for a photo can be the details of what it contains, such as animal, tree, college, and so on, or, in the context of an audio file, a political meeting, a farewell party, and so on. More often, the labels are mapped or defined by humans and are significantly more expensive to obtain than the unlabeled raw data.

The learning models can be applied to both labeled and unlabeled data. We can derive more accurate models using a combination of labeled and unlabeled datasets. The following diagram represents labeled and unlabeled data. Both triangles and bigger circles represent labeled data and small circles represent unlabeled data.

The application of labeled and unlabeled data is discussed in more detail in the following sections. You will see that supervised learning adopts labeled data and unsupervised learning adopts unlabeled data. Semi-supervised learning and deep learning techniques apply a combination of labeled and unlabeled data in a variety of ways to build accurate models.

Tasks

A task is a problem that the Machine learning algorithm is built to solve. It is important that we measure the performance on a task. The term "performance" in this context is nothing but the extent or confidence with which the problem is solved. Different algorithms when run on different datasets produce a different model. It is important that the models thus generated are not compared, and instead, the consistency of the results with different datasets and different models is measured.

Algorithms

After getting a clear understanding of the Machine learning problem at hand, the focus is on what data and algorithms are relevant or applicable. There are several algorithms available. These algorithms are either grouped by the learning subfields (such as supervised, unsupervised, reinforcement, semi-supervised, or deep) or the problem categories (such as Classification, Regression, Clustering or Optimization). These algorithms are applied iteratively on different datasets, and output models that evolve with new data are captured.

Models

Models are central to any Machine learning implementation. A model describes data that is observed in a system. Models are the output of algorithms applied to a dataset. In many cases, these models are applied to new datasets that help the models learn new behavior and also predict them. There is a vast range of machine learning algorithms that can be applied to a given problem. At a very high level, models are categorized as the following:

Logical modelsGeometric modelsProbabilistic models

Logical models

Logical models are more algorithmic in nature and help us derive a set of rules by running the algorithms iteratively. A Decision tree is one such example:

Geometric models

Geometric models use geometric concepts such as lines, planes, and distances. These models usually operate, or can operate, on high volumes of data. Usually, linear transformations help compare different Machine learning methods:

Probabilistic models

Probabilistic models are statistical models that employ statistical techniques. These models are based on a strategy that defines the relationship between two variables. This relationship can be derived for sure as this involves using a random background process. In most cases, a subset of the overall data can be considered for processing:

Viagra

Lottery

P(Y= Spam (Viagra, lottery))

P(Y= ham (Viagra, lottery))

0.31

0.69

0.65

0.35

0.80

0.20

0.40

0.60

Data and inconsistencies in Machine learning

This section details all the possible data inconsistencies that may be encountered while implementing Machine learning projects, such as:

Under-fittingOver-fittingData instabilityUnpredictable future

Fortunately, there are some established processes in place today to address these inconsistencies. The following sections cover these inconsistencies.

Under-fitting

A model is said to be under-fitting when it doesn't take into consideration enough information to accurately model the actual data. For example, if only two points on an exponential curve are mapped, this possibly becomes a linear representation, but there could be a case where a pattern does not exist. In cases like these, we will see increasing errors and subsequently an inaccurate model. Also, in cases where the classifier is too rigid or is not complex enough, under-fitting is caused not just due to a lack of data, but can also be a result of incorrect modeling. For example, if the two classes form concentric circles and we try to fit a linear model, assuming they were linearly separable, this could potentially result in under-fitting.

The accuracy of the model is determined by a measure called "power" in the statistical world. If the dataset size is too small, we can never target an optimal solution.

Over-fitting

This case is just the opposite of the under-fitting case explained before. While too small a sample is not appropriate to define an optimal solution, a large dataset also runs the risk of having the model over-fit the data. Over-fitting usually occurs when the statistical model describes noise instead of describing the relationships. Elaborating on the preceding example in this context, let's say we have 500,000 data points. If the model ends up catering to accommodate all 500,000 data points, this becomes over-fitting. This will in effect mean that the model is memorizing the data. This model works well as long as the dataset does not have points outside the curve. A model that is over-fit demonstrates poor performance as minor fluctuations in data tend to be exaggerated. The primary reason for over-fitting also could be that the criterion used to train the model is different from the criterion used to judge the efficacy of the model. In simple terms, if the model memorizes the training data rather than learning, this situation is seen to occur more often.

Now, in the process of mitigating the problem of under-fitting the data, by giving it more data, this can in itself be a risk and end up in over-fitting. Considering that more data can mean more complexity and noise, we could potentially end up with a solution model that fits the current data at hand and nothing else, which makes it unusable. In the following graph, with the increasing model complexity and errors, the conditions for over-fit and under-fit are pointed out:

Data instability

Machine learning algorithms are usually robust to noise within the data. A problem will occur if the outliers are due to manual error or misinterpretation of the relevant data. This will result in a skewing of the data, which will ultimately end up in an incorrect model.

Therefore, there is a strong need to have a process to correct or handle human errors that can result in building an incorrect model.

Unpredictable data formats

Machine learning is meant to work with new data constantly coming into the system and learning from that data. Complexity will creep in when the new data entering the system comes in formats that are not supported by the machine learning system. It is now difficult to say if our models work well for the new data given the instability in the formats that we receive the data, unless there is a mechanism built to handle this.

Practical Machine learning examples

In this section, let's explore some real-world machine learning applications. We covered various examples within the introductory section of this chapter and we will now cover some domain-specific examples with a brief description of each problem.

For online and offline applications, some of the following examples can easily be guessed. In the chapters to follow, a subset of these examples will be picked to demonstrate the practical implementation aspects using suitable Machine learning algorithms.

Problem / problem Domain

Description

Spam detection

The problem statement here is to identify which e-mails are "spam". A Machine learning algorithm can categorize an e-mail to be marked as spam based on some rules that it builds using some key features of e-mail data. Once an e-mail is marked as spam, that e-mail is then moved to the spam folder and the rest are left in the inbox.

Credit card fraud detection

This is one of the recent problems that credit card firms need a solution for. Based on the usage patterns of the credit card by the consumer and the purchase behavior of the customer, the need is to identify any transaction that is not potentially made by the customer and mark them as fraudulent for necessary action to be taken.

Digit recognition

This is a very simple use case that requires the ability to group posts based on the zip code. This includes the need to interpret a handwritten numeric accurately and bucket the posts based on the zip code for faster processing.

Speech recognition

Automated call centers need this capability where a user's request on the phone is interpreted and mapped to one of the tasks for execution. The moment the user request can be mapped to a task, its execution can be automated. A model of this problem will allow a program to understand and make an attempt to fulfill that request. The iPhone with Siri has this capability.

Face detection

This is one of the key features that today's social media websites provide. This feature provides an ability to tag a person across many digital photographs. This gives aptitude to a group or categorizes the photographs by a person. Some cameras and software such as iPhoto have this capability.

Product recommendation or customer segmentation

This capability is found in almost all of the top online shopping websites today. Given a purchase history for a customer and a large inventory of products, the idea is to identify those products that the customer will most likely be interested in buying, thus motivating more product purchases. There are many online shopping and social websites that support this feature (for example: Amazon, Facebook, Google+, and many others).

There are other cases like the ability to predict whether a trial version customer opts for the paid version of the product.

Stock trading

This means predicting stock performance based on the current past stock movement. This task is critical to financial analysts and helps provide decision support when buying and selling stocks.

Sentiment analysis

Many times, we find that the customers make decisions based on opinions shared by others. For example, we buy a product because it has received positive feedback from the majority of its users. Not only in commercial businesses as detailed earlier, but sentiment analysis is also being used by political strategists to gauge public opinion on policy announcements or campaign messages.

Types of learning problems

This section focuses on elaborating different learning problem categories. Machine learning algorithms are also classified under these learning problems. The following figure depicts various types of learning problems:

Classification

Classification is a way to identify a grouping technique for a given dataset in such a way that depending on a value of the target or output attribute, the entire dataset can be qualified to belong to a class. This technique helps in identifying the data behavior patterns. This is, in short, a discrimination mechanism.

For example, a sales manager needs help in identifying a prospective customer and wants to determine whether it is worth spending the effort and time the customer demands. The key input for the manager is the customer's data, and this case is commonly referred to asTotal Lifetime Value (TLV).

We take the data and start plotting blindly on a graph (as shown in the following graph) with the x axis representing the total items purchased and the y axis representing the total money spent (in multiples of hundreds of dollars). Now we define the criteria to determine, for example, whether a customer is good or bad. In the following graph, all the customers who spend more than 800 dollars in a single purchase are categorized as good customers (note that this is a hypothetical example or analysis).

Now when new customer data comes in, the sales manager can plot the new customers on this graph and based on which side they fall, predict whether the customer is likely to be good or bad.

Tip

Note that classification need not always be binary (yes or no, male or female, good or bad, and so on) and any number of classifications can be defined (poor, below average, average, above average, good) based on the problem definition.

Clustering

In many cases, the data analyst is just given some data and is expected to unearth interesting patterns that may help derive intelligence. The main difference between this task and that of a classification is that in the classification problem, the business user specifies what he/she is looking for (a good customer or a bad customer, a success or a failure, and so on).

Let's now expand on the same example considered in the classification section. Here the patterns to classify the customers are identified without any target in mind or any prior classification, and unlike running a classification, the results may always not be the same (for example, depending on how the initial centroids are picked). An example modeling method for clustering is k-means clustering. More details on k-means clustering is covered in the next section and in detail in the following chapters.

In short, clustering is a classification analysis that does not start with a specific target in mind (good/bad, will buy/will not buy).

Forecasting, prediction or regression

Similar to classification, forecasting or prediction is also about identifying the way things would happen in the future. This information is derived from past experience or knowledge. In some cases, there is not enough data, and there is a need to define the future through regression. Forecasting and prediction results are always presented along with the degree of uncertainty or probability. This classification of the problem type is also called rule extraction.

Let's take an example here, an agricultural scientist working on a new crop that she developed. As a trial, this seed was planted at various altitudes and the yield was computed. The requirement here is to predict the yield of the crop given the altitude details (and some more related data points). The relationship between yield gained and the altitude is determined by plotting a graph between the parameters. An equation is noted that fits most of the data points, and in cases where data does not fit the curve, we can get rid of the data. This technique is called regression.

Simulation

In addition to all the techniques we defined until now, there might be situations where the data in context itself has many uncertainty. For example, an outsourcing manager is given a task and can estimate with experience that the task can be done by an identified team with certain skills in 2-4 hours.

Let's say the cost of input material may vary between $100-120 and the number of employees who come to work on any given day may be between 6 and 9. An analyst then estimates how much time the project might take. Solving such problems requires the simulation of a vast amount of alternatives.

Typically in forecasting, classification, and unsupervised learning, we are given data and we really do not know how the data is interconnected. There is no equation to describe one variable as a function of others.

Essentially, data scientists combine one or more of the preceding techniques to solve challenging problems, which are:

Web search and information extractionDrug designPredicting capital market behaviorUnderstanding customer behaviorDesigning robots

Optimization

Optimization, in simple terms, is a mechanism to make something better or define a context for a solution that makes it the best.

Considering a production scenario, let's assume there are two machines that produce the desired product but one machine requires more energy for high speed in production and lower raw materials while the other requires higher raw materials and less energy to produce the same output in the same time. It is important to understand the patterns in the output based on the variation in inputs; a combination that gives the highest profits would probably be the one the production manager would want to know. You, as an analyst, need to identify the best possible way to distribute the production between the machines that gives him the highest profit.

The following image shows the point of highest profit when a graph was plotted for various distribution options between the two machines. Identifying this point is the goal of this technique.

Unlike the case of simulations where there is uncertainty associated with the input data, in optimization we not only have access to data, but also have the information on the dependencies and relationships between data attributes.

One of the key concepts in Machine learning is a process called induction. The following learning subfields use the induction process to build models. Inductive learning is a reasoning process that uses the results of one experiment to run the next set of experiments and iteratively evolve a model from specific information.

The following figure depicts various subfields of Machine learning. These subfields are one of the ways the machine learning algorithms are classified.

Supervised learning

Supervised learning is all about operating to a known expectation and in this case, what needs to be analyzed from the data being defined. The input datasets in this context are also referred to as "labeled" datasets. Algorithms classified under this category focus on establishing a relationship between the input and output attributes, and use this relationship speculatively to generate an output for new input data points. In the preceding section, the example defined for the classification problem is also an example of supervised learning. Labeled data helps build reliable models but is usually expensive and limited.

When the input and output attributes of the data are known, the key in supervised learning is the mapping between the inputs to outputs. There are quite a few examples of these mappings, but the complicated function that links up the input and output attributes is not known. A supervised learning algorithm takes care of this linking, and given a large dataset of input/output pairs, these functions help predict the output for any new input value.

Unsupervised learning

In some of the learning

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Practical Machine Learning E-Book

Sunila Gollapudi

About This Book

Who This Book Is For

What You Will Learn

In Detail

Style and approach

Table of Contents

Practical Machine Learning

Practical Machine Learning

Credits

Foreword

About the Author

Acknowledgments

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Chapter 1. Introduction to Machine learning

Machine learning

Definition

Core Concepts and Terminology

What is learning?

Tip

Data

Labeled and unlabeled data

Tasks

Algorithms

Models

Logical models

Geometric models

Probabilistic models

Data and inconsistencies in Machine learning

Under-fitting

Over-fitting

Data instability

Unpredictable data formats

Practical Machine learning examples

Types of learning problems

Classification

Tip

Clustering

Forecasting, prediction or regression

Simulation

Optimization

Supervised learning

Unsupervised learning