Practical Reinforcement Learning - Dr. Engr. S.M. Farrukh Akhtar - E-Book

Practical Reinforcement Learning E-Book

Dr. Engr. S.M. Farrukh Akhtar

0,0
41,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Master different reinforcement learning techniques and their practical implementation using OpenAI Gym, Python and Java

About This Book

  • Take your machine learning skills to the next level with reinforcement learning techniques
  • Build automated decision-making capabilities in your systems
  • Cover Reinforcement Learning concepts, frameworks, algorithms, and more in detail

Who This Book Is For

Machine learning/AI practitioners, data scientists, data analysts, machine learning engineers, and developers who are looking to expand their existing knowledge to build optimized machine learning models, will find this book very useful.

What You Will Learn

  • Understand the basics of reinforcement learning methods, algorithms, and more, and the differences between supervised, unsupervised, and reinforcement learning
  • Master the Markov Decision Process math framework by building an OO-MDP Domain in Java
  • Learn dynamic programming principles and the implementation of Fibonacci computation in Java
  • Understand Python implementation of temporal difference learning
  • Develop Monte Carlo methods and various policies used to build a Monte Carlo simulator using Python
  • Understand Policy Gradient methods and policies applied in the reinforcement domain
  • Instill reinforcement methods in the autonomous platform using a moving car example
  • Apply reinforcement learning algorithms in games with REINFORCEjs

In Detail

Reinforcement learning (RL) is becoming a popular tool for constructing autonomous systems that can improve themselves with experience. We will break the RL framework into its core building blocks, and provide you with details of each element.

This book aims to strengthen your machine learning skills by acquainting you with reinforcement learning algorithms and techniques. This book is divided into three parts. The first part defines Reinforcement Learning and describes its basics. It also covers the basics of Python and Java frameworks, which we are going to use later in the book. The second part discusses learning techniques with basic algorithms such as Temporal Difference, Monte Carlo, and Policy Gradient—all with practical examples. Lastly, in the third part we apply Reinforcement Learning with the most recent and widely used algorithms via practical applications.

By the end of this book, you'll know the practical implementation of case studies and current research activities to help you advance further with Reinforcement Learning.

Style and approach

This hands-on book will further expand your machine learning skills by teaching you the different reinforcement learning algorithms and techniques using practical examples.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 417

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Practical Reinforcement Learning

 

 

 

 

 

 

Develop self-evolving, intelligent agents with OpenAI Gym, Python, and Java

 

 

 

 

 

 

 

 

 

 

Dr. Engr. S.M. Farrukh Akhtar

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Practical Reinforcement Learning

Copyright © 2017 Packt Publishing

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

 

First published: October 2017

Production reference: 1131017

 

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

 

ISBN 978-1-78712-872-9

www.packtpub.com

Credits

Author

Dr. Engr. S.M. Farrukh Akhtar

 

 

Copy Editors

Vikrant Phadkay

Alpha Singh

 

Reviewers

Ruben Oliva Ramos

Juan Tomás Oliva Ramos

Vijayakumar Ramdoss

 

Project Coordinator

Nidhi Joshi

 

 

 

Commissioning Editor

Wilson D'souza

Proofreader

Safis Editing 

Acquisition Editor

Tushar Gupta

Indexer

Tejal Daruwale Soni

Content Development Editor

Mayur Pawanikar

Graphics

Tania Dutta

Technical Editor

Suwarna Patil

Production Coordinator

Aparna Bhagat

About the Author

Dr. Engr. S.M. Farrukh Akhtar is an active researcher and speaker with more than 13 years of industrial experience analyzing, designing, developing, integrating, and managing large applications in different countries and diverse industries. He has worked in Dubai, Pakistan, Germany, Singapore, and Malaysia. He is currently working in Hewlett Packard as an enterprise solution architect.

He received a PhD in artificial intelligence from European Global School, France. He also received two master's degrees: a master's of intelligent systems from the University Technology Malaysia, and MBA in business strategy from the International University of Georgia. Farrukh completed his BSc in computer engineering from Sir Syed University of Engineering and Technology, Pakistan. He is also an active contributor and member of the machine learning for data science research group in the University Technology Malaysia. His research and focus areas are mainly big data, deep learning, and reinforcement learning.

He has cross-platform expertise and has achieved recognition for his expertise from IBM, Sun Microsystems, Oracle, and Microsoft. Farrukh received the following accolades:

Sun Certified Java Programmer 

in 2001

Microsoft Certified Professional and Sun Certified Web Component Developer 

in 2002

Microsoft Certified Application Developer 

in 2003

Microsoft Certified Solution Developer 

in 2004

Oracle Certified Professional 

in 2005

IBM Certified Solution Developer - XML

 in 2006

IBM Certified Big Data Architect and 

Scrum Master Certified - For Agile Software  Practitioners 

in 2017

He also contributes his experience and services as a member of the board of directors in K.K. Abdal Institute of Engineering and Management Sciences, Pakistan, and is a board member of Alam Educational Society.

Skype id: farrukh.akhtar

 

About the Reviewers

Ruben Oliva Ramos is a computer systems engineer with a master's degree in computer and electronic systems engineering, teleinformatics, and networking, with a specialization from the University of Salle Bajio in Leon, Guanajuato, Mexico. He has more than 5 years of experience in developing web applications to control and monitor devices connected with Arduino and Raspberry Pi, and using web frameworks and cloud services to build Internet of Things applications.

He is a mechatronics teacher at the University of Salle Bajio and teaches students of master's in design and engineering of mechatronics systems. Ruben also works at Centro de Bachillerato Tecnologico Industrial 225 in Leon, teaching subjects such as electronics, robotics and control, automation, and microcontrollers.

He is a technician, consultant, and developer of monitoring systems and datalogger data using technologies such as Android, iOS, Windows Phone, HTML5, PHP, CSS, Ajax, JavaScript, Angular, ASP.NET databases (SQlite, MongoDB, web servers, Node.js, IIS), hardware programming (Arduino, Raspberry Pi, Ethernet Shield, GPS, and GSM/GPRS), ESP8266, and control and monitor systems for data acquisition and programming.

He has written a book called Internet of Things Programming with JavaScript, published by Packt.

I would like to thank my savior and lord, Jesus Christ, for giving me strength and courage to pursue this project. Thanks to my dearest wife, Mayte, our two lovely sons, Ruben and Dario, my father, Ruben, my dearest mom, Rosalia, my brother, Juan Tomas, and my sister, Rosalia, whom I love. This is for all their support while reviewing this book, for allowing me to pursue my dreams and tolerating not being with them after my busy day's work.

 

Juan Tomás Oliva Ramos is an environmental engineer from the University of Guanajuato, Mexico, with a master's degree in administrative engineering and quality. He has more than 5 years of experience in management and development of patents, technological innovation projects, and development of technological solutions through the statistical control of processes. He has been a teacher of statistics, entrepreneurship, and technological development of projects since 2011.  He became an entrepreneur mentor and started a new department of technology management and entrepreneurship at Instituto Tecnologico Superior de Purisima del Rincon.

Juan is an Alfaomega reviewer and has worked on the book Wearable designs for Smart watches, Smart TVs and Android mobile devices.

He has developed prototypes through programming and automation technologies for the improvement of operations, which have been registered for patents.

I want to thank God for giving me the wisdom and humility to review this book. I thank Packt for giving me the opportunity to review this amazing book and to collaborate with a group of committed people. I want to thank my beautiful wife, Brenda; our two magic princesses, Regina and Renata; and our next member, Angel Tadeo; all of you give me the strength, happiness, and joy to start a new day. Thanks for being my family.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

 

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787128725.

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Reinforcement Learning

Overview of machine learning

What is machine learning?

Speech conversion from one language to another

Suspicious activity detection from CCTVs

Medical diagnostics for detecting diseases

Supervised learning

Unsupervised learning

Reinforcement learning

Introduction to reinforcement learning

Positive reinforcement learning

Negative reinforcement learning

Applications of reinforcement learning

Self-driving cars

Drone autonomous aerial taxi

Aerobatics autonomous helicopter

TD-Gammon – computer game

AlphaGo

The agent environment setup

Exploration versus exploitation

Neural network and reinforcement learning

Reinforcement learning frameworks/toolkits

OpenAI Gym

Getting Started with OpenAI Gym

Docker

Docker installation on Windows environment

Docker installation on a Linux environment

Running an environment

Brown-UMBC Reinforcement Learning and Planning

Walkthrough with Hello GridWorld

Hello GridWorld project

Summary

Markov Decision Process

Introduction to MDP

State

Action

Model

Reward

Policy

MDP - more about rewards

Optimal policy

More about policy

Bellman equation

A practical example of building an MDP domain

GridWorld

Terminal states

Java interfaces for MDP definitions

Single-agent domain

State

Action

Action type

SampleModel

Environment

EnvironmentOutcome

TransitionProb

Defining a GridWorld state

Defining a GridWorld model

Creating the state visualizer

Testing it out

Markov chain

Building an object-oriented MDP domain

Summary

Dynamic Programming

Learning and planning

Evaluating a policy

Value iteration

Value iteration implementation using BURLAP

Output of the value iteration

Policy iteration

Bellman equations

The relationship between Bellman equations

Summary

Temporal Difference Learning

Introducing TD learning

TD lambda

Estimating from data

Learning rate

Properties of learning rate

Overview of TD(1)

An example of TD(1)

Why TD(1) is wrong

Overview of TD(0)

TD lambda rule

K-step estimator

Relationship between k-step estimators and TD lambda

Summary

Monte Carlo Methods

Monte Carlo methods

First visit Monte Carlo

Example – Blackjack

Objective of the game

Card scoring/values

The deal

Naturals

The gameplay

Applying the Monte Carlo approach

Blackjack game implementation

Monte Carlo for control

Monte Carlo Exploring Starts

Example - Blackjack

Summary

Learning and Planning

Q-learning

Q-learning example by hand

Value iteration

Testing the value iteration code

Q-learning code

Testing Q-learning code

Output of the Q-learning program

Summary

Deep Reinforcement Learning

What is a neural network?

A single neuron

Feed-forward neural network

Multi-Layer Perceptron

Deep learning

Deep Q Network

Experience replay

The DQN algorithm

DQN example – PyTorch and Gym

Task

Packages

Replay memory

Q-network

Input extraction

Training

Training loop

Example – Flappy Bird using Keras

Dependencies

qlearn.py

Game screen input

Image preprocessing

Convolution Neural Network

DQN implementation

Complete code

Output

Summary

Game Theory

Introduction to game theory

Example of game theory

Minimax

Fundamental results

Game tree

von Neumann theorem

Mini Poker game

Mixed strategies

OpenAI Gym examples

Agents

Environments

Example 1 – simple random agent

Example 2 – learning agent

Example 3 - keyboard learning agent

Summary

Reinforcement Learning Showdown

Reinforcement learning frameworks

PyBrain

Setup

Ready to code

Environment

Agent

Task

Experiment

RLPy

Setup

Ready to code

Maja Machine Learning Framework

Setup

RL-Glue

Setup

RL-Glue components

Sample project

sample_sarsa_agent.py

sample_mines_environment.py

sample_experiment.py

Mindpark

Setup

Summary

Applications and Case Studies – Reinforcement Learning

Inverse Reinforcement Learning

IRL algorithm

Implementing a car obstacle avoidance problem

Results and observations

Partially Observable Markov Decision Process

POMDP example

State estimator

Value iteration in POMDP

Reinforcement learning for POMDP

Summary

Current Research – Reinforcement Learning

Hierarchical reinforcement learning

Advantages of hierarchical reinforcement learning

The SMDP model

Hierarchical RL model

Reinforcement learning with hierarchies of abstract machines

HAM framework

Running a HAM algorithm

HAM for mobile robot example

HAM for a RoboCup keepaway example

MAXQ value function decomposition

Taxi world example

Decomposition of the projected value function

Summary

Article

State

Action

Model

Reward

Policy

MDP - more about rewards

Optimal policy

More about Policy

Summary

Preface

This book is divided into three parts. The first part starts with defining reinforcement learning. It describes the basics and the Python and Java frameworks we are going to use it in this book. The second part discusses learning techniques with basic algorithms such as temporal difference, Monte Carlo and policy gradient with practical examples. The third part applies reinforcement learning with the most recent and widely used algorithms with practical applications. We end with practical implementations of case studies and current research activities.

What this book covers

Chapter 1, Reinforcement Learning, is about machine learning and types of machine learning (supervised, unsupervised, and reinforcement learning) with real-life examples. We also discuss positive and negative reinforcement learning. Then we see the trade-off between explorations versus exploitation, which is a very common problem in reinforcement learning.  We also see various practical applications of reinforcement learning like self driving cars, drone autonomous taxi, and AlphaGo. Furthermore, we learn reinforcement learning frameworks OpenAI Gym and BURLAP, we set up the development environment, and we write the first program on both frameworks. 

Chapter 2, Markov Decision Process, discusses MDP, which defines the reinforcement learning problem, and we discuss the solutions of that problem. We learn all about states, actions, transitions, rewards, and discount. In that context, we also discuss policies and value functions (utilities). Moreover, we cover the practical implementation of MDP and you also learn how to create an object-oriented MDP.

Chapter 3, Dynamic Programming, shows how dynamic programming is used in reinforcement learning, and then we solve the Bellman equation using value iteration and policy iteration. We also implement the value iteration algorithm using BURLAP.

Chapter 4, Temporal Difference Learning, covers one of the most commonly used approaches for policy evaluation. It is a central part of solving reinforcement learning tasks. For optimal control, policies have to be evaluated. We discuss three ways to think about it: model based learning, value-based learning, and policy-based learning.

Chapter 5, Monte Carlo Methods, discusses Monte Carlo approaches. The idea behind Monte Carlo is simple: using randomness to solve problems. Monte Carlo methods learn directly from episodes of experience. It is model-free and needs no knowledge of MDP transitions and rewards.

Chapter 6, Learning and Planning, explains how to implement your own planning and learning algorithms. We start with Q-learning and later we see the value iterations. In it, I highly recommend that you use BURLAP's existing implementations of value iteration and Q-learning since they support a number of other features (options, learning rate decay schedules, and so on).

Chapter 7, Deep Reinforcement Learning, discusses how a combination of deep learning and reinforcement learning works together to create artificial agents to achieve human-level performance across many challenging domains. We start with neural network and then discuss single neuron feed-forward neural networks and MLP. Then we see neural networks with reinforcement learning, deep learning, DQN, the DQN algorithm, and an example (PyTorch).

Chapter 8, Game Theory, shows how game theory is related to machine learning and how we apply the reinforcement learning in gaming practices. We discuss pure and mixed strategies, von Neumann theorem, and how to construct the matrix normal form of a game. We also learn the principles of decision making in games with hidden information. We implement some examples on the OpenAI Gym simulated in Atari. and examples of simple random agent and learning agents.

Chapter 9, Reinforcement Learning Showdown, we will look at other very interesting reinforcement learning frameworks, such as PyBrain, RLPy, Maja, and so on. We will also discuss in detail about Reinforcement Learning Glue (RL-Glue) that enables us to write the reinforcement learning program in many languages.

Chapter 10, Applications and Case Studies – Reinforcement Learning , covers advanced topics of reinforcement learning. We discuss Inverse Reinforcement Learning and POMDP's.

Chapter 11, Current Research – Reinforcement Learning, describes the current ongoing research areas in reinforcement learning, We will discuss about hierarchical reinforcement learning; then we will look into reinforcement learning with hierarchies of abstract machines. Later in the chapter we will learn about MAXQ value function decomposition.

What you need for this book

This book covers all the practical examples in Python and Java. You need to install Python 2.7 or Python 3.6 in your computer. If you are working on Java, then you have to install Java 8.

All the other reinforcement-learning-related toolkits or framework installations will be covered in the relevant sections.

Who this book is for

This book is meant for machine learning/AI practitioners, data scientists, engineers who wish to expand their spectrum of skills in AI and learn about developing self-evolving intelligent agents.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We need to initialize our environment with the reset() method."

A block of code is set as follows:

for _ in range(1000): env.render() env.step(env.action_space.sample())

Any command-line input or output is written as follows:

cd gym

pip install –e .

New terms and important words are shown in bold.

Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "In order to download new modules, we will go toFiles|Settings|Project Name|Project Interpreter."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:

Log in or register to our website using your email address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Practical-Reinforcement-Learning. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Reinforcement Learning

In this chapter, we will learn what machine learning isand how reinforcement learning is different from other machine learning techniques, such as supervised learning and unsupervised learning. Furthermore, we will look into reinforcement learning elements such as state, agent, environment, and reward. After that, we will discuss positive and negative reinforcement learning. Then we will explore the latest applications of reinforcement learning. As this book covers both Java and Python programming languages, the later part of the chapter will cover various frameworks of reinforcement learning. We will see how to set up the development environment and develop some programs using open-air gym andBrown-UMBC Reinforcement Learning and Planning(BURLAP).

Overview of machine learning

In this era of technological advancement, the utilization of machine learning is not like the way it used to be in the past. The purpose of machine learning is to solve the problems such as pattern recognition or perform specific tasks that a computer can learn without being programmed. Researchers are interested in algorithms that a computer can learn from data. The repetitive way of machine learning is vital because as models get new data with time, they are also able to independently adjust. They learn from past performances to produce more reliable results and decisions. Machine learning is not a new subject, but nowadays it's getting fresh momentum.

What is machine learning?

Machine learning is a subject that is based on computer algorithms, and its purpose is to learn and perform specific tasks. Humans are always interested in making intelligent computers that will help them to do predictions and perform tasks without supervision. Machine learning comes into action and produces algorithms that learn from past experiences and make decisions to do better in the future.

Arthur Samuel, way back in 1959, said: "Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed".

Can a computer learn from experience? The answer is yes and that is what precisely machine learning is. Here, past experiences are called data. We can say that machine learning is actually a field that gives computers the capability to learn without being programmed.

For example, a telecom company is very much interested in knowing which customers are going to terminate their service. If they are aware or can predict those customers, they can offer them special deals to retain them. A machine learning program always learns from past data and improves with time. In simpler words, if a computer program improves on a certain task based on a past experience, thenwe can say thatit has learned.

Machine learning is a field that discovers structures of algorithms that enable learning from data. These algorithms build a model that accepts inputs, and based on these inputs, they make predictions or results. We cannot provide all the preconditions in the program; the algorithm is designed in such a way that it learns itself.

Sometimes the words, machine learning and Artificial Intelligence (AI), are used inter-changeably. However, machine learning and AI are two distinctive areas of computing. Machine learning is solely focused on writing software that can learn from past experiences.

Applications ofmachine learning includesentiment analysis, email spam detection, targeted advertisements (Google AdSense), recommendation engines used by e-commerce sites, and pattern mining for market basket analysis. Some real-life examples of machine learning are covered in the next section.

Speech conversion from one language to another

This Skype feature helps break the language barrier during voice/video calling. It translates a conversation into another language in real time, allowing both sides of speakers to effectively share their views in their native languages.

Suspicious activity detection from CCTVs

This is a wonderful example of how an application of machine learning can make society a safer place. The idea is to have a machine learning algorithm capture and analyze CCTV footage all the time and learn from it the normal activities of people, such as walking, running, and so on. If any suspicious activity occurs, say robbery, it alerts the authorities in real time about the incident.

Medical diagnostics for detecting diseases

Doctors and hospitals are now increasingly being assisted in detecting diseases such as skin cancer faster and more accurately. A system designed by IBM picked cancerous lesions (damage) in some images with 95 percent accuracy, whereas a doctor's accuracy is usually between 75—84 percent using manual methods. So, the computing approach can help doctors make more informed decisions by increasing the efficiency of recognizing melanoma and spotting cases where it is difficult for a doctor to identify it.

Machine learning can be divided into three categories:

Figure 1.1: Types of machine learning

Unsupervised learning

Unsupervised learning is a type of machine learning in which we have only input variables and no output variables. We need to find some relationship or structure in these input variables. Here, the data is unlabeled; that is, there is no specific meaning for any column.

It is called unsupervised learning because there is no training and no supervision. The algorithm will learn based on the grouping or structure in the data, for example, an algorithm to identify that a picture contains an animal, tree, or chair. The algorithm doesn't have any prior knowledge or training data. It just converts it into pixels and groups them based on the data provided.

In unsupervised learning, we group the parts of data based on similarities within each other. The data in unsupervised learning is unlabeled, meaning there are no column names. This is not important because we don't have any specific knowledge/training of the data.

Unsupervised learning problems can be further grouped as clustering and association problems:

Clustering

: A clustering problem is for discovering a pattern or understanding the way of grouping from the given data. An example is a grouping of customers by region, or a grouping based on age.

Association

: Association is a rule-based learning problem where you discover a pattern that describes a major/big portion of the given data. For example, in an online book shop, the recommendation engine suggests that people who buy book

A

also buy certain other books.

Some popular examples of unsupervised learning algorithms are:

Apriori algorithm (association problems)

K-means (clustering problems)

Now take up the same fruit groupingexample again from the earlier section. Suppose we have a bag full of fruits and our task is to arrange the fruits grouped in one place.

In this instance, we have no prior knowledge of the fruits; that is, we have never seen these fruits before and it's the first time we will be seeing these fruits. Now, how do we perform this task? What are the steps we will do to complete this task? The first step is to take a fruit from the bag and see its physical characteristics, say the color of this particular fruit. Then arrange the fruits based on color. We end up with a grouping as per Table 1.2:

Color

Fruit Name

Red group

Cherries and apples

Green group

Grapes and bananas

Table 1.2: Grouping based on color

Now we will group them based on size and color. See the result in Table 1.3:

Size and color

Fruit Name

Big and red

Apple

Small and red

Cherry

Big and green

Banana

Small and green

Grapes

Table 1.3: Grouping based on size and color

It's done now! We've successfully grouped them.

This is called unsupervised learning and the approach is called clustering.

Note that in unsupervised learning, we don't have any training data or past example to learn from. In the preceding example, we didn't have any prior knowledge of the fruits.

Reinforcement learning

Reinforcement learning is a type of machine learning that determines the action within a specific environment in order to maximize a reward. One of the characteristics of reinforcement learning is that the agent can only receive a reward after performing the action, and thus must continue to interact with the environment in order to determine the optimal policy through trial and error.

Let's take an example. How did you learn to ride a cycle? Was somebody telling you how to ride a cycle and you just followed those instructions? What kind of learning is it? Some people have a misconception that it is supervised learning and they give the reasoning that my uncle held me while I started cycling, he was telling me what to do, and so on. At best, what they could tell you is Watch out, don't fall down from the cycle, be careful, or some other instruction. That does not count as supervision. As you know, supervised learning means that you are ready to ride a cycle and someone gives you the exact steps, such as Push down your left foot with 5 pounds of pressure, or Move your handle to 80 degrees.

Someone has to give you the exact control signals in order for you to ride a cycle. Now you may think that if someone gave instructions of this kind, a child would never learn to cycle. So, people immediately say that it is unsupervised learning. The justification they now give is that no one tells them how to ride a cycle. But let's analyze this. If it is truly unsupervised learning, then what that means is that kids watch hundreds of videos of other cyclists and how they ride, figuring out the pattern, and then get down to cycle and repeat it. That is essentially unsupervised learning; you have lots of data and you figure out the patterns and try to execute those patterns. Cycling does not work that way! You have to get down to cycling and try it yourself. How to learn cycling? is neither supervised nor unsupervised learning. It's a different paradigm. It is reinforcement learning, one that you learn by trial and error.

During this learning process, the feedback signals that tell us how well we do are either pain... Ouch! I fell! That hurts! I will avoid doing what led to this next time!, or reward... Wow! I am riding the bike! This feels great! I just need to keep doing what I am doing right now!

Introduction to reinforcement learning

Have you seen a baby learn to walk? The baby readily succeeds the first time it stands up, tries to walk a few steps, then falls down, and again stands up. Over time, the baby learns to walk. There is no one really teaching it how to walk. Learning is more by trial and error. There are many learning situations where humans, as well as biological systems, do not get detailed instructions on how to perform a task. However, they undertake those tasks by evaluation and try to improve with behavior based on scans of evaluations. Reinforcement learning is actually a mathematical structure that captures this kind of trial-and-error learning. The goal here is to learn about the system through interaction with the system.

Reinforcement learning is inspired by behavioral psychology. In the year 1890, a Russian physiologist named Ivan Pavlov was doing an experiment on salivation in dogs when they are being fed. He noticed that whenever he rang the bell, his dog began to salivate. Even when he was not bringing them food and just rang the bell, the dog started to salivate. Pavlov had started from the theory that there are some elements that a dog does not need to learn. Let's say dogs do not learn to salivate whenever they see food. This reaction is hard-wired into the dog. In behavioral science terms, it is an undefined response (a stimulus-response is a connection that requires no learning). In behavioral psychology terms, we write: Undefined Stimulus > Undefined Response.

The dog actually forms an association with the bell and the food. And later on, without serving the food you ring the bell, the dog starts salivating to digest the food it expects to be delivered. Essentially, the food is the pay-off, like a reward to it, and it forms an association between the signals (in this case, ringing a bell and the reward the dog is going to get).

After this experiment, there were more and very complex experiments on animals and people came up with lots of theories. Lots of papers on reinforcement learning are taken from the behavioral psychology journals.

There are a few terms in reinforcement learning that are used in this book several times: agent, environment, states, actions, and rewards. Let me explain these terms briefly here and later, we will go into the details of each term in Chapter 2, Markov Decision Process.

An agent in reinforcement learning always takes actions. Let me give you an example. A plane moving left and right in a video game is an agent. Or if the character is moving left, right, up, and down in the Pac Man game, then the character is actually an agent.

A state is the place or the situation in which the agent finds itself. Let's say our plane's game is a state or the Pac Man game is a state.

An action, as the name implies, is some work based on certain scenarios. An agent can perform certain actions. In the Pac Man game, he can move left, right, up, or down. In our other example of the plane in the video game, it can go left or right.

A reward is a feedback signal; it can be plus or minus. For example, in the Pac Man game, if the agent goes left and it avoids the enemy, it gets a plus reward. In the same way, if our plane goes right or left and dodges a bomb, it gets a reward.

Figure 1.3: Reinforcement learning elements

Reinforcement learning is all about learning from the environment and learning to be more accurate with time. There are two types of reinforcement learning, called positive reinforcement learning and negative reinforcement learning. We will discuss both the approaches in the next section.

Positive reinforcement learning

Positive reinforcement learning means getting a positive reward. It is something desirable that is given when you take an action. Let me give you an example to understand the concept. Let's say after studying hard, you've secured the first position in your class. Now that you know your good action results in a positive reward, you'll actually try more to continue such good actions. This is called positive reinforcement learning.

Another example, continued from the previous section on riding a bicycle, is of someone clapping for you when you are finally able to ride a cycle. It's a positive feedback and is called positive reinforcement learning.

Negative reinforcement learning

On the other hand, negative reinforcement learning means getting negative rewards, or something undesirable given to you when you take an action. For example, you go to a cinema to watch a movie and you feel very cold, so it is uncomfortable to continue watching the movie. Next time you go to the same theater and feel the same cold again. It's surely uncomfortable to watch a movie in this environment.

The third time you visit the theater you wear a jacket. With this action, the negative element is removed.

Again, taking up the same ride-a-cycle example here, you fall down and get hurt. That's a negative feedback and it's called negative reinforcement learning.

The goal here is to learn by interacting with the system; it's not something that is completely offline. You have some level of interaction with the system and you learn about the system through the interaction.

Now we will discuss another example from a game of chess. One way is to sit with the opponent and make a sequence of moves; at the end of the game, either you win or lose. If you win, you will get a reward; someone will pay you $80. If you lose, you have to pay the opponent $80. That's all that happens and the maximum feedback you get is, either you win $80 or you lose $80 at the end of the game. No one will tell you that given this position this is the move you have to take; that's the reason I said it is learning from reward and punishment in the absence of detailed supervision.

A dog can be trained to keep the room tidy by giving him more tasty food when he behaves well and reducing the amount of his favorite food if he dirties the room.

The dog can be considered as an agent and the room as the environment. You are the source of the reward signal (tasty food). Although the feedback given to the dog is vague, eventually his neural networks will figure out that there is a relation between good food and good behavior.

The dog will possibly behave well and stop messing up with his room to maximize the goal of eating more tasty food. Thus we've seen reinforcement learning in non-computer issues.

This reinforces the idea that reinforcement learning can be a powerful tool for AI applications. Self-driving cars are a good example of AI applications.

With reinforcement learning, we aim to mimic biological entities.

Another example can be of a robot as an agent with a goal to move around in a room without bumping into obstacles. A negative score (punishment) on bumping into an obstacle and positive score (reward) on avoiding an obstacle will define the final score for the author. The reward can be maximized by moving around in the room avoiding obstacles. Here, we can say that the goal is to maximize the score.

An agent can maximize the reward by acting appropriately on the environment and performing optimum actions. Reinforcement learning is thus also used in self-adapting systems, such as AlphaGo.

Applications of reinforcement learning

Reinforcement learning is used in a wide variety of applications.

Figure 1.4: The many faces of reinforcement learning

Self-driving cars

Self-driving cars are not science fiction anymore. Companies such as Toyota and Ford have invested millions of dollars for R&D in this technology. Taxi services such as Uber and Lyft, currently paying human drivers, may soon deploy entire fleets of self-driving cars. In the next two to three years, hundreds of thousands of self-driving cars may be sold to regular consumers.

Google is also taking a lead in this. The Google self-driving car project is called Waymo; it stands for a new way forward in mobility.

Now, Waymo is an autonomous car development company.

Thirty-three corporations are working on autonomous vehicles and over $450 M is invested across 36 deals to date; auto tech start-ups are on track for yearly highs in both deals and dollars.

Many influential personnel from automobile and technology industries predict that this will happen. But the big question behind this is, when will this actually happen? The timing is the key here; by 2020 many relevant companies are planning to launch autonomous cars. Refer to the following predictions by motor companies:

Motor company

Launch prediction

Audi

2020

NuTonomy (Singapore)

2018

Delphi and MobiEye

2019

Ford

2021

Volkswagen

2019

General Motors

2020

BMW

2021

Baidu

2019

Toyota

2020

Elon Musk

2019

Jaguar

2024

Nissan

2020

Google

2018

Autonomous cars are the core and long-term strategy. IEEE predicts that 75 percent of vehicles will be fully autonomous by 2040.

Planning for a self-driving car is done via reinforcement learning. The car learns to continuously correct its driving capability over time through trial and error when training.

We will learn how to create a self-driving car using a simulator in upcoming chapters.

Drone autonomous aerial taxi

While people are still debating about the safety of self-driving cars, the United Arab Emirates is actually preparing to launch an autonomous aerial taxi or drone taxi.

It is one of the finest examples of applying reinforcement learning.

The Road and Transport Authority (RTA), Dubai employs the Chinese firm Ehang's 184, which is the world's first passenger drone. It's capable of a range of about 30 miles on a single charge and can carry one person weighing up to 220 lbs, as well as a small suitcase. The entire flight is managed by a command center; all you need to do is hop in and choose from the list of predetermined destinations where you want to land.

Riders can use a smartphone app to book their flights to pick them up from the designated zones. The drone taxi arrives at the designated place and the rider will go inside and get into a seat and select the pre-programmed designation using a touchscreen. They will just sit and enjoy the flight. All the flights are monitored in the control room remotely for passenger safety.

This drone autonomous taxi can carry a weight of 110 kg, and it uses eight motors to fly at a speed of up to 70 kilometers/hour.

Aerobatics autonomous helicopter

Computer scientists at Stanford have successfully created an AI system that can enable robotic helicopters to learn and perform difficult stunts watching other helicopters performing the same maneuvers. This has resulted in autonomous helicopters which can perform a complete airshow of tricks on its own. Controlling the autonomous helicopter flight is the most challenging problem.

Autonomous helicopter flight is widely regarded to be a highly challenging control problem. Despite this fact, human experts can reliably fly helicopters through a wide range of maneuvers, including aerobatic maneuvers.

How does it work? By using reinforcement learning for the optimal control problem, it optimizes the model and reward functions.

We will look into all these reinforcement learning algorithms practically in the upcoming chapters.

TD-Gammon – computer game

TD-Gammon is a widely played computer backgammon program developed in 1992. TD-Gammon is a neural network which teaches itself to play backgammon and improves its strategies by playing the game with itself and learns from the results. It is a good example of reinforcement learning algorithm. It begins with random initial weights (and hence a random initial strategy), TD-Gammon eventually develops a strong level of play. While raw description of the board state is given, but with zero information built-in, the system teaches itself and develops strong ability to play at intermediate level. Moreover, with additional hand-crafted features the systems performs stunningly well.

Figure 1.5: A backgammon game

The current version of TD-Gammon is very close to the level of the best human player of all time. It explored a lot of strategies that humans had never used, and that is the reason for the advancement in current TD-backgammon play.

AlphaGo

The game of Go originated in China more than 3,000 years ago. The rules of the game are simple. Players take turns to place white or black stones on a board, trying to capture the opponent's stones or surround empty space to make points out of territory. As simple as the rules are, Go is a game of profound complexity. There are more possible positions in Go than there are atoms in the universe. That makes Go more complex than chess.

The game of Go is a classic and very challenging game. Computer scientists have been trying for decades to at least achieve a beginner level of performance with a computer as compared to a human. Now, with advancements in deep reinforcement learning, the computer learns a network policy (which selects actions) and also a network value (which predicts the winner) through self-play.

AlphaGo uses a state-of-the-art tree search and deep neural network techniques. It is the first program that beat a professional human player in Oct 2016. Later on, AlphaGo also defeated Lee Sedol, one of the strongest players with 17 world titles. The final score of the game was 4 to 1; this match was seen by 200 million viewers.

The agent environment setup

Reinforcement learning is learning from interaction with the environment. Here the learner is called the Agent. Everything outside the Agent is called the Environment. The Agent