41,99 €
Master different reinforcement learning techniques and their practical implementation using OpenAI Gym, Python and Java
Machine learning/AI practitioners, data scientists, data analysts, machine learning engineers, and developers who are looking to expand their existing knowledge to build optimized machine learning models, will find this book very useful.
Reinforcement learning (RL) is becoming a popular tool for constructing autonomous systems that can improve themselves with experience. We will break the RL framework into its core building blocks, and provide you with details of each element.
This book aims to strengthen your machine learning skills by acquainting you with reinforcement learning algorithms and techniques. This book is divided into three parts. The first part defines Reinforcement Learning and describes its basics. It also covers the basics of Python and Java frameworks, which we are going to use later in the book. The second part discusses learning techniques with basic algorithms such as Temporal Difference, Monte Carlo, and Policy Gradient—all with practical examples. Lastly, in the third part we apply Reinforcement Learning with the most recent and widely used algorithms via practical applications.
By the end of this book, you'll know the practical implementation of case studies and current research activities to help you advance further with Reinforcement Learning.
This hands-on book will further expand your machine learning skills by teaching you the different reinforcement learning algorithms and techniques using practical examples.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 417
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: October 2017
Production reference: 1131017
ISBN 978-1-78712-872-9
www.packtpub.com
Author
Dr. Engr. S.M. Farrukh Akhtar
Copy Editors
Vikrant Phadkay
Alpha Singh
Reviewers
Ruben Oliva Ramos
Juan Tomás Oliva Ramos
Vijayakumar Ramdoss
Project Coordinator
Nidhi Joshi
Commissioning Editor
Wilson D'souza
Proofreader
Safis Editing
Acquisition Editor
Tushar Gupta
Indexer
Tejal Daruwale Soni
Content Development Editor
Mayur Pawanikar
Graphics
Tania Dutta
Technical Editor
Suwarna Patil
Production Coordinator
Aparna Bhagat
Dr. Engr. S.M. Farrukh Akhtar is an active researcher and speaker with more than 13 years of industrial experience analyzing, designing, developing, integrating, and managing large applications in different countries and diverse industries. He has worked in Dubai, Pakistan, Germany, Singapore, and Malaysia. He is currently working in Hewlett Packard as an enterprise solution architect.
He received a PhD in artificial intelligence from European Global School, France. He also received two master's degrees: a master's of intelligent systems from the University Technology Malaysia, and MBA in business strategy from the International University of Georgia. Farrukh completed his BSc in computer engineering from Sir Syed University of Engineering and Technology, Pakistan. He is also an active contributor and member of the machine learning for data science research group in the University Technology Malaysia. His research and focus areas are mainly big data, deep learning, and reinforcement learning.
He has cross-platform expertise and has achieved recognition for his expertise from IBM, Sun Microsystems, Oracle, and Microsoft. Farrukh received the following accolades:
Sun Certified Java Programmer
in 2001
Microsoft Certified Professional and Sun Certified Web Component Developer
in 2002
Microsoft Certified Application Developer
in 2003
Microsoft Certified Solution Developer
in 2004
Oracle Certified Professional
in 2005
IBM Certified Solution Developer - XML
in 2006
IBM Certified Big Data Architect and
Scrum Master Certified - For Agile Software Practitioners
in 2017
He also contributes his experience and services as a member of the board of directors in K.K. Abdal Institute of Engineering and Management Sciences, Pakistan, and is a board member of Alam Educational Society.
Skype id: farrukh.akhtar
Ruben Oliva Ramos is a computer systems engineer with a master's degree in computer and electronic systems engineering, teleinformatics, and networking, with a specialization from the University of Salle Bajio in Leon, Guanajuato, Mexico. He has more than 5 years of experience in developing web applications to control and monitor devices connected with Arduino and Raspberry Pi, and using web frameworks and cloud services to build Internet of Things applications.
He is a mechatronics teacher at the University of Salle Bajio and teaches students of master's in design and engineering of mechatronics systems. Ruben also works at Centro de Bachillerato Tecnologico Industrial 225 in Leon, teaching subjects such as electronics, robotics and control, automation, and microcontrollers.
He is a technician, consultant, and developer of monitoring systems and datalogger data using technologies such as Android, iOS, Windows Phone, HTML5, PHP, CSS, Ajax, JavaScript, Angular, ASP.NET databases (SQlite, MongoDB, web servers, Node.js, IIS), hardware programming (Arduino, Raspberry Pi, Ethernet Shield, GPS, and GSM/GPRS), ESP8266, and control and monitor systems for data acquisition and programming.
He has written a book called Internet of Things Programming with JavaScript, published by Packt.
Juan Tomás Oliva Ramos is an environmental engineer from the University of Guanajuato, Mexico, with a master's degree in administrative engineering and quality. He has more than 5 years of experience in management and development of patents, technological innovation projects, and development of technological solutions through the statistical control of processes. He has been a teacher of statistics, entrepreneurship, and technological development of projects since 2011. He became an entrepreneur mentor and started a new department of technology management and entrepreneurship at Instituto Tecnologico Superior de Purisima del Rincon.
Juan is an Alfaomega reviewer and has worked on the book Wearable designs for Smart watches, Smart TVs and Android mobile devices.
He has developed prototypes through programming and automation technologies for the improvement of operations, which have been registered for patents.
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787128725.
If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
Reinforcement Learning
Overview of machine learning
What is machine learning?
Speech conversion from one language to another
Suspicious activity detection from CCTVs
Medical diagnostics for detecting diseases
Supervised learning
Unsupervised learning
Reinforcement learning
Introduction to reinforcement learning
Positive reinforcement learning
Negative reinforcement learning
Applications of reinforcement learning
Self-driving cars
Drone autonomous aerial taxi
Aerobatics autonomous helicopter
TD-Gammon – computer game
AlphaGo
The agent environment setup
Exploration versus exploitation
Neural network and reinforcement learning
Reinforcement learning frameworks/toolkits
OpenAI Gym
Getting Started with OpenAI Gym
Docker
Docker installation on Windows environment
Docker installation on a Linux environment
Running an environment
Brown-UMBC Reinforcement Learning and Planning
Walkthrough with Hello GridWorld
Hello GridWorld project
Summary
Markov Decision Process
Introduction to MDP
State
Action
Model
Reward
Policy
MDP - more about rewards
Optimal policy
More about policy
Bellman equation
A practical example of building an MDP domain
GridWorld
Terminal states
Java interfaces for MDP definitions
Single-agent domain
State
Action
Action type
SampleModel
Environment
EnvironmentOutcome
TransitionProb
Defining a GridWorld state
Defining a GridWorld model
Creating the state visualizer
Testing it out
Markov chain
Building an object-oriented MDP domain
Summary
Dynamic Programming
Learning and planning
Evaluating a policy
Value iteration
Value iteration implementation using BURLAP
Output of the value iteration
Policy iteration
Bellman equations
The relationship between Bellman equations
Summary
Temporal Difference Learning
Introducing TD learning
TD lambda
Estimating from data
Learning rate
Properties of learning rate
Overview of TD(1)
An example of TD(1)
Why TD(1) is wrong
Overview of TD(0)
TD lambda rule
K-step estimator
Relationship between k-step estimators and TD lambda
Summary
Monte Carlo Methods
Monte Carlo methods
First visit Monte Carlo
Example – Blackjack
Objective of the game
Card scoring/values
The deal
Naturals
The gameplay
Applying the Monte Carlo approach
Blackjack game implementation
Monte Carlo for control
Monte Carlo Exploring Starts
Example - Blackjack
Summary
Learning and Planning
Q-learning
Q-learning example by hand
Value iteration
Testing the value iteration code
Q-learning code
Testing Q-learning code
Output of the Q-learning program
Summary
Deep Reinforcement Learning
What is a neural network?
A single neuron
Feed-forward neural network
Multi-Layer Perceptron
Deep learning
Deep Q Network
Experience replay
The DQN algorithm
DQN example – PyTorch and Gym
Task
Packages
Replay memory
Q-network
Input extraction
Training
Training loop
Example – Flappy Bird using Keras
Dependencies
qlearn.py
Game screen input
Image preprocessing
Convolution Neural Network
DQN implementation
Complete code
Output
Summary
Game Theory
Introduction to game theory
Example of game theory
Minimax
Fundamental results
Game tree
von Neumann theorem
Mini Poker game
Mixed strategies
OpenAI Gym examples
Agents
Environments
Example 1 – simple random agent
Example 2 – learning agent
Example 3 - keyboard learning agent
Summary
Reinforcement Learning Showdown
Reinforcement learning frameworks
PyBrain
Setup
Ready to code
Environment
Agent
Task
Experiment
RLPy
Setup
Ready to code
Maja Machine Learning Framework
Setup
RL-Glue
Setup
RL-Glue components
Sample project
sample_sarsa_agent.py
sample_mines_environment.py
sample_experiment.py
Mindpark
Setup
Summary
Applications and Case Studies – Reinforcement Learning
Inverse Reinforcement Learning
IRL algorithm
Implementing a car obstacle avoidance problem
Results and observations
Partially Observable Markov Decision Process
POMDP example
State estimator
Value iteration in POMDP
Reinforcement learning for POMDP
Summary
Current Research – Reinforcement Learning
Hierarchical reinforcement learning
Advantages of hierarchical reinforcement learning
The SMDP model
Hierarchical RL model
Reinforcement learning with hierarchies of abstract machines
HAM framework
Running a HAM algorithm
HAM for mobile robot example
HAM for a RoboCup keepaway example
MAXQ value function decomposition
Taxi world example
Decomposition of the projected value function
Summary
Article
State
Action
Model
Reward
Policy
MDP - more about rewards
Optimal policy
More about Policy
Summary
This book is divided into three parts. The first part starts with defining reinforcement learning. It describes the basics and the Python and Java frameworks we are going to use it in this book. The second part discusses learning techniques with basic algorithms such as temporal difference, Monte Carlo and policy gradient with practical examples. The third part applies reinforcement learning with the most recent and widely used algorithms with practical applications. We end with practical implementations of case studies and current research activities.
Chapter 1, Reinforcement Learning, is about machine learning and types of machine learning (supervised, unsupervised, and reinforcement learning) with real-life examples. We also discuss positive and negative reinforcement learning. Then we see the trade-off between explorations versus exploitation, which is a very common problem in reinforcement learning. We also see various practical applications of reinforcement learning like self driving cars, drone autonomous taxi, and AlphaGo. Furthermore, we learn reinforcement learning frameworks OpenAI Gym and BURLAP, we set up the development environment, and we write the first program on both frameworks.
Chapter 2, Markov Decision Process, discusses MDP, which defines the reinforcement learning problem, and we discuss the solutions of that problem. We learn all about states, actions, transitions, rewards, and discount. In that context, we also discuss policies and value functions (utilities). Moreover, we cover the practical implementation of MDP and you also learn how to create an object-oriented MDP.
Chapter 3, Dynamic Programming, shows how dynamic programming is used in reinforcement learning, and then we solve the Bellman equation using value iteration and policy iteration. We also implement the value iteration algorithm using BURLAP.
Chapter 4, Temporal Difference Learning, covers one of the most commonly used approaches for policy evaluation. It is a central part of solving reinforcement learning tasks. For optimal control, policies have to be evaluated. We discuss three ways to think about it: model based learning, value-based learning, and policy-based learning.
Chapter 5, Monte Carlo Methods, discusses Monte Carlo approaches. The idea behind Monte Carlo is simple: using randomness to solve problems. Monte Carlo methods learn directly from episodes of experience. It is model-free and needs no knowledge of MDP transitions and rewards.
Chapter 6, Learning and Planning, explains how to implement your own planning and learning algorithms. We start with Q-learning and later we see the value iterations. In it, I highly recommend that you use BURLAP's existing implementations of value iteration and Q-learning since they support a number of other features (options, learning rate decay schedules, and so on).
Chapter 7, Deep Reinforcement Learning, discusses how a combination of deep learning and reinforcement learning works together to create artificial agents to achieve human-level performance across many challenging domains. We start with neural network and then discuss single neuron feed-forward neural networks and MLP. Then we see neural networks with reinforcement learning, deep learning, DQN, the DQN algorithm, and an example (PyTorch).
Chapter 8, Game Theory, shows how game theory is related to machine learning and how we apply the reinforcement learning in gaming practices. We discuss pure and mixed strategies, von Neumann theorem, and how to construct the matrix normal form of a game. We also learn the principles of decision making in games with hidden information. We implement some examples on the OpenAI Gym simulated in Atari. and examples of simple random agent and learning agents.
Chapter 9, Reinforcement Learning Showdown, we will look at other very interesting reinforcement learning frameworks, such as PyBrain, RLPy, Maja, and so on. We will also discuss in detail about Reinforcement Learning Glue (RL-Glue) that enables us to write the reinforcement learning program in many languages.
Chapter 10, Applications and Case Studies – Reinforcement Learning , covers advanced topics of reinforcement learning. We discuss Inverse Reinforcement Learning and POMDP's.
Chapter 11, Current Research – Reinforcement Learning, describes the current ongoing research areas in reinforcement learning, We will discuss about hierarchical reinforcement learning; then we will look into reinforcement learning with hierarchies of abstract machines. Later in the chapter we will learn about MAXQ value function decomposition.
This book covers all the practical examples in Python and Java. You need to install Python 2.7 or Python 3.6 in your computer. If you are working on Java, then you have to install Java 8.
All the other reinforcement-learning-related toolkits or framework installations will be covered in the relevant sections.
This book is meant for machine learning/AI practitioners, data scientists, engineers who wish to expand their spectrum of skills in AI and learn about developing self-evolving intelligent agents.
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We need to initialize our environment with the reset() method."
A block of code is set as follows:
for _ in range(1000): env.render() env.step(env.action_space.sample())
Any command-line input or output is written as follows:
cd gym
pip install –e .
New terms and important words are shown in bold.
Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "In order to download new modules, we will go toFiles|Settings|Project Name|Project Interpreter."
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:
Log in or register to our website using your email address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Practical-Reinforcement-Learning. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
In this chapter, we will learn what machine learning isand how reinforcement learning is different from other machine learning techniques, such as supervised learning and unsupervised learning. Furthermore, we will look into reinforcement learning elements such as state, agent, environment, and reward. After that, we will discuss positive and negative reinforcement learning. Then we will explore the latest applications of reinforcement learning. As this book covers both Java and Python programming languages, the later part of the chapter will cover various frameworks of reinforcement learning. We will see how to set up the development environment and develop some programs using open-air gym andBrown-UMBC Reinforcement Learning and Planning(BURLAP).
In this era of technological advancement, the utilization of machine learning is not like the way it used to be in the past. The purpose of machine learning is to solve the problems such as pattern recognition or perform specific tasks that a computer can learn without being programmed. Researchers are interested in algorithms that a computer can learn from data. The repetitive way of machine learning is vital because as models get new data with time, they are also able to independently adjust. They learn from past performances to produce more reliable results and decisions. Machine learning is not a new subject, but nowadays it's getting fresh momentum.
Machine learning is a subject that is based on computer algorithms, and its purpose is to learn and perform specific tasks. Humans are always interested in making intelligent computers that will help them to do predictions and perform tasks without supervision. Machine learning comes into action and produces algorithms that learn from past experiences and make decisions to do better in the future.
Arthur Samuel, way back in 1959, said: "Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed".
Can a computer learn from experience? The answer is yes and that is what precisely machine learning is. Here, past experiences are called data. We can say that machine learning is actually a field that gives computers the capability to learn without being programmed.
For example, a telecom company is very much interested in knowing which customers are going to terminate their service. If they are aware or can predict those customers, they can offer them special deals to retain them. A machine learning program always learns from past data and improves with time. In simpler words, if a computer program improves on a certain task based on a past experience, thenwe can say thatit has learned.
Machine learning is a field that discovers structures of algorithms that enable learning from data. These algorithms build a model that accepts inputs, and based on these inputs, they make predictions or results. We cannot provide all the preconditions in the program; the algorithm is designed in such a way that it learns itself.
Sometimes the words, machine learning and Artificial Intelligence (AI), are used inter-changeably. However, machine learning and AI are two distinctive areas of computing. Machine learning is solely focused on writing software that can learn from past experiences.
Applications ofmachine learning includesentiment analysis, email spam detection, targeted advertisements (Google AdSense), recommendation engines used by e-commerce sites, and pattern mining for market basket analysis. Some real-life examples of machine learning are covered in the next section.
This Skype feature helps break the language barrier during voice/video calling. It translates a conversation into another language in real time, allowing both sides of speakers to effectively share their views in their native languages.
This is a wonderful example of how an application of machine learning can make society a safer place. The idea is to have a machine learning algorithm capture and analyze CCTV footage all the time and learn from it the normal activities of people, such as walking, running, and so on. If any suspicious activity occurs, say robbery, it alerts the authorities in real time about the incident.
Doctors and hospitals are now increasingly being assisted in detecting diseases such as skin cancer faster and more accurately. A system designed by IBM picked cancerous lesions (damage) in some images with 95 percent accuracy, whereas a doctor's accuracy is usually between 75—84 percent using manual methods. So, the computing approach can help doctors make more informed decisions by increasing the efficiency of recognizing melanoma and spotting cases where it is difficult for a doctor to identify it.
Machine learning can be divided into three categories:
Unsupervised learning is a type of machine learning in which we have only input variables and no output variables. We need to find some relationship or structure in these input variables. Here, the data is unlabeled; that is, there is no specific meaning for any column.
It is called unsupervised learning because there is no training and no supervision. The algorithm will learn based on the grouping or structure in the data, for example, an algorithm to identify that a picture contains an animal, tree, or chair. The algorithm doesn't have any prior knowledge or training data. It just converts it into pixels and groups them based on the data provided.
Unsupervised learning problems can be further grouped as clustering and association problems:
Clustering
: A clustering problem is for discovering a pattern or understanding the way of grouping from the given data. An example is a grouping of customers by region, or a grouping based on age.
Association
: Association is a rule-based learning problem where you discover a pattern that describes a major/big portion of the given data. For example, in an online book shop, the recommendation engine suggests that people who buy book
A
also buy certain other books.
Some popular examples of unsupervised learning algorithms are:
Apriori algorithm (association problems)
K-means (clustering problems)
Now take up the same fruit groupingexample again from the earlier section. Suppose we have a bag full of fruits and our task is to arrange the fruits grouped in one place.
In this instance, we have no prior knowledge of the fruits; that is, we have never seen these fruits before and it's the first time we will be seeing these fruits. Now, how do we perform this task? What are the steps we will do to complete this task? The first step is to take a fruit from the bag and see its physical characteristics, say the color of this particular fruit. Then arrange the fruits based on color. We end up with a grouping as per Table 1.2:
Color
Fruit Name
Red group
Cherries and apples
Green group
Grapes and bananas
Now we will group them based on size and color. See the result in Table 1.3:
Size and color
Fruit Name
Big and red
Apple
Small and red
Cherry
Big and green
Banana
Small and green
Grapes
It's done now! We've successfully grouped them.
This is called unsupervised learning and the approach is called clustering.
Note that in unsupervised learning, we don't have any training data or past example to learn from. In the preceding example, we didn't have any prior knowledge of the fruits.
Reinforcement learning is a type of machine learning that determines the action within a specific environment in order to maximize a reward. One of the characteristics of reinforcement learning is that the agent can only receive a reward after performing the action, and thus must continue to interact with the environment in order to determine the optimal policy through trial and error.
Let's take an example. How did you learn to ride a cycle? Was somebody telling you how to ride a cycle and you just followed those instructions? What kind of learning is it? Some people have a misconception that it is supervised learning and they give the reasoning that my uncle held me while I started cycling, he was telling me what to do, and so on. At best, what they could tell you is Watch out, don't fall down from the cycle, be careful, or some other instruction. That does not count as supervision. As you know, supervised learning means that you are ready to ride a cycle and someone gives you the exact steps, such as Push down your left foot with 5 pounds of pressure, or Move your handle to 80 degrees.
Someone has to give you the exact control signals in order for you to ride a cycle. Now you may think that if someone gave instructions of this kind, a child would never learn to cycle. So, people immediately say that it is unsupervised learning. The justification they now give is that no one tells them how to ride a cycle. But let's analyze this. If it is truly unsupervised learning, then what that means is that kids watch hundreds of videos of other cyclists and how they ride, figuring out the pattern, and then get down to cycle and repeat it. That is essentially unsupervised learning; you have lots of data and you figure out the patterns and try to execute those patterns. Cycling does not work that way! You have to get down to cycling and try it yourself. How to learn cycling? is neither supervised nor unsupervised learning. It's a different paradigm. It is reinforcement learning, one that you learn by trial and error.
During this learning process, the feedback signals that tell us how well we do are either pain... Ouch! I fell! That hurts! I will avoid doing what led to this next time!, or reward... Wow! I am riding the bike! This feels great! I just need to keep doing what I am doing right now!
Have you seen a baby learn to walk? The baby readily succeeds the first time it stands up, tries to walk a few steps, then falls down, and again stands up. Over time, the baby learns to walk. There is no one really teaching it how to walk. Learning is more by trial and error. There are many learning situations where humans, as well as biological systems, do not get detailed instructions on how to perform a task. However, they undertake those tasks by evaluation and try to improve with behavior based on scans of evaluations. Reinforcement learning is actually a mathematical structure that captures this kind of trial-and-error learning. The goal here is to learn about the system through interaction with the system.
Reinforcement learning is inspired by behavioral psychology. In the year 1890, a Russian physiologist named Ivan Pavlov was doing an experiment on salivation in dogs when they are being fed. He noticed that whenever he rang the bell, his dog began to salivate. Even when he was not bringing them food and just rang the bell, the dog started to salivate. Pavlov had started from the theory that there are some elements that a dog does not need to learn. Let's say dogs do not learn to salivate whenever they see food. This reaction is hard-wired into the dog. In behavioral science terms, it is an undefined response (a stimulus-response is a connection that requires no learning). In behavioral psychology terms, we write: Undefined Stimulus > Undefined Response.
The dog actually forms an association with the bell and the food. And later on, without serving the food you ring the bell, the dog starts salivating to digest the food it expects to be delivered. Essentially, the food is the pay-off, like a reward to it, and it forms an association between the signals (in this case, ringing a bell and the reward the dog is going to get).
After this experiment, there were more and very complex experiments on animals and people came up with lots of theories. Lots of papers on reinforcement learning are taken from the behavioral psychology journals.
There are a few terms in reinforcement learning that are used in this book several times: agent, environment, states, actions, and rewards. Let me explain these terms briefly here and later, we will go into the details of each term in Chapter 2, Markov Decision Process.
An agent in reinforcement learning always takes actions. Let me give you an example. A plane moving left and right in a video game is an agent. Or if the character is moving left, right, up, and down in the Pac Man game, then the character is actually an agent.
A state is the place or the situation in which the agent finds itself. Let's say our plane's game is a state or the Pac Man game is a state.
An action, as the name implies, is some work based on certain scenarios. An agent can perform certain actions. In the Pac Man game, he can move left, right, up, or down. In our other example of the plane in the video game, it can go left or right.
A reward is a feedback signal; it can be plus or minus. For example, in the Pac Man game, if the agent goes left and it avoids the enemy, it gets a plus reward. In the same way, if our plane goes right or left and dodges a bomb, it gets a reward.
Reinforcement learning is all about learning from the environment and learning to be more accurate with time. There are two types of reinforcement learning, called positive reinforcement learning and negative reinforcement learning. We will discuss both the approaches in the next section.
Positive reinforcement learning means getting a positive reward. It is something desirable that is given when you take an action. Let me give you an example to understand the concept. Let's say after studying hard, you've secured the first position in your class. Now that you know your good action results in a positive reward, you'll actually try more to continue such good actions. This is called positive reinforcement learning.
Another example, continued from the previous section on riding a bicycle, is of someone clapping for you when you are finally able to ride a cycle. It's a positive feedback and is called positive reinforcement learning.
On the other hand, negative reinforcement learning means getting negative rewards, or something undesirable given to you when you take an action. For example, you go to a cinema to watch a movie and you feel very cold, so it is uncomfortable to continue watching the movie. Next time you go to the same theater and feel the same cold again. It's surely uncomfortable to watch a movie in this environment.
The third time you visit the theater you wear a jacket. With this action, the negative element is removed.
Again, taking up the same ride-a-cycle example here, you fall down and get hurt. That's a negative feedback and it's called negative reinforcement learning.
The goal here is to learn by interacting with the system; it's not something that is completely offline. You have some level of interaction with the system and you learn about the system through the interaction.
Now we will discuss another example from a game of chess. One way is to sit with the opponent and make a sequence of moves; at the end of the game, either you win or lose. If you win, you will get a reward; someone will pay you $80. If you lose, you have to pay the opponent $80. That's all that happens and the maximum feedback you get is, either you win $80 or you lose $80 at the end of the game. No one will tell you that given this position this is the move you have to take; that's the reason I said it is learning from reward and punishment in the absence of detailed supervision.
A dog can be trained to keep the room tidy by giving him more tasty food when he behaves well and reducing the amount of his favorite food if he dirties the room.
The dog can be considered as an agent and the room as the environment. You are the source of the reward signal (tasty food). Although the feedback given to the dog is vague, eventually his neural networks will figure out that there is a relation between good food and good behavior.
The dog will possibly behave well and stop messing up with his room to maximize the goal of eating more tasty food. Thus we've seen reinforcement learning in non-computer issues.
This reinforces the idea that reinforcement learning can be a powerful tool for AI applications. Self-driving cars are a good example of AI applications.
With reinforcement learning, we aim to mimic biological entities.
Another example can be of a robot as an agent with a goal to move around in a room without bumping into obstacles. A negative score (punishment) on bumping into an obstacle and positive score (reward) on avoiding an obstacle will define the final score for the author. The reward can be maximized by moving around in the room avoiding obstacles. Here, we can say that the goal is to maximize the score.
An agent can maximize the reward by acting appropriately on the environment and performing optimum actions. Reinforcement learning is thus also used in self-adapting systems, such as AlphaGo.
Reinforcement learning is used in a wide variety of applications.
Self-driving cars are not science fiction anymore. Companies such as Toyota and Ford have invested millions of dollars for R&D in this technology. Taxi services such as Uber and Lyft, currently paying human drivers, may soon deploy entire fleets of self-driving cars. In the next two to three years, hundreds of thousands of self-driving cars may be sold to regular consumers.
Google is also taking a lead in this. The Google self-driving car project is called Waymo; it stands for a new way forward in mobility.
Thirty-three corporations are working on autonomous vehicles and over $450 M is invested across 36 deals to date; auto tech start-ups are on track for yearly highs in both deals and dollars.
Many influential personnel from automobile and technology industries predict that this will happen. But the big question behind this is, when will this actually happen? The timing is the key here; by 2020 many relevant companies are planning to launch autonomous cars. Refer to the following predictions by motor companies:
Motor company
Launch prediction
Audi
2020
NuTonomy (Singapore)
2018
Delphi and MobiEye
2019
Ford
2021
Volkswagen
2019
General Motors
2020
BMW
2021
Baidu
2019
Toyota
2020
Elon Musk
2019
Jaguar
2024
Nissan
2020
2018
Autonomous cars are the core and long-term strategy. IEEE predicts that 75 percent of vehicles will be fully autonomous by 2040.
Planning for a self-driving car is done via reinforcement learning. The car learns to continuously correct its driving capability over time through trial and error when training.
We will learn how to create a self-driving car using a simulator in upcoming chapters.
While people are still debating about the safety of self-driving cars, the United Arab Emirates is actually preparing to launch an autonomous aerial taxi or drone taxi.
It is one of the finest examples of applying reinforcement learning.
The Road and Transport Authority (RTA), Dubai employs the Chinese firm Ehang's 184, which is the world's first passenger drone. It's capable of a range of about 30 miles on a single charge and can carry one person weighing up to 220 lbs, as well as a small suitcase. The entire flight is managed by a command center; all you need to do is hop in and choose from the list of predetermined destinations where you want to land.
Riders can use a smartphone app to book their flights to pick them up from the designated zones. The drone taxi arrives at the designated place and the rider will go inside and get into a seat and select the pre-programmed designation using a touchscreen. They will just sit and enjoy the flight. All the flights are monitored in the control room remotely for passenger safety.
This drone autonomous taxi can carry a weight of 110 kg, and it uses eight motors to fly at a speed of up to 70 kilometers/hour.
Computer scientists at Stanford have successfully created an AI system that can enable robotic helicopters to learn and perform difficult stunts watching other helicopters performing the same maneuvers. This has resulted in autonomous helicopters which can perform a complete airshow of tricks on its own. Controlling the autonomous helicopter flight is the most challenging problem.
Autonomous helicopter flight is widely regarded to be a highly challenging control problem. Despite this fact, human experts can reliably fly helicopters through a wide range of maneuvers, including aerobatic maneuvers.
How does it work? By using reinforcement learning for the optimal control problem, it optimizes the model and reward functions.
We will look into all these reinforcement learning algorithms practically in the upcoming chapters.
TD-Gammon is a widely played computer backgammon program developed in 1992. TD-Gammon is a neural network which teaches itself to play backgammon and improves its strategies by playing the game with itself and learns from the results. It is a good example of reinforcement learning algorithm. It begins with random initial weights (and hence a random initial strategy), TD-Gammon eventually develops a strong level of play. While raw description of the board state is given, but with zero information built-in, the system teaches itself and develops strong ability to play at intermediate level. Moreover, with additional hand-crafted features the systems performs stunningly well.
The current version of TD-Gammon is very close to the level of the best human player of all time. It explored a lot of strategies that humans had never used, and that is the reason for the advancement in current TD-backgammon play.
The game of Go originated in China more than 3,000 years ago. The rules of the game are simple. Players take turns to place white or black stones on a board, trying to capture the opponent's stones or surround empty space to make points out of territory. As simple as the rules are, Go is a game of profound complexity. There are more possible positions in Go than there are atoms in the universe. That makes Go more complex than chess.
The game of Go is a classic and very challenging game. Computer scientists have been trying for decades to at least achieve a beginner level of performance with a computer as compared to a human. Now, with advancements in deep reinforcement learning, the computer learns a network policy (which selects actions) and also a network value (which predicts the winner) through self-play.
AlphaGo uses a state-of-the-art tree search and deep neural network techniques. It is the first program that beat a professional human player in Oct 2016. Later on, AlphaGo also defeated Lee Sedol, one of the strongest players with 17 world titles. The final score of the game was 4 to 1; this match was seen by 200 million viewers.
Reinforcement learning is learning from interaction with the environment. Here the learner is called the Agent. Everything outside the Agent is called the Environment. The Agent
