103,99 €
The book begins with a chapter on traditional methods of supervised learning, covering recursive least squares learning, mean square error methods, and stochastic approximation. Chapter 2 covers single agent reinforcement learning. Topics include learning value functions, Markov games, and TD learning with eligibility traces. Chapter 3 discusses two player games including two player matrix games with both pure and mixed strategies. Numerous algorithms and examples are presented. Chapter 4 covers learning in multi-player games, stochastic games, and Markov games, focusing on learning multi-player grid games—two player grid games, Q-learning, and Nash Q-learning. Chapter 5 discusses differential games, including multi player differential games, actor critique structure, adaptive fuzzy control and fuzzy interference systems, the evader pursuit game, and the defending a territory games. Chapter 6 discusses new ideas on learning within robotic swarms and the innovative idea of the evolution of personality traits.
• Framework for understanding a variety of methods and approaches in multi-agent machine learning.
• Discusses methods of reinforcement learning such as a number of forms of multi-agent Q-learning
• Applicable to research professors and graduate students studying electrical and computer engineering, computer science, and mechanical and aerospace engineering
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 318
Veröffentlichungsjahr: 2014
Cover
Title
Copyright
Preface
References
Chapter 1: A Brief Review of Supervised Learning
1.1 Least Squares Estimates
1.2 Recursive Least Squares
1.3 Least Mean Squares
1.4 Stochastic Approximation
References
Chapter 2: Single-Agent Reinforcement Learning
2.1 Introduction
2.2
-Armed Bandit Problem
2.3 The Learning Structure
2.4 The Value Function
2.5 The Optimal Value Functions
2.6 Markov Decision Processes
2.7 Learning Value Functions
2.8 Policy Iteration
2.9 Temporal Difference Learning
2.10 TD Learning of the State-Action Function
2.11 Q-Learning
2.12 Eligibility Traces
References
Chapter 3: Learning in Two-Player Matrix Games
3.1 Matrix Games
3.2 Nash Equilibria in Two-Player Matrix Games
3.3 Linear Programming in Two-Player Zero-Sum Matrix Games
3.4 The Learning Algorithms
3.5 Gradient Ascent Algorithm
3.6 WoLF-IGA Algorithm
3.7 Policy Hill Climbing (PHC)
3.8 WoLF-PHC Algorithm
3.9 Decentralized Learning in Matrix Games
3.10 Learning Automata
3.11 Linear Reward–Inaction Algorithm
3.12 Linear Reward–Penalty Algorithm
3.13 The Lagging Anchor Algorithm
3.14
Lagging Anchor Algorithm
References
Chapter 4: Learning in Multiplayer Stochastic Games
4.1 Introduction
4.2 Multiplayer Stochastic Games
4.3 Minimax-Q Algorithm
4.3 Minimax-Q Algorithm
4.5 The Simplex Algorithm
4.6 The Lemke–Howson Algorithm
4.7 Nash-Q Implementation
4.8 Friend-or-Foe Q-Learning
4.9 Infinite Gradient Ascent
4.10 Policy Hill Climbing
4.11 WoLF-PHC Algorithm
4.12 Guarding a Territory Problem in a Grid World
4.13 Extension of
Lagging Anchor Algorithm to Stochastic Games
4.14 The Exponential Moving-Average Q-Learning (EMA Q-Learning) Algorithm
4.15 Simulation and Results Comparing EMA Q-Learning to Other Methods
References
Chapter 5: Differential Games
5.1 Introduction
5.2 A Brief Tutorial on Fuzzy Systems
5.3 Fuzzy Q-Learning
5.4 Fuzzy Actor–Critic Learning
5.5 Homicidal Chauffeur Differential Game
5.6 Fuzzy Controller Structure
5.7 Q(
)-Learning Fuzzy Inference System
5.9 Learning in the Evader–Pursuer Game with Two Cars
5.6 Fuzzy Controller Structure
5.10 Simulation of the Game of Two Cars
5.11 Differential Game of Guarding a Territory
5.12 Reward Shaping in the Differential Game of Guarding a Territory
5.13 Simulation Results
References
Chapter 6: Swarm Intelligence and the Evolution of Personality Traits
6.1 Introduction
6.2 The Evolution of Swarm Intelligence
6.3 Representation of the Environment
6.4 Swarm-Based Robotics in Terms of Personalities
6.5 Evolution of Personality Traits
6.6 Simulation Framework
6.7 A Zero-Sum Game Example
6.8 Implementation for Next Sections
6.9 Robots Leaving a Room
6.10 Tracking a Target
6.11 Conclusion
References
Index
End User License Agreement
Chapter 2: Single-Agent Reinforcement Learning
Table 2.1 Temporal difference Q-table learning result.
Table 2.2 Temporal difference Q-table learning result.
Chapter 3: Learning in Two-Player Matrix Games
Table 3.1 Examples of two-player matrix games.
Table 3.2 Comparison of learning algorithms in matrix games.
Table 3.3 Examples of two-player matrix games.
Chapter 4: Learning in Multiplayer Stochastic Games
Table 4.1 Action-value function
in Example 4.1.
Table 4.2 Minimax solution for the defender in the state
.
Table 4.3 Minimax solution for the defender in the state
. (a)
-values of the defender for the state
. (b) Linear constraints for the defender in the state
.
Table 4.4 States and strategies.
Table 4.5 Grid game 1: Nash
-values in state (0, 2).
Table 4.6 Grid game 1. Nash
-values in state (1,3).
Table 4.7 Comparison of multiagent reinforcement learning algorithms.
Chapter 5: Differential Games
Table 5.1 Tabular format.
Table 5.2 Pursuer's fuzzy decision table before learning.
Table 5.3 Evader's fuzzy decision table before learning.
Table 5.4 Capture time (s) for different numbers of learning episodes.
Table 5.5 The evader's fuzzy decision table after 1000 learning episodes.
Table 5.6 The pursuer's fuzzy decision table after 1000 learning episodes.
Table 5.7 Summary of the time of capture for different numbers of learning episodes in the game of two cars.
Table 5.8 The evader's fuzzy decision table and the output constant
after learning.
Table 5.9 The pursuer's fuzzy decision table and the output constant
after learning.
Chapter 6: Swarm Intelligence and the Evolution of Personality Traits
Table 6.1 Zero-sum game example.
Table 6.2 Optimal mixed strategies.
Table 6.3 Experimental results obtained for both players.
Table 6.4 Modeling of a game between two robots trying to leave a room.
Table 6.5 Utility payoffs for states.
Table 6.6 Convergence of the personality traits.
Table 6.7 Simulation results.
Chapter 2: Single-Agent Reinforcement Learning
Figure 2-1 Agent–environment interaction in reinforcement learning.
Figure 2-2 Armed bandit with varying
.
Figure 2-3 Example of the grid world.
Figure 2-4 Values for each of the states.
Figure 2-5 Resulting optimal policies.
Figure 2-6 Resulting state values based on the optimal policies.
Figure 2-7 Comparison of TD learning with and without eligibility traces for
.
Figure 2-8 Comparison of Q-learning with and without eligibility traces for Q(1, UP).
Chapter 3: Learning in Two-Player Matrix Games
Figure 3-1 Simplex method for player 1 in the matching pennies game.
Figure 3-2 Simplex method for player 1 in the revised matching pennies game.
Figure 3-3 Simplex method at
in Example 3.3. (a) Simplex method for player 1 at
. (b) Simplex method for player 2 at
.
Figure 3-4 Players' NE strategies versus
.
Figure 3-5 GA in matching pennies game.
Figure 3-6 PHC matching pennies game, player 1, probability of choosing action 1, heads.
Figure 3-7 PHC matching pennies game, player 1, probability of choosing action 1, heads when player 2 always chooses heads.
Figure 3-8 WoLF-PHC matching pennies game, player 1, probability of choosing action 1.
Figure 3-9 Trajectories of players' strategies during learning in matching pennies.
Figure 3-10 Trajectories of players' strategies during learning in prisoners' dilemma.
Figure 3-11 Trajectories of players' strategies during learning in rock-paper-scissors.
Chapter 4: Learning in Multiplayer Stochastic Games
Figure 4-1 Example of stochastic games. (a) A
grid game with two players. (b) The numbered cells in the game. (c) Possible state transitions given players' joint action (
). Reproduced from [5], © X. Lu.
Figure 4-2 A
grid game. (a) Initial positions of the players: state
. (b) Invader in top-right versus defender in bottom-left: state
. (c) Invader in bottom-left versus defender in top-right: state
. Reproduced from [5], © X. Lu.
Figure 4-3 Minimax Q-learning for the defender/invader game. Action probability for the defender.
Figure 4-4 Two stochastic games [7]. (a) Grid game 1. (b) Grid game 2.
Figure 4-5 (a) Nash equilibrium of grid game 1. (b) Nash equilibrium of grid game 2. Reproduced from [8] with permission from MIT press.
Figure 4-6 Grid game with barriers, start position (0,1).
Figure 4-7 Constraint equations plotted for the simplex method.
Figure 4-8 Polytope defined by
.
Figure 4-9 Polytope defined by
.
Figure 4-10 Nash-Q learner with exploit-explore. Reproduced from [15], © P. De Beck-Courcelle.
Figure 4-11 Nash-Q learner with explore only. Reproduced from [15], © P. De Beck-Courcelle.
Figure 4-12 Nash-Q learning with exploit only. Reproduced from [15], © P. De Beck-Courcelle.
Figure 4-13 Guarding a territory in a grid world. (a) Initial positions of the players when the game starts. (b) Terminal positions of the players when the game ends. Reproduced from [5], © X. Lu.
Figure 4-14 Players' strategies at state
using the minimax-Q algorithm in the first simulation for the
grid game. (a) Defender's strategy
(solid line) and
(dash line). (b) Invader's strategy
(solid line) and
(dash line). Reproduced from [5], © X. Lu.
Figure 4-15 Players' strategies at state
using the WoLF-PHC algorithm in the first simulation for the
grid game. (a) Defender's strategy
(solid line) and
(dash line). (b) Invader's strategy
(solid line) and
(dash line). Reproduced from [5], © X. Lu.
Figure 4-16 Defender's strategy at state
in the second simulation for the
grid game. (a) Minimax-Q-learned strategy of the defender at state
against the invader using a fixed strategy. Solid line: probability of defender moving up; Dashed line: probability of defender moving left. (b) WoLF-PHC learned strategy of the defender at state
against the invader using a fixed strategy. Solid line: probability of defender moving up; Dashed line: probability of defender moving left. Reproduced from [5], © X. Lu.
Figure 4-17 A
grid game. (a) Initial positions of the players. (b) One of the terminal positions of the players. Reproduced from [5], © X. Lu.
Figure 4-18 Results in the first simulation for the
grid game. (a) Result of the minimax-Q learned strategy of the defender against the minimax-Q-learned strategy of the invader. (b) Result of the WoLF-PHC learned strategy of the defender against the WoLF-PHC-learned strategy of the invader. Reproduced from [5], © X. Lu.
Figure 4-19 Results in the second simulation for the
grid game. (a) Result of the minimax-Q-learned strategy of the defender against the invader using a fixed strategy. (b) Result of the WoLF-PHC-learned strategy of the defender against the invader using a fixed strategy. Reproduced from [5], © X. Lu.
Figure 4-20 Hu and Wellman's grid game. (a) Grid game. (b) Nash equilibrium path 1. (c) Nash equilibrium path 2. Reproduced from [24] © M. Awheda and Schwartz, H. M.
Figure 4-21 Learning trajectories of players' strategies at the initial state in the grid game. Reproduced from [5] © X. Lu.
Figure 4-22 Probability distributions of the second actions for both players in the dilemma game. (a) The EMA Q-learning, (b) PGA-APP, and (c) WPL algorithms are shown. Reproduced from [24] © M. Awheda and Schwartz, H. M.
Figure 4-23 Probability distributions of the first actions for the three players in the three-player matching pennies game. (a) The EMA Q-learning, (b) PGA-APP, and (c) WPL algorithms are shown. Reproduced from [24] © M. Awheda and Schwartz, H. M.
Figure 4-24 Probability distributions of player 1's actions in the Shapley's game. (a) The EMA Q-learning, (b) PGA-APP, and (c) WPL algorithms are shown. Reproduced from [24] © M. Awheda and Schwartz, H. M.
Figure 4-25 Probability distributions of the first actions for both players in the biased game. (a) The EMA Q-learning, (b) PGA-APP, and (c) WPL algorithms are shown. Reproduced from [24] © M. Awheda and Schwartz, H. M.
Figure 4-26 Grid game 1. (a) Probability of action North of player 1 when learning with the EMA Q-learning algorithm with different values of the constant gain
. Plots (b) and (c) illustrate the probability of action North of player 1 and player 2, respectively, when learning with the EMA Q-learning, PGA-APP, and WPL algorithms. Reproduced from [24] © M. Awheda and Schwartz, H. M.
Figure 4-27 Two stochastic games [8]. (a) Grid game 1. (b) Grid game 2. Reproduced from [24] © M. Awheda and Schwartz, H. M.
Figure 4-28 (a) Nash equilibrium of grid game 1. (b) Nash equilibrium of grid game 2 [8] with permission from MIT press. Reproduced from [24] © M. Awheda and Schwartz, H. M.
Figure 4-29 Grid game 2. (a) Probability of selecting action North by player 1 when learning with the EMA Q-learning, PGA-APP, and WPL algorithms. (b) Probability of selecting action West by player 2 when learning with the EMA Q-learning, PGA-APP, and WPL algorithms. Reproduced from [24] © M. Awheda and Schwartz, H. M.
Chapter 5: Differential Games
Figure 5-1 Examples of membership functions.
Figure 5-2 Fuzzy system components.
Figure 5-3 Membership functions. (a) Membership functions of five fuzzy sets. (b) Membership functions of seven fuzzy sets.
Figure 5-4 Nonlinear function
and the estimation
with five rules and seven rules.
Figure 5-5 Estimation error
with five rules and seven rules.
Figure 5-6 Basic configuration of fuzzy systems.
Figure 5-7 Architecture of the actor–critic learning system.
Figure 5-8 Homicidal chauffeur problem model.
Figure 5-9 The vehicle cannot turn into the circular region defined by its minimum turning radius
.
Figure 5-10 Membership functions before training. (a) Pursuer membership functions before training. (b) Evader membership functions before training.
Figure 5-11 Construction of the learning system where the white Gaussian noise
is added as an exploration mechanism.
Figure 5-12 The pursuer captures the evader with 100 learning episodes.
Figure 5-13 The evader increases the capture time after 500 learning episodes.
Figure 5-14 The evader learns to escape after 1000 learning episodes.
Figure 5-15 The evader avoids capture when
rad.
Figure 5-16 The pursuer can capture the evader when
rad.
Figure 5-17 The game of two cars.
Figure 5-18 The pursuer captures the evader with 100 learning episodes.
Figure 5-19 The evader increases the capture time after 500 learning episodes.
Figure 5-20 The evader learns to escape after 1300 learning episodes. (a) The evader learns to escape after 1300 learning episodes. (b) Zoomed version of (a).
Figure 5-21 The pursuer's membership functions after training. (a) The angle difference φ. (b) The rate of change of the angle difference
.
Figure 5-22 The evader's membership functions after training. (a) The angle difference φ. (b) The distance between the pursuer and the evader
d
.
Figure 5-23 The time of capture with the use of eligibility traces in the game of two cars.
Figure 5-24 The differential game of guarding a territory.
Figure 5-25 MFs for
.
Figure 5-26 Membership functions for input variables.
Figure 5-27 Membership functions for input variables.
Figure 5-28 Reinforcement learning with no shaping function in Example 5.2. (a) Trained defender using FQL with no shaping function. (b) Trained defender using FACL with no shaping function.
Figure 5-29 Reinforcement learning with a bad shaping function in Example 5.2. (a) Trained defender using FQL with the bad shaping function in Example 5.2. (b) Trained defender using FACL with the bad shaping function in Example 5.2.
Figure 5-30 Reinforcement learning with a good shaping function in Example 5.2. (a) Trained defender using FQL with the good shaping function in Example 5.2. (b) Trained defender using FACL with the good shaping function in Example 5.2.
Figure 5-31 Initial positions of the defender in the training and testing episodes in Example 5.3.
Figure 5-32 Example 5.3: average performance of the trained defender versus the NE invader. (a) Average performance error
in the FQL algorithm. (b) Average performance error
in the FACL algorithm.
Figure 5-33 The differential game of guarding a territory with three players.
Figure 5-34 Reinforcement learning without shaping or with a bad shaping function in Example 5.4. (a) Two trained defenders using FACL with no shaping function versus the NE invader after one training trial. (b) Two trained defenders using FACL with the bad shaping function versus the NE invader after one training trial.
Figure 5-35 Two trained defenders using FACL with the good shaping function versus the NE invader after one training trial in Example 5.4.
Figure 5-36 Example 5.5: average performance of the two trained defenders versus the NE invader. (a) Initial positions of the players in the training and testing episodes. (b) Average performance error for the trained defenders versus the NE invader.
Chapter 6: Swarm Intelligence and the Evolution of Personality Traits
Figure 6-1 (a) Actual configuration of the world. (b) The way robot A perceives it. (c) The way robot B perceives it. Reproduced from [21] © S. Givigi and H. M. Schwartz.
Figure 6-2 Simplex of a player with two strategies. Reproduced from [21] © S. Givigi and H. M. Schwartz.
Figure 6-3 Artistic depiction of the problem of robots leaving a room. Reproduced from [21] © S. Givigi and H. M. Schwartz.
Figure 6-4 Artistic depiction of the simulation environment. Reproduced from [21] © S. Givigi and H. M. Schwartz.
Figure 6-5 Utility function and personality traits of one robot. Reproduced from [21] © S. Givigi and H. M. Schwartz.
Figure 6-6 State of the robots during the simulation. Reproduced from [21] © S. Givigi and H. M. Schwartz.
Figure 6-7 State of the simulation when two robots turned courageous. Reproduced from [21] © S. Givigi and H. M. Schwartz.
Figure 6-8 State of the simulation when five robots turned courageous. Reproduced from [21] © S. Givigi and H. M. Schwartz.
Figure 6-9 State of the simulation when 10 robots turned courageous. Reproduced from [21] © S. Givigi and H. M. Schwartz.
Figure 6-10 Robot waiting for a more courageous robot. Reproduced from [21] © S. Givigi and H. M. Schwartz.
Cover
Table of Contents
Start Reading
Cover
Contents
iii
iv
ix
x
xi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
60
58
61
62
59
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
237
238
239
240
241
242
Howard M. Schwartz
Department of Systems and Computer Engineering
Carleton University
Copyright © 2014 by John Wiley & Sons, Inc. All rights reserved
Published by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Schwartz, Howard M., editor.
Multi-agent machine learning : a reinforcement approach / Howard M. Schwartz.
pages cm
Includes bibliographical references and index.
ISBN 978-1-118-36208-2 (hardback)
1. Reinforcement learning. 2. Differential games. 3. Swarm intelligence. 4. Machine learning. I. Title.
Q325.6.S39 2014
519.3–dc23
2014016950
For a decade I have taught a course on adaptive control. The course focused on the classical methods of system identification, using such classic texts as Ljung [1, 2]. The course addressed traditional methods of model reference adaptive control and nonlinear adaptive control using Lyapunov techniques. However, the theory had become out of sync with current engineering practice. As such, my own research and the focus of the graduate course changed to include adaptive signal processing, and to incorporate adaptive channel equalization and echo cancellation using the least mean squares (LMS) algorithm. The course name likewise changed, from “Adaptive Control” to “Adaptive and Learning Systems.” My research was still focused on system identification and nonlinear adaptive control with application to robotics. However, by the early 2000s, I had started work with teams of robots. It was now possible to use handy robot kits and low-cost microcontroller boards to build several robots that could work together. The graduate course in adaptive and learning systems changed again; the theoretical material on nonlinear adaptive control using Lyapunov techniques was reduced, replaced with ideas from reinforcement learning. A whole new range of applications developed. The teams of robots had to learn to work together and to compete.
Today, the graduate course focuses on system identification using recursive least squares techniques, some model reference adaptive control (still using Lyapunov techniques), adaptive signal processing using the LMS algorithm, and reinforcement learning using Q-learning. The first two chapters of this book present these ideas in an abridged form, but in sufficient detail to demonstrate the connections among the learning algorithms that are available; how they are the same; and how they are different. There are other texts that cover this material in detail [2–4].
The research then began to focus on teams of robots learning to work together. The work examined applications of robots working together for search and rescue applications, securing important infrastructure and border regions. It also began to focus on reinforcement learning and multiagent reinforcement learning. The robots are the learning agents. How do children learn how to play tag? How do we learn to play football, or how do police work together to capture a criminal? What strategies do we use, and how do we formulate these strategies? Why can I play touch football with a new group of people and quickly be able to assess everyone's capabilities and then take a particular strategy in the game?
As our research team began to delve further into the ideas associated with multiagent machine learning and game theory, we discovered that the published literature covered many ideas but was poorly coordinated or focused. Although there are a few survey articles [5], they do not give sufficient details to appreciate the different methods. The purpose of this book is to introduce the reader to a particular form of machine learning. The book focuses on multiagent machine learning, but it is tied together with the central theme of learning algorithms in general. Learning algorithms come in many different forms. However, they tend to have a similar approach. We will present the differences and similarities of these methods.
This book is based on my own work and the work of several doctoral and masters students who have worked under my supervision over the past 10 years. In particular, I would like to thank Prof. Sidney Givigi. Prof. Givigi was instrumental in developing the ideas and algorithms presented in Chapter 6. The doctoral research of Xiaosong (Eric) Lu has also found its way into this book. The work on guarding a territory is largely based on his doctoral dissertation. Other graduate students who helped me in this work include Badr Al Faiya, Mostafa Awheda, Pascal De Beck-Courcelle, and Sameh Desouky. Without the dedicated work of this group of students, this book would not have been possible.
H. M. SchwartzOttawa, CanadaSeptember, 2013
[1] L. Ljung, System Identification: Theory for the User. Upper Saddle River, NJ: Prentice Hall, 2nd ed., 1999.
[2] L. Ljung and T. Soderstrom, Theory and Practice of Recursive Identification. Cambridge, Massachusetts: The MIT Press, 1983.
[3] R. S. Sutton and A. G. Barto, Reinforcement learning: An Introduction. Cambridge, Massachusetts: The MIT Press, 1998.
[4] Astrom, K. J. and Wittenmark, B., Adaptive Control. Boston, Massachusetts: Addison-Wesley Longman Publishing Co., Inc., 2nd ed., 1994, ISBN = 0201558661.
[5] L. Buoniu and R. Babuška, and B. D. Schutter, “A comprehensive survey of multiagent reinforcement learning,” IEEE Trans. Syst. Man Cybern. Part C, Vol. 38, no. 2, pp. 156–172, 2008.
There are a number of algorithms that are typically used for system identification, adaptive control, adaptive signal processing, and machine learning. These algorithms all have particular similarities and differences. However, they all need to process some type of experimental data. How we collect the data and process it determines the most suitable algorithm to use. In adaptive control, there is a device referred to as the self-tuning regulator. In this case, the algorithm measures the states as outputs, estimates the model parameters, and outputs the control signals. In reinforcement learning, the algorithms process rewards, estimate value functions, and output actions. Although one may refer to the recursive least squares (RLS) algorithm in the self-tuning regulator as a supervised learning algorithm and reinforcement learning as an unsupervised learning algorithm, they are both very similar. In this chapter, we will present a number of well-known baseline supervised learning algorithms.
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
