TY - GEN
T1 - Q-learning and enhanced policy iteration in discounted dynamic programming
AU - Bertsekas, Dimitri P.
AU - Yu, Huizhen
PY - 2010
Y1 - 2010
N2 - We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal Q-factors. Instead of policy evaluation by solving a linear system of equations, our algorithm involves (possibly inexact) solution of an optimal stopping problem. This problem can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of asynchronous/modified policy iteration, with lower overhead advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm resolves effectively the inherent difficulties of existing schemes due to inadequate exploration.
AB - We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal Q-factors. Instead of policy evaluation by solving a linear system of equations, our algorithm involves (possibly inexact) solution of an optimal stopping problem. This problem can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of asynchronous/modified policy iteration, with lower overhead advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm resolves effectively the inherent difficulties of existing schemes due to inadequate exploration.
UR - http://www.scopus.com/inward/record.url?scp=79953158573&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79953158573&partnerID=8YFLogxK
U2 - 10.1109/CDC.2010.5717930
DO - 10.1109/CDC.2010.5717930
M3 - Conference contribution
AN - SCOPUS:79953158573
SN - 9781424477456
T3 - Proceedings of the IEEE Conference on Decision and Control
SP - 1409
EP - 1416
BT - 2010 49th IEEE Conference on Decision and Control, CDC 2010
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 49th IEEE Conference on Decision and Control, CDC 2010
Y2 - 15 December 2010 through 17 December 2010
ER -