Q-learning and enhanced policy iteration in discounted dynamic programming

Dimitri P. Bertsekas; Huizhen Yu

doi:10.1109/CDC.2010.5717930

Q-learning and enhanced policy iteration in discounted dynamic programming

Dimitri P. Bertsekas, Huizhen Yu

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

7 Scopus citations

Abstract

We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal Q-factors. Instead of policy evaluation by solving a linear system of equations, our algorithm involves (possibly inexact) solution of an optimal stopping problem. This problem can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of asynchronous/modified policy iteration, with lower overhead advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm resolves effectively the inherent difficulties of existing schemes due to inadequate exploration.

Original language	English (US)
Title of host publication	2010 49th IEEE Conference on Decision and Control, CDC 2010
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	1409-1416
Number of pages	8
ISBN (Print)	9781424477456
DOIs	https://doi.org/10.1109/CDC.2010.5717930
State	Published - 2010
Externally published	Yes
Event	49th IEEE Conference on Decision and Control, CDC 2010 - Atlanta, United States Duration: Dec 15 2010 → Dec 17 2010

Publication series

Name	Proceedings of the IEEE Conference on Decision and Control
ISSN (Print)	0743-1546
ISSN (Electronic)	2576-2370

Conference

Conference	49th IEEE Conference on Decision and Control, CDC 2010
Country/Territory	United States
City	Atlanta
Period	12/15/10 → 12/17/10

ASJC Scopus subject areas

Control and Systems Engineering
Modeling and Simulation
Control and Optimization

Access to Document

10.1109/CDC.2010.5717930

Cite this

Q-learning and enhanced policy iteration in discounted dynamic programming. / Bertsekas, Dimitri P.; Yu, Huizhen.
2010 49th IEEE Conference on Decision and Control, CDC 2010. Institute of Electrical and Electronics Engineers Inc., 2010. p. 1409-1416 5717930 (Proceedings of the IEEE Conference on Decision and Control).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Bertsekas, DP & Yu, H 2010, Q-learning and enhanced policy iteration in discounted dynamic programming. in 2010 49th IEEE Conference on Decision and Control, CDC 2010., 5717930, Proceedings of the IEEE Conference on Decision and Control, Institute of Electrical and Electronics Engineers Inc., pp. 1409-1416, 49th IEEE Conference on Decision and Control, CDC 2010, Atlanta, United States, 12/15/10. https://doi.org/10.1109/CDC.2010.5717930

@inproceedings{f42563e3c7184a4ba39a9b406672df85,

title = "Q-learning and enhanced policy iteration in discounted dynamic programming",

abstract = "We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal Q-factors. Instead of policy evaluation by solving a linear system of equations, our algorithm involves (possibly inexact) solution of an optimal stopping problem. This problem can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of asynchronous/modified policy iteration, with lower overhead advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm resolves effectively the inherent difficulties of existing schemes due to inadequate exploration.",

author = "Bertsekas, {Dimitri P.} and Huizhen Yu",

year = "2010",

doi = "10.1109/CDC.2010.5717930",

language = "English (US)",

isbn = "9781424477456",

series = "Proceedings of the IEEE Conference on Decision and Control",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "1409--1416",

booktitle = "2010 49th IEEE Conference on Decision and Control, CDC 2010",

note = "49th IEEE Conference on Decision and Control, CDC 2010 ; Conference date: 15-12-2010 Through 17-12-2010",

}

TY - GEN

T1 - Q-learning and enhanced policy iteration in discounted dynamic programming

AU - Bertsekas, Dimitri P.

AU - Yu, Huizhen

PY - 2010

Y1 - 2010

N2 - We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal Q-factors. Instead of policy evaluation by solving a linear system of equations, our algorithm involves (possibly inexact) solution of an optimal stopping problem. This problem can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of asynchronous/modified policy iteration, with lower overhead advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm resolves effectively the inherent difficulties of existing schemes due to inadequate exploration.

AB - We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal Q-factors. Instead of policy evaluation by solving a linear system of equations, our algorithm involves (possibly inexact) solution of an optimal stopping problem. This problem can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of asynchronous/modified policy iteration, with lower overhead advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm resolves effectively the inherent difficulties of existing schemes due to inadequate exploration.

UR - http://www.scopus.com/inward/record.url?scp=79953158573&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79953158573&partnerID=8YFLogxK

U2 - 10.1109/CDC.2010.5717930

DO - 10.1109/CDC.2010.5717930

M3 - Conference contribution

AN - SCOPUS:79953158573

SN - 9781424477456

T3 - Proceedings of the IEEE Conference on Decision and Control

SP - 1409

EP - 1416

BT - 2010 49th IEEE Conference on Decision and Control, CDC 2010

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 49th IEEE Conference on Decision and Control, CDC 2010

Y2 - 15 December 2010 through 17 December 2010

ER -

Q-learning and enhanced policy iteration in discounted dynamic programming

Abstract

Publication series

Conference

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this