Multiagent value iteration algorithms in dynamic programming and reinforcement learning

Dimitri Bertsekas

doi:10.1016/j.rico.2020.100003

Multiagent value iteration algorithms in dynamic programming and reinforcement learning

Dimitri Bertsekas

Research output: Contribution to journal › Article › peer-review

16 Scopus citations

Abstract

We consider infinite horizon dynamic programming problems, where the control at each stage consists of several distinct decisions, each one made by one of several agents. In an earlier work we introduced a policy iteration algorithm, where the policy improvement is done one-agent-at-a-time in a given order, with knowledge of the choices of the preceding agents in the order. As a result, the amount of computation for each policy improvement grows linearly with the number of agents, as opposed to exponentially for the standard all-agents-at-once method. For the case of a finite-state discounted problem, we showed convergence to an agent-by-agent optimal policy. In this paper, this result is extended to value iteration and optimistic versions of policy iteration, as well as to more general DP problems where the Bellman operator is a contraction mapping, such as stochastic shortest path problems with all policies being proper.

Original language	English (US)
Article number	100003
Journal	Results in Control and Optimization
Volume	1
DOIs	https://doi.org/10.1016/j.rico.2020.100003
State	Published - Dec 2020

ASJC Scopus subject areas

Control and Optimization
Applied Mathematics
Artificial Intelligence
Modeling and Simulation
Control and Systems Engineering

Access to Document

10.1016/j.rico.2020.100003

Cite this

@article{f63dc22521f74ed399c1f5a52d864d82,

title = "Multiagent value iteration algorithms in dynamic programming and reinforcement learning",

abstract = "We consider infinite horizon dynamic programming problems, where the control at each stage consists of several distinct decisions, each one made by one of several agents. In an earlier work we introduced a policy iteration algorithm, where the policy improvement is done one-agent-at-a-time in a given order, with knowledge of the choices of the preceding agents in the order. As a result, the amount of computation for each policy improvement grows linearly with the number of agents, as opposed to exponentially for the standard all-agents-at-once method. For the case of a finite-state discounted problem, we showed convergence to an agent-by-agent optimal policy. In this paper, this result is extended to value iteration and optimistic versions of policy iteration, as well as to more general DP problems where the Bellman operator is a contraction mapping, such as stochastic shortest path problems with all policies being proper.",

author = "Dimitri Bertsekas",

note = "Publisher Copyright: {\textcopyright} 2020 The Author",

year = "2020",

month = dec,

doi = "10.1016/j.rico.2020.100003",

language = "English (US)",

volume = "1",

journal = "Results in Control and Optimization",

issn = "2666-7207",

publisher = "Elsevier BV",

}

TY - JOUR

T1 - Multiagent value iteration algorithms in dynamic programming and reinforcement learning

AU - Bertsekas, Dimitri

PY - 2020/12

Y1 - 2020/12

N2 - We consider infinite horizon dynamic programming problems, where the control at each stage consists of several distinct decisions, each one made by one of several agents. In an earlier work we introduced a policy iteration algorithm, where the policy improvement is done one-agent-at-a-time in a given order, with knowledge of the choices of the preceding agents in the order. As a result, the amount of computation for each policy improvement grows linearly with the number of agents, as opposed to exponentially for the standard all-agents-at-once method. For the case of a finite-state discounted problem, we showed convergence to an agent-by-agent optimal policy. In this paper, this result is extended to value iteration and optimistic versions of policy iteration, as well as to more general DP problems where the Bellman operator is a contraction mapping, such as stochastic shortest path problems with all policies being proper.

AB - We consider infinite horizon dynamic programming problems, where the control at each stage consists of several distinct decisions, each one made by one of several agents. In an earlier work we introduced a policy iteration algorithm, where the policy improvement is done one-agent-at-a-time in a given order, with knowledge of the choices of the preceding agents in the order. As a result, the amount of computation for each policy improvement grows linearly with the number of agents, as opposed to exponentially for the standard all-agents-at-once method. For the case of a finite-state discounted problem, we showed convergence to an agent-by-agent optimal policy. In this paper, this result is extended to value iteration and optimistic versions of policy iteration, as well as to more general DP problems where the Bellman operator is a contraction mapping, such as stochastic shortest path problems with all policies being proper.

UR - http://www.scopus.com/inward/record.url?scp=85111130812&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85111130812&partnerID=8YFLogxK

U2 - 10.1016/j.rico.2020.100003

DO - 10.1016/j.rico.2020.100003

M3 - Article

AN - SCOPUS:85111130812

SN - 2666-7207

VL - 1

JO - Results in Control and Optimization

JF - Results in Control and Optimization

M1 - 100003

ER -

Multiagent value iteration algorithms in dynamic programming and reinforcement learning

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this