TY - GEN
T1 - Advice-Guided Reinforcement Learning in a non-Markovian Environment
AU - Neider, Daniel
AU - Gaglione, Jean Raphael
AU - Gavran, Ivan
AU - Topcu, Ufuk
AU - Wu, Bo
AU - Xu, Zhe
N1 - Funding Information:
This work was partially supported by the grants ARL W911NF2020132, ONR N00014-20-1-2115, and ARO W911NF-20-1-0140.
Publisher Copyright:
© 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2021
Y1 - 2021
N2 - We study a class of reinforcement learning tasks in which the agent receives its reward for complex, temporally-extended behaviors sparsely. For such tasks, the problem is how to augment the state-space so as to make the reward function Markovian in an efficient way. While some existing solutions assume that the reward function is explicitly provided to the learning algorithm (e.g., in the form of a reward machine), the others learn the reward function from the interactions with the environment, assuming no prior knowledge provided by the user. In this paper, we generalize both approaches and enable the user to give advice to the agent, representing the user’s best knowledge about the reward function, potentially fragmented, partial, or even incorrect. We formalize advice as a set of DFAs and present a reinforcement learning algorithm that takes advantage of such advice, with optimal convergence guarantee. The experiments show that using well-chosen advice can reduce the number of training steps needed for convergence to optimal policy, and can decrease the computation time to learn the reward function by up to two orders of magnitude.
AB - We study a class of reinforcement learning tasks in which the agent receives its reward for complex, temporally-extended behaviors sparsely. For such tasks, the problem is how to augment the state-space so as to make the reward function Markovian in an efficient way. While some existing solutions assume that the reward function is explicitly provided to the learning algorithm (e.g., in the form of a reward machine), the others learn the reward function from the interactions with the environment, assuming no prior knowledge provided by the user. In this paper, we generalize both approaches and enable the user to give advice to the agent, representing the user’s best knowledge about the reward function, potentially fragmented, partial, or even incorrect. We formalize advice as a set of DFAs and present a reinforcement learning algorithm that takes advantage of such advice, with optimal convergence guarantee. The experiments show that using well-chosen advice can reduce the number of training steps needed for convergence to optimal policy, and can decrease the computation time to learn the reward function by up to two orders of magnitude.
UR - http://www.scopus.com/inward/record.url?scp=85110480987&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85110480987&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85110480987
T3 - 35th AAAI Conference on Artificial Intelligence, AAAI 2021
SP - 9073
EP - 9080
BT - 35th AAAI Conference on Artificial Intelligence, AAAI 2021
PB - Association for the Advancement of Artificial Intelligence
T2 - 35th AAAI Conference on Artificial Intelligence, AAAI 2021
Y2 - 2 February 2021 through 9 February 2021
ER -