Constrained policy improvement for efficient reinforcement learning

Elad Sarafian, Aviv Tamar, Sarit Kraus

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

9 Scopus citations

Abstract

We propose a policy improvement algorithm for Reinforcement Learning (RL) termed Rerouted Behavior Improvement (RBI). RBI is designed to take into account the evaluation errors of the Q-function. Such errors are common in RL when learning the Qvalue from finite experience data. Greedy policies or even constrained policy optimization algorithms that ignore these errors may suffer from an improvement penalty (i.e., a policy impairment). To reduce the penalty, the idea of RBI is to attenuate rapid policy changes to actions that were rarely sampled. This approach is shown to avoid catastrophic performance degradation and reduce regret when learning from a batch of transition samples. Through a two-armed bandit example, we show that it also increases data efficiency when the optimal action has a high variance. We evaluate RBI in two tasks in the Atari Learning Environment: (1) learning from observations of multiple behavior policies and (2) iterative RL. Our results demonstrate the advantage of RBI over greedy policies and other constrained policy optimization algorithms both in learning from observations and in RL tasks.

Original languageEnglish
Title of host publicationProceedings of the 29th International Joint Conference on Artificial Intelligence, IJCAI 2020
EditorsChristian Bessiere
PublisherInternational Joint Conferences on Artificial Intelligence
Pages2863-2871
Number of pages9
ISBN (Electronic)9780999241165
StatePublished - 2020
Event29th International Joint Conference on Artificial Intelligence, IJCAI 2020 - Yokohama, Japan
Duration: 1 Jan 2021 → …

Publication series

NameIJCAI International Joint Conference on Artificial Intelligence
Volume2021-January
ISSN (Print)1045-0823

Conference

Conference29th International Joint Conference on Artificial Intelligence, IJCAI 2020
Country/TerritoryJapan
CityYokohama
Period1/01/21 → …

Bibliographical note

Publisher Copyright:
© 2020 Inst. Sci. inf., Univ. Defence in Belgrade. All rights reserved.

Funding

∗This work was supported by the Israel Innovation Authority.

FundersFunder number
Israel Innovation Authority

    Fingerprint

    Dive into the research topics of 'Constrained policy improvement for efficient reinforcement learning'. Together they form a unique fingerprint.

    Cite this