Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction

Assaf Hallak, Gal Dalal, Steven Dalton, Iuri Frosio, Shie Mannor, Gal Chechik

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Tree Search (TS) is crucial to some of the most influential successes in reinforcement learning. Here, we tackle two major challenges with TS that limit its usability: distribution shift and scalability. We first discover and analyze a counter-intuitive phenomenon: action selection through TS and a pre-trained value function often leads to lower performance compared to the original pre-trained agent, even when having access to the exact state and reward in future steps. We show this is due to a distribution shift to areas where value estimates are highly inaccurate and analyze this effect using Extreme Value theory. To overcome this problem, we introduce a novel off-policy correction term that accounts for the mismatch between the pre-trained value and its corresponding TS policy by penalizing under-sampled trajectories. We prove that our correction eliminates the above mismatch and bound the probability of sub-optimal action selection. Our correction significantly improves pre-trained Rainbow agents without any further training, often more than doubling their scores on Atari games. Next, we address the scalability issue given by the computational complexity of exhaustive TS that scales exponentially with the tree depth. We introduce Batch-BFS: a GPU breadth-first search that advances all nodes in each depth of the tree simultaneously. Batch-BFS reduces runtime by two orders of magnitude and, beyond inference, enables also training with TS of depths that were not feasible before. We train DQN agents from scratch using TS and show improvement in several Atari games compared to both the original DQN and the more advanced Rainbow.

Original languageEnglish
Title of host publicationAdvances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
EditorsMarc'Aurelio Ranzato, Alina Beygelzimer, Yann Dauphin, Percy S. Liang, Jenn Wortman Vaughan
PublisherNeural information processing systems foundation
Pages5518-5530
Number of pages13
ISBN (Electronic)9781713845393
StatePublished - 2021
Externally publishedYes
Event35th Conference on Neural Information Processing Systems, NeurIPS 2021 - Virtual, Online
Duration: 6 Dec 202114 Dec 2021

Publication series

NameAdvances in Neural Information Processing Systems
Volume7
ISSN (Print)1049-5258

Conference

Conference35th Conference on Neural Information Processing Systems, NeurIPS 2021
CityVirtual, Online
Period6/12/2114/12/21

Bibliographical note

Publisher Copyright:
© 2021 Neural information processing systems foundation. All rights reserved.

Fingerprint

Dive into the research topics of 'Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction'. Together they form a unique fingerprint.

Cite this