Reinforcement Learning (RL) for robotic applications can benefit from a warm-start where the agent is initialized with a pretrained behavioral policy. However, when transitioning to RL updates, degradation in performance can occur, which may compromise the robot's safety. This degradation, which constitutes an inability to properly utilize the pretrained policy, is attributed to extrapolation error in the value function, a result of high values being assigned to Out-Of-Distribution actions not present in the behavioral policy's data. We investigate why the magnitude of degradation varies across policies and why the policy fails to quickly return to behavioral performance. We present visual confirmation of our analysis and draw comparisons to the Offline RL setting which suffers from similar difficulties. We propose a novel method, Confidence Constrained Learning (CCL) for Warm-Start RL, that reduces degradation by balancing between the policy gradient and constrained learning according to a confidence measure of the Q-values. For the constrained learning component we propose a novel objective, Positive Q-value Distance (CCL-PQD). We investigate a variety of constraint-based methods that aim to overcome the degradation, and find they constitute solutions for a multi-objective optimization problem between maximimal performance and miniminal degradation. Our results demonstrate that hyperparameter tuning for CCL-PQD produces solutions on the Pareto Front of this multi-objective problem, allowing the user to balance between performance and tolerable compromises to the robot's safety.
|Title of host publication||IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2022|
|Publisher||Institute of Electrical and Electronics Engineers Inc.|
|Number of pages||8|
|State||Published - 2022|
|Event||2022 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2022 - Kyoto, Japan|
Duration: 23 Oct 2022 → 27 Oct 2022
|Name||IEEE International Conference on Intelligent Robots and Systems|
|Conference||2022 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2022|
|Period||23/10/22 → 27/10/22|
Bibliographical noteFunding Information:
Benjamin Wexler, Elad Sarafian and Sarit Kraus are with the Department of Computer Science at Bar-Ilan University, Ramat-Gan, Israel firstname.lastname@example.org email@example.com firstname.lastname@example.org. This research has been partly supported by the EU Project TAILOR under grant 952215 and the DSI at BIU.
© 2022 IEEE.