All Dates/Times are Australian Eastern Standard Time (AEST)

Technical Program

Paper Detail

Paper IDD4-S3-T3.1
Paper Title Actor-only Deterministic Policy Gradient via Zeroth-order Gradient Oracles in Action Space
Authors Harshat Kumar, University of Pennsylvania, United States; Dionysios S Kalogerias, Michigan State University, United States; George J Pappas, Alejandro Ribeiro, University of Pennsylvania, United States
Session D4-S3-T3: Reinforcement Learning
Chaired Session: Thursday, 15 July, 22:40 - 23:00
Engagement Session: Thursday, 15 July, 23:00 - 23:20
Abstract Deterministic policies demonstrate substantial empirical success over their stochastic counterparts as they remove a level of randomness in Policy Gradient (PG) methods when applied to stochastic search problems involving Markov decision processes. However, current implementations require the use of state-action value (Q-function) approximators, also known as critics, to obtain estimates of the associated policy-reward gradient. In this work, we propose the use of two-point stochastic evaluations to obtain gradient estimate of a smoothed Q-function surrogate, constructed by evaluating pairs of the Q-function at low-dimensional, randomized initial action perturbations. This procedure lifts the dependence on a critic and restores true model-free policy learning, with provable algorithmic stability. In fact, our finite complexity bounds improve upon existing results by up to 2 orders of magnitude in terms of iteration complexity, and by up to 3/2 orders of magnitude in terms of sample complexity. Simulation results on an agent navigation problem showcase the effectiveness of our proposed algorithm in a practical setting, as well.