|| Actor-only Deterministic Policy Gradient via Zeroth-order Gradient Oracles in Action Space
||Harshat Kumar, University of Pennsylvania, United States; Dionysios S Kalogerias, Michigan State University, United States; George J Pappas, Alejandro Ribeiro, University of Pennsylvania, United States|
||D4-S3-T3: Reinforcement Learning
||Thursday, 15 July, 22:40 - 23:00
||Thursday, 15 July, 23:00 - 23:20
Deterministic policies demonstrate substantial empirical success over their stochastic counterparts as they remove a level of randomness in Policy Gradient (PG) methods when applied to stochastic search problems involving Markov decision processes. However, current implementations require the use of state-action value (Q-function) approximators, also known as critics, to obtain estimates of the associated policy-reward gradient. In this work, we propose the use of two-point stochastic evaluations to obtain gradient estimate of a smoothed Q-function surrogate, constructed by evaluating pairs of the Q-function at low-dimensional, randomized initial action perturbations. This procedure lifts the dependence on a critic and restores true model-free policy learning, with provable algorithmic stability. In fact, our finite complexity bounds improve upon existing results by up to 2 orders of magnitude in terms of iteration complexity, and by up to 3/2 orders of magnitude in terms of sample complexity. Simulation results on an agent navigation problem showcase the effectiveness of our proposed algorithm in a practical setting, as well.