2021 IEEE International Symposium on Information Theory

12-20 July 2021 • Melbourne, Victoria, Australia

IEEE Information Theory Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Symposium on Information Theory

12-20 July 2021 • Melbourne, Victoria, Australia

All Dates/Times are Australian Eastern Standard Time (AEST)

Technical Program

Paper Detail

Paper ID	D3-S4-T1.1
Paper Title	Differentially Quantized Gradient Descent
Authors	Chung-Yi Lin, Victoria Kostina, Babak Hassibi, Caltech, United States
Session	D3-S4-T1: Distributed Computation II
Chaired Session:	Wednesday, 14 July, 23:00 - 23:20
Engagement Session:	Wednesday, 14 July, 23:20 - 23:40
Abstract	Consider the following distributed optimization scenario. A worker has access to training data that it uses to compute the gradients while a server decides when to stop iterative computation based on its target accuracy or delay constraints. The only information that the server knows about the problem instance is what it receives from the worker via a rate-limited noiseless communication channel. We introduce the technique we call \emph{differential quantization} (DQ) that compensates past quantization errors to make the descent trajectory of a quantized algorithm follow that of its unquantized counterpart. Assuming that the objective function is smooth and strongly convex, we prove that \emph{differentially quantized gradient descent} (DQ-GD) attains a linear convergence rate of $\max\{\sigma_{\mathrm{GD}}, \rho_n 2^{-R}\}$, where $\sigma_{\mathrm{GD}}$ is the convergence rate of unquantized gradient descent (GD), $\rho_n$ is the covering efficiency of the quantizer, and $R$ is the bitrate per problem dimension $n$. Thus at any $R\geq\log_2 \rho_n /\sigma_{\mathrm{GD}}$, the convergence rate of DQ-GD is the same as that of unquantized {\GD}, i.e., there is no loss due to quantization. We show a converse demonstrating that no {\GD}-like quantized algorithm can converge faster than $\max\{\sigma_{\mathrm{GD}}, 2^{-R}\}$. Since quantizers exist with $\rho_n \to 1$ as $n \to \infty$ (Rogers, 1963), this means that DQ-GD is asymptotically optimal. In contrast, naively quantized GD where the worker directly quantizes the gradient attains only $\sigma_{\mathrm{GD}} + \rho_n2^{-R}$. The technique of differential quantization continues to apply to gradient methods with momentum such as Nesterov's accelerated gradient descent, and Polyak's heavy ball method. For these algorithms as well, if the rate is above a certain threshold, there is no loss in convergence rate obtained by the differentially quantized algorithm compared to its unquantized counterpart. Experimental results on both simulated and real-world least-squares problems validate our theoretical analysis.