Automatically batching control-intensive programs for modern accelerators

Alexey Radul
Dougal Maclaurin
Third Conference on Systems and Machine Learning, Austin, TX (2020)

Abstract

We present a general approach to batching arbitrary computations for
GPU and TPU accelerators. We demonstrate the effectiveness of our
method with orders-of-magnitude speedups on the No U-Turn Sampler
(NUTS), a workhorse algorithm in Bayesian statistics. The central
challenge of batching NUTS and other Markov chain Monte Carlo
algorithms is data-dependent control flow and recursion. We overcome
this by mechanically transforming a single-example implementation into
a form that explicitly tracks the current program point for each batch
member, and only steps forward those in the same place. We present
two different batching algorithms: a simpler, previously published one
that inherits recursion from the host Python, and a more complex,
novel one that implmenents recursion directly and can batch across it.
We implement these batching methods as a general program
transformation on Python source. Both the batching system and the
NUTS implementation presented here are available as part of the
popular TensorFlow Probability software package.