Efficient Language Model Inference using Statistical Tools

Lecture / Panel
For NYU Community



Ananda Theertha Suresh
Research Scientist, Google, New York


"Efficient Language Model Inference using Statistical Tools"


Autoregressive sampling from large language models has led to state-of-the-art results in several natural language tasks. However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks.  One way to speed up sampling is speculative decoding: use a small model to sample a draft (block or sequence of tokens), and then score all tokens in the draft by the large language model in parallel. A subset of the tokens in the draft are accepted (and the rest rejected) based on a statistical method to guarantee that the final output follows the distribution of the large model.

In this talk, we provide a principled understanding of speculative decoding through the lens of distribution coupling and optimal transport theory. This new formulation enables us to improve upon speculative decoding in two ways: first we propose an optimal draft acceptance algorithm that provides additional wall-clock speedup without incurring additional computation cost. Next, we ask if the latency can be improved further with extra parallel computations? We answer this question affirmatively by showing that if we have multiple drafts from the small model, we can use them to improve the speedup further albeit using extra parallel computations. We provide theoretical guarantees on the proposed algorithms and characterize the expected speedup. We further empirically demonstrate the practicality of the new algorithms on standard datasets.

About Speaker

Ananda Theertha Suresh is a research scientist at Google Research, New York. He received his PhD from University of California San Diego, where he was advised by Prof. Alon Orlitsky. His research focuses on theoretical and algorithmic aspects of machine learning. He is a recipient of the 2017 Paul Baran Maroni Young Scholar award and a co-recipient of best paper awards at NeurIPS 2015, ALT 2020 and CCS 2021.