Speculative Decoding and Efficient LLM Inference with Chris Lott
EPISODE 717
|
FEBRUARY
3,
2025
Watch
Follow
Share
About this Episode
Today, we're joined by Chris Lott, senior director of engineering at Qualcomm AI Research to discuss accelerating large language model inference. We explore the challenges presented by the LLM encoding and decoding (aka generation) and how these interact with various hardware constraints such as FLOPS, memory footprint and memory bandwidth to limit key inference metrics such as time-to-first-token, tokens per second, and tokens per joule. We then dig into a variety of techniques that can be used to accelerate inference such as KV compression, quantization, pruning, speculative decoding, and leveraging small language models (SLMs). We also discuss future directions for enabling on-device agentic experiences such as parallel generation and software tools like Qualcomm AI Orchestrator.
About the Guest
Chris Lott
Qualcomm AI Research
Thanks to our sponsor Qualcomm AI Research
Qualcomm AI Research is dedicated to advancing AI to make its core capabilities — perception, reasoning, and action — ubiquitous across devices. Their work makes it possible for billions of users around the world to have AI-enhanced experiences on devices powered by Qualcomm Technologies. To learn more about what Qualcomm Technologies is up to on the research front, visit twimlai.com/qualcomm.
Resources
- Why Qualcomm AI Orchestrator is the key to next generation AI experiences
- Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement
- Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs
- On Speculative Decoding for Multimodal Large Language Models
- AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability
- Your Apps Are on Borrowed Time. AI Agents Are on the Way
- Qualcomm AI Engine for Snapdragon 8 Elite Mobile Platform
- Snapdragon X Elite
- Accelerating generative AI at the edge
- How to Quadruple LLM Decoding Performance with Speculative Decoding (SpD) and Microscaling (MX) Formats on Qualcomm® Cloud AI 100
- AI at the Edge: Qualcomm AI Research at NeurIPS 2024 with Arash Behboodi - #711
- Gen AI at the Edge: Qualcomm AI Research at CVPR 2024 with Fatih Porikli - #688
- Quantizing Transformers by Helping Attention Heads Do Nothing with Markus Nagel - #663
