How to Engineer AI Inference Systems with Philip Kiely
EPISODE 766
|
APRIL
30,
2026
Watch
Follow
Share
About this Episode
In this episode, Philip Kiely, head of AI education at Baseten, joins us to unpack the fast-evolving discipline of inference engineering. We explore why inference has become the stickiest and most critical workload in AI, how it blends GPU programming, applied research, and large-scale distributed systems, and where the line sits between inference and model serving. Philip shares how research-to-production can move in hours, not months, and why understanding “the knobs” of inference—batching, quantization, speculation, and KV cache reuse—lets teams design better products and SLAs. We trace the inference maturity journey from closed APIs to dedicated deployments and in-house platforms, discuss GPU lifecycles, and survey today’s runtime landscape, including vLLM, SGLang, and TensorRT LLM. Finally, we look ahead to agents and multimodality, making the case for specialized, workload-specific runtimes when performance and efficiency matter most.
About the Guest
Philip Kiely
Baseten
Resources
- Inference Engineering Book
- Baseten
- PolarQuant: Quantizing KV Caches with Polar Transformation
- NVIDIA TensorRT
- NVIDIA Hopper Architecture
- NVIDIA Blackwell Architecture
- NVIDIA Ampere Architecture
- The path to ubiquitous AI
- Gemini Enterprise Agent Platform
- Amazon Bedrock
- Microsoft Foundry
- Wispr Flow
- Jane Street Signs $6 Billion AI Cloud Agreement with CoreWeave
- EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
- AWS and Cerebras Collaboration Aims to Set a New Standard for AI Inference Speed and Performance in the Cloud
- NVIDIA H200 GPU
- NVIDIA GB300 NVL72
- NVIDIA H100 GPU
- NVIDIA L4 Tensor Core GPU
- NVIDIA L40 GPU