How to Find the Agent Failures Your Evals Miss with Scott Clark
EPISODE 767
|
MAY
7,
2026
Watch
Follow
Share
About this Episode
In this episode, Scott Clark, co-founder and CEO of Distributional, joins us to explore how teams can reliably operate and improve complex LLM systems and agents in production. Scott introduces a Maslow’s hierarchy of observability: telemetry for logging, monitoring for known signals, and post-production or online analytics to surface unknown unknowns. We dig into examples of real-world failures Scott’s team has seen in production systems, such as “lazy” tool-use hallucinations that standard evals miss, and how mapping traces into vector fingerprints enables clustering and topic discovery to uncover emergent behaviors. Scott explains how analytics can feed the data flywheel by generating evals, guardrails, and training data, and why online, adaptive approaches are essential for non-stationary models. We also touch on practical how-to’s such as instrumentation with OpenTelemetry, the GenAI semantic conventions, and the role of dedicated analytics tools.
About the Guest
Scott Clark
Distributional
Thanks to our sponsor Distributional
This show is brought to you by our friends at Distributional, the AI analytics platform built for teams that are serious about agent quality. Distributional finds patterns in production agent traces, creating actionable insights with suggestions for new evals, refined guardrails, and improvements to your agent based on real usage. Don't take their word for it, use Distributional yourself for free. Go to app.dbnl.com to create your free hosted account that also comes with a free LLM endpoint to power evals and analytics. Or install a self-hosted version of Distributional for free in your own environment.
To learn more, visit dbnl.com and start improving production agent quality today.
Resources
- Distributional
- Distributional App
- Distributional Docs
- Clio: Privacy-Preserving Insights into Real-World AI Use
- Where the goblins came from
- An update on recent Claude Code quality reports
- Datadog
- Statsig
- Braintrust
- Mixpanel
- Decagon
- Harvey AI
- Supporting Rapid Model Development at Two Sigma with Scott Clark & Matthew Adereth - #273
- Bayesian Optimization for Hyperparameter Tuning with Scott Clark - #50
- Democast: Automated Model Tuning with Scott Clark
- Building Real-World LLM Products with Fine-Tuning and More with Hamel Husain - #694

